I have a problem with a replacing characters, I do not know how to do that.
We in Slovakia have characters with interpunctions.
How do I change (eg. á) for html code at input. If I have a string like Áno (translated yes),
how do I change á to html code at the string.
I want make input where smiles like :-) will change to image. Or my interpunctioned characters to html code.
You can use strtr for such purposes. I do not know what problems you have to solve with smileys etc, so I'll give you an example for German umlauts (however not to HTML entities, but to standard ASCII characters):
$string = strtr($string, array('ä' => 'ae', 'ö' => 'oe', 'ü' => 'ue'));
Of course you can also use HTML entities instead of ae etc, you just have to look them up.
Edit
Judging from your update (I want make input where smiles like :-) will change to image. Or my interpunctioned characters to html code.) I think you want to use both htmlentities and strtr.
htmlentities will make sure that all non-ASCII characters are displayed correctly. Also have a look at UTF-8. With UTF-8, you will not have to translate your czech characters.
And strtr will replace your smileys by the proper HTML code.
Related
I'm currently facing a very strange encoding issue when dealing with an html source code.
I got the following line:
"requête présentée par..."
When an extern library does an utf8_decode I got:
"reque^te présente´e par..."
So accents are placed right to the accented characters. If I do an utf8_encode from that result, I don't get the original "requête présentée par..." but I keep having "reque^te présente´e par..."
Even stranger: If I open the original html in Notepad++, encoding is utf8 without BOM (so far, so good) but I can actually select half of the character with the text selection (keyboard or mouse). Yes, half of it. As if the real code was "e^" but it was displayed as "ê". When I try to copy it to my IDE it copies "ê" but pastes "e^".
I have come up with a basic replacement function:
"e^" => "ê",
"e´" => "é",
...
and some other french cases, and it's working properly for now.
But as the HTML comes in differents languages, I'm pretty sure I won't be able to successfully replace every character under this encoding issue.
Has anybody face this issue before and (hopefully) has a more general solution?
Thanks in advance.
It sounds like your HTML source is using Combining characters. That is, instead of using a single unicode character to represent the ê, it's using first a regular e and then a combining character to add the diacritic ^. You can verify this with a hex editor to see the character codes, in this case the combining circumflex is hex code 0302.
See also Unicode equivalence.
In my page I convert lower to uppercase string and output 'em in the title tag. First I had the issue that   is not accepted, so I had to preserve entities.
So I converted them to unicode, then uppercase and then back to htmlentities:
echo htmlentities(strtoupper(html_entity_decode(ob_get_clean())));
Now I have the problem that I recognized related to a "right single quote". I'm getting this character as ’ in the title.
It seems that either of the two functions I'm using does not convert them correctly. Is there any better function that I can use or is there something especially for the title tag?
Edit: Here is a var_dump of the original data which I don't have influence to:
string(74) "Example example example » John Doe- Who’s That? "
Edit II: This is what my code above results in:
This would happen, if I would just use strtoupper:
Your problem is that strtoupper will destroy your UTF-8 entity-decoded input because it is not multibyte aware. In this instance, ’ decodes to the hex-encoded UTF-8 sequence e2 80 99. But in strtoupper's single-byte world, the character with code \xe2 is â, which is converted to  (\xc2) -- which makes your text an invalid UTF-8 sequence.
Simply use mb_strtoupper instead.
It's ugly, but it might work for you (although I would certainly suggest Jon's solution):
After your strtoupper(), you can replace all uppercased HTMLentities this way:
$entity_table = get_html_translation_table(HTML_ENTITIES);
$entity_table_uc = array_map('strtoupper', $entity_table);
$string = str_replace($entity_table_uc, $entity_table, $string);
This should remove the need for htmlentities() / html_entity_decode().
I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. Ω or ) as well as decimal (e.g. Ӓ) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.
For reasons justified by business logic, I need to convert the character "Æ" to "Ae" in a string. However, despite the fact that mb_detect_encoding() tells me the string is UTF-8, I can't figure out how to do this. (And for other reasons of business logic, it would be an issue to htmlentities() the string before replacing it, as other Google searches have suggested.)
What I tried first was this, using the test string "Æther":
return str_replace("Æ", 'Ae', $string);
Unfortunately, that doesn't actually find the Æ in the text, returning "Æther".
return str_replace(chr(195), 'Ae', $string);
That finds the Æ and replaces it, but adds an unknown character afterwards, changing it to the not-usable "Ae�ther." So I tried this:
$ae_character = mb_convert_encoding('&#' . intval(195) . ';', 'UTF-8', 'HTML-ENTITIES');
return str_replace($ae_character, 'Ae', $string);
Which again failed to find the Æ character in the string. I know it's a UTF-8 issue of some sort, but I'm honestly stumped as to how to search for and replace this without adding the extra character afterwards. Any ideas?
<?php
$x = 'Æmystr';
print str_replace('Æ', 'AE', $x); // prints: AEmystr
?>
That code works just fine, what I believe you're missing is changing the encoding of your file. Your .php file should be encoded in UTF-8 or UNICODE. This can be done in some (text) editors or IDEs, i.e Eclipse, EditPlus, Notepad++ etc... Even Notepad on windows 7.
When saving bring up the Save/Save As dialog, and normally near the Save button there is an Encoding dropdown/radio buttons, that lets you choose between ANSI and UTF-8 (and others).
On *nix I believe most editors have it, just not sure of the locations. If after you do it and get it working, then edit/save with an editor that just does ANSI it'll overwrite it with an unknown char etc...
As to why the below code didn't work.
return str_replace(chr(195), 'Ae', $string);
It's because a unicode char is normally 2 chars put together. So what you have above is just the start of the unicode char. try this:
print str_replace(chr(195).chr(134), 'AE', $x);
That should replace it as well and might even be preferred as you (might|do) not have to change the file encoding.
Click on this for a link to characters page
Here's another one.
I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.