I have a problem with a hidden character not showing up either in database (phpmyadmin) nor website. Website has utf-8 character encoding. If I copy/paste the string with the "hidden" character into Notepad I can see it. It looks just like a bullet character but is hidden. What type of character is this and can it be removed with PHP?
The user able to type this character is using Mac and are probably doing a copy/paste from a document (maybe unicode?) into a form on our website and saves it. So this character is not visible with utf-8 encoding but visible if I copy my string into a Notepad document.
This is the hidden character at the end of the string. Looks like a bullet:
Copy the character, then fire up PowerShell and do the following (yes, it's convoluted, sorry):
'U+{0:X4}'-f+[char]'<PASTE>'
and paste the character where it says <PASTE>. It should give you the Unicode code point of that character. You then should be able to write something that removes it from the string, but from my eyes there shouldn't be any input that destroys the document layout, except maybe fun things like RTL markers.
Short explanation of the above: [char]'x' converts a single-character string to a char, + will then treat it as a number (similar to [int], but shorter). The rest is a format string and the formatting operator -f.
Related
I store codes like "\u1F603" within messages in my database, and now I need to display the corresponding emoji on my web page.
How can I convert \u1F603 to \xF0\x9F\x98\x83 using PHP for displaying emoji icons in a web page?
You don't need to convert emoji character codes to UTF-8 sequences, you can simply use the original 21-bit Unicode value as numeric character reference in HTML like this: 😃 which renders as: 😃.
The Wikipedia article "Unicode and HTML" explains:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 合, which produces this: 合.
So if in your PHP code you have a string containing '\u1F603', then you can create the corresponding HTML string using preg_replace, as in following example:
$text = "This is fun \\u1F603!"; // this has just one backslash, it had to be escaped
echo "Database has: $text<br>";
$html = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $text);
echo "Browser shows: $html<br>";
This outputs:
Database has: This is fun \u1F603!
Browser shows: This is fun 😃!
Note that if in your data you would use the literal \u notation also for lower range Unicode characters, i.e. with hex numbers of 2 to 4 digits, you must make sure the next user's character is not also a hex digit, as it would lead to a wrong interpretation of where the \u escape sequence stops. In that case I would suggest to always left-pad these hex numbers with zeroes in your data so they are always 5 digits long.
To ensure your browser uses the correct character encoding, do the following:
Specify the UTF-8 character encoding in the HTML head section:
<meta charset="utf-8">
Save your PHP file in UTF-8 encoding. Depending on your editor, you may need to use a "Save As" option, or find such a setting in the editor's "Preferences" or "Options" menu.
Hell everyone,
after many try i can found solution.
I user below code:
https://github.com/BriquzStudio/php-emoji
include 'Emoji.php';
$message = Emoji::Decode($message);
This one working fine for me!! :)Below is my reslut
I have an XML with special Characters encoded as &#xxx; in it. As long as I'd output these characters to a browser, that would work fine as they're HTML-Encodings (sort of).
But I need to read the XML-File with simplexml_load_string, which results in garbage for certain characters, because they're in the extended ASCII-table.
For example:
translates to š - but when I try to use html_entity_decode, I get an empty character.
I tried almost everything from iconv to mb_decode_numericentity - nothing worked.
How do I convert those &#xxx; to the real characters???
[Edit]
I found this table http://www.ascii-code.com that claims the is an extended ASCII Character using ISO-8859-1
I'm confused...
You're apparently dealing with two different characters that look almost identical when printing:
'LATIN SMALL LETTER S WITH CARON' (U+0161) actually encodes as š
corresponds to 'SINGLE CHARACTER INTRODUCER' (U+009A)
I've found that none of my fonts or text editors handle the second one properly. So you most likely get a blank character for that precise reason.
The second one appears to be some kind of weird control character whose exact purpose escapes from my understanding:
To be followed by a single printable character (0x20 through 0x7E) or
format effector (0x08 through 0x0D). The intent was to provide a means
by which a control function or a graphic character that would be
available regardless of which graphic or control sets were in use
could be defined. Definitions of what the following byte would invoke
was never implemented in an international standard. Not part of the
first edition of ISO/IEC 6429
It's worth noting that character references in XML use numeric codes from a fixed encoding (some UCS variant). If the author of the XML file doesn't follow this convention you'll be faced with either invalid XML (something that effectively prevents it from being parsed with an XML library) or valid XML that contains corrupted data (something that, at most, will require tedious post-processing).
I'm using SimpleXML to read nodes, and I echo out the image file name. Using foreach, I print them out:
assets/project_Guide2Big1.jpg
assets/project_Guide2Big2.jpg
assets/project_Guide2Big3.jpg
assets/project_Guide2Big4.jpg
assets/project_Guide2Big5.jpg
I inserted these values into my img tags, but the images don't appear except for the first one.
I copy "assets/project_Guide2Big1.jpg" into the browser. I see the image, but when I copy "assets/project_Guide2Big2.jpg", the address changes to this
asset/%E2%80%8BprojectGuide2Big2.jpg.
It looks like some urlencoding(?). I tried to decode, but my images still aren't working. This is so wierd.
Were does the %E2%80%8B come from?
That looks suspiciously like a UTF-8 character sequence representing some Unicode character which you didn't expect to be there.
Using this online converter, we can see that the sequence of UTF-8 bytes E2 80 8B represent the Unicode codepoint U+200B, which is a "Zero Width Space".
So somehow, your source XML includes an invisible character after the slash. When echoed to the screen, it is completely invisible - even when viewing source, since the source is still just text. But when you try to load the URL, that character is outside the valid range for URLs, so gets automatically encoded by the browser.
You might be wondering what the point of a zero-width space is, but consider automatic word-wrap functions - they may look for a space to break on, but a URL contains no spaces. So inserting a zero-width space makes the text look the same, but allows it to wrap at that specific point. Another character useful for this is the "soft hyphen", which has the beautifully apt entity name of - as a friend of mine put it, "the soft hyphen is shy, and may not appear". :)
I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == ÃÂ;
utf8_encode("\xc3\xad") == ÃÂ;
utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?
You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.
I am using this autocompleter from Google
http://code.google.com/p/jquery-autocomplete/ (if you click on "Source" you can find all the source files for the script)
and everything is working fine, except it's having problems with special Croatian characters (like č, ć, ž etc. I'm not sure if you'll see these, so here's an idea of what I am talking about: link - the letter c with a hachek on top etc.)
Here's the setup:
an html file points to a jquery autocomplete script and a php file with the results array
the metadata for the html file has a charset of utf-8, no other pages have any kind of encoding at all
the array in the php file has those special characters encoded with html codes (the letter "ž" is replaced with ž so a typical array element looks like this: "Požega" => "5")
when I enter a search string into the input field, the returning results are encoded correctly - Požega etc. but when I click the result to accept it, it enters Požega into the input field, which is obviously not what I want
when my search string has a special letter in it, the script doesn't find anything
How do I fix this? Should I just replace the HTML special codes in the array with the actual special letters(it seems to work fine then, but I'm not sure whether everybody will see this as I intended)? If not, how do I set the character encoding on all pages so the special letters display correctly on the input field and they're searchable?
Thanks for the help!
Character encoding is such a pain in the ass with browsers. There are several things you can do to cover your bases, one of which you've already done.
Set the tag to indicate charset of UTF-8
Use .htaccess to define a charset of UTF-8
Use PHP to define a charset of UTF-8 in the header (something like: header('Content-Type: text/html; charset=UTF-8');"
Making sure these are true should ensure that the data shows up on all UTF-8 supported browsers. By the way, I can see the special characters, so you must be doing something right. :)