PHP Unicode character questions - php

Here's a link I found, which even has a character I need to play with for other projects of mine.
http://www.fileformat.info/info/unicode/char/2446/index.htm
There is a box with the Title of: "Encodings" on that page. And I am wondering about some of the rows.
I obviously need a course on this sort of thing, but I'm wondering what the difference is between "HTML Entity (decimal)" and "HTML Entity (hex)".
The funny thing is, which confuses me, I throw those characters on a web page, and they display fine. But I haven't specified any UTF-8 encoding in the php page.
<?php
$string1 = '⑆';
$string2 = '⑆';
echo $string1;
echo '<br>';
echo $string2;
?>
Does the browser know how to display both automatically?
And to make it weirder, I can only see those characters on my Mac, in Firefox.
But my windows box doesn't want to show them. I've tested it in chrome, and firefox. Do I need to tell the browsers to view them correctly? Or is it an operating system modification?

They're both valid numeric HTML entities, and the browser does indeed know how to decode them. The difference is the first is a hexadecimal number, while the latter is decimal.
0x2446 = 9286
Note that 0x means hexadecimal.
Also note that it is good practice to always have your server explicitly specify an encoding. The W3C explains how to do so. UTF-8 is a good choice.
If you use any Unicode encoding, you can always put the character right on your page, so you don't have to use entities.

To be exact, neither is an entity reference. & is an entity reference that refers to the entity named amp that is defined as:
<!ENTITY amp CDATA "&" -- ampersand, U+0026 ISOnum -->
Here you can see that the entity’s value is just another reference: &.
⑆ and ⑆ are “just” character references (numeric character references to be exact) and refers to characters by specifying the code position of a character in the Universal Character Set, i.e. the Unicode character set.

You can use any "HTML Entity" in any encoding and in practice, if You have installed appropriate fonts, every browser will work fine. Well, it was created for displaying characters that are not included in current encoding. In Your situations it looks You have to install some fonts on Your Windows box.
On the other hand, it has almost nothing to do with PHP.

Related

Display \u1F603 (emoji icon) in web page

I store codes like "\u1F603" within messages in my database, and now I need to display the corresponding emoji on my web page.
How can I convert \u1F603 to \xF0\x9F\x98\x83 using PHP for displaying emoji icons in a web page?
You don't need to convert emoji character codes to UTF-8 sequences, you can simply use the original 21-bit Unicode value as numeric character reference in HTML like this: 😃 which renders as: 😃.
The Wikipedia article "Unicode and HTML" explains:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 合, which produces this: 合.
So if in your PHP code you have a string containing '\u1F603', then you can create the corresponding HTML string using preg_replace, as in following example:
$text = "This is fun \\u1F603!"; // this has just one backslash, it had to be escaped
echo "Database has: $text<br>";
$html = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $text);
echo "Browser shows: $html<br>";
This outputs:
Database has: This is fun \u1F603!
Browser shows: This is fun 😃!
Note that if in your data you would use the literal \u notation also for lower range Unicode characters, i.e. with hex numbers of 2 to 4 digits, you must make sure the next user's character is not also a hex digit, as it would lead to a wrong interpretation of where the \u escape sequence stops. In that case I would suggest to always left-pad these hex numbers with zeroes in your data so they are always 5 digits long.
To ensure your browser uses the correct character encoding, do the following:
Specify the UTF-8 character encoding in the HTML head section:
<meta charset="utf-8">
Save your PHP file in UTF-8 encoding. Depending on your editor, you may need to use a "Save As" option, or find such a setting in the editor's "Preferences" or "Options" menu.
Hell everyone,
after many try i can found solution.
I user below code:
https://github.com/BriquzStudio/php-emoji
include 'Emoji.php';
$message = Emoji::Decode($message);
This one working fine for me!! :)Below is my reslut

Decode HTML-encoded characters with extended ASCII

I have an XML with special Characters encoded as &#xxx; in it. As long as I'd output these characters to a browser, that would work fine as they're HTML-Encodings (sort of).
But I need to read the XML-File with simplexml_load_string, which results in garbage for certain characters, because they're in the extended ASCII-table.
For example:
š translates to š - but when I try to use html_entity_decode, I get an empty character.
I tried almost everything from iconv to mb_decode_numericentity - nothing worked.
How do I convert those &#xxx; to the real characters???
[Edit]
I found this table http://www.ascii-code.com that claims the š is an extended ASCII Character using ISO-8859-1
I'm confused...
You're apparently dealing with two different characters that look almost identical when printing:
'LATIN SMALL LETTER S WITH CARON' (U+0161) actually encodes as š
š corresponds to 'SINGLE CHARACTER INTRODUCER' (U+009A)
I've found that none of my fonts or text editors handle the second one properly. So you most likely get a blank character for that precise reason.
The second one appears to be some kind of weird control character whose exact purpose escapes from my understanding:
To be followed by a single printable character (0x20 through 0x7E) or
format effector (0x08 through 0x0D). The intent was to provide a means
by which a control function or a graphic character that would be
available regardless of which graphic or control sets were in use
could be defined. Definitions of what the following byte would invoke
was never implemented in an international standard. Not part of the
first edition of ISO/IEC 6429
It's worth noting that character references in XML use numeric codes from a fixed encoding (some UCS variant). If the author of the XML file doesn't follow this convention you'll be faced with either invalid XML (something that effectively prevents it from being parsed with an XML library) or valid XML that contains corrupted data (something that, at most, will require tedious post-processing).

echo-ing EURO symbol

i have tried to copy euro symbol from Wikipedia...and echo it (in my parent page),at that time it is working.but when i replace the same html content using jquery(used same symbol to echo in the other page).it is not displaying.why is it so..(or is der any way to display the same thing using html)?
In HTML you do this
€
And of course this works with jQuery, or any other web based language you are using
For more information look here
You need to ensure that your data is encoded using $X, that your server claims it is encoded using $X, and that any meta tags or xml prologs you may have also claim it is encoded using $X.
... where $X is a character encoding which includes the euro symbol. UTF-8 is recommended.
The W3C have an introduction to character encoding.
You can bypass this using HTML entities (€ in this case), which let you represent characters using ASCII (which is a subset of pretty much any character encoding you care to name). This has the advantage of being easy to type of a keyboard which doesn't have that character, but requires a tiny bit more bandwidth and will make it hard to read the source code of documents which include a lot of non-ASCII characters.
Note that HTML entities will only work when dealing with HTML. You'll find it breaking if you try things such as $(input).val('€').

UTF-8 & IsAlpha() in PHP

I'm working on a application which supports several languages and has a functionality in place which tries to use the language requested by the browser and also allows manual override of this function. This part works fine and picks the correct templates, labels, etc.
User have to enter sometimes text on their own and that's where I run into issues because the application has to accept even "complicated" languages like Chinese and Russian. So far I've taken care of the things mentioned in other posting, i.e.:
calling mb_internal_encoding( 'UTF-8' )
setting the right encoding when rendering the webpages with meta http-equiv=Content-Type content=text/html;charset=UTF-8 (format adapted due to stackoverflow limitations)
even the content arrives correctly, because mb_detect_encoding() == UTF-8
tried to set setLocale(LC_CTYPE, "UTF-8"), which doesn't seem to work because it requires the selection of one language, which I can't specify because I have to support several. And it still fails if I force it manually for testing purposes, i.e. with; setLocale(LC_CTYPE,"zh__CN.utf8") - ctype_alpha() would still fail for Chinese text
It seems that even explicit language selection doesn't make ctype_alpha() useful.
Hence the question is: how should I check for alphabetic characters in all languages?
The only idea I had at the moment is to check manually with arrays of "valid" characters - but this seems ugly especially for Chinese.
How would you solve this issue?
If you'd like to check only for valid unicode letters regardless of the used language I'd propose to use a regular expression (if your pcre-regex extension is built with unicode support):
// adjust pattern to your needs
// $input needs to be UTF-8 encoded
if (preg_match('/^\p{L}+$/u', $input)) {
// OK
} else {
// not OK
}
\p{L} checks for unicode characters with the L(etter) property which includes the properties Ll (lower case letter), Lm (modifier letter), Lo (other letter), Lt (title case letter) and Lu (upper case letter) - from: Regular Expression Details).
I wouldn't use an array of characters. That would get impossible to manage.
What I'd suggest is working out a 'default' language from the IP address and using that as the locale for a request. You could also get it from the browser-agent string in some cases. You could provide the user a way to override so that if your default isn't correct they aren't stuck with a strange site. (E.g. provide on the form 'language set to english. If this isn't correct, please change: '. This isn't the nicest thing to provide but you won't get any working validation otherwise as you NEED a language/locale set in order to have a sensible alpha validation (An A isn't a letter in chinese).
You can use the languages from
$_SERVER['HTTP_ACCEPT_LANGUAGE']
It contains something like
de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
so you need to parse this string. Then you can use the preferred language in the setLocale function.
This is rather an encoding issue than a language detection issue. Because UTF-8 can encode any Unicode character.
The best approach is to use UTF-8 throughout your project: in your database, in your output and as expected encoding for the input.
Output    Make sure you encode your data with UTF-8 and declare that in the HTTP header in the Content-Type field and not just in the document itself.
Input    If you’re using forms, declare the expected encoding in the accept-charset attribute.

Get non-UTF-8-form fields as UTF-8 in PHP?

I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.
The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.
There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an &#8594; which gets displayed as → in a browser. So the user’s input gets corrupted.
Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?
Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.
Thanks!
The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities
Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.
I actually do a htmlspecialchars () on the text before displaying it
Yes. You must do that, or else you've got a security problem.
Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself
Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.
I know that the good idea is to switch the whole software to UTF-8,
Yup. Well, at least the encoding of the page containing the form should be UTF-8.
<form action="action.php" method="get" accept-charset="UTF-8">
<!-- some elements -->
</form>
All browsers should return the values in the encoding specified in accept-charset.
You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.
Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.
I think this answers your question.
The html_entity_decode function is probably what you want.
You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.
Or you first decode those existing character references.
You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.
I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to &amp; ... which when ran back through a html_entity_decode() would come out as the text string the user entered.
This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()
mbstring supports the "charset" HTML-Entities
for($i=0; $i<strlen($out); $i++) {
printf('%02X ', ord($out[$i]));
}61 20 E2 86 92 20 62 20 26 20 63 E2 86 92 is the byte-sequence for → (RIGHTWARDS ARROW) in utf8.
You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.

Categories