Emojis not correctly encode into hexadecimal - php

$message = "Spanish Language
á, é, í, ó, ú, ñ, ü
😃 😄 😅 😆 😉 😊 😋 😎";
$hex = '#U' . strtoupper(bin2hex(mb_convert_encoding($message, 'UCS-2','auto')));
When I send $hex into the following API all things are fine except the emojis, instead if emojis ? symbol appears in the mobile
https://api.txtlocal.com/docs/encodingdecodingunicode
please correct me what I m doing wrong.

These emoji are not representable in UCS-2. In UTF-16, they are represented using surrogate pairs, which are not supported in UCS-2. For example, 😋 is encoded in UTF-16 as this:
0x3d 0xd8 0x0b 0xde
This is four bytes, even though it is supposedly only a single character. UCS-2 guarantees that all characters it contains will be take exactly two bytes, and so 😋 is not included.

I fixed this issue, just change following the line of code
return '#U' . strtoupper(bin2hex(mb_convert_encoding($message, 'UTF-16','UTF-8')));

Related

How to encode unicode strings to sequence like this "\xd1\x81" in PHP?

I have a string of unicode - сентрября
and i know this is expressed in the sequence like this:
\xd1\x81\xd0\xb5\xd0\xbd\xd1\x82\xd1\x8f\xd1\x80\xd0\xb1\xd1\x80\xd1\x8f
What is this type of expression encoded characters and how to convert any text from unicode to sequences like this in PHP?
The prefix "\x" indicates that it is hexadecimal. If the prefix is removed, you get the same output as with the "bin2hex" function in php.
I think this is the function you search for:
https://www.php.net/manual/de/function.bin2hex.php
bin2hex("сентрября") = d181d0b5d0bdd182d180d18fd0b1d180d18f
Those are characters in Windows-1252 (CP1252) some of them are special characters, so you can decode with iconv function
\x81 = Ñ (LATIN character)
echo iconv("cp1252", "utf-8//IGNORE", "\x81\xd0\xb5\xd0\xbd\xd1\x82\xd1\x8f\xd1\x80\xd0\xb1\xd1\x80\xd1\x8f");
Result of the code Above: ентÑрбрÑ
NOTE this code is usually used for hack and inject code in your website.

How to display the (extended) ASCII representation of a special character in PHP 5.6?

I am trying to decode this special character: "ß", if I use "ord()", I get "C3"
echo "ord hex--> " . dechex(ord('ß'));
...but that doesn't look good; so i tried "bin2hex()", now I get "C39F" (what?).
echo "bin2hex --> " . bin2hex('ß');
By using an Extended ASCII Table from the Internet, i know that the correct hexadecimal value is "DF", so i now tried "hex2bin()", but that give me some unknown character like this: "�".
echo "hex2bin --> " . hex2bin('DF');
Is it possible to get the "DF" output?
You're on the right path with bin2hex, what you're confused about is merely the encoding. Currently you're seeing the hex value of ß for the UTF-8 encoding, because your string is encoded in UTF-8. What you want is the hex value for that string in some other encoding. Let's assume "Extended ASCII" refers to ISO-8859-1, as it colloquially often does (but doesn't have to):
echo bin2hex(iconv('UTF-8', 'ISO-8859-1', 'ß'));
Now, having said that, I have no idea what you'd use that information for. There are many valid "hex values" for the character ß in various different encodings; "Extended ASCII" is just one possible answer, and it's a vague answer to be sure, since "Extended ASCII" has very little practical meaning with hundreds of different "Extended ASCII" charsets available.
ASCII goes from 0x00 to 0x7F. This is not enough to represent all the characters needed so historically old Windows OSes used the available space in a byte (from 0x80 to 0xFF) to represent different characters depending on the localization. This is what codepages are: an arbitrary mapping of non-ASCII values to non-ASCII characters. What you call "extended ASCII" is IMO an inappropriate name for a codepage.
The assumption 1 byte - 1 character is dead and (if not) must die.
So actually what you are seeing is the UTF-8 representation of ß. If you want to see the UNICODE code point value of ß (or any other character) just show its UTF-32 representation that AFAIK is mapped 1:1.
// Print 000000df
echo bin2hex(iconv('UTF-8', 'UTF-32BE', 'ß')));
bin2hex() should be fine, as long as you know what encoding you are using.
The C3 output you get appears to be the first byte of the two-byte representation of the character in UTF-8 (what incidentally means that you've configured your editor to save files in such encoding, which is a good idea in 2017).
The ord() function does not accept arbitrary encodings, let alone Unicode-compatible ones such as UTF-8:
Returns the ASCII value of the first character of string.
ASCII (a fairly small 7-bit charset) does not have any encoding for the ß character (aka U+00DF LATIN SMALL LETTER SHARP S). Seriously. ASCII does not even have a DF position (it goes up to 7E).

Normalize Turkish in PHP?

Is there a way to simply normalize turkish characters like Ç, Ğ, İ, Ö, Ş, Ü and ı ?
cause now I'm using str_replace but that doesn't seem the right way to go, cause it's possible to forget a character.... Is there a more standard way? I tried to use the normalize method within the PHP internationalization module, but the Turkish characters stay Turkish. I would like to replace them with normal characters for the URL. So Ç becomes C and Ş becomes S, and so on.
What do you mean by normalization? Just take the characters as they come in, but put your scripts, connection and html in correct encoding.
UTF-8 suggested, explanation: UTF-8 vs. Unicode
If you only want ASCII chars, you can test this by something like ord($char) < 255.
For conversion look into these functions:
http://php.net/iconv
http://php.net/utf8_encode
http://php.net/mb_convert_encoding
A call similiar to
$str = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
would do the trick.
Another preg_replace way: Convert special characters to normal characters using PHP, like ã, é, ç to a, e, c

Language specific characters to regular English chars

I am not sure where to start with this, but here is what I want to do:
Users have a textfield where they need to input few words. Problem is that page will use people from different countries, and they will enter "weird" Latin characters like: ž, Ä, Ü, đ, Ť, Á etc.
Before saving to base I want to convert them to z, a, u, d, t, a... Is there a way to do this without making something like this (I think there is too much characters to cover):
$string = str_replace(array('Č','Ä','Á','đ'), array('C','A','A','d'), $string);
And, yes, I know that I can save utf-8 in database, but problem is that this string will later be sent by SMS, and because of sms protocol nature, these "special" chars use more space in message than regular English alphabet characters (I am limited to 120 chars, and if i put "Ä" in message, it will take more than 1 character place).
First of all, I would still store the original characters in utf-8 in the database. You can always "translate" them to ASCII characters upon retrieval. This is good because if, say, in the future SMS adds UTF-8 support (or you want to use user data for something else), you'll have the original characters intact.
That said, you can use iconv to do this:
iconv('utf-8', 'ascii//TRANSLIT', $input); //where $input contains "weird" characters
See this thread for more info, including some caveats of this approach: PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
Close but not perfect because it converts the accents and things into characters.
http://www.php.net/manual/en/function.iconv.php
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", 'Martín');
//output: Mart'in
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", "ÆÇÈÊÈÒÐÑÕ");
//output: AEC`E^E`E`OD~N~O
Using
echo iconv('utf-8', 'ascii//TRANSLIT', 'Martín');
//output: Mart
If the accented character is not UTF-8, it just cuts off the string from the special char onwards.

DomDocument and special characters written in two bytes

I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!
One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.

Categories