How to convert a Chinese character to UTF-16 code units?

How to convert a Chinese character to UTF-16 code units? - php

I'm using PHP for this web development project. Right now, I'm working on a user page, where the user can add words that he knows. Off course, I'm starting out crude, without adding any special features yet like Do you know this Character suggestion, etc.
I have tackled the challenges of adding UTF-16 collation and charset set to UTF-16 in my MySQL Database, in fact online at http://freemysqlhosting.net to support Chinese characters in my website. Now what I'm struggling with is to support automatic PinYin generation for my Chinese characters.
I have found this after searching all over SO: https://github.com/reorx/pinyindep/blob/master/Uni2Pinyin. Each line begins with a Chinese character, in UTF-16 Code Units.
Take for example, 爱. In UTF-16, it is 7231. I convert this at https://r12a.github.io/apps/conversion/. When I do a lookup in the file, I get the pinyin associated. :D This is the functionality I need, though looking it up in GitHub is in JS, rather than PHP.
In the manual lookup, ai4 is returned, which is the correct intonation. Now, what I'm looking for is either a PHP Built-in Library, or a code snippet to convert this string input, let's say “爱” into a UTF-16 Four Character Code Unit, such as here 7321.
So what's the question:
How should I convert a Chinese character, in form of a string, to UTF-16 code units? (Either through built-in library, or through a suggested PHP Code Snippet)
P.S. I don't really like third-party tools unless they are really popular worldwide, or there's no other option.

You need to use PHP's multibyte string module:
$c = "爱";
list(, $d) = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
echo dechex($d);
// => 7231
Change UTF-8 to UTF-16 if your string is coming from the database in that encoding.
mb_convert_encoding will change the string into four-byte-per-character encoding; then unpack converts the four bytes into an unsigned long; finally, converting to hexadecimal string using dechex.

If you are using PHP 7.2+ you can use mb_ord to simplify the conversion.
echo dechex(mb_ord("爱"));

Related

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see â‚¬ or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!

I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.

You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.

Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

Google can't read a sitemap with special characters in URLs

I got a big sitemap created dynamically with PHP, it has a sitemap index with some 230 separate sitemaps, and each individual sitemap has between 3.000 and 15.000 URLs.
In most of those 230 sitemaps, everything is ok, but in some of them some URLs contain special characters and Google returns an error, does not accept such sitemap. The example of a normal, accepted URL:
http://www.site.com/Gentofte-Greve/Denmark 1 Badmintonligaen/12-fe-juice_a-1091627-1-33-1-odds/
The example of an URL which corrupts the entire sitemap file for Google:
http://www.site.com/Team%20%C5rhus%20Elite-Solr%F8d%20Strand/Denmark 1 Badmintonligaen/12-fe-juice_a-1091631-1-33-1-odds/
Any special character, for example the Nordic ones, will wreck the sitemap. Here is an example of Nordic characters: http://www.borgos.nndata.no/alfabet.htm
My questions is - HOW do I code those special characters (and other similar ones) so sitemap still checks out fine. Which PHP coding function do I use if that's a solution? Is the only solution to use str_replace and replace those characters with normal ones? It wouldn't be an issue, the URL works no matter what you write in the first part of it as that part is for SEO only, but this would be time-consuming. I'd prefer to be able to write those special characters in a way which doesn't wreck the sitemap for Google.
Everything else regarding my sitemaps is fine, they're coded in UTF-8 or at least they should be with this line:
<?xml version='1.0' encoding='UTF-8'?>

Are the %C5 and %F8 sequences meant to represent the characters U+00C5 (Å) and U+00F8 (ø)? If so, you need to use their UTF-8 encodings, not their raw Unicode codepoint numbers. 'Å' should be %C3%85, and 'ø' should be %C3%B8.
For more information about URI encoding, see RFC 3986.
Doing this in PHP is complicated by the fact that PHP strings are really byte strings, not Unicode character strings. They can't store abstract Unicode characters; they can only store the encoded representation of those characters, in a particular encoding such as UTF-8 or UTF-16. You can use the mbstring extension to work with encoded Unicode strings, but doing this correctly will probably mean using the mbstring functions for all handling of Unicode text throughout your application.
You should be looking to fix this encoding problem at the source: how did your program get a string that contains the byte 0xC5 to represent the character U+00C5? Something, somewhere, must've assumed that Unicode codepoint numbers translate directly into bytes, which is wrong. Find and fix that, so that your data is read into the PHP string in UTF-8 form to begin with, and then use the mbstring functions for any manipulation of the string afterward.
Once you have a string that contains the UTF-8 representation of your URL, rawurlencode() should give you the correct percent-escaped result.

Why use multibyte string functions in PHP?

At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?

All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.

Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php

You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.

PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.

multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).

Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.