From the answers to this question I tried to make my program more safe by converting strings to hex and comparing those values instead of directly and dangerously using strings directly from the user. I modified the code on that question to add a conversion:
function mssql_escape($data) {
if(is_numeric($data))
return $data;
$data = iconv("ISO-8859-1", "UTF-16", $data);
$unpacked = unpack('H*hex', $data);
return '0x' . $unpacked['hex'];
}
I do this because in my database I am using nvarchar instead of varchar. Now when I run through it on the php side, it comes up with
0xfeff00680065006c006c006f00200077006f0072006c00640021
Then I run the following query:
declare #test nvarchar(100);
set #test = 'hello world!';
select CONVERT(VARBINARY(MAX), #test);
It results in:
0x680065006C006C006F00200077006F0072006C0064002100
Now you'll notice those numbers are ALMOST the same. Other than the trailing zeros, the only difference is feff00. Why is that there? I realize all I would have to do is shift, but I'd really like to know WHY it's there instead of just making an assumption. Can anybody explain to me why php decides to throw feff00 (yellow!) in the front of my hex?
Well, Andrew, I seem to answer a lot of your questions. This link explains:
So the people were forced to come up with the bizarre convention of
storing a FE FF at the beginning of every Unicode string; this is
called a Unicode Byte Order Mark and if you are swapping your high and
low bytes it will look like a FF FE and the person reading your string
will know that they have to swap every other byte. Phew. Not every
Unicode string in the wild has a byte order mark at the beginning.
And Wikipedia explains:
If the 16-bit units are represented in big-endian byte order, this BOM
character will appear in the sequence of bytes as 0xFE followed by
0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text
display that expects the text to be ISO-8859-1.
if the 16-bit units
use little-endian order, the sequence of bytes will have 0xFF followed
by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a
text display that expects the text to be ISO-8859-1.
So the code you displayed with FEFF, which means it's in Big Endian notation. Use UTF-16LE for little endian, and SQL will understand that. Shifting the first SIX hex digits will only coincidentally work as long as you're only using two bytes.
Related
I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see € or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"
I receive country names like from a library: "\u00c3\u0096sterreich".
How do I convert this to Österreich?
Using PHP 7.3
This one is a lot trickier than it seem, but the below code appears to work.
First we pipe it through the standard regex for Unicode escape sequences, then pack that as a binary string, convert the encoding and finally decode. I cannot promise this is the best way to do this, but it appears to be working correct as far as I can tell.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
return utf8_decode(mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE'));
}, $str);
Demo here
The Unicode for the UTF-8 character "Ö" is U+00D6.
This character consists of the 2 hex bytes: c3 and 96.
The representation \u00c3 \u0096 for these 2 bytes is a bit strange. Provided that the multibyte character is represented byte for byte, the following code can also be used.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback(
'~\\\\u00([0-9a-f]{2})~i',
function($m){
return hex2bin($m[1]);
},
$str
);
//Test
$expect = "Österreich";
var_dump($str === $expect); //bool(true)
In case anyone else ends up here with a similar issue, I thought I'd try and shed some light on what's going on. Because as mentioned, this is a lot more complicated that it might look.
A string like \u00c3 refers to a Unicode code-point, in hexadecimal. Ö in the Unicode table is character 214, or \u00d6.
The 214 here is not directly related to how Ö is actually stored in any particular encoding (UTF-8, UTF-16, etc), it's just an abstract number in the overall Unicode table that refers to that character. UTF-8, for instance, will store it in two bytes 11000010 10010110 (194 150 in decimal). There's a really good explanation of how this works in this answer, if you're interested in the finer details.
What appears to have happened in your string is that these two bytes have then been encoded back into hexadecimal, and returned as two separate Unicode code points. u00c3 is Ã, and \u0096 is a control character. This is why any standard methods of decoding this (json_decode, etc) won't have worked - ultimately what you have is not a valid representation of the string Österreich.
The other answers should both work perfectly well, but this code snippet might better illustrate the issue with the format your library is using. It specifically matches two consecutive low Unicode code-points, recombines their decimal representations into an unsigned two-byte integer, and then returns the result.
$str = '\u00c3\u0096sterreich';
echo preg_replace_callback('/\\\\u00([0-9a-fA-F]{2})\\\\u00([0-9a-fA-F]{2})/', function ($match) {
$i = (hexdec($match[1]) << 8) + hexdec($match[2]);
return pack('N', $i);
}, $str);
Österreich
See https://3v4l.org/QtUuGD
I am trying to decode this special character: "ß", if I use "ord()", I get "C3"
echo "ord hex--> " . dechex(ord('ß'));
...but that doesn't look good; so i tried "bin2hex()", now I get "C39F" (what?).
echo "bin2hex --> " . bin2hex('ß');
By using an Extended ASCII Table from the Internet, i know that the correct hexadecimal value is "DF", so i now tried "hex2bin()", but that give me some unknown character like this: "�".
echo "hex2bin --> " . hex2bin('DF');
Is it possible to get the "DF" output?
You're on the right path with bin2hex, what you're confused about is merely the encoding. Currently you're seeing the hex value of ß for the UTF-8 encoding, because your string is encoded in UTF-8. What you want is the hex value for that string in some other encoding. Let's assume "Extended ASCII" refers to ISO-8859-1, as it colloquially often does (but doesn't have to):
echo bin2hex(iconv('UTF-8', 'ISO-8859-1', 'ß'));
Now, having said that, I have no idea what you'd use that information for. There are many valid "hex values" for the character ß in various different encodings; "Extended ASCII" is just one possible answer, and it's a vague answer to be sure, since "Extended ASCII" has very little practical meaning with hundreds of different "Extended ASCII" charsets available.
ASCII goes from 0x00 to 0x7F. This is not enough to represent all the characters needed so historically old Windows OSes used the available space in a byte (from 0x80 to 0xFF) to represent different characters depending on the localization. This is what codepages are: an arbitrary mapping of non-ASCII values to non-ASCII characters. What you call "extended ASCII" is IMO an inappropriate name for a codepage.
The assumption 1 byte - 1 character is dead and (if not) must die.
So actually what you are seeing is the UTF-8 representation of ß. If you want to see the UNICODE code point value of ß (or any other character) just show its UTF-32 representation that AFAIK is mapped 1:1.
// Print 000000df
echo bin2hex(iconv('UTF-8', 'UTF-32BE', 'ß')));
bin2hex() should be fine, as long as you know what encoding you are using.
The C3 output you get appears to be the first byte of the two-byte representation of the character in UTF-8 (what incidentally means that you've configured your editor to save files in such encoding, which is a good idea in 2017).
The ord() function does not accept arbitrary encodings, let alone Unicode-compatible ones such as UTF-8:
Returns the ASCII value of the first character of string.
ASCII (a fairly small 7-bit charset) does not have any encoding for the ß character (aka U+00DF LATIN SMALL LETTER SHARP S). Seriously. ASCII does not even have a DF position (it goes up to 7E).
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?
This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.
Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.
In one of the comments, you show these hex representations of the strings:
4d696e61205469646967617265 20 616e7374 c3a4 6c6c6e696e676172 // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
^^-----------------^^^^1 ^^^^^^2
Note the parts I marked, apparently there are two parts to this problem.
For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.
Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.
mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);