Hex to Unicode in PHP ( \u014D to ō) [duplicate] - php

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?
How can I convert \u014D to ō in PHP?
Thank You

It's not immediate clear what you mean when you say "to ō". If you're asking how to convert it into a different encoding then a general approach is to use the iconv function. 014D is the UCS-2 (unicode) for your desired function so, if you have a string containing the bytes 014D you could use
iconv('UCS-2', 'UTF-8', $s)
to convert from UCS-2 to UTF-8. Similarly if you want to convert to a different encoding - although you need to be aware that not all encodings will include the character you are using. You'll see from the iconv documentation that the //TRANSLIT option may help in that case.
Note that iconv is taking a byte sequence so, if you actually have a string containing a slash, then a u, then a 0 etc... you'll need to convert that into the byte sequence first.

If you have the escape characters in the string you could use a messy exec statement.
$string = '\\u014D';
exec("\$string = '$string'");
This way, the Unicode escape sequence should be recognized and interpreted as a unicode character When the string is parsed.
Of course, you should never use exec unless absolutely necessary.

Related

PHP html_entity_decode is not working for UTF-8 characters? [duplicate]

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.

php preg_replace: unicode modifier for ascii strings

I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?
This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.
Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.
The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.

PHP - convert unicode to character [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to get the character from unicode value in PHP?
PHP: Convert unicode codepoint to UTF-8
How can I convert a unicode character such as %u05E1 to a normal character via PHP?
The chr function not covering it and I am looking for something similar.
"%uXXXX" is a non-standard scheme for URL-encoding Unicode characters. Apparently it was proposed but never really used. As such, there's hardly any standard function that can decode it into an actual UTF-8 sequence.
It's not too difficult to do it yourself though:
$string = '%u05E1%u05E2';
$string = preg_replace('/%u([0-9A-F]+)/', '&#x$1;', $string);
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
This converts the %uXXXX notation to HTML entity notation &#xXXXX;, which can be decoded to actual UTF-8 by html_entity_decode. The above outputs the characters "סע" in UTF-8 encoding.
Use hexdec to convert it to it's decimal representation first.
echo chr(hexdec("05E1"));
var_dump(hexdec("%u05E1") == hexdec("05E1")); //true

Why use multibyte string functions in PHP?

At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

Categories