How does intval() change when using UTF-8 multibyte strings as opposed to regular one byte per character strings? Is it the same?
PHP doesn't distinguish string encodings internally. A string is simply an array of bytes. If you pass an UTF-8 string to intval, the function only sees the bytes of the encoded UTF-8 string. Given the nature of the UTF-8 encoding, intval will treat any non-ASCII character as a non-digit. So it doesn't make a difference whether you pass an ASCII, Latin1, or UTF-8 string.
Related
Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.
What do we call these strings and how they can be decoded? as I found these are UTF-8 multi-byte characters and somewhere I noticed they are UTF-32
\x4c\x4f\x42\x41\x4c\x53
The following sequence of bytes \x4c\x4f\x42\x41\x4c\x53 is an ASCII safe sequence. So you can either treat it as a single byte encoding string, or UTF-8.
$s = "\x4c\x4f\x42\x41\x4c\x53";
echo $s; // outputs LOBALS
I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?
This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.
Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.
The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.
Let's say I have an UTF-8 text like this:
âàêíóôõ <br> âàêíóôõ <br> âàêíóôõ
I want to replace <br> with <br />. Do I need to use mb_str_replace or I can use str_replace ?
Consindering < b r / > are all single byte char?
Since str_replace is binary-safe and UTF-8 is a bijective encoding, you can use str_replace, even if search string or replacement contains multi-byte characters, as long as all three parameters are encoded as UTF-8.
That's why there isn't an mb_str_replace function in the first place.
If your encoding is not bijective - i.e. there are multiple representations of the same string, for example < in UTF-7, which can be expressed both as '+ADw-' and '<', you should convert all strings to the same (bijective) encoding, apply str_replace, and then convert the strings to the target encoding.
Reference for manipulating UTF-8 strings safely in PHP (archive). There is no hard-and-fast rule. Some native PHP string functions functions can operate safely on utf-8, some can with care, and some cannot.
There is no mb_str_replace(). Notice the section "UTF-8 Safe Functionality": explode() and str_replace() are safe as long as all three arguments to it are valid UTF-8 strings.
i wanna convert to original string of “Cool†..Origingal string is cool . (' is backquote)
It seems that you just forgot to specify the character encoding properly.
Because “ is what you get when the character “ (U+201C) encoded in UTF-8 (0xE2809C) is interpreted with a single-byte character encoding like Windows-1252 (default character encoding in some browsers) where 0xE2, 0x80, and 0x9C represent the characters â, €, and œ respectively.
So just make sure to specify your character encoding properly. Or if you actually want to use Windows-1252 as your output character encoding, you can convert your UTF-8 data with mb_convert_encoding, iconv or similar functions.
There's a wide variety of character encoding functions in PHP, especially if you have access to the multibyte string functions. (mb_string is thankfully enabled on most PHP installs.)
What you need to do is convert the encoding of the original string to the encoding you require, but as I don't know what encoding has been used/is required all I can suggest is that you could try using the mb_convert_encoding function, possibly after using mb_detect_encoding on the original string.
Incidentally, I'd highly recommend attempting to keep all data in UTF-8, (text files, HTML encoding, database connections/data, etc.) as you'll make your life a lot easier this way.