I have a lot of images which has been imported from SQL dump with utf-8 encoding. Thus, instead of "FF D8 FF E0" I see "C3 BF C3 98 C3 BF C3 A0" in the beginning of jpeg images.
I've tried iconv('utf-8', 'iso-8859-1', $data) but it not converts whole file (there is chars in utf-8 which can not be converted to iso-8859-1.
How I can to convert utf-8 simple to one-byte binary with unrespect to encoding?
The problem was because there are some representations of the same character in UTF-8, called "non-shortest" form. That characters can be converted mathematically, but iconv counts them as errorneous and not converts.
I've made a short function, which converts text of any utf-8 character to Unicode (UTF-16) codepoints array. And then remap some non-ASCII values to ASCII by simple table (for example 0x20ac is the same as 0x80, etc). You can found complete code and remapping table here: Converting UTF-8 with non-shortest characters to one-byte encoding
Related
I'm trying to decode a text that contains extended ASCII characters but when I try to convert the character I get the wrong value. Like this:
echo "“<br>";
echo ord("“")."<br>";
echo chr(ord("“"))."<br>";
And this is my output:
“
226
�
The ASCII value of the character "“" is 147, not 226. And instead of the � symbol, I want to get "“" character back.
I'm using UTF-8
<meta charset="utf-8">
I have tried changing to different charsets but it didn't work.
1st U+201C Left Double Quotation Mark is UTF-8 byte sequence E2 80 9C (hexadecimal) i.e. decimal 226 128 156
2nd ord — Convert the first byte of a string to a value between 0 and 255
Result: ord("“") returns 226…
Instead of ord and chr pair, use mb_ord and its complement mb_chr, e.g. as follows:
<?php
echo "“<br>";
echo mb_ord("“")."<br>";
echo mb_chr(mb_ord("“"))."<br>";
?>
Result: .\SO\74045685.php
“8220“
Edit you can get Windows-1251 code (147) for character “ (U+201C, Left Double Quotation Mark) as follows:
echo ord(mb_convert_encoding("“","Windows-1251","UTF-8")); //147
You're incorrect about the “ character, the UTF-8 encoding is two bytes: c293.
See: SET TRANSMIT STATE.
In the manual for ord() it says:
However, note that this function is not aware of any string encoding,
and in particular will never identify a Unicode code point in a
multi-byte encoding such as UTF-8 or UTF-16.
On top of this, if I actually convert the '“' charachter to hexadecimal, I get: e2809c. So it's a triplet. Never trust what you read online. 😏
See: https://3v4l.org/57UV8
There is no ASCII representation for “, as has already been said it is multibyte, UTF-8 to be precise:
echo mb_detect_encoding("“"); // UTF-8
ord() and chr() don't support this, you're only looking at the first byte of up to four needed for a particular character. Fortunately there are functions that does:
echo "“\n"; // “
echo mb_ord("“")."\n"; // 8220
echo mb_chr(mb_ord("“")); // “
But why do you need to transform it back and forth? It seems you already have the character in your code :), not as a value but as the actual visual representation.
Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.
I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".
So my question is, how can I convert the format above into Chinese word back in UTF-8?
For your information, I had tried with utf8_decode, iconv, but none of them work. :(
Thank you very much.
The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:
echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙
You need to use:
utf8_encode($data);
and not decode,to convert your current ISO-8859-1 to UTF-8.
Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:
setlocale(LC_CTYPE, 'C');
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
Just for your reference:
ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish
UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian
There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.
For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.
I have a database which uses latin-1 and a PHP application which is utf-8.
I have strings in the database like this:
'Société' which should be Société
'€1bn' which should be €2bn.
When I print the faulty characters to screen with PHP's ord(), from the returning data in the db, it prints 195 and 226.
Could somebody explain why this is happening (why saving like this and why characters being read as they are) and if I can reverse it.
The WHY:
1) é is unicode 233 (as the browser reads it).
é utf8 bytes converted into latin1 chars bytes is à ©. This is why it appears like this in the database.
à © is recognised as à which is code point 195. Hence why you see that.
2) € is unicode 8364.
€ utf8 bytes converted into latin1 chars bytes is â <82> ¬. Again this is why they appear like this in the db.
â <82> ¬ is recognised as â which is code point 226. Again this is why you see this.
That is why you see those values from ord() and why the characters are stored in that manner in a latin-1 database.
Reverse:
To reverse it we need Latin-1 char bytes to UTF8 bytes.
If we try it:
â is 226. Converted latin-1 to utf8 produces â.
à is 195. Converted latin-1 to utf8 produces Ã.
Problem:
The problem is Latin-1 has less characters than utf-8 (by a long way).
Latin1 single-byte stream and UTF8 multi-byte char stream so 1 char in utf8 could produce up to 4 chars for latin1.
So the UTF-8 to Latin-1 conversion produces faulty characters.
Latin1 back to utf8 is not possible.
Solution:
IF you are unable to change the character set of your database I could suggest encoding special characters in the database in their character entity before writing them (so the db can stay as latin1 and app as utf8 as both can understand html entities) e.g. umlaut as Ä.
It could be done using PHPs html_entity_decode() combined with mb_detect_encoding() to detect and convert specific characters.
References:
See ltf.ed.ac.uk for the utf8 char bytes to latin1 bytes:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%96&mode=char
These are strings in UTF-8 but displayed as if they were latin1. In UTF-8 é and € are encoded with two bytes, that's why you see two characters when the string is interpreted as latin1. So what you are doing is storing UTF-8 data in a table that was not declared as UTF-8. You should change the encoding of the database* and the connection**, then you will get a consistent presentation of your data
*) for example see here: https://stackoverflow.com/a/6184788/664108 (case 2)
**) SET NAMES 'utf8' in SQL
I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!
One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.