DomDocument and special characters written in two bytes - php

I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!

One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.

Related

PHP html_entity_decode is not working for UTF-8 characters? [duplicate]

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.

How to display the (extended) ASCII representation of a special character in PHP 5.6?

I am trying to decode this special character: "ß", if I use "ord()", I get "C3"
echo "ord hex--> " . dechex(ord('ß'));
...but that doesn't look good; so i tried "bin2hex()", now I get "C39F" (what?).
echo "bin2hex --> " . bin2hex('ß');
By using an Extended ASCII Table from the Internet, i know that the correct hexadecimal value is "DF", so i now tried "hex2bin()", but that give me some unknown character like this: "�".
echo "hex2bin --> " . hex2bin('DF');
Is it possible to get the "DF" output?
You're on the right path with bin2hex, what you're confused about is merely the encoding. Currently you're seeing the hex value of ß for the UTF-8 encoding, because your string is encoded in UTF-8. What you want is the hex value for that string in some other encoding. Let's assume "Extended ASCII" refers to ISO-8859-1, as it colloquially often does (but doesn't have to):
echo bin2hex(iconv('UTF-8', 'ISO-8859-1', 'ß'));
Now, having said that, I have no idea what you'd use that information for. There are many valid "hex values" for the character ß in various different encodings; "Extended ASCII" is just one possible answer, and it's a vague answer to be sure, since "Extended ASCII" has very little practical meaning with hundreds of different "Extended ASCII" charsets available.
ASCII goes from 0x00 to 0x7F. This is not enough to represent all the characters needed so historically old Windows OSes used the available space in a byte (from 0x80 to 0xFF) to represent different characters depending on the localization. This is what codepages are: an arbitrary mapping of non-ASCII values to non-ASCII characters. What you call "extended ASCII" is IMO an inappropriate name for a codepage.
The assumption 1 byte - 1 character is dead and (if not) must die.
So actually what you are seeing is the UTF-8 representation of ß. If you want to see the UNICODE code point value of ß (or any other character) just show its UTF-32 representation that AFAIK is mapped 1:1.
// Print 000000df
echo bin2hex(iconv('UTF-8', 'UTF-32BE', 'ß')));
bin2hex() should be fine, as long as you know what encoding you are using.
The C3 output you get appears to be the first byte of the two-byte representation of the character in UTF-8 (what incidentally means that you've configured your editor to save files in such encoding, which is a good idea in 2017).
The ord() function does not accept arbitrary encodings, let alone Unicode-compatible ones such as UTF-8:
Returns the ASCII value of the first character of string.
ASCII (a fairly small 7-bit charset) does not have any encoding for the ß character (aka U+00DF LATIN SMALL LETTER SHARP S). Seriously. ASCII does not even have a DF position (it goes up to 7E).

Two byte character in a single byte character encoded (ISO-8859-1) HTML document

I learned that ISO-8859-1 is a single-byte charset.
See the page http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=###&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News. It is using Malayalam language.
The HTTP header and meta tag tell that it is using ISO-8859-1 as character-encoding.
But in this page a two byte character (0x201A) is used (http://unicodelookup.com/#%E2%80%9A).
(copy the character and look up it in http://unicodelookup.com)
<div id="articleTitleMal" style="padding-top:10px;">
<font face= "Manorama" >
¼ÈØOVA¢: ÜÍß‚Äí 1.28 ...
</font>
</div>
How is it possible to use two byte character in the single byte encoding?
Mine is not a curiosity to know that. One of my task is stucked because of not understanding the above issue.
Update: They are using the font www.manoramaonline.com/portal/mmcss/Manorama.ttf and I think some of the character in the Manaorama-font using two byte.
UPDATE2: I tried to convert the document from ISO-8859-1 to UTF-8 using the below code.
<?php
$t = file_get_contents('http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=###&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News');
// Change the charset info in meta-tag
$t = str_replace('ISO-8859-1', 'UTF-8', $t);
file_put_contents('t.html', utf8_encode($t));
That time the above selected character is missing.
Even though the page is declared as ISO-8859-1 encoded in HTTP headers, browsers interpret it as Windows-1252 encoded. This is a longstanding tradition, now being formalized e.g. in the WHATWG Encoding Standard.
Thus, when the data contains the byte 82 (hex), it is not taken as a control character (as per ISO 8859-1) but as U+201A “‚” (as per Windows-1252).
However, the page uses font trickery that maps code positions to Malayalam characters according to a special internal, nonstandard encoding. (You can see this if you disable style sheets on the page. All texts become gibberish.) The page is not really meant to contain U+201A “‚” but the byte 82 to which a Malayalam character is assigned in the font.
So you need to preserve the byte as-is to get the same results. A conversion to UTF-8 would break this.
If you wanted to convert the data to Unicode, you would need to find out the internal encoding of the font being used and perform that mapping at the character level.

Bullet "•" in XML

Similar to this question
I am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.
The problem is that it contains a bullet. It renders as "•" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "•". (it renders as tall block in my text editor)
You can see the commented lines where I tried a few different things.
I tried str replace on "•" first, then tweaked around and this is my latest:
// deal with bullets in XML.
$bullet="•"; //this was copied and pasted from transliterated text.
//$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
//$data=str_replace($bullet,'•',$data); // replace the bullet char
$data=str_replace($bullet,' - ',$data); // replace the bullet char
//$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.
Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.
XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.
The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:
Unicode Character 'BULLET' (U+2022)
So you now say it renders as •. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:
Unicode Character 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2)
Unicode Character 'EURO SIGN' (U+20AC)
Unicode Character 'CENT SIGN' (U+00A2)
Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.
If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:
ISO-8859-15 (Latin 9)
OEM 858 (Multilingual Latin I + Euro)
Windows 1252 (Latin I)
Windows 1254 (Turkish)
Windows 1256 (Arabic)
Windows 1258 (Vietnam)

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!
I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.
You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.
Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

Categories