PHP Problems with multibyte strings and using DOMDOCUMENT

PHP Problems with multibyte strings and using DOMDOCUMENT - php

Using PHP 5.6.11 - I have a block of HTML that is utf-8 encoded. The multibyte strings are encoded in the text.
Here is one sample of a string:
"You haven’t added"
Viewed with hexdump (see e2 80 99?) on a utf-8 console (linux):
00000000 59 6f 75 20 68 61 76 65 6e e2 80 99 74 20 61 64 |You haven...t ad|
Here it is as html entities:
"You haven’t added"
All this is ok. However when I load it into a domdoc, it comes out again mangled (shown as html entities).
"You haven&acirc;&#128;&#153;t added"
Here is the code to generate this snippet.
$text="<html><body>You haven’t added anything.<br></body></html>";
echo mb_detect_encoding($text)."\n";
$text2= substr($text,strpos($text,"You haven"),20);
echo $text2."\n";
echo htmlentities($text2);
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML($text);
$text2 = $doc->saveHTML();
$text2= substr($text2,strpos($text2,"You haven"),35);
echo "\n".htmlentities($text2)."\n";
The output of this is:
UTF-8
You haven’t added
You haven’t added
You haven&acirc;&#128;&#153;t added
I have tried a variety of ideas, but I can't seem to keep domdoc from mangling either the HTML or the multibyte. Any suggestions?
Edit: If I insert a meta tag it works more as expected.
$text='<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head><body>You haven’t added anything.<br></body></html>';
Output:
UTF-8
You haven’t added
You haven’t added
You haven’t added anything.<br></
Edit 2:
Inserting the meta tag with charset=utf-8 works fine as well as:
$doc->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'));
Fixes the encoding. I still can't figure out what domdocument is doing with the encoding, I've tried this line at least 3 times earlier but it wasn't working. Perhaps a little time away from the keyboard was needed, because it seems to be working now. I'll update this if there is a problem once I test it on bigger datasets.

Related

Charset of Textfile stored with file_put_contents() is missinterpretated

To prepare a download of some HTML contenteditable, as plain text file, I do following :
Send the html contenteditable, which inherits other html elements, through Ajax to a server side script prepareDownload.php.
There I create a new DOMDocument : $doc = new DOMDocument();
Then I do : $doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"]);
Then I am looking for text contents in certain elements and assemble it in $plainText
Finally I write $plainText to disk with : file_put_contents($txtFile, $plainText, LOCK_EX);
So far it works … but when I open the textfile the special characters like the German Ä are a mess.
To find out where the problem might be generated I place some print_r() commands on several stages in the php script and look into the browsers console whats coming back.
Until the point where I write $plainText with to disk file_put_contents() everything is perfect. Looking into the stored text file then, characters are a mess.
Now I assume that file_put_contents() misinterprets the given charset. But how to tell file_put_contents() that it should interpret (not encode) it as UTF-8 ?
EDIT:
As a test to find out more I replaced the explizit statement :
$doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"])
with
$doc->loadHTML($_POST["data"])
The character ä in the file still looks weired, but different. The hexdump now looks like this :
0220: 20 76 69 65 6C 2C 20 65 72 7A C3 A4 68 6C 74 20 viel, erz..hlt
Now ä has two points (two bytes) and is hex C3 A4. What kind of encoding is this ?

php filter_var FILTER_FLAG_ENCODE_HIGH

I have the fallowing testcase for the php function function_var():
<?php
$inputvalue = "Ž"; //NUM = 142 on the ASCII extended list
$sanitized = filter_var($inputvalue, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
echo 'The sanitized output: '.$sanitized."\n"; // --> & #197;& #189; (Å ½)
?>
If you run the above snippet the output is not what I expect to be returned. The Ž is number 142 in the ASCII extended list (see: ascii-code[dot]com). So what I expect to get back is the '& #142;' (string, without the space).
I had help finding out what is going wrong I just dont know how to solve it yet.
If you convert 'Ž' to Hex UTF-8 bytes you get: C5 BD. These hex bytes correspond with the ISO-8859 hex values: Å ½(see: http://cs.stanford.edu/~miles/iso8859.html). These 2 characters then get decoded by filter_var to '& #197;& #189;'.
See this onlineconverter!!!: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C5%BD&mode=char
So basically what happens: UTF-8 bytes are used to translate them as Latin-1 characters bytes. The converter page says the fallowing: "UTF-8 bytes as Latin-1 characters" is what you typically see when you display a UTF-8 file with a terminal or editor that only knows about 8-bit characters.
I dont think my editor is the problem. I am using a Mac with Coda 2 (UTF-8 as default). The test has also been tested on a html5 page with meta character set to utf-8. Furthermore I am using a defaut XAMPP localhost server. With Firebug in Firefox I also checked if the file was served as UTF-8 (it is).
Anyone got a idea how I can solve this encoding problem?

I am gona drop this cause I am not finding any solution. The email() function is also not safe and I am gona use either phpmailer or swiftmailer (and I am leaning towards the latter).

PHP gettext and non-ANSII charters

I have a PHP web application which is originally in Polish. But I was asked to locale it into Russian. I've decided to use gettext. But I've problem when I'm trying to translate string with Polish special characters. For example:
echo gettext('Urządzenie');
Display "Urządzenie" in web browser instead of word in Russian.
All files are encoded in UTF-8 and .po file was generated with --from-code utf-8 . Translations without Polish special chars such as
echo gettext('Instrukcja');
works well. Do you know what could be the reason of this strange behaviour?

Are you sure the PHP file is in UTF-8 format? To verify, try this:
echo bin2hex('Urządzenie');
You should see the following bytes:
55 72 7a c4 85 64 7a 65 6e 69 65

How to read non-ASCII characters from CLI standard input

If I type å in CMD, fgets stop waiting for more input and the loop runs until I press ctrl-c. If I type a "normal" characters like a-z0-9!?() it works as expected.
I run the code in CMD under Windows 7 with UTF-8 as charset (chcp 65001), the file is saved as UTF-8 without bom. I use PHP 5.3.5 (cli).
<?php
echo "ÅÄÖåäö work here.\n";
while(1)
{
echo '> '. fgets(STDIN);
}
?>
If I change charset to chcp 1252 the loop doesn't break when I type å and it print "> å" but the "ÅÄÖåäö work here" become "Ã…Ã„Ã–Ã¥Ã¤Ã¶ work here!". And I know that I can change the file to ANSI, but then I can't use special characters like ╠╦╗.
So why does fgets stop waiting for userinput after I have typed åäö?
And how can I fix this?
EDIT:
Also found a strange bug.
echo "öäåÅÄÖåäö work here! Or?".chr(10); -> ��äåÅÄÖåäö work here! Or? re! Or?.
If the first char in echo is å/ä/ö it print strange chars AND the end output duplicate's with n - 1 char.. (n = number of åäö in the begining of the string).
Eg: echo "åäö 1234" -> ??äö 123434 and echo åäöåäö 1234 -> ??äöåäö 1234 1234.
EDIT2 (solved):
The problem was chcp 65001, now I use chcp 437 (chcp 437).
Big thanks to Timothy Martens!

Possible solution:
echo '>';
$line = stream_get_line(STDIN, 999999, PHP_EOL);
Notes:
I was unable to reproduce your error using multiple versions of PHP.
Using the following PHP version 5.3.8 gave me no issues
PHP 5.3 (5.3.8)
VC9 x86 Non Thread Safe (2011-Aug-23 12:26:18)
Arcitechture is Win XP SP3 32 bit
You might try upgrading PHP.
I downloaded php-5.3.5-nts-Win32-VC6-x86 and was not able to reproduce your error, it works fine for me.
Edit: Additionaly I typed the characters using my spanish keyboard.
Edit2:
CMD Command:
chcp 437
PHP Code:
<?php
$fp=fopen("php://stdin","r");
while(1){
$str = fgets(STDIN);
echo mb_detect_encoding($str)."\n";
echo '>'.stream_get_line($fp,999999,"\n")."\n";
}
?>
Output:
test
ASCII
test
>test
öïü
öïü
>öïü

I think that happens because PHP 5.3 does not support properly multibyte characters.
These chars: ÅÄÖåäö
Are binary: c3 85 c3 84 c3 96 c3 a5 c3 a4 c3 b6 (without BOM at beggining)
Citing PHP String:
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support. See details of the string type.
Normally does not affect the final result, because the browser/reader understand multibyte characters, but for CMD and STDIN buffer is Ã…Ã„Ã–Ã¥Ã¤Ã¶ (12 chars/bytes char array).
only MB functions handle multibyte strings basic operations.

Why can't I get rid of this Â ?

Each line is a string
Â 4
Â minutes
Â 12
Â minutes
Â 16
Â minutes
I was able to remove the Â successfully using str_replace but not the HTML entity. I found this question: How to remove html special chars?
But the preg_replace did not do the job. How can I remove the HTML entity and that A?
Edit:
I think I should have said this earlier: I am using DOMDocument::loadHTML() and DOMXpath.
Edit:
Since this seems like an encoding issue, I should say that this is actually all separate strings.

Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at:
This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data:
4 minutes
Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence:
characters: 4 [nbsp] m i n ...
bytes : 34 C2 A0 6D 69 6E ...
(I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity , but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8.
Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it.
bytes : 34 C2 A0 6D 69 6E ...
characters: 4 Â [nbsp] m i n ...
And switch the raw non-breaking space into its HTML entity representation, and you get what you have.
So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing?
Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says C3 A9, whereas ISO-8859-1 says E9. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode C3 A9 as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "Ã©". Junk. In psuedo-code:
utf8-decode ( utf8-encode ( text-data ) ) // OK
iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK
iso8859_1-decode ( utf8-encode ( text-data ) ) // Fails
utf8-decode ( iso8859_1-encode ( text-data ) ) // Fails
This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.

This looks like an encoding error - your document is encoded with UTF-8, but is being rendered as ASCII. Solving your encoding mis-match will solve your issues. You could try using utf8_decode() on your source before using DOMdocument::loadHTML()
Here's an alternative solution from the DOMdocument::loadHTML() documentation page.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Problems with multibyte strings and using DOMDOCUMENT - php

Related

Charset of Textfile stored with file_put_contents() is missinterpretated

php filter_var FILTER_FLAG_ENCODE_HIGH

PHP gettext and non-ANSII charters

How to read non-ASCII characters from CLI standard input

Why can't I get rid of this Â ?

Categories

Resources