I am downloading HTML files (raw HTML without any !DOCTYPE...) from a government website and then extracting paragraphs to put them into a MySQL database.
I am using DOMDocument, so I am going
$doc = DOMDocument();
$doc->loadHTMLFile( "../notifs/notif$notif_no.htm" );
The problem comes because certain characters get transformed into something strange: e.g. (one type of) apostrophe becomes ¢€™.
If I then try and save this para to a text field in a table either it is refused by MySQL or it is recorded as these strange characters... depending on the encoding of the text field.
Also, if I go $doc->saveHTMLFile( "test.htm" ); it actually prints out the strange characters, not the apostrophe.
I know this has something to do with encoding, but several days' googling and much looking at questions on SE have not led to the solution. Firefox tells me that the downloaded HTML files are in utf-8 encoding. I tried changing the php.ini file so the default_charset is "utf-8". No joy.
I am more an application programmer than a website person so I am quite new to encoding. I have tried cracking this one myself but just don't really understand what's going on or what to do.
later
have found that by putting
$file = file_get_contents("../notifs/notif$notif_no.htm");
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );
then saveHTMLFile() outputs with a correct apostrophe... as does my echo of the SQL INSERT INTO ... (...) VALUES (...) string. However the text in the MySQL text field obstinately refuses to cooperate. (naturally have tried multiple different collations). Meanwhile, mb_detect_encoding ( $clean_string ) prints "UTF-8" and mb_check_encoding ( $clean_string ) returns TRUE.
Another puzzling thing, though: if I do
$doc->loadHTML('<?xml encoding="latin1">' . $file )
this same partial success stays the same, right down to the "UTF-8" detected encoding. hmmmm
later
$doc = new DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
# without this following line adding an explicit encoding for the DOMDocument nothing worked!
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );
and then, when you've extracted some text and cleaned it up a bit, calling it $clean_string
# convert difficult UTF-8 characters into HTML special sequences ("’", etc.)
$clean_string = mb_convert_encoding($clean_string, "HTML-ENTITIES", "UTF-8");
After this $clean_string contains sequences like "... wine’s worth drinking"... but I, for one, can still be quite confused, because if you simply go
echo ">>> clean string $clean_string<br>";
... the "’" sequence will of course be displayed by the browser as ' (single quote).
This is probably absolutely obvious to most PHPers... but if you want to display an accurate picture of what you have in $clean_string you have to go
$decoded_clean_string = htmlspecialchars( $clean_string, ENT_QUOTES );
echo ">>> decoded string: $decoded_clean_string<br>";
$doc = DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
$file = mb_convert_encoding($file, "UTF-8");
$doc->loadHTML( $file );
Worth a shot?
or
$file = mb_convert_encoding($file, 'HTML-ENTITIES', 'UTF-8');
Related
Im generating a XML file from database that is formated to utf-8 and creating a XML file, however for a some specific case it is not converting properly and displaying me this message :
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding ! Bytes: 0x96 0x20 0x50 0x61 in Entity, line: 1
I have already tried all possible online solutions, going from iconv , trying to do regex but none of these are solving the problem. The mb_encoding returns it is ASCII , which is supposedly UTF-8, even checking the file itself its utf-8.
This is my file start which loads the file path from the database which is the variable $xml_file, all inputs from database are being decoded using utf8_decode.
<?php
$content = utf8_encode(file_get_contents($xml_file));
//$encoding = mb_detect_encoding($content);
//$myXMLString = file_put_contents($xml_file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($xml_file)));
$xml_doc = new DomDocument();
$xml_doc->formatOutput = true;
$xml_doc->preserveWhiteSpace = false;
$xml_doc->loadXML($content);
?>
This is only happening with some items because other generate correctly, however i can not find any particular difference between them neither a permanent fix for this.
HOW I FIXED :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
Managed to fix this converting it again to UTF-8:
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:
<Address><Name>RamÃrez Jones</Name></Address>
The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).
However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into
RamÃrez Jones
into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).
Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes
RamÃ-rez Jones
For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as RamÃrez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!
Here is some example code to show this problem:
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());
Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:
UTF-8
RamÃrez Jones
RamA-rez Jones
How can we avoid this problem? It's really screwing things up.
EDIT:
Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).
REVISED FINAL SOLUTION:
It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:
echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));
This works because "Ã" are HTML entities.
ALTERNATE SOLUTION
Strangely, this also works:
$xml = '<?xml version="1.0"?><Address><Name>RamÃrez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name;
It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.
You can fix its encoding using either:
$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());
or
$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');
UPDATE:
Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).
The solution provided by #Hazzit also works.
The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.
Stripping the accents with minimum loss of information
After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.
$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);
An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.
SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.
function decode_hexentities($xml) {
return
preg_replace_callback(
'~&#x([0-9a-fA-F]+);~i',
function ($matches) { return chr(hexdec($matches[1])); },
$xml
);
}
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
results in:
UTF-8
Ramírez Jones
Ramirez Jones
I have problems with wrong character encoding while reading a xml-file.
While this one shows the complete content of the file correctly...
$reader = new DOMDocument();
$reader->preserveWhiteSpace = false;
$reader->load('zip://content.odt#content.xml');
echo $reader->saveXML();
...this one gives me a strange output (german umlauts, em dashes, µ or similar characters aren't shown correctly):
$reader = new DOMDocument();
$reader->preserveWhiteSpace = false;
$reader->load('zip://content.odt#content.xml');
$elements = $reader->getElementsByTagName('text');
foreach($elements as $node){
foreach($node->childNodes as $child) {
$content .= $child->nodeValue;
}
}
echo $content;
I don't know why this is the case. Hope someone can explain it to me.
DOMDocument::saveXML()
This method returns the whole XML document as string. As with any XML document, the encoding is given in the XML declaration or it has the default encoding which is UTF-8.
DOMNode::$nodeValue
Contains the value of a node, most often text. All text-strings the DOMDocument library returns - of which DOMNode is part of - is in UTF-8 encoding regardless of the encoding of the XML document.
As you write that if you display the first:
echo $reader->saveXML();
all umlauts are preserved, it's most likely the XML itself ships with a different encoding as UTF-8 because the later
$content .= $child->nodeValue;
...
echo $content;
doesn't do it.
As you don't share how and with which application you're displaying and reading the output, not much more can be said.
You most likely need to hint the character encoding in the later case to the displaying application. For example, if you display text in a browser, you should add the appropriate content-type header at the very beginning:
header("Content-Type: text/plain; charset=utf-8");
Compare with How to set UTF-8 encoding for a PHP file.
DOMDocument seems to convert Chinese characters into codes, for instance,
你的乱发 will become ä½ çš„ä¹±å‘
How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
Below is my simple test,
$dom = new DOMDocument();
$dom->loadHTML($html);
If I add this below before loadHTML(),
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
I get,
你的乱发
Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after....
DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?
$dom = new DOMDocument();
$dom->loadHTML($html);
If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.
Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.
I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).
I just stumbled upon this thread when searching for a solution of a similar problem, i after loading the html properly and doing some parsing with Xpath etc... my text ends up like this:
你的乱发
this display fine in the body of the HTML, but won't display properly in a style or script tag (e.g. setting chinese-fonts).
to fix this, do the reverse lauthiamkok did:
$html = mb_convert_encoding($html, "UTF-8", "HTML-ENTITIES");
if for any reason the first workaround doesn't work for you, try this conversion.
I'm pretty sure ä½ çš„ä¹±å‘ is actually Windows Latin 1 (not ASCII, there are no diacritics in ASCII). Somewhere along the way your UTF-8 text got saved as Windows Latin 1....
$string = file_get_contents('http://example.com');
if ('UTF-8' === mb_detect_encoding($string)) {
$dom = new DOMDocument();
// hack to preserve UTF-8 characters
$dom->loadHTML('<?xml encoding="UTF-8">' . $string);
$dom->preserveWhiteSpace = false;
$dom->encoding = 'UTF-8';
$body = $dom->getElementsByTagName('body');
echo htmlspecialchars($body->item(0)->nodeValue);
}
This changes all UTF-8 characters to Å, ¾, ¤ and other rubbish. Is there any other way how to preserve UTF-8 characters?
Don't post answers telling me to make sure I am outputting it as UTF-8, I made sure I am.
Thanks in advance :)
I had similar problems recently, and eventually found this workaround - convert all the non-ascii characters to html entities before loading the html
$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($string);
In case it is definitely the DOM screwing up the encoding, this trick did it for me a while back the other way round (accepting ISO-8859-1 data). DOMDocument should be UTF-8 by default in any case but you can still try:
$dom = new DOMDocument('1.0', 'utf-8');
I had to add a utf8 header to get the correct view:
header('Content-Type: text/html; charset=utf-8');
At the top of the script where your php code lies(the code you posted here), make sure you send a utf-8 header. I bet your encoding is a some variant of latin1 right now. Yes, I know the remote webpage is utf8, but this php script isn't.