Im generating a XML file from database that is formated to utf-8 and creating a XML file, however for a some specific case it is not converting properly and displaying me this message :
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding ! Bytes: 0x96 0x20 0x50 0x61 in Entity, line: 1
I have already tried all possible online solutions, going from iconv , trying to do regex but none of these are solving the problem. The mb_encoding returns it is ASCII , which is supposedly UTF-8, even checking the file itself its utf-8.
This is my file start which loads the file path from the database which is the variable $xml_file, all inputs from database are being decoded using utf8_decode.
<?php
$content = utf8_encode(file_get_contents($xml_file));
//$encoding = mb_detect_encoding($content);
//$myXMLString = file_put_contents($xml_file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($xml_file)));
$xml_doc = new DomDocument();
$xml_doc->formatOutput = true;
$xml_doc->preserveWhiteSpace = false;
$xml_doc->loadXML($content);
?>
This is only happening with some items because other generate correctly, however i can not find any particular difference between them neither a permanent fix for this.
HOW I FIXED :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
Managed to fix this converting it again to UTF-8:
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
Related
I encountered a problem with converting the Windows-1257 file to UTF-8. The original file has
<?xml version="1.0" encoding="windows-1257"?>
on top and I try to convert it using this code:
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
file_put_contents('data/rmtools/import/utf8/'.$files_single, $unicode_xml);
It saves the file as UTF-8, but when I open this file I still get the error:
XML parsing error: Input is not proper UTF-8, indicate encoding ! Bytes: 0x04 0x50 0x72 0x65
Is there any proper way I could convert it to readable UTF-8, or it means that there is still some symbols in the file which is NOT on UTF-8?
You're trying to convert UTF8 to UTF8//IGNORE, and that's why you're receiving that error. The first parameter is the in_charset. iconv on PHP.net Please change
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
to
$unicode_xml = iconv("CP1257", "UTF-8//IGNORE", $baltic_xml);
However I'd personally recommend you to use mb_* as iconv relies heavily on your OS's implementation of iconv and can show differences in between OS, mb_* on the other hand is pure php extension and is consistent. Making your code use mb_* changes whole to
ini_set('mbstring.substitute_character','none'); //to remove the unknown characters, in place of //IGNORE in iconv
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
$unicode_xml = utf8_encode($unicode_xml); //to correct utf-8 bytes
$unicode_xml = preg_replace('/[^\PC\s]/u', '', $unicode_xml); //to remove control chars in case it has
file_put_contents('data/rmtools/import/utf8/' . $files_single, $unicode_xml);
According to mb supported encodings CP-1257 is not one of them, you may use ISO-8859-13 instead, however please note that there are some inconsistencies between them in some graphical characters (language characters however seem to be consistent according to wikipedia )
I have a .csv file encoded in UCS-2LE BOM. I need to make some changes to it and I want to use preg_replace, so I want to convert the file to UTF-8. However, when I convert it, all spaces disappear and all words which belong to one and the same line are sticked together.
My code is :
$content = file_get_contents( "myFile.csv" );
$content = mb_convert_encoding( $content, 'UCS-2LE', 'UTF-8');
What is the proper way to make the conversion so that I do not lose any spaces or characters?
Before converting - screenshot in Excel:
After converting the file:
You should change second line into this:
$content = mb_convert_encoding($content, 'UTF-8', 'UCS-2LE');
2nd argument is TO ENCODING, 3rd is FROM ENCODING.
In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:
<Address><Name>RamÃrez Jones</Name></Address>
The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).
However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into
RamÃrez Jones
into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).
Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes
RamÃ-rez Jones
For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as RamÃrez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!
Here is some example code to show this problem:
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());
Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:
UTF-8
RamÃrez Jones
RamA-rez Jones
How can we avoid this problem? It's really screwing things up.
EDIT:
Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).
REVISED FINAL SOLUTION:
It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:
echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));
This works because "Ã" are HTML entities.
ALTERNATE SOLUTION
Strangely, this also works:
$xml = '<?xml version="1.0"?><Address><Name>RamÃrez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name;
It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.
You can fix its encoding using either:
$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());
or
$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');
UPDATE:
Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).
The solution provided by #Hazzit also works.
The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.
Stripping the accents with minimum loss of information
After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.
$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);
An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.
SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.
function decode_hexentities($xml) {
return
preg_replace_callback(
'~&#x([0-9a-fA-F]+);~i',
function ($matches) { return chr(hexdec($matches[1])); },
$xml
);
}
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
results in:
UTF-8
Ramírez Jones
Ramirez Jones
I am downloading HTML files (raw HTML without any !DOCTYPE...) from a government website and then extracting paragraphs to put them into a MySQL database.
I am using DOMDocument, so I am going
$doc = DOMDocument();
$doc->loadHTMLFile( "../notifs/notif$notif_no.htm" );
The problem comes because certain characters get transformed into something strange: e.g. (one type of) apostrophe becomes ¢€™.
If I then try and save this para to a text field in a table either it is refused by MySQL or it is recorded as these strange characters... depending on the encoding of the text field.
Also, if I go $doc->saveHTMLFile( "test.htm" ); it actually prints out the strange characters, not the apostrophe.
I know this has something to do with encoding, but several days' googling and much looking at questions on SE have not led to the solution. Firefox tells me that the downloaded HTML files are in utf-8 encoding. I tried changing the php.ini file so the default_charset is "utf-8". No joy.
I am more an application programmer than a website person so I am quite new to encoding. I have tried cracking this one myself but just don't really understand what's going on or what to do.
later
have found that by putting
$file = file_get_contents("../notifs/notif$notif_no.htm");
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );
then saveHTMLFile() outputs with a correct apostrophe... as does my echo of the SQL INSERT INTO ... (...) VALUES (...) string. However the text in the MySQL text field obstinately refuses to cooperate. (naturally have tried multiple different collations). Meanwhile, mb_detect_encoding ( $clean_string ) prints "UTF-8" and mb_check_encoding ( $clean_string ) returns TRUE.
Another puzzling thing, though: if I do
$doc->loadHTML('<?xml encoding="latin1">' . $file )
this same partial success stays the same, right down to the "UTF-8" detected encoding. hmmmm
later
$doc = new DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
# without this following line adding an explicit encoding for the DOMDocument nothing worked!
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );
and then, when you've extracted some text and cleaned it up a bit, calling it $clean_string
# convert difficult UTF-8 characters into HTML special sequences ("’", etc.)
$clean_string = mb_convert_encoding($clean_string, "HTML-ENTITIES", "UTF-8");
After this $clean_string contains sequences like "... wine’s worth drinking"... but I, for one, can still be quite confused, because if you simply go
echo ">>> clean string $clean_string<br>";
... the "’" sequence will of course be displayed by the browser as ' (single quote).
This is probably absolutely obvious to most PHPers... but if you want to display an accurate picture of what you have in $clean_string you have to go
$decoded_clean_string = htmlspecialchars( $clean_string, ENT_QUOTES );
echo ">>> decoded string: $decoded_clean_string<br>";
$doc = DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
$file = mb_convert_encoding($file, "UTF-8");
$doc->loadHTML( $file );
Worth a shot?
or
$file = mb_convert_encoding($file, 'HTML-ENTITIES', 'UTF-8');
I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>