How to convert unknown/mixed encoding file to UTF-8 - php

I am using retrieving an XML file from a remote service which is supposed to be UTF-8, as the header is <?xml version="1.0" encoding="UTF-8"?>. However, certain parts of it is apparently not UTF-8, as when I load it into PHP's XMLReader extension, it throws some sort of "Not UTF-8 as expected" error when parsing over certain parts of the document (parts that look like they have been copy-pasted directly from MS Word).
I am looking for ideas to solve this error. Is there some program I can use to "fix" the file of any non-uft8 encodings? A PHP solution or any other solution will do

Depending on what encoding it is you are converting from, quick and easy utf-8 safe strings,utf8_encode function is your friend, but only for iso8859-1 encoding. Also, your txt cannot be already UTF-8 else you have good chances of having garbled text.
See the man page for more info:
// Usage can be as simple as this.
$name = utf8_encode($contact['name']);
On the other hand, if you need to convert from any other encoding, you will have to maybe look into incov() function.
Good-luck

Related

Problems with php xml characters

Hello friends i have a problem with some characters reading a xml file from php i am using this source code:
$file = 'test.xml';
$xml_1 = simplexml_load_file($file);
echo ($xml_1->content);
its work ok but when the content is a special character like ñ ó it show a rarer character like this ñ i tried to include in html head utf8 charset but its the same
SimpleXML emits UTF-8 output by design. If you application does not support UTF-8 you'll have to convert with the usual tools (e.g. mb_convert_encoding()) but you need to take this into account:
You need to know for sure the encoding your app is using.
UTF-8 can hold the complete Unicode catalogue thus some characters may not have an equivalent in your target encoding.
Whatever, in 2016 there's no reason to use anything else than UTF-8 unless your maintaining legacy code.
Finally i find the solution i must to use utf8_decode php function to convert the characters it is not enought with put utf8 charset in the head page you must to convert using php before

XML file isn't UTF-8 encoded when created in PHP

I'm trying to output XML file using PHP, and everything is right except that the file that is created isn't UTF-8 encoded, it's ANSI. (I see that when I open the file an do the Save as...).
I was using
$dom = new DOMDocument('1.0', 'UTF-8');
but I figured out that non-english characters don't appear on the output.
I was searching for solution and I tryed first adding
header("Content-Type: application/xml; charset=utf-8");
at the beginning of the php script but it say's:
Extra content at the end of the document
Below is a rendering of the page up to the first error.
I've tryed some other suggestions like not to include 'UTF-8' when creating the document but to write it separately:
$doc->encoding = 'UTF-8'; , but the result was the same.
I used
$doc->save("filename.xml");
to save the file, and I've tryed to change it to
$doc->saveXML();
but the non-english characters didn't appear.
Any ideas?
ANSI is not a real encoding. It's a word that basically means "whatever encoding my Windows computer is configured to use". Getting ANSI is a clear sign of relying on default encoding somewhere.
In order to generate valid UTF-8 output, you have to feed all XML functions with proper UTF-8 input. The most straightforward way to do it is to save your PHP source code as UTF-8 and then just type some non-English letters. If you are reading data from external sources (such as a database) you need to ensure that the complete toolchain makes proper use of encodings.
Whatever, using "Save as" in an undisclosed piece of software is not a reliable way to determine the file encoding.

Read specific character from xml file to a php file

I have a problem when I read specific characters from my XML file to the PHP file.
I use characters like "ä" , "ü" and "ö". I get the following error:
simplexml_load_string() [function.simplexml-load-string]: Entity: line 96: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xFC 0x73 0x65 0x0C
PHP 5 and earlier versions have no native Unicode support. PHP 6 or 7, where the Unicode support has been promised, may take years. To bridge the gap, there exist several extensions like mbstring, iconv and intl.
Make sure you send the HTML Response with an appropriate content-type and encoding, e.g.
<?php header('Content-Type: text/html; charset=utf-8');?>
Also check that the XML file prolog contains the proper encoding, e.g.
<?xml version="1.0" encoding="UTF-8"?>
Assuming that is all correct, it appears that the xml file is claiming to be UTF-8 but is actually something else (likely latin1 or ISO-8859-1 or Mojibake.). You can manually open the XML file in your favorite editor (I like Sublime) and save the file explicitly with a UTF8 Encoding. Or you can use a function to attempt to modify the string before loading. Like the one from: Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
But at the end of the day, it's going to be messy and PHP still doesn't seem to handle Unicode as well as we would all like it to and it simply isn't built into the core.
We suggest you check out Portable UTF-8 - a Lightweight Library for Unicode Handling in PHP.
The string of the XML you've got is not properly encoded. The default encoding is UTF-8 however the string you've got is different, most likely Windows-1252.
If you want that error to go away, you need to re-encode the string from that (by the missing information in your question:) unknown encoding to UTF-8.
As an encoding if it is unknown is broken, you need to find out/learn about the encoding of the string first.
Then you can just convert it to UTF-8 or inject the encoding into the XML string which is easily possible with XMLRecoder - Inspect and modify character encoding of an XML document based on XML Declaration and BOM. Parts of it are explained in PHP XMLReader, get the version and encoding which is about XMLReader but like SimpleXML, it is also a libxml based PHP XML extension and shares some of the commons, so this works.
Usage example:
$buffer = file_get_contents($file);
$fromEncoding = 'WINDOWS-1252'; # insert *your* correct string encoding here
$recoder = new XMLRecoder();
$buffer = $recoder->setEncodingDeclaration($buffer, $fromEncoding);
$sxml = simplexml_load_string($buffer);
To better understand XML encodings in PHP and the available charset encodings and names, please see:
how to get list of supported encodings by iconv library in php? (The iconv library is used internally by SimpleXML and other PHP XML libraries to convert charsets in the document into the in-memory representation of UTF-8)
https://gist.github.com/4188459
Official Names for Character Sets, ed. Keld Simonsen et al. (Internet Assigned Numbers Authority)
PHP XMLReader, get the version and encoding
php XML export issue with XMLWriter using writeAttribute() method

How to write a file in UTF-8 format

I have read a XML file with the simplexml_load_file() function. I suppose this function is well written and supports XML encodings correctly. My XML file is in UTF-8 format, i.e. it contains normal ANSI characters along with national characters with multibyte encoding.
So, now I want to write the XML file back with fopen() and fwrite(), also in UTF-8 format.
Should I perform some conversions to do that correctly?
Suppose variable $a contains some UTF-8 encodings. Will it be written correctly?
if your xml is really encoded in utf8 already, then the strings produced by simplexml will be utf8 also. meaning- just write them to the file as-is.
php's interfacing with the underlying libxml is a bit funky though. Even though the xml may be utf8 encoded, make sure that the xml starts with a proper encoding declaration or it may get misinterpreted.
<?xml version="1.0" encoding="UTF-8"?>

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

Categories