Read specific character from xml file to a php file - php

I have a problem when I read specific characters from my XML file to the PHP file.
I use characters like "ä" , "ü" and "ö". I get the following error:
simplexml_load_string() [function.simplexml-load-string]: Entity: line 96: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xFC 0x73 0x65 0x0C

PHP 5 and earlier versions have no native Unicode support. PHP 6 or 7, where the Unicode support has been promised, may take years. To bridge the gap, there exist several extensions like mbstring, iconv and intl.
Make sure you send the HTML Response with an appropriate content-type and encoding, e.g.
<?php header('Content-Type: text/html; charset=utf-8');?>
Also check that the XML file prolog contains the proper encoding, e.g.
<?xml version="1.0" encoding="UTF-8"?>
Assuming that is all correct, it appears that the xml file is claiming to be UTF-8 but is actually something else (likely latin1 or ISO-8859-1 or Mojibake.). You can manually open the XML file in your favorite editor (I like Sublime) and save the file explicitly with a UTF8 Encoding. Or you can use a function to attempt to modify the string before loading. Like the one from: Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
But at the end of the day, it's going to be messy and PHP still doesn't seem to handle Unicode as well as we would all like it to and it simply isn't built into the core.
We suggest you check out Portable UTF-8 - a Lightweight Library for Unicode Handling in PHP.

The string of the XML you've got is not properly encoded. The default encoding is UTF-8 however the string you've got is different, most likely Windows-1252.
If you want that error to go away, you need to re-encode the string from that (by the missing information in your question:) unknown encoding to UTF-8.
As an encoding if it is unknown is broken, you need to find out/learn about the encoding of the string first.
Then you can just convert it to UTF-8 or inject the encoding into the XML string which is easily possible with XMLRecoder - Inspect and modify character encoding of an XML document based on XML Declaration and BOM. Parts of it are explained in PHP XMLReader, get the version and encoding which is about XMLReader but like SimpleXML, it is also a libxml based PHP XML extension and shares some of the commons, so this works.
Usage example:
$buffer = file_get_contents($file);
$fromEncoding = 'WINDOWS-1252'; # insert *your* correct string encoding here
$recoder = new XMLRecoder();
$buffer = $recoder->setEncodingDeclaration($buffer, $fromEncoding);
$sxml = simplexml_load_string($buffer);
To better understand XML encodings in PHP and the available charset encodings and names, please see:
how to get list of supported encodings by iconv library in php? (The iconv library is used internally by SimpleXML and other PHP XML libraries to convert charsets in the document into the in-memory representation of UTF-8)
https://gist.github.com/4188459
Official Names for Character Sets, ed. Keld Simonsen et al. (Internet Assigned Numbers Authority)
PHP XMLReader, get the version and encoding
php XML export issue with XMLWriter using writeAttribute() method

Related

Problems with php xml characters

Hello friends i have a problem with some characters reading a xml file from php i am using this source code:
$file = 'test.xml';
$xml_1 = simplexml_load_file($file);
echo ($xml_1->content);
its work ok but when the content is a special character like ñ ó it show a rarer character like this ñ i tried to include in html head utf8 charset but its the same
SimpleXML emits UTF-8 output by design. If you application does not support UTF-8 you'll have to convert with the usual tools (e.g. mb_convert_encoding()) but you need to take this into account:
You need to know for sure the encoding your app is using.
UTF-8 can hold the complete Unicode catalogue thus some characters may not have an equivalent in your target encoding.
Whatever, in 2016 there's no reason to use anything else than UTF-8 unless your maintaining legacy code.
Finally i find the solution i must to use utf8_decode php function to convert the characters it is not enought with put utf8 charset in the head page you must to convert using php before

How to write a file in UTF-8 format

I have read a XML file with the simplexml_load_file() function. I suppose this function is well written and supports XML encodings correctly. My XML file is in UTF-8 format, i.e. it contains normal ANSI characters along with national characters with multibyte encoding.
So, now I want to write the XML file back with fopen() and fwrite(), also in UTF-8 format.
Should I perform some conversions to do that correctly?
Suppose variable $a contains some UTF-8 encodings. Will it be written correctly?
if your xml is really encoded in utf8 already, then the strings produced by simplexml will be utf8 also. meaning- just write them to the file as-is.
php's interfacing with the underlying libxml is a bit funky though. Even though the xml may be utf8 encoded, make sure that the xml starts with a proper encoding declaration or it may get misinterpreted.
<?xml version="1.0" encoding="UTF-8"?>

How to convert unknown/mixed encoding file to UTF-8

I am using retrieving an XML file from a remote service which is supposed to be UTF-8, as the header is <?xml version="1.0" encoding="UTF-8"?>. However, certain parts of it is apparently not UTF-8, as when I load it into PHP's XMLReader extension, it throws some sort of "Not UTF-8 as expected" error when parsing over certain parts of the document (parts that look like they have been copy-pasted directly from MS Word).
I am looking for ideas to solve this error. Is there some program I can use to "fix" the file of any non-uft8 encodings? A PHP solution or any other solution will do
Depending on what encoding it is you are converting from, quick and easy utf-8 safe strings,utf8_encode function is your friend, but only for iso8859-1 encoding. Also, your txt cannot be already UTF-8 else you have good chances of having garbled text.
See the man page for more info:
// Usage can be as simple as this.
$name = utf8_encode($contact['name']);
On the other hand, if you need to convert from any other encoding, you will have to maybe look into incov() function.
Good-luck

PHP 5, XSL and The Character Ú

Im having dificulty getting the letter
Ú
to render through PHP 5.3 and XSL. Its part of a string in a database and that is loaded into an XML node within a tags. However it causes the XSL/XML transformation to not render. Removing the character from the string fixes the problem instantly.
Any ideas?
What character encoding are you using? From the sounds of it you have some sort of character encoding mismatch.
If your XSL is using ISO-8559-1 (or ASCII equivalent) and you are trying to output to a page that is UTF-8 encoded then the character output will be off. It also works vice-versa.
Actually I don't know right answer but I have a solution like below :
"&".htmlentities("Ú");
Your XSL transformation engine probably interprets your document as non-well-formed XML because of encoding issues. If that text containing Ú is stored using some 8-bit encoding (like ISO-8859 variants), then this character will not produce a valid UTF-8 octet if it is used as such without any character conversion. Invalid characters in an XML document will mean it is not well formed XML and processing it as XML is forbidden.
There are many points where that encoding error might happen:
it could be stored in the database incorrectly
it could be read from the database incorrectly
you might produce your XML by concatenating strings that use different encodings
you might manipulate the text with a tool or method that can't handle your encoding or is not aware of it
your XSLT engine might not be aware of the correct encoding of the input stream resulting a rejected file even though it has no encoding error
My random guesses for the probable causes of that are points 3 and 5.

base64 decode French characters

We are getting base64 encoded (XML) data from a third party. If the XML data is in English, everything works fine, I am able do base64 decode, and parse the XML. If the XML is all lower case French characters, everything works fine. But if the xml data contains upper case French characters (like &Agrave), if I do base64 decode and try to parse it, the parser fails. Any suggestions on how to fix this problem?
Thanks.
Base64 is a method to encode 8-bit binary data using 7-bits/US-ASCII charachters. After the Base64 decode you should have a standard XML file.
Probably this XML file contains illegal characters, or does not correctly specify the character encoding it uses.
You mention À, an HTML-specific (not-XML) representation of À. If the XML contains the HTML encoded string À, there should also be a reference in the XML to an entity table specifying how to decode that string.
Alternatively, if your XML contains the À character directly, encoded using (for example) the ISO-8859-1 character set, either your XML should specify this encoding (<?xml version="1.0" encoding="ISO-8859-1"?>), or you should specify it yourself when decoding it.
Failing that, the parser may assume (e.g) UTF-8 encoding is used, and will fail when trying to decode the À.
The exact error message should tell you what the problem is.
[update: À directly]:
Sounds like the XML is invalid then; that they say UTF-8 but are actually using a different encoding. Check the XML bytes (after the base 64 decode) for this; if the À is encoded as one byte, it is definitely not UTF-8.
[update: how to fix?] If they incorrectly specify it in the XML header, they should really replace the false header (<?xml version="1.0" encoding="UTF-8"?>) with the correct one (<?xml version="1.0" encoding="windows-1252"?>).
If they don't specify anything, it looks like the iconv function may be your best bet. I haven't really needed it, so I'm not 100 % sure about this, but looks like you could use: $data = iconv("ISO-8859-1", "UTF-8", $data) after the base64_decode and before the simplexml_load_string. I don't know of a way to specify the encoding directly while decoding the XML.
I'm not really experienced with the PHP specifics of character encoding, so I'm not giving any guarantees...
What's the XML character encoding? Maybe it's not UTF-8 and your parser is trying to parse the XML string as UTF-8.

Categories