I am creating a XML navigation for my website. This line below is causing a simpleXML issue:
<label>Osnabrück</label>
My PHP code, using HTMLentities has changed Osnabrück into Osnabrück. However, when trying to parse my XML with this line in it, I get this error:
/application/configs/navigation.xml:318: parser error : Entity 'Atilde' not defined simplexml_load_file()
Should I not be using htmlentities()? Or is there some kind of setting I'm missing?
Kind Regards
Steve
You should not be using HTML Entities in XML. Using normal UTF-8 characters should be fine.
The occurrence of Osnabrück means that at some point, most likely, the city name is processed as ISO-8859-1 instead of UTF-8. It is not htmlentities()'s fault. You need to find that point and fix it.
You can use iconv() function to convert to utf-8 dynamicaly.
iconv("ISO-8859-1", "UTF-8", $text);
Related
I am consuming an XML feed which contains a great deal of whitespace.
When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.
I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:
string(17) "72"
Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.
I did recieve the following error:
simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74
I just found this regex (untested)
$xml_data = preg_replace("/>\s+</", "><", $xml_data);
If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here:
http://php.net/manual/en/function.xml-parser-set-option.php
Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.
My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.
Solution
My very hacky workaround that works:
$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);
urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.
I have generated an XML file in PHP using the DOMDocument class, the data was grabbed from a MySQL database. A lot of the data contains HTML markup, but I've encased all of it in a CDATA section.
At first the file had a lot of encoding errors, but running everything through utf8_encode() before putting it into the file seems to have fixed all the errors except one.
Here is the error I have right now:
error on line 5113 at column 450: Input is not proper UTF-8, indicate encoding !
Bytes: 0x14 0x31 0x30 0x30
I found some posts on here with similar errors, but none have solved my problem, or suggest using utf_encode(). Here is the section that seems to be triggering the error:
...quiet portable package. ]]></Summary><Features><![CDATA[The EF4500iSE was designed for maximum fuel...
The error seem to be between CDATA[ and The, although I can't see any characters between there and that piece is the same as every other CDATA block in the file. If I remove the entire Features element and it's contents, the file loads up fine.
Here is the link to the file: http://test.hhdev.hothousemarketing.com/inventory.xml
The problem ended up being a non-ASCII character present within the CDATA tag, as pointed out by Colin in the comments of the question.
I was in a rush to solve this so I just used a brute force method and ran everything through a regex replacement in addition to utf8_encode(), I used:
$output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
I found this here: http://www.stemkoski.com/php-remove-non-ascii-characters-from-a-string/
Thanks to Colin and Francis for their contributions.
Some characters are just flat-out not permitted in XML, even in a CDATA section, even entity-encoded.
You might be able to use this on a UTF-8 string (untested):
$xml_legal_chars = preg_replace('/[\x{00}-\x{08}\x{0B}\x{0C}\x{0E}-\x{1F}\x{D800}-\x{DFFF}\x{FFFE}\x{FFFF}]/u', '', $utf8string);
I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.
I am getting this error in my local site.
Warning (2): htmlspecialchars(): Invalid multibyte sequence in argument in [/var/www/html/cake/basics.php, line 207]
Does anyone knows, what is the problem or what should be the solution for this?
Thanks.
Be sure to specify the encoding to UTF-8 if your files are encoded as such:
htmlspecialchars($str, ENT_COMPAT, 'UTF-8');
The default charset for htmlspecialchars is ISO-8859-1 (as of PHP v5.4 the default charset was turned to 'UTF-8'), which might explain why things go haywire when it meets multibyte characters.
I ran in to this error on production and found this great post about it -
http://insomanic.me.uk/post/191397106/php-htmlspecialchars-htmlentities-invalid
It appears to be a bug in PHP (for CentOS at least) that displays this error on when display errors is Off!
You are feeding corrupted character data into the function, or not specifying the right encoding.
I had this issue a while ago, old behavior (prior to PHP 5.2.7 I believe) was to return the string despite corruption, but since that version it will throw this error instead.
My solution involved writing a script to feed my strings through iconv using the //IGNORE modifier to remove corrupted data.
(We had a corrupted database which had some strings in UTF-8, some in latin-1 usually with incorrectly defined character types on the columns).
(Looking at the comment to Tatu's answer, I would start by looking at (and playing with) the contents of the $charset variable.
The correct code in order not to get any error is:
htmlentities($string, ENT_IGNORE, 'UTF-8') ;
Beside this you can also use str_replace to replace some bad characters to your needs and then use htmlentities function.
Have a look at this rss feed it replaced the greater html sign to gt; tag which might not look nice when reading thee rss feed. You can replace this with something like "-" sign or ")" and etc.
Had the same problem because I was using substr on utf-8 string.
Error was infrequent and seemingly random. Error occurred only if string was cut on multibyte char!
mb_substr solved the problem :)
That's actually one of the most frequent errors I get.
Sometimes I dont use __() translation - just plain German text containing äöü.
There it is especially important to mind the encoding of the files.
So make sure you properly save the files that contain special chars as UTF8.
I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.