Encoding Error in PHP Generated XML File - php

I have generated an XML file in PHP using the DOMDocument class, the data was grabbed from a MySQL database. A lot of the data contains HTML markup, but I've encased all of it in a CDATA section.
At first the file had a lot of encoding errors, but running everything through utf8_encode() before putting it into the file seems to have fixed all the errors except one.
Here is the error I have right now:
error on line 5113 at column 450: Input is not proper UTF-8, indicate encoding !
Bytes: 0x14 0x31 0x30 0x30
I found some posts on here with similar errors, but none have solved my problem, or suggest using utf_encode(). Here is the section that seems to be triggering the error:
...quiet portable package. ]]></Summary><Features><![CDATA[The EF4500iSE was designed for maximum fuel...
The error seem to be between CDATA[ and The, although I can't see any characters between there and that piece is the same as every other CDATA block in the file. If I remove the entire Features element and it's contents, the file loads up fine.
Here is the link to the file: http://test.hhdev.hothousemarketing.com/inventory.xml

The problem ended up being a non-ASCII character present within the CDATA tag, as pointed out by Colin in the comments of the question.
I was in a rush to solve this so I just used a brute force method and ran everything through a regex replacement in addition to utf8_encode(), I used:
$output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
I found this here: http://www.stemkoski.com/php-remove-non-ascii-characters-from-a-string/
Thanks to Colin and Francis for their contributions.

Some characters are just flat-out not permitted in XML, even in a CDATA section, even entity-encoded.
You might be able to use this on a UTF-8 string (untested):
$xml_legal_chars = preg_replace('/[\x{00}-\x{08}\x{0B}\x{0C}\x{0E}-\x{1F}\x{D800}-\x{DFFF}\x{FFFE}\x{FFFF}]/u', '', $utf8string);

Related

Removing invisible characters from UTF-8 XML data

I am consuming an XML feed which contains a great deal of whitespace.
When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.
I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:
string(17) "72"
Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.
I did recieve the following error:
simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74
I just found this regex (untested)
$xml_data = preg_replace("/>\s+</", "><", $xml_data);
If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here:
http://php.net/manual/en/function.xml-parser-set-option.php
Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.
My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.
Solution
My very hacky workaround that works:
$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);
urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.

How to avoid echoing character 65279 in php?

I have encountered a similar problem described here (and in other places) -
where as on an ajax callback I get a xmlhttp.responseText that seems ok (when I alert it - it shows the right text) - but when using an 'if' statement to compare it to the string - it returns false.
(I am also the one who wrote the server-side code returning that string) - after much studying the string - I've discovered that the string had an "invisible character" as its first character. A character that was not shown. If I copied it to Notepad - then deleted the first character - it won't delete until pressing Delete again.
I did a charCodeAt(0) for the returned string in xmlhttp.responseText. And it returned 65279.
Googling it reveals that it is some sort of a UTF-8 control character that is supposed to set "big-endian" or "small-endian" encoding.
So, now I know the cause of the problem - but... why does that character is being echoed?
In the source php I simply use
echo 'the string'...
and it apparently somehow outputs [chr(65279)]the string...
Why? And how can I avoid it?
To conclude, and specify the solution:
Windows Notepad adds the BOM character (the 3 bytes: EF BB BF) to files saved with utf-8 encoding.
PHP doesn't seem to be bothered by it - unless you include one php file into another -
then things get messy and strings gets displayed with character(65279) prepended to them.
You can edit the file with another text editor such as Notepad++ and use the encoding
"Encode in UTF-8 without BOM",
and this seems to fix the problem.
Also, you can save the other php file with ANSI encoding in notepad - and this also seem to work (that is, in case you actually don't use any extended characters in the file, I guess...)
If you want to print a string that contains the ZERO WIDTH NO-BREAK SPACE char (e.g., by including an external non-PHP file), try the following code:
echo preg_replace("/\xEF\xBB\xBF/", "", $string);
If you are using Linux or Mac, here is an elegant solution to get rid of the  character in PHP.
If you are using WordPress (25% of Internet websites are powered by WordPress), the chances are that a plugin or the active theme are introducing the BOM character due a file that contains BOM (maybe that file was edited in Windows). If that's the case, go to your wp-content/themes/ folder and run the following command:
grep -rl $'\xEF\xBB\xBF' .
This will search for files with BOM. If you have .php results in the list, then do this:
Rename the file to something like filename.bom.bak.php
Open the file in your editor and copy the content in the clipbard.
Create a new file and paste the content from the clipboard.
Save the file with the original name filename.php
If you are dealing with this locally, then eventually you'd need to re-upload the new files to the server.
If you don't have results after running the grep command and you are using WordPress, then another place to check for BOM files is the /wp-content/plugins folder. Go there and run the command again. Alternatively, you can start deactivating all the plugins and then check if the problem is solved while you active the plugins again.
If you are not using WordPress, then go to the root of your project folder and run the command to find files with BOM. If any file is found, then run the four steps procedure described above.
You can also remove the character in javascript with:
myString = myString.replace(String.fromCharCode(65279), "" );
I had this problem and changed my encoding to utf-8 without bom, Ansi, etc with no luck. My problem was caused by using a php include function in the html body. Moving the include function to above my html (above !DOCTYPE tag) resolved the issue.
After I knew my issue I tested include, include_once and require functions. All attempts to include a file from within the html body created the extra miscellaneous 𐃁 character at the spot where the PHP code would start.
I also tried to assign the result of the include to a variable ... i.e $result = include("myfile.txt"); with the same extra character being added
Please note that moving the include above the HTML would not remove the extra character from showing, however it removes it from my data and out of the content area.
In addition to the above, I just had this issue when pulling some data from a MySQL database (charset is set to UTF-8) - the issue being the HTML tags, I allowed some basic ones like <p> and <a> when I displayed it on the page, I got the &#65729 character looking through Dev Tools in Chrome.
So I removed the tags from the table and that removed the &#65729 issue (and the blank line above the where the text was to be displayed.
I just wanted to add to this, since my Rep isn't high enough to actually comment on the answer.
EDIT: Using VIM I was able to remove the BOM with :set nobomb and you can confirm the presence of the BOM with :set bomb? which will display either bomb or nobomb
I use "Dreamweaver CC 2015", by default it has this option enabled: "include BOM signature" or something like that, when you click on save as option from file menu. In the window that apears, you can see "Unicode Options..". You can disable the BOM option. And remeber to change all your files like that. Or you can simply go to preferences and disable the BOM option and save all your files.
I'm using the PhpStorm IDE to develop php pages.
I had this problem and use this option of IDE to remove any BOM characters and problem solved:
File -> Remove BOM
Try to find options like this in your IDE.
Probably something on the server. If you know it's there, I would just bypass it until solved.
myString = myString.substring(1)
Chops off the first character.
When using atom it is a white space on the start of the document before <?php
A Linux solution to find and remove this character from a file is to use sed -i 's/\xEF\xBB\xBF//g' your-filename-here
My solution is create a php file with content:
<?php
header("Content-Type:text/html;charset=utf-8");
?>
Save it as ANSI, then other php file will require/include this before any html or php code

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

Zend_Config_XML encoding issue

I am creating a XML navigation for my website. This line below is causing a simpleXML issue:
<label>Osnabrück</label>
My PHP code, using HTMLentities has changed Osnabrück into Osnabrück. However, when trying to parse my XML with this line in it, I get this error:
/application/configs/navigation.xml:318: parser error : Entity 'Atilde' not defined simplexml_load_file()
Should I not be using htmlentities()? Or is there some kind of setting I'm missing?
Kind Regards
Steve
You should not be using HTML Entities in XML. Using normal UTF-8 characters should be fine.
The occurrence of Osnabrück means that at some point, most likely, the city name is processed as ISO-8859-1 instead of UTF-8. It is not htmlentities()'s fault. You need to find that point and fix it.
You can use iconv() function to convert to utf-8 dynamicaly.
iconv("ISO-8859-1", "UTF-8", $text);

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

Categories