Removing invisible characters from UTF-8 XML data - php

I am consuming an XML feed which contains a great deal of whitespace.
When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.
I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:
string(17) "72"
Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.
I did recieve the following error:
simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74

I just found this regex (untested)
$xml_data = preg_replace("/>\s+</", "><", $xml_data);
If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here:
http://php.net/manual/en/function.xml-parser-set-option.php

Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.
My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.

Solution
My very hacky workaround that works:
$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);
urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.

Related

Encoding Error in PHP Generated XML File

I have generated an XML file in PHP using the DOMDocument class, the data was grabbed from a MySQL database. A lot of the data contains HTML markup, but I've encased all of it in a CDATA section.
At first the file had a lot of encoding errors, but running everything through utf8_encode() before putting it into the file seems to have fixed all the errors except one.
Here is the error I have right now:
error on line 5113 at column 450: Input is not proper UTF-8, indicate encoding !
Bytes: 0x14 0x31 0x30 0x30
I found some posts on here with similar errors, but none have solved my problem, or suggest using utf_encode(). Here is the section that seems to be triggering the error:
...quiet portable package. ]]></Summary><Features><![CDATA[The EF4500iSE was designed for maximum fuel...
The error seem to be between CDATA[ and The, although I can't see any characters between there and that piece is the same as every other CDATA block in the file. If I remove the entire Features element and it's contents, the file loads up fine.
Here is the link to the file: http://test.hhdev.hothousemarketing.com/inventory.xml
The problem ended up being a non-ASCII character present within the CDATA tag, as pointed out by Colin in the comments of the question.
I was in a rush to solve this so I just used a brute force method and ran everything through a regex replacement in addition to utf8_encode(), I used:
$output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
I found this here: http://www.stemkoski.com/php-remove-non-ascii-characters-from-a-string/
Thanks to Colin and Francis for their contributions.
Some characters are just flat-out not permitted in XML, even in a CDATA section, even entity-encoded.
You might be able to use this on a UTF-8 string (untested):
$xml_legal_chars = preg_replace('/[\x{00}-\x{08}\x{0B}\x{0C}\x{0E}-\x{1F}\x{D800}-\x{DFFF}\x{FFFE}\x{FFFF}]/u', '', $utf8string);

Illegal non-standard quotes in XML

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.
Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.
EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:
<SomeTag>User’s Input</SomeTag>
Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:
<?xml version="1.0" encoding="UTF-8"?>
There may also be a UTF-8 option in the parser's API.
Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!
Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.
Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".
If you need to get rid of them, a simple global replace using a text editor will do the job fine.
But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).
If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:
$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;
For me gives:
”’
whereas
$html = htmlentities( '”’' );
echo $html;
gets confused:
â??â??
If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.
Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".
These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.
Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.
Use
$s = 'User’s Input';
$descriptfix = preg_replace('/[“”]/','\"',$s);
$descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo "<SomeTag>htmlentities($s)</SomeTag>";

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

html entities decoding in php

I seem to be completely unable to get around utf-8 character encoding.
So I'm exporting content from a database as a utf-8 xml file.
The software I am importing into is quite strict about character encoding, so I can't just put everything in CDATA tags.
There's a whole bunch of weird characters, e.g. ’, — … already in the data.
These aren't working in the xml and need to be replaced out (normally with just a ' quote).
Ideally, I'd like to decode all the characters, and then use htmlspecialchars($text, ENT_COMPAT, 'UTF-8', FALSE) to encode them back again. But I can't seem to find a function that will decode them. Is there one?
I've started to manually go through each entity with a str_replace() but it's turning into a much bigger job than I anticipated.
Any help would be a lifesaver.
Thanks
html_entity_decode() perhaps?
in some cases, in character conversion issues in php, it is important to have a locale set. Doesn't matter which, e.g.
setlocale(LC_CTYPE,'en_US.utf8');
But I would advise that any time invested in getting the encoding right from the beginning, without reverting to entities, if at all possible, is worth it.

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

Categories