MYSQL REPLACE in QUERY and CDATA output usage still causes broken XML - php

I have a PHP script that queries a MySQL database and displays that information in XML format output.
I have a troublesome column that I have no control over (I can only SELECT). This column is filled with character returns etc.
In the MySQL query, I have tried to use REPLACE on this column like this:
REPLACE(PropertyInformation, '\r\n', '') AS PropertyInformation
In the PHP script I also wrap the exported XML in CDATA as I was told this could help, like this:
<Description><![CDATA[' . $PropertyInformation . ']]></Description>
I also form the XML like this in the script:
header("Content-Type: text/xml;charset=UTF-8");
echo '<?xml version="1.0" encoding="UTF-8"?>

The result is broken since your data is not UTF-8 even though you claim it to be (<?xml version="1.0" encoding="UTF-8"?>)
You need to convert your data to this format.
There are three ways to do that, either
convert your data in the database to be UTF-8, or
convert doing a select statement, or
convert in PHP the data to be UTF-8, leaving the data in the database as-is.
First option you would do by taking a dump of the database, issuing iconv conversion command to it and importing it back.
Second you would do with SELECT CONVERT(latin1column USING utf8) ...
Third you would again do with iconv, assuming your data would be ISO-8859-1: $converted = iconv("ISO-8859-1", "UTF-8", $text);

Related

simplexml_load_string - parse error due to unicode characters in payload

I have a problem with simplexml_load_string erring with parse errors due to an xml payload coming from a database with unicode characters in it.
I'm at a loss how to get php to read this and use the xml like I normally would. The code has been working fine until people were getting creative with data being submitted.
Unfortunately I cannot modify the source data, I have to work with what I receive, to give you an idea, one field that's breaking it in the original raw receipt looks like :
<FirstName>🐺</FirstName>
Previously the code works fine by parsing the xml with a simple line of :
$xmlresult = simplexml_load_string($result, 'SimpleXMLElement',LIBXML_NOCDATA);
However with these unicode characters, it just errors.
Depending on what I use to view the data if I dump the raw payload it can look like:
<d83d><dc3a>
or <U+D83D><U+DC3A>
Reading a bit on stack, it seemed DOM might work but didn't have any luck there either.
The incoming payload does have the header:
?xml version="1.0" encoding="UTF-8"?>
data comes in via
<data type="cdata"><![CDATA[<payload>
I'm at a complete loss, hopefully can get some help here to get me over this hump with this data handling.
I've been staring at this for days and it seems one thing I didn't try was to wrap my curl call function with utf8_encode like this :
$result = utf8_encode(do_curl($xmlbuildquery));
My do_curl function is just a separate function to call the curl procedure, nothing more.
Doing that, I'm able to parse the results, instead of those unicode characters showing up, instead its displaying as
[firstname] => 🐺
(the above is result of print_r($result); after
$xmldata = simplexml_load_string((string)$xmlresult->body->function->data);
With that in place the xml is now parsing finally. Oddly this sparked my curiosity further as this information is provided via csv thats imported into a mysql database and when I look up the same record its shown as :
FirstName: ????
with the table type set too :
FirstName varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
That might suggest their not utf8_encoding the output to the csv perhaps, separate from this issue but just interesting.
And finally, my script is able to run again!!

Decoding XML from UTF-8 to ISO-8859-1 in PHP

I'm trying to "decode" an XML file (and transforming it with XSLT), but I'm having trouble decoding both files. The scenario is as follows:
I have a site for data entry which is all encoded in ISO-8859-1 (our Oracle database is in that format, so I can't change it). The problem is, I have those 2 files (an XML to show the data entry form and and XSLT to transform it into HTML). Both files are saved in ISO-8859-1 encoding, and both have the corresponding header, i. e., , and whenever I read the files and show them in the browser, the special characters (ñ, á, ¿) are shown either as UTF-8 or as a question mark (depending on the method I use for showing), but never as the "normal" representation.
My code for showing the XML file is:
<?php
$xslString = file_get_contents("catalog.xsl");
$xslString = utf8_decode($xslString);
$xslDoc = simplexml_load_string($xslString);
$xmlString = file_get_contents("questionnaire.xml");
$xmlString = utf8_decode($xmlString);
$xmlDoc = simplexml_load_string($xmlString);
$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
?>
I already tried several combinations of DOMDocument, iconv, mb_convert_encoding, but they show the XML file as unencoded UTF, a question mark or a double question mark.
On the other hand, this also messes up my data entry, since if I want to enter one of those characters, they either show as ? or ?? on the corresponding data field on the DB, or they get truncated at the first special char (if I use iconv).
What am I missing? Is there a workaround? I can't convert anything to UTF-8 because of the database.
I hope I'm being clear enough, please excuse my English.
Thanks in advance!
Hope this helps others. In the end, there were two things:
1) I was reading the XML/XSL files like this (in my original script):
<?php
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xmlDoc->load("xmlfile.xml");
?>
which effectively changed the encoding to UTF-8. I changed the lines to:
<?php
$xmlString = file_get_contents("xmlfile.xml");
$xmlDoc = simplexml_load_string($xmlString);
?>
removing the utf_decode statement, and it worked like a charm. Now I get my special chars on screen as they're intended. As a side effect, the data entered in the form is now saved correctly to my database, so I got two birds in one shot.

Character encoding error, cannot write valid XML from MySQL via PHP

The feed in question is: http://api.inoads.com/snowstorm/feed.xml
Here is the PHP code I am using for the generation:
<?php
$database = 'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
$query = "SELECT * FROM the_queue WHERE id LIKE '%' ORDER BY id DESC LIMIT 25";
$result = mysql_query($query, $dbconnect);
while ($line = mysql_fetch_assoc($result))
{
$return[] = $line;
}
$now = date("D, d M Y H:i:s T");
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<rss version=\"2.0\">
<channel>
<title>The Queue</title>
<link>http://readapp.net</link>
<description>A curated reading list.</description>
<language>en-us</language>
<pubDate>$now</pubDate>
<lastBuildDate>$now</lastBuildDate>
";
foreach ($return as $line)
{
$output .= "<item><title>".htmlspecialchars($line['title'])."</title>
<description>".htmlspecialchars($line['description'])."</description>
<link>".htmlspecialchars($line['link'])."</link>
<pubDate>".htmlspecialchars($line['pubDate'])."</pubDate>
</item>";
}
$output .= "</channel></rss>";
$fh = fopen('feed.xml', 'w');
fwrite($fh, $output);
?>
What might be causing the error?
Here's a link from the feed validator: http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fapi.inoads.com%2Fsnowstorm%2Ffeed.xml
You said the XML file is UTF-8, but when I download it and open it in my text editor it auto-detects the windows latin1 encoding, and the quotes display perfectly.
If I force my text editor to use UTF-8, it shows an error message because there are illegal characters for the UTF-8 encoding.
Therefore, your data is not UTF-8, it is latin1. You need to find out exactly where that's happening. It could be any one, or several of:
is the HTML page where the content is typed in by the user set to UTF-8?
If not, the browser will be sending latin1 quotes. To fix this, the first tag in your <head> needs to be:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
is every browser correctly respecting your UTF-8 setting in that page's HTML?
If you specify UTF-8 and the page contains characters illegal in that encoding, some browsers might decide to use a different encoding despite the <meta> tag. How to check this is different in every browser.
is the MySQL connection when inserting into the database set to use UTF-8?
You need to be using UTF-8 here, or else MySQL may try to convert the encoding for you, often corrupting them. Set the encoding with:
$database = 'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
mysql_query('SET NAMES utf8', $dbconnect);
is the MySQL table (and individual column) set to use UTF-8?
Again, to avoid MySQL doing it's own buggy conversion, you need to make sure it's using UTF-8 for the table and also the individual comment. Do a structure dump of the database and check for:
CREATE TABLE `the_queue` (
...
) ... DEFAULT CHARSET=utf8;
And also make sure there isn't something like this on any of the columns:
`description` varchar(255) CHARACTER SET latin1,
is the MySQL connection when reading the database set to use UTF-8?
Your read connection also needs to be utf8. So double check that.
are you doing anything in the PHP that cannot handle UTF-8?
PHP has some functions which cannot be used on utf-8 strings, as it will corrupt them. One of those functions is htmlentities() so make sure you always use htmlspecialchars(). The easiest way to test this is to start commenting out big chunks of your code to see where the encoding is breaking.
There is one problem here:
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
...
There is a string containing "?>". This is the finalization marker for php. It will give you an error.
You can avoid these problems this way:
$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?".">
...
The point of htmlentities is to replace all characters that have define HTML character entities with those entities. If you really don't want any character entities (as your desired result suggests), don't use htmlentities.
By default, htmlentities uses the latin-1 charset, so it chokes on the smart quotes (indeed, all multibyte characters), which is where you see the question marks. One fix is to use htmlspecialchars to convert a much more limited set of characters (&, <, >, ' and "). This will still convert the double quotes because, well, that's the point of htmlspecialchars, unless you specify the ENT_NOQUOTES as the second argument. Another fix is to specify the character set as the third argument (this isn't exclusive of using htmlspecialchars).
The fourth argument to either specifies whether or not to encode already encoded characters. Whether or not do double-encode depends on the source data.
$line['description'] = '"Dave, stop. Stop, will you? Stop, Dave. Will you stop, Dave?” ... “Dave, my mind is going,” HAL says, forlornly. “I can feel it. I can feel it.”';
echo "<description>" . htmlspecialchars($line['description'], ENT_NOQUOTES, 'UTF-8', false) . "</description>";
See also:
RSS 2.0 Best Practice Tip: Entity-encoded HTML in Descriptions
Problem is that you are holding this string with quotes in database (as I assume). If it is true, PHP is removing quotes (which is proper), because of not causing bugs (SQL injection ex). So you have to remove quotes in DB and while generating XML file just add them. It is the simplest in my opinion. And try avoid double quotes ". You should use single ones '. In double PHP parser additionally checks what is in. So try to remove qoutes from DB and add them while generating XML. Should help.
Another error that you have it´s the format of the date. The date must be in format RFC-822, it must be in a format like this "Wed, 02 Oct 2002 08:00:00 EST", not "July/August 2008".

PHP Sax Parser and UTF-8

It is unfortunate that I am running into some troubles with the php sax parser and the utf-8 encoding.
The case:
I have a xml-file that is encoded in utf-8. The file is parsed using the standard php sax parser. The data is stored into some container objects and inserted into a mysql database. Unfortunately some characters look weird in the database (mostly german umlaute). For example Gürtel looks like Gürtel.
The following code fragment shows how the parser is instantiated:
$saxParser = xml_parser_create("UTF-8");
Does this suffice to parse utf-8 files? If yes, what I am missing? Some sepcial database stuff when inserting?
Thanks in advance.
Check the encoding step by step to find the invalid code:
Print the value you retrive from the XML
Print out the SQL statement you build
When printing the values, make sure your browser reads the output with the correct encoding.
You have to ensure that every component uses the proper encoding:
PHP script
Save your PHP with the encoding set to UTF-8 without BOM, because this might cause problems. Use only multibyte string functions when working with UTF-8 strings.
XML file
XML file starts with
<?xml version="1.0" encoding="UTF-8" ?>
and the file is properly saved with the encoding set to UTF-8.
SQL column (collation)
VARCHAR(length) [CHARACTER SET charset_name] [COLLATE collation_name]
Communication between MySQL server and PHP script
Run this command right after opening the connection to the MySQL server:
SET NAMES 'UTF8'
SET NAMES indicates what character set the client will use to send SQL
statements to the server.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

PHP: Simple XML and different codepages and getting the data correctly

I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.
To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...
When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file.
When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.
Example:
The XML:
<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km ; Dìèín - Rozb 741,85km </name>
...
The PHP code:
$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";
Result of the code (on linux bash shell) moves the cursor upwards and then prints: bín - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)
I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.
I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)
How can I convert my data so that it will always show up, with whatever encoding on the source?
Try iconv instead:
$str = iconv('UTF-8', 'WINDOWS-1250', $str);
The problem is your input file is malformed. There is no character ì (latin small letter I with grave) in Windows-1250. See here.
The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).
The fact such character shows correctly in the shell is likely fortuitous.

Categories