PHP: UTF 8 characters encoding - php

I am scraping a list of RSS feeds by using cURL, and then I am reading and parsing the RSS data with SimpleXML. The sorted data is then inserted into a mySQL database.
However, as notice on http://dansays.co.uk/research/MNA/rss.php I am having several issues with characters not displaying correctly.
Examples:
âGuitar Hero: Van Halenâ Trailer And Tracklist Available
NV 10/10/09 – Salt Lake City, UT 10/11/09 – Denver, CO 10/13/09 –
I have tried using htmlentities and htmlspecialchars on the data before inserting them into the database, but it doesn't seem to help resolve issue.
How could I possibly resolve this issue I am having?
Thanks for any advices.
Updated
I've tried what Greg suggested, and the issue is still here...
Here is the code I used to do SET NAMES in PDO:
$dbh = new PDO($dbstring, $username, $password);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$dbh->query('SET NAMES "utf8"');
I did a bit of echo'ing with the simplexml data before it is sorted and inserted into the database, and I now believe it is something to do with the cURL...
Here is what I have for cURL:
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);
Issue Resolved
I had to set the content charset in the RSS/HTML page to "UTF-8" to resolve this issue. I guess this isn't a real fix as the char problems are still there in the raw data. Looking forward to proper support for it in PHP6!

Your page is being served as UTF-8 so I'd point my finger at the database.
Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:
SET NAMES "utf8"

Just a quick note about CURLOPT_ENCODING : it's the Accept-Encoding header, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".

Like all debugging, you start by isolating the problem:
I am scraping a list of RSS feeds by using cURL, - look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)
and then I am reading and parsing the RSS data with SimpleXML. - print out the field that SimpleXML read out - is it ok or does a problem show up?
The sorted data is then inserted into a mySQL database. - print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.
EDIT
Take the feed http://hangout.altsounds.com/external.php?type=RSS2 , put it into the validator http://validator.w3.org/feed/ . They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm .
What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.
EDIT 2
A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.
The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.

It may have to do with the XML prologue, which looks like this for that particular feed you linked to:
<?xml version="1.0" encoding="ISO-8859-1" ?>
As far as I know libxml, on which SimpleXML is based, looks for this kind of things. I'm not sure about XML files but I'm sure that with HTML strings it looks for META elements that specify the charset.
Try stripping the XML prologue (I solved a similar problem once by stripping the HTML META tags) and don't forget to utf8_encode() the data before feeding it to SimpleXMLElement.

Related

simplexml_load_string - parse error due to unicode characters in payload

I have a problem with simplexml_load_string erring with parse errors due to an xml payload coming from a database with unicode characters in it.
I'm at a loss how to get php to read this and use the xml like I normally would. The code has been working fine until people were getting creative with data being submitted.
Unfortunately I cannot modify the source data, I have to work with what I receive, to give you an idea, one field that's breaking it in the original raw receipt looks like :
<FirstName>🐺</FirstName>
Previously the code works fine by parsing the xml with a simple line of :
$xmlresult = simplexml_load_string($result, 'SimpleXMLElement',LIBXML_NOCDATA);
However with these unicode characters, it just errors.
Depending on what I use to view the data if I dump the raw payload it can look like:
<d83d><dc3a>
or <U+D83D><U+DC3A>
Reading a bit on stack, it seemed DOM might work but didn't have any luck there either.
The incoming payload does have the header:
?xml version="1.0" encoding="UTF-8"?>
data comes in via
<data type="cdata"><![CDATA[<payload>
I'm at a complete loss, hopefully can get some help here to get me over this hump with this data handling.
I've been staring at this for days and it seems one thing I didn't try was to wrap my curl call function with utf8_encode like this :
$result = utf8_encode(do_curl($xmlbuildquery));
My do_curl function is just a separate function to call the curl procedure, nothing more.
Doing that, I'm able to parse the results, instead of those unicode characters showing up, instead its displaying as
[firstname] => 🐺
(the above is result of print_r($result); after
$xmldata = simplexml_load_string((string)$xmlresult->body->function->data);
With that in place the xml is now parsing finally. Oddly this sparked my curiosity further as this information is provided via csv thats imported into a mysql database and when I look up the same record its shown as :
FirstName: ????
with the table type set too :
FirstName varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
That might suggest their not utf8_encoding the output to the csv perhaps, separate from this issue but just interesting.
And finally, my script is able to run again!!

PHP-DOMDocument Character Encoding Issue

This one has been baffling me for a few days. I'm having issues with character encoding, and I've spent much time researching and reading through Stack Overflow questions, and I am yet to find a solution.
So I have an XML file, and inside that file there is a group of tags similar to this:
<item name="purchase" date="November 12 2014 02:27:48">
<airline>Aero Test Ltd</airline>
<aircraft>Boeing 747-400</aircraft>
<engine>Rolls-Royce RB211-524H2-T</engine>
<config>5 25 40 560</config>
<value>261430000</value>
<name>None</name>
</item>
From a web page, a user can change the name of the aircraft (The <name> Tag). The name is sent via XMLHttpRequest to my PHP page, which should create a new set of tags like the above, and log the name in a mySQL database.
It works normally with regular English-alphabet text. When I try to use the name "Corvina Panameña" I come across some trouble with the ñ.
It adds this group of tags to my XML document (like it should):
<item name="renaming" date="January 03 2015 04:34:38">
<airline>Aero Test Ltd</airline>
<aircraft>Boeing 747-400</aircraft>
<engine>Rolls-Royce RB211-524H2-T</engine>
<config>5 25 40 560</config>
<value>227852883</value>
<name>Corvina Panameña</name>
</item>
And DOMDocument encodes the ñ as &#xF1, which is what it should do according to my research. When I open the file in chrome it displays the character.
I have 3 other web pages, 2 of which use the data from the data from the mySQL database. One of these mySQL-data pages displays the character, and then the problem: the other shows this character-combination instead: ñ. Both pages have the HTML5 Doctype and do not have a character set defined in a <meta> tag.
The 3rd web page uses the XML data. Strangely, it displays the same character-combination as the 2nd mySQL page: ñ. The page uses the HTML5 Doctype and does not have a character set defined in a <meta> tag.
What is the solution to this bizarre problem?
Is the problem similar to this:
http://www.glenscott.co.uk/blog/html5-character-encodings-and-domdocument-loadhtml-and-loadhtmlfile/
This is my DOMDocument procedure for adding the tag group: http://codepad.org/dHdiY5wG
DOMDocument procedure for reading the data: http://codepad.org/ATpkZq4H
Full XML File: http://codepad.org/XnN9ahuc
Screenshots: http://imgur.com/a/ajiPG
-Edit-
ini_set("default_encoding", "UTF-8") and htmlentities didn't help.
-Edit 2-
Using utf_encode() on the data didn't help either.
-Edit 3-
It appears as if the post data being sent by the XMLHttpRequest is the problem, not the XML.
This data is sent: Corvina Panameña
And this is received: Corvina Panameña
Try using $xml = new DomDocument('1.0', 'UTF-8'); to set the UTF-8 charset encoding for your xml file, if it isn't enough, try using utf8_encode with the name string prior to adding it to the xml doc.

Decoding XML from UTF-8 to ISO-8859-1 in PHP

I'm trying to "decode" an XML file (and transforming it with XSLT), but I'm having trouble decoding both files. The scenario is as follows:
I have a site for data entry which is all encoded in ISO-8859-1 (our Oracle database is in that format, so I can't change it). The problem is, I have those 2 files (an XML to show the data entry form and and XSLT to transform it into HTML). Both files are saved in ISO-8859-1 encoding, and both have the corresponding header, i. e., , and whenever I read the files and show them in the browser, the special characters (ñ, á, ¿) are shown either as UTF-8 or as a question mark (depending on the method I use for showing), but never as the "normal" representation.
My code for showing the XML file is:
<?php
$xslString = file_get_contents("catalog.xsl");
$xslString = utf8_decode($xslString);
$xslDoc = simplexml_load_string($xslString);
$xmlString = file_get_contents("questionnaire.xml");
$xmlString = utf8_decode($xmlString);
$xmlDoc = simplexml_load_string($xmlString);
$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
?>
I already tried several combinations of DOMDocument, iconv, mb_convert_encoding, but they show the XML file as unencoded UTF, a question mark or a double question mark.
On the other hand, this also messes up my data entry, since if I want to enter one of those characters, they either show as ? or ?? on the corresponding data field on the DB, or they get truncated at the first special char (if I use iconv).
What am I missing? Is there a workaround? I can't convert anything to UTF-8 because of the database.
I hope I'm being clear enough, please excuse my English.
Thanks in advance!
Hope this helps others. In the end, there were two things:
1) I was reading the XML/XSL files like this (in my original script):
<?php
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xmlDoc->load("xmlfile.xml");
?>
which effectively changed the encoding to UTF-8. I changed the lines to:
<?php
$xmlString = file_get_contents("xmlfile.xml");
$xmlDoc = simplexml_load_string($xmlString);
?>
removing the utf_decode statement, and it worked like a charm. Now I get my special chars on screen as they're intended. As a side effect, the data entered in the form is now saved correctly to my database, so I got two birds in one shot.

encoding issues in drupal when importing from wordpress

I am currently moving blog posts from wordpress to drupal. however after moving it
some of the text is not being displayed correctly.
wordpress is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
Drupal is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
In the wordpress and drupal db the value is correct. The source is the same.
<h2>When it hasn’t</h2>
I did a search and found many options. None of them helped.
Below are the ones I have done and checked.
1) I double checked that utf-8 is the character encoing in drupal and wp.
I also made a simple test.php file to check nothing else was coming in the way
and it still did not display correctly.
2) I made sure when we take a mysqldump and upload to drupal utf-8
is used.
3) I also made sure the .php file is in utf-8 when saved.
4) I changed the encoding type in chrome for every option available and nothing
displayed it correctly.
5) I also used php functions to recode it but they did not work.
$value2="<h2>When it hasn’t</h2>";
$out = recode_string('..utf-8', $value2);
//output - When it hasnt
$out2= mb_convert_encoding($value2,'UTF-8', "UTF-8");
// output - When it hasn’t
$out3= #iconv('UTF-8', 'utf-8', $value2);
// output - When it hasn’t
I have ran out of options now and I am stuck. Please help
You say the text in both databases is correct, but actually this doesn't mean too much: to viewing the content of a record you must use some client, and quite a few transformations may happen depending on how the text is rendered so you can read it.
So only two things matters:
the encoding of the column
the encoding of the HTML page returned by Drupal
Since your page outputs ’ (in CP1252 is xE2x80x99) for ’ (Unicode U+2019, UTF-8 is 0xE28099) I guess the column is indeed UTF-8, however there's someone between the database and the browser who thinks the text is CP1252. This is what you have to check:
If using MySQL, the connection encoding must be UTF-8 so that what you have in your PHP script is UTF-8 text. You can use SET NAMES 'UTF-8'. Note that if you don't need the Unicode set, you can even use CP1252: the only important thing is that you know the encoding, since PHP strings are just byte arrays.
Explicitely define the response encoding in the HTTP Content-Type header. I mean, configure Drupal to call header('Content-Type: text/html; charset=utf-8');
If the HTTP response encoding is different than the one used for the text retrieved from the db, transcode the query result accordingly

XML not well formed error

I have a php script that writes xml data to a file and another one that sends the contents of this file to the client as the response.
But on the client side,im getting the following error:
XML Parsing Error: not well-formed
When i view source of the page, the XML i see is as follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<books><date>December 24th, 2009</date><total>2</total><book><name>Book 1</name><url>http://www.mydomain.com/posters/68370/img.jpg</url></book><book><name>Book 2</name><url>http://www.anotherdomain.com/posters/76198/img1.jpg</url></book></books>
In file1.php i have the following code that writes the XML to a file :
$file= fopen("book_results.xml", "w");
$xml_writer = new XMLWriter();
$xml_writer->openMemory();
$xml_writer->startDocument('1.0', 'UTF-8', 'yes');
$xml_writer->startElement('books');
$xml_writer->writeElement('date',get_current_date()); // Like December 23rd, 2009
$xml_writer->writeElement('total',$totalResults);
foreach($bookList as $key => $value) { /* $bookList contains key value pairs */
$xml_writer->startElement('book');
$xml_writer->writeElement('name',$key);
$xml_writer->writeElement('url',$value);
$xml_writer->endElement(); //book
}
$xml_writer->endElement(); //books
$xml_data = $xml_writer->outputMemory();
fwrite($file,$xml_data);
fclose($file);
And in index.php, i have the following code to send the contents of the file as a response
<?php
//Send the xml file contents as response
header('Content-type: text/xml');
readfile('book_results.xml');
?>
What could be causing the error ?
Please help.
Thank You.
The above looks good to me (including the fact that you're forming the XML via a dedicated component) and either:
what you're using to validate this is wrong
you're looking at something different to what you think you are
I would definitely try another tool/browser/whatever to validate this. Additionally, you may want to save the XML file as sent to the browser, and check it using XMLStarlet (a command-line XML toolkit).
I'm wondering also if it's an issue that we can't easily see - a character encoding problem or a Byte-Order-Mark issue (related to encodings). Does the character encoding of the web page you're sending match/differ from the encoding of the XML (UTF-8).
There are some free websites and tools for checking for validity in XML.
According to the XML Validator, when I pasted your XML above into the textarea, it said "no errors found".
However, Validome says "Can not find declaration of element 'books'."
Perhaps Jeff's suggestion of changing date and total to attributes might help. It would probably be easy to try that.
Have you tried using those 2 loose date and total tags as attributes instead?:
<books date="December 24th" total="2">
Also, xml can be quite sensitive. Make sure to use CDATA tags were appropriate
It validates fine in WMHelp XMLPad 3.0.1.0, and opens fine in FireFox 3.0.8 and IE7 without errors.
The only thing I can see, from a copy and paste of your XML, is that the XML declaration is followed by a CR/LF combination (0x0D0x0A). This is platform specific (Windows), and may be an issue on the client; you didn't mention what the client was, however, so I can't be sure if that's the problem.
Ensure that you are writing UTF-8 or 7-bit ASCII encoding to the file (test with a text editor or the 'file' command, if you have it), and that your checker supports it. Keep in mind that UTF-8 can include a signature (sometimes called the byte-order mark) in the first three bytes (EF BB BF) that sometimes confuses some tools if it is there, and rarely if it is not.
xml version='1.0' encoding='UTF-8' standalone='yes'
use single quote.

Categories