PHP-DOMDocument Character Encoding Issue

PHP-DOMDocument Character Encoding Issue - php

This one has been baffling me for a few days. I'm having issues with character encoding, and I've spent much time researching and reading through Stack Overflow questions, and I am yet to find a solution.
So I have an XML file, and inside that file there is a group of tags similar to this:
<item name="purchase" date="November 12 2014 02:27:48">
<airline>Aero Test Ltd</airline>
<aircraft>Boeing 747-400</aircraft>
<engine>Rolls-Royce RB211-524H2-T</engine>
<config>5 25 40 560</config>
<value>261430000</value>
<name>None</name>
</item>
From a web page, a user can change the name of the aircraft (The <name> Tag). The name is sent via XMLHttpRequest to my PHP page, which should create a new set of tags like the above, and log the name in a mySQL database.
It works normally with regular English-alphabet text. When I try to use the name "Corvina Panameña" I come across some trouble with the ñ.
It adds this group of tags to my XML document (like it should):
<item name="renaming" date="January 03 2015 04:34:38">
<airline>Aero Test Ltd</airline>
<aircraft>Boeing 747-400</aircraft>
<engine>Rolls-Royce RB211-524H2-T</engine>
<config>5 25 40 560</config>
<value>227852883</value>
<name>Corvina Panameña</name>
</item>
And DOMDocument encodes the ñ as &#xF1, which is what it should do according to my research. When I open the file in chrome it displays the character.
I have 3 other web pages, 2 of which use the data from the data from the mySQL database. One of these mySQL-data pages displays the character, and then the problem: the other shows this character-combination instead: Ã±. Both pages have the HTML5 Doctype and do not have a character set defined in a <meta> tag.
The 3rd web page uses the XML data. Strangely, it displays the same character-combination as the 2nd mySQL page: Ã±. The page uses the HTML5 Doctype and does not have a character set defined in a <meta> tag.
What is the solution to this bizarre problem?
Is the problem similar to this:
http://www.glenscott.co.uk/blog/html5-character-encodings-and-domdocument-loadhtml-and-loadhtmlfile/
This is my DOMDocument procedure for adding the tag group: http://codepad.org/dHdiY5wG
DOMDocument procedure for reading the data: http://codepad.org/ATpkZq4H
Full XML File: http://codepad.org/XnN9ahuc
Screenshots: http://imgur.com/a/ajiPG
-Edit-
ini_set("default_encoding", "UTF-8") and htmlentities didn't help.
-Edit 2-
Using utf_encode() on the data didn't help either.
-Edit 3-
It appears as if the post data being sent by the XMLHttpRequest is the problem, not the XML.
This data is sent: Corvina Panameña
And this is received: Corvina PanameÃ±a

Try using $xml = new DomDocument('1.0', 'UTF-8'); to set the UTF-8 charset encoding for your xml file, if it isn't enough, try using utf8_encode with the name string prior to adding it to the xml doc.

Related

simplexml_load_string - parse error due to unicode characters in payload

I have a problem with simplexml_load_string erring with parse errors due to an xml payload coming from a database with unicode characters in it.
I'm at a loss how to get php to read this and use the xml like I normally would. The code has been working fine until people were getting creative with data being submitted.
Unfortunately I cannot modify the source data, I have to work with what I receive, to give you an idea, one field that's breaking it in the original raw receipt looks like :
<FirstName>🐺</FirstName>
Previously the code works fine by parsing the xml with a simple line of :
$xmlresult = simplexml_load_string($result, 'SimpleXMLElement',LIBXML_NOCDATA);
However with these unicode characters, it just errors.
Depending on what I use to view the data if I dump the raw payload it can look like:
<d83d><dc3a>
or <U+D83D><U+DC3A>
Reading a bit on stack, it seemed DOM might work but didn't have any luck there either.
The incoming payload does have the header:
?xml version="1.0" encoding="UTF-8"?>
data comes in via
<data type="cdata"><![CDATA[<payload>
I'm at a complete loss, hopefully can get some help here to get me over this hump with this data handling.

I've been staring at this for days and it seems one thing I didn't try was to wrap my curl call function with utf8_encode like this :
$result = utf8_encode(do_curl($xmlbuildquery));
My do_curl function is just a separate function to call the curl procedure, nothing more.
Doing that, I'm able to parse the results, instead of those unicode characters showing up, instead its displaying as
[firstname] => í ½í°º
(the above is result of print_r($result); after
$xmldata = simplexml_load_string((string)$xmlresult->body->function->data);
With that in place the xml is now parsing finally. Oddly this sparked my curiosity further as this information is provided via csv thats imported into a mysql database and when I look up the same record its shown as :
FirstName: ????
with the table type set too :
FirstName varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
That might suggest their not utf8_encoding the output to the csv perhaps, separate from this issue but just interesting.
And finally, my script is able to run again!!

Some '<' tags within a PHP SOAP response displaying as HTML '<' entities

I am getting some data from a Web Service using PHP SOAP. The data I receive from the Soap Client using __getLastResponse appears to be a SOAP envelope around the relevant XML data, which is fine as I am then planning on turning this into a SimpleXMLElement to extract the data.
The problem is that the data looks correct until it hits a certain <records> tag, after which it replaces all < tags with <.
This is what the data looks like when I print_r it (this is just a small example of the full data):
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><ns2:searchResponse xmlns:ns2="http://woksearch.v3.wokmws.thomsonreuters.com"><return><queryId>1</queryId><recordsFound>16492</recordsFound><recordsSearched>38802522</recordsSearched><records><records xmlns="http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord">
<REC r_id_disclaimer="ResearcherID data provided by Thomson Reuters"><UID>WOS:000257367300002</UID><static_data><summary><EWUID><WUID coll_id="WOS"></WUID><edition value="WOS.SCI"></edition></EWUID><pub_info issue="8" pubtype="Journal" sortdate="2008-07-01" has_abstract="Y" coverdate="JUL 2008" pubmonth="JUL" vol="17" pubyear="2008"><page end="1820" page_count="16" begin="1805">1805-1820</page></pub_info><titles count="6"><title type="source">BIODIVERSITY AND CONSERVATION</title>...etc...</static_data><dynamic_data><citation_related><tc_list><silo_tc local_count="16" coll_id="WOS"></silo_tc></tc_list></citation_related><cluster_related><identifiers><identifier value="0960-3115" type="issn"></identifier><identifier value="10.1007/s10531-007-9267-2" type="doi"></identifier><identifier value="10.1007/s10531-007-9267-2" type="xref_doi"></identifier></identifiers></cluster_related></dynamic_data></REC>
</records></records></return></ns2:searchResponse></soap:Body></soap:Envelope>
Why are the opening tags displaying correctly until it gets to the second <records> tag? After that it replaces them with < until it reaches the closing </records> tag, when it carries on displaying the opening tags correctly. It doesn't affect the closing tags or the quotation marks which is strange.
Is this something to do with CDATA? That's all I can think of although it doesn't state that there is a block of CDATA anywhere...
Thanks.

I had the same problem, don't know if this would be the "good" soluition, but at least it works:
$xmlP = html_entity_decode($client->__getLastResponse());
This is to decode the html entities back to html. The < and > that already exists in the string correctly, are kept the same.
Hope this helps to anyone

CKEditor output in BBcode format not HTML

I'm using CKEditor in a form. When I submit that form the content of I wrote in the text area of the CKEditor is saved in database in such format [b]helllo[/b][size=100]fefdf[/size]:*). In another page when I retrieve the data it shows in HTML as same [b]helllo[/b][size=100]fefdf[/size]:*) instead of the output in BBCode format. Can anyone help me with how to get it in BBcode format?
What I want:
What I am getting:

I deduce that BBCode option is activated when you call the editor (see this example code), so that shouldn't be the issue.
One thing you could try is setting the basicEntities config to false.
Taken from the CKeditor API:
<static> {Boolean} CKEDITOR.config.basicEntities Since: 3.0
Whether to escape basic HTML entities in the document, including:
nbsp
gt
lt
amp
Note: It should not be subject to change unless when outputting a non-HTML data format like BBCode.
Defined in: plugins/entities/plugin.js.
config.basicEntities = false;
Default Value:
true

hello thanks for the help i solved the problem just removed the line "extraPlugins : 'bbcode'," and now its working

XML not well formed error

I have a php script that writes xml data to a file and another one that sends the contents of this file to the client as the response.
But on the client side,im getting the following error:
XML Parsing Error: not well-formed
When i view source of the page, the XML i see is as follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<books><date>December 24th, 2009</date><total>2</total><book><name>Book 1</name><url>http://www.mydomain.com/posters/68370/img.jpg</url></book><book><name>Book 2</name><url>http://www.anotherdomain.com/posters/76198/img1.jpg</url></book></books>
In file1.php i have the following code that writes the XML to a file :
$file= fopen("book_results.xml", "w");
$xml_writer = new XMLWriter();
$xml_writer->openMemory();
$xml_writer->startDocument('1.0', 'UTF-8', 'yes');
$xml_writer->startElement('books');
$xml_writer->writeElement('date',get_current_date()); // Like December 23rd, 2009
$xml_writer->writeElement('total',$totalResults);
foreach($bookList as $key => $value) { /* $bookList contains key value pairs */
$xml_writer->startElement('book');
$xml_writer->writeElement('name',$key);
$xml_writer->writeElement('url',$value);
$xml_writer->endElement(); //book
}
$xml_writer->endElement(); //books
$xml_data = $xml_writer->outputMemory();
fwrite($file,$xml_data);
fclose($file);
And in index.php, i have the following code to send the contents of the file as a response
<?php
//Send the xml file contents as response
header('Content-type: text/xml');
readfile('book_results.xml');
?>
What could be causing the error ?
Please help.
Thank You.

The above looks good to me (including the fact that you're forming the XML via a dedicated component) and either:
what you're using to validate this is wrong
you're looking at something different to what you think you are
I would definitely try another tool/browser/whatever to validate this. Additionally, you may want to save the XML file as sent to the browser, and check it using XMLStarlet (a command-line XML toolkit).
I'm wondering also if it's an issue that we can't easily see - a character encoding problem or a Byte-Order-Mark issue (related to encodings). Does the character encoding of the web page you're sending match/differ from the encoding of the XML (UTF-8).

There are some free websites and tools for checking for validity in XML.
According to the XML Validator, when I pasted your XML above into the textarea, it said "no errors found".
However, Validome says "Can not find declaration of element 'books'."
Perhaps Jeff's suggestion of changing date and total to attributes might help. It would probably be easy to try that.

Have you tried using those 2 loose date and total tags as attributes instead?:
<books date="December 24th" total="2">
Also, xml can be quite sensitive. Make sure to use CDATA tags were appropriate

It validates fine in WMHelp XMLPad 3.0.1.0, and opens fine in FireFox 3.0.8 and IE7 without errors.
The only thing I can see, from a copy and paste of your XML, is that the XML declaration is followed by a CR/LF combination (0x0D0x0A). This is platform specific (Windows), and may be an issue on the client; you didn't mention what the client was, however, so I can't be sure if that's the problem.

Ensure that you are writing UTF-8 or 7-bit ASCII encoding to the file (test with a text editor or the 'file' command, if you have it), and that your checker supports it. Keep in mind that UTF-8 can include a signature (sometimes called the byte-order mark) in the first three bytes (EF BB BF) that sometimes confuses some tools if it is there, and rarely if it is not.

xml version='1.0' encoding='UTF-8' standalone='yes'
use single quote.

PHP: UTF 8 characters encoding

I am scraping a list of RSS feeds by using cURL, and then I am reading and parsing the RSS data with SimpleXML. The sorted data is then inserted into a mySQL database.
However, as notice on http://dansays.co.uk/research/MNA/rss.php I am having several issues with characters not displaying correctly.
Examples:
âGuitar Hero: Van Halenâ Trailer And Tracklist Available
NV 10/10/09 â€“ Salt Lake City, UT 10/11/09 â€“ Denver, CO 10/13/09 â€“
I have tried using htmlentities and htmlspecialchars on the data before inserting them into the database, but it doesn't seem to help resolve issue.
How could I possibly resolve this issue I am having?
Thanks for any advices.
Updated
I've tried what Greg suggested, and the issue is still here...
Here is the code I used to do SET NAMES in PDO:
$dbh = new PDO($dbstring, $username, $password);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$dbh->query('SET NAMES "utf8"');
I did a bit of echo'ing with the simplexml data before it is sorted and inserted into the database, and I now believe it is something to do with the cURL...
Here is what I have for cURL:
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);
Issue Resolved
I had to set the content charset in the RSS/HTML page to "UTF-8" to resolve this issue. I guess this isn't a real fix as the char problems are still there in the raw data. Looking forward to proper support for it in PHP6!

Your page is being served as UTF-8 so I'd point my finger at the database.
Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:
SET NAMES "utf8"

Just a quick note about CURLOPT_ENCODING : it's the Accept-Encoding header, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".

Like all debugging, you start by isolating the problem:
I am scraping a list of RSS feeds by using cURL, - look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)
and then I am reading and parsing the RSS data with SimpleXML. - print out the field that SimpleXML read out - is it ok or does a problem show up?
The sorted data is then inserted into a mySQL database. - print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.
EDIT
Take the feed http://hangout.altsounds.com/external.php?type=RSS2 , put it into the validator http://validator.w3.org/feed/ . They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm .
What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.
EDIT 2
A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.
The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.

It may have to do with the XML prologue, which looks like this for that particular feed you linked to:
<?xml version="1.0" encoding="ISO-8859-1" ?>
As far as I know libxml, on which SimpleXML is based, looks for this kind of things. I'm not sure about XML files but I'm sure that with HTML strings it looks for META elements that specify the charset.
Try stripping the XML prologue (I solved a similar problem once by stripping the HTML META tags) and don't forget to utf8_encode() the data before feeding it to SimpleXMLElement.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP-DOMDocument Character Encoding Issue - php

Try using $xml = new DomDocument('1.0', 'UTF-8'); to set the UTF-8 charset encoding for your xml file, if it isn't enough, try using utf8_encode with the name string prior to adding it to the xml doc.

Related

simplexml_load_string - parse error due to unicode characters in payload

Some '<' tags within a PHP SOAP response displaying as HTML '<' entities

CKEditor output in BBcode format not HTML

XML not well formed error

PHP: UTF 8 characters encoding

Categories

Resources