XML not well formed error - php

I have a php script that writes xml data to a file and another one that sends the contents of this file to the client as the response.
But on the client side,im getting the following error:
XML Parsing Error: not well-formed
When i view source of the page, the XML i see is as follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<books><date>December 24th, 2009</date><total>2</total><book><name>Book 1</name><url>http://www.mydomain.com/posters/68370/img.jpg</url></book><book><name>Book 2</name><url>http://www.anotherdomain.com/posters/76198/img1.jpg</url></book></books>
In file1.php i have the following code that writes the XML to a file :
$file= fopen("book_results.xml", "w");
$xml_writer = new XMLWriter();
$xml_writer->openMemory();
$xml_writer->startDocument('1.0', 'UTF-8', 'yes');
$xml_writer->startElement('books');
$xml_writer->writeElement('date',get_current_date()); // Like December 23rd, 2009
$xml_writer->writeElement('total',$totalResults);
foreach($bookList as $key => $value) { /* $bookList contains key value pairs */
$xml_writer->startElement('book');
$xml_writer->writeElement('name',$key);
$xml_writer->writeElement('url',$value);
$xml_writer->endElement(); //book
}
$xml_writer->endElement(); //books
$xml_data = $xml_writer->outputMemory();
fwrite($file,$xml_data);
fclose($file);
And in index.php, i have the following code to send the contents of the file as a response
<?php
//Send the xml file contents as response
header('Content-type: text/xml');
readfile('book_results.xml');
?>
What could be causing the error ?
Please help.
Thank You.

The above looks good to me (including the fact that you're forming the XML via a dedicated component) and either:
what you're using to validate this is wrong
you're looking at something different to what you think you are
I would definitely try another tool/browser/whatever to validate this. Additionally, you may want to save the XML file as sent to the browser, and check it using XMLStarlet (a command-line XML toolkit).
I'm wondering also if it's an issue that we can't easily see - a character encoding problem or a Byte-Order-Mark issue (related to encodings). Does the character encoding of the web page you're sending match/differ from the encoding of the XML (UTF-8).

There are some free websites and tools for checking for validity in XML.
According to the XML Validator, when I pasted your XML above into the textarea, it said "no errors found".
However, Validome says "Can not find declaration of element 'books'."
Perhaps Jeff's suggestion of changing date and total to attributes might help. It would probably be easy to try that.

Have you tried using those 2 loose date and total tags as attributes instead?:
<books date="December 24th" total="2">
Also, xml can be quite sensitive. Make sure to use CDATA tags were appropriate

It validates fine in WMHelp XMLPad 3.0.1.0, and opens fine in FireFox 3.0.8 and IE7 without errors.
The only thing I can see, from a copy and paste of your XML, is that the XML declaration is followed by a CR/LF combination (0x0D0x0A). This is platform specific (Windows), and may be an issue on the client; you didn't mention what the client was, however, so I can't be sure if that's the problem.

Ensure that you are writing UTF-8 or 7-bit ASCII encoding to the file (test with a text editor or the 'file' command, if you have it), and that your checker supports it. Keep in mind that UTF-8 can include a signature (sometimes called the byte-order mark) in the first three bytes (EF BB BF) that sometimes confuses some tools if it is there, and rarely if it is not.

xml version='1.0' encoding='UTF-8' standalone='yes'
use single quote.

Related

simplexml_load_string - parse error due to unicode characters in payload

I have a problem with simplexml_load_string erring with parse errors due to an xml payload coming from a database with unicode characters in it.
I'm at a loss how to get php to read this and use the xml like I normally would. The code has been working fine until people were getting creative with data being submitted.
Unfortunately I cannot modify the source data, I have to work with what I receive, to give you an idea, one field that's breaking it in the original raw receipt looks like :
<FirstName>🐺</FirstName>
Previously the code works fine by parsing the xml with a simple line of :
$xmlresult = simplexml_load_string($result, 'SimpleXMLElement',LIBXML_NOCDATA);
However with these unicode characters, it just errors.
Depending on what I use to view the data if I dump the raw payload it can look like:
<d83d><dc3a>
or <U+D83D><U+DC3A>
Reading a bit on stack, it seemed DOM might work but didn't have any luck there either.
The incoming payload does have the header:
?xml version="1.0" encoding="UTF-8"?>
data comes in via
<data type="cdata"><![CDATA[<payload>
I'm at a complete loss, hopefully can get some help here to get me over this hump with this data handling.
I've been staring at this for days and it seems one thing I didn't try was to wrap my curl call function with utf8_encode like this :
$result = utf8_encode(do_curl($xmlbuildquery));
My do_curl function is just a separate function to call the curl procedure, nothing more.
Doing that, I'm able to parse the results, instead of those unicode characters showing up, instead its displaying as
[firstname] => 🐺
(the above is result of print_r($result); after
$xmldata = simplexml_load_string((string)$xmlresult->body->function->data);
With that in place the xml is now parsing finally. Oddly this sparked my curiosity further as this information is provided via csv thats imported into a mysql database and when I look up the same record its shown as :
FirstName: ????
with the table type set too :
FirstName varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
That might suggest their not utf8_encoding the output to the csv perhaps, separate from this issue but just interesting.
And finally, my script is able to run again!!

Extract XML from .prt file using PHP but file becomes unreadable when opened with PHP

I have a .prt (CAD Design File) that I need to extract some XML from using PHP. When I view this file directly in the browser, I can see the XML along with some unreadable areas. However, when I go to open it using PHP to get the XML I need from it, the file becomes mostly unreadable and the XML is no where to be found as the file looks like it was encrypted.
This is an example of what the .prt file looks like when opened directly in the browser: File in Browser
This is an example of what the file looks like when opened using PHP: Using PHP
This is how I am trying to open the file with PHP:
$handle = fopen("thePart.prt", "rb");
$contents = trim(stream_get_contents($handle));
fclose($handle);
//echo out contents to see what happens
echo $contents;
If I could get this file to open without doing what it is doing, I can get the XML out of it myself. How do I fix the issue that I am having? Thank you very much in advance.
Real Answer
Turns out that there was no problem at all with the code. The browser was just interpreting the XML tags as HTML and so the data was not displayed (PHP by default sets a content type of text/html). When viewing the source code, the XML was plain and visible. The XML can also be seen without viewing the source by setting the content type of the php file:
header('Content-Type: text/plain');
This way, the browser will just display the XML as it is, without attempting to parse it as HTML first.
Initial Answer
Just a guess here, but it might be that you're opening the file in binary mode (the "rb" in your first line of code. Try opening it as a plain text file (use "r" instead of "rb").
More likely, it's an encoding issue where PHP is trying to decode a UTF-8 file as ASCII, for instance. Since you are opening a binary file (CAD Design File is binary with a little XML, I'm assuming), PHP might be getting confused while trying to detect the encoding of the file. I would need a copy of the file to know for sure.
Try comparing the result of mb_detect_encoding:
mb_detect_encoding($contents)
and the actual encoding of the XML data within the .prt file. If they are different, that's how you know that PHP is using the wrong encoding. In that case, use mb_convert_encoding to convert from PHP's detected encoding to that of the XML data.

getting xml data as is(with tags, attributes and values) with php

Hello im struggling with a problem. I have an url that contains xml data...
when i'm using file_get_contents($url) or fopen($url,'r') it gives me only values:
Consider the xml:
<tag1 attrName="something">
<tag2>some Value</tag2>
<tag2>some Other Value</tag2>
...
...
</tag1>
what i get: some Value, some Other Value
But i need to get whole xml (with tags and attributes and its' values) and parse it with my own way because there's a restriction that i'm not allowed to use php 5.x practices.I mean i cant use any parser.. It shouldnt be so hard to get xml data as is.. should it??
what i get: some Value, some Other Value
Nope - my suspicion is that that is what you see in your browser, because it is swallowing all <tags>.
The XML source code will be there after a file_get_contents() operation.
You are using file_get_contents() which states
This function is similar to file(), except that file_get_contents()
returns the file in a string, starting at the specified offset up to
maxlen bytes. On failure, file_get_contents() will return FALSE.
Press Ctrl+u to see the source code in any of the major browsers(except IE where its F12 in IE9). I am sure that your code will be there. Your browser wont display the tags that's all.
The other longer(but better way) to display an XML file from your php file is to pass the content type as text/xml. Use the following way
<?php
header("Content-Type: text/xml");//SHOULD come before any output
// dynamically generate and output your xml here
?>

Stop greater than sign converting to HTML entity

I'm writing some XML in PHP that is not validating because the closing greater than sign on a CDATA element is getting converted to an HTML entity. The code is as follows:
$xml .= '<item number="'.$i.'">
<sku>'.$this->get_product_sku($key, $value).'</sku>
<description>
<![CDATA[
'.get_the_title($value['prodid']).'
]]>
</description>
<qty>'.$value['quantity'].'</qty>
<price>'.$value['price'].'</price>
<extended>'.$value['quantity']*$value['price'].'</extended>
</item>';
The resulting XML looks something like the following when printed out using var_dump or print_r:
<item number="2">
<sku>45NK2</sku>
<description>
<![CDATA[
Test Product
]]>
</description>
<qty>2</qty>
<price>1500.00</price>
<extended>3000.00</extended>
</item>
The closing > turns into > and the XML does not validate. Can someone help me fix this problem?
Thanks!
EDIT: Here is the whole function that generates the XML. I only call and print this function. There is nothing done to the string that is invalidating it.
function build_xml($p, $c)
{
global $wpdb;
// Make the billing and shipping data available
$this->determine_shipping_details($p, $c);
$this->determine_billing_details($p, $c);
// Build the XML
$xml = '<?xml version="1.0" ?>
<orderdata batch="'.$p['id'].'">
<order id="'.$p['id'].'">
<orderdate>'.date('m/d/Y h:i:s', $p['date']).'</orderdate>
<store>'.$this->store_id.'</store>
<adcode>OL</adcode>
<username>'.$this->username.'</username>
<password>'.$this->password.'</password>
<billingaddress>
<firstname>'.$this->billing_details['first_name'].'</firstname>
<lastname>'.$this->billing_details['last_name'].'</lastname>
<address1>'.$this->billing_details['address'].'</address1>
<city>'.$this->billing_details['city'].'</city>
<state>'.$this->billing_details['state'].'</state>
<zipcode>'.$this->billing_details['zip'].'</zipcode>
<country>'.$this->billing_details['country'].'</country>
<phone>'.$this->billing_details['phone'].'</phone>
<email>'.$this->billing_details['email'].'</email>
</billingaddress>
<shippingaddress>
<firstname>'.$this->shipping_details['first_name'].'</firstname>
<lastname>'.$this->shipping_details['last_name'].'</lastname>
<address1>'.$this->shipping_details['address'].'</address1>
<city>'.$this->shipping_details['city'].'</city>
<state>'.$this->shipping_details['state'].'</state>
<zipcode>'.$this->shipping_details['zip'].'</zipcode>
<country>'.$this->shipping_details['country'].'</country>
<phone>'.$this->shipping_details['phone'].'</phone>
<email>'.$this->shipping_details['email'].'</email>
</shippingaddress>
<orderdetails>';
// Add the individual items' information to the XML
$i = 1;
foreach($c as $key => $value)
{
$xml .= '<item number="'.$i.'">
<sku>'.$this->get_product_sku($key, $value).'</sku>
<description>
<![CDATA[
'.get_the_title($value['prodid']).'
]]>
</description>
<qty>'.$value['quantity'].'</qty>
<price>'.$value['price'].'</price>
<extended>'.str_replace(stripslashes( get_option('wpsc_thousands_separator') ), '', trim(wpsc_currency_display($value['quantity']*$value['price'], array('display_currency_symbol' => false, 'display_decimal_point' => true, 'display_currency_code' => false, 'display_as_html' => false)))).'</extended>
</item>';
$i++;
}
// Add the order totals
$xml .= '<subtotal>'.str_replace(stripslashes( get_option('wpsc_thousands_separator') ), '', trim(wpsc_currency_display($p['totalprice']-$p['wpec_taxes_total']-$p['base_shipping'], array('display_currency_symbol' => false, 'display_decimal_point' => true, 'display_currency_code' => false, 'display_as_html' => false)))).'</subtotal>
<shipping code="'.'FEG'.'" rate="'.$p['base_shipping'].'" thirdparty="">'.'FEDEX GROUND SERVICE'.'</shipping>
<tax rate="'.$p['wpec_taxes_rate'].'">'.$p['wpec_taxes_total'].'</tax>
<total>'.$p['totalprice'].'</total>
<amountpaid>'.$p['totalprice'].'</amountpaid>
</orderdetails>';
// Close out the tags
$xml .= '</order>
</orderdata>';
return $xml;
}
When i run it on my webserver it is formatted correctly. Are you setting the header?
Try
header('Content-type: text/xml');
echo $xml;
From the information you provided with your question, it's hard to specifically say why the output gets mangled.
So you need to step through your program and look into each point where your XML is build (already part of your question) and processed further on by your wordpress setup with it's various plugins and themes.
For that it's necessary to get an understanding where such modifications can appear.
Additionally you need a method to see the output as-is, that means unchanged. If you look into source-code in your browser, this often is not the case: Browsers change the output before they display it, so it's often better to dump request responses in the command-line with a HTTP client like curl which you can use to optionally dump the output into a file and look at with an editor unchanged.
Let's recap:
The creation of the XML must be correct firsthand.
The XML might get changed by wordpress.
The XML might get changed by the browser.
This can be a lot of points to check:
1. The creation of the XML must be correct firsthand.
First of all I would look into the return value of get_the_title($value['prodid']) alone, so you actually know what you deal with. Probably it already contains the >? This would explain where that single > might come from. It would be valid to use it within <![CDATA[...]]> however. That's just for smelling and understanding what might happen later on.
Next to the single value in question, you should ensure the XML itself looks correct before processing it furthe, which means at the end of the function. You can do so by outputting it before returning from the method and ending/exiting the application to prevent further processing:
echo "Test output:\n\n", $xml; die();
Then look into the output. Does it looks correct? Is the problematic > already in there at the end of the cdata section in question? If yes, you know that the problem is already inside the function. If not, you know that the problem is unrelated to the function in question and that the XML is mangled later on. Depending on the outcome, you need to look for the defect.
2. The XML might get changed by wordpress.
In comments you asked:
Why would var_dump be filtered? I'm running this in a plugin I'm building. Not sure why this would be filtered.
Next to filtering done by the program (browser, source-viewer etc.), wordpress itself or one of it's addons (plugins, themes) might filter the output. From your comment you already say that you don't know why this can happen, therefore you don't know where this can happen as well.
You have not shared yet how the xml is output. Are you just echo'ing it to the browser? Is it passed to some function that handles the output? This is most likely very important to find the cause of your issue. For example is your plugin answering to an XMLRPC request? In your question you're focussing a lot on the invalid XML, but you didn't share much information for which purpose the XML is being created, where it goes to and for what reason etc.. This information would be useful to understand the bigger picture.
If you take care of the output yourself (echo, print etc.), some code might have installed an output buffer. That means your output get's buffered and probably processed later on. These output buffer related issues are harder to track down. First of all you can disable all other plugins and themes and see what happens. Wordpress itself is not making much use of output buffering (Output Buffering Control [Docs]) so this could nail it quite fast because then only the default output buffering would interfere with your output.
If you make use of a wordpress function to output the XML, then filters can be in action. Wordpress has a filter-system build in which allows itself to hook and change various data. Additionally, Wordpress core functions itself are always "trying hard" to escape output. So actually there can be a lot of points where this filtering is actually taking place. "Not sure why this would be filtered." - There might be no why for your case, it's just that it always happens.
These issues can be located more easily by using an interactive debugger with breakpoints and variable inspection. It allows you to look into the program while it executes and you can see "live" what happens with the data. However you don't have it always. The other alternative is to set breakpoints yourself (die) and do the output yourself (echo, var_dump etc.).
3. The XML might get changed by the browser.
I've already wrote about it at the beginning and in between. Basically if you're not seeing the source as-is, but mangled by the browser, you just might suspect the cause of error wrongly. It's like using the wrong glasses and just hinders you to track things down in the first place. So know your tools.
Things are not always easy to detect. You need to look into the right area in the first place and you need to consistently track things down. There can be various reasons why things happen if the software is more complex like Wordpress.
Try using html_entity_decode() or htmlspecialchars_decode(). Either should work for this case.
http://www.php.net/manual/en/function.html-entity-decode.php
http://www.php.net/manual/en/function.htmlspecialchars-decode.php
Encode it on purpose:
$xml .= '<item number="'.$i.'">
<sku>'.$this->get_product_sku($key, $value).'</sku>
<description>
<![CDATA[
'.get_the_title($value['prodid']).'
]]>
</description>
<qty>'.$value['quantity'].'</qty>
<price>'.$value['price'].'</price>
<extended>'.$value['quantity']*$value['price'].'</extended>
</item>';
Then decode it on display:
echo html_entity_decode($xml);
I know, this is an old thread which I am reviving, but still thought of sharing this so that others looking for a solution to similar problem might get benefited. Specially when this whole discussion doesn't have the right answer.
Solution is very simple. The problem is that wordpress processes this as an HTML rather than script and converts greater than symbol > to &gt. The offending code is in /wp-includes/post-template.php and looks like below:
function the_content($more_link_text = null, $stripteaser = false) {
$content = get_the_content($more_link_text, $stripteaser);
$content = apply_filters('the_content', $content);
/** $content = str_replace(']]>', ']]>', $content); */
As you may notice the last line is converting ]]> to ]]&gt. Commenting out this will solve the problem.

Why do I see `á` instead of a space when writing to screen (encoding problem)?

I am completely lost with encoding issues, I have no idea what's going on, what the problem is exactly and how to fix it.
Basically I'm just trying to read an HTML file from a Zip file, parse it then output pieces to XML. Now something funky is happening with the text I get out of the parser.
When parsing the HTML, instead of a space I get á only if I write to the screen. If I keep it in a variable and write to a file it looks fine in the file. However even though it looks right in the XML something is wrong with it, my PHP parser can't parse that XML nor does IE seem to like it.
I had to first mb_convert_encoding($xmlcontent, "ASCII"); so I could get that XML to parse in PHP.
Any idea what my problem is?
extract HTML from a .tar.gz file using Perl
my $tar = Archive::Tar->new;
$tar->read("myfile.tar.gz");
$tar->extract_file('index.html', 'output.html');
load HTML, this is where it starts to get funky, I get output like Numberáofásourceálines
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('output.html') or die $!;
$tree->elementify;
write to XML
my $output = new IO::File(">output.xml");
my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE => 1,DATA_INDENT => 2);
If it looks correct when you write it to a file and wrong when you write it to the terminal, it sounds like your terminal is expecting the wrong encoding. Check your terminal settings.'
Also, see Jon Rockway's answer to "Why does modern Perl avoid UTF-8 by default?". With encodings, you have to convert your input to the correct encoding and convert your output to the correct encoding. Everything that looks at the data needs to know which encoding you're using.
I think I just fixed it by processing this on the html before parsing it, thanks for all the great pointers!
s/\&nbsp\;/ /g;

Categories