Parsing XML using PHP - Which includes ampersands and other characters - php

I'm trying to parse an XML file and one of the fields looks like the following:
<link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>
This seems to break the parser. i think it might be something to do with the & in the link?
My code is quite simple:
<?
$xml = simplexml_load_file("files/this.xml");
echo $xml->getName() . "<br />";
foreach($xml->children() as $child) {
echo $child->getName() . ": " . $child . "<br />";
}
?>
any ideas how i can resolve this?

The XML snippet you posted is not valid. Ampersands have to be escaped, this is why the parser complaints.

Your XML feed is not valid XML : the & should be escaped as &
This means you cannot use an XML parser on it :-(
A possible "solution" (feels wrong, but should work) would be to replace '&' that are not part of an entity by '&', to get a valid XML string before loading it with an XML parser.
In your case, considering this :
$str = <<<STR
<xml>
<link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>
</xml>
STR;
You might use a simple call to str_replace, like this :
$str = str_replace('&', '&', $str);
And, then, parse the string (now XML-valid) that's in $str :
$xml = simplexml_load_string($str);
var_dump($xml);
In this case, it should work...
But note that you must take care about entities : if you already have an entity like '>', you must not replace it to '&gt;' !
Which means that such a simple call to str_replace is not the right solution : it will probably break stuff on many XML feeds !
Up to you to find the right way to do that replacement -- maybe with some kind of regex...

It breaks the parser because your XML is invalid - & should be encoded as &.

If your XML already has some escaping, this way it will be preserved and unescaped ampersands will be fixed:
$brokenXmlText = file_get_contents("files/this.xml");
$fixed = preg_replace('/&(?!lt;|gt;|quot;|apos;|amp;|#)/', '&', $brokenXmlText);
$xml = simplexml_load_string($fixed);

The comment by mjv resolved it:
Alternatively to using &, you may
consider putting the urls and other
XML-unfriendly content in
, i.e. a
Character Data block

I think this will help you
http://www.php.net/manual/en/simplexml.examples-errors.php#96218

Related

simplexml_load_file give didn't work read node which contains url with query strings?

I am working with simple project to read data from xml and save into database. But I got one problem while reading XML file. Some brief is below:
My parser.php code:
if (file_exists('zero.xml')) {
$xml = simplexml_load_file('zero.xml');
echo $xml->productURL;
} else {
exit('Failed to open test.xml.');
}
and zero.xml file contain:
<product sku="107">
<price>6999</price>
<productURL>https://www.example.com/in/open.pl?user_id=2&b=96</productURL>
</product>
Upon run code didn't got any output just warnings.
EntityRef: expecting ';' in C:\xampp\htdocs\sqlsvr2012\parser.php on
line 31
When I replace "&" operator in url
user_id=2&b=96
with ; or : then this give me output. But that is not exact url format which is not acceptible. I don't why this is not working with & operator. Please help me to fix this issue.
I am adding little solution to salathe answer not ideal but will work. You may have to load file contents and perform find and replace operation str_replace() like & -> & and pass refined contents to simple xml parser
<?php
$path_to_file = 'path/to/the/file.xml';
$file_contents = file_get_contents($path_to_file);
$file_contents = str_replace("&",",&",$file_contents);
simplexml_load_string($file_contents);
?>
Hope this may help!
A bare & is not valid XML character data, since it is used to denote the start of a named entity. You have several options, including enclosing the value in a CDATA block, or my preferred choice is to use the & entity like ?user_id=2&b=96
Entity example
<productURL>https://www.example.com/in/open.pl?user_id=2&b=96</productURL>
CDATA example
<productURL><![CDATA[https://www.example.com/in/open.pl?user_id=2&b=96]]></productURL>
The XML spec says the following on this matter:
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively.
– http://www.w3.org/TR/REC-xml/#syntax

Why does PHP function strip_tags() removes data that is not tags? How to avoid this?

This code:
$input = 'I love <3 PHP!';
echo strip_tags($input);
Outputs:
I love
Is there a PHP function (or anyone's custom function) which would remove only tags (that means properly closed tags), not everything preceded by < ?
Why does PHP function strip_tags() removes data that is not tags?
It errs on the side of security.
How to avoid this?
If you are expecting text input, use htmlspecialchars to escape < characters (and a few others) instead of removing them.
Try htmlspecialchars, it will still show tags, but converted to html entities
As of PHP 5 the Tidy extension is usually available in most compiled binaries. It is not 100% effective but could help you in this case. Tidy tries to close all unclosed HTML tags in a string. With it closed you could then ignore the wanted tag. You would then need to strip out the final tag that tidy put in.
Tidy documentation
$str = tidy("I <3 PHP");
// second param ignores the closed tag <3>
$str = strip_tags($str, '<3>')
$str = str_replace('<3>', '<3', $str);
echo $str;

Correct combination of escape characters to inject PHP in javascript?

I've tried all the combination I know of but can't get it right!
echo <<<EOF
Popup!
EOF;
I want to pass the string that is contained in $comments to the popup, but I can't seem to get the right combination of escape characters and concatenation. Help pls!
TIA
Edit: This is the HTML that goes into the string I mentioned.
$comments.= "<b>" . $row['comName'] . "</b><br><i>" . $row['comment'] . "</i><br><br>";
You need to escape the string to valid Javascript/JSON first to preserve Javascript syntax, then escape the Javascript to preserve the syntax of the HTML it's embedded in:
$js = sprintf('javascript:popup(%s)', json_encode($comments));
printf('Popup!', htmlspecialchars($js));
Since this is quite a pain, you should really try to go for unobtrusive Javascript, which separates Javascript from HTML.

Replacing with <br> before printing XML contents using SimpleXML

I know it's probably a very simple issue but still I didn't find a solution...
Ok, I'll be brief:
suppose to have a so structured XML file:
<root><item><text>blah blah blah
blah blah blah
blah blah blah
...</text></item></root>
My XML is obviously more complex but that's not important since my question is:
how do I replace those 
 with, for instance, html <br> tags?
I'm using SimpleXML to read data from XML and tried with:
echo str_replace("
", "<br>", $message->text);
and even with:
echo str_replace("\n", "<br>", $message->text);
but nothing...
I need to use SimpleXML for this.
represents the ASCII "carriage return" character (ASCII code 13, which is D in hexadecimal), sometimes written "\r", rather than the "linefeed" character, "\n" (which is ASCII 10, or A in hex). Note that when SimpleXML is asked for the string content of a node (with (string)$node or implicitly with statements like echo $node) it will turn this "entity" into the actual character it represents.
Depending on your platform (Windows, Linux, MacOS, etc), the standard line-ending, accessible via the built-in constant PHP_EOL, will be either "\n", "\r\n", or "\r".
The safest way to replace these with HTML linebreak tags (<br>) is to replace any of these characters, since you don't know which convention the source of the XML data might have been using.
PHP has a built-in function which should be able to do this for you, called nl2br(). If you want a slightly custom version, there's a comment in the docs from "ngkongs" showing how to use str_replace to similar effect.
I figured it out how to solve just before posting my question so, having already written, I'll share this hoping it'll be useful for someone else sooner or later...
This does the trick:
echo str_replace(PHP_EOL, "<br>", $message->text);

xsl to php array

I got a xml file that contains hierarchical data. Now I need to get some of that data into a php array. I use xsl to get the data I want formatted as a php array. But when I print it it leaves all the tabs and extra spaces and line breaks etc which I need to get rid of to turn it into a flat string (I suppose!) and then convert that string into a array.
In the xsl I output as text and have indent="no" (which does nothing). I've tried to strip \t \n \r etc but it doesn't affect the output at all.
Is there a really good php function out there that can strip out all formatting except single spaces? Or is there more going on here I don't know about or another way of doing the same thing?
First off, using xsl output to form your PHP array is fairly inelegant and inefficient. I would highly suggest going with something like the domdocument class available in PHP (http://www.php.net/manual/en/class.domdocument.php). If you must stick with your current method, try using regular expressions to remove any unnecessary whitespace.
$string = preg_replace('/\s+/', '', $string);
or
$string = preg_replace('/\s\s+/', ' ', $string);
to preserve single white space.
I've created a class for opensource library that your welcome to use, and look at as an example on how to create an array from XML (and just take out the "good" parts).
USING XML
So the crux of the problem is probably keeping the data in XML as long as possible. Therefore the after the XSL translation you would have something like:
<xml>
<data>value
with newline
</data>
<data>with lots of whitespace</data>
</xml>
Then you could loop trough that data like:
$xml = simplexml_load_string($xml_string);
foreach($xml as $data)
{
// use str_replace or a regular expression to replace the values...
$data_array[] = str_replace(array(" ", "\n"), "", $data);
}
// $data_array is the array you want!
USING JSON
However if you can't output the XSL into XML then loop through it. Then you may want to use XSL to create a JSON string object and convert that to an array so the xsl would look like:
{
"0" : "value
with newline",
"1" : "with lots of whitespace"
}
Then you could loop trough that data like:
$json_array = json_encode($json_string, TRUE); // the TRUE is to make an array
foreach($json_array as $key => $value)
{
// use str_replace or a regular expression to replace the values...
$json_array[$key] = str_replace(array(" ", "\n"), "", $value);
}
Either way you'll have to pull the values in PHP because XSLT's handling of spaces and newlines is pretty rudimentary.

Categories