Parsing xml feed with cdata PHP SimpleXML [duplicate] - php

This question already has answers here:
How to parse CDATA HTML-content of XML using SimpleXML?
(2 answers)
Closed 8 years ago.
I am parsing a rss feed to json using php.
using below code
my json output contains data out of description from item element but title and link data not extracting
problem is some where with incorrent CDATA or my code is not parsing it correctly.
xml is here
$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';
$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);
// step 2: extract the channel metadata
$articles = array();
// step 3: extract the articles
foreach ($xml->channel->item as $item) {
$article = array();
$article['title'] = (string)trim($item->title);
$article['link'] = $item->link;
$article['pubDate'] = $item->pubDate;
$article['timestamp'] = strtotime($item->pubDate);
$article['description'] = (string)trim($item->description);
$article['isPermaLink'] = $item->guid['isPermaLink'];
$articles[$article['timestamp']] = $article;
}
echo json_encode($articles);

I think you are just the victim of the browser hiding the tags. Let me explain:
Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
<channel>
<description>Blog do Garotinho</description>
<item>
<description><![CDATA[<br>
Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]>
</description>
<link><![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]></link>
...
<title><![CDATA[A bancada dos caras de pau]]></title>
</item>
As you can see the <title> for example starts with a < which when will turn to a < when simplexml returns it for your json data.
Now if you are looking the printed json data in a browser your browser will see the following:
"title":"<![CDATA[A bancada dos caras de pau]]>"
Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.
If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).
Try this demo:
There seem to be empty an empty "" after the "title":
http://codepad.viper-7.com/ZYpaS1
However if i put a htmlspecialchars() around the json_encode():
http://codepad.viper-7.com/1nHqym they became "visible".
You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():
function clean_cdata($str) {
return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
}
This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:
// ....
$article['title'] = clean_cdata($item->title);
// ....

Related

Improve security with simplexml

I have a xml document and with simplexml i can easily parse into what i want.
My Xml:
<?xml version="1.0" encoding="UTF-8"?>
<noticias>
<noticia url="noticia-1">
<titulo>título da notícia 1</titulo>
<desc>some description</desc>
<texto>some text here</texto>
<img>filename here</img>
<in>some reference to where it came from</in>
</noticia>
...
</noticias>
PHP simplexml parser
$file = 'xml/noticias.xml';
if(file_exists($file)) {
$xml = simplexml_load_file($file);
foreach($xml as $item) {
$url = $item['url'];
$titulo = $item->titulo;
...
echo '<div><h2>'.$titulo.'</h2></div>';
}
}
My question is: is this secure? How can i improve security?
Thanks in advance.
It is not. However the problem in your source is not related to SimpleXML. You output a string value from an external data source (an XML file) as HTML source. This allows for something called an HTML injection. It can just break your output or let it be manipulated without the user actually noticing.
Here is a small example based on your source:
$xmlString = <<<'XML'
<noticias>
<noticia url="noticia-1">
<titulo>título da <i>notícia</i> 1</titulo>
</noticia>
</noticias>
XML;
$xml = simplexml_load_string($xmlString);
foreach($xml->noticia as $item) {
$titulo = $item->titulo;
echo '<div><h2>'.$titulo.'</h2></div>';
}
Output:
<div><h2>título da <i>notícia</i> 1</h2></div>
The i elements are text content in the XML, but HTML source in the output. A part of the title will be rendered italic in the browser. This is an harmless example for an HTML injection, but imagine someone with a not so nice intent.
If you output any value to HTML, make sure to escape special characters with htmlspecialchars() or use an API (like DOM) that does the escaping for you.

simplexml_load_string is stripping data out of soap response

So i am hitting a soap service, when i get the data and parse it through simplexml_load_string in order to access the data as an object (or just basically access the data) simplexml_load_string seems to strip it out.
A Raw response from soap service looks like:
A result parsed through simplexml_load_string
using the following code:
$result = simplexml_load_string((string)$result->DisplayCategoriesResult->any);
i get a result of:
this looks correct but at a closer look you will notice its just id's and the names of the categories are left behing simplexml_load_string
how can i manage to get the proper result? if there is another way of getting the raw data into a "usable" form or object that solution is also welcome
The text content of XML nodes doesn't show up when using print_r or var_dump, etc. They aren't "traditional" PHP objects, so you can't use the standard debugging options.
To access the text content (whether embedded as CDATA or otherwise), you need to step down into the child elements, and then cast them to strings:
<?php
$xml = <<<XML
<randgo xmlns="">
<status>0</status>
<message>Success</message>
<categories>
<category id="53"><![CDATA[go eat]]></category>
<category id="54"><![CDATA[go do]]></category>
<category id="55"><![CDATA[go out]]></category>
</categories>
</randgo>
XML;
$sxml = simplexml_load_string($xml);
foreach ($sxml->categories->category as $category)
{
echo $category['id'] . ": " . (string) $category, PHP_EOL;
}
=
$ php simplexml_categories.php
53: go eat
54: go do
55: go out
See: https://eval.in/590975
(Sorry if there are any typos in the XML, I think I copied from the screenshot correctly...)
The category names are CDATA. Try something like this to read it.
$doc = new DOMDocument();
$doc->load((string)$result->DisplayCategoriesResult->any);
$categories = $doc->getElementsByTagName("categories");
foreach ($categories as $categorie) {
foreach($categorie->childNodes as $child) {
if ($child->nodeType == XML_CDATA_SECTION_NODE) {
echo $child->textContent . "<br/>";
}
}
}

parse and process HTML/XML/plain text page [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I am creating a small php app that pulls data from a remote website its working great but i would like to make it more user friendly now.
I need to get a few specific items from the page and as far as I can tell the page looks like an xml file wen you look at sorce code but it has no style to it and appears as plain text so I don't really know what to do.
The page I am trying to get looks like this
<channel>
<name>data</name>
<id>data</id>
<img>data</img>
<auther>data</auther>
<mp3>data</mp3>
<bio>data</bio>
</channel>
<channel>
<name>data</name>
<id>data</id>
<img>data</img>
<auther>data</auther>
<mp3>data</mp3>
<bio>data</bio>
</channel>
<channel>
<name>data</name>
<id>data</id>
<img>data</img>
<auther>data</auther>
<mp3>data</mp3>
<bio>data</bio>
</channel>
<channel>
<name>data</name>
<id>data</id>
<img>data</img>
<auther>data</auther>
<mp3>data</mp3>
<bio>data</bio>
</channel>
I need to get all the data from each tag under the channel tag and keep it in the same order to echo it back out onto my own page in the same way.
How could i do this ? i tried using regex with the following patter
$pattern = '<channel>
<name>(.*)</name>
<id>(.*)</id>
<img>(.*)</img>
<auther>(.*)</auther>
<mp3>(.*)</mp3>
<bio>(.*)</bio>
</channel>';
but that doesn't work I really need the best and simplest way to do this.
$SimpleXMLElement = new SimpleXMLElement($str);
foreach ($SimpleXMLElement->children() as $Channel) {
foreach ($Channel->children() as $Child) {
echo $Child->getName() . ' = ' . (string) $Child;
}
}
this way you can use SimpleXMLElement, it's very easy
I would "sanitize" the incoming data and make an xml document out of it. This can be done by simply wrapping it into a surrounding tag. (I name it channels). Having this, you can parse the data using DOM:
// Sanitize input data. Make an xml out of it
$xml = '<channels>';
$xml .= file_get_contents($url);
$xml .= '</channels>';
// Create a document
$doc = new DOMDocument();
$doc->loadXML($xml);
// Iterate through channel elements
foreach($doc->getElementsByTagName('channel') as $channel) {
echo $channel->getElementsByTagName('name')->item(0)->nodeValue . PHP_EOL;
echo $channel->getElementsByTagName('id')->item(0)->nodeValue . PHP_EOL;
// And so on ...
}

php - converting xml to json does not work when there is CDATA

If I use the following php code to convert an xml to json:
<?php
header("Content-Type:text/json");
$resultXML = "
<QUERY>
<Company>fcsf</Company>
<Details>
fgrtgrthtyfgvb
</Details>
</QUERY>
";
$sxml = simplexml_load_string($resultXML);
echo json_encode($sxml);
?>
I get
{"Company":"fcsf","Details":"\n fgrtgrthtyfgvb\n "}
However, If I use CDATA in the Details element as follows:
<?php
header("Content-Type:text/json");
$resultXML = "
<QUERY>
<Company>fcsf</Company>
<Details><![CDATA[
fgrtgrthtyfgvb]]>
</Details>
</QUERY>
";
$sxml = simplexml_load_string($resultXML);
echo json_encode($sxml);
?>
I get the following
{"Company":"fcsf","Details":{}}
In this case the Details element is blank. Any idea why Details is blank and how to correct this?
This is not a problem with the JSON encoding – var_dump($sxml->Details) shows you that SimpleXML already messed it up before, as you will only get
object(SimpleXMLElement)#2 (0) {
}
– an “empty” SimpleXMLElement, the CDATA content is already missing there.
And after we figured that out, googling for “simplexml cdata” leads us straight to the first user comment on the manual page on SimpleXML Functions, that has the solution:
If you are having trouble accessing CDATA in your simplexml document, you don't need to str_replace/preg_replace the CDATA out before loading it with simplexml.
You can do this instead, and all your CDATA contents will be merged into the element contents as strings.
$xml = simplexml_load_file($xmlfile, 'SimpleXMLElement', LIBXML_NOCDATA);
So, use
$sxml = simplexml_load_string($resultXML, 'SimpleXMLElement', LIBXML_NOCDATA);
in your code, and you’ll get
{"Company":"fcsf","Details":"\n fgrtgrthtyfgvb\n "}
after JSON-encoding it.

Hide XML declaration in files generated using PHP

I was tesing with a simple example of how to display XML in browser using PHP and found this example which works good
<?php
$xml = new DOMDocument("1.0");
$root = $xml->createElement("data");
$xml->appendChild($root);
$id = $xml->createElement("id");
$idText = $xml->createTextNode('1');
$id->appendChild($idText);
$title = $xml->createElement("title");
$titleText = $xml->createTextNode('Valid');
$title->appendChild($titleText);
$book = $xml->createElement("book");
$book->appendChild($id);
$book->appendChild($title);
$root->appendChild($book);
$xml->formatOutput = true;
echo "<xmp>". $xml->saveXML() ."</xmp>";
$xml->save("mybooks.xml") or die("Error");
?>
It produces the following output:
<?xml version="1.0"?>
<data>
<book>
<id>1</id>
<title>Valid</title>
</book>
</data>
Now I have got two questions regarding how the output should look like.
The first line in the xml file '', should not be displayed, that is it should be hidden
How can I display the TextNode in the next line. In total I am exepecting an output in this fashion
<data>
<book>
<id>1</id>
<title>
Valid
</title>
</book>
</data>
Is that possible to get the desired output, if so how can I accomplish that.
Thanks
To skip the XML declaration you can use the result of saveXML on the root node:
$xml_content = $xml->saveXML($root);
file_put_contents("mybooks.xml", $xml_content) or die("cannot save XML");
Please note that saveXML(node) has a different output from saveXML().
First question:
here is my post where all usable threads with answers are listed: How do you exclude the XML prolog from output?
Second question:
I don't know of any PHP function that outputs text nodes like that.
You could:
read xml using DomDocument and save each node as string
iterate trough nodes
detect text nodes and add new lines to xml string manually
At the end you would have the same XML with text node values in new line:
<node>
some text data
</node>

Categories