I have a xml document and with simplexml i can easily parse into what i want.
My Xml:
<?xml version="1.0" encoding="UTF-8"?>
<noticias>
<noticia url="noticia-1">
<titulo>título da notícia 1</titulo>
<desc>some description</desc>
<texto>some text here</texto>
<img>filename here</img>
<in>some reference to where it came from</in>
</noticia>
...
</noticias>
PHP simplexml parser
$file = 'xml/noticias.xml';
if(file_exists($file)) {
$xml = simplexml_load_file($file);
foreach($xml as $item) {
$url = $item['url'];
$titulo = $item->titulo;
...
echo '<div><h2>'.$titulo.'</h2></div>';
}
}
My question is: is this secure? How can i improve security?
Thanks in advance.
It is not. However the problem in your source is not related to SimpleXML. You output a string value from an external data source (an XML file) as HTML source. This allows for something called an HTML injection. It can just break your output or let it be manipulated without the user actually noticing.
Here is a small example based on your source:
$xmlString = <<<'XML'
<noticias>
<noticia url="noticia-1">
<titulo>título da <i>notícia</i> 1</titulo>
</noticia>
</noticias>
XML;
$xml = simplexml_load_string($xmlString);
foreach($xml->noticia as $item) {
$titulo = $item->titulo;
echo '<div><h2>'.$titulo.'</h2></div>';
}
Output:
<div><h2>título da <i>notícia</i> 1</h2></div>
The i elements are text content in the XML, but HTML source in the output. A part of the title will be rendered italic in the browser. This is an harmless example for an HTML injection, but imagine someone with a not so nice intent.
If you output any value to HTML, make sure to escape special characters with htmlspecialchars() or use an API (like DOM) that does the escaping for you.
Related
I want to create dynamic tags in XML using PHP
like this : <wsse:Username>fqsuser01</wsse:Username>
the main thing is that I want the tags will change the value inside ---> "wsse"
(like this value)
what I need to do? to create this XML file wite PHP?
Thanks,
For this purpose you can use XMLWriter for example (another option is SimpleXML). Both option are in PHP core so any third party libraries aren't needed. wsse is a namespace - more about them you can read here
I also share with you some example code:
<?php
//create a new xmlwriter object
$xml = new XMLWriter();
//using memory for string output
$xml->openMemory();
//set the indentation to true (if false all the xml will be written on one line)
$xml->setIndent(true);
//create the document tag, you can specify the version and encoding here
$xml->startDocument();
//Create an element
$xml->startElement("root");
//Write to the element
$xml->writeElement("r1:id", "1");
$xml->writeElement("r2:id", "2");
$xml->writeElement("r3:id", "3");
$xml->endElement(); //End the element
//output the xml
echo $xml->outputMemory();
?>
Result:
<?xml version="1.0"?>
<root>
<r1:id>1</r1:id>
<r2:id>2</r2:id>
<r3:id>3</r3:id>
</root>
You could use a string and convert it to XML using simplexml_load_string(). The string must be well formed.
<?php
$usernames= array(
'username01',
'username02',
'username03'
);
$xml_string = '<wsse:Usernames>';
foreach($usernames as $username ){
$xml_string .= "<wsse:Username>$username</wsse:Username>";
}
$xml_string .= '</wsse:Usernames>';
$note=
<<<XML
$xml_string
XML; //backspace this line all the way to the left
$xml=simplexml_load_string($note);
?>
If you wanted to be able to change the namespaces on each XML element you would do something very similar to what is shown above. (Form a string with dynamic namespaces)
The XML portion that I instructed you to backspace all of the way has weird behavior. See https://www.w3schools.com/php/func_simplexml_load_string.asp for an example that you can copy & paste.
I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;
This question already has answers here:
How to parse CDATA HTML-content of XML using SimpleXML?
(2 answers)
Closed 8 years ago.
I am parsing a rss feed to json using php.
using below code
my json output contains data out of description from item element but title and link data not extracting
problem is some where with incorrent CDATA or my code is not parsing it correctly.
xml is here
$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';
$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);
// step 2: extract the channel metadata
$articles = array();
// step 3: extract the articles
foreach ($xml->channel->item as $item) {
$article = array();
$article['title'] = (string)trim($item->title);
$article['link'] = $item->link;
$article['pubDate'] = $item->pubDate;
$article['timestamp'] = strtotime($item->pubDate);
$article['description'] = (string)trim($item->description);
$article['isPermaLink'] = $item->guid['isPermaLink'];
$articles[$article['timestamp']] = $article;
}
echo json_encode($articles);
I think you are just the victim of the browser hiding the tags. Let me explain:
Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
<channel>
<description>Blog do Garotinho</description>
<item>
<description><![CDATA[<br>
Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]>
</description>
<link><![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]></link>
...
<title><![CDATA[A bancada dos caras de pau]]></title>
</item>
As you can see the <title> for example starts with a < which when will turn to a < when simplexml returns it for your json data.
Now if you are looking the printed json data in a browser your browser will see the following:
"title":"<![CDATA[A bancada dos caras de pau]]>"
Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.
If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).
Try this demo:
There seem to be empty an empty "" after the "title":
http://codepad.viper-7.com/ZYpaS1
However if i put a htmlspecialchars() around the json_encode():
http://codepad.viper-7.com/1nHqym they became "visible".
You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():
function clean_cdata($str) {
return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
}
This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:
// ....
$article['title'] = clean_cdata($item->title);
// ....
If I use the following php code to convert an xml to json:
<?php
header("Content-Type:text/json");
$resultXML = "
<QUERY>
<Company>fcsf</Company>
<Details>
fgrtgrthtyfgvb
</Details>
</QUERY>
";
$sxml = simplexml_load_string($resultXML);
echo json_encode($sxml);
?>
I get
{"Company":"fcsf","Details":"\n fgrtgrthtyfgvb\n "}
However, If I use CDATA in the Details element as follows:
<?php
header("Content-Type:text/json");
$resultXML = "
<QUERY>
<Company>fcsf</Company>
<Details><![CDATA[
fgrtgrthtyfgvb]]>
</Details>
</QUERY>
";
$sxml = simplexml_load_string($resultXML);
echo json_encode($sxml);
?>
I get the following
{"Company":"fcsf","Details":{}}
In this case the Details element is blank. Any idea why Details is blank and how to correct this?
This is not a problem with the JSON encoding – var_dump($sxml->Details) shows you that SimpleXML already messed it up before, as you will only get
object(SimpleXMLElement)#2 (0) {
}
– an “empty” SimpleXMLElement, the CDATA content is already missing there.
And after we figured that out, googling for “simplexml cdata” leads us straight to the first user comment on the manual page on SimpleXML Functions, that has the solution:
If you are having trouble accessing CDATA in your simplexml document, you don't need to str_replace/preg_replace the CDATA out before loading it with simplexml.
You can do this instead, and all your CDATA contents will be merged into the element contents as strings.
$xml = simplexml_load_file($xmlfile, 'SimpleXMLElement', LIBXML_NOCDATA);
So, use
$sxml = simplexml_load_string($resultXML, 'SimpleXMLElement', LIBXML_NOCDATA);
in your code, and you’ll get
{"Company":"fcsf","Details":"\n fgrtgrthtyfgvb\n "}
after JSON-encoding it.
I am trying to put an html string inside of xml with php like this:
<?php
$xml_resource = new SimpleXMLElement('stuff.xml', 0, true);
$xml_resource->content = '<![CDATA[<u>111111111111111111111111111111111 text</u>]]>';
$xml_resource->asXML('stuff.xml');
?>
but for some reason my xml file looks like this:
<?xml version="1.0"?> <data>
<content id="pic1" frame="1" xpos="22" ypos="22" width="11" height="11"><![CDATA[<u>111111111111111111111111111111111 text</u>]]></content> </data>
Thank you very much for your help good sirs.
SimpleXML cannot create CDATA sections. However, simply assigning the HTML to a node should be functionnally equivalent:
$xml_resource->content = '<u>111111111111111111111111111111111 text</u>';
Of course the special characters will be escaped, and the result will be equivalent to using a CDATA section.
If you absolutely want to create CDATA sections, you will have to use something like SimpleDOM to access the corresponding DOM method.
include 'SimpleDOM.php';
$xml_resource = new SimpleDOM('stuff.xml', 0, true);
$xml_resource->content = '';
$xml_resource->content->insertCDATA('<u>111111111111111111111111111111111 text</u>');
$xml_resource->asXML('stuff.xml');