Parsing XML file - php

I've got a problem with parsing an XML file (nb. well formed one).
Consider XML file like this:
<?xml version="1.0" encoding="utf-8" ?>
<root>
<list>
<item no="1">
<title>Item's 1 title</title>
<content>Some long content with <special>tags</special> inside</content>
</item>
<item no="2">
<title>Item's 2 title</title>
<content>Some long content with <special>tags</special> inside</content>
</item>
</list>
</root>
I need to get contents contents of each item in the list and put them in an array. Generally not a problem, but in this case, I can't get my head round it.
Problem lays in <content> contents. It is string with tags in-between. I can't find a way to extract the contents. SimpleXML returns/echoes just the string with anything including and inside <special> tags stripped out. Like this:
Some long content with inside.
I'd ideally want it to get a string like this:
Some long content with <special>tags</special> inside
How do I get it?

You could use DOMDocument which is built into PHP.
<?php
$xml = <<<END
<?xml version="1.0" encoding="utf-8" ?>
<root>
<list>
<item no="1">
<title>Item's 1 title</title>
<content>Some long content with <special>tags</special> inside</content>
</item>
<item no="2">
<title>Item's 2 title</title>
<content>Some long content with <special>tags</special> inside</content>
</item>
</list>
</root>
END;
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($xml);
$nodes = $doc->getElementsByTagName('content');
foreach ( $nodes as $node )
{
$temp_doc = new DOMDocument('1.0', 'UTF-8');
foreach ( $node->childNodes as $child )
$temp_doc->appendChild($temp_doc->importNode($child, true));
echo $temp_doc->saveHTML(); // Outputs: Some long content with <special>tags</special> inside
}
To select the top level "content" elements (in case there are "content" elements inside), you can use DOMXPath.
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($xml); // $xml from the example above
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('/root/list/item/content');
foreach ( $nodes as $node )
{
$temp_doc = new DOMDocument('1.0', 'UTF-8');
foreach ( $node->childNodes as $child )
$temp_doc->appendChild($temp_doc->importNode($child, true));
echo $temp_doc->saveHTML(); // Outputs: Some long content with <special>tags</special> inside
}

SimpleXML just doesn't support mixed content (text nodes with element nodes as siblings). I suggest you use XMLReader instead.

You could use SimpleXML's asXML function. It will return that called node as the xml string;
$xml = simplexml_load_file($file);
foreach($xml->list->item as $item) {
$content = $item->contents->asXML();
echo $content."\n";
}
will print:
<content>Some long content with <special>tags</special> inside</content>
<content>Some long content with <special>tags</special> inside</content>
it's a little ugly but you could then clip out the <content> and </content> with a substr:
$content = substr($content,9,-10);

Related

How can I remove certain elements from XML using SimpleXML

I load the following XML data into SimpleXML like this:
<?php
$xmlString = <<<'XML'
<?xml version="1.0"?>
<response>
<item key="0">
<title>AH 2308</title>
<field_a>3.00</field_a>
<field_b>7.00</field_b>
<field_d1>35.00</field_d1>
<field_d2>40.00</field_d2>
<field_e></field_e>
<field_g2></field_g2>
<field_g>M 45x1,5</field_g>
<field_gewicht>0.13</field_gewicht>
<field_gtin>4055953012781</field_gtin>
<field_l>40.00</field_l>
<field_t></field_t>
<field_abdrueckmutter>KM 9</field_abdrueckmutter>
<field_sicherung>MB 7</field_sicherung>
<field_wellenmutter>KM 7</field_wellenmutter>
</item>
<item key="1">
<title></title>
<field_a></field_a>
<field_b></field_b>
<field_d1></field_d1>
<field_d2></field_d2>
<field_e></field_e>
<field_g2></field_g2>
<field_g></field_g>
<field_gewicht></field_gewicht>
<field_gtin></field_gtin>
<field_l></field_l>
<field_t></field_t>
<field_abdrueckmutter></field_abdrueckmutter>
<field_sicherung></field_sicherung>
<field_wellenmutter></field_wellenmutter>
</item>
</response>
XML;
$xml = simplexml_load_string($xml);
How can I achieve the following result:
<?xml version="1.0"?>
<response>
<item key="0">
<title>AH 2308</title>
<field_a>3.00</field_a>
<field_b>7.00</field_b>
<field_d1>35.00</field_d1>
<field_d2>40.00</field_d2>
<field_e></field_e>
<field_g2></field_g2>
<field_g>M 45x1,5</field_g>
<field_gewicht>0.13</field_gewicht>
<field_gtin>4055953012781</field_gtin>
<field_l>40.00</field_l>
<field_t></field_t>
<field_abdrueckmutter>KM 9</field_abdrueckmutter>
<field_sicherung>MB 7</field_sicherung>
<field_wellenmutter>KM 7</field_wellenmutter>
</item>
<item key="1"></item>
</response>
To delete all empty elements, I could use the following working code:
foreach ($xml->xpath('/child::*//*[not(*) and not(text()[normalize-space()])]') as $emptyElement) {
unset($emptyElement[0]);
}
But that's not exactly what I want.
Basically, when the <title> element is empty, I want to remove it with all its siblings and keep the parent <item> element.
What's important: I also want to keep empty element, if the <title> is not empty. See <item key="0"> for example. The elements <field_e>, <field_g2> and <field_t>will be left untouched.
Is there an easy xpath query which can achieve that? Hope anyone can help. Thanks in advance!
This xpath query is working:
foreach ($xml->xpath('//title[not(text()[normalize-space()])]/following-sibling::*') as $emptyElement) {
unset($emptyElement[0]);
}
It keeps the <title> element but I can live with that.
DOM is more flexible manipulating nodes:
$document = new DOMDocument();
$document->loadXML($xmlString);
$xpath = new DOMXpath($document);
$expression = '/response/item[not(title[normalize-space()])]';
foreach ($xpath->evaluate($expression) as $emptyItem) {
// replace children with an empty text node
$emptyItem->textContent = '';
}
echo $document->saveXML();

Remove white spaces between tag values in xml with php

I been searching information how to remove white spaces between tag values leaved by a PHP code when I export it to XML, I will explain detailed, first I load and XML then I do a search on the file with xPath, then I remove some elements that do not match some brands and finally I reexport it as a new XML, the problem is that this new XML is full of white spaces leaved by the code. I tried trim it but it doesn't seems to work correctly.
Here is my code:
<?php
$sXML = simplexml_load_file('file.xml'); //First load the XML
$brands = $sXML->xPath('//brand'); //I do a search for the <brand> tag
function filter(string $input) { //Then I give it a list of variables
switch ($input) {
case 'BRAND 3':
case 'BRAND 4':
return false;
default:
return true;
}
}
array_walk($brands, function($brand) { //I remove all elements do not match my list
$content = (string) $brand;
if (filter($content)) {
$item = $brand->xPath('..')[0];
unset($item[0]);
}
});
$sXML->asXML('filtred.xml'); // And finally export a new xml
?>
This one is the original XML:
<?xml version="1.0" encoding="utf-8"?>
<products>
<item>
<reference>00001</reference>
<other_string>PRODUCT 1</other_string>
<brand>BRAND 1</brand>
</item>
<item>
<reference>00002</reference>
<other_string>PRODUCT 2</other_string>
<brand>BRAND 2</brand>
</item>
<item>
<reference>00003</reference>
<other_string>PRODUCT 3</other_string>
<brand>BRAND 3</brand>
</item>
<item>
<reference>00004</reference>
<other_string>PRODUCT 4</other_string>
<brand>BRAND 4</brand>
</item>
<item>
<reference>00005</reference>
<other_string>PRODUCT 5</other_string>
<brand>BRAND 5</brand>
</item>
</products>
And the output of the script sends this:
<?xml version="1.0" encoding="utf-8"?>
<products>
<item>
<reference>00001</reference>
<other_string>PRODUCT 1</other_string>
<brand>BRAND 1</brand>
</item>
<item>
<reference>00002</reference>
<other_string>PRODUCT 2</other_string>
<brand>BRAND 2</brand>
</item>
<item>
<reference>00005</reference>
<other_string>PRODUCT 5</other_string>
<brand>BRAND 5</brand>
</item>
</products>
As you can see on the output, there is a white space between product 2 and product 5 and that I need to remove it. Any help will be appreciate.
You can force SimpleXML to trim all whitespace when it reads the file, by passing the LIBXML_NOBLANKS option to simplexml_load_file:
$sXML = simplexml_load_file('file.xml', null, LIBXML_NOBLANKS);
Then when you call ->asXML(), all the whitespace will be removed, and you'll get XML all on one line, like this:
<?xml version="1.0" encoding="utf-8"?>
<products><item><reference>00003</reference><other_string>PRODUCT 3</other_string><brand>BRAND 3</brand></item><item><reference>00004</reference><other_string>PRODUCT 4</other_string><brand>BRAND 4</brand></item></products>
To re-generate whitespace based on the remaining structure, you'll need to use DOM rather than SimpleXML - but that's easy to do without changing any of your existing code, because dom_import_simplexml simply "rewraps" the XML without reparsing it.
Then you can use the DOMDocument formatOutput property and save() method to "pretty-print" the document:
$sXML = simplexml_load_file('file.xml', null, LIBXML_NOBLANKS);
// ...
// process $sXML as before
// ...
$domDocument = dom_import_simplexml($sXML)->ownerDocument;
$domDocument->formatOutput = true;
echo $domDocument->save('filtered.xml');
Another possibility is to use preg_replace:
// Get simpleXml as string
$xmlAsString = $yourSimpleXmlObject->asXML();
// Remove newlines
$xmlAsString = preg_replace("/\n/", "", $xmlAsString);
// Remove spaces between tags
$xmlAsString = preg_replace("/>\s*</", "><", $xmlAsString);
var_dump($xmlAsString);
Now you get your XML as string in one line (including the XML declaration).

PHP XML: Getting text of a node and its children

I know that this questions has been asked before, but I cannot make it work. I'm using simplexml and xpath in a PHP file. I need to get text from a node including the text in its child nodes. So, the results should be:
Mr.Smith bought a white convertible car.
Here is the xml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="test9.xsl"?>
<items>
<item>
<description>
<name>Mr.Smith bought a <car>white</car> <car>convertible</car> car.</name>
</description>
</item>
</items>
The php that's not working is:
$text = $xml->xpath('//items/item/description/name');
foreach($text as &$value) {
echo $value;
}
Please help!
To get the node value with all its child elements, you can use DOMDocument, with C14n():
<?php
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="test9.xsl"?>
<items>
<item>
<description>
<name>Mr.Smith bought a <car>white</car> <car>convertible</car> car.</name>
</description>
</item>
</items>
XML;
$doc = new DOMDocument;
$doc->loadXML($xml);
$x = new DOMXpath($doc);
$text = $x->query('//items/item/description/name');
echo $text[0]->C14n(); // Mr.Smith bought a white convertible car.
Demo

Accessing a single XML DOM Document node

I am completely new to DOM Documents, basically what I am trying to do, is to load a RSS feed in and select only one node, and then save it to a XML file.
Here is the XML I am loading from a web feed:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Markets</title>
<description/>
<link>http://www.website.com</link>
<language>en-us</language>
<copyright>XML Output Copyright</copyright>
<ttl>15</ttl>
<pubDate>Tue, 16 Nov 2010 09:38:00 +0000</pubDate>
<webMaster>admin#website.com</webMaster>
<image>
<title>title</title>
<url>http://www.website.com/images/xmllogo.gif</url>
<link>http://www.website.com</link>
<width>144</width>
<height>16</height>
</image>
<item>
<title>title</title>
<description>the description goes here
</description>
<enclosure url="http://www.website.com/images/image.png" type="image/png"/>
</item>
</channel>
</rss>
Here is my lame attempt at getting the <description> node and saving it to feed.xml:
<?php
$feed = new DOMDocument();
$feed->load('http://www.website.com/directory/directory/cz.c');
$nodeValue = $feed->getElementsByTagName('description')->item(0)->nodeValue;
$feed->save("feed.xml");
?>
So basically I need to get the description tag, and save it as a XML file.
Any help would be appreciated, thanx in advance!
Almost correct. To get the "outerXml" of a node, you can pass the node to saveXml()
$feed = new DOMDocument();
$feed->load('http://www.website.com/directory/directory/cz.c');
$xml = $feed->saveXml($feed->getElementsByTagName('description')->item(0));
file_put_contents("feed.xml", $xml);
Saving with file_put_contents will not include an XML prolog. Note that in your example, the first description element is empty, so the file will contain <description/>.
If you want to extract the node as standalone XML Document, you have to instantiate a new DOMDocument and import the DOMNode and then use save().
$dom = new DOMDocument($feed->xmlVersion, $feed->xmlEncoding);
$dom->appendChild(
$dom->importNode(
$feed->getElementsByTagName('description')->item(0),
TRUE
)
);
echo $dom->save('new.xml');
$feed = simplexml_load_file('feed.xml');
$descr=$feed->channel->description;
Try this

Transform RSS-Feed into another "standard" XML-Format with PHP

quick question: I need to transform a default RSS Structure into another XML-format.
The RSS File is like....
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Name des RSS Feed</title>
<description>Feed Beschreibung</description>
<language>de</language>
<link>http://xml-rss.de</link>
<lastBuildDate>Sat, 1 Jan 2000 00:00:00 GMT</lastBuildDate>
<item>
<title>Titel der Nachricht</title>
<description>Die Nachricht an sich</description>
<link>http://xml-rss.de/link-zur-nachricht.htm</link>
<pubDate>Sat, 1. Jan 2000 00:00:00 GMT</pubDate>
<guid>01012000-000000</guid>
</item>
<item>
<title>Titel der Nachricht</title>
<description>Die Nachricht an sich</description>
<link>http://xml-rss.de/link-zur-nachricht.htm</link>
<pubDate>Sat, 1. Jan 2000 00:00:00 GMT</pubDate>
<guid>01012000-000000</guid>
</item>
<item>
<title>Titel der Nachricht</title>
<description>Die Nachricht an sich</description>
<link>http://xml-rss.de/link-zur-nachricht.htm</link>
<pubDate>Sat, 1. Jan 2000 00:00:00 GMT</pubDate>
<guid>01012000-000000</guid>
</item>
</channel>
</rss>
...and I want to extract only the item-elements (with childs and attributes) XML like:
<?xml version="1.0" encoding="ISO-8859-1"?>
<item>
<title>Titel der Nachricht</title>
<description>Die Nachricht an sich</description>
<link>http://xml-rss.de/link-zur-nachricht.htm</link>
<pubDate>Sat, 1. Jan 2000 00:00:00 GMT</pubDate>
<guid>01012000-000000</guid>
</item>
...
It hasn't to be stored into a file. I need just the output.
edit: Furthermore you need to know: The RSS File could have dynamic numbers of items. This is just a sample. So it has to be looped with while, for, for-each, ...
I tried different approaches with DOMNode, SimpleXML, XPath, ... but without success.
Thanks
chris
A different approach would be to use an XSLT:
$xsl = <<< XSL
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<items>
<xsl:copy-of select="//item">
<xsl:apply-templates/>
</xsl:copy-of>
</items>
</xsl:template>
</xsl:stylesheet>
XSL;
The above stylesheet has just one rule, namely deep copying all <item> elements from the source XML to an XML file and ignore everything else from the source file. The nodes will be copied into an <items> element for root node. To process this, you'd do
$xslDoc = new DOMDocument(); // create Doc for XSLT
$xslDoc->loadXML($xsl); // load stylesheet into it
$xmlDoc = new DOMDocument(); // create Doc for RSS
$xmlDoc->loadXML($xml); // load your XML/RSS into it
$proc = new XSLTProcessor(); // init XSLT engine
$proc->importStylesheet($xslDoc); // load stylesheet into engine
echo $proc->transformToXML($xmlDoc); // output transformed XML
Instead of outputting, you could just write the return value to file.
Further reading:
http://de3.php.net/manual/en/class.xsltprocessor.php
http://www.w3.org/TR/xslt#copy-of
What you ask for is hardly a transformation. You are basically just extracting the <item> elements as they are. Also, the result you give is not valid XML, as it lacks a root node.
Apart from that, you can simple do it like this:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadXML($xml); // load some XML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//item'); // Find all item elements
foreach($nodes as $node) { // Iterate over found item elements
echo $dom->saveXml($node); // output the item node outerHTML
}
The above would echo the <item> nodes. You could simply buffer the output, concatenate it to a string, write to it an array and implode, etc - and write it to file.
If you want to do it properly with DOM (and a root node), the full code would be:
$dom = new DOMDocument; // init DOMDocument for RSS
$dom->loadXML($xml); // load some XML into it
$items = new DOMDocument; // init DOMDocument for new file
$items->preserveWhiteSpace = FALSE; // dump whitespace
$items->formatOutput = TRUE; // make output pretty
$items->loadXML('<items/>'); // create root node
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//item'); // Find all item elements
foreach($nodes as $node) { // iterate over found item nodes
$copy = $items->importNode($node, TRUE); // deep copy of item node
$items->documentElement->appendChild($copy); // append item nodes
}
echo $items->saveXML(); // outputs the new document
Instead of saveXML(), you'd use save('filename.xml') to write it to a file.
Try:
<?php
$xmlFile = new DOMDocument(); //Instantiate new DOMDocument
$xmlFile->load("URL TO RSS/XML FILE"); //Load in XML/RSS file
$xmlString = file_get_contents("URL TO RSS/XML FILE");
$title[] = "";
$description[] = "";
$link[] = "";
$pubDate[] = "";
$guid[] = "";
for($i = 0; $i < substr_count($xmlString, "<item>"); $i++)
{
$title[] = $xmlFile->getElementsByTagName("title")->item(0)->nodeValue; //Get the value of the node <title>
$description[] = $xmlFile->getElementsByTagName("description")->item(0)->nodeValue;
$link[] = $xmlFile->getElementsByTagName("link")->item(0)->nodeValue;
$pubDate[] = $xmlFile->getElementsByTagName("pubDate")->item(0)->nodeValue;
$guid[] = $xmlFile->getElementsByTagName("guid")->item(0)->nodeValue;
}
?>
Untested but the arrays
$title[]
$description[]
$link[]
$pubDate[]
$guid[]
should be populated with all of the data that you need!
EDIT:
OK so another approach:
<?php
$xmlString = file_get_contents("URL TO RSS/XML FILE");
$titles = preg_filter("/<title>([.]*)</title>/","\\1", mixed $xmlString);
$descriptions = preg_filter("/<description>([.]*)</description>/","\\1", mixed $xmlString);
$links = preg_filter("/<link>([.]*)</link>/","\\1", mixed $xmlString);
$pubDates = preg_filter("/<pubDate>([.]*)</pubDate>/","\\1", mixed $xmlString);
$guids = preg_filter("/<guid>([.]*)</guid>/","\\1", mixed $xmlString);
?>
In this example each variable will be filled with the correct values.

Categories