I'm trying to read a large xml file (about 40 MB), and use this data for update the db of my application.
It seems i've found a good compromise in terms of elapsed time/memory using XMLReader and simplexml_import_dom() but i can't get the value of attributes with colon in their name... for example <g:attr_name>.
If i simply use $reader->read() function for each "product" node i can retrive the value as $reader->value, but if i expand() the node and copy it with $doc->importNode this attributes are ignored.
$reader = new XMLReader();
$reader->open(__XML_FILE__);
$doc = new DOMDocument;
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if($reader->localName=="product"){
$node = simplexml_import_dom($doc->importNode($reader->expand(), true));
echo $node->attr_name."<br><br>";
$reader->next('product');
}
}
}
Probably i miss something... any advice would be really appriciated!
Thanks.
Attributes with colons in their name have a namespace.
The part before the colon is a prefix that is registered to some namespace (usually in the root node). To access the namespaced attributes of a SimpleXmlElement you have to pass the namespace to the attributes() method:
$attributes = $element->attributes('some-namespace'); // or
$attributes = $element->attributes('g', TRUE); // and then
echo $attributes['name'];
The same applies to element children of a node. Access them via the childrens() method
$children = $element->children('some-namespace'); // or
$children = $element->children('g', TRUE); // and then
echo $children->elementName;
On a sidenote, if you want to import this data to your database, you can also try to do so directly:
http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html#xml-5.1-importing
Related
So as an example here is an MWE XML
<manifest xmlns="http://iuclid6.echa.europa.eu/namespaces/manifest/v1"
xmlns:xlink="http://www.w3.org/1999/xlink">
<general-information>
<title>IUCLID 6 container manifest file</title>
<created>Tue Nov 05 11:04:06 EET 2019</created>
<author>SuperUser</author>
</general-information>
<base-document-uuid>f53d48a9-17ef-48f0-8d0e-76d03007bdfe/f53d48a9-17ef-48f0-8d0e-76d03007bdfe</base-document-uuid>
<contained-documents>
<document id="f53d48a9-17ef-48f0-8d0e-76d03007bdfe/f53d48a9-17ef-48f0-8d0e-76d03007bdfe">
<type>DOSSIER</type>
<name xlink:type="simple"
xlink:href="f53d48a9-17ef-48f0-8d0e-76d03007bdfe_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d"
>Initial submission</name>
<first-modification-date>2019-03-27T06:46:39Z</first-modification-date>
<last-modification-date>2019-03-27T06:46:39Z</last-modification-date>
</document>
</contained-documents>
</manifest>
In this case I want to find an attribute xlink:href and replace the name tag with the contents of the file referred to by the xlink:href - in this case f53d48a9-17ef-48f0-8d0e-76d03007bdfe_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d (which is an XML format file as well).
At the moment I use simplexml to pull it into an object and then xml2json library to convert it into a recursive array - but walking it using the normal methods doesn't give me a way to modify a parent node..
I'm not sure how to back up the hierarchy - any suggestions??
So this is where I am right now - xml2array (https://github.com/tamlyn/xml2json) delivers an array of arrays with XML attributes brought out into the array too
<?php
include('./xml2json.php');
$arrayData = [];
$xmlOptions = array(
"namespaceRecursive" => "True"
);
function &i6cArray(& $array){
foreach ($array as $key => $value) {
if(is_array($value)){
//recurse the array of arrays
$value = &i6cArray($value);
$array[$key]=$value;
print_r($value);
} elseif ($key == '#xlink:href') {
// we want to replace the element here with the ref'd file contents
// So we should get name.content = file contents
$tempxml = simplexml_load_file($value);
$tempArrayData = xmlToArray($tempxml);
$array['content']=$tempArrayData;
} else {
//do nothing (at least for now)
}
}
return $array;
}
if (file_exists('manifest.xml')) {
$xml = simplexml_load_file('manifest.xml');
$arrayData = xmlToArray($xml,$xmlOptions);
// walk array - we know the initial thing is an array
$arrayData = &i6cArray($arrayData);
//output result
$jsonString = json_encode($arrayData, JSON_PRETTY_PRINT);
file_put_contents('dossier.json', $jsonString);
} else {
exit("Failed to open manifest.");
}
?>
Since I would have liked to remove the #xlink attributes, but won't die otherwise I am going to insert a 'content' value which will be the referenced XML content.
I would still link to have replaced the entire 'name' key with something
A few bits of background before we get into the specific solution:
The parts of names before a colon are local aliases for a particular namespace, identified by a URI in an xmlns attribute. They need slightly different handling than non-namespaced names; see this reference question for SimpleXML.
PHP's SimpleXML and DOM extensions both have support for a language called "XPath", which lets you search for elements and attributes based on their parents and/or content.
The DOM is a more complex API than SimpleXML, but has more powerful features, particularly for writing. You can switch between the two using the functions simplexml_import_dom() and dom_import_simplexml().
In this case, we want to find all xlink:href attributes. Looking at the xmlns attributes at the top of the file, we see these are in the http://www.w3.org/1999/xlink namespace. In XPath, you can say "has an attribute" with the syntax [#attributename], so we can use SimpleXML and XPath like this:
$simplexml->registerXpathNamespace('xl', 'http://www.w3.org/1999/xlink');
$elements_with_xlink_hrefs = $simplexml->xpath('//[#xl:href]');
For each of those, we want the attribute value:
foreach ( $elements_with_xlink_hrefs as $simplexml_element ) {
$filename = (string)$simplexml_element->attributes('http://www.w3.org/1999/xlink')->href;
// ...
We then want to load that file, and inject it into the document; this is easier with the DOM, but there is a complexity of having to "import" the node so that it's "owned by" the right document.
// load the other file
$other_document = new DOMDocument;
$other_document->load($filename);
// switch to DOM and add it in place
$dom_element = dom_import_simplexml($simplexml_element);
$dom_element->appendChild(
$dom_element->ownerDocument->importNode(
$other_document->documentElement
)
);
We can now tidy up and delete the "xlink" attributes:
$dom_element->removeAttributeNs('http://www.w3.org/1999/xlink', 'href');
$dom_element->removeAttributeNs('http://www.w3.org/1999/xlink', 'type');
Once we're done, we can output the whole thing back as one combined XML document:
} // end of foreach loop
echo $simplexml->asXML();
I'm trying to process an RSS feed using PHP and there are some tags such as 'itunes:image' which I need to process. The code I'm using is below and for some reason these elements are not returning any value. The output is length is 0.
How can I read these tags and get their attributes?
$f = $_REQUEST['feed'];
$feed = new DOMDocument();
$feed->load($f);
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->textContent; // textContent
$arrt = $item->getElementsByTagName('itunes:image');
print_r($arrt);
}
getElementsByTagName is specified by DOM, and PHP is just following that. It doesn't consider namespaces. Instead, use getElementsByTagNameNS, which requires the full namespace URI (not the prefix). This appears to be http://www.itunes.com/dtds/podcast-1.0.dtd*. So:
$img = $item->getElementsByTagNameNS('http://www.itunes.com/dtds/podcast-1.0.dtd', 'image');
// Set preemptive fallback, then set value if check passes
urlImage = '';
if ($img) {
$urlImage = $img->getAttribute('href');
}
Or put the namespace in a constant.
You might be able to get away with simply removing the prefix and getting all image tags of any namespace with getElementsByTagName.
Make sure to check whether a given item has an itunes:image element at all (example now given); in the example podcast, some don't, and I suspect that was also giving you trouble. (If there's no href attribute, getAttribute will return either null or an empty string per the DOM spec without erroring out.)
*In case you're wondering, there is no actual DTD file hosted at that location, and there hasn't been for about ten years.
<?php
$rss_feed = simplexml_load_file("url link");
if(!empty($rss_feed)) {
$i=0;
foreach ($rss_feed->channel->item as $feed_item) {
?>
<?php echo $rss_feed->children('itunes', true)->image->attributes()->href;?>
<?php
}
?>
If I have the following data in my XML file;
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.008.001.02" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrDrctDbtInitn>
<PmtInf>
<PmtInfId>5n7gfUaPGK</PmtInfId>
<PmtMtd>DD</PmtMtd>
<NbOfTxs>1</NbOfTxs>
<CtrlSum>200.2</CtrlSum>
<PmtTpInf>
<SvcLvl>
<Cd>SEPA</Cd>
</SvcLvl>
<LclInstrm>
<Cd>CORE</Cd>
</LclInstrm>
<SeqTp>RCUR</SeqTp>
</PmtTpInf>
<DrctDbtTxInf>
<PmtId>
<EndToEndId>nmu5AOhE7G</EndToEndId>
</PmtId>
</DrctDbtTxInf>
</PmtInf>
<PmtInf>
<PmtInfId>5jAcoNoId3</PmtInfId>
<PmtMtd>DD</PmtMtd>
<NbOfTxs>3</NbOfTxs>
<CtrlSum>100.5</CtrlSum>
<PmtTpInf>
<SvcLvl>
<Cd>SEPA</Cd>
</SvcLvl>
<LclInstrm>
<Cd>CORE</Cd>
</LclInstrm>
<SeqTp>FRST</SeqTp>
</PmtTpInf>
<DrctDbtTxInf>
<PmtId>
<EndToEndId>nmu5AbdfG</EndToEndId>
</PmtId>
</DrctDbtTxInf>
<DrctDbtTxInf>
<PmtId>
<EndToEndId>nmu5A3r5jgG</EndToEndId>
</PmtId>
</DrctDbtTxInf>
</PmtInf>
</CstmrDrctDbtInitn>
</Document>
How would I access <NbOfTxs> in the second <PmtInf> block (where the value is 3) instead of <NbOfTxs> in the first <PmtInf> block (where the value is 1)?
If I just used the following line of code;
$FRSTTransaction = $xml->getElementsByTagName('NbOfTxs')->nodeValue;
It doesn't know which <NbOfTxs> I am attempting to access.
The only difference between each payment block is the <SeqTp>. There will be 4 Payment Blocks in total.
I am trying to count the number of <DrctDbtTxInf>blocks in each Payment Block and then put this value into <NbOfTxs>.
<PmtId>
<EndToEndId>nmu5AOhE7G</EndToEndId>
</PmtId>
</DrctDbtTxInf>
The code I tried is as follows;
$filename = date('Y-W').'.xml'; //2014-26.xml
$xml = new DOMDocument;
$xml->load($filename);
$NumberTransactions = 0;
$RCURTransaction = $xml->getElementsByTagName('DrctDbtTxInf');
$NodeValue = $xml->getElementsByTagName('NbOfTxs')->nodeValue;
foreach ($RCURTransaction as $Transaction) {
$NodeValue = $NodeValue + 1;
}
$Document = simplexml_load_file($filename);
$Document->CstmrDrctDbtInitn->PmtInf->NbOfTxs = $NodeValue;
$Document->asXML($filename);
I receive no errors, it just doesnt seem to access the node value.
Read the manual. DOMDocument::getElementsByTagName does not return a node, it returns an instance of the DOMNodeList class. This class implements the Traversable interface (which means you can foreach it), and has one method of its own item (cf, again, the manual).
You are attempting to access the nodeValue property of what you think is a DOMNode instance, but is in fact a DOMNodeList instance. As you can see in the manual, there is no nodeValue property available. Instead, get a specific node from this list, and then get the node value:
$nodes = $xml->getElementsByTagName('NbOfTxs');
foreach ($nodes as $node)
echo $node->nodeValue, PHP_EOL;//first, second, third node
Or, if you want, for example to see the value of just the third occurrence of this node:
if ($nodes->length > 2)//zero-indexed!
echo $nodes->item(2)->nodeValue, PHP_EOL;
else
echo 'Error, only ', $nodes->length, ' occurrences of that node found', PHP_EOL;
Bottom line, as often, really is RTM. The documentation for DOMDocument::getElementsByTagName clearly shows what the return type of the given method is. if it's an instance of a particular class, that return type is clickable on the PHP website, and links you through to the manual page of that class. Navigating an API couldn't be simpler than that, IMHO:
//from the docs
public DOMNodeList DOMDocument::getElementsByTagName ( string $name )
// class::methodName arguments + expected type
|-> return type, links to docs for this class
Update
Addressing the things you mention in your updated question:
How to count specific children for a node
I'm assuming each PmtInf is a payment block, but all the SeqTp tags seem to me, to be children of PmtTpInf tags. Since we're working with a DOMNodeList, which consists of DOMNode instances. Looking at the manual is the first thing to do. As you can see, each DOMNode has numerous, handy properties and methods: $childNodes, $nodeName and $parentNode are the ones we will be using here.
$payments = $xml->getElementsByTagName('PmtTpInf');//don't get PmtInf, we can access that through `PmtTpInf` nodes' parentNode property
$idx = -1;
$counts = array();
$parents = array();
foreach ($payments as $payment)
{
if (!$parents || $parents[$idx] !== $payment->parentNode)
{//current $payment is not a child of last processed payment block
$parents[++$idx] = $payment->parentNode;//add reference to new payment block
$count[$idx] = 0;//set count to 0
}
foreach ($payment->childNodes as $child)
{
if ($child->nodeName === 'SeqTp')
++$counts[$idx];//add 1 to count
}
}
Ok, now we have 2 arrays, $parents, which contains each payment block, and $counts, which contains the total count of all SeqTp blocks in that payment block. Let's set about adding/updating that data:
foreach ($parents as $idx => $block)
{//iterate over the payment blocks
$nbNode = null;//no node found yet
for ($i=0;$i<$block->childNodes->length;++$i)
{
if ($block->childNodes->item($i)->nodeName === 'NbOfTxs')
{
$nbNode = $block->childNodes->item($i);
break;//found the node, stop here
}
}
if ($nbNode === null)
{//NbOfTxs tag does not exist yet
$nbNode = $xml->createElement('NbOfTxs', 0);//create new node
$block->appendChild($nbNode);//add as child of the payment-block node
}
$nbNode->nodeValue = $counts[$idx];//set value using the counts array we constructed above
}
Lastly, to save this updated XML dom:
$xml->save($filename);
That's all, no need for simplexml_load_file at all, because that parses the XML DOM, which DOMDocument already did for you.
I have one solution to the subject problem, but it’s a hack and I’m wondering if there’s a better way to do this.
Below is a sample XML file and a PHP CLI script that executes an xpath query given as an argument. For this test case, the command line is:
./xpeg "//MainType[#ID=123]"
What seems most strange is this line, without which my approach doesn’t work:
$result->loadXML($result->saveXML($result));
As far as I know, this simply re-parses the modified XML, and it seems to me that this shouldn’t be necessary.
Is there a better way to perform xpath queries on this XML in PHP?
XML (note the binding of the default namespace):
<?xml version="1.0" encoding="utf-8"?>
<MyRoot
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.example.com/data http://www.example.com/data/MyRoot.xsd"
xmlns="http://www.example.com/data">
<MainType ID="192" comment="Bob's site">
<Price>$0.20</Price>
<TheUrl><![CDATA[http://www.example.com/path1/]]></TheUrl>
<Validated>N</Validated>
</MainType>
<MainType ID="123" comment="Test site">
<Price>$99.95</Price>
<TheUrl><![CDATA[http://www.example.com/path2]]></TheUrl>
<Validated>N</Validated>
</MainType>
<MainType ID="922" comment="Health Insurance">
<Price>$600.00</Price>
<TheUrl><![CDATA[http://www.example.com/eg/xyz.php]]></TheUrl>
<Validated>N</Validated>
</MainType>
<MainType ID="389" comment="Used Cars">
<Price>$5000.00</Price>
<TheUrl><![CDATA[http://www.example.com/tata.php]]></TheUrl>
<Validated>N</Validated>
</MainType>
</MyRoot>
PHP CLI Script:
#!/usr/bin/php-cli
<?php
$xml = file_get_contents("xpeg.xml");
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
// remove the default namespace binding
$e = $domdoc->documentElement;
$e->removeAttributeNS($e->getAttributeNode("xmlns")->nodeValue,"");
// hack hack, cough cough, hack hack
$domdoc->loadXML($domdoc->saveXML($domdoc));
$xpath = new DOMXpath($domdoc);
$str = trim($argv[1]);
$result = $xpath->query($str);
if ($result !== FALSE) {
dump_dom_levels($result);
}
else {
echo "error\n";
}
// The following function isn't really part of the
// question. It simply provides a concise summary of
// the result.
function dump_dom_levels($node, $level = 0) {
$class = get_class($node);
if ($class == "DOMNodeList") {
echo "Level $level ($class): $node->length items\n";
foreach ($node as $child_node) {
dump_dom_levels($child_node, $level+1);
}
}
else {
$nChildren = 0;
foreach ($node->childNodes as $child_node) {
if ($child_node->hasChildNodes()) {
$nChildren++;
}
}
if ($nChildren) {
echo "Level $level ($class): $nChildren children\n";
}
foreach ($node->childNodes as $child_node) {
if ($child_node->hasChildNodes()) {
dump_dom_levels($child_node, $level+1);
}
}
}
}
?>
The solution is using the namespace, not getting rid of it.
$result = new DOMDocument();
$result->loadXML($xml);
$xpath = new DOMXpath($result);
$xpath->registerNamespace("x", trim($argv[2]));
$str = trim($argv[1]);
$result = $xpath->query($str);
And call it as this on the command line (note the x: in the XPath expression)
./xpeg "//x:MainType[#ID=123]" "http://www.example.com/data"
You can make this more shiny by
finding out default namespaces yourself (by looking at the namespace property of the document element)
supporting more than one namespace on the command line and register them all before $xpath->query()
supporting arguments in the form of xyz=http//namespace.uri/ to create custom namespace prefixes
Bottom line is: In XPath you can't query //foo when you really mean //namespace:foo. These are fundamentally different and therefore select different nodes. The fact that XML can have a default namespace defined (and thus can drop explicit namespace usage in the document) does not mean you can drop namespace usage in XPath.
Just out of curiosity, what happens if you remove this line?
$e->removeAttributeNS($e->getAttributeNode("xmlns")->nodeValue,"");
That strikes me as the most likely to cause the need for your hack. You're basically removing the xmlns="http://www.example.com/data" part and then re-building the DOMDocument. Have you considered simply using string functions to remove that namespace?
$pieces = explode('xmlns="', $xml);
$xml = $pieces[0] . substr($pieces[1], strpos($pieces[1], '"') + 1);
Then continue on your way? It might even end up being faster.
Given the current state of the XPath language, I feel that the best answer is provided by Tomalek: to associate a prefix with the default namespace and to prefix all tag names. That’s the solution I intend to use in my current application.
When that’s not possible or practical, a better solution than my hack is to invoke a method that does the same thing as re-scanning (hopefully more efficiently): DOMDocument::normalizeDocument(). The method behaves “as if you saved and then loaded the document, putting the document in a ‘normal’ form.”
Also as a variant you may use a xpath mask:
//*[local-name(.) = 'MainType'][#ID='123']
I want to delete all the empty nodes in my XML document using SimpleXML
Here is my code :
$xs = file_get_contents('liens.xml')or die("Fichier XML non chargé");
$doc_xml = new SimpleXMLElement($xs);
foreach($doc_xml->xpath('//*[not(text())]') as $torm)
unset($torm);
$doc_xml->asXML("liens.xml");
I saw with a print_r() that XPath is grabbing something, but nothing is removed from my XML file.
$file = 'liens.xml';
$xpath = '//*[not(text())]';
if (!$xml = simplexml_load_file($file)) {
throw new Exception("Fichier XML non chargé");
}
foreach ($xml->xpath($xpath) as $remove) {
unset($remove[0]);
}
$xml->asXML($file);
I know this post is a bit old but in your foreach, $torm is replaced in every iteration. This means your unset($torm) is doing nothing to the original $doc_xml object.
Instead you will need to remove the element itself:
foreach($doc_xml->xpath('//*[not(text())]') as $torm)
unset($torm[0]);
###
by using a simplxmlelement-self-reference.