HTML DOM Merge paged XML file into single file with a loop - php

Hey I'm working on an import for a list of elements. The code works for now, but it is not futureproof if there are more items added. The XML uses an unique key and pagination (every 100 items a new key).
Below is my PHP code for the function I've build.
<?php
$feedUrl = '[url of the feed]';
$doc1 = new DOMDocument();
$doc1->load($feedUrl);
$doc1_token = $doc1->getElementsByTagName('resumptionToken')[0]->nodeValue;
$doc2 = new DOMDocument();
$doc2->load($feedUrl . '&resumptionToken=' . $doc1_token);
$doc2_token = $doc2->getElementsByTagName('resumptionToken')[0]->nodeValue;
$doc3 = new DOMDocument();
$doc3->load($feedUrl . '&resumptionToken=' . $doc2_token);
$doc3_token = $doc3->getElementsByTagName('resumptionToken')[0]->nodeValue;
$doc4 = new DOMDocument();
$doc4->load($feedUrl . '&resumptionToken=' . $doc3_token);
$doc4_token = $doc4->getElementsByTagName('resumptionToken')[0]->nodeValue;
$doc5 = new DOMDocument();
$doc5->load($feedUrl . '&resumptionToken=' . $doc4_token);
$doc5_token = $doc5->getElementsByTagName('resumptionToken')[0]->nodeValue;
// get 'ListRecordes' element of document 1
$list_records = $doc1->getElementsByTagName('ListRecords')->item(0); //edited res - items
// iterate over 'item' elements of document 2
$items2 = $doc2->getElementsByTagName('record');
for ($i = 0; $i < $items2->length; $i ++) {
$item2 = $items2->item($i);
// import/copy item from document 2 to document 1
$item1 = $doc1->importNode($item2, true);
// append imported item to document 1 'res' element
$list_records->appendChild($item1);
}
// iterate over 'item' elements of document 3
$items3 = $doc3->getElementsByTagName('record');
for ($i = 0; $i < $items3->length; $i ++) {
$item3 = $items3->item($i);
// import/copy item from document 3 to document 1
$item1 = $doc1->importNode($item3, true);
// append imported item to document 1 'res' element
$list_records->appendChild($item1);
}
// iterate over 'item' elements of document 4
$items4 = $doc4->getElementsByTagName('record');
for ($i = 0; $i < $items4->length; $i ++) {
$item4 = $items4->item($i);
// import/copy item from document 4 to document 1
$item1 = $doc1->importNode($item4, true);
// append imported item to document 1 'res' element
$list_records->appendChild($item1);
}
// iterate over 'item' elements of document 5
$items5 = $doc5->getElementsByTagName('record');
for ($i = 0; $i < $items5->length; $i ++) {
$item5 = $$items5->item($i);
// import/copy item from document 5 to document 1
$item1 = $doc1->importNode($item5, true);
// append imported item to document 1 'res' element
$list_records->appendChild($item1);
}
$doc1->save('merged.xml'); //edited -added saving into xml file
I think the code is not perfect, because if the we add more records than 600, the latest one's are not imported in the merged xml.
Besides this there is also an other issue. We have nested "" nodes. We need to merge the "" direct childs only.
<ListRecords>
<record>
<header>
...
</header>
<metadata>
<record xmlns="http://www.openarchives.org/OAI/2.0/" priref="100000002">
...
</record>
</metadata>
</record>
</ListRecords>

You can use Xpath expressions to address specific elements. On your snippet that would be /ListRecords/record. However I think it misses the document element node with the namespace declaration for the Open Archives Initiative Protocol for Metadata Harvesting. It should be something like:
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/">
<ListRecords>
<record>
<header>
<identifier>oai:arXiv.org:hep-th/9901001</identifier>
</header>
</record>
</ListRecords>
</OAI-PMH>
To address the namespace with Xpath you need to register a prefix for it. Then put the feed urls in an array and iterate them:
$mergeDocument = new DOMDocument();
$mergeDocument->loadXML(
'<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"><ListRecords/></OAI-PMH>'
);
$mergeTarget = $mergeDocument->documentElement->firstChild;
foreach ($feedUrls as $feedUrl) {
$document = new DOMDocument();
$document->load($feedUrl);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('oai', 'http://www.openarchives.org/OAI/2.0/');
foreach ($xpath->evaluate('/oai:OAI-PMH/oai:ListRecords/oai:record') as $record) {
$mergeTarget->appendChild($mergeDocument->importNode($record, TRUE));
}
}
$mergeDocument->formatOutput = TRUE;
echo $mergeDocument->saveXML();

Related

PHP XML find "searching" node and create new node after "searching node"

i need some help from you :
loading XML from a file.
I'm also loading data from a CSV file.
I have 2 columns in CSv:
ID
Number
I'm looking for the ID from the CSV in the XML file.
I can find you, everything works.
However, I need to edit the XML as follows:
If you find an element in the XML whose ID is contains an ID value from a CSV file, copy this element as many times as the value is in the CSV (column number).
Here is my code.
<?php
$file = 'doc.xml';
if (!copy($file, $newfile)) {
echo "failed to copy";
}
$fh = fopen("data.csv", "r");
$csvData = array();
//Loop through the rows in our CSV file and add them to
//the PHP array that we created above.
while (($row = fgetcsv($fh, 0, ";")) !== FALSE) {
$csvData[] = $row;
}
$length = count($csvData);
if (file_exists('doc.xml')) {
$xml = simplexml_load_file('doc.xml');
for ($i=0; $i < $length; $i++) {
$searchedNode = $csvData[$i+1][0];
$searchingMedia = $xml->xpath("/node/media/image[contains(#id,'$searchedNode')]");
foreach ($searchingMedia as $node) {
$update = $node->addAttribute('count',$csvData[$i+1][1]);
}
}
}
$xml->asXml('doc_new.xml');
?>
CSV data :
Can someone help me please ?
SimplXML abstracts the XML nodes, so it is not the best API for direct node manipulations - use DOM.
You did not provide an example but from your code I would expect something like this:
$data = [
[
'42', '6'
]
];
$xml = <<<'XML'
<root>
<node>
<media>
<image id="42"/>
</media>
</node>
</root>
XML;
Then it is fairly straightforward:
// bootstrap the DOM
$document = new DOMDocument();
// let the parser ignore whitespace nodes (indents)
$document->preserveWhiteSpace = FALSE;
$document->loadXML($xml);
// DOM has a spearate object for Xpath
$xpath = new DOMXpath($document);
// iterate the CSV data
foreach ($data as $row) {
// looking for "image" elements with a specific id attribute
$expression = sprintf(
'//root/node/media/image[#id="%s"]',
$row[0]
);
// iterate the found image nodes
foreach ($xpath->evaluate($expression) as $imageNode) {
// "amount" times
for ($i = 0, $c = (int)$row[1]; $i < $c; $i++) {
// clone the "image" element and insert clone after it
$imageNode->after(
$newNode = $imageNode->cloneNode(TRUE)
);
// modify the clone
$newNode->textContent = 'Inserted Node #'.$i;
}
}
}
$document->formatOutput = TRUE;
echo $document->saveXML();
Output:
<?xml version="1.0"?>
<root>
<node>
<media>
<image id="42"/>
<image id="42">Inserted Node #5</image>
<image id="42">Inserted Node #4</image>
<image id="42">Inserted Node #3</image>
<image id="42">Inserted Node #2</image>
<image id="42">Inserted Node #1</image>
<image id="42">Inserted Node #0</image>
</media>
</node>
</root>

Change outerHTML of a php DOMElement?

How do I change the outerHtml of an element using PHP DomDocument class? Make sure, no third party library is used such as Simple PHP Dom or else.
For example:
I want to do something like this.
$dom = new DOMDocument;
$dom->loadHTML($html);
$tag = $dom->getElementsByTagName('h3');
foreach ($tag as $e) {
$e->outerHTML = '<h5>Hello World</h5>';
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;
And the output should be like this:
Old Output: <h3>Hello World</h3>
But I need this new output: <p>Hello World</p>
You can create a copy of the element content and attributes in a new node (with the new name you need), and use the function replaceChild().
The current code will work only with simple elements (a text inside a node), if you have nested elements, you will need to write a recursive function.
$dom = new DOMDocument;
$dom->loadHTML($html);
$titles = $dom->getElementsByTagName('h3');
for($i = $titles->length-1 ; $i >= 0 ; $i--)
{
$title = $titles->item($i);
$titleText = $title->textContent ; // get original content of the node
$newTitle = $dom->createElement('h5'); // create a new node with the correct name
$newTitle->textContent = $titleText ; // copy the content of the original node
// copy the attribute (class, style, ...)
$attributes = $title->attributes ;
for($j = $attributes->length-1 ; $j>= 0 ; --$j)
{
$attributeName = $attributes->item($j)->nodeName ;
$attributeValue = $attributes->item($j)->nodeValue ;
$newAttribute = $dom->createAttribute($attributeName);
$newAttribute->nodeValue = $attributeValue ;
$newTitle->appendChild($newAttribute);
}
$title->parentNode->replaceChild($newTitle, $title); // replace original node per our copy
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;

PHP XPath - query finds too many nodes

I'm trying to multiplicate a row (with data-id='first') from a template three times and fill the proper field ({first}) with some value (0,1,2 in this case). Below you can find my simple code. I don't understand, why this line - $nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode); finds more than one node (it finds nodes which contain text 'first'). It just finds both rows - the cloned and the original one, so it replaces the text in both of them, while it should replace it only in the new one - please note that I'm providing the second parameter for function $xpath->query which should make the search relative to just that new node I just cloned.
Here's a fiddle: https://eval.in/170941
HTML:
<html>
<head>
<title>test</title>
</head>
<body>
<table>
<tr data-id="first">
<td>{first}</td>
</tr>
</table>
</body>
</html>
PHP:
<?php
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
for ($i = 0; $i < 3; $i++) {
$newNode = $element->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();
As you can see, the result is a three elements table with rows valued 0,0,0, while expected values should be 0,1,2.
Starting an xpath location path with / means tha it start at the document root. So //* is always any element node, the context argument has no effect.
Try:
$nodeList = $xpath->query(".//*[text()[contains(.,'first')]]", $newNode);
HINT: DOMXpath::query() does only allow expressions that return a node list, DOMXpath::evaluate() allows all expressions. Example: count(//*).
HINT: DOMNodelist objects implement iterator, you can use foreach to iterate them.
The problem you are having is that you are cloning the original node, but in your first pass you're altering the original node's content. Every pass after that is copying the already modified node, so there is no {first} to find.
One solution is to make a clone of the source element which you never insert into the document, and use that inside your loop.
Here's my fiddle: https://eval.in/171149
<?php
$html = '<html><head><title>test</title></head><body><table><tr data-id="first"><td>{first}</td></tr></table></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
$clonedNode = $element->cloneNode(true);
for ($i = 0; $i < 3; $i++) {
$newNode = $clonedNode->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();

Fetch the attributes using PHP crawler

I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .
you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;
You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.

PHP DOMDocument Strings to Objects

I have created a php script in PHP Dom where multiple html files are scraped to look for all P tags that contain a specific class.
I then want to get the values inside those p tags and build an unordered list in PHP Dom.
My problem is, while I can get the values and echo all of them onto a page, when I try to createElements and append each value in its own LI tag my results only returns the LAST item in the list. I hope that makes sense. Here is the code:
$dom = new DOMDocument();
$dom->formatOutput = true;
$dom->preservewhiteSpace = false;
//looping through an array
foreach ($pages as $page) {
foreach ($page['pageContent'] as $listlinks) {
$dom->loadHTMLFile($theurl . 'content_id_' . $listlinks['content'] . '.html');
//create the xPath object after loading the html source, otherwise the query won't work:/
$xPath = new DOMXPath($dom);
//get the p nodes in a DOMNodeList that has class"content_header_type_2":
$nodeList = $xPath->query("//p[#class='content_header_type_2']");
//create a new DOMDocument and add a ul element:
$newDom = new DOMDocument();
$ul = $newDom->createElement('ul');
$newDom->appendChild($ul);
// append all nodes from $nodeList to the new dom, as children of $ul:
foreach ($nodeList as $domElement) {
$domNode = $newDom->importNode($domElement, true);
echo $domNode->nodeValue . '<br>'; //This gives the entire list
$li = $newDom->createElement('li', $domNode->nodeValue); //This gives the last value in the list
$ul->appendChild($li);
}
}
};
$output = $newDom ->saveHTML();
echo $output;

Categories