I am building an RSS parser using the SimpleXML Class and I was wondering if using the DOMDocument class would improve the speed of the parser. I am parsing an rss document that is at least 1000 lines and I use almost all of the data from those 1000 lines. I am looking for the method that will take the least time to complete.
SimpleXML and DOMDocument both use the same parser (libxml2), so the parsing difference between them is negligible.
This is easy to verify:
function time_load_dd($xml, $reps) {
// discard first run to prime caches
for ($i=0; $i < 5; ++$i) {
$dom = new DOMDocument();
$dom->loadXML($xml);
}
$start = microtime(true);
for ($i=0; $i < $reps; ++$i) {
$dom = new DOMDocument();
$dom->loadXML($xml);
}
$stop = microtime(true) - $start;
return $stop;
}
function time_load_sxe($xml, $reps) {
for ($i=0; $i < 5; ++$i) {
$sxe = simplexml_load_string($xml);
}
$start = microtime(true);
for ($i=0; $i < $reps; ++$i) {
$sxe = simplexml_load_string($xml);
}
$stop = microtime(true) - $start;
return $stop;
}
function main() {
// This is a 1800-line atom feed of some complexity.
$url = 'http://feeds.feedburner.com/reason/AllArticles';
$xml = file_get_contents($url);
$reps = 10000;
$methods = array('time_load_dd','time_load_sxe');
echo "Time to complete $reps reps:\n";
foreach ($methods as $method) {
echo $method,": ",$method($xml,$reps), "\n";
}
}
main();
On my machine I get basically no difference:
Time to complete 10000 reps:
time_load_dd: 17.725028991699
time_load_sxe: 17.416455984116
The real issue here is what algorithms you are using and what you are doing with the data. 1000 lines is not a big XML document. Your slowdown will not be in memory usage or parsing speed but in your application logic.
Well, I have encountered a HUGE performance difference between DomDocument and SimpleXML. I have ~ 15 MB big XML file with approx 50 000 elements like this:
...
<ITEM>
<Product>some product code</Product>
<Param>123</Param>
<TextValue>few words</TextValue>
</ITEM>
...
I only need to "read" those values and save them in PHP array. At first I tried DomDocument ...
$dom = new DOMDocument();
$dom->loadXML( $external_content );
$root = $dom->documentElement;
$xml_param_values = $root->getElementsByTagName('ITEM');
foreach ($xml_param_values as $item) {
$product_code = $item->getElementsByTagName('Product')->item(0)->textContent;
// ... some other operation
}
That script died after 60 seconds with maximum execution time exceeded error. Only 15 000 items of 50k were parsed.
So I rewrote the code to SimpleXML version:
$xml = new SimpleXMLElement($external_content);
foreach($xml->xpath('ITEM') as $item) {
$product_code = (string) $item->Product;
// ... some other operation
}
After 1 second all was done.
I don't know how those functions are internally implemented in PHP, but in my application (and with my XML structure) there is really, REALLY HUGE performance difference between DomDocument and SimpleXML.
Related
I have to process a huge XML file, I used DOMDocument to process But the datas returned is huge, so how can I choose specific amount of elements to display.
For example I want to display 5 elements.
My code:
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->load('IPCCPC-epoxif-201905.xml'); //IPCCPC-epoxif-201905
$xpath = new DOMXPath($doc);
if(empty($_POST['search'])){
$txtSearch = 'A01B1/00';
}
else{
$txtSearch = $_POST['search'];
}
$titles = $xpath->query("Doc/Fld[#name='IC']/Prg/Sen[contains(text(),\"$txtSearch\")]");
foreach ($titles as $title)
{
// I want to display 5 results here.
}
Add an index to the loop, and break out when it hits the limit.
$limit = 5;
foreach ($titles as $i => $title) {
if ($i >= $limit) {
break;
}
// rest of code
}
I'm trying to multiplicate a row (with data-id='first') from a template three times and fill the proper field ({first}) with some value (0,1,2 in this case). Below you can find my simple code. I don't understand, why this line - $nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode); finds more than one node (it finds nodes which contain text 'first'). It just finds both rows - the cloned and the original one, so it replaces the text in both of them, while it should replace it only in the new one - please note that I'm providing the second parameter for function $xpath->query which should make the search relative to just that new node I just cloned.
Here's a fiddle: https://eval.in/170941
HTML:
<html>
<head>
<title>test</title>
</head>
<body>
<table>
<tr data-id="first">
<td>{first}</td>
</tr>
</table>
</body>
</html>
PHP:
<?php
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
for ($i = 0; $i < 3; $i++) {
$newNode = $element->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();
As you can see, the result is a three elements table with rows valued 0,0,0, while expected values should be 0,1,2.
Starting an xpath location path with / means tha it start at the document root. So //* is always any element node, the context argument has no effect.
Try:
$nodeList = $xpath->query(".//*[text()[contains(.,'first')]]", $newNode);
HINT: DOMXpath::query() does only allow expressions that return a node list, DOMXpath::evaluate() allows all expressions. Example: count(//*).
HINT: DOMNodelist objects implement iterator, you can use foreach to iterate them.
The problem you are having is that you are cloning the original node, but in your first pass you're altering the original node's content. Every pass after that is copying the already modified node, so there is no {first} to find.
One solution is to make a clone of the source element which you never insert into the document, and use that inside your loop.
Here's my fiddle: https://eval.in/171149
<?php
$html = '<html><head><title>test</title></head><body><table><tr data-id="first"><td>{first}</td></tr></table></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
$clonedNode = $element->cloneNode(true);
for ($i = 0; $i < 3; $i++) {
$newNode = $clonedNode->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();
I have a array which it reads its cells from a xml file,i wrote it by "for" but now because i don't know how many node i have i wanna to write this loop in a way that it start and finish up to end of xml file.my code with for is:
$description=array();
for($i=0;$i<2;$i++)
{
$description[$i]=read_xml_node("dscription",$i);
}
and my xml file:
<eth0>
<description>WAN</description>
</eth0>
<eth1>
<description>LAN</description>
</eth1>
in this code i must know "2",but i wanna to know a way that doesn't need to know "2".
i am not sure what kind of parser you are using, but it is very easy with simplexml, so i put together some sample code using simplexml.
something like this should do the trick:
$xmlstr = <<<XML
<?xml version='1.0' standalone='yes'?>
<node>
<eth0>
<description>WAN</description>
</eth0>
<eth1>
<description>LAN</description>
</eth1>
</node>
XML;
$xml = new SimpleXMLElement($xmlstr);
foreach ($xml as $xmlnode) {
foreach ($xmlnode as $description) {
echo $description . " ";
}
}
output:
WAN LAN
$length = count($description);
for ($i = 0; $i < $length; $i++) {
print $description[$i];
}
The parser you use might allow you to use a while loop which will return false when it has reached the end of the XML document. For example:
while ($node = $xml->read_next_node($mydoc)) {
//Do whatever...
}
If none exists, you can try using count() as the second parameter of your for loop. It returns the length of an array you specify. For example:
for ($i = 0; $i < count($myarray); $i++) {
//Do whatever...
}
I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .
you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;
You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.
what are the advantages and disadvantages of the following libraries?
PHP Simple HTML DOM Parser
QP
phpQuery
From the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->clear(); unset($object); when you dont need an object anymore.
Are there any more scrapers? What are your experiences with them? I'm going to make this a community wiki, may we'll build a useful list of libraries that can be useful when scraping.
i did some tests based Byron's answer:
<?
include("lib/simplehtmldom/simple_html_dom.php");
include("lib/phpQuery/phpQuery/phpQuery.php");
echo "<pre>";
$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['pq'] = $data['dom'] = $data['simple_dom'] = array();
$timer_start = microtime(true);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$data['dom'][] = $node->getAttribute("href");
}
foreach($x->query("//img") as $node)
{
$data['dom'][] = $node->getAttribute("src");
}
foreach($x->query("//input") as $node)
{
$data['dom'][] = $node->getAttribute("name");
}
$dom_time = microtime(true) - $timer_start;
echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";
$timer_start = microtime(true);
$doc = phpQuery::newDocument($html);
foreach( $doc->find("a") as $node)
{
$data['pq'][] = $node->href;
}
foreach( $doc->find("img") as $node)
{
$data['pq'][] = $node->src;
}
foreach( $doc->find("input") as $node)
{
$data['pq'][] = $node->name;
}
$time = microtime(true) - $timer_start;
echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";
$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
$data['simple_dom'][] = $node->href;
}
foreach( $simple_dom->find("img") as $node)
{
$data['simple_dom'][] = $node->src;
}
foreach( $simple_dom->find("input") as $node)
{
$data['simple_dom'][] = $node->name;
}
$simple_dom_time = microtime(true) - $timer_start;
echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";
echo "</pre>";
and got
dom: 0.00359296798706 . Got 115 items
PQ: 0.010568857193 . Got 115 items
simple_dom: 0.0770139694214 . Got 115 items
I used to use simple html dom exclusively until some bright SO'ers showed me the light hallelujah.
Just use the built in DOM functions. They are written in C and part of the PHP core. They are faster more efficient than any 3rd party solution. With firebug, getting an XPath query is muey simple. This simple change has made my php based scrapers run faster, while saving my precious time.
My scrapers used to take ~ 60 megabytes to scrape 10 sites asyncronously with curl. That was even with the simple html dom memory fix you mentioned.
Now my php processes never go above 8 megabytes.
Highly recommended.
EDIT
Okay I did some benchmarks. Built in dom is at least an order of magnitude faster.
Built in php DOM: 0.007061
Simple html DOM: 0.117781
<?
include("../lib/simple_html_dom.php");
$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['dom'] = $data['simple_dom'] = array();
$timer_start = microtime(true);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$data['dom'][] = $node->getAttribute("href");
}
foreach($x->query("//img") as $node)
{
$data['dom'][] = $node->getAttribute("src");
}
foreach($x->query("//input") as $node)
{
$data['dom'][] = $node->getAttribute("name");
}
$dom_time = microtime(true) - $timer_start;
echo "built in php DOM : $dom_time\n";
$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
$data['simple_dom'][] = $node->href;
}
foreach( $simple_dom->find("img") as $node)
{
$data['simple_dom'][] = $node->src;
}
foreach( $simple_dom->find("input") as $node)
{
$data['simple_dom'][] = $node->name;
}
$simple_dom_time = microtime(true) - $timer_start;
echo "simple html DOM : $simple_dom_time\n";