html scraping and css queries - php

what are the advantages and disadvantages of the following libraries?
PHP Simple HTML DOM Parser
QP
phpQuery
From the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->clear(); unset($object); when you dont need an object anymore.
Are there any more scrapers? What are your experiences with them? I'm going to make this a community wiki, may we'll build a useful list of libraries that can be useful when scraping.
i did some tests based Byron's answer:
<?
include("lib/simplehtmldom/simple_html_dom.php");
include("lib/phpQuery/phpQuery/phpQuery.php");
echo "<pre>";
$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['pq'] = $data['dom'] = $data['simple_dom'] = array();
$timer_start = microtime(true);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$data['dom'][] = $node->getAttribute("href");
}
foreach($x->query("//img") as $node)
{
$data['dom'][] = $node->getAttribute("src");
}
foreach($x->query("//input") as $node)
{
$data['dom'][] = $node->getAttribute("name");
}
$dom_time = microtime(true) - $timer_start;
echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";
$timer_start = microtime(true);
$doc = phpQuery::newDocument($html);
foreach( $doc->find("a") as $node)
{
$data['pq'][] = $node->href;
}
foreach( $doc->find("img") as $node)
{
$data['pq'][] = $node->src;
}
foreach( $doc->find("input") as $node)
{
$data['pq'][] = $node->name;
}
$time = microtime(true) - $timer_start;
echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";
$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
$data['simple_dom'][] = $node->href;
}
foreach( $simple_dom->find("img") as $node)
{
$data['simple_dom'][] = $node->src;
}
foreach( $simple_dom->find("input") as $node)
{
$data['simple_dom'][] = $node->name;
}
$simple_dom_time = microtime(true) - $timer_start;
echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";
echo "</pre>";
and got
dom: 0.00359296798706 . Got 115 items
PQ: 0.010568857193 . Got 115 items
simple_dom: 0.0770139694214 . Got 115 items

I used to use simple html dom exclusively until some bright SO'ers showed me the light hallelujah.
Just use the built in DOM functions. They are written in C and part of the PHP core. They are faster more efficient than any 3rd party solution. With firebug, getting an XPath query is muey simple. This simple change has made my php based scrapers run faster, while saving my precious time.
My scrapers used to take ~ 60 megabytes to scrape 10 sites asyncronously with curl. That was even with the simple html dom memory fix you mentioned.
Now my php processes never go above 8 megabytes.
Highly recommended.
EDIT
Okay I did some benchmarks. Built in dom is at least an order of magnitude faster.
Built in php DOM: 0.007061
Simple html DOM: 0.117781
<?
include("../lib/simple_html_dom.php");
$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['dom'] = $data['simple_dom'] = array();
$timer_start = microtime(true);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$data['dom'][] = $node->getAttribute("href");
}
foreach($x->query("//img") as $node)
{
$data['dom'][] = $node->getAttribute("src");
}
foreach($x->query("//input") as $node)
{
$data['dom'][] = $node->getAttribute("name");
}
$dom_time = microtime(true) - $timer_start;
echo "built in php DOM : $dom_time\n";
$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
$data['simple_dom'][] = $node->href;
}
foreach( $simple_dom->find("img") as $node)
{
$data['simple_dom'][] = $node->src;
}
foreach( $simple_dom->find("input") as $node)
{
$data['simple_dom'][] = $node->name;
}
$simple_dom_time = microtime(true) - $timer_start;
echo "simple html DOM : $simple_dom_time\n";

Related

Extracting information from <i> tag from HTML using PHP

I am having some code and getting HTTP 500 Error. A bit getting confused. I need to extract from the web of weather cast weather digit information and add in the website.
Here is a code:
orai_class.php
<?php
Class orai{
var $url;
function generate_orai($url){
$html = file_get_contents($url);
$classname = 'wi wi-1';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
$i=0;
foreach($results as $node)
{
if ($results->length > 0) {
$array[] = $results->item($i)->nodeValue;
}
$i++;
}
return $array;
}
}
?>
index.php
<?php
include("orai.class.php");
$orai = new orai();
print_r($orai->generate_orai('https://orai.15min.lt/prognoze/vilnius'));
?>
Thank You.

Caching property og php

I'm trying to extract the properties from a link like facebook does only the page loses many seconds to load with a couple of links. Is there any way to speed it up? For example using caches?
libxml_use_internal_errors(true);
$c = file_get_contents('https://link-here');
$d = new DomDocument();
$d->loadHTML($c);
$xp = new domxpath($d);
foreach ($xp->query("//meta[#property='og:title']") as $el) {
$title = $el->getAttribute("content");
}
foreach ($xp->query("//meta[#property='og:description']") as $el) {
$content = $el->getAttribute("content");
}
foreach ($xp->query("//meta[#property='og:image']") as $el) {
$image = $el->getAttribute("content");
}

How to get child nodes from an xml url?

I got this link https://www.ncbi.nlm.nih.gov/gene/7128?report=xml&format=text. I am trying to write a code that gets Interactions and GeneOntology within Gene-commentary_heading from the link. I only succeed using this code when there are the 2 or 3 nodes but in this case there are at least 6 nodes or more. Could someone help me?
Bellow is the example of the information I am looking for (it's to much to visualise so I just showed a part)
<Gene-commentary_heading>GeneOntology</Gene-commentary_heading>
<Gene-commentary_source>
<Other-source>
<Other-source_pre-text>Provided by</Other-source_pre-text>
<Other-source_anchor>GOA</Other-source_anchor>
<Other-source_url>http://www.ebi.ac.uk/GOA/</Other-source_url>
</Other-source>
</Gene-commentary_source>
<Gene-commentary_comment>
<Gene-commentary>
<Gene-commentary_type value="comment">254</Gene-commentary_type>
<Gene-commentary_label>Function</Gene-commentary_label>
<Gene-commentary_comment>
<Gene-commentary>
<Gene-commentary_type value="comment">254</Gene-commentary_type>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>3677</Object-id_id>
</Object-id>
</Dbtag_tag>
...
`$url = "https://www.ncbi.nlm.nih.gov/gene/7128?report=xml&format=text";
$document_xml = new DOMDocument();
$document_xml->loadXML($url);
$elements = $url->getElementsByTagName('Gene-commentary_heading');
echo $elements;
foreach($element as $node) {
$GO = $node -> getElementsByTagName('GeneOntology');
$Int = $node->getElementsByTagName('Interactions');
}
My answer
$esearch_test = "https://www.ncbi.nlm.nih.gov/gene/7128?report=xml&format=text";
$result = file_get_contents($esearch_test);
$xml = simplexml_load_string($result);
$doc = new DOMDocument();
$doc = DOMDocument::loadXML($xml);
$c = 1;
foreach($doc->getElementsByTagName('Gene-commentary_heading') as $node) {
echo "$c: ".$node->textContent."\n";
$c++;
}

How to call UL class only once using domdocument php

I am using PHP Domdocument to load my html. In my HTML, I have class="smalllist" two times. But, I need to load the first class elements.
Now, My PHP Code is
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
foreach ($table as $row) {
echo $row->getElementsByTagName('a')->item(0)->nodeValue."-";
echo $row->getElementsByTagName('a')->item(1)->nodeValue."\n";
}
which loads both the classes.
But, I need to load only one class with that name.
Please help me in this. Thanks in advance.
DOMXPath returns a DOMNodeList which has a item() method. see if this works
$table->item(0)->getElementsByTagName('a')->item(0)->nodeValue
edited (untested):
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
}
You can put a break within the foreach loop to read only from the first class. Or, you can do foreach ($table->item(0) as $row) {...
Code:
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
if( ++$count > 2 ) {
break;
}
}
another way rather than using break (more than one way to skin a cat):
$anchors = $table->item(0)->getElementsByTagName('a');
for($i = 0; $i < 2; $i++){
echo $anchor->item($i)->nodeValue . "\n";
}
This is my final code:
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
$data[$k][$arr1[$count]] = $anchor->nodeValue;
if( ++$count > 1 ) {
break;
}
}
Working fine.

SimpleXML vs DOMDocument performance

I am building an RSS parser using the SimpleXML Class and I was wondering if using the DOMDocument class would improve the speed of the parser. I am parsing an rss document that is at least 1000 lines and I use almost all of the data from those 1000 lines. I am looking for the method that will take the least time to complete.
SimpleXML and DOMDocument both use the same parser (libxml2), so the parsing difference between them is negligible.
This is easy to verify:
function time_load_dd($xml, $reps) {
// discard first run to prime caches
for ($i=0; $i < 5; ++$i) {
$dom = new DOMDocument();
$dom->loadXML($xml);
}
$start = microtime(true);
for ($i=0; $i < $reps; ++$i) {
$dom = new DOMDocument();
$dom->loadXML($xml);
}
$stop = microtime(true) - $start;
return $stop;
}
function time_load_sxe($xml, $reps) {
for ($i=0; $i < 5; ++$i) {
$sxe = simplexml_load_string($xml);
}
$start = microtime(true);
for ($i=0; $i < $reps; ++$i) {
$sxe = simplexml_load_string($xml);
}
$stop = microtime(true) - $start;
return $stop;
}
function main() {
// This is a 1800-line atom feed of some complexity.
$url = 'http://feeds.feedburner.com/reason/AllArticles';
$xml = file_get_contents($url);
$reps = 10000;
$methods = array('time_load_dd','time_load_sxe');
echo "Time to complete $reps reps:\n";
foreach ($methods as $method) {
echo $method,": ",$method($xml,$reps), "\n";
}
}
main();
On my machine I get basically no difference:
Time to complete 10000 reps:
time_load_dd: 17.725028991699
time_load_sxe: 17.416455984116
The real issue here is what algorithms you are using and what you are doing with the data. 1000 lines is not a big XML document. Your slowdown will not be in memory usage or parsing speed but in your application logic.
Well, I have encountered a HUGE performance difference between DomDocument and SimpleXML. I have ~ 15 MB big XML file with approx 50 000 elements like this:
...
<ITEM>
<Product>some product code</Product>
<Param>123</Param>
<TextValue>few words</TextValue>
</ITEM>
...
I only need to "read" those values and save them in PHP array. At first I tried DomDocument ...
$dom = new DOMDocument();
$dom->loadXML( $external_content );
$root = $dom->documentElement;
$xml_param_values = $root->getElementsByTagName('ITEM');
foreach ($xml_param_values as $item) {
$product_code = $item->getElementsByTagName('Product')->item(0)->textContent;
// ... some other operation
}
That script died after 60 seconds with maximum execution time exceeded error. Only 15 000 items of 50k were parsed.
So I rewrote the code to SimpleXML version:
$xml = new SimpleXMLElement($external_content);
foreach($xml->xpath('ITEM') as $item) {
$product_code = (string) $item->Product;
// ... some other operation
}
After 1 second all was done.
I don't know how those functions are internally implemented in PHP, but in my application (and with my XML structure) there is really, REALLY HUGE performance difference between DomDocument and SimpleXML.

Categories