Optimize remote page retrieving and parsing

Optimize remote page retrieving and parsing - php

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
I have some more preg_match calls inside the loop.
How can I optimize my code?
Edit:
I've changed my code to use xpath instead of regex, and it became much more slower.
Edit2:
That's my full code:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>

As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
<?php
# set up some dummy data
$data = <<<DATA
<div>
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
print_r($links);
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>

Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/
an example to get all a tags on a page:
<?php
include_once('simple_html_dom.php');
$url = "http://your_url/";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link)
{
// do something with the link
}
?>

Related

How to extract particular link from html page using php

Hii i'm trying to scrape href link from a tag using regex, but i'm unable to retrieve link can someone help me to achieve this here is the link which i tring to extract from html page. /u/0/uc?export=download&confirm=EY_S&id=fileid Here is my php function
<?php
function dwnload($url)
{
$scriptx = "";
$internalErrors = libxml_use_internal_errors(true);
$dom = new DOMDocument();
#$dom->loadHTML(curl($url));
foreach ($dom->getElementsByTagName('a') as $k => $js) {
$scriptx .= $js->nodeValue;
}
preg_match_all('#\bhttps?://[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $scriptx, $match);
$vlink = "";
foreach ($match[0] as $c) {
if (strpos($c, 'export=download') !== false) {
$vlink = $c;
}
}
return $vlink;
}?>
Thanks

You're concatenating the link texts. That does not make sense. If you try to extract links, DOMNode::getElementsByTagName() does the job already. You just need to filter the results.
Let's consider a small HTML fragment:
$html = <<<'HTML'
SUCCESS
FAILURE
HTML;
Now iterate the a elements and filter them by their href attribute.
$document = new DOMDocument();
$document->loadHTML($html);
foreach ($document->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if (strpos($href, 'export=download') !== false) {
var_dump([$href, $a->textContent]);
}
}
Output:
array(2) {
[0]=>
string(46) "/u/0/uc?export=download&confirm=EY_S&id=fileid"
[1]=>
string(7) "SUCCESS"
}
Now if this is a string match it is possible to use an Xpath expression:
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//a[contains(#href, "export=download")]') as $a) {
var_dump([$a->getAttribute('href'), $a->textContent]);
}
Or combine the Xpath expression with an more specific regular expression:
$pattern = '((?:\\?|&)export=download(?:&|$))';
foreach ($xpath->evaluate('//a[contains(#href, "export=download")]') as $a) {
$href = $a->getAttribute('href');
if (preg_match($pattern, $href)) {
var_dump([$href, $a->textContent]);
}
}

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?

To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

PHP: DOM get url and anchors (but not IMG)

I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>

Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;

How to use this function to grab a div

I found this function in snipplr which grabs ra div with certain attribute. I tried to use it, but it didn't work. Is there a something wrong in my way of using it?
http://snipplr.com/view.php?codeview&id=20987
function get_tag( $attr, $value, $xml, $tag=null ) {
if( is_null($tag) )
$tag = '\w+';
else
$tag = preg_quote($tag);
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = "/<(".$tag.")[^>]*$attr\s*=\s*".
"(['\"])$value\\2[^>]*>(.*?)<\/\\1>/"
preg_match_all($tag_regex,
$xml,
$matches,
PREG_PATTERN_ORDER);
return $matches[3];
}
I made a change on it to use it for a url like this:
function get_tag( $attr, $value, $page, $tag=null ) {
if( is_null($tag) )
$tag = '\w+';
else
$tag = preg_quote($tag);
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = "/<(".$tag.")[^>]*$attr\s*=\s*".
"(['\"])$value\\2[^>]*>(.*?)<\/\\1>/";
$page = file_get_contents($page);
preg_match_all($tag_regex,
$page,
$matches,
PREG_PATTERN_ORDER);
return $matches[3];
}
get_tag("class","weather","http://www.masrawy.com","div");
How can I use this correctly?

Dont use a regex for this. Use something that can parse and query the DOM like DOMDocument, Zend_Dom_Query or SimpleHTMLDOM.
DOMDocument example:
$dom = new DomDocument();
$html = file_get_contents('http://www.masrawy.com');
$dom->loadHTML($html);
$finder = new DomXPath($dom);
$classname="weather";
$nodes = $finder->query("//div[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
$extracted = array();
foreach($nodes as $element)
{
// convert to html string
$extracted[] = $element->ownerDocument->saveXML($element);
}
// now iterate over extracted and output...
An Zend_Dom_Query example:
$html = file_get_contents("http://www.masrawy.com");
$dom = new Zend_Dom_Query($html);
$results = $dom->query('div.theCssClassName');
$extracted = array();
foreach($results as $element)
{
// convert to html string
$extracted[] = $element->ownerDocument->saveXML($element);
}
// now iterate over extracted and output...

adding rel="nofollow" while saving data

I have my application to allow users to write comments on my website. Its working fine. I also have tool to insert their weblinks in it. I feel good with contents with their own weblinks.
Now i want to add rel="nofollow" to every links on content that they have been written.
I would like to add rel="nofollow" using php i.e while saving data.
So what's a simple method to add rel="nofollow" or updated rel="someother" with rel="someother nofollow" using php
a nice example will be much efficient

Regexs really aren't the best tool for dealing with HTML, especially when PHP has a pretty good HTML parser built in.
This code will handle adding nofollow if the rel attribute is already populated.
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
var_dump($dom->saveHTML());
CodePad.
The resulting HTML is in $dom->saveHTML(). Except it will wrap it with html, body elements, etc, so use this to extract just the HTML you entered...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
echo $html;
If you have >= PHP 5.3, replace saveXML() with saveHTML() and drop the second argument.
Example
This HTML...
hello
hello
hello
hello
...is converted into...
hello
hello
hello
hello

Good Alex. If it is in the form of a function it is more useful. So I made it below:
function add_no_follow($str){
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
$dom->saveHTML();
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
return $html;
}
Use as follows :
$str = "Some content with link Some content ... ";
$str = add_no_follow($str);

I've copied Alex's answer and made it into a function that makes links nofollow and open in a new tab/window (and added UTF-8 support). I'm not sure if this is the best way to do this, but it works (constructive input is welcome):
function nofollow_new_window($str)
{
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor)
{
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
$target = array();
if ($anchor->hasAttribute('target') AND ($relAtt = $anchor->getAttribute('target')) !== '') {
$target = preg_split('/\s+/', trim($relAtt));
}
if (in_array('_blank', $target)) {
continue;
}
$target[] = '_blank';
$anchor->setAttribute('target', implode(' ', $target));
}
$str = utf8_decode($dom->saveHTML($dom->documentElement));
return $str;
}
Simply use the function like this:
$str = '<html><head></head><body>fdsafffffdfsfdffff dfsdaff flkklfd aldsfklffdssfdfds Google</body></html>';
$str = nofollow_new_window($str);
echo $str;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Optimize remote page retrieving and parsing - php

Related

How to extract particular link from html page using php

PHP Web Crawler doesn't crawl .php files

PHP: DOM get url and anchors (but not IMG)

How to use this function to grab a div

adding rel="nofollow" while saving data

Categories

Resources