How to speed up class that checks links in HTML? - php

I have cobbled together a class that checks links. It works but it is slow:
The class basically parses a HTML string and returns all invalid links for href and src attributes. Here is how I use it:
$class = new Validurl(array('html' => file_get_contents('http://google.com')));
$invalid_links = $class->check_links();
print_r($invalid_links);
With HTML that has a lot of links it becomes really slow and I know it has to go through each link and follow it, but maybe someone with more experience can give me a few pointers on how to speed it up.
Here's the code:
class Validurl{
private $html = '';
public function __construct($params){
$this->html = $params['html'];
}
public function check_links(){
$invalid_links = array();
$all_links = $this->get_links();
foreach($all_links as $link){
if(!$this->is_valid_url($link['url'])){
array_push($invalid_links, $link);
}
}
return $invalid_links;
}
private function get_links() {
$xml = new DOMDocument();
#$xml->loadHTML($this->html);
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('type' => 'url', 'url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
}
foreach($xml->getElementsByTagName('img') as $link) {
$links[] = array('type' => 'img', 'url' => $link->getAttribute('src'));
}
return $links;
}
private function is_valid_url($url){
if ((strpos($url, "http")) === false) $url = "http://" . $url;
if (is_array(#get_headers($url))){
return true;
}else{
return false;
}
}
}

First of all I would not push the links and images into an array, and then iterate through the array, when you could directly iterate the results of getElementsByTagName(). You'd have to do it twice for <a> and <img> tags, but if you separate the checking logic into a function, you just call that for each round.
Second, get_headers() is slow, based on comments from the PHP manual page. You should rather use cUrl in some way like this (found in a comment on the same page):
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
$r = split("\n", $r);
return $r;
}
UPDATE: and yes, some kind of caching could also help, e.g. an SQLITE database with one table for the link and the result, and you could purge that db like each day.

You could cache the results (in DB, eg: a key-value store), so that your validator assumes that if a link was valid it's going to be valid for 24 hours or a week or something like that.

Related

PHP DOMDocument getting elements by tag name ignores commented ones [duplicate]

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.
Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment class.
To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for
For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}
That Regex
\s*<!--[\s\S]+?-->
Helps to you.
In regex Test
for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);

Pull text from another website

Is it possible to pull text data from another domain (not currently owned) using php? If not any other method? I've tried using Iframes, and because my page is a mobile website things just don't look good. I'm trying to show a marine forecast for a specific area. Here is the link I'm trying to display.
Update...........
This is what I ended up using. Maybe it will help someone else. However I felt there was more than one right answer to my question.
<?php
$ch = curl_init("http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
echo $content;
?>
This works as I think you want it to, except it depends on the same format from the weather site (also that "Outlook" is displayed).
<?php
//define the URL of the resource
$url = 'http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1';
//function from http://stackoverflow.com/questions/5696412/get-substring-between-two-strings-php
function getInnerSubstring($string, $boundstring, $trimit=false)
{
$res = false;
$bstart = strpos($string, $boundstring);
if($bstart >= 0)
{
$bend = strrpos($string, $boundstring);
if($bend >= 0 && $bend > $bstart)
{
$res = substr($string, $bstart+strlen($boundstring), $bend-$bstart-strlen($boundstring));
}
}
return $trimit ? trim($res) : $res;
}
//if the URL is reachable
if($source = file_get_contents($url))
{
$raw = strip_tags($source,'<hr>');
echo '<pre>'.substr(strstr(trim(getInnerSubstring($raw,"<hr>")),'Outlook'),7).'</pre>';
}
else{
echo 'Error';
}
?>
If you need any revisions, please comment.
Try using a user-agent as shown below. Then you can use simplexml to parse the contents and extract the text you want. For more info on simplexml.
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-agent: www.example.com"
)
);
$content = file_get_contents($url, false, stream_context_create($opts));
$xml = simplexml_load_string($content);
You may use cURL for that. Have a Look at http://www.php.net/manual/en/book.curl.php

My Link Crawler Coes Not Work on Certain Webpages

I wrote a program in PHP to find and print all links present on a web page. It also goes inside any links it found and does the same. My problem is that in some sites (like Youtube) it won't print the links, or follow inside them.
Here is my main code:
function echo_urls($site_address){
if(check_valid_url($site_address)){
$site = new site();
$site->address = $site_address;
$site->full_address = "$site_address";
$site->depth = 0;
$queue = new queue();
$queue->push($site);
array_push($queue->seen,$site->address);
$depth = 0;
while(($site = $queue->get_first())){
$depth++;
echo $site->depth." : ".$site->full_address."<br>";
$queue = push_links($site->address,$queue,$depth);
}
}
else;
}
function push_links($site_address,$queue,$depth){
if($depth<4){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$site_address);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
if( $result ){
preg_match_all( '/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU', $result, $list);
$list = $list[0];
foreach( $list as $item ) {
if(!(empty($item)))
if($result = get_all_string_between($item,"href=\"","\"")){
if((array_search($result[0],$queue->seen))==false){
$site = new site();
$site->address = $result[0];
$site->full_address = $item;
$site->depth = $depth;
$queue->push($site);
array_push($queue->seen,$site->address);
}
}
}
}
}
return $queue;
}
It's hard to tell by looking at a couple of functions, but my guess is:
YouTube is blocking you
This part if($depth<4){ is stopping push_links from executing because it might be returning FALSE
Also, don't use RegEx for this. Use something like The DOMDocument class
I usually use PHPQuery for crawling the sites. It's very simple
http://code.google.com/p/phpquery/

Update Twitter status with media with Twitter Async

Any idea how one would update a user's Twitter status with an image - using the Twitter-Async class?
This is what I have
$twitter = new Twitter(CONSUMER_KEY, CONSUMER_SECRET,$_SESSION['oauth_token'],$_SESSION['oauth_token_secret']);
$array = array('media[]' => '#/img/1.jpg','status' => $status);
$twitter->post('/statuses/update_with_media.json', $array);
With thanks to #billythekid, I have managed to do this. This is what you need to do:
Look these functions up in the EpiOAuth file and see what I've added and alter it where necessary.
EpiOAuth.php
//I have this on line 24
protected $mediaUrl = 'https://upload.twitter.com';
//and altered getApiUrl() to include check for such (you may wish to make this a regex in keeping with the rest?)
private function getApiUrl($endpoint)
{
if(strpos($endpoint,"with_media") > 0)
return "{$this->mediaUrl}/{$this->apiVersion}{$endpoint}";
elseif(preg_match('#^/(trends|search)[./]?(?=(json|daily|current|weekly))#', $endpoint))
return "{$this->searchUrl}{$endpoint}";
elseif(!empty($this->apiVersion))
return "{$this->apiVersionedUrl}/{$this->apiVersion}{$endpoint}";
else
return "{$this->apiUrl}{$endpoint}";
}
// add urldecode if post is multiPart (otherwise tweet is encoded)
protected function httpPost($url, $params = null, $isMultipart)
{
$this->addDefaultHeaders($url, $params['oauth']);
$ch = $this->curlInit($url);
curl_setopt($ch, CURLOPT_POST, 1);
// php's curl extension automatically sets the content type
// based on whether the params are in string or array form
if ($isMultipart) {
$params['request']['status'] = urldecode($params['request']['status']);
}
if($isMultipart)
curl_setopt($ch, CURLOPT_POSTFIELDS, $params['request']);
else
curl_setopt($ch, CURLOPT_POSTFIELDS, $this->buildHttpQueryRaw($params['request']));
$resp = $this->executeCurl($ch);
$this->emptyHeaders();
return $resp;
}
Post image
// how to post image
$twitter = new Twitter(CONSUMER_KEY, CONSUMER_SECRET,$_SESSION['oauth_token'],$_SESSION['oauth_token_secret']);
$array = array('#media[]' => '#/img/1.jpg','status' => $status);
$twitter->post('/statuses/update_with_media.json', $array);

php strange looping problem

Sorry for the long code, I'm really losing it.
This code is supposed to get a list of urls through POST, in a textarea with breaklines between each url. The script should download each url, go through the html and take some links, then go in those links, get some data and echo it out.
For some reason, visually it looks as if I'm running getDetails() only once, as I'm getting only one set of results.
I have checked multiple times if the foreach loop takes each url separately and that part is working
Can anyone spot the problem?
require_once('simple_html_dom.php');
function getDetails($html) {
$dom = new simple_html_dom;
$dom->load($html);
$title = $dom->find('h1', 0)->find('a', 0);
foreach($dom->find('span[style="color:#333333"]') as $element) {
$address = $element->innertext;
}
$address = str_replace("<br>"," ",$address);
$address = str_replace(","," ",$address);
$title->innertext = str_replace(","," ",$title->innertext);
if ($address == "") {
$exp = explode("<strong><strong>",$html);
$exp2 = explode("</strong>",$exp[1]);
$address = $exp2[0];
}
echo $title->innertext . "," . $address . "<br>";
}
function getHtml($Url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
function getdd($u) {
$html = getHtml($u);
$dom = new simple_html_dom;
$dom->load($html);
foreach($dom->find('a') as $element) {
if (strstr($element->href,"display_one.asp")) {
$durls[] = $element->href;
}
}
return $durls;
}
if (isset($_POST['url'])) {
$urls = explode("\n",$_POST['url']);
foreach ($urls as $u) {
$durls2 = getdd($u);
$durls2 = array_unique($durls2);
foreach ($durls2 as $durl) {
$d = getHtml("http://www.example.co.il/" . $durl);
getDetails($d);
}
}
}
You're only assigning the last element in the loop, it looks like. You'll need to concatenate. Something like $address .= $element->innertext; inside the loop (note the .= instead of =).
edit: unless I'm mistaking what it's supposed to be doing. I think I may've been focusing on the wrong part of the code.
When you use DOMDocument on html you load it with $dom->loadHTMLFile() or $dom->loadHTML() you should also call libxml_use_internal_errors(true) before hand so that it will not crash because of improperly formatted html.

Categories