Is there a way to use PHP to crawl links? - php

I'd like to use PHP to crawl a document we have that has about 6 or 7 thousand href links in it. What we need is what is on the other side of the link which means that PHP would have to follow each link and grab the contents of the link. Can this be done?
Thanks

Sure, just grab the content of your starting url with a function like file_get_contents (http://nl.php.net/file_get_contents), Find URL's in the content of this page using a regular expression, grab the contents of those url's etcetera.
Regexp will be something like:
$regexUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";

Once you harvest the links, you can use curl or file_get_contents (in a safe environment file_get_contents shouldn't allow to walk over http protocol though)

I just have a SQL table of all the links I have found, and if they have been parsed or not.
I then use Simple HTML DOM to parse oldest added page, although as it tends to run out of memory with large pages (500kb+ of html) I use regex for some of it*. For every link I find I add it to the SQL database as needing parsing, and the time I found it.
The SQL database prevents the data being lost on an error, and as I have 100,000+ links to parse, I do it over a long period of time.
I am unsure, but have you checked the useragent of file_get_contents()? If it isn't your pages and you make 1000s of requests, you may want to change the user agent, either by writing your own HTTP down loader or using one from a library(I use the one in the Zend Framework) but cURL etc work fine. If you use a custom user agent, it allows the admin looking over logs to see the information about your bot. (I tend to put the reason why I am crawling and a contact in mine).
*The regex I use is:
'/<a[^>]+href="([^"]+)"[^"]*>/is'
A better solution (From Gumbo) could be:
'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'

The PHP Snoopy library has a bunch of built in functions to accomplish exactly what you are looking for.
http://sourceforge.net/projects/snoopy/
You can download the page itself with Snoopy, then it has another function to extract all the URLs on that page. It will even correct the links to be full-fledged URIs (i.e. they aren't just relative to the domain/directory the page resides on).

You can try the following. See this thread for more details
<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
return;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
$stripped_file = strip_tags($result, "<a>");
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );
foreach($matches as $match){
$href = $match[1];
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
crawl_page($href, $depth - 1);
}
}
echo "Crawled {$href}";
}
crawl_page("http://www.sitename.com/",3);
?>

I suggest that you take the HTML document with your 6000 URLs, parse them out and loop through the list you've got. In your loop, get the contents of the current URL using file_get_contents (for this purpose, you don't really need cURL when file_get_contents is enabled on your server), parse out the containing URLs again, and so on.
Would look something like this:
<?php
function getUrls($url) {
$doc = file_get_contents($url);
$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
preg_match_all($pattern, $doc, $urls);
return $urls;
}
$urls = getUrls("your_6k_file.html");
foreach($urls as $url) {
$moreUrls = getUrls($url);
//do something with moreUrls
}
?>

Related

Get a Tiny URL's real URL and store as PHP variable

I want to get the video ID of a Youtube URL, but often the URL is condensed into a Tiny URL when being shared.
For example, I have a script that gets the Youtube video's thumbnail based on its video ID =
<?php $vlog = "OeqlkEymQ94"; ?>
<img src="http://img.youtube.com/vi/<?=$vlog;?>/0.jpg" alt="" />
This is easy to get when the URL I'm extracting it from is
http://www.youtube.com/watch?v=OeqlkEymQ94
but sometimes the URL is a tiny URL so I have to figure out how to return the real URL so I can use it.
http://tinyurl.com/kmx9zt6
Is it possible to retreive the real URL of a URL through PHP?
You could use get_headers() or cURL to grab the Location header:
function getFullURL($url) {
$headers = get_headers($url);
$headers = array_reverse($headers);
foreach($headers as $header) {
if (strpos($header, 'Location: ') === FALSE) {
$url = str_replace('Location: ', '', $header);
break;
}
}
return $url;
}
Usage:
echo getFullURL('http://tinyurl.com/kmx9zt6');
Note: It's a slightly modified version of the gist here.
For future reference, I used a much simpler function because my Tiny URL's were always going to resolve to Youtube, and the headers were almost always the same :
function getFullURL($url) {
$headers = get_headers($url);
$url = $headers[4]; //This is the location part of the array
$url = str_replace('Location: ', '', $url);
$url = str_replace('location: ', '', $url); //in my case the location was lowercase, but it can't hurt to have both
return $url;
}
usage -
echo getFullURL('http://tinyurl.com/kmx9zt6');

Strip html to remove all js/css/html tags to give actual text(displayed on browser) to use it for indexing and search

I have tried strip_tag but it still leaves inline js : (function(){..}) and also inline css #button{}
I need to extract pure text from html without any JS function or styling or tags so that I can index it and use for for my search functionality.
html2text also doesnt seem to solve the problem!
EDIT
PHP Code:
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = #get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
{
$content = strip_tags(file_get_contents($url));
}
OUTPUT :
$content =
(function() { var a=window,c="jstiming",d="tick";var e=function(b){this.t={};this.tick=function(b,o,f){f=void 0!=f?f:(new Date).getTime();this.t[b]=[f,o]};this[d]("start",null,b)},h=new e;a.jstiming={Timer:e,load:h};if(a.performance&&a.performance.timing){var i=a.performance.timing,j=a[c].load,k=i.navigationStart,l=i.responseStart;0=k&&(j[d]("_wtsrt",void 0,k),j[d]("wtsrt_","_wtsrt",l))}
try{var m=null;a.chrome&&a.chrome.csi&&(m=Math.floor(a.chrome.csi().pageT));null==m&&a.gtbExternal&&(m=a.gtbExternal.pageT());null==m&&a.external&&(m=a.external.pageT);m&&(a[c].pt=m)}catch(n){};a.tickAboveFold=function(b){var g=0;if(b.offsetParent){do g+=b.offsetTop;while(b=b.offsetParent)}b=g;750>=b&&a[c].load[d]("aft")};var p=!1;function q(){p||(p=!0,a[c].load[d]("firstScrollTime"))}a.addEventListener?a.addEventListener("scroll",q,!1):a.attachEvent("onscroll",q);
})();
Everyman Software: Development Setup for Neo4j and PHP: Part 2
#navbar-iframe { display:block }
if(window.addEventListener) {
window.addEventListener('load', prettyPrint, false);
} else {
window.attachEvent('onload', prettyPrint);
}
var a=navigator,b="userAgent",c="indexOf",f="&m=1",g="(^|&)m=",h="?",i="?m=1";function j(){var d=window.location.href,e=d.split(h);switch(e.length){case 1:return d+i;case 2:return 0
2011-11-05
Development Setup for Neo4j and PHP: Part 2
This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases. In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.
All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.
Also, I won't be using any specific PHP framework. The principles in t
This is a small snippet I always use to remove all the hidden text from a webpage, including everything inbetween <script>, <style>, <head> etc tags. Also it will replace all the multiple occurrences of any kind of whitespace with a single space.
<?php
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = #get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
{
$content = strip_html_tags(file_url_contents($url));
}
############################################
//To fetch the $url by using cURL
function file_url_contents($url){
$crl = curl_init();
$timeout = 30;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
} //file_url_contents ENDS
//To remove all the hidden text not displayed on a webpage
function strip_html_tags($str){
$str = preg_replace('/(<|>)\1{2}/is', '', $str);
$str = preg_replace(
array(// Remove invisible content
'#<head[^>]*?>.*?</head>#siu',
'#<style[^>]*?>.*?</style>#siu',
'#<script[^>]*?.*?</script>#siu',
'#<noscript[^>]*?.*?</noscript>#siu',
),
"", //replace above with nothing
$str );
$str = replaceWhitespace($str);
$str = strip_tags($str);
return $str;
} //function strip_html_tags ENDS
//To replace all types of whitespace with a single space
function replaceWhitespace($str) {
$result = $str;
foreach (array(
" ", " \t", " \r", " \n",
"\t\t", "\t ", "\t\r", "\t\n",
"\r\r", "\r ", "\r\t", "\r\n",
"\n\n", "\n ", "\n\t", "\n\r",
) as $replacement) {
$result = str_replace($replacement, $replacement[0], $result);
}
return $str !== $result ? replaceWhitespace($result) : $result;
}
############################
?>
See it in action here http://codepad.viper-7.com/txIxfE
And output: http://pastebin.com/a86jd17s
strip_tags() will remove everything that is inside < and >. So, e.g., if you have something like
<script type="text/javascript">alert('hello world');</script>
It will be reduced to
alert('hello world');
This will not be executed but just displayed on your site.
Alternatively, try htmlentities() to convert "<" to "&lt" and > to "&gt" so it can be displayed safely without executing anything.
If, instead, your question is about extracting data from tags, better use regular expressions. E.g., if you have something like
Google
you can simply use preg_match() to get the word "Google" from the whole link:
$content='Google';
$regex="#(.+?)#";
preg_match($regex,$content,$match);
echo $match[1];
By the way - $match[1] will have the match cleared from any tags anyways, while $match[0] won't. To get more than one match, use preg_match_all().

How to skip links containing file extensions while web scraping using PHP

Here is a function that validates .edu TLD and checks that the url does not point to a .pdf document or a .doc document.
public function validateEduDomain($url) {
if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $url) && !preg_match('/\.(pdf)|(doc)$/i', $url) ) {
return TRUE;
}
return FALSE;
Now I am encountering links that point to jpg, rtf and others that simple_html_dom tries to parse and return its content. I want to avoid this happening by skipping all such links. The problem is that the list is non-exhaustive and I want the code to skip all such links. How am I supposed to do that??
Tring to filter urls by guessing what's behind it will always fail in a number of cases. Assuming you are using curl to download, you should check if the response document-type header is among the acceptable ones:
<?php
require "simple_html_dom.php";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //default is to output it
$urls = array(
"google.com",
"https://www.google.com/logos/2012/newyearsday-2012-hp.jpg",
"http://cran.r-project.org/doc/manuals/R-intro.pdf",
);
$acceptable_types = array("text/html", "application/xhtml+xml");
foreach ($urls as $url) {
curl_setopt($curl, CURLOPT_URL, $url);
$contents = curl_exec($curl);
//we need to handle content-types like "text/html; charset=utf-8"
list($response_type) = explode(";", curl_getinfo($curl, CURLINFO_CONTENT_TYPE));
if (in_array($response_type, $acceptable_types)) {
echo "accepting {$url}\n";
// create a simple_html_dom object from string
$obj = str_get_html($contents);
} else {
echo "rejecting {$url} ({$response_type})\n";
}
}
running the above results in:
accepting google.com
rejecting https://www.google.com/logos/2012/newyearsday-2012-hp.jpg (image/jpeg)
rejecting http://cran.r-project.org/doc/manuals/R-intro.pdf (application/pdf)
Update the last regex to something like this:
!preg_match('/\.(pdf)|(doc)|(jpg)|(rtf)$/i', $url) )
Will filter out the jpgs and rtf documents.
You have to add the extensions to the regex above to omit them.
Update
I don’t think its possible to block all sort of extensions and I personally do not recommend it for scraping usage also. You will have to skip some extensions to keep crawling. Why dont you change you regex filter to the ones you would like to accept like:
preg_match('/\.(html)|(html)|(php)|(aspx)$/i', $url) )

parsing html through get_file_contents()

is have been told that the best way to parse html is through DOM like this:
<?
$html = "<span>Text</span>";
$doc = new DOMDocument();
$doc->loadHTML( $html);
$elements = $doc->getElementsByTagName("span");
foreach( $elements as $el)
{
echo $el->nodeValue . "\n";
}
?>
but in the above the variable $html can't be a url, or can it??
wouldnt i have to use to function get_file_contents() to get the html of a page?
You have to use DOMDocument::loadHTMLFile to load HTML from an URL.
$doc = new DOMDocument();
$doc->loadHTMLFile($path);
DOMDocument::loadHTML parses a string of HTML.
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($path));
It can be, but it depends on allow_url_fopen being enabled in your PHP install. Basically all of the PHP file-based functions can accept a URL as a source (or destination). Whether such a URL makes sense is up to what you're trying to do.
e.g. doing file_put_contents('http://google.com') is not going to work, as you'd be attempting to do an HTTP upload to google, and they're not going allow you to replace their homepage...
but doing $dom->loadHTML('http://google.com'); would work, and would suck in google's homepage into DOM for processing.
If you're having trouble using DOM, you could use CURL to parse. For example:
$url = "http://www.davesdaily.com/";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_URL, $url);
$input = curl_exec($curl);
$regexp = "<span class=comment>([^<]*)<\/span>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match);
}
echo $match[0];
The script should grab the text between <span class=comment> and </span> and store inside an array $match. This should echo Entertainment.

how to find the total no.of inbound and outbound links of a website using php?

how to find the total no.of inbound and outbound links of a website using php?
To count outbound links
parse html for webpage
parse all links using regex
filter links which starts with your domain or "/"
To inbound link
Grab google results page
http://www.google.ca/search?sourceid=chrome&ie=UTF-8&q=site:
parse similarly
For outbound links, you will have to parse the HTML code of the website as some here have suggested.
For inbound links, I suggest using the Google Custom Search API, sending a direct request to google can get your ip banned. You can view the search api here. Here is a function I use in my code for this api:
function doGoogleSearch($searchTerm)
{
$referer = 'http://your-site.com';
$args['q'] = $searchTerm;
$endpoint = 'web';
$url = "http://ajax.googleapis.com/ajax/services/search/".$endpoint;
$args['v'] = '1.0';
$key= 'your-api-key';
$url .= '?'.http_build_query($args, '', '&');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
$body = curl_exec($ch);
curl_close($ch);
//decode and return the response
return json_decode($body);
}
After calling this function as: $result = doGoogleSearch('link:site.com'), the variable $result->cursor->estimatedResultCount will have the number of results returned.
PHP can't determine the inbound links of a page through some trivial action. You either have to monitor all incoming visitors and check what their referrer is, or parse the entire internet for links that point to that site. The first method will miss links not getting used, and the second method is best left to Google.
On the other hand, the outbound links from a site is doable. You can read in a page and analyze the text for links with a regular expression, counting up the total.
function getGoogleLinks($host)
{
$request = "http://www.google.com/search?q=" . urlencode("link:" . $host) ."&hl=en";
$data = getPageData($request);
preg_match('/<div id=resultStats>(About )?([\d,]+) result/si', $data, $l);
$value = ($l[2]) ? $l[2] : "n/a";
$string = "" . $value . "";
return $string;
}
//$host means the domain name

Categories