How to skip links containing file extensions while web scraping using PHP - php

Here is a function that validates .edu TLD and checks that the url does not point to a .pdf document or a .doc document.
public function validateEduDomain($url) {
if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $url) && !preg_match('/\.(pdf)|(doc)$/i', $url) ) {
return TRUE;
}
return FALSE;
Now I am encountering links that point to jpg, rtf and others that simple_html_dom tries to parse and return its content. I want to avoid this happening by skipping all such links. The problem is that the list is non-exhaustive and I want the code to skip all such links. How am I supposed to do that??

Tring to filter urls by guessing what's behind it will always fail in a number of cases. Assuming you are using curl to download, you should check if the response document-type header is among the acceptable ones:
<?php
require "simple_html_dom.php";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //default is to output it
$urls = array(
"google.com",
"https://www.google.com/logos/2012/newyearsday-2012-hp.jpg",
"http://cran.r-project.org/doc/manuals/R-intro.pdf",
);
$acceptable_types = array("text/html", "application/xhtml+xml");
foreach ($urls as $url) {
curl_setopt($curl, CURLOPT_URL, $url);
$contents = curl_exec($curl);
//we need to handle content-types like "text/html; charset=utf-8"
list($response_type) = explode(";", curl_getinfo($curl, CURLINFO_CONTENT_TYPE));
if (in_array($response_type, $acceptable_types)) {
echo "accepting {$url}\n";
// create a simple_html_dom object from string
$obj = str_get_html($contents);
} else {
echo "rejecting {$url} ({$response_type})\n";
}
}
running the above results in:
accepting google.com
rejecting https://www.google.com/logos/2012/newyearsday-2012-hp.jpg (image/jpeg)
rejecting http://cran.r-project.org/doc/manuals/R-intro.pdf (application/pdf)

Update the last regex to something like this:
!preg_match('/\.(pdf)|(doc)|(jpg)|(rtf)$/i', $url) )
Will filter out the jpgs and rtf documents.
You have to add the extensions to the regex above to omit them.
Update
I don’t think its possible to block all sort of extensions and I personally do not recommend it for scraping usage also. You will have to skip some extensions to keep crawling. Why dont you change you regex filter to the ones you would like to accept like:
preg_match('/\.(html)|(html)|(php)|(aspx)$/i', $url) )

Related

Getting whole HTML element with PHP

I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>
You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!
I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;
Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.
extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain

How to change values in php?

How can I change value of price (in wordpress) which is set for numeric values? I want to change the value to display text or numeric from url (scraping api)
right now my class_core.php file shows this:
Price Display
========================================================================== */
function PRICE($val){
// RETURN IF NOT NUMERIC
if(!is_numeric($val) && defined('WLT_JOBS') ){ return $val; }
if(isset($GLOBALS['CORE_THEME']['currency'])){
$seperator = "."; $sep = ","; $digs = 2;
if(is_numeric($val)){
$val = number_format($val,$digs, $seperator, $sep);
}
$val = hook_price_filter($val);
// RETURN IF EMPTY
if($val == ""){ return $val; }
// LEFT/RIGHT POSITION
if(isset($GLOBALS['CORE_THEME']['currency']['position']) && $GLOBALS['CORE_THEME']['currency']['position'] == "right"){
if(substr($val,-3) == ".00"){ $val = substr($val,0,-3); }
$val = $val.$GLOBALS['CORE_THEME']['currency']['symbol'];
}else{
$val = $GLOBALS['CORE_THEME']['currency']['symbol'].$val;
}
}
php is a scripting language. you dont have to declare what kind of variable you will be using. You just declare the name and the type of the variable change automatically depending on what data are you storing.
If you have a url that contains some information, like (www.xyz.com/dddddd/ddddd) you can use CURL to obtain a result...
(ref: http://www.jonasjohn.de/snippets/php/curl-example.htm)
function curl_download($Url){
// is cURL installed yet?
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "http://www.example.org/yay.htm");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
and then in your code...
$url_for_value = "www.xyz.com/dddddd/ddddd";
// remember to add http colon and two slashes in front of url...
// stackoverflow tools won't let me do that here...
$val = curl_download($url_for_value);
function PRICE($val){
if(!is_numeric($val) && defined('WLT_JOBS') ){
// if not numeric, e.g. $100 , strip off non-numeric characters.
preg_match_all('/([\d]+)/', $val, $match);
// Do we have a valid number now?
if (!is_numeric($match[0]){
// perform other tests on return info from the CURL function?
return $val;
}
$val = $match[0];
}
if(isset($GLOBALS['CORE_THEME']['currency'])){ ....
Note: Its certainly admirable to have a need for a specific function, and then use that need to motivate you to learn new skills. This project assumes a certain experience in HTML, PHP and WordPress. If you don't feel comfortable in that stuff yet, that's okay, we all started knowing nothing.
Here's a possible learning roadmap:
--HTML Learn the organization of a website, elements, and how to create forms, buttons, etc...
--PHP This is a scripting language, runs on a server.
--CSS You will need this for WordPress. (Why? Because we insist on you using a child theme, and that will require to understand how CSS works. )
--JavaScript, although not absolutely required, lots of existing tools use this.
There are a lot of free tutorials on this stuff. I'd probably start at http://html.net/ or somewhere like that. Do all the tutorials.
After that you get to jump into WordPress. Start small, modify a few sites, then grow to writing your own plugins. At that point, I think you should be able to easily create the functionality you are looking for.
If not, it could well be quicker to hire the job out. eLance is your friend.

Get Google Search Images using php

Search on Google images with car keyword & get car images.
I found two links to implement like this,
PHP class to retrieve multiple images from Google using curl multi
handler
Google image API using cURL
implement also but it gave 4 random images not more than that.
Question: How to get car images in PHP using keyword i want to implement like we search on Google?
Any suggestion will be appreciated!!!
You could use the PHP Simple HTML DOM library for this:
<?php
include "simple_html_dom.php";
$search_query = "ENTER YOUR SEARCH QUERY HERE";
$search_query = urlencode( $search_query );
$html = file_get_html( "https://www.google.com/search?q=$search_query&tbm=isch" );
$image_container = $html->find('div#rcnt', 0);
$images = $image_container->find('img');
$image_count = 10; //Enter the amount of images to be shown
$i = 0;
foreach($images as $image){
if($i == $image_count) break;
$i++;
// DO with the image whatever you want here (the image element is '$image'):
echo $image;
}
This will print a specific number of images (number is set in '$image_count').
For more information on the PHP Simple HTML DOM library click here.
i am not very much sure about this ,but still google gives a nice documentation about this.
$url = "https://ajax.googleapis.com/ajax/services/search/images?" .
"v=1.0&q=barack%20obama&userip=INSERT-USER-IP";
// sendRequest
// note how referer is set manually
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, /* Enter the URL of your site here */);
$body = curl_exec($ch);
curl_close($ch);
// now, process the JSON string
$json = json_decode($body);
// now have some fun with the results...
this is from the official Google's developer guide regarding image searching.
for more reference you can have a reference of the same here.
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_php
in $url you must set the search keywords.

PHP - Parse_url only get pages

I'm working on a little webcrawler as a side project at the moment and basically having it collect all hrefs on a page and then subsequently parsing those, my problem is.
How can I only get the actual page results? at the moment i'm using the following
foreach($page->getElementsByTagName('a') as $link)
{
$compare_url = parse_url($link->getAttribute('href'));
if (#$compare_url['host'] == "")
{
$links[] = 'http://'.#$base_url['host'].'/'.$link->getAttribute('href');
}
elseif ( #$base_url['host'] == #$compare_url['host'] )
{
$links[] = $link->getAttribute('href');
}
}
As you can see this will bring in jpegs, exe files etc. I only need to pickup the web pages like .php, .html, .asp etc.
I'm not sure if there is some function able to work this one out or if it will need to be regex from some sort of master list?
Thanks
Since the URL string alone doesn't connected with the resource behind it in any way you will have to go out and ask the webserver about them. For this there's a HTTP method called HEAD so you won't have to download everything.
You can implement this with curl in php like this:
function is_html($url) {
function curl_head($url) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
$content = curl_exec($curl);
curl_close($curl);
// redirected heads just pile up one after another
$parts = explode("\r\n\r\n", trim($content));
// return only the last one
return end($parts);
}
$header = curl_head('http://github.com');
// look for the content-type part of the header response
return preg_match('/content-type\s*:\s*text\/html/i', $header);
}
var_dump(is_html('http://github.com'));
This version is only accepts text/html responses and doesn't check if the response is 404 or other error (however follows redirects up to 5 jumps). You can tweak the regexp or add some error handling in either from the curl response, or by matching against the header string's first line.
Note: Webservers will run scripts behind these URLs to give you responses. Be careful not overload hosts with probing, or grabbing "delete" or "unsubscribe" type links.
To check if a page is valid (html,php... extension use this function:
function check($url){
$extensions=array("php","html"); //Add extensions here
foreach($extensions as $ext){
if(substr($url,-(strlen($ext)+1))==".".$ext){
return 1;
}
}
return 0;
}
foreach($page->getElementsByTagName('a') as $link) {
$compare_url = parse_url($link->getAttribute('href'));
if (#$compare_url['host'] == "") { if(check($link->getAttribute('href'))){ $links[] = 'http://'.#$base_url['host'].'/'.$link->getAttribute('href');} }
elseif ( #$base_url['host'] == #$compare_url['host'] ) {
if(check($link->getAttribute('href'))){ $links[] = $link->getAttribute('href'); }
}
Consider using preg_match to check the type of the link (application , picture , html file) and considering the results decide what to do.
Another option (and simple) is to use explode and find the last string of the url which comes after a . (the extension)
For instance:
//If the URL will has any one of the following extensions , ignore them.
$forbid_ext = array('jpg','gif','exe');
foreach($page->getElementsByTagName('a') as $link) {
$compare_url = parse_url($link->getAttribute('href'));
if (#$compare_url['host'] == "")
{
if(check_link_type($link->getAttribute('href')))
$links[] = 'http://'.#$base_url['host'].'/'.$link->getAttribute('href');
}
elseif ( #$base_url['host'] == #$compare_url['host'] )
{
if(check_link_type($link->getAttribute('href')))
$links[] = $link->getAttribute('href');
}
}
function check_link_type($url)
{
global $forbid_ext;
$ext = end(explode("." , $url));
if(in_array($ext , $forbid_ext))
return false;
return true;
}
UPDATE (instead of checking 'forbidden' extensions , let's look for good ones)
$good_ext = array('html','php','asp');
function check_link_type($url)
{
global $good_ext;
$ext = end(explode("." , $url));
if($ext == "" || !in_array($ext , $good_ext))
return true;
return false;
}

Is there a way to use PHP to crawl links?

I'd like to use PHP to crawl a document we have that has about 6 or 7 thousand href links in it. What we need is what is on the other side of the link which means that PHP would have to follow each link and grab the contents of the link. Can this be done?
Thanks
Sure, just grab the content of your starting url with a function like file_get_contents (http://nl.php.net/file_get_contents), Find URL's in the content of this page using a regular expression, grab the contents of those url's etcetera.
Regexp will be something like:
$regexUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
Once you harvest the links, you can use curl or file_get_contents (in a safe environment file_get_contents shouldn't allow to walk over http protocol though)
I just have a SQL table of all the links I have found, and if they have been parsed or not.
I then use Simple HTML DOM to parse oldest added page, although as it tends to run out of memory with large pages (500kb+ of html) I use regex for some of it*. For every link I find I add it to the SQL database as needing parsing, and the time I found it.
The SQL database prevents the data being lost on an error, and as I have 100,000+ links to parse, I do it over a long period of time.
I am unsure, but have you checked the useragent of file_get_contents()? If it isn't your pages and you make 1000s of requests, you may want to change the user agent, either by writing your own HTTP down loader or using one from a library(I use the one in the Zend Framework) but cURL etc work fine. If you use a custom user agent, it allows the admin looking over logs to see the information about your bot. (I tend to put the reason why I am crawling and a contact in mine).
*The regex I use is:
'/<a[^>]+href="([^"]+)"[^"]*>/is'
A better solution (From Gumbo) could be:
'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'
The PHP Snoopy library has a bunch of built in functions to accomplish exactly what you are looking for.
http://sourceforge.net/projects/snoopy/
You can download the page itself with Snoopy, then it has another function to extract all the URLs on that page. It will even correct the links to be full-fledged URIs (i.e. they aren't just relative to the domain/directory the page resides on).
You can try the following. See this thread for more details
<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
return;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
$stripped_file = strip_tags($result, "<a>");
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );
foreach($matches as $match){
$href = $match[1];
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
crawl_page($href, $depth - 1);
}
}
echo "Crawled {$href}";
}
crawl_page("http://www.sitename.com/",3);
?>
I suggest that you take the HTML document with your 6000 URLs, parse them out and loop through the list you've got. In your loop, get the contents of the current URL using file_get_contents (for this purpose, you don't really need cURL when file_get_contents is enabled on your server), parse out the containing URLs again, and so on.
Would look something like this:
<?php
function getUrls($url) {
$doc = file_get_contents($url);
$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
preg_match_all($pattern, $doc, $urls);
return $urls;
}
$urls = getUrls("your_6k_file.html");
foreach($urls as $url) {
$moreUrls = getUrls($url);
//do something with moreUrls
}
?>

Categories