Prevent writing of log/file if visitor is a spider? - php

I have a website status checker that writes the latest urls checked to a log file (url, status e.g. up or down and date checked), trouble i'm now finding is that it also records spider/Google bot visits, so latest site checks are being written multiple times per second...
Here is my log writing function:
public function log($url, $status) {
if (strpos($url, "/") !== false):
if (strpos($url, "http://") === false):
$url = "http://" . $url;
endif;
$parse = parse_url($url);
$url = $parse['host'];
endif;
if (!empty($url)):
$arrayToWrite = array(
array(
"url" => $url,
"status" => $status,
"date" => date("m/d/Y h:i")
)
);
if (file_exists($this->logfile)):
$fileContents = file_get_contents($this->logfile);
$arrayFromFile = unserialize($fileContents);
foreach ($arrayFromFile as $k => $tmpArray):
if ($tmpArray['url'] == $url):
unset($arrayFromFile[$k]);
endif;
endforeach;
if (is_array($arrayFromFile)):
array_splice($arrayFromFile, 9);
$arrayToWrite = array_merge($arrayToWrite, $arrayFromFile);
endif;
endif;
file_put_contents($this->logfile, serialize($arrayToWrite));
endif;
}
What type of amendments could I make so it ignores bots/spider visits please so it only tracks/writes real visitors?

Refrencing this answer: how to detect search engine bots with php?
You can use $_SERVER['HTTP_USER_AGENT'] to check if the visitor identifies as a spider.
$bots = array("googlebot", "msn", "add other bots");
if(in_array(strtolower($_SERVER['HTTP_USER_AGENT']), $bots)){
// Don't save url
}
A List of Spiders

Related

Add http:// prefix to URL when it is missing

Hello Guys im using this to Add http:// prefix to Url when missing. The problem is that he add it to the other Url's where i have http:// already.
foreach($result as $key => &$value)
{
if (strpos($sample['Internetadress'], 'http://') === false){
$sample['Internetadress'] = 'http://' .$sample['Internetadress'];
}
}
i want that he doesnt edit it when it exist.
i want that he doesnt add it when there is no URL.
Sorry about my english im from Germany :D
Your code shoud cover your first point (when http exist don't change the URL), if not provide us a sample url that you want to modify.
For the second point you just make another check like this:
<?php
$sample['Internetadress']='www.example.com';
if (strpos($sample['Internetadress'], 'http://') === false && trim($sample['Internetadress'])!==''){
if(strpos($sample['Internetadress'], 'https://') === false){
$sample['Internetadress'] = 'http://' .$sample['Internetadress'];
}
}
echo $sample['Internetadress'];
I think in this case a little regex is a good solution:
$urls = [
'http://www.example.com',
'foo.bar.com',
'https://example.com',
'www.example.com'
];
foreach ($urls as &$url) {
$url = preg_replace('/^(?!http)/i', 'http://', $url);
}
The values in $urls after the loop are:
[
"http://www.example.com",
"http://foo.bar.com",
"https://example.com",
"http://www.example.com"
]
Also instead of using a foreach loop and passing the values by reference you could use a array_map() like this:
$urls = array_map(function($url) {
return preg_replace('/^(?!http)/i', 'http://', $url);
}, $urls);

Scrape page and separate internal from external links

Building a little PHP scraper , I'm writing a little function which should separate my Internal & External links,
I'm passing the function a copy of the html source code along with the base host address
$source = file_get_contents('http://www.example.com');
$host = "mysite.com";
here is my function so far...
function find_page_links($source, $host){
if($source){
$htmlDoc = new DomDocument();
#$htmlDoc->loadhtml($source);
$int_links = array();
$ext_links = array();
// GET LINKS
foreach($htmlDoc->getElementsByTagName('a') as $link) {
$url = trim($link->getAttribute('href'));
$title = trim($link->getAttribute('title'));
$text = trim($link->nodeValue);
$rel = trim($link->getAttribute('rel'));
$pos = strpos($url,$host);
if( $pos === false ){ // NO MATCH EXTERNAL
if( (substr($url, 0, 1) == '/') ||
(substr($url, 0, 1) == '#') )
{
// INTERNAL
$int_links[] = array( 'link_url' => $url,
'link_text' => $text,
'link_title' => $title,
'link_rel' => $rel
);
}else{
// EXTERNAL
$ext_links[] = array( 'link_url' => $url,
'link_text' => $text,
'link_title' => $title,
'link_rel' => $rel
);
}
}else{
if( $pos < 20 ){
// INTERNAL
$int_links[] = array( 'link_url' => $url,
'link_text' => $text,
'link_title' => $title,
'link_rel' => $rel );
}else{
// EXTERNAL
$ext_links[] = array( 'link_url' => $url,
'link_text' => $text,
'link_title' => $title,
'link_rel' => $rel
);
}
} // end else
} // end foreach
$content = array();
$content['int_links'] = $int_links;
$content['ext_links'] = $ext_links;
return $content ;
}
}
So whats happening is the function loads the HTML via DomDocument
I create 2 arrays to store both internal & external
Loop through the document and getElementsByTagName('a')
It then uses strpos to check if the host address "example.com" is within the link URL if there is NO match/false then it's external, but we do a further check to make sure the link URL doesnt start with a forward slash ie: "/contact-us.php" that would mean its an internal, also in that check we check for a "#" tag at the begining which would be an anchor link on the page...
so that was IF pos === false / no match
now If host is in the link URL its a match, I do another check to see if the position of the host is lower down in the string, which would be internal ie:
http://example.com/about/
but if the position is greater than 20 (just a number plucked from the air) then..
like a google plus link or facebook link the host url will be present in the link but much further along the string which would mean its an external,
ie: http://www.facebook.com/plugins/like.php?href=http://example.com/
phew...
if you guys have any other BETTER ways to spot an external or internal link please let me know..my results, really vary depending on the site, if links are using the full path,

cloudfront signed urls ip address

I have Signed URLs on Cloudfront working fine in PHP. Bucket policies work with HTTP referrers on S3 but because Cloudfront doesn't support HTTP referrer checks I need to serve a file to one IP address only (the client that requested the file and generated the signed URL or my web server ideally).
Can someone please help me add the IP Address element to the JSON code so it works?
"IpAddress":{"AWS:SourceIp":"192.0.2.0/24"},
I'm lost with the PHP and Policy Statement but think it might be easy for someone who knows: http://tinyurl.com/9czr5lp
It does encoding/signing a bit differently for a custom policy: http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/private-content-creating-signed-url-custom-policy.html#private-content-custom-policy-statement
The below is an AWS example and works except not for the IP Address lock in.
I can test this very quickly if someone can please give me a hand for two minutes!
Thanks MASSIVELY for any help :)
Jon
function getSignedURL($resource, $timeout)
{
$keyPairId = "XXXXXXXXXXXX";
$expires = time() + $timeout;
$json = '{"Statement":[{"Resource":"'.$resource.'","Condition":{"DateLessThan": {"AWS:EpochTime":'.$expires.'}}}]}';
$fp=fopen("pk-XXXXXXXX.pem","r");
$priv_key=fread($fp,8192);
fclose($fp);
$key = openssl_get_privatekey($priv_key);
if(!$key)
{
echo "<p>Failed to load private key!</p>";
return;
}
//Sign the policy with the private key
if(!openssl_sign($json, $signed_policy, $key, OPENSSL_ALGO_SHA1))
{
echo '<p>Failed to sign policy: '.openssl_error_string().'</p>';
return;
}
//Create url safe signed policy
$base64_signed_policy = base64_encode($signed_policy);
$signature = str_replace(array('+','=','/'), array('-','_','~'), $base64_signed_policy);
//Construct the URL
$url = $resource.'?Expires='.$expires.'&Signature='.$signature.'&Key-Pair-Id='.$keyPairId;
return $url;
}
$url = getSignedURL("http://s675765.cloudfront.net/filename.mp4", 600);
print $url;
{"Statement":[{"Resource":"testRes","Condition":{"DateLessThan":{"AWS:EpochTime":1357034400},"IpAddress":{"AWS:SourceIp":"192.0.2.0\/24"}}}]}
This is a valid JSON string with filled and escaped values.
If you pass the IP address as a variable make sure you escape the /
e.g.
$escapedIp = str_replace( '/', '\/', $ipAddress );
Have a look at the php json extension
This would make things quite easier:
example statement as php array
$statement = array(
'Statement' => array(
array(
'Resource' => $resource,
'Condition' => array(
'DateLessThan' => array(
'AWS:EpochTime' => $expires
),
'IpAddress' => array(
'AWS:SourceIp' => $ipAddress
)
)
)
)
);
$json = json_encode( $statement );

Download rapidshare file using rapidshare api in php

I am trying to download a rapidshare file using its "download" subroutine as a free user. The following is the code that I use to get response from the subroutine.
function rs_download($params)
{
$url = "http://api.rapidshare.com/cgi-bin/rsapi.cgi?sub=download&fileid=".$params['fileid']."&filename=".$params['filename'];
$reply = #file_get_contents($url);
if(!$reply)
{
return false;
}
$result_arr = array();
$result_keys = array(0=> 'hostname', 1=>'dlauth', 2=>'countdown_time', 3=>'md5hex');
if( preg_match("/DL:(.*)/", $reply, $reply_matches) )
{
$reply_altered = $reply_matches[1];
}
else
{
return false;
}
foreach( explode(',', $reply_altered) as $index => $value )
{
$result_arr[ $result_keys[$index] ] = $value;
}
return $result_arr;
}
For instance; trying to download this...
http://rapidshare.com/files/440817141/AutoRun__live-down.com_Champ.rar
I pass the fileid(440817141) and filename(AutoRun__live-down.com_Champ.rar) to rs_download(...) and I get a response just as rapidshare's api doc says.
The rapidshare api doc (see "sub=download") says call the server hostname with the download authentication string but I couldn't figure out what form the url should take.
Any suggestions?, I tried
$download_url = "http://$the-hostname/$the-dlauth-string/files/$fileid/$filename"
and a couple other variations of the above, nothing worked.
I use curl to download the file, like the following;
$cr = curl_init();
$fp = fopen ("d:/downloaded_files/file1.rar", "w");
// set curl options
$curl_options = array(
CURLOPT_URL => $download_url
,CURLOPT_FILE => $fp
,CURLOPT_HEADER => false
,CURLOPT_CONNECTTIMEOUT => 0
,CURLOPT_FOLLOWLOCATION => true
);
curl_setopt_array($cr, $curl_options);
curl_exec($cr);
curl_close($cr);
fclose($fp);
The above curl code doesn't seem to work, nothing gets downloaded. Probably its the download url that is incorrect.
Also tried this format for the download url:
"http://rs$serverid$shorthost.rapidshare.com/files/$fileid/$filename"
With this curl writes a file entry but that is all it does(writes a 0/1 kb file).
Here is the code that I use to get the serverid, shorthost, among a few other values from rapidshare.
function rs_checkfile($params)
{
$url = "http://api.rapidshare.com/cgi-bin/rsapi.cgi?sub=checkfiles_v1&files=".$params['fileids']."&filenames=".$params['filenames'];
// the response from rapishare would a string something like:
// 440817141,AutoRun__live-down.com_Champ.rar,47768,20,1,l3,0
$reply = #file_get_contents($url);
if(!$reply)
{
return false;
}
$result_arr = array();
$result_keys = array(0=> 'file_id', 1=>'file_name', 2=>'file_size', 3=>'server_id', 4=>'file_status', 5=>'short_host'
, 6=>'md5');
foreach( explode(',', $reply) as $index => $value )
{
$result_arr[ $result_keys[$index] ] = $value;
}
return $result_arr;
}
rs_checkfile(...) takes comma seperated fileids and filenames(no commas if calling for a single file)
Thanks in advance for any suggestions.
You start by requesting ?sub=download&fileid=X&filename=Y, and it returns $hostname,$dlauth,$countdown,$md5hex.. since you're a free user you have to delay for $countdown seconds, and then call ?sub=download&fileid=X&filename=Y&dlauth=Z to perform the download.
There's a working implementation in python here that would probably answer any of your other questions.

Get keyword from a (search engine) referrer url using PHP

I am trying to get the search keyword from a referrer url. Currently, I am using the following code for Google urls. But sometimes it is not working...
$query_get = "(q|p)";
$referrer = "http://www.google.com/search?hl=en&q=learn+php+2&client=firefox";
preg_match('/[?&]'.$query_get.'=(.*?)[&]/',$referrer,$search_keyword);
Is there another/clean/working way to do this?
Thank you,
Prasad
If you're using PHP5 take a look at http://php.net/parse_url and http://php.net/parse_str
Example:
// The referrer
$referrer = 'http://www.google.com/search?hl=en&q=learn+php+2&client=firefox';
// Parse the URL into an array
$parsed = parse_url( $referrer, PHP_URL_QUERY );
// Parse the query string into an array
parse_str( $parsed, $query );
// Output the result
echo $query['q'];
There are different query strings on different search engines. After trying Wiliam's method, I have figured out my own method. (Because, Yahoo's is using 'p', but sometimes 'q')
$referrer = "http://search.yahoo.com/search?p=www.stack+overflow%2Ccom&ei=utf-8&fr=slv8-msgr&xargs=0&pstart=1&b=61&xa=nSFc5KjbV2gQCZejYJqWdQ--,1259335755";
$referrer_query = parse_url($referrer);
$referrer_query = $referrer_query['query'];
$q = "[q|p]"; //Yahoo uses both query strings, I am using switch() for each search engine
preg_match('/'.$q.'=(.*?)&/',$referrer,$keyword);
$keyword = urldecode($keyword[1]);
echo $keyword; //Outputs "www.stack overflow,com"
Thank you,
Prasad
To supplement the other answers, note that the query string parameter that contains the search terms varies by search provider. This snippet of PHP shows the correct parameter to use:
$search_engines = array(
'q' => 'alltheweb|aol|ask|ask|bing|google',
'p' => 'yahoo',
'wd' => 'baidu',
'text' => 'yandex'
);
Source: http://betterwp.net/wordpress-tips/get-search-keywords-from-referrer/
<?php
class GET_HOST_KEYWORD
{
public function get_host_and_keyword($_url) {
$p = $q = "";
$chunk_url = parse_url($_url);
$_data["host"] = ($chunk_url['host'])?$chunk_url['host']:'';
parse_str($chunk_url['query']);
$_data["keyword"] = ($p)?$p:(($q)?$q:'');
return $_data;
}
}
// Sample Example
$obj = new GET_HOST_KEYWORD();
print_r($obj->get_host_and_keyword('http://www.google.co.in/search?sourceid=chrome&ie=UTF-&q=hire php php programmer'));
// sample output
//Array
//(
// [host] => www.google.co.in
// [keyword] => hire php php programmer
//)
// $search_engines = array(
// 'q' => 'alltheweb|aol|ask|ask|bing|google',
// 'p' => 'yahoo',
// 'wd' => 'baidu',
// 'text' => 'yandex'
//);
?>
$query = parse_url($request, PHP_URL_QUERY);
This one should work For Google, Bing and sometimes, Yahoo Search:
if( isset($_SERVER['HTTP_REFERER']) && $_SERVER['HTTP_REFERER']) {
$query = getSeQuery($_SERVER['HTTP_REFERER']);
echo $query;
} else {
echo "I think they spelled REFERER wrong? Anyways, your browser says you don't have one.";
}
function getSeQuery($url = false) {
$segments = parse_url($url);
$keywords = null;
if($query = isset($segments['query']) ? $segments['query'] : (isset($segments['fragment']) ? $segments['fragment'] : null)) {
parse_str($query, $segments);
$keywords = isset($segments['q']) ? $segments['q'] : (isset($segments['p']) ? $segments['p'] : null);
}
return $keywords;
}
I believe google and yahoo had updated their algorithm to exclude search keywords and other params in the url which cannot be received using http_referrer method.
Please let me know if above recommendations will still provide the search keywords.
What I am receiving now are below when using http referrer at my website end.
from google: https://www.google.co.in/
from yahoo: https://in.yahoo.com/
Ref: https://webmasters.googleblog.com/2012/03/upcoming-changes-in-googles-http.html

Categories