Bulk link checker in php

Bulk link checker in php - php

I want to check the links in my database, if the status of the link (through possible redirections) is still valid (e.g. status 200). The below script is what I currently use. The limitation is that over +/- 400 links, the server gives me a 500 - internal error. Unfortunality, I cannot review the servers logs on what the reason is, my assumption is that it's a time out issue.
How can I make this script scalable, so that it will allow me to run more then the currently +/- 400 links?
function urlValidator($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode != '200') {
echo $url;
echo " - ". $httpcode;
}
}
// creation of $url_array
//
foreach($url_array as $url){
if(!is_null($url)) {
urlValidator($url);
}
}
I did try to add flush() and/or ob_flush() to the code, but it didn't help either (or implemented wrongly).
Any suggestions are more then welcome.

The default execution time of a PHP script is 30 seconds. After that it will time out.
You can either increase this time to something like this:
ini_set('max_execution_time', 600); //10 minutes
But, to make it really scalable, I would store the current "link-check" status in a database, so that you can continue where you left off and have multiple instances call your script.

Related

How to resent the curl request if the output is empty

How to resend curl request if the output is empty or something like (Database maintenance, please check back in 10 minutes) Want to refresh request if it is similar to any above.
Im using basic curl method
$this->ch = curl_init();
curl_setopt($this->ch, CURLOPT_HEADER, true);
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, true);
#curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($this->ch, CURLOPT_USERAGENT, 'Opera/9.23 (Windows NT 5.1; U; en)');
// site is returning a gzipped body, force uncompressed
curl_setopt($this->ch, CURLOPT_ENCODING, 'identity');
If this is very simple question, Im really sorry Because im new.
Thanks in advance

If you want refresh it immediately in same script execution you can do something like that:
$response = curl_exec($this->ch);
if (empty($response)) {
sleep(10);
$response = curl_exec($this->ch);
}
Or combination of that in some loop if you want check like 3 times or more.

cUrl waits for url to load IF results are printed

The following curl-call succeeds every time, if and only if $data is printed after the curl-call. curl_getinfo() returning
[content_type] => text/html; charset=UTF-8
If $data is not printed, the curl-call sometimes return the same result as above and sometimes returns $data being "Loading...", Which means that page has not finished loading yet. And curl_getinfo() returning
[content_type] => text/html
Furthermore, when using print_r($data), I can see the print_r(curl_getinfo($ch)); on my website being updated several times while performing the curl-call. What... The.... F?
(the set_opt-list has grown larger as I'm trying to find a solution LOL)
Ooh.. yeah, even if I print $data after it's been returned to function caller and caught in another variable.. curl succeeds every time.
Is this normal behaviour? I don't want to print_r($data)!
Is it possible that the url I'm retrieving contains javascript which gets run when I "print" it on my website? Why does it work occasionally without the print_r($data)? Ref: is-there-a-way-to-let-curl-wait-until-the-pages-dynamic-updates-are-done
edit: Until further notice, I've put the curl-call in a while-loop, checking if downloaded size is above a certain threshold. I've set the while loop to 10 iterations, and so far it is enough, i.e. it will manage to download the content of interest. Time consumed is barely noticed.
function curl_get_contents($url) {
global $dbg;
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
$data = curl_exec($ch);
if ($dbg) {
print_r(curl_getinfo($ch)); // This one gets refreshed if print_r($data) used below
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch);
} else {
echo "ALL GOOD <br>";
}
}
curl_close($ch);
//echo $data; // If I do this...
//print_r($data); // ... or this. curl is success 100%.
return $data;
}

How to use PHP CURL to bypass cross domain

I need PHP to submit paramaters from one domain to another. JavaScript is not an option for my situation. I'm now trying to use CURL with PHP, but have not been successful in bypassing the cross domain.
From domain_A, I have a page with the following PHP with CURL script:
if (_iscurl()){
echo "<p>CURL is enabled</p>";
$url = "http://domain_B/process.php?id=123&amt=100&jsonp=?";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)");
curl_setopt($ch, CURLOPT_URL, $url );
$return = curl_exec($ch);
curl_close($ch);
echo "<p>Finished operations</p>";
}
else{
echo "CURL is disabled";
}
?>
I am not getting any results, so I am assuming that the PHP CURL script is not successful. Any ideas to fix this?
Thanks

Well, its bit late. But adding this answer for further readers who might face similar issue. This issue arises some times when we are sending php curl request from a domain hosted over http to a domain hosted over https (http over ssl).
Just add below code snippet before curl execution.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);

Using false in CURLOPT_RETURNTRANSFER doesn't return anything by curl. make it true(or 1)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

PHP CURL - Problems storing and using cookies when scraping

I've been trying to write a script that retrieves Google trends results for a given keyword. Please note im not trying to do anything malicious I just want to be able to automate this process and run it a few times every day.
After investigating the Google trends page I discovered that the information is available using the following URL:
http://www.google.com/trends/trendsReport?hl=en-GB&q=keyword&cmpt=q&content=1
You can request that information mutliple times with no issues from a browser, but if you try with "privacy mode" after 4 or 5 requests the following is displayed:
An error has been detected You have reached your quota limit. Please
try again later.
This makes me think that cookies are required. So I have written my script as follows:
$cookiefile = $siteurl . '/wp-content/plugins/' . basename(dirname(__FILE__)) . '/cookies.txt';
$url = 'http://www.google.com/trends/trendsReport?hl=en-GB&q=keyword&cmpt=q&content=1';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$x='error';
while (trim($x) != '' ){
$html=curl_exec($ch);
$x=curl_error($ch);
}
echo "test cookiefile contents = ".file_get_contents($cookiefile)."<br />";
echo $html;
However I just can't get anything written to my cookies file. So I keep on getting the error message. Can anyone see where I'm going wrong with this?

I'm pretty sure your cookie file should exist before you can use it with curl.
Try:
$h = fopen($cookiefile, "x+");

Check proxy using PHP

I'm writing a web app that requires a lot of proxies to work.
I also have a list of proxies, but I don't know which of them works and what type are they (socks, http, https).
Let assume I have 5000 proxies in ip:port format.
What is the fastest way to check all of them?
I tried fsockopen, but it is quite slow.
Maybe pinging them first will save the time?

<?php
$proxies = file ("proxies.txt");
$mc = curl_multi_init ();
for ($thread_no = 0; $thread_no<count ($proxies); $thread_no++)
{
$c [$thread_no] = curl_init ();
curl_setopt ($c [$thread_no], CURLOPT_URL, "http://google.com");
curl_setopt ($c [$thread_no], CURLOPT_HEADER, 0);
curl_setopt ($c [$thread_no], CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($c [$thread_no], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt ($c [$thread_no], CURLOPT_TIMEOUT, 10);
curl_setopt ($c [$thread_no], CURLOPT_PROXY, trim ($proxies [$thread_no]));
curl_setopt ($c [$thread_no], CURLOPT_PROXYTYPE, 0);
curl_multi_add_handle ($mc, $c [$thread_no]);
}
do {
while (($execrun = curl_multi_exec ($mc, $running)) == CURLM_CALL_MULTI_PERFORM);
if ($execrun != CURLM_OK) break;
while ($done = curl_multi_info_read ($mc))
{
$info = curl_getinfo ($done ['handle']);
if ($info ['http_code'] == 301) {
echo trim ($proxies [array_search ($done['handle'], $c)])."\r\n";
}
curl_multi_remove_handle ($mc, $done ['handle']);
}
} while ($running);
curl_multi_close ($mc);
?>

You could use cURL to check for proxies. Some good article is given here
Hope it helps

The port usually gives you a good clue about the proxy type.
80,8080,3128 is typically HTTP
1080 is typically SOCKS
But let's be realistic, you seem to have a list of public proxies. It's not unlikely that every single one is not working anymore.
You can use curl or wget or lynx in a script or similar to test the proxies.
You can also try sort your list up into SOCKS and HTTP as good as you can and enter it into the Proxycollective .
That's a free project but you need an invitation code or a 99cent ticket to become member.
Once you are member you can upload your proxy lists and they will be tested. All working ones will be given back to you sortable.
So if you don't want to program something on your own that's maybe your best bet, invitation codes can sometimes be found in various forums.
But keep in mind what I said, if you have a list of 5000 random proxies I bet you'll hardly find more than 10 working ones in there anymore. Public proxies only live short.

This proxy checker API could be exactly what you're looking for. You can easily check a proxy list with that.
If you want to develop it yourself, is not difficult to do a small script to do the same than that API does.

Here is the code I use. You can modify it to meet your requirements:
function _check($url,$usecookie = false,$sock="",$ref) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, False);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,$_POST['timeoutpp']);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3");
if($sock){
curl_setopt($ch, CURLOPT_PROXY, $sock);
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
}
if ($usecookie){
curl_setopt($ch, CURLOPT_COOKIEJAR, $usecookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $usecookie);
}
if ($ref){
curl_setopt($ch, CURLOPT_REFERER,$ref);
}
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$result=curl_exec ($ch);
curl_close($ch);
return $result;
}

There is a PHP tool on github that can scan the proxy list as multithread. It will not be difficult to integrate it into your own project. The tool also analyzes the security level of the proxies it scans. Yes, I wrote it. :)
https://github.com/enseitankado/proxy-profiler

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Bulk link checker in php - php

Related

How to resent the curl request if the output is empty

cUrl waits for url to load IF results are printed

How to use PHP CURL to bypass cross domain

PHP CURL - Problems storing and using cookies when scraping

Check proxy using PHP

Categories

Resources