I want to check the links in my database, if the status of the link (through possible redirections) is still valid (e.g. status 200). The below script is what I currently use. The limitation is that over +/- 400 links, the server gives me a 500 - internal error. Unfortunality, I cannot review the servers logs on what the reason is, my assumption is that it's a time out issue.
How can I make this script scalable, so that it will allow me to run more then the currently +/- 400 links?
function urlValidator($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode != '200') {
echo $url;
echo " - ". $httpcode;
}
}
// creation of $url_array
//
foreach($url_array as $url){
if(!is_null($url)) {
urlValidator($url);
}
}
I did try to add flush() and/or ob_flush() to the code, but it didn't help either (or implemented wrongly).
Any suggestions are more then welcome.
The default execution time of a PHP script is 30 seconds. After that it will time out.
You can either increase this time to something like this:
ini_set('max_execution_time', 600); //10 minutes
But, to make it really scalable, I would store the current "link-check" status in a database, so that you can continue where you left off and have multiple instances call your script.
Related
How to resend curl request if the output is empty or something like (Database maintenance, please check back in 10 minutes) Want to refresh request if it is similar to any above.
Im using basic curl method
$this->ch = curl_init();
curl_setopt($this->ch, CURLOPT_HEADER, true);
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, true);
#curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($this->ch, CURLOPT_USERAGENT, 'Opera/9.23 (Windows NT 5.1; U; en)');
// site is returning a gzipped body, force uncompressed
curl_setopt($this->ch, CURLOPT_ENCODING, 'identity');
If this is very simple question, Im really sorry Because im new.
Thanks in advance
If you want refresh it immediately in same script execution you can do something like that:
$response = curl_exec($this->ch);
if (empty($response)) {
sleep(10);
$response = curl_exec($this->ch);
}
Or combination of that in some loop if you want check like 3 times or more.
The following curl-call succeeds every time, if and only if $data is printed after the curl-call. curl_getinfo() returning
[content_type] => text/html; charset=UTF-8
If $data is not printed, the curl-call sometimes return the same result as above and sometimes returns $data being "Loading...", Which means that page has not finished loading yet. And curl_getinfo() returning
[content_type] => text/html
Furthermore, when using print_r($data), I can see the print_r(curl_getinfo($ch)); on my website being updated several times while performing the curl-call. What... The.... F?
(the set_opt-list has grown larger as I'm trying to find a solution LOL)
Ooh.. yeah, even if I print $data after it's been returned to function caller and caught in another variable.. curl succeeds every time.
Is this normal behaviour? I don't want to print_r($data)!
Is it possible that the url I'm retrieving contains javascript which gets run when I "print" it on my website? Why does it work occasionally without the print_r($data)? Ref: is-there-a-way-to-let-curl-wait-until-the-pages-dynamic-updates-are-done
edit: Until further notice, I've put the curl-call in a while-loop, checking if downloaded size is above a certain threshold. I've set the while loop to 10 iterations, and so far it is enough, i.e. it will manage to download the content of interest. Time consumed is barely noticed.
function curl_get_contents($url) {
global $dbg;
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
$data = curl_exec($ch);
if ($dbg) {
print_r(curl_getinfo($ch)); // This one gets refreshed if print_r($data) used below
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch);
} else {
echo "ALL GOOD <br>";
}
}
curl_close($ch);
//echo $data; // If I do this...
//print_r($data); // ... or this. curl is success 100%.
return $data;
}
I need PHP to submit paramaters from one domain to another. JavaScript is not an option for my situation. I'm now trying to use CURL with PHP, but have not been successful in bypassing the cross domain.
From domain_A, I have a page with the following PHP with CURL script:
if (_iscurl()){
echo "<p>CURL is enabled</p>";
$url = "http://domain_B/process.php?id=123&amt=100&jsonp=?";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)");
curl_setopt($ch, CURLOPT_URL, $url );
$return = curl_exec($ch);
curl_close($ch);
echo "<p>Finished operations</p>";
}
else{
echo "CURL is disabled";
}
?>
I am not getting any results, so I am assuming that the PHP CURL script is not successful. Any ideas to fix this?
Thanks
Well, its bit late. But adding this answer for further readers who might face similar issue. This issue arises some times when we are sending php curl request from a domain hosted over http to a domain hosted over https (http over ssl).
Just add below code snippet before curl execution.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
Using false in CURLOPT_RETURNTRANSFER doesn't return anything by curl. make it true(or 1)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
I've been trying to write a script that retrieves Google trends results for a given keyword. Please note im not trying to do anything malicious I just want to be able to automate this process and run it a few times every day.
After investigating the Google trends page I discovered that the information is available using the following URL:
http://www.google.com/trends/trendsReport?hl=en-GB&q=keyword&cmpt=q&content=1
You can request that information mutliple times with no issues from a browser, but if you try with "privacy mode" after 4 or 5 requests the following is displayed:
An error has been detected You have reached your quota limit. Please
try again later.
This makes me think that cookies are required. So I have written my script as follows:
$cookiefile = $siteurl . '/wp-content/plugins/' . basename(dirname(__FILE__)) . '/cookies.txt';
$url = 'http://www.google.com/trends/trendsReport?hl=en-GB&q=keyword&cmpt=q&content=1';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$x='error';
while (trim($x) != '' ){
$html=curl_exec($ch);
$x=curl_error($ch);
}
echo "test cookiefile contents = ".file_get_contents($cookiefile)."<br />";
echo $html;
However I just can't get anything written to my cookies file. So I keep on getting the error message. Can anyone see where I'm going wrong with this?
I'm pretty sure your cookie file should exist before you can use it with curl.
Try:
$h = fopen($cookiefile, "x+");
I'm writing a web app that requires a lot of proxies to work.
I also have a list of proxies, but I don't know which of them works and what type are they (socks, http, https).
Let assume I have 5000 proxies in ip:port format.
What is the fastest way to check all of them?
I tried fsockopen, but it is quite slow.
Maybe pinging them first will save the time?
<?php
$proxies = file ("proxies.txt");
$mc = curl_multi_init ();
for ($thread_no = 0; $thread_no<count ($proxies); $thread_no++)
{
$c [$thread_no] = curl_init ();
curl_setopt ($c [$thread_no], CURLOPT_URL, "http://google.com");
curl_setopt ($c [$thread_no], CURLOPT_HEADER, 0);
curl_setopt ($c [$thread_no], CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($c [$thread_no], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt ($c [$thread_no], CURLOPT_TIMEOUT, 10);
curl_setopt ($c [$thread_no], CURLOPT_PROXY, trim ($proxies [$thread_no]));
curl_setopt ($c [$thread_no], CURLOPT_PROXYTYPE, 0);
curl_multi_add_handle ($mc, $c [$thread_no]);
}
do {
while (($execrun = curl_multi_exec ($mc, $running)) == CURLM_CALL_MULTI_PERFORM);
if ($execrun != CURLM_OK) break;
while ($done = curl_multi_info_read ($mc))
{
$info = curl_getinfo ($done ['handle']);
if ($info ['http_code'] == 301) {
echo trim ($proxies [array_search ($done['handle'], $c)])."\r\n";
}
curl_multi_remove_handle ($mc, $done ['handle']);
}
} while ($running);
curl_multi_close ($mc);
?>
You could use cURL to check for proxies. Some good article is given here
Hope it helps
The port usually gives you a good clue about the proxy type.
80,8080,3128 is typically HTTP
1080 is typically SOCKS
But let's be realistic, you seem to have a list of public proxies. It's not unlikely that every single one is not working anymore.
You can use curl or wget or lynx in a script or similar to test the proxies.
You can also try sort your list up into SOCKS and HTTP as good as you can and enter it into the Proxycollective .
That's a free project but you need an invitation code or a 99cent ticket to become member.
Once you are member you can upload your proxy lists and they will be tested. All working ones will be given back to you sortable.
So if you don't want to program something on your own that's maybe your best bet, invitation codes can sometimes be found in various forums.
But keep in mind what I said, if you have a list of 5000 random proxies I bet you'll hardly find more than 10 working ones in there anymore. Public proxies only live short.
This proxy checker API could be exactly what you're looking for. You can easily check a proxy list with that.
If you want to develop it yourself, is not difficult to do a small script to do the same than that API does.
Here is the code I use. You can modify it to meet your requirements:
function _check($url,$usecookie = false,$sock="",$ref) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, False);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,$_POST['timeoutpp']);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3");
if($sock){
curl_setopt($ch, CURLOPT_PROXY, $sock);
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
}
if ($usecookie){
curl_setopt($ch, CURLOPT_COOKIEJAR, $usecookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $usecookie);
}
if ($ref){
curl_setopt($ch, CURLOPT_REFERER,$ref);
}
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$result=curl_exec ($ch);
curl_close($ch);
return $result;
}
There is a PHP tool on github that can scan the proxy list as multithread. It will not be difficult to integrate it into your own project. The tool also analyzes the security level of the proxies it scans. Yes, I wrote it. :)
https://github.com/enseitankado/proxy-profiler