php cURL progress monitor? - php

I'm using cURL in PHP to run a script that can take over an hour to run. For debugging purposes I'd like to be able to view the progress of the request by looking at the screen (this is not a public site) and see what's going on. I've tried a few things but am not having any luck. I don't need a lot of info, just something like 'now loading id 123'
I've tried ob_flush, but apparently that is no longer supported as per: http://php.net/ob_flush
I've also tried using CURLOPT_PROGRESSFUNCTION but there is not a lot of documentation on it and I couldn't get it to work. My code is very simple:
$sql = "select item_number from products order by id desc";
$result_sql = $db->query($sql);
while($row = $result_sql->fetch_assoc())
{
//I'd like it to display this as it loads
print '<br>Getting data for item_number: '.$row['item_number']
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,"http://targetsite.com//".$row['item_number']);
curl_setopt($curl,CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSLVERSION, 3);
//curl_setopt($curl, CURLOPT_HEADER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 ( .NET CLR 3.5.30729)");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_VERBOSE, 1);
}
Any advice? I'm really not fussy, just the easiest method that will quickly show me that something is happening.

Assuming you're using PHP >=5.3 (previous versions do not support CURLOPT_PROGRESSFUNCTION) you use it like so:
function callback($download_size, $downloaded, $upload_size, $uploaded)
{
// do your progress stuff here
}
$ch = curl_init('http://www.example.com');
// This is required to curl give us some progress
// if this is not set to false the progress function never
// gets called
curl_setopt($ch, CURLOPT_NOPROGRESS, false);
// Set up the callback
curl_setopt($ch, CURLOPT_PROGRESSFUNCTION, 'callback');
// Big buffer less progress info/callbacks
// Small buffer more progress info/callbacks
curl_setopt($ch, CURLOPT_BUFFERSIZE, 128);
$data = curl_exec($ch);
Source: http://pastebin.com/bwb5VKpe

Related

How to resent the curl request if the output is empty

How to resend curl request if the output is empty or something like (Database maintenance, please check back in 10 minutes) Want to refresh request if it is similar to any above.
Im using basic curl method
$this->ch = curl_init();
curl_setopt($this->ch, CURLOPT_HEADER, true);
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, true);
#curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($this->ch, CURLOPT_USERAGENT, 'Opera/9.23 (Windows NT 5.1; U; en)');
// site is returning a gzipped body, force uncompressed
curl_setopt($this->ch, CURLOPT_ENCODING, 'identity');
If this is very simple question, Im really sorry Because im new.
Thanks in advance
If you want refresh it immediately in same script execution you can do something like that:
$response = curl_exec($this->ch);
if (empty($response)) {
sleep(10);
$response = curl_exec($this->ch);
}
Or combination of that in some loop if you want check like 3 times or more.

cUrl waits for url to load IF results are printed

The following curl-call succeeds every time, if and only if $data is printed after the curl-call. curl_getinfo() returning
[content_type] => text/html; charset=UTF-8
If $data is not printed, the curl-call sometimes return the same result as above and sometimes returns $data being "Loading...", Which means that page has not finished loading yet. And curl_getinfo() returning
[content_type] => text/html
Furthermore, when using print_r($data), I can see the print_r(curl_getinfo($ch)); on my website being updated several times while performing the curl-call. What... The.... F?
(the set_opt-list has grown larger as I'm trying to find a solution LOL)
Ooh.. yeah, even if I print $data after it's been returned to function caller and caught in another variable.. curl succeeds every time.
Is this normal behaviour? I don't want to print_r($data)!
Is it possible that the url I'm retrieving contains javascript which gets run when I "print" it on my website? Why does it work occasionally without the print_r($data)? Ref: is-there-a-way-to-let-curl-wait-until-the-pages-dynamic-updates-are-done
edit: Until further notice, I've put the curl-call in a while-loop, checking if downloaded size is above a certain threshold. I've set the while loop to 10 iterations, and so far it is enough, i.e. it will manage to download the content of interest. Time consumed is barely noticed.
function curl_get_contents($url) {
global $dbg;
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
$data = curl_exec($ch);
if ($dbg) {
print_r(curl_getinfo($ch)); // This one gets refreshed if print_r($data) used below
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch);
} else {
echo "ALL GOOD <br>";
}
}
curl_close($ch);
//echo $data; // If I do this...
//print_r($data); // ... or this. curl is success 100%.
return $data;
}

Bulk link checker in php

I want to check the links in my database, if the status of the link (through possible redirections) is still valid (e.g. status 200). The below script is what I currently use. The limitation is that over +/- 400 links, the server gives me a 500 - internal error. Unfortunality, I cannot review the servers logs on what the reason is, my assumption is that it's a time out issue.
How can I make this script scalable, so that it will allow me to run more then the currently +/- 400 links?
function urlValidator($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode != '200') {
echo $url;
echo " - ". $httpcode;
}
}
// creation of $url_array
//
foreach($url_array as $url){
if(!is_null($url)) {
urlValidator($url);
}
}
I did try to add flush() and/or ob_flush() to the code, but it didn't help either (or implemented wrongly).
Any suggestions are more then welcome.
The default execution time of a PHP script is 30 seconds. After that it will time out.
You can either increase this time to something like this:
ini_set('max_execution_time', 600); //10 minutes
But, to make it really scalable, I would store the current "link-check" status in a database, so that you can continue where you left off and have multiple instances call your script.

PHP cURL fails to fetch images from website

I've written a small PHP script for grabbing images with curl and saving them locally.
It reads the urls for the images from my db, grabs it and saves the file to a folder.
Tested and works on a couple other websites before, fails with a new one I'm trying it with.
I did some reading around, modified the script a bit but still nothing.
Please suggest what to look out for.
$query_products = "SELECT * from product";
$products = mysql_query($query_products, $connection) or die(mysql_error());
$row_products = mysql_fetch_assoc($products);
$totalRows_products = mysql_num_rows($products);
do {
$ch = curl_init ($row_products['picture']);
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:2.0) Gecko/20110319 Firefox/4.0');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$rawdata = curl_exec ($ch);
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close ($ch);
if($http_status==200){
$fp = fopen("images/products/".$row_products['productcode'].".jpg", 'w');
fwrite($fp, $rawdata);
fclose($fp);
echo ' -- Downloaded '.$newname.' to local: '.$newname.'';
} else {
echo ' -- Failed to download '.$row_products['picture'].'';
}
usleep(500);
} while ($row_products = mysql_fetch_assoc($products));
Your target website may require/check a combination of things. In order:
Location. Some websites only allow the referer to be a certain value (either their site or no referer, to prevent hotlinking)
Incorrect URL
Cookies. Yes, this can be checked
Authentication of some sort
The only way to do this is to sniff what a normal request looks like and to mimic it. Your MSIE user-agent string looks different from a genuine MSIE UA, however, and I'd consider changing it to an exact copy of a real one if I were you.
Could you get curl to output to a file (using the setopt for output stream) and telling us what error code you are getting, along with the URL of an image? This will help me be more precise.
Also, 0 isn't a success - it's a failure

How to use PHP to get a webpage into a variable

I want to download a page from the web, it's allowed to do when you are using a simple browser like Firefox, but when I use "file_get_contents" the server refuses and replies that it understands the command but don't allow such downloads.
So what to do? I think I saw in some scripts (on Perl) a way to make your script like a real browser by creating a user agent and cookies, which makes the servers think that your script is a real web browser.
Does anyone have an idea about this, how it can be done?
Use CURL.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// set the UA
curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');
// Alternatively, lie, and pretend to be a browser
// curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
(From http://uk.php.net/manual/en/curl.examples-basic.php)
Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.
function __construct($useragent,$url)
{
$this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
$this->url=$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$this->xpath = new DOMXPath($dom);
}
...
public function displayResults($site)
$data=$this->path[0]->length;
for($i=0;$i<$data;$i++)
{
$delData=$this->path[0]->item($i);
//setting the href and title properties
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href');
$titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;
//setting the saves and additoinal
$saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
if ($saves==NULL)
{
$saves=0;
}
//build the array
$this->newSiteBookmark[$i]['source']='delicious.com';
$this->newSiteBookmark[$i]['url']=$urlSite;
$this->newSiteBookmark[$i]['title']=$titleSite;
$this->newSiteBookmark[$i]['saves']=$saves;
}
The latter is a part of a class that scrapes data from delicious.com .Not very legal though.
This answer takes your comment to Rich's answer in mind.
The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:
//pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
//pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:
$handle = fopen("http://www.example.com/", "r");//open specified URL for reading
It's especially useful if cURL isn't available.

Categories