The following curl-call succeeds every time, if and only if $data is printed after the curl-call. curl_getinfo() returning
[content_type] => text/html; charset=UTF-8
If $data is not printed, the curl-call sometimes return the same result as above and sometimes returns $data being "Loading...", Which means that page has not finished loading yet. And curl_getinfo() returning
[content_type] => text/html
Furthermore, when using print_r($data), I can see the print_r(curl_getinfo($ch)); on my website being updated several times while performing the curl-call. What... The.... F?
(the set_opt-list has grown larger as I'm trying to find a solution LOL)
Ooh.. yeah, even if I print $data after it's been returned to function caller and caught in another variable.. curl succeeds every time.
Is this normal behaviour? I don't want to print_r($data)!
Is it possible that the url I'm retrieving contains javascript which gets run when I "print" it on my website? Why does it work occasionally without the print_r($data)? Ref: is-there-a-way-to-let-curl-wait-until-the-pages-dynamic-updates-are-done
edit: Until further notice, I've put the curl-call in a while-loop, checking if downloaded size is above a certain threshold. I've set the while loop to 10 iterations, and so far it is enough, i.e. it will manage to download the content of interest. Time consumed is barely noticed.
function curl_get_contents($url) {
global $dbg;
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
$data = curl_exec($ch);
if ($dbg) {
print_r(curl_getinfo($ch)); // This one gets refreshed if print_r($data) used below
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch);
} else {
echo "ALL GOOD <br>";
}
}
curl_close($ch);
//echo $data; // If I do this...
//print_r($data); // ... or this. curl is success 100%.
return $data;
}
Related
1)Instead of http://webiste.com/filename i want to take the data from the .txt file which contains 20374 rows and each row contains different website
for example:
website1.com
website2.com
website3.com
etc.
2)parse them individually using curl command
3)get the needed data via preg_match
4)and final results i want to save to my mysql databse
bellow is the code that i am using at the moment, please advise the solution what needs to be added to achieve this goal ?
function curl($url)
{
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: ddosdefend=1d4607e3ac67b865e6c7263260c34e888cae7c56"));
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$test = curl('http://webiste.com/filename');
preg_match('/<iframe class=\"metaframe rptss\" src="(.*?)"/', $test, $matches);
$test2 = $matches[1];
I want to check the links in my database, if the status of the link (through possible redirections) is still valid (e.g. status 200). The below script is what I currently use. The limitation is that over +/- 400 links, the server gives me a 500 - internal error. Unfortunality, I cannot review the servers logs on what the reason is, my assumption is that it's a time out issue.
How can I make this script scalable, so that it will allow me to run more then the currently +/- 400 links?
function urlValidator($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode != '200') {
echo $url;
echo " - ". $httpcode;
}
}
// creation of $url_array
//
foreach($url_array as $url){
if(!is_null($url)) {
urlValidator($url);
}
}
I did try to add flush() and/or ob_flush() to the code, but it didn't help either (or implemented wrongly).
Any suggestions are more then welcome.
The default execution time of a PHP script is 30 seconds. After that it will time out.
You can either increase this time to something like this:
ini_set('max_execution_time', 600); //10 minutes
But, to make it really scalable, I would store the current "link-check" status in a database, so that you can continue where you left off and have multiple instances call your script.
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/25.0.1");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIE, 'long cookie here');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
The original url I'm feeding it is http://example.com/i-123.html but if I open in browser, I get redirected to https://example.com/item-description-123.html (so I added CURLOPT_FOLLOWLOCATION).
However, the output of this function is binary data.
1f8b 0800 0000 0000 0003 ed7d e976 db38
f2ef e7f8 2930 9ac9 d86e 9b92 b868 f3a2
3e5e 9374 67fb c7ee 74f7 e4e6 f880 2428
31a6 4835 172f 3dd3 8f74 3fde 17b8 f7c5
6e15 008a 8ba8 2db1 3ce9 25a7 dba4 4810
......
How do I fix this? I tried adding
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);
(copied from somewhere). Didn't work.
file_get_contents() gives me the same output.
Well, the solution was pathetic...
Using wget -S http://example.com I found out that the content is compressed (gzipped). Using gunzip I successfully extracted the html.
Also added to my original PHP script
curl_setopt($ch,CURLOPT_ENCODING , "");
And it worked like a charm.
I'm writing a php script that deals with page processing via cURL, so I have a function to get and return pages by URL
function get_url($Url){
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
set_time_limit (20);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: age_gate_birthday=19901101"));
curl_setopt($ch, CURLOPT_REFERER, "http://www.facebook.com");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
return $output;
}
echoing $output in this function always returns a string of HTML, however if I call on this function in another function
function get_vid ($sql, $url) {
$data = get_url($url);
...
the returned value is an empty string, despite the fact that $output had value when get_url() was doing its thing.
Weirdly enough, the error only exists with specific URLs, but works fine with others.
Thank you for trying to help!
UPDATE: It seems CURL returns FALSE randomly on specific links, which seems to be a culprit of this issue, however curl_error is empty, so I'm unable to identify the cause of this.
I think it's because you get a http redirect.
Try to check http code like this :
if (curl_getinfo($ch,CURLINFO_HTTP_CODE) == 302) {
// Manage http redirect here
}
I've written a small PHP script for grabbing images with curl and saving them locally.
It reads the urls for the images from my db, grabs it and saves the file to a folder.
Tested and works on a couple other websites before, fails with a new one I'm trying it with.
I did some reading around, modified the script a bit but still nothing.
Please suggest what to look out for.
$query_products = "SELECT * from product";
$products = mysql_query($query_products, $connection) or die(mysql_error());
$row_products = mysql_fetch_assoc($products);
$totalRows_products = mysql_num_rows($products);
do {
$ch = curl_init ($row_products['picture']);
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:2.0) Gecko/20110319 Firefox/4.0');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$rawdata = curl_exec ($ch);
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close ($ch);
if($http_status==200){
$fp = fopen("images/products/".$row_products['productcode'].".jpg", 'w');
fwrite($fp, $rawdata);
fclose($fp);
echo ' -- Downloaded '.$newname.' to local: '.$newname.'';
} else {
echo ' -- Failed to download '.$row_products['picture'].'';
}
usleep(500);
} while ($row_products = mysql_fetch_assoc($products));
Your target website may require/check a combination of things. In order:
Location. Some websites only allow the referer to be a certain value (either their site or no referer, to prevent hotlinking)
Incorrect URL
Cookies. Yes, this can be checked
Authentication of some sort
The only way to do this is to sniff what a normal request looks like and to mimic it. Your MSIE user-agent string looks different from a genuine MSIE UA, however, and I'd consider changing it to an exact copy of a real one if I were you.
Could you get curl to output to a file (using the setopt for output stream) and telling us what error code you are getting, along with the URL of an image? This will help me be more precise.
Also, 0 isn't a success - it's a failure