php curl data scraping - php

I have a CURL code to fetch data from a site it is working fine for last few months but suddenly stop working for me it says
HTTP/1.0 302 Moved Temporarily
my code is:
$ch = curl_init();
curl_setopt($ch, CURLOPT_REFERER, $baseUrl);
curl_setopt($ch, CURLOPT_PROXY, $proxy[0]);
curl_setopt($ch, CURLOPT_PROXYPORT, $proxy[1]);
//curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIE , $phpSId);
curl_setopt($ch, CURLOPT_COOKIEJAR , $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE , $cookie);
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1");
curl_setopt($ch, CURLOPT_TIMEOUT , 40);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST , 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER , 0);
curl_setopt($ch, CURLOPT_URL , $url);
curl_setopt($ch, CURLOPT_HEADER , 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION , 1);
curl_setopt($ch, CURLOPT_POST , 1);
curl_setopt($ch, CURLOPT_POSTFIELDS , $data);
$result = curl_exec($ch);
curl_close ($ch);
unset($ch);
die($result);
Please help, thanks in advance

The specified options already make curl follow redirects. However, in the case of a long redirect chain, you may want to increase CURLOPT_MAXREDIRS.
You can use a packet dumper such as wireshark to check which requests are sent by curl. It may be simply a bug in the scraped website which causes it to redirect infinitely.

Related

Facebook php curl not returning response when content length is set

I am trying to virtually post form data to delete lots of posts on facebook. But neither code is giving any error nor the post is being deleted.
Whenever I set content length in header array of php curl, facebook does not respond and page keeps loading.
Can anyone point out whats the mistake in it.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS,$fi);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_HTTPHEADER,
array(
'authority:mbasic.facebook.com',
'method:POST',
'path:'.$actionl,
'scheme:https',
'accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-encoding:gzip, deflate',
'accept-language:en-US,en;q=0.8',
'cache-control:max-age=0',
'content-length:136',
'content-type:application/x-www-form-urlencoded',
'origin:https://mbasic.facebook.com',
'referer:'.$reffr,
'upgrade-insecure-requests:1',
'user-agent:Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 Nokia5233/21.1.004; Profile/MIDP-2.1 Configuration/CLDC-1.1 ) AppleWebKit/525 (KHTML, like Gecko) Version/3.0 BrowserNG/7.2.5.2 3gpp-gba'
));
curl_setopt($ch, CURLOPT_COOKIEFILE, getcwd () . '/fb.txt' ); // The cookie file'
curl_setopt($ch, CURLOPT_COOKIEJAR, getcwd () . '/fb.txt' ); // cookie jar
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
echo curl_exec($ch);

cURL request not accepting cookies, results in endless redirect

I am trying to build a cURL request to download a pdf. I am getting the "Error:Maximum (100) redirects followed" error and believe it is due to a cookie issue. Here's the relevant code from my cURL request:
$cookie_file = "/home/mrpeanut/php/cookie.txt"
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSLVERSION, 4);
curl_setopt($ch, CURLOPT_SSL_CIPHER_LIST, implode(':', $arrayCiphers));
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 100);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_COOKIEJAR, realpath($cookie_file));
curl_setopt($ch, CURLOPT_COOKIEFILE, realpath($cookie_file));
And have also tried CURLOPT_COOKIEJAR, '-' (from here), all without luck. The pdf downloads just fine using a browser. However, it downloads even if I disable all cookies using the Web Developer toolbar (so maybe that's not my true issue?).
You need to send CURLOPT_COOKIE with the value of your PHPSESSID. That is how the server is going to identify the cookie.
You should add this
curl_setopt($ch, CURLOPT_COOKIE, 'PHPSESSID=' . $_COOKIE['PHPSESSID'] . '; path=/');

error 301 by using function fileget_contents_curl on a url

I have an error curl code 301. I get an error 301 when I made ​​the request to curl leboncoin.fr
I try to solve the problem by adding: curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1) in the code of my function curl.
Code work find on one day only. and next day I found again the same code erreor(301 error)
Here are the curl code below:
function file_get_contents_curl($url)
{
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
**curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);**
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER , 1);
curl_setopt($ch, CURLOPT_FORBID_REUSE , 1);
curl_setopt($ch, CURLOPT_TIMEOUT , 10);
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
do you have any idea to solve this?
thanks.
If it works one day and there are no changes on your side or the site you are trying to connect, it should be the same result.
In any case, since you know that the address changed, change it in your code to reduce time and steps.
Also, you may be having problems due to the time limit set for the wait on connection, try to increase CURLOPT_CONNECTTIMEOUT a bit, like 13, just in case the servers are taking too long to respond or do the redirection.

Stop Curl redirecting to new page

I am trying to make curl request to domain: http://xyz.com. here is my code.
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_URL, $strURL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $arrData);
curl_exec($ch);
While making request it gets redirected to some page within and don't come back to my page.
How can i stop being redirected in middle of curl request.
M sorry guys...
after the suggestion i tried CURLOPT_FOLLOWLOCATION to 0 and it worked... it was my mistake that i didn't remove next line of header redirection and it went on passing and passing...
sorry my mistake.
once more... CURLOPT_FOLLOWLOCATION to 0 wont transfer...
I think because that page checks your user-agent or sets cookies, so you need to try mimic web browser as much as possible.
Like adding user-agent:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7');
Or try set cookie:
$cookieJar = tempnam ("/tmp", "CURLCOOKIE");
curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookieJar);
If you provide url maybe i could help more.
Try using the CURLOPT_MAXREDIRS option.
CURLOPT_MAXREDIRS : The maximum amount of HTTP redirections to follow. Use this option alongside CURLOPT_FOLLOWLOCATION.
Try to play with :
CURLOPT_RETURNTRANSFER,
CURLOPT_FOLLOWLOCATION,
CURLOPT_COOKIEJAR,
CURLOPT_COOKIEFILE
And think to log, it's easier to debug with it !
$handle = fopen('log.tmp', 'w');
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_STDERR, $handle);

How to use CURL via a proxy?

I am looking to set curl to use a proxy server. The url is provided by an html form, which has not been a problem. Without the proxy it works fine. I have found code on this and other sites, but they do not work. Any help in finding the correct solution would be much appreciated. I feel that the bellow are close, but that I am missing something. Thank You.
The bellow code I adapted from here http://www.webmasterworld.com/forum88/10572.htm but it returns an error message about a missing T_VARIABLE on line 12.
<?
$url = '$_POST[1]';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 0);
curl_setopt($ch, CURLOPT_PROXY, '66.96.200.39:80');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1)
curl_exec ($ch);
$curl_info = curl_getinfo($ch);
curl_close($ch);
echo '<br />';
print_r($curl_info);
?>
The bellow is from curl through proxy returns no content
<?
$proxy = "66.96.200.39:80";
$proxy = explode(':', $proxy);
$url = "$_POST[1]";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, $proxy[0]);
curl_setopt($ch, CURLOPT_PROXYPORT, $proxy[1]);
curl_setopt($ch, CURLOPT_HEADER, 1);
$exec = curl_exec($ch);
echo curl_error($ch);
print_r(curl_getinfo($ch));
echo $exec;
?>
is currently live on pelican-cement.com but also does not work.
UPDATE:
Thank you for all your help, I made the above changes. Now it only returns a blank screen.
<?
$url = $_POST['1'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 0);
curl_setopt($ch, CURLOPT_PROXY, '66.96.200.39:80');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_exec ($ch);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
?>
Here is a working version with your bugs removed.
$url = 'http://dynupdate.no-ip.com/ip.php';
$proxy = '127.0.0.1:8888';
//$proxyauth = 'user:password';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
//curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyauth);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
I have added CURLOPT_PROXYUSERPWD in case any of your proxies require a user name and password.
I set CURLOPT_RETURNTRANSFER to 1, so that the data will be returned to $curl_scraped_page variable.
I removed a second extra curl_exec($ch); which would stop the variable being returned.
I consolidated your proxy IP and port into one setting.
I also removed CURLOPT_HTTPPROXYTUNNEL and CURLOPT_CUSTOMREQUEST as it was the default.
If you don't want the headers returned, comment out CURLOPT_HEADER.
To disable the proxy simply set it to null.
curl_setopt($ch, CURLOPT_PROXY, null);
Any questions feel free to ask, I work with cURL every day.
I have explained use of various CURL options required for CURL PROXY.
$url = 'http://dynupdate.no-ip.com/ip.php';
$proxy = '127.0.0.1:8888';
$proxyauth = 'user:password';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // URL for CURL call
curl_setopt($ch, CURLOPT_PROXY, $proxy); // PROXY details with port
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyauth); // Use if proxy have username and password
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5); // If expected to call with specific PROXY type
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // If url has redirects then go to the final redirected URL.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0); // Do not outputting it out directly on screen.
curl_setopt($ch, CURLOPT_HEADER, 1); // If you want Header information of response else make 0
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
root#APPLICATIOSERVER:/var/www/html# php connectiontest.php
61e23468-949e-4103-8e08-9db09249e8s1 OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 10.172.123.1:80 root#APPLICATIOSERVER:/var/www/html#
Post declaring the proxy settings in the php script file issue has been fixed.
$proxy = '10.172.123.1:80';
curl_setopt($cSession, CURLOPT_PROXY, $proxy); // PROXY details with port
Here is a well tested function which i used for my projects with detailed self explanatory comments
There are many times when the ports other than 80 are blocked by server firewall so the code appears to be working fine on localhost but not on the server
function get_page($url){
global $proxy;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
//curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_HEADER, 0); // return headers 0 no 1 yes
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // return page 1:yes
curl_setopt($ch, CURLOPT_TIMEOUT, 200); // http request timeout 20 seconds
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Follow redirects, need this if the url changes
curl_setopt($ch, CURLOPT_MAXREDIRS, 2); //if http server gives redirection responce
curl_setopt($ch, CURLOPT_USERAGENT,
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); // cookies storage / here the changes have been made
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // false for https
curl_setopt($ch, CURLOPT_ENCODING, "gzip"); // the page encoding
$data = curl_exec($ch); // execute the http request
curl_close($ch); // close the connection
return $data;
}

Categories