I'm just starting to learn cURL and now I simply try all kind of things to get used to the options that cURL give me.Here is a simple script which I use to connect and retrieve data from a current site
<?php
$ch=curl_init();
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, "http://img2.somesite.net/bitbucket/");
curl_setopt($ch, CURLOPT_HEADER, $header);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a=curl_exec($ch);
curl_close($ch);
echo $a;
?>
I want to get the images stored there, but obviously I miss something.I'm not sure how it should be done.When you write http://img2.somesite.net/bitbucket/pic.jpg the image is loaded.I want to get the name of all files that are there, or maybe I should trigger a command which will download the images and then check them on the PC...I don't know is this possible with cURL and hwo could be done?
Also when I leave it
http://somesite.net/
I get the resource back, so basicly this works...
Thanks
Leron
Try setting a User-Agent, as Some sites block requests without a user agent.
Also, download Tamper Data for firefox and see the Headers being sent to the Server when you initiate the download from your browser. Imitiate all the Headers from curl and that will do it.
Thats how i do it.
CURLOPT_HEADER is a boolean value determining whether it should return the requests' headers.
To pass headers to cURL, you should use CURLOPT_HTTPHEADER.
This should work just fine:
<?php
$ch=curl_init();
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, "http://img2.somesite.net/bitbucket/");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a=curl_exec($ch);
curl_close($ch);
echo $a;
?>
Related
I want to scrape a website by SCRAPY with AJAX PAGINATION, i scraped this web site by PHP by using CURL, i monitored the network by Firebug, with firebug we have a option "Copy for CURL" for POST REQUEST.
My question is how can i do the same for SCRAPY.
my function in PHP:
function forCurl($url,$refer, $jsessionid){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0');
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: no-cache' --data 't%3Azoneid=forceAjax";
$header[] = "Connection: keep-alive";
$header[] = "Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3";
$header[] = "Pragma: no-cache";
$header[] = "X-Requested-With: XMLHttpRequest";
$header[] = "Keep-Alive: 700";
$cookie = "JSESSIONID=" . $jsessionid. '; langueFront=fr; tc_cj_v2=%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLMQOMROKJRZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLMQOSQMMRNZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNNJJKNRRMZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNNKNJOJSKZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNNMLSNSKLZZZ%5D777_rn_lh%5BfyfcheZZZ222H%7B0%7D%23%7B%29H%21-ZZZKNLLNNMMLMJNJZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNOOJSKRKMZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNOOLSOMPNZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNOPJMROQLZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNOPMQSKNOZZZ%5D; _ga=GA1.2.487921595.1421941922; aurol=GA1.2.865695137.1421941922; __utma=239562643.487921595.1421941922.1422452658.1422454606.14; __utmz=239562643.1422443324.10.2.utmcsr=Sphere_myWebSite|utmccn=myWebSitefr_logo|utmcmd=Interne; kameleoonVisitIdentifier=rj1hnzh5ux1n2gxr/4; myWebSiteCook=\"869|\"; revelationDriveWin=2; myWebSite.hamon=1; __utmv=239562643.|1=visite_myWebSitedrive=239562643.487921595.1421941922.1422452658.1422454606.14=1; tosend=%7B%22p%22%3A%7B%22tracker%22%3A%22myWebSitedrive%22%2C%20%22url%22%3A%22rayon%22%2C%20%22mtime%22%3A1422455760000%2C%20%22ref%22%3A%22http%3A%2F%2Fwww.myWebSitedrive.fr%2Fdrive%2Frecherche%2Fbio%22%2C%20%22dest%22%3A%22http%3A%2F%2Fwww.myWebSitedrive.fr%2Fdrive%2FNice-Cote-dAzur-869%2FSurgeles-R41355%2FViandes-Volailles-41478%2F%22%7D%2C%22d%22%3A%7B%22dv%22%3A%22NA%22%7D%2C%20%22t%22%3A%7B%22iplobserverstart%22%3A%221422455762613%22%2C%22jsinit%22%3A%221422455763871%22%2C%22domload%22%3A%221422455764728%22%2C%22clicklink%22%3A%221422455817128%22%2C%22unload%22%3A%221422455817521%22%7D%7D; kameleoonExperiment-14570=86018/1422452656881/false; __utmc=239562643; rdmvalidation=1; layerDrivePromos=2; __utmb=239562643.19.10.1422454606; _gat=1; _gat_myWebSiteRollup=1; __utmt=1; __utmt_secondTracker=1; __utmli=toPage_14b30fac8d4_0';
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_REFERER, $refer);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
$content = curl_exec($ch);
curl_close($ch);
return $content ;
i want to know how can i post the same parametres with SCRAPY, is that a good idea for scraping a website with ajax pagination?
i tried this:
yield Request(sousUrl, headers={'Referer':'%s' % url}, callback=self.parse_page)
In Python you can use PyCurl
PycURL is a Python interface to libcurl.
PHP script successfully get feeds from my Facebook page:
$access_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
$url = 'https://graph.facebook.com/v2.2/711493158867950/feed?access_token='.$access_token;
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$json = curl_exec($curl);
But after two hours access_token expires and I get the following error:http://prntscr.com/60jyqt.
How to rewrite the following php code to "renew" access_token key?
You could start with having a look at the documentation:
https://developers.facebook.com/docs/facebook-login/access-tokens#termtokens
https://developers.facebook.com/docs/facebook-login/access-tokens#extending
This worked for me (Facebook SDK v3.1):
1) Create a "system user"
2) Grant him access to the properties I needed (in my case an App)
3) Generate a new token for that app and system user
The instructions I used can be found here
Some time ago I got feeds from my Facebook page by the curl request like this:
$url='https://www.facebook.com/feeds/page.phpformat=json&id=XXXXXXXXXXXXX';
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$json = curl_exec($curl);
But two days ago this script returned error with message like that http://prntscr.com/602m20
What could be the problem?
The Pages JSON feed (e.g.
https://www.facebook.com/feeds/page.php?id=%2019292868552&format=json)
is now deprecated and will stop returning data from Jan 28, 2015
onwards. Developers should instead call the feed edge on the Graph
API's Page object: /v2.2/{page_id}/feed.
Source
I have a curl function as shown below. To load this function i use this <img src="http://site.com/pxl.php?i=1.jpg" height="1" width="1" /> but when i do that cookies dont get added to my browser. Is there a way so that when the curl run on the url i collect the cookies from the url and set it to the browser of using accessing <img src="http://site.com/pxl.php?i=1.jpg" height="1" width="1" />
function get_content($url,$ref)
{
$browser = $_SERVER['HTTP_USER_AGENT'];
$ch = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $browser);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
$html = curl_exec($ch);
curl_close ($ch);
return $html;
}
To send a cookie to a page using cURL, Use the CURLOPT_COOKIE option:
curl_setopt($ch, CURLOPT_COOKIE, 'user=alice; activity=knitting');
To set cookies for the script, use setcookie():
setcookie("TestCookie", $value, time()+3600);
I'm trying to fake the referer of a request using:
<?php
$url = "http://www.blabla.com";
function doMagic($url)
{
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.12011-10-16 20:23:00");
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, "http://www.fakeRef.com");
curl_setopt($curl, CURLOPT_ENCODING, "gzip,deflate");
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl);
echo 'Curl error: '. curl_error($curl);
curl_close($curl);
return $html;
}
$text = doMagic($url);
print("$text");
?>
I have a local apache server that I'm using to run this PHP script: localhost/script.php. The problem is that the actual referer (that Piwik reports) is localhost/script.php, not http://www.fakeRef.com.
What's the issue here?
The problem is that the actual referer (that Piwik reports) is localhost/script.php, not http://www.fakeRef.com.
What's the issue here?
You seem to be viewing the output of your curl operation in a browser. Then the explanation is simple. Piwik uses a tracking image to count your hit. The browser loads the tracking image; the image's referer will be your script, not the fake value you used to fetch the HTML code.
If you want to test whether setting the referer this way works, look into your server access logs. The script.php request there should contain the faked referer.