I'm trying to fake the referer of a request using:
<?php
$url = "http://www.blabla.com";
function doMagic($url)
{
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.12011-10-16 20:23:00");
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, "http://www.fakeRef.com");
curl_setopt($curl, CURLOPT_ENCODING, "gzip,deflate");
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl);
echo 'Curl error: '. curl_error($curl);
curl_close($curl);
return $html;
}
$text = doMagic($url);
print("$text");
?>
I have a local apache server that I'm using to run this PHP script: localhost/script.php. The problem is that the actual referer (that Piwik reports) is localhost/script.php, not http://www.fakeRef.com.
What's the issue here?
The problem is that the actual referer (that Piwik reports) is localhost/script.php, not http://www.fakeRef.com.
What's the issue here?
You seem to be viewing the output of your curl operation in a browser. Then the explanation is simple. Piwik uses a tracking image to count your hit. The browser loads the tracking image; the image's referer will be your script, not the fake value you used to fetch the HTML code.
If you want to test whether setting the referer this way works, look into your server access logs. The script.php request there should contain the faked referer.
Related
Right now I use curl to get html5player.setVideoUrlLow and is working good but is only poor quality. So i need to get html5player.setVideoUrlHigh but this param don't appear with curl response if I run from server! On localhost work fine! What I missing from my code?
Tried already with different CURLOPT_USERAGENT ans same problem! Thank you!
<?php
function getstring($string,$start,$end)
{
$str = explode($start,$string);
$str = explode($end,$str[1]);
return $str[0];
}
$viewkey = $_GET['viewkey']; // https://mypage.com/view_video.php?viewkey=54501623
$url = "http://www.xvideos.com/embedframe/".$viewkey."";
// $url = "http://www.xvideos.com/video".$viewkey.""; // alternate but same result //
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) Ap");
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, "http://www.google.com/bot.html");
curl_setopt($curl, CURLOPT_ENCODING, "gzip,deflate");
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION,true);
$html = curl_exec($curl);
curl_close($curl);
$VideoUrlLow=getstring($html,"html5player.setVideoUrlLow('","');");
$VideoUrlHigh=getstring($html,"html5player.setVideoUrlHigh('","');");
if($VideoUrlHigh!="")
{
$mp4 = $VideoUrlHigh; // empty on server but work in localhost
} else
{
$mp4 = $VideoUrlLow;
}
header('Location: '.$mp4);
?>
I want to scrape a website by SCRAPY with AJAX PAGINATION, i scraped this web site by PHP by using CURL, i monitored the network by Firebug, with firebug we have a option "Copy for CURL" for POST REQUEST.
My question is how can i do the same for SCRAPY.
my function in PHP:
function forCurl($url,$refer, $jsessionid){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0');
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: no-cache' --data 't%3Azoneid=forceAjax";
$header[] = "Connection: keep-alive";
$header[] = "Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3";
$header[] = "Pragma: no-cache";
$header[] = "X-Requested-With: XMLHttpRequest";
$header[] = "Keep-Alive: 700";
$cookie = "JSESSIONID=" . $jsessionid. '; langueFront=fr; tc_cj_v2=%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLMQOMROKJRZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLMQOSQMMRNZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNNJJKNRRMZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNNKNJOJSKZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNNMLSNSKLZZZ%5D777_rn_lh%5BfyfcheZZZ222H%7B0%7D%23%7B%29H%21-ZZZKNLLNNMMLMJNJZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNOOJSKRKMZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNOOLSOMPNZZZ%5D777%5Ecl_%5Dny%5B%5D%5D_mmZZZZZZKNLLNOPJMROQLZZZ%5D777_rn_lh%5BfyfcheZZZ%7B%7E%28%24%29H/*+%7E-%241%20H%21-ZZZKNLLNOPMQSKNOZZZ%5D; _ga=GA1.2.487921595.1421941922; aurol=GA1.2.865695137.1421941922; __utma=239562643.487921595.1421941922.1422452658.1422454606.14; __utmz=239562643.1422443324.10.2.utmcsr=Sphere_myWebSite|utmccn=myWebSitefr_logo|utmcmd=Interne; kameleoonVisitIdentifier=rj1hnzh5ux1n2gxr/4; myWebSiteCook=\"869|\"; revelationDriveWin=2; myWebSite.hamon=1; __utmv=239562643.|1=visite_myWebSitedrive=239562643.487921595.1421941922.1422452658.1422454606.14=1; tosend=%7B%22p%22%3A%7B%22tracker%22%3A%22myWebSitedrive%22%2C%20%22url%22%3A%22rayon%22%2C%20%22mtime%22%3A1422455760000%2C%20%22ref%22%3A%22http%3A%2F%2Fwww.myWebSitedrive.fr%2Fdrive%2Frecherche%2Fbio%22%2C%20%22dest%22%3A%22http%3A%2F%2Fwww.myWebSitedrive.fr%2Fdrive%2FNice-Cote-dAzur-869%2FSurgeles-R41355%2FViandes-Volailles-41478%2F%22%7D%2C%22d%22%3A%7B%22dv%22%3A%22NA%22%7D%2C%20%22t%22%3A%7B%22iplobserverstart%22%3A%221422455762613%22%2C%22jsinit%22%3A%221422455763871%22%2C%22domload%22%3A%221422455764728%22%2C%22clicklink%22%3A%221422455817128%22%2C%22unload%22%3A%221422455817521%22%7D%7D; kameleoonExperiment-14570=86018/1422452656881/false; __utmc=239562643; rdmvalidation=1; layerDrivePromos=2; __utmb=239562643.19.10.1422454606; _gat=1; _gat_myWebSiteRollup=1; __utmt=1; __utmt_secondTracker=1; __utmli=toPage_14b30fac8d4_0';
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_REFERER, $refer);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
$content = curl_exec($ch);
curl_close($ch);
return $content ;
i want to know how can i post the same parametres with SCRAPY, is that a good idea for scraping a website with ajax pagination?
i tried this:
yield Request(sousUrl, headers={'Referer':'%s' % url}, callback=self.parse_page)
In Python you can use PyCurl
PycURL is a Python interface to libcurl.
I want to fetch the result of a webpage, exactly like a simple user, through a browser. I setted the headers and the sended cookies of the request, to which I got with fiddler4. Last night it's worked, but now it's throw cURL error 28, request timed out.
Here is the code what I used:
function cURL($url){
$cURL=curl_init();
$header[0] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: hu-HU,hu;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2,it;q=0.2,de;q=0.2,fr;q=0.2";
$header[] = "Accept-Encoding: gzip, deflate, sdch";
$header[] = "Pragma: ";
curl_setopt($cURL, CURLOPT_URL, $url);
curl_setopt($cURL, CURLOPT_USERAGENT, 'User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36');
curl_setopt($cURL, CURLOPT_HTTPHEADER, $header);
curl_setopt($cURL, CURLOPT_POST, true);
curl_setopt($cURL, CURLOPT_POSTFIELDS, 'action=verChau');
curl_setopt($cURL, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($cURL, CURLOPT_AUTOREFERER, true);
curl_setopt($cURL, CURLOPT_COOKIE, 'Cookie: __gfp_64b=mgK79a4qc_M9RH4eFToSbGkkxUWaWD2tKPQ51RreN8r.A7; PHPSESSID=780d83cb35c5b82098e33fde9c101d08; __atuvc=0%7C4%2C0%7C5%2C1%7C6%2C6%7C7%2C1%7C8; cTest=1; resDone20101213=1; _gat=1; _ga=GA1.2.307256553.1418233339; _goa3=eyJ1IjoiMTQxMjEyNDEwODM2NDE4MTU4NzAxNiIsImgiOiJCQzI0OEVFRC5kc2wucG9vbC50ZWxla29tLmh1IiwicyI6MTQxODMzODgwMDAwMH0=; _goa3TC=eyI1NjM2NCI6MTQyNTEzMjMwNDE2OCwiMzEzNDUzMiI6MTQyNTE0ODk1NDg2M30=; _goa3TS=e30=');
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($cURL, CURLOPT_TIMEOUT, 10);
$html=curl_exec($cURL);
if ($html === false){
echo "cURL exception: ".curl_errno($cURL).": ".curl_error($cURL);
}
curl_close($cURL);
return $html;
}
Can anybody help me out?
PHP script successfully get feeds from my Facebook page:
$access_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
$url = 'https://graph.facebook.com/v2.2/711493158867950/feed?access_token='.$access_token;
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$json = curl_exec($curl);
But after two hours access_token expires and I get the following error:http://prntscr.com/60jyqt.
How to rewrite the following php code to "renew" access_token key?
You could start with having a look at the documentation:
https://developers.facebook.com/docs/facebook-login/access-tokens#termtokens
https://developers.facebook.com/docs/facebook-login/access-tokens#extending
This worked for me (Facebook SDK v3.1):
1) Create a "system user"
2) Grant him access to the properties I needed (in my case an App)
3) Generate a new token for that app and system user
The instructions I used can be found here
Some time ago I got feeds from my Facebook page by the curl request like this:
$url='https://www.facebook.com/feeds/page.phpformat=json&id=XXXXXXXXXXXXX';
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$json = curl_exec($curl);
But two days ago this script returned error with message like that http://prntscr.com/602m20
What could be the problem?
The Pages JSON feed (e.g.
https://www.facebook.com/feeds/page.php?id=%2019292868552&format=json)
is now deprecated and will stop returning data from Jan 28, 2015
onwards. Developers should instead call the feed edge on the Graph
API's Page object: /v2.2/{page_id}/feed.
Source