I'm trying to grab the source code of a website so I can parse out football fixtures, my code is:
<?php
$url = "https://www.bbc.co.uk/sport/football/scores-fixtures/2019-03-06";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-gb,en;q=0.5',
'Accept-Encoding: gzip, deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Proxy-Connection: Close',
'Cookie: PREF=ID=2bb051bfbf00e95b:U=c0bb6046a0ce0334:',
'Cache-Control: max-age=0',
'Connection: Close'
));
$output = curl_exec($ch);
curl_close($ch);
echo substr($output, 0, 12);
?>
Output of the substring shown is:
���
I need the output in standard text, is that compressed or something?
How do I fix this please?
Thanks.
I need the output in standard text, is that compressed or something?
Yes, exactly that: it's gzip-compressed. Your options are a) decompress it using e.g. gzdecode b) tell the server you don't want a gzip-encoded response; the easiest way is to let curl handle this for you:
delete 'Accept-Encoding: gzip, deflate', from your header array
Add: curl_setopt($ch, CURLOPT_ENCODING, 'identity'); somewhere before you curl_exec()
Related
Recently, I want to scraping a website using CURL PHP. And the problem come. It return weird string combination and symbol. I really confused about it. I have set the encoding, both in header and declared it in curlopt.
Here is the coding I used to scrap.
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
//curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate,br');
curl_exec($ch);
curl_close($ch);
And this is the header I sent :
$header = [
':authority: www.airpaz.com',
':method: GET',
':path: $path,
':scheme: https',
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding: gzip, deflate, br',
'accept-language: en-US,en;q=0.9',
'cache-control: max-age=0',
'referer: $referer',
'upgrade-insecure-requests: 1',
'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
];
When I run it, it return exactly like the image below :
Can anyone tell what's the problem is? Thanks for your time. It will help me a lot
include_once('simple_html_dom.php');
$usuario = "username";
$password = "password";
$url = 'https://www.instagram.com/';
$url_login = 'https://www.instagram.com/accounts/login/ajax/';
$user_agent = array("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ",
"(KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36");
$ch = curl_init();
$headers = [
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US;q=0.6,en;q=0.4',
'Connection: keep-alive',
'Content-Length: 0',
'Host: www.instagram.com',
'Origin: https://www.instagram.com',
'Referer: https://www.instagram.com/',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36',
'X-Instagram-AJAX: 1',
'X-Requested-With: XMLHttpRequest'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/cookie/pruebalogininsta2.txt");
curl_setopt($ch, CURLOPT_REFERER, $sTarget);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
$html = curl_exec($ch);
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $html, $matches);
$cookies = array();
foreach($matches[1] as $item) {
parse_str($item, $cookie);
$cookies = array_merge($cookies, $cookie);
}
$headers = [
'Accept-Encoding: gzip, deflate',
//'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4',
'Accept-Language: en-US;q=0.6,en;q=0.4',
'Connection: keep-alive',
'Content-Length: 0',
'Host: www.instagram.com',
'Origin: https://www.instagram.com',
'Referer: https://www.instagram.com/',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36',
'X-Instagram-AJAX: 1',
'X-Requested-With: XMLHttpRequest'
];
$cadena_agregar_vector = 'X-CSRFToken:'. $cookies["csrftoken"];
$headers[] = $cadena_agregar_vector ;
$sPost = "username=".$usuario . "&password=". $password ;
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POSTFIELDS, $sPost);
curl_setopt($ch, CURLOPT_URL, $url_login);
$html2 = curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, "http://www.instagram.com/");
$html4 = curl_exec($ch);
echo $html4;
this is what I get
the problem is the way you hardcode Accept-Encoding: gzip, deflate, this makes curl send the encoding header indeed, but it does not turn on the decoding feature of curl, thus you get the raw data, without curl decoding it for you.
remove 'Accept-Encoding: gzip, deflate', and add curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate'); , and curl will decode it for you (provided that curl is compiled with gzip & deflate support) - or better yet, just do curl_setopt($ch, CURLOPT_ENCODING, ''); , and curl will automatically list all supported encodings, so you dont run into the encoding problem where curl isn't compiled with gzip support.
on an unrelated note, you probably want to use CURLOPT_USERAGENT, not set the user-agent header manually. else, the UA-string will just be sent with this 1 request, and be reset on the next request, while CURLOPT_USERAGENT is kept until curl_close($ch)
edit: on my first revision of this post, i wrote CURLOPT_POSTFIELDS instead of CURLOPT_ENCODING, sorry, fixed that
edit 2: on another unrelated note, you're encoding the username/password wrong. instead of $sPost = "username=".$usuario . "&password=". $password ;, do
$sPost=http_build_query(array('username'=>$usuario,'password'=>$password));, else accounts with & or = or NULLs in the password or username wont work properly
The answer posted by #hanshenrik should really be accepted. But if you just want an easy solution that works and is not incorrect, remove the 'Accept-Encoding: gzip, deflate' from your headers array.
By curiosity i've tried to parse html
$url = "http://www.continente.pt/stores/continente/pt-pt/public/Pages/subcategory.aspx?cat=Bebidas_Vinhos";
$agent= 'Googlebot-Image/1.0 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
var_dump($result);
from shop supermarket website and i get this message
Error
This page can't be displayed. Contact support for additional information.
The incident ID is: N/A.
I found it strange and they have some protection against this type of "attacks", but how they protect this website and how they let google bot crawl for digital marketing purpose?
Try with session cookies, but this page not have content because is loaded async with ajax.
curl 'http://www.continente.pt/stores/continente/pt-pt/public/Pages/subcategory.aspx?cat=Bebidas_Vinhos' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3' -H 'Connection: keep-alive' -H 'Cookie: searchRefiner=%7B%22%22%3A%7B%221449049672079%22%3A%5B%5D%7D%7D; SPSessionGuid=ec3e4a3e-7cfe-4c8a-902f-1c64ba0868f4; __CommerceAnonymousShopper_ef77e72d-62b9-4b0f-8113-d111c9d6d7ce_Internet=0244rfNRN5rPgC7kvXzyqrNQg==WBGr/AUg99sKnXpF3QH4Sa5cHPFred5bJqPiwbFvDnL1jHUk6v0Jb0dpOZLY66bXpC8faWF7k5aOMi/qIkOgA4RNWuskMnicr6OJ12BBs8ns68kXmckzTJvkVEfDQB7DApeN5ULier028VPSLkChmWvBHyCHno328U6SrLu65m5e3lu521PF940napZPZIvN7hP51Yfi9c+FkwjIAZ+j8w==; MSCSProfile=287001FD2674671C70ED37E496ED003312D0DA42BDDB218BA1D2B71AD462488CF83AD1F7530553A13FDD4C8DB0E26123D3A02CCFBA6DAE49B72A185609583B9617878CEA5D73023FE7A74384436D54761511ED87FFA2AF58124E143C0E90DC9C72D55A51B3AE6EAB71153682F607FE3C29538E729117E4DD3D6B05C06E7FBA47; cPrompt_useCookies=1; cpup=2; _ga=GA1.2.532033017.1449049672; _dc_gtm_UA-158387-26=1; byside_webcare_tuid=5110f1jvvitrsyi82c2q4kddcxlrl0vdwfmrmtzeah679ditkl; __atuvc=1%7C48; __atuvs=565ebe4c6d710bda000; CampaignHistory=146148' -H 'Host: www.continente.pt' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0'
I am unable to set the host in curl. It still shows as localhost if i use the following code
function wget($url)
{
$agent= 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1';
$curlHeaders = array (
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US,en;q=0.5',
'User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1',
'Connection: Keep-Alive',
'Pragma: no-cache',
'Referer: http://example.com/',
'Host: hostname',
'Cache-Control: no-cache',
'Cookie: visid_incap_185989=9v1q8Ar0ToSOja48BRmb8nn1GFUAAAAAQUIPAAAAAABCRWagbDIfmlN9NTrcvrct; incap_ses_108_185989=Z1orY6Bd0z3nGYE2lbJ/AXn1GFUAAAAAmb41m+jMLFCJB1rTIF28Mg==; _ga=GA1.3.637468927.1427699070; _gat=1; frontend=rqg7g9hp2ht788l309m7gk8qi7; _gat_UA-1279175-12=1; __utma=233911437.637468927.1427699070.1427699078.1427699078.1; __utmb=233911437.2.10.1427699078; __utmc=233911437; __utmz=233911437.1427699078.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt_UA-1279175-1=1; _cb_ls=1; _chartbeat2=S0WVXDwMWnCFBgQp.1427699081322.1427699232786.1; PRUM_EPISODES=s=1427699568560&r=http%3A//example.com/'
);
$ch = curl_init();
curl_setopt ($ch, CURLOPT_HTTPHEADER, $curlHeaders);
curl_setopt ($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
return $result;
}
I use fiddler to track the network requests. where I found the host is still as localhost
If I load this same Link in browser i get as following in fiddler
I need my specified domain to be accessed. How can I achieve this?
Note: I am aware that host name should not contain the protocol.
Alternatively
Also i would like to know is it possible to get the source code of a website the could be seen in browser through terminal?
Assuming we are not trying spoof the Host header, omit the Host header altogether and let curl sort it out. In this case, just remove 'Host: hostname', because you already get curl to automatically set this with your code near the bottom with curl_setopt($ch, CURLOPT_URL, $url);.
If you really want to set the Host header yourself, then just replace
'Host: hostname',
with
"Host: ". parse_url($url, PHP_URL_HOST),
(Note: This function doesn't work with relative URLs.)
try like this,
curl_init('XXX.XXX.XXX.XXX');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: subdomain.hostname.com'));
If you are using windows and xampp then try to use virtual host rather than localhost, then it will start working, I did the same.
According to HTTP quick specification read, I assume your problems are happening because of improper Host header being send. I was able to download some websites with following code:
function wget($url, $follow = true) {
$host = parse_url($url);
$agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1';
$curlHeaders = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US,en;q=0.5',
'User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1',
'Connection: Keep-Alive',
'Pragma: no-cache',
'Referer: http://example.com/',
'Host: ' . $host['host'] . (isset($host['port']) ? ':' . $host['port'] : null), // building host header
'Cache-Control: no-cache',
'Cookie: visid_incap_185989=9v1q8Ar0ToSOja48BRmb8nn1GFUAAAAAQUIPAAAAAABCRWagbDIfmlN9NTrcvrct; incap_ses_108_185989=Z1orY6Bd0z3nGYE2lbJ/AXn1GFUAAAAAmb41m+jMLFCJB1rTIF28Mg==; _ga=GA1.3.637468927.1427699070; _gat=1; frontend=rqg7g9hp2ht788l309m7gk8qi7; _gat_UA-1279175-12=1; __utma=233911437.637468927.1427699070.1427699078.1427699078.1; __utmb=233911437.2.10.1427699078; __utmc=233911437; __utmz=233911437.1427699078.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt_UA-1279175-1=1; _cb_ls=1; _chartbeat2=S0WVXDwMWnCFBgQp.1427699081322.1427699232786.1; PRUM_EPISODES=s=1427699568560&r=http%3A//example.com/'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, $curlHeaders);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $follow); // following redirects or not
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
return $result;
}
echo(wget('http://example.com'));
Anyway this function is not an universal build. Personally I would add saving cookies between redirection requests etc. Essential change is within 'Host' header line. I'm building there proper Host header based on full $url provided to function.
Set the full URL into CURLOPT_URL.
I am using cURL to access a facebook page. Locally it works perfect, but when I upload it to my dev server, it breaks and returns an empty string. I've checked and cURL is installed on the server. Here's the code I use to access facebook:
$header = array();
$header[] = 'Accept: text/json';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$header[] = 'Pragma: ';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://facebook.com/feeds/page.php?format=json&id=135137236003');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$result = curl_exec($ch);
curl_close ($ch);
Any help is appreciated!
Change the accept header to */* or application/json as facebook is sending the response header as application/json.
And change this url
http://facebook.com/feeds/page.php?format=json&id=135137236003
to
http://www.facebook.com/feeds/page.php?format=json&id=135137236003
as facebook is redirecting the non-www request to www requests. Though it works for you as put follow location, but it saves one reound trip