By curiosity i've tried to parse html
$url = "http://www.continente.pt/stores/continente/pt-pt/public/Pages/subcategory.aspx?cat=Bebidas_Vinhos";
$agent= 'Googlebot-Image/1.0 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
var_dump($result);
from shop supermarket website and i get this message
Error
This page can't be displayed. Contact support for additional information.
The incident ID is: N/A.
I found it strange and they have some protection against this type of "attacks", but how they protect this website and how they let google bot crawl for digital marketing purpose?
Try with session cookies, but this page not have content because is loaded async with ajax.
curl 'http://www.continente.pt/stores/continente/pt-pt/public/Pages/subcategory.aspx?cat=Bebidas_Vinhos' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3' -H 'Connection: keep-alive' -H 'Cookie: searchRefiner=%7B%22%22%3A%7B%221449049672079%22%3A%5B%5D%7D%7D; SPSessionGuid=ec3e4a3e-7cfe-4c8a-902f-1c64ba0868f4; __CommerceAnonymousShopper_ef77e72d-62b9-4b0f-8113-d111c9d6d7ce_Internet=0244rfNRN5rPgC7kvXzyqrNQg==WBGr/AUg99sKnXpF3QH4Sa5cHPFred5bJqPiwbFvDnL1jHUk6v0Jb0dpOZLY66bXpC8faWF7k5aOMi/qIkOgA4RNWuskMnicr6OJ12BBs8ns68kXmckzTJvkVEfDQB7DApeN5ULier028VPSLkChmWvBHyCHno328U6SrLu65m5e3lu521PF940napZPZIvN7hP51Yfi9c+FkwjIAZ+j8w==; MSCSProfile=287001FD2674671C70ED37E496ED003312D0DA42BDDB218BA1D2B71AD462488CF83AD1F7530553A13FDD4C8DB0E26123D3A02CCFBA6DAE49B72A185609583B9617878CEA5D73023FE7A74384436D54761511ED87FFA2AF58124E143C0E90DC9C72D55A51B3AE6EAB71153682F607FE3C29538E729117E4DD3D6B05C06E7FBA47; cPrompt_useCookies=1; cpup=2; _ga=GA1.2.532033017.1449049672; _dc_gtm_UA-158387-26=1; byside_webcare_tuid=5110f1jvvitrsyi82c2q4kddcxlrl0vdwfmrmtzeah679ditkl; __atuvc=1%7C48; __atuvs=565ebe4c6d710bda000; CampaignHistory=146148' -H 'Host: www.continente.pt' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0'
Related
I'm trying to grab the source code of a website so I can parse out football fixtures, my code is:
<?php
$url = "https://www.bbc.co.uk/sport/football/scores-fixtures/2019-03-06";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20100101 Firefox/6.0.2',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-gb,en;q=0.5',
'Accept-Encoding: gzip, deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Proxy-Connection: Close',
'Cookie: PREF=ID=2bb051bfbf00e95b:U=c0bb6046a0ce0334:',
'Cache-Control: max-age=0',
'Connection: Close'
));
$output = curl_exec($ch);
curl_close($ch);
echo substr($output, 0, 12);
?>
Output of the substring shown is:
���
I need the output in standard text, is that compressed or something?
How do I fix this please?
Thanks.
I need the output in standard text, is that compressed or something?
Yes, exactly that: it's gzip-compressed. Your options are a) decompress it using e.g. gzdecode b) tell the server you don't want a gzip-encoded response; the easiest way is to let curl handle this for you:
delete 'Accept-Encoding: gzip, deflate', from your header array
Add: curl_setopt($ch, CURLOPT_ENCODING, 'identity'); somewhere before you curl_exec()
Recently, I want to scraping a website using CURL PHP. And the problem come. It return weird string combination and symbol. I really confused about it. I have set the encoding, both in header and declared it in curlopt.
Here is the coding I used to scrap.
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
//curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate,br');
curl_exec($ch);
curl_close($ch);
And this is the header I sent :
$header = [
':authority: www.airpaz.com',
':method: GET',
':path: $path,
':scheme: https',
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding: gzip, deflate, br',
'accept-language: en-US,en;q=0.9',
'cache-control: max-age=0',
'referer: $referer',
'upgrade-insecure-requests: 1',
'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
];
When I run it, it return exactly like the image below :
Can anyone tell what's the problem is? Thanks for your time. It will help me a lot
I extract from a database table a set of almost 1500 data, and for each of this data I should call an endpoint through CURL in this way:
for($i=0; $i <1500; $i++) {
$headers = [
'Host: www.hostname.it',
'Accept: application/json, text/javascript, */*; q=0.01',
'X-Requested-With: XMLHttpRequest',
'Accept-Language: it-it',
'Content-Type: application/x-www-form-urlencoded; charset=UTF-8',
'Origin: https://www.desiderimagazine.it',
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30',
'Connection: close',
'Referer: https://www.hpstname.it/page/registrazione?utm_source=gate2000&utm_medium=display&utm_campaign=ghh_gen18_regist&utm_content=leadcampaign2',
'Content-Length: '.mb_strlen($post_fields, '8bit')
];
$ch = curl_init($endpoint);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_ENCODING , "");
curl_setopt($cu, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/601.6.17 (KHTML, like Gecko) Version/9.1.1 Safari/601.6.17");
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 24000);
curl_setopt($ch, CURLOPT_CONNECTIONTIMEOUT, 24000);
curl_setopt($ch, CURLOPT_POSTFIELDS,$post_fields); //Post Fields
$result = curl_exec($ch);
// .. here I save the result on database
}
I run the script inserting its url on the browser and it works fine ( I correctly see the results on the database and the endpoint response) for the first 20-30 data, more or less. After that I sistematically get a
504 error - Gateway error timeout
I suspect it could be the way I execute it, but there must be some configuration I can change on my code in order to fix it.
Thanks
include_once('simple_html_dom.php');
$usuario = "username";
$password = "password";
$url = 'https://www.instagram.com/';
$url_login = 'https://www.instagram.com/accounts/login/ajax/';
$user_agent = array("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ",
"(KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36");
$ch = curl_init();
$headers = [
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US;q=0.6,en;q=0.4',
'Connection: keep-alive',
'Content-Length: 0',
'Host: www.instagram.com',
'Origin: https://www.instagram.com',
'Referer: https://www.instagram.com/',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36',
'X-Instagram-AJAX: 1',
'X-Requested-With: XMLHttpRequest'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/cookie/pruebalogininsta2.txt");
curl_setopt($ch, CURLOPT_REFERER, $sTarget);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
$html = curl_exec($ch);
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $html, $matches);
$cookies = array();
foreach($matches[1] as $item) {
parse_str($item, $cookie);
$cookies = array_merge($cookies, $cookie);
}
$headers = [
'Accept-Encoding: gzip, deflate',
//'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4',
'Accept-Language: en-US;q=0.6,en;q=0.4',
'Connection: keep-alive',
'Content-Length: 0',
'Host: www.instagram.com',
'Origin: https://www.instagram.com',
'Referer: https://www.instagram.com/',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36',
'X-Instagram-AJAX: 1',
'X-Requested-With: XMLHttpRequest'
];
$cadena_agregar_vector = 'X-CSRFToken:'. $cookies["csrftoken"];
$headers[] = $cadena_agregar_vector ;
$sPost = "username=".$usuario . "&password=". $password ;
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POSTFIELDS, $sPost);
curl_setopt($ch, CURLOPT_URL, $url_login);
$html2 = curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, "http://www.instagram.com/");
$html4 = curl_exec($ch);
echo $html4;
this is what I get
the problem is the way you hardcode Accept-Encoding: gzip, deflate, this makes curl send the encoding header indeed, but it does not turn on the decoding feature of curl, thus you get the raw data, without curl decoding it for you.
remove 'Accept-Encoding: gzip, deflate', and add curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate'); , and curl will decode it for you (provided that curl is compiled with gzip & deflate support) - or better yet, just do curl_setopt($ch, CURLOPT_ENCODING, ''); , and curl will automatically list all supported encodings, so you dont run into the encoding problem where curl isn't compiled with gzip support.
on an unrelated note, you probably want to use CURLOPT_USERAGENT, not set the user-agent header manually. else, the UA-string will just be sent with this 1 request, and be reset on the next request, while CURLOPT_USERAGENT is kept until curl_close($ch)
edit: on my first revision of this post, i wrote CURLOPT_POSTFIELDS instead of CURLOPT_ENCODING, sorry, fixed that
edit 2: on another unrelated note, you're encoding the username/password wrong. instead of $sPost = "username=".$usuario . "&password=". $password ;, do
$sPost=http_build_query(array('username'=>$usuario,'password'=>$password));, else accounts with & or = or NULLs in the password or username wont work properly
The answer posted by #hanshenrik should really be accepted. But if you just want an easy solution that works and is not incorrect, remove the 'Accept-Encoding: gzip, deflate' from your headers array.
I am unable to set the host in curl. It still shows as localhost if i use the following code
function wget($url)
{
$agent= 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1';
$curlHeaders = array (
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US,en;q=0.5',
'User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1',
'Connection: Keep-Alive',
'Pragma: no-cache',
'Referer: http://example.com/',
'Host: hostname',
'Cache-Control: no-cache',
'Cookie: visid_incap_185989=9v1q8Ar0ToSOja48BRmb8nn1GFUAAAAAQUIPAAAAAABCRWagbDIfmlN9NTrcvrct; incap_ses_108_185989=Z1orY6Bd0z3nGYE2lbJ/AXn1GFUAAAAAmb41m+jMLFCJB1rTIF28Mg==; _ga=GA1.3.637468927.1427699070; _gat=1; frontend=rqg7g9hp2ht788l309m7gk8qi7; _gat_UA-1279175-12=1; __utma=233911437.637468927.1427699070.1427699078.1427699078.1; __utmb=233911437.2.10.1427699078; __utmc=233911437; __utmz=233911437.1427699078.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt_UA-1279175-1=1; _cb_ls=1; _chartbeat2=S0WVXDwMWnCFBgQp.1427699081322.1427699232786.1; PRUM_EPISODES=s=1427699568560&r=http%3A//example.com/'
);
$ch = curl_init();
curl_setopt ($ch, CURLOPT_HTTPHEADER, $curlHeaders);
curl_setopt ($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
return $result;
}
I use fiddler to track the network requests. where I found the host is still as localhost
If I load this same Link in browser i get as following in fiddler
I need my specified domain to be accessed. How can I achieve this?
Note: I am aware that host name should not contain the protocol.
Alternatively
Also i would like to know is it possible to get the source code of a website the could be seen in browser through terminal?
Assuming we are not trying spoof the Host header, omit the Host header altogether and let curl sort it out. In this case, just remove 'Host: hostname', because you already get curl to automatically set this with your code near the bottom with curl_setopt($ch, CURLOPT_URL, $url);.
If you really want to set the Host header yourself, then just replace
'Host: hostname',
with
"Host: ". parse_url($url, PHP_URL_HOST),
(Note: This function doesn't work with relative URLs.)
try like this,
curl_init('XXX.XXX.XXX.XXX');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: subdomain.hostname.com'));
If you are using windows and xampp then try to use virtual host rather than localhost, then it will start working, I did the same.
According to HTTP quick specification read, I assume your problems are happening because of improper Host header being send. I was able to download some websites with following code:
function wget($url, $follow = true) {
$host = parse_url($url);
$agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1';
$curlHeaders = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US,en;q=0.5',
'User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1',
'Connection: Keep-Alive',
'Pragma: no-cache',
'Referer: http://example.com/',
'Host: ' . $host['host'] . (isset($host['port']) ? ':' . $host['port'] : null), // building host header
'Cache-Control: no-cache',
'Cookie: visid_incap_185989=9v1q8Ar0ToSOja48BRmb8nn1GFUAAAAAQUIPAAAAAABCRWagbDIfmlN9NTrcvrct; incap_ses_108_185989=Z1orY6Bd0z3nGYE2lbJ/AXn1GFUAAAAAmb41m+jMLFCJB1rTIF28Mg==; _ga=GA1.3.637468927.1427699070; _gat=1; frontend=rqg7g9hp2ht788l309m7gk8qi7; _gat_UA-1279175-12=1; __utma=233911437.637468927.1427699070.1427699078.1427699078.1; __utmb=233911437.2.10.1427699078; __utmc=233911437; __utmz=233911437.1427699078.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt_UA-1279175-1=1; _cb_ls=1; _chartbeat2=S0WVXDwMWnCFBgQp.1427699081322.1427699232786.1; PRUM_EPISODES=s=1427699568560&r=http%3A//example.com/'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, $curlHeaders);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $follow); // following redirects or not
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
return $result;
}
echo(wget('http://example.com'));
Anyway this function is not an universal build. Personally I would add saving cookies between redirection requests etc. Essential change is within 'Host' header line. I'm building there proper Host header based on full $url provided to function.
Set the full URL into CURLOPT_URL.