I'm downloading blog posts for analysis and after 10 pages of results I'm getting a strange redirect to the site's homepage rather than to the 10th page of results. Going to the 10th page in my browser works just fine.
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, "http://www.russellmoore.com/category/article/page/10");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
$status = curl_getinfo($ch);
print_r($status);
Executing this code redirects my script to http://www.russellmoore.com/.
As Daren pointed out, removing the user agent worked. However, because another blog I was downloading from required a user agent, I changed it to:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
This solution worked for both blogs.
Related
I'm using the same code to get the price of different web pages (7 in particular), all work perfect, but in 1 I can not get any data, could you tell me if it is impossible, if the page has any protection? Thanks in advance.
$source = file_get_contents("https://www.cyberpuerta.mx/Computo-Hardware/Discos-Duros-SSD-NAS/Discos-Duros-Internos-para-PC/Disco-Duro-Interno-Western-Digital-Caviar-Blue-3-5-1TB-SATA-III-6-Gbit-s-7200RPM-64MB-Cache.html");
preg_match("'<span class=\"priceText\">(.*?)</span>'", $source, $price);
echo $price[1];
I hope this result:
$869.00
This code only works badly on the website shown in the code.
Use curl with an agent set, this usually tricks the website protections to believe it's a true user.
$URL = "https://www.cyberpuerta.mx/Computo-Hardware/Discos-Duros-SSD-NAS/Discos-Duros-Internos-para-PC/Disco-Duro-Interno-Western-Digital-Caviar-Blue-3-5-1TB-SATA-III-6-Gbit-s-7200RPM-64MB-Cache.html";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $URL);
$result =curl_exec($ch);
preg_match("'<span class=\"priceText\">(.*?)</span>'", $result, $price);
echo $price[1];
i am trying to get web page content with curl from some websites but they return 400 bad request ( file_get_contents return empty ) here's the function i am using :
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Put error_reporting(E_ALL); line at the top file where you are calling this function.
It will generate the cause of an error.
$loginUrl = 'http://mp3.zing.vn/json/song/get-source/ZmJmTknNCBmLNzHtZbxtvmLH';
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$loginUrl);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$result=curl_exec($ch);
curl_close($ch);
var_dump(json_decode($result));
I have a problem to get the data using curl operation. If i use the url only in my browser then it returns the data but here i using var_dump its null. I have consult some post in stackoverflow but i cant sovle this problem.
Where i do some mistake, please help my. Thanks
The URL is invalid, i.e. the path mentioned as the variable $loginURL doesnot exist.
loginUrl = 'http://mp3.zing.vn/json/song/get-source/ZmJmTknNCBmLNzHtZbxtvmLH';
i use that:
https://github.com/zrashwani/arachnid
and i do that:
$url = "www.google.com";
$crawler = new \Arachnid\Crawler($url, 2);
$crawler->traverse();
i run that with cron in php
and i have a URL that i can to come in to him with cron
that give me a blnk page
how i do with that apps a
how i add that code to my code:
$userAgent = "IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)";
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
Crawling http://www.mfinante.ro/infocodfiscal.html?cod=299 is not working.
It's getting redirected to some other location. But why?
<?php
$url = 'http://www.mfinante.ro/infocodfiscal.html?cod=299';
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_ENCODING ,"");
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
$html = curl_exec($curl);
$redirectURL = curl_getinfo($curl,CURLINFO_EFFECTIVE_URL );
curl_close($curl);
echo $html;
?>
I'm unable to understand why this happening.
You could use htmlspecialchars() to get the source code of the response
echo htmlspecialchars($html);
It's likely that there is a javascript or meta redirect in there somewhere. My JS is so poor i can't really help you with that.
If you can find that, you can build a regular expression to find the URL and then fetch it's contents.