PHP curl redirect moved permanently - php

I'm downloading blog posts for analysis and after 10 pages of results I'm getting a strange redirect to the site's homepage rather than to the 10th page of results. Going to the 10th page in my browser works just fine.
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, "http://www.russellmoore.com/category/article/page/10");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
$status = curl_getinfo($ch);
print_r($status);
Executing this code redirects my script to http://www.russellmoore.com/.

As Daren pointed out, removing the user agent worked. However, because another blog I was downloading from required a user agent, I changed it to:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
This solution worked for both blogs.

Related

web scraping does not work only on this site

I'm using the same code to get the price of different web pages (7 in particular), all work perfect, but in 1 I can not get any data, could you tell me if it is impossible, if the page has any protection? Thanks in advance.
$source = file_get_contents("https://www.cyberpuerta.mx/Computo-Hardware/Discos-Duros-SSD-NAS/Discos-Duros-Internos-para-PC/Disco-Duro-Interno-Western-Digital-Caviar-Blue-3-5-1TB-SATA-III-6-Gbit-s-7200RPM-64MB-Cache.html");
preg_match("'<span class=\"priceText\">(.*?)</span>'", $source, $price);
echo $price[1];
I hope this result:
$869.00
This code only works badly on the website shown in the code.
Use curl with an agent set, this usually tricks the website protections to believe it's a true user.
$URL = "https://www.cyberpuerta.mx/Computo-Hardware/Discos-Duros-SSD-NAS/Discos-Duros-Internos-para-PC/Disco-Duro-Interno-Western-Digital-Caviar-Blue-3-5-1TB-SATA-III-6-Gbit-s-7200RPM-64MB-Cache.html";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $URL);
$result =curl_exec($ch);
preg_match("'<span class=\"priceText\">(.*?)</span>'", $result, $price);
echo $price[1];

Why curl return 400 bad request when i try to get page content?

i am trying to get web page content with curl from some websites but they return 400 bad request ( file_get_contents return empty ) here's the function i am using :
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Put error_reporting(E_ALL); line at the top file where you are calling this function.
It will generate the cause of an error.

Curl return empty in php

$loginUrl = 'http://mp3.zing.vn/json/song/get-source/ZmJmTknNCBmLNzHtZbxtvmLH';
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$loginUrl);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$result=curl_exec($ch);
curl_close($ch);
var_dump(json_decode($result));
I have a problem to get the data using curl operation. If i use the url only in my browser then it returns the data but here i using var_dump its null. I have consult some post in stackoverflow but i cant sovle this problem.
Where i do some mistake, please help my. Thanks
The URL is invalid, i.e. the path mentioned as the variable $loginURL doesnot exist.
loginUrl = 'http://mp3.zing.vn/json/song/get-source/ZmJmTknNCBmLNzHtZbxtvmLH';

how run with \Arachnid\Crawler in URL with CURL

i use that:
https://github.com/zrashwani/arachnid
and i do that:
$url = "www.google.com";
$crawler = new \Arachnid\Crawler($url, 2);
$crawler->traverse();
i run that with cron in php
and i have a URL that i can to come in to him with cron
that give me a blnk page
how i do with that apps a
how i add that code to my code:
$userAgent = "IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)";
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);

Scraping a website with cURL request not reading the HTML code

Crawling http://www.mfinante.ro/infocodfiscal.html?cod=299 is not working.
It's getting redirected to some other location. But why?
<?php
$url = 'http://www.mfinante.ro/infocodfiscal.html?cod=299';
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_ENCODING ,"");
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
$html = curl_exec($curl);
$redirectURL = curl_getinfo($curl,CURLINFO_EFFECTIVE_URL );
curl_close($curl);
echo $html;
?>
I'm unable to understand why this happening.
You could use htmlspecialchars() to get the source code of the response
echo htmlspecialchars($html);
It's likely that there is a javascript or meta redirect in there somewhere. My JS is so poor i can't really help you with that.
If you can find that, you can build a regular expression to find the URL and then fetch it's contents.

Categories