web scraping does not work only on this site - php

I'm using the same code to get the price of different web pages (7 in particular), all work perfect, but in 1 I can not get any data, could you tell me if it is impossible, if the page has any protection? Thanks in advance.
$source = file_get_contents("https://www.cyberpuerta.mx/Computo-Hardware/Discos-Duros-SSD-NAS/Discos-Duros-Internos-para-PC/Disco-Duro-Interno-Western-Digital-Caviar-Blue-3-5-1TB-SATA-III-6-Gbit-s-7200RPM-64MB-Cache.html");
preg_match("'<span class=\"priceText\">(.*?)</span>'", $source, $price);
echo $price[1];
I hope this result:
$869.00
This code only works badly on the website shown in the code.

Use curl with an agent set, this usually tricks the website protections to believe it's a true user.
$URL = "https://www.cyberpuerta.mx/Computo-Hardware/Discos-Duros-SSD-NAS/Discos-Duros-Internos-para-PC/Disco-Duro-Interno-Western-Digital-Caviar-Blue-3-5-1TB-SATA-III-6-Gbit-s-7200RPM-64MB-Cache.html";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $URL);
$result =curl_exec($ch);
preg_match("'<span class=\"priceText\">(.*?)</span>'", $result, $price);
echo $price[1];

Related

Why curl return 400 bad request when i try to get page content?

i am trying to get web page content with curl from some websites but they return 400 bad request ( file_get_contents return empty ) here's the function i am using :
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Put error_reporting(E_ALL); line at the top file where you are calling this function.
It will generate the cause of an error.

Curl return empty in php

$loginUrl = 'http://mp3.zing.vn/json/song/get-source/ZmJmTknNCBmLNzHtZbxtvmLH';
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$loginUrl);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$result=curl_exec($ch);
curl_close($ch);
var_dump(json_decode($result));
I have a problem to get the data using curl operation. If i use the url only in my browser then it returns the data but here i using var_dump its null. I have consult some post in stackoverflow but i cant sovle this problem.
Where i do some mistake, please help my. Thanks
The URL is invalid, i.e. the path mentioned as the variable $loginURL doesnot exist.
loginUrl = 'http://mp3.zing.vn/json/song/get-source/ZmJmTknNCBmLNzHtZbxtvmLH';

Error handling on a php function

How do I handle NULL returns on a function?
The below code is a basic curl function. I find there are times the $url will be NULL, for example if a website goes offline for some reason or a user types in a wrong url. In these instances I get an error "call to member function on null"
How do I return an empty result instead of a null result and stop the user from seeing this error?
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
One potential avenue is Add:
$headers = curl_getinfo($ch);
then
if($headers['http_code'] < 400){
$data = "whatever you need it to be...";
etc. and you can expand this for 300 (redirects) as necessary.

Scraping a website with cURL request not reading the HTML code

Crawling http://www.mfinante.ro/infocodfiscal.html?cod=299 is not working.
It's getting redirected to some other location. But why?
<?php
$url = 'http://www.mfinante.ro/infocodfiscal.html?cod=299';
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_ENCODING ,"");
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
$html = curl_exec($curl);
$redirectURL = curl_getinfo($curl,CURLINFO_EFFECTIVE_URL );
curl_close($curl);
echo $html;
?>
I'm unable to understand why this happening.
You could use htmlspecialchars() to get the source code of the response
echo htmlspecialchars($html);
It's likely that there is a javascript or meta redirect in there somewhere. My JS is so poor i can't really help you with that.
If you can find that, you can build a regular expression to find the URL and then fetch it's contents.

CURL badrequest 400 when submiting from webserver

I'm trying to submit a form to a .aspx page with curl and then do something with the response. The problem is that my code works when I'm submiting it from my local xampp server but when submited from webserver I get "HTTP Error 400. The request URL is invalid."
I tried removing CURLOPT_POST option, found it somewhere on SO. I also tried urlencoding but then I get nothing.
$url = "http://www.somepage.com/locations/default.aspx#location_page_map";
$kv[]='search=92627';
$kv[]='__VIEWSTATE';
$kv[]='__EVENTTARGET';
$query_string = join("&", $kv);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_POST, count($kv));
curl_setopt($ch, CURLOPT_POSTFIELDS, $query_string);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$output = curl_exec($ch);
var_dump($output);
curl_close($ch);
You can actually leave out the __VIEWSTATE and __EVENTTARGET there most likely something todo with ASP's form value persistence, also you can remove the #location_page_map as thats just to focus the page on the map section, so will not impact the results from the service/site your trying to scrape. You then use http_build_query() to turn the array into a string for curl.
<?php
//$url = "http://www.myfitfoods.com/locations/default.aspx#location_page_map";
$url = "http://www.somepage.com/locations/default.aspx#location_page_map";
$kv['search'] = '92627';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_POST, count($kv));
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($kv));
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$output = curl_exec($ch);
var_dump($output);
curl_close($ch);
You haven't defined your $kv array propertly. Curl will take an array, but it has to be in key=>value format. All you've provided is 3 values. e.g. you'd actually be passing
=search%3D62627&=__VIEWSTATE&=__EVENTTARGET
^--no key ^---no key ^--- no key
Try:
$kv = array(
'search' => 92627,
'x' => '__VIEWSTATE',
'y' => '__EVENTTARGET'
)
curl_setopt($ch, CURL_POSTFIELDS, $kv);
or similar instead.

Categories