Using curl I am not getting whole html while scraping, why? - php

For example, I tried to scrape meta tags for yebhi.com and for some pages its coming back as null.
I'm using the following code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.yebhi.com/253196/PD/Tech-Graphic-Tee-81703611.htm');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4');
$data = curl_exec($ch);
curl_close($ch);
I am not getting the html properly, what am I doing wrong?

Related

get_meta_tags http request failed 403 forbidden

When I do:
$tags = get_meta_tags('http://example.com');
I get error: http request failed 403 forbidden, but when I go to site with browser all ok, status code: 200. May be I need set user_agent? But how I can do it?
You can do it by cURL. Here's the example:
$user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, https://example.com);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
You can set the user agent and retrieve the meta information
ini_set('user_agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1');
$meta_tags = get_meta_tags('www.example.com');
it will return an array of all meta tags.
For more information please refer to PHP Manual

A website URL is not loading with Curl php

I am using Curl PHP to fetch data from remote site. My Script is:
<?php
$url = 'https://www.(url).com/';
$sleep = rand(10, 12);
sleep($sleep);
$agent= 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','accept-encoding:gzip, deflate, sdch','accept:image/webp,image/*,*/*;q=0.8'));
curl_setopt($ch, CURLOPT_PROXY, "x.x.x.x:x");
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
$mainPage = new simple_html_dom;
echo $mainPage->load($result);
But it returns 403 forbidden error in response.
I tried with advanced User agents include, but still I am getting this error in response.
Thanks in advance for suggestions and comments.

PHP Curl not executing

I am trying to retrieve the HTML from a user profile on Instagram using cURL.
I am new to cURL so do not know the cause of this error.
Nothing happens when the cURL is executed , the page seems to refresh?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.instagram.com/zohebchaudhry1/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookiess.txt');
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookiess.txt');
curl_setopt($ch ,CURLOPT_TIMEOUT , 10);
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" );
$html = curl_exec($ch);
curl_close($ch);
echo $html;
above is the PHP cURL code.
It appears that cURL is working, however you're unable to see the output because printing HTML may not be desired.
I suggest replacing echo $html; with echo htmlentities($html);
Read more: php.net/htmlentities

simple_html_dom: trying to find height in google search

Anyone can explain to me what is wrong with the code and how do i get the height value? I am trying to get the height of celebrities. Any suggestions?
Thanks.
My code (Updated with CURL user agent setting as advised):
$url='https://www.google.com/webhp?ie=UTF-8#q=ailee+height';
//Set CURL user agent
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
//simple html dom
require_once('lib/simple_html_dom.php');
$html = str_get_html($data);
$height= $html->find('div[class="_eF"]',0)->innertext;
echo $height;
I get empty from the above code. In this case, I want to return:
5' 5" (1.65 m)
The problem is that curl doesn't process JavaScript and Google will show a different webpage when JavaScript is disabled, in this case, the div changes to a span with a different id
<span class="_m3b">1.65 m</span>
Also, the link you were using wasn't working for me.
Try this instead:
<?php
header('Content-Type: text/html; charset=utf-8');
$url='https://www.google.pt/search?q=ailee+height&num=10&gbv=1';
//Set CURL user agent
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
require_once('simple_html_dom.php');
$html = str_get_html($data);
$height= $html->find('span[class="_m3b"]',0)->innertext;
echo $height;
//1.65 m

php curl to read data from webpage

I am given a project on fetching data from this url.
For this, Simple HTML DOM process has already failed, so I am working on:
function curl_download($Url){
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "www.idealo.de/preisvergleich/MainSearchProductCategory.html?q=0018208925063");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
print curl_download('www.idealo.de/preisvergleich/MainSearchProductCategory.html?q=0018208925063');
This code returns a blank page. Can anyone please help me?
The reason is the Useragent you used is too short to look like a real browser.
Try to use this one bellow:
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.38");

Categories