scraping a secure page https in php - php

Im trying to crawl a secure page (https) such as google with curl
but I seem to get no data back from my crawler
php function
function getDOM($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_RANGE, '0-100');
$content = curl_exec($ch);
curl_close($ch);
echo $url."<br>";
echo $content;
$dom = new simple_html_dom();
$dom->load($content);
if($dom){
return $dom;
}
return null;
}
getDOM("https://www.google.co.uk/search?sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8&q=crawling%20https#hl=en&gs_nf=1&pq=site:stackoverflow.com%20crawling%20https%20php&cp=6&gs_id=s&xhr=t&q=stackoverflow&pf=p&sclient=psy-ab&oq=stacko&aq=0&aqi=g4&aql=&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=8baefeb740f734a5&biw=1280&bih=685");
is there anything I can do to crawl a https as I don't seem to have this problem with normal pages

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
Add this to your code. This will allow any certificate to pass through, so it should be fine for your use (but not a good idea in general).

Related

Retreiving a 3rd-party webpage with CURL/PHP - not entirely working

I am writing a tool that accesses a set of external website pages. Here is my test code to see if I can retrieve the page:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
/* curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); */
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.imdb.com/');
echo $returned_content;
The thing is when I pass Google (for example) as the URL, I get the Google homepage in my browser (sans images for obvious reasons), but when I pass the site I want to see, www.imdb.com, I get nothing. Why is this, and what can I do about it?

why Instagram returns blank to CURL request?

i write following code to get html data from url and its working for https site like Facebook but not working for Instagram only.
Instagram returns the blank
<?php
$url = 'https://www.instagram.com';
$returned_content = get_data($url);
print_r($returned_content)
/* gets the data from a URL */
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
?>
The Instagram will return only javascript, that can't be render by your browser because it uses dynamic path, so <script src='/path/file.js'> will try to get localhost/path/file.js instead of instagram.com/path/file.js and in this situation the localhost/path/file.js not will exist, so the page will be blank.
One solution is find a way to give the full HTML instead of the Javascript, in this case you can use the "User-Agent" to do this trick. You might know that JS not handle by the search-engine, so for this situation the Instagram (and many websites) give the page without JS that is supported by the bot.
So, add this:
curl_setopt($ch, CURLOPT_USERAGENT, "ABACHOBot");
The "ABACHOBot" is one Crawler. In this page you can found many others alternatives, like a "Baiduspider", "BecomeBot"...
You can use "generic" user-agent too, like "bot", "spider", "crawler" and probably will work too.
Here try this on
<?php
$url = 'https://www.instagram.com';
$returned_content = get_data($url);
print_r($returned_content);
/* gets the data from a URL */
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//Update.................
curl_setopt($ch, CURLOPT_USERAGENT, 'spider');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_HEADER, false);
//....................................................
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
?>
You should pass
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false)
and other header info as above.
For more detail,Please see
http://stackoverflow.com/questions/4372710/php-curl-https

How to make a call to .aspx https from php script from my localhost with xamp?

I am trying to send SMS from my localhost with xamp installed.
Requested page is on https and an .aspx page.
I am getting error: "HTTP Error 400. The request is badly formed." or blank page only in some cases.
Detaisl is as follows :
$url = 'https://www.ismartsms.net/iBulkSMS/HttpWS/SMSDynamicAPI.aspx';
$postArgs = 'UserId='.$username.
'&Password='.$password.
'&MobileNo='.$destination.
'&Message='.$text.
'&PushDateTime='.$PushDateTime.
'&Lang='.$Lang;
function getSslPage($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$response = getSslPage($all);
echo "<pre>";
print_r($response); exit;
I tried every possible solution/combination found on internet but could not resolve that. The API developers do not have a example for php script.
I tried httpful php library and file_get_contents function but getting empty page. Also tried every combination with curl_setup.
I need to call this url without any post data and see the response from it.
Instead getting a blank page.
Please note that when I execute the url with all details in browser it works fine.
Can anybody help me in this regard.
Thank you,
Usman
First do urlencode over your data as follows:
$postArgs = 'UserId='. urlencode($username.
'&Password='.urlencode($password).
'&MobileNo='.urlencode($destination).
'&Message='.urlencode($text).
'&PushDateTime='.urlencode($PushDateTime).
'&Lang='.urlencode($Lang);
After that two possible solutions. One is using GET.
curl_setopt($ch, CURLOPT_URL, $url . "?" . $postArgs);
Second option is using POST method.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postArgs);

Unable to use file_get_contents(), returns nothing

I'm trying to get some data from a website that is not mine, using this code.
<?
$text = file_get_contents("https://ninjacourses.com/explore/4/");
echo $text;
?>
However, nothing is being echo'd, and the string length is 0.
I've done this method before, and it has worked no problem, but with this website, it is not working at all.
Thanks!
I managed to get the contents using curl like this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "https://ninjacourses.com/explore/4/");
$result = curl_exec($ch);
curl_close($ch);
cURL is a way you can hit a URL from your code to get a html response from it. cURL means client URL which allows you to connect with other URLs and use their responses in your code
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "https://ninjacourses.com/explore/4/");
$result = curl_exec($ch);
curl_close($ch);
i think this is useful for you curl-with-php and another

curl unable to download webpages

I am trying to open homepages of websites and extract title and description from it's html markup using curl with php, I am successful in doing this to an extent, but many websites are there I am unable to open. My code is here:
function curl_download($Url){
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
// $url is any url
$source=curl_download($url);
$d=new DOMDocument();
$d->loadHTML($source);
$title=$d->getElementsByTagName("title")->item(0)->textContent)
$domx = new DOMXPath($d);
$desc=$domx->query("//meta[#name='description']")->item(0);
$description=$desc->getAttribute('content');
?>
This code is working fine for most websites but there are many whome it doesn't even able to open. What can be the reason?
When I tried getting headers of those websites using get_headers function, its working fine, but these are not being opened using curl. Two of these websites are blogger.com and live.com.
Replace:
$output = curl_exec($ch);
with
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSLVERSION, 3);
$output = curl_exec($ch);
if (!$output) {
echo curl_error($ch);
}
and see why Curl is failing.
It's a good idea to always check the result of function calls to see if they succeeded or not, and to report when they fail. While a function may work 99.999% of the time, you need to report the times it fails, and why, so the underlying cause can be identified and fixed, if possible.

Categories