How to scrape a SSL or HTTPS URL - php

I have written a function to scrape a website using CURL but it returns nothing when called and can't understand why. The output is empty
<?php
function scrape($url)
{
$headers = Array(
"Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
"Cache-Control: max-age=0",
"Connection: keep-alive",
"Keep-Alive: 300",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
"Accept-Language: en-us,en;q=0.5",
"Pragma: "
);
$config = Array(
CURLOPT_RETURNTRANSFER => TRUE ,
CURLOPT_FOLLOWLOCATION => TRUE ,
CURLOPT_AUTOREFERER => TRUE ,
CURLOPT_CONNECTTIMEOUT => 120 ,
CURLOPT_TIMEOUT => 120 ,
CURLOPT_MAXREDIRS => 10 ,
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8" ,
CURLOPT_URL => $url ,
) ;
$handle = curl_init() ;
curl_setopt_array($handle,$config) ;
curl_setopt($handle,CURLOPT_HTTPHEADER,$headers) ;
$data = curl_exec($handle) ;
curl_close($handle) ;
return $data ;
}
echo scrape("https://www.google.com") ;
?>

There are 2 possible fixes when trying to scrape a ssl or https url:
The quick fix
The proper fix
The quick fix, first.
Warning: this can introduce security issues that SSL is designed to protect against.
set: CURLOPT_SSL_VERIFYPEER => false
The second, and proper fix. Set 3 options:
CURLOPT_SSL_VERIFYPEER => true
CURLOPT_SSL_VERIFYHOST => 2
CURLOPT_CAINFO => getcwd() . '\CAcert.pem'
The last thing you need to do is download the CA certificate.
Go to, - http://curl.haxx.se/docs/caextract.html -> click 'cacert.pem' -> copie/paste the text in to a text editor -> save the file as 'CAcert.pem' Check it isn't 'CAcert.pem.txt'
<?php
function scrape($url)
{
$headers = Array(
"Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
"Cache-Control: max-age=0",
"Connection: keep-alive",
"Keep-Alive: 300",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
"Accept-Language: en-us,en;q=0.5",
"Pragma: "
);
$config = Array(
CURLOPT_SSL_VERIFYPEER => true,
CURLOPT_SSL_VERIFYHOST => 2,
CURLOPT_CAINFO => getcwd() . '\CAcert.pem',
CURLOPT_RETURNTRANSFER => TRUE ,
CURLOPT_FOLLOWLOCATION => TRUE ,
CURLOPT_AUTOREFERER => TRUE ,
CURLOPT_CONNECTTIMEOUT => 120 ,
CURLOPT_TIMEOUT => 120 ,
CURLOPT_MAXREDIRS => 10 ,
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8" ,
CURLOPT_URL => $url
) ;
$handle = curl_init() ;
curl_setopt_array($handle,$config) ;
curl_setopt($handle,CURLOPT_HTTPHEADER,$headers) ;
$output->data = curl_exec($handle) ;
if(curl_exec($handle) === false) {
$output->error = 'Curl error: ' . curl_error($handle);
} else {
$output->error = 'Operation completed without any errors';
}
curl_close($handle) ;
return $output ;
}
$scrape = scrape("https://www.google.com") ;
echo $scrape->data;
//uncomment for errors
//echo $scrape->error;
?>

Related

Trying to get html respons

I was trying to get html response using PHP but always get something like this
screenshot [picture hidden from post due to binary characters]
<?php
$header_request = array (
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding: gzip",
"accept-language: id-ID,id;q=0.9,en-US;q=0.8,en;q=0.7",
"cookie: csrf_cookie_name=9316014c9d7860019da66a78edfaf926; _data_pop=115-1_274-1; ci_session=607f0be4e56b8b08ee2398b892f115c9e660192e; _data_cpc=1-2_15-2_190-4",
"referer: https://ptc4btc.com/",
"user-agent: Mozilla/5.0 (Linux; Android 7.0; Moto C Plus) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.56 Mobile Safari/537.36",
"upgrade-insecure-requests: 1"
);
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => "https://ptc4btc.com/dashboard",
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_HTTPHEADER => $header_request,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_SSL_VERIFYHOST => 2,
));
$exec = curl_exec($ch);
echo($exec);
curl_close($exec);
You said you were willing to accept gzip data with accept-encoding: gzip in the headers. So decode it:
echo gzdecode($exec);

PHP curl extract download link

I'm trying to extract a download link from this website using cURL.
http://abelhas.pt/Asantino/TUTORIAIS/SONY+VEGAS/ZLICED+TRAILER/ZLICED+TRAILER/ZLICED+TRAILER+PRE-RENDERD+FX,36948354.mp4(video)
Opening the link gives u a download button, pressing it will then call a ajax script that returns some J SON data from following URL
http://abelhas.pt/action/License/Download
This URL expects 2 parameters ( fileId(int) & __RequestVerificationToken(str)) as POST parameters and this header needs to be send aswell (X-Requested-With: XMLHttpRequest)
This is the script I'm using to login & fetch the __RequestVerificationToken && the fileId from the Item page
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie:".$cookie.";\r\n" . // check function.stream-context-create on php.net
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n"
)
);
$context = stream_context_create($options);
$html = file_get_html($url,false,$context);
$token = $html->find('input[name=__RequestVerificationToken]',0)->value;
$id = $html->find('a.fileCopyAction',0)->rel;
return array(
'id' => $id,
'token' => $token
);
Followed by this code that I use to extract the download link(this does not work)
$res = array();
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // do not return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POSTFIELDS => 'fileId='.$file_id.'&__RequestVerificationToken='.$token,
CURLOPT_HTTPHEADER => array(
'Accept:*/*',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.8',
'Cache-Control: max-age=0',
'Connection: keep-alive',
'Content-Type: application/x-www-form-urlencoded',
'Cookie: '.$cookie,
'Host: abelhas.pt',
'Origin: http://abelhas.pt',
'X-Requested-With: XMLHttpRequest',
),
);
$ch = curl_init("http://abelhas.pt/action/License/Download");
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$res['content'] = $content;
$res['url'] = $header['url'];
return $res;
This returns J SON data but not what the website returns if u inspect it with chrome/firefox.
So the question remains : How to do this to get the same results as if visiting it urself using a browser?
Thanks in advance.

PHP curl not getting the intended page content, Firefox does. Possible causes?

When I use the curl to fetch a page on an ecommerce site, it always gives me the same front page (ignoring the starting item parameter); whereas when I go to the url in a browser it works as usual.
Simplified code:
// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';
$ch = curl_init($url);
$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
. 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
//$cookieFile = tempnam('/tmp', 'curlcookie');
$cookieFile = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'curlcookies.txt';
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1,
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
);
curl_setopt_array($ch, $options);
$strPageHTML = curl_exec($ch);
curl_close($ch);
I'm sorry for the Chinese site, but if you look at the items listed and their url as returned by curl, their id's are always the same as the ones on the front page (where s = 0) when they should be different.
What am I doing wrong?
Edit 1: added cookie to code, still doesn't work.
Edit 2: edited the cookie line to clear any confusion. Also the contents of the cookies are as follows:
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_.taobao.com TRUE / FALSE 0 cookie2 d686d4be95b4b56b61292118b43e1333
#HttpOnly_.taobao.com TRUE / FALSE 1316321978 _tb_token_ eeab7e3e5ea9e
.taobao.com TRUE / FALSE 1321505978 t 3c473872e51e93b0cf172375b31f503a
.taobao.com TRUE / FALSE 0 uc1 cookie14=UoLdHCGrCsSKAg%3D%3D
.taobao.com TRUE / FALSE 0 v 0
.taobao.com TRUE / FALSE 0 _lang zh_CN:GBK
You should take a look at cookies generated by the website, or even some CSRF tokens that would be inserted to keep you away from doing some parsing job.
When I inspect the webpage at first load, I can find this:
Set-Cookie:cookie2=b1d92ddac8aa82350a6ff5e892a8637d;Domain=.taobao.com;Path=/;HttpOnly
_tb_token_=fde3979ee6b13;Domain=.taobao.com;Path=/;Expires=Sat, 17-Sep-2011 07:09:40 GMT;HttpOnly
t=91f29eb410a21a04bf36025823c4b2ad; Domain=.taobao.com; Expires=Wed, 16-Nov-2011 07:09:40 GMT; Path=/
uc1=cookie14=UoLdHCDBHbn1eg%3D%3D; Domain=.taobao.com; Path=/
Maybe these cookies are used to identify you while navigating through categories.
Searching for "token" in the DOM made some results too.
Instead of accessing the page by pretending to be a user, is it possible to access the information you require via their api (http://open.taobao.com/)?
This page uses a lot of cookies, I would not be surprised a session cookie is required to load the page. See what happens when enabling that
curl_setopt($DATA_POST, CURLOPT_COOKIEFILE, 'cookiefile.txt');
curl_setopt($DATA_POST, CURLOPT_COOKIEJAR, 'cookiefile.txt');
// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';
$ch = curl_init($url);
$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
. 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$cookieFile = "cookie_china"; // I've changed this value and it seems to be working fine, I get the same results
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1,
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
);
curl_setopt_array($ch, $options);
$strPageHTML = curl_exec($ch);
curl_close($ch);

curl .net https page login problem 500 Internal Server Error

When I try to login to page with curl I get 500 Internal Server Error. Page is on https.
Here is Login function:
function login ($login, $pass) {
$loginURL = "https://bitomat.pl/Account/LogOn";
$loginPage = $this->curl->getPage($loginURL);
$numStart = strpos(htmlentities($loginPage['content']), "type="hidden" value="");
$numEnd = strpos(htmlentities($loginPage['content']), "<fieldset>");
$verNum = substr(htmlentities($loginPage['content']), $numStart+36, 64);
echo $numStart." - ".$numEnd." - ".$verNum."<br>";
$page = $this->curl->post($loginURL, "__RequestVerificationToken=".urlencode($verNum)."&UserName=$login&Password=".urlencode($pass)."&RememberMe=false");
echo nl2br(htmlentities($page['content']));
}
In response I get:
HTTP/1.1 500 Internal Server Error
Cache-Control: private
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/7.0
X-AspNetMvc-Version: 2.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sat, 25 Jun 2011 16:32:18 GMT
Content-Length: 3974
CUrl post function:
function post($url, $params) {
$rnd = rand(0, 10000000000);
$options = array(
CURLOPT_COOKIESESSION => true,
CURLOPT_COOKIEFILE => "cookie/cookieBitomat",// . $rnd,
CURLOPT_COOKIEJAR => "cookie/cookieBitomat",// . $rnd,
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => true, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $params,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 60, // timeout on connect
CURLOPT_TIMEOUT => 60, // timeout on response
CURLOPT_MAXREDIRS => 20, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => TRUE,
CURLOPT_HTTPHEADER => array("Content-Type: application/x-www-form-urlencoded",
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Host: bitomat.pl",
"Accept-Language: pl,en-us;q=0.7,en;q=0.3",
"Accept-Encoding: gzip, deflate",
"Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.7",
"Keep-Alive: 115",
"Connection: keep-alive",
"Referer: https://bitomat.pl/Account/LogOn"
)
);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch);
$header = curl_getinfo($ch);
curl_close($ch);
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
Usually you get 500 in cases like this because your HTTP request differs somewhat from one that a browser would do in this situation and the server-side script assumes something about it.
Make sure you inspect a "live" request and mimic that as closely as possible.

PHP Curl 302 authentication with cookies

I am trying to learn to use PHP curl and it seemed to go well until I have tried to authenticate to changeip.com. Here is the function I use to make a Curl call:
function request($ch, $url, $params = array())
{
$options = array
(
CURLOPT_URL => $url,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8',
//CURLOPT_COOKIESESSION => TRUE,
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_HEADER => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLINFO_HEADER_OUT => TRUE,
CURLOPT_CONNECTTIMEOUT => 30,
CURLOPT_TIMEOUT => 30,
CURLOPT_MAXREDIRS => 30,
CURLOPT_VERBOSE => TRUE,
CURLOPT_COOKIEJAR => __DIR__ . DIRECTORY_SEPARATOR . 'cookies.txt',
CURLOPT_COOKIEFILE => __DIR__ . DIRECTORY_SEPARATOR . 'cookies.txt',
CURLOPT_HTTPHEADER => array
(
'Host: www.changeip.com',
'Pragma:',
'Expect:',
'Keep-alive: 115',
'Connection: keep-alive',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
//'Accept-Encoding: gzip,deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Content-Type: application/x-www-form-urlencoded',
),
);
if(!empty($params['referrer']))
{
$options[CURLOPT_REFERER] = $params['referrer'];
}
if(!empty($params['post']))
{
$options[CURLOPT_POST] = TRUE;
$options[CURLOPT_POSTFIELDS] = $params['post'];
}
curl_setopt_array($ch, $options);
$return = array();
$return['body'] = curl_exec($ch);
$info = curl_getinfo($ch);
//die(var_dump( curl_getinfo($ch, CURLINFO_HEADER_OUT) ));
$return['header'] = http_parse_headers(substr($return['body'], 0, $info['header_size']));
$return['body'] = substr($return['body'], $info['header_size']);
/*if(!empty($return['header']['Location']))
{
$params['referrer'] = $url;
return request($ch, substr($url, 0, strrpos($url, '/')+1) . $return['header']['Location'], $params);
}*/
return $return;
}
And here is the actual call:
// chaneip
$ch = curl_init();
// login
$params = array();
$params['post'] = array
(
'p' => 'aaaaaa2',
'u' => 'aaaaaa2',
);
$params['referrer'] = 'https://www.changeip.com/login.asp';
$return = request($ch, 'https://www.changeip.com/loginverify.asp?', $params);
However, this script does not retrieve valid cookies from changeip.com, i.e., does not authenticate. I have tried to compare Curl sent headers with HTTPLiveHeaders expecting to find any difference but in the end I didn't find anything. Can anyone advice me what is missing to make this work?
Commonly given question:
is cookie.txt 0777? Yes and the script does actually create some sort of cookie:
www.changeip.com FALSE / FALSE 0 ACloginAddrs 6
www.changeip.com FALSE / FALSE 0 ASPSESSIONIDCCSSCQRA DNHKGDICMKHFIJADMAPPMHHC
But it isn't a valid cookie.
$options[CURLOPT_POSTFIELDS] = http_build_query($params['post']);
Fixed the issue.

Categories