Fetch content of google scholar page with php - php

This is my journal page in Google Scholar:
https://scholar.google.com/citations?user=F4z6guYAAAAJ
I can check the page with browser. But can not get contents by PHP (Curl or File_get_contents)
I tried many headers but was not useful.
Update : My code is here:
$fgc_context = stream_context_create(array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept: text/html,application/xhtml+xml,application/xml\r\n" .
"Accept-Charset: ISO-8859-1,utf-8\r\n" .
"Accept-Encoding: gzip,deflate,sdch\r\n" .
"Accept-Language: en-US,en;q=0.8\r\n",
"timeout" => 60,
'user_agent'=>"user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9\r\n"
)
));
ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
$wcnt = #file_get_contents($the_journal_url, false, $fgc_context);
And google return a page ends with:
<H1>Server Error</H1> We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.<p>Please try again later.</p>

Try with this code :
(run it 2 times to create the cookie the first time)
$cookie = __DIR__ . '/cookie.txt';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, 'https://scholar.google.com/citations?user=F4z6guYAAAAJ');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0');
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
echo $data;

Related

How to have multiple CURLOPT_USERAGENT in one query?

I have the following code which currently use only one user agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8.
My question is how I can use multiple user agents in one time? It should change user agent if the current one will not pass.
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}

cURL request does not display any content

I am trying to recover the content of a page with PHP cURL. It works well on other websites, but on this website it does not work and I don't why.
Here is my code :
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://ratings.fide.com/top.phtml');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($curl);
curl_close($curl);
echo $result;

Get data that PHP curl is sending

PHP's curl is surprisingly undebugable and obscure. I have some problem downloading a JSON API data with cURL. I want to see what is exactly cURL sending to the remote HTTP server.
Currently the only debug option I have is to temporarily send request to some simple HTTP server that writes input to stdout. I would need to write that server just to debug curl!
What I do:
function get_data($url) {
$ch = curl_init();
echo "Download: $url.\n";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
// I hoped to get some debug info
// but this setting has no effect
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_HEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0',
'X-Purpose: Counting downloads.'
));
echo "Sending: \n".curl_getinfo($ch, CURLINFO_HEADER_OUT);
$data = curl_exec($ch);
var_dump($data);
echo curl_error($ch)." ".curl_errno($ch);
curl_close($ch);
return $data;
}
How can I get the data that is sent by cURL as a text?
If you want to define the headers you should use CURLOPT_HTTPHEADER and not CURLOPT_HEADER, i.e.:
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0',
'X-Purpose: Counting downloads.'
));
To get the the content curl is sending use:
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_STDERR,$f = fopen($verbosePath, "w+"));
function get_data($url) {
$verbosePath = __DIR__.DIRECTORY_SEPARATOR.'verbose.txt';
echo "Saving verbose to: $verbosePath\n";
$handle=curl_init('http://www.google.com/');
curl_setopt($handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($handle, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($handle, CURLOPT_HTTPHEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0',
'X-Purpose: Counting downloads.'
));
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_STDERR,$f = fopen($verbosePath, "w+"));
$data = curl_exec($handle);
curl_close($handle);
fclose($f);
return $data;
}
get_data("https://www.google.com");
verbose.txt
* About to connect() to www.google.com port 80
* Trying 172.217.0.100... * connected
* Connected to www.google.com (172.217.0.100) port 80
> GET / HTTP/1.1
Host: www.google.com
Accept: */*
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0
X-Purpose: Counting downloads.

Amazon Blocks cURL Request?

I am trying to use php cURL to fetch amazon web page but get
HTTP/1.1 503 Service Temporarily Unavailable instead. Is Amazon blocking cURL?
http://www.amazon.com/gp/offer-listing/B003B7Q5YY/
<?php
function get_html_content($url) {
// fake user agent
$userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_COOKIEFILE,'cookies.txt');
curl_setopt($ch,CURLOPT_COOKIEJAR,'cookies.txt');
$string = curl_exec($ch);
curl_close($ch);
return $string;
}
echo get_html_content("http://www.amazon.com/gp/offer-listing/B003B7Q5YY");
?>
I use simple
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $offers_page);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$html = curl_exec($ch);
curl_close($ch);
but i have another problem. if you send a lot of queries to amazon - they start send 500 page to you.

file_get_contents and CURL can't open a specific website

I tried to use file_get_contents and cURL to get the content of an website, I also tried to open the same site using Lynx and could not get the content. I got a 406 Not Acceptable, it seems that the site checks if I'm using a browser. Is there a work around?
It probably expects the user agent to be a web browser. You can set this easily using cURL:
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
Where $useragent is the string you want to use for a user agent. Try it with some common ones for the major browsers and see if that helps. This page lists some common user agents.
//make a call the the webpage to get his handicap
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.golfspain.com/portalgolf/HCP/handicap_resul.aspx?sLic=CB00693474");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 60);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE );
curl_setopt($ch, CURLOPT_REFERER, "http://google.com" );
curl_setopt($ch, CURLOPT_HEADER, TRUE );
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13');
$header = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Accept-Language: en-us;q=0.8,en;q=0.6'
);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
$html = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
Maybe you have to set some more HTTP headers like a 'real' browser. With cURL:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13');
$header = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Accept-Language: en-us;q=0.8,en;q=0.6'
);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);

Categories