Web Scraping in Google Scholar

Web Scraping in Google Scholar - php

I'm trying to scrape from Google Scholar profile pages.The idea is that I want to retrieve the publications' list with an XPath but I does not download the page, here's my code:
I tried with curl
function get_page($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
//I tried to change user agent as well
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}
And without curl :
function get_xpath($query_url) {
$dom = new DOMDocument();
#$dom->loadHTMLFile($query_url);
sleep(1);
return new DOMXpath($dom);
}
$query_url = "https://scholar.google.it/citations?user=p-POZjgAAAAJ&hl=it&cstart=0&pagesize=100";
To get it without curl
$xpath = get_xpath($query_url);
To get it with curl
$xpath = get_xpath(get_page($query_url));
And then
$autori=$xpath->query("//tr[1]/td[1]/div[1]");
But $autori keeps being empty, any idea?

Related

Extract/Display JSON Wikipedia PHP

I am new to programming,
I need to extract the wikipedia content and put it into html.
//curl request returns json output via json_decode php function
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$search = $_GET["search"];
if (empty($search)) {
//term param not passed in url
exit;
} else {
//create url to use in curl call
$term = str_replace(" ", "_", $search);
$url = "https://en.wikipedia.org/w/api.php?action=opensearch&search=".$search."&limit=1&namespace=0&format=jsonfm";
$json = curl($url);
$data = json_decode($json, true);
$data = $data['parse']['wikitext']['*'];
}
so I basically want to reprint a wiki page but with my styles and do not know how to do.
Any ideas, Thanks

Scraping a website for price data using PHP but it returns zero(==$0) may be the website is blocking me. How to over come it?

This is the code that i have used:
$curl = curl_init("https://www.flipkart.com/curren-cu2-345656-analog-watch-boys-men/p/itmeax4wh4ujcfft?pid=WATEAX4WGYNYWVCM&srno=b_1_1&otracker=hp_omu_Deals%20of%20the%20Day_5_15c7e867-d35a-4431-a4a0-da39f043bc1f_0&lid=LSTWATEAX4WGYNYWVCMHVLY32");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tableRows = $xpath->query('//*[#id="container"]/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]');
echo $tableRows[0];
echo $tableRows[1];
echo $tableRows[2];
foreach ($tableRows as $row) {
echo $row . "<br>";
}
It shows zero, while i open the source in F12 developer mode it shows "==$0" adjacent to the div, how to i overcome this ?

As such flipkart is https so its blocking your request. To overcome this issue. Please use following two lines of code in addition with your curl request.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);

Xpath returns blank page doesn't echo the values

With the code below I get only blank page the name or the nickname is not getting echoed back. I crossed checked the path its correct still its not echoing anything back
<?php
$url="http://www.mans-best-friend.org.uk/dog-breeds-alphabetical-list.htm";
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
$html = curl_exec($curl_handle);
curl_close($curl_handle);
$mydoc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(empty($html)) die("EMPTY HTML");
$mydoc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$my_xpath = new DOMXPath($mydoc);
//////////////////////////////////////////////////////
$nodes = $my_xpath->query( '//*[#id="table94"]/tbody/tr/td' );
foreach( $nodes as $node )
{
$title=$my_xpath->query( 'p[#data-iceapc="1"]/span/a/font', $node );
$nickname=$my_xpath->query( 'p[#data-iceapc="2"]/span/a/font', $node );
echo $title." ".$nickname."<br>";
}
?>
In case you can't find the p element. Scroll to the part where dog names are. For e.g. Affenpinscher right click on it and select inspect - it shows the p element.

First of all you have to "fix" the html code for the xpath to properly work because it contains too many errors. In this case im extracting only the needed table with id table94.
Afterwards you can use xpath on the dom object to get your desired data:
<?php
$url="http://www.mans-best-friend.org.uk/dog-breeds-alphabetical-list.htm";
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
$html = curl_exec($curl_handle);
curl_close($curl_handle);
$html = preg_replace('/^.*(<table[^>]*id="table94">.*?<\/table>).*$/is', '\1', $html);
$mydoc = new DOMDocument();
$mydoc->loadHTML($html);
$my_xpath = new DOMXPath($mydoc);
$nodes = $my_xpath->query( '//tr' );
foreach( $nodes as $node )
{
if ($my_xpath->query('td[position()=last()-1]/p/span/a/font', $node)->length > 0) {
echo $my_xpath->query('td[position()=last()-1]/p/span/a/font', $node)->item(0)->textContent.' ';
echo $my_xpath->query('td[position()=last()]/p/span/font', $node)->item(0)->textContent."<br />";
}
}

checking for latest image with cURL PHP (404)...but images not updating

I'm trying to use PHP to find the most recent images/loop of a weather model that dynamically updates.
I'm using this script within a looper to show some static images, and the third object in the slideshow/looper is the problematic model loop within an iframe.
Here's the page: http://greenandtheblue.com/weather/thunderstorms.php
It looks like cURL is working some of the time, but for some reason, when there is a new model run, even though the page refreshes, it does not respond and can't "find" the new run to show.
You can see I basically use cURL to detect if an image exists. If it doesn't, it goes back one hour to see if that "model run" exists. It does this until it finds a run that exists and then plugs this time into the URL for the weather model, trying to display it in an iframe.
Here's the php (I'm not good at coding, and you can tell I'm sure):
//begin process to find date format...then run through past 6 hours to get formats for each possible HRRR run in order to find most recent HRRR run available
$hrrr_date = date('Ymd');
$cur_hr = date('H');
$cur_hr = strtotime("now")."\n";
$m1_hr = strtotime("-1 hour");
$m2_hr = strtotime("-2 hours");
$m3_hr = strtotime("-3 hours");
$m4_hr = strtotime("-4 hours");
$m5_hr = strtotime("-5 hours");
$m6_hr = strtotime("-6 hours");
$time_cur = date("YmdH",$cur_hr);
$time_m1 = date("YmdH",$m1_hr);
$time_m2 = date("YmdH",$m2_hr);
$time_m3 = date("YmdH",$m3_hr);
$time_m4 = date("YmdH",$m4_hr);
$time_m5 = date("YmdH",$m5_hr);
$hrrr_cur = "rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_cur."/t1/cref_t1sfc_f05.png";
$hrrr_m1 = "rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m1."/t1/cref_t1sfc_f05.png";
$hrrr_m2 = "rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m2."/t1/cref_t1sfc_f05.png";
$hrrr_m3 = "rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m3."/t1/cref_t1sfc_f05.png";
$hrrr_m4 = "rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m4."/t1/cref_t1sfc_f05.png";
$hrrr_m5 = "rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m5."/t1/cref_t1sfc_f05.png";
echo $time_cur;
//use curl to find if various HRRR runs result in 404...or broken link
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_cur."/t1/cref_t1sfc_f05.png");
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_exec($ch);
$is404 = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 404;
curl_close($ch);
$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, "http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m1."/t1/cref_t1sfc_f05.png");
curl_setopt($ch2, CURLOPT_NOBODY, 1);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch2, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_exec($ch2);
$is404_2 = curl_getinfo($ch2, CURLINFO_HTTP_CODE) == 404;
curl_close($ch2);
$ch3 = curl_init();
curl_setopt($ch3, CURLOPT_URL, "http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m2."/t1/cref_t1sfc_f05.png");
curl_setopt($ch3, CURLOPT_NOBODY, 1);
curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch3, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_exec($ch3);
$is404_3 = curl_getinfo($ch3, CURLINFO_HTTP_CODE) == 404;
curl_close($ch3);
$ch4 = curl_init();
curl_setopt($ch4, CURLOPT_URL, "http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m3."/t1/cref_t1sfc_f05.png");
curl_setopt($ch4, CURLOPT_NOBODY, 1);
curl_setopt($ch4, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch4, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_exec($ch4);
$is404_4 = curl_getinfo($ch4, CURLINFO_HTTP_CODE) == 404;
curl_close($ch4);
$ch5 = curl_init();
curl_setopt($ch5, CURLOPT_URL, "http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m4."/t1/cref_t1sfc_f05.png");
curl_setopt($ch5, CURLOPT_NOBODY, 1);
curl_setopt($ch5, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch5, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_exec($ch5);
$is404_5 = curl_getinfo($ch5, CURLINFO_HTTP_CODE) == 404;
curl_close($ch5);
if ($is404 == 0) {
$good_time = $time_cur;
echo "latest HRRR run: $good_time"; //current http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_cur."/t1/cref_t1sfc_f05.png = $is404 <br>";
}
else if ($is404_2 == 0) {
$good_time = $time_m1;
echo "latest HRRR run: $good_time"; //http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m1."/t1/cref_t1sfc_f05.png = $is404_2 <br>";
}
else if ($is404_3 == 0) {
$good_time = $time_m2;
echo "latest HRRR run $good_time"; //http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m2."/t1/cref_t1sfc_f05.png = $is404_3 <br>";
}
else if ($is404_4 == 0) {
$good_time = $time_m3;
echo "latest HRRR run $good_time"; //http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m3."/t1/cref_t1sfc_f05.png = $is404_4 <br>";
}
else if ($is404_5 == 0) {
$good_time = $time_m4;
echo "latest HRRR run $good_time"; //http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/".$time_m4."/t1/cref_t1sfc_f05.png = $is404_5 <br>";
}
?>
It's hard to explain, but the script doesn't work or refresh correctly. For instance, this image exists right now: http://rapidrefresh.noaa.gov/HRRR/for_web/hrrr_ncep_jet/2015050417/t1/cref_t1sfc_f05.p‌ng
The script should key in on the time of 2015050417 as the most recent valid run time. It then plugs it into this http://rapidrefresh.noaa.gov/HRRR/jsloopLocalDiskDateDomainZipZ.cgi?dsKeys=hrrr_ncep_jet:&runTime=2015050417&plotName=cref_sfc&fcstInc=60&numFcsts=16&model=hrrr&ptitle=HRRR%20Model%20Fields%20-%20Experimental&maxFcstLen=15&fcstStrLen=-1&resizePlot=1&domain=full URL to display in an iframe. This URL exists right now and shows properly in a window itself, but inside the looper as an iframe, it does not display.
Sorry for any confusion in explaining, but I find this problem to be difficult to put into words. If you have any suggestions for streamlining/fixing this script to work more consistently, I'm happy to listen.
Thanks for any help.

How to parse feed with php

Using Wikiepdia API link to get some basic informations about some world known characters.
Example : (About Dave Longaberger)
This would show as following
Now my question
I'd like to parse the xml to get such basic informations between <extract></extract> to show it.
Here is my idea but failed (I/O warning : failed to load external entity)
<?PHP
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave Longaberger&format=xml&exintro=1';
$xml = simplexml_load_file($url);
// get extract
$text=$xml->pages[0]->extract;
// show title
echo $text;
?>
Another idea but also failed (failed to open stream: HTTP request failed!)
<?PHP
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave Longaberger&format=xml&exintro=1";
$text = file_get_contents($url);
echo $text;
?>
so any idea how to do it. ~ Thanks
Update (after added urlencode or rawurlencode still not working)
$name = "Dave Longaberger";
$name = urlencode($name);
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles='.$name.'&format=xml&exintro=1';
$text = file_get_contents($url);
Also not working
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave Longaberger&format=xml&exintro=1';
$url = urlencode($url);
$text = file_get_contents($url);
nor
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles='.rawurlencode('Dave Longaberger').'&format=xml&exintro=1';
$text = file_get_contents($url);
Well so i really don't know looks like it is impossible by somehow.

Set the User Agent Header in your curl request, wikipedia replies with error 403 forbidden otherwise.
<?PHP
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave+Longaberger&format=xml&exintro=1";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$xml = curl_exec($ch);
curl_close($ch);
echo $xml;
?>
Alternatively:
ini_set("user_agent","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave+Longaberger&format=xml&exintro=1";
$xml = simplexml_load_file($url);
$extracts = $xml->xpath("/api/query/pages/page/extract");
var_dump($extracts);

Look at the note in this php man page
http://php.net/manual/en/function.file-get-contents.php
If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode().

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Web Scraping in Google Scholar - php

Related

Extract/Display JSON Wikipedia PHP

Scraping a website for price data using PHP but it returns zero(==$0) may be the website is blocking me. How to over come it?

Xpath returns blank page doesn't echo the values

checking for latest image with cURL PHP (404)...but images not updating

How to parse feed with php

Categories

Resources