Get Paginated Links With php and simple html dom - php

I have this code to try and get the pagination links using php but the result is not quiet right. could any one help me.
what I get back is just a recurring instance of the first link.
<?php
include_once('simple_html_dom.php');
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
$Next_Link = array();
foreach($dom->find('a[title=Next]') as $element){
$Next_Link[] = $element->href;
}
print_r($Next_Link);
$next_page_url = $Next_Link[0];
if($next_page_url !='') {
echo '<br>' . $next_page_url;
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
dlPage($next_page_url);
}
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
//print_r($data)
?>
what i want to get is
mySiteUrl/?facet_is_mpg_child=0&viewType=gridView&page=2
mySiteUrl//?facet_is_mpg_child=0&viewType=gridView&page=3
.
.
.
to the last link in the pagination. Please help

Here it is. Look that I htmlspecialchars_decode the link. Cause the href in curl there shouldn't be an & like in xml. Should the return value of dlPage the last link in Pagination. I understood so.
<?php
include_once('simple_html_dom.php');
function dlPage($href, $already_loaded = array()) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$htmlPage = curl_exec($curl);
curl_close($curl);
echo "Loading From URL:" . $href . "<br/>\n";
$already_loaded[$href] = true;
// Create a DOM object
$dom = file_get_html($href);
// Load HTML from a string
$dom->load($htmlPage);
$next_page_url = null;
$items = $dom->find('ul[class="osh-pagination"] li[class="item"] a[title="Next"]');
foreach ($items as $item) {
$link = htmlspecialchars_decode($item->href);
if (!isset($already_loaded[$link])) {
$next_page_url = $link;
break;
}
}
if ($next_page_url !== null) {
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
return dlPage($next_page_url, $already_loaded);
}
return $href;
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
echo "DATA:" . $data . "\n";
And the output is:
Loading From URL:https://www.jumia.com.gh/phones/<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=2<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=3<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=4<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5<br/>
DATA:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5

Related

Cannot extract the data from the website using PHP cURL

I am working on a project which needs to get the data from other webpage:
https://eth.ethfans.org/#/miner?0x2998850087633a4806191960c94ed535d97da598
I am trying to use the function cRUL:
<?php
$url = "https://eth.ethfans.org/#/miner?0x2998850087633a4806191960c94ed535d97da598";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);
echo $contents;
?>
However, I can only get the layout of the site, but I cannot get the data inside.
Can anyone help for this ?
Thanks in Advance.
Regards,
Alex
Use str_get_html to fetch the data from the layout:
$get_html = str_get_html($contents);
Example:
function check()
{
$url = "https://stackoverflow.com/questions/49248329/cannot-extract-the-data-from-the-website-using-php-curl";
$get_html = $this->get_curl($url);
#print_r($get_html); exit;
$get_html = str_get_html($get_html);
$fb = NULL;
foreach ($get_html->find('a') as $v) { // you can get what data from the layout
if(strpos($v->href, 'facebook'))
{
echo $fb = $v->href;
echo "\n";
break;
}
}
unset($get_html);
}
public function get_curl($url)
{
ob_start();
$ch = curl_init($url);
$headers = [
'Accept-Language: en-US,en;q=0.5',
'Cache-Control: no-cache',
'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:28.0) Gecko/20100101 Firefox/51.0',
];
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch,CURLOPT_URL, $url);
$response = curl_exec ($ch);
curl_close ($ch);
ob_end_flush();
return $response;
}
you're hitting the wrong url, the page you're hitting only contains the layout and the javascript required to fetch the actual data, then the javascript fetch the data from https://eth.ethfans.org/api/page/miner?value=2998850087633a4806191960c94ed535d97da598 , so, do as the javascript does, and fetch that url.

Extract/Display JSON Wikipedia PHP

I am new to programming,
I need to extract the wikipedia content and put it into html.
//curl request returns json output via json_decode php function
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$search = $_GET["search"];
if (empty($search)) {
//term param not passed in url
exit;
} else {
//create url to use in curl call
$term = str_replace(" ", "_", $search);
$url = "https://en.wikipedia.org/w/api.php?action=opensearch&search=".$search."&limit=1&namespace=0&format=jsonfm";
$json = curl($url);
$data = json_decode($json, true);
$data = $data['parse']['wikitext']['*'];
}
so I basically want to reprint a wiki page but with my styles and do not know how to do.
Any ideas, Thanks

Scraping a website for price data using PHP but it returns zero(==$0) may be the website is blocking me. How to over come it?

This is the code that i have used:
$curl = curl_init("https://www.flipkart.com/curren-cu2-345656-analog-watch-boys-men/p/itmeax4wh4ujcfft?pid=WATEAX4WGYNYWVCM&srno=b_1_1&otracker=hp_omu_Deals%20of%20the%20Day_5_15c7e867-d35a-4431-a4a0-da39f043bc1f_0&lid=LSTWATEAX4WGYNYWVCMHVLY32");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tableRows = $xpath->query('//*[#id="container"]/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]');
echo $tableRows[0];
echo $tableRows[1];
echo $tableRows[2];
foreach ($tableRows as $row) {
echo $row . "<br>";
}
It shows zero, while i open the source in F12 developer mode it shows "==$0" adjacent to the div, how to i overcome this ?
As such flipkart is https so its blocking your request. To overcome this issue. Please use following two lines of code in addition with your curl request.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);

Curl with Simple HTML DOM using Link Pagination

I want to combine Curl and Simple HTML DOM.
Both are working fine separately.
I want to curl a site and then I want to look into the inner data using DOM
with pagination page numbers.
I am using this code.
<?php
include 'simple_html_dom.php';
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
$url = 'http://example.com/';
$data = dlPage($url);
// echo $data;
#######################################################
$startpage = 1;
$endpage = 3;
for ($p=$startpage;$p<=$endpage;$p++) {
$html = file_get_html('http://example.com/page/$p.html');
// connect to main page links
foreach ($html->find('div#link a') as $link) {
$linkHref = $link->href;
//loop through each link
$linkHtml = file_get_html($linkHref);
// parsing inner data
foreach($linkHtml->find('h1') as $title) {
echo $title;
}
foreach ($linkHtml->find('div#data') as $description) {
echo $description;
}
}
}
?>
How can I combine this to make it work as one single script?

Update page after each cURL Request

I have the following code:
$fiz = $_GET['file'];
$file = file_get_contents($fiz);
$trim = trim($file);
$tw = explode("\n", $trim);
$browser = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36';
foreach($tw as $twi){
$url = 'https://twitter.com/users/username_available?username='.$twi;
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, '$browser');
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($ch);
$json = json_decode($result, true);
if($json['valid'] == 1){
echo "Twitter ".$twi." is available! <br />";
$fh = fopen('available.txt', 'a') or die("can't open file");
fwrite($fh, $twi."\n");
} else {
echo "Twitter ".$twi." is taken! <br />";
}
}
And what it does is that it takes list that would look something like:
apple
dog
cat
and so on, and it checks it with Twitter to check if the name is taken or not.
What I want to know is that if it's in any way possible to make the request show up after each check in instead of showing up all at once?
You need to use Ajax calls, If you are familiar with JavaScript or Jquery you can easily do this.
Instead of checking all names at once , use an Ajax function to send one name at a time to the server side PHP code.
Say you send "Cat" first , the page is processed and returns the result using Ajax. Now you can display the result on page.
send "dog" get the response---> display it and so on.
A similar Question has been answered here Return AJAX result for each cURL Request
Hope this helps, I use Jquery here ...
JavaScript
<script>
var keyArray = ('cat','dog','mouse','rat');
function checkUserName(name, keyArray, position){
$("#result").load("namecheck.php", {uesername: name, function(){ // Results are displayed on 'result' element
fetchNext(keyArray, position);
});}
}
function fetchNext(keyArray, position){
position++; // get next name in the array
if(position < keyArray.lenght){ // not exceeding the aray count
checkUserName(keyArray[position], keyArray, position) // make ajax call to check user name
}
}
function startProcess(){
var keyArray = ('cat','dog','mouse','rat');
var position = 0; // get the first element from the array
fetchNext(keyArray, position);
}
</script>
HTML
<div id="result"></div>
<button onclick="startProcess()"> Start Process </button>
PHP
<?
$twi = $_GET['username'];
$browser = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36';
$url = 'https://twitter.com/users/username_available?username='.$twi;
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, '$browser');
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($ch);
$json = json_decode($result, true);
if($json['valid'] == 1){
echo "Twitter ".$twi." is available! <br />";
$fh = fopen('available.txt', 'a') or die("can't open file");
fwrite($fh, $twi."\n");
}else{
echo "Twitter ".$twi." is taken! <br />";
} ?>

Categories