Curl with Simple HTML DOM using Link Pagination

Curl with Simple HTML DOM using Link Pagination - php

I want to combine Curl and Simple HTML DOM.
Both are working fine separately.
I want to curl a site and then I want to look into the inner data using DOM
with pagination page numbers.
I am using this code.
<?php
include 'simple_html_dom.php';
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
$url = 'http://example.com/';
$data = dlPage($url);
// echo $data;
#######################################################
$startpage = 1;
$endpage = 3;
for ($p=$startpage;$p<=$endpage;$p++) {
$html = file_get_html('http://example.com/page/$p.html');
// connect to main page links
foreach ($html->find('div#link a') as $link) {
$linkHref = $link->href;
//loop through each link
$linkHtml = file_get_html($linkHref);
// parsing inner data
foreach($linkHtml->find('h1') as $title) {
echo $title;
}
foreach ($linkHtml->find('div#data') as $description) {
echo $description;
}
}
}
?>
How can I combine this to make it work as one single script?

Related

Extract/Display JSON Wikipedia PHP

I am new to programming,
I need to extract the wikipedia content and put it into html.
//curl request returns json output via json_decode php function
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$search = $_GET["search"];
if (empty($search)) {
//term param not passed in url
exit;
} else {
//create url to use in curl call
$term = str_replace(" ", "_", $search);
$url = "https://en.wikipedia.org/w/api.php?action=opensearch&search=".$search."&limit=1&namespace=0&format=jsonfm";
$json = curl($url);
$data = json_decode($json, true);
$data = $data['parse']['wikitext']['*'];
}
so I basically want to reprint a wiki page but with my styles and do not know how to do.
Any ideas, Thanks

Scraping a website for price data using PHP but it returns zero(==$0) may be the website is blocking me. How to over come it?

This is the code that i have used:
$curl = curl_init("https://www.flipkart.com/curren-cu2-345656-analog-watch-boys-men/p/itmeax4wh4ujcfft?pid=WATEAX4WGYNYWVCM&srno=b_1_1&otracker=hp_omu_Deals%20of%20the%20Day_5_15c7e867-d35a-4431-a4a0-da39f043bc1f_0&lid=LSTWATEAX4WGYNYWVCMHVLY32");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tableRows = $xpath->query('//*[#id="container"]/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]');
echo $tableRows[0];
echo $tableRows[1];
echo $tableRows[2];
foreach ($tableRows as $row) {
echo $row . "<br>";
}
It shows zero, while i open the source in F12 developer mode it shows "==$0" adjacent to the div, how to i overcome this ?

As such flipkart is https so its blocking your request. To overcome this issue. Please use following two lines of code in addition with your curl request.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);

Get Paginated Links With php and simple html dom

I have this code to try and get the pagination links using php but the result is not quiet right. could any one help me.
what I get back is just a recurring instance of the first link.
<?php
include_once('simple_html_dom.php');
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
$Next_Link = array();
foreach($dom->find('a[title=Next]') as $element){
$Next_Link[] = $element->href;
}
print_r($Next_Link);
$next_page_url = $Next_Link[0];
if($next_page_url !='') {
echo '<br>' . $next_page_url;
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
dlPage($next_page_url);
}
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
//print_r($data)
?>
what i want to get is
mySiteUrl/?facet_is_mpg_child=0&viewType=gridView&page=2
mySiteUrl//?facet_is_mpg_child=0&viewType=gridView&page=3
.
.
.
to the last link in the pagination. Please help

Here it is. Look that I htmlspecialchars_decode the link. Cause the href in curl there shouldn't be an & like in xml. Should the return value of dlPage the last link in Pagination. I understood so.
<?php
include_once('simple_html_dom.php');
function dlPage($href, $already_loaded = array()) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$htmlPage = curl_exec($curl);
curl_close($curl);
echo "Loading From URL:" . $href . "<br/>\n";
$already_loaded[$href] = true;
// Create a DOM object
$dom = file_get_html($href);
// Load HTML from a string
$dom->load($htmlPage);
$next_page_url = null;
$items = $dom->find('ul[class="osh-pagination"] li[class="item"] a[title="Next"]');
foreach ($items as $item) {
$link = htmlspecialchars_decode($item->href);
if (!isset($already_loaded[$link])) {
$next_page_url = $link;
break;
}
}
if ($next_page_url !== null) {
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
return dlPage($next_page_url, $already_loaded);
}
return $href;
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
echo "DATA:" . $data . "\n";
And the output is:
Loading From URL:https://www.jumia.com.gh/phones/<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=2<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=3<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=4<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5<br/>
DATA:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5

PHP Simple HTML DOM parser - parsing nested elements

I've been playing with PHP Simple HTML DOM Parser Manual found here http://simplehtmldom.sourceforge.net/manual.htm and I got success with some tests except this one:
It got nested tables and spans and I would like to parse the outer text of span with class of mynum.
<?php
require_once 'simple_html_dom.php';
$url = 'http://relumastudio.com/test/target.html';
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
$DEBUG = 1;
if($DEBUG){
$html = new simple_html_dom();
$html->load($url);
echo $html->find('span[class=mynum]',0)->outertext; // I should get 123456
}else{
echo $result;
}
curl_close($ch);
I thought I could get away with just once call to echo $html->find('span[class=mynum]',0)->outertext; to get the text 123456 but I can't.
Any ideas? Any help is greatly appreciated. Thank You.

Load the url properly first. Then use ->innertext in this case:
$url = 'http://relumastudio.com/test/target.html';
$html = file_get_html($url);
$num = $html->find('span.mynum', 0)->innertext;
echo $num;

You need innertext.
$html = new simple_html_dom();
$html->load_file($url);
echo $html->find('span[class=mynum]',0)->innertext;
outertext returns <span class="mynum">123456</span>

Update page after each cURL Request

I have the following code:
$fiz = $_GET['file'];
$file = file_get_contents($fiz);
$trim = trim($file);
$tw = explode("\n", $trim);
$browser = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36';
foreach($tw as $twi){
$url = 'https://twitter.com/users/username_available?username='.$twi;
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, '$browser');
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($ch);
$json = json_decode($result, true);
if($json['valid'] == 1){
echo "Twitter ".$twi." is available! <br />";
$fh = fopen('available.txt', 'a') or die("can't open file");
fwrite($fh, $twi."\n");
} else {
echo "Twitter ".$twi." is taken! <br />";
}
}
And what it does is that it takes list that would look something like:
apple
dog
cat
and so on, and it checks it with Twitter to check if the name is taken or not.
What I want to know is that if it's in any way possible to make the request show up after each check in instead of showing up all at once?

You need to use Ajax calls, If you are familiar with JavaScript or Jquery you can easily do this.
Instead of checking all names at once , use an Ajax function to send one name at a time to the server side PHP code.
Say you send "Cat" first , the page is processed and returns the result using Ajax. Now you can display the result on page.
send "dog" get the response---> display it and so on.
A similar Question has been answered here Return AJAX result for each cURL Request
Hope this helps, I use Jquery here ...
JavaScript
<script>
var keyArray = ('cat','dog','mouse','rat');
function checkUserName(name, keyArray, position){
$("#result").load("namecheck.php", {uesername: name, function(){ // Results are displayed on 'result' element
fetchNext(keyArray, position);
});}
}
function fetchNext(keyArray, position){
position++; // get next name in the array
if(position < keyArray.lenght){ // not exceeding the aray count
checkUserName(keyArray[position], keyArray, position) // make ajax call to check user name
}
}
function startProcess(){
var keyArray = ('cat','dog','mouse','rat');
var position = 0; // get the first element from the array
fetchNext(keyArray, position);
}
</script>
HTML
<div id="result"></div>
<button onclick="startProcess()"> Start Process </button>
PHP
<?
$twi = $_GET['username'];
$browser = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36';
$url = 'https://twitter.com/users/username_available?username='.$twi;
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, '$browser');
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($ch);
$json = json_decode($result, true);
if($json['valid'] == 1){
echo "Twitter ".$twi." is available! <br />";
$fh = fopen('available.txt', 'a') or die("can't open file");
fwrite($fh, $twi."\n");
}else{
echo "Twitter ".$twi." is taken! <br />";
} ?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Curl with Simple HTML DOM using Link Pagination - php

Related

Extract/Display JSON Wikipedia PHP

Scraping a website for price data using PHP but it returns zero(==$0) may be the website is blocking me. How to over come it?

Get Paginated Links With php and simple html dom

PHP Simple HTML DOM parser - parsing nested elements

Update page after each cURL Request

Categories

Resources