Php Simple Html Dom Parser can't get content on pagination

Php Simple Html Dom Parser can't get content on pagination - php

Hi i'm a beginner in using simple_html_dom. i'm trying to fetch list of href's from list of posts from this sample website having pagination using below code.
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.themelock.com/wordpress/elegantthemes/');
function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
$items = $html->find('h2[class=post-title]');
foreach($items as $post) {
$articles[] = array($post->children(0)->href);
}
foreach($articles as $item) {
echo "<div class='item'>";
echo $item[0];
echo "</div>";
}
}
if($next = $html->find('div[class=navigation]', 0)->last_child() ) {
$URL = $next->href;
$html->clear();
unset($html);
getArticles($URL);
}
?>
As a result i'm getting
http://www.themelock.com/wordpress/908-minimal-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/892-event-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/882-askit-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/853-lightbright-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/850-inreview-elegantthemes-review-wordpress-theme.html
http://www.themelock.com/wordpress/807-boutique-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/804-elist-elegantthemes-directory-wordpress-theme.html
http://www.themelock.com/wordpress/798-webly-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/795-elegantestate-real-estate-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/786-notebook-elegantthemes-wordpress-theme.html
Above code fetching only Next page (Second page) contents. I'm wondering how to get first page post url's followed by next pages.
Did anyone know how to do this ?

Thanks for your support guys, I made this to work using below code,
<?php
include('simple_html_dom.php');
$url = "http://www.themelock.com/wordpress/yootheme-wordpress/";
// Start from the main page
$nextLink = $url;
// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);
$posts = $html->find('h2[class=post-title]');
foreach($posts as $post) {
// Get the link
$articles = $post->children(0)->href;
echo $articles.'</br>';
}
// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div[class=navigation]', 0)->last_child()) ? $temp->href : NULL );
// Clear DOM object
$html->clear();
unset($html);
}
?>

Related

Scrape Facebook page and parse with php

I am trying to get a list of names and profiles that shared a particular post of a Facebook page.
I thought I could use simple html dom to parse the page with php, but with no success yet. This is my code so far:
<?php
include_once('simple_html_dom.php');
function scraping_shares() {
$html = file_get_html('https://m.facebook.com/shares/view?id=10156833628051729');
foreach($html->find('div.bn') as $data) {
$item['name'] = trim($data->find('h3.bo', 0)->plaintext);
$ret[] = $item;
}
$html->clear();
unset($html);
return $ret;
}
ini_set('user_agent', 'My-Application/2.5');
$ret = scraping_shares();
foreach($ret as $v) {
echo $v['name'].' <br>';
}
?>
Any help please?

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz

The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!

If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

How to scrape page using simple htmldom and PHP?

I am trying to get the data in <div id listing-page-cart-inner> and <div id="description text"> and <div id="tags">, but i am finding it difficult to mine data.
Can anyone guide me? I am not able to fetch data though first div that I mentioned I am able to scrape, but other div I am not able to. When I loop through the second foreach it takes longer time.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('https://etsy.com/listing/107492702/');
//$val = $html->find('div[id=listing-page-cart-inner]');
function scraping_etsy() {
// create HTML DOM
$html = file_get_html('https://etsy.com/listing/107492702/');
foreach($html->find('div[id=listing-page-cart-inner]') as $article)
{
// get title
//$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('span', 0)->plaintext);
// get intro
//$lists = $articles->find('div[id=item-overview]');
$item['list1'] = trim($article->find('li',0)->plaintext);
$item['list2'] = trim($article->find('li',1)->plaintext);
$item['list3'] = trim($article->find('li',2)->plaintext);
$item['list4'] = trim($article->find('li',3)->plaintext);
$item['list5'] = trim($article->find('li',4)->plaintext);
/*foreach($article->find('li') as $al){
$item['lists'] =trim($al->find('li')->plaintext);
}*/
$ret[] = $item;
}
foreach($html->find('div[id=description]') as $content){
var_dump($content->find('text'));
// $item['content'] = trim($content->find('div[id=description]')->plaintext);
// $ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret ;
}
$ret = scraping_etsy();
var_dump($ret);
/*foreach($ret as $v) {
echo $v['title'].'<br>';
echo '<ul>';
echo '<li>'.$v['details'].'</li>';
echo '<li>Diggs: '.$v['diggs'].'</li>';
echo '</ul>';
}*/
?>

As for getting children of those divs, just remember that if found the parent element, always use ->find('<the selector here>', 0) always use the index to actually point to that element.
$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';
$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
echo $overview->innertext . '<br/>';
}
// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
$tags[] = $li->find('a', 0)->innertext;
}
echo '<pre>', print_r($tags, 1), '</pre>';
// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;

The easiest way to start always is to use 3d-party library, i.e. Symfony DomCrawler
It usage as easy as
use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
foreach ($crawler as $domElement) {
print $domElement->nodeName;
}
And you can use filters like
$crawler = $crawler->filter('body > p');

Simple HTML DOM Not Finding DIV

I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?

This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}

I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)

try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.

include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)

I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Php Simple Html Dom Parser can't get content on pagination - php

Related

Scrape Facebook page and parse with php

Parsing HTML Table Data from XML with PHP

How to scrape page using simple htmldom and PHP?

Simple HTML DOM Not Finding DIV

PHP Simple DOM Parser to Scrape From Multiple URLs

Categories

Resources