scape pagination content using simple dom parser - php

I want to scrape title post of a blog and I wrote below code. I stuck in figuring out how to loop through every page.
$dom = file_get_html('http://demos.appthemes.com/clipper/');
scrape('http://demos.appthemes.com/clipper/');
function scrape($URL)
{
$dom = file_get_html($URL);
foreach ($dom->find('.item-frame h1 a') as $items) {
$item = array('courseTitle' => $items->text());
var_dump($item);
}
}
for($pages = 0; $pages < 3;$pages++) {
if($next = $dom->find('a[class=page]', $pages)) {
$URL = $next->href;
$dom->clear();
unset($dom);
scrape($URL);
}
}
Partial result did appear but stuck at an error Undefined variable: dom in on line 23

unset($dom); causes the $dom variable to be unset and on the second loop iteration ($pages == 1) call to $dom->find fails.
I did not get the logic, but try to remove $dom->clear(); unset($dom); lines.
Hope it helps.

Related

Remove the tag <a> but not content

I'm trying to get all links from a page and remove them, except the contents of those links. The code not works 100%, because some lins are removed and others not.
I'm using PHP and DOMDocument.
$dom = new DOMDocument();
$dom->encoding = 'utf-8';
$dom->loadHTML(utf8_decode($text));
$links = $dom->getELementsByTagName('a');
foreach($links as $link)
{
$link->parentNode->replaceChild(new DOMText($link->textContent), $link);//I've tried this way but not work.
//And I've tried other way below:
/*$sibling = $link->firstChild;
do {
$next = $sibling->nextSibling;
$link->parentNode->insertBefore($sibling, $link);
} while ($sibling = $next);
$link->parentNode->removeChild($link);*/
}
return $dom->saveHTML();
For example, we have three links:
<p>Page</p>
Page1
Page2
Page3
<p>Test</p>
The result is:
<p>Page</p>
Page1
Page2
Page3
<p>Test</p>
I want all links removed (not content).
Any idea to solve this problem?
Make a copy of $links as an ordinary array, because the object that getElementsByTagName() returns is a "live" NodeList -- it changes as you modify the DOM, and this causes the foreach loop to skip elements (it's the same problem as trying to delete elements from an array while you're looping over it).
$links_array = [];
foreach ($links as $l) {
$links_array[] = $l;
}
foreach($links_array as $link)
{
$link->parentNode->replaceChild(new DOMText($link->textContent), $link);
}

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

How can I retrieve infos from PHP DOMElement?

I'm working on a function that gets the whole content of the style.css file, and returns only the CSS rules that needed by the currently viewed page (it will be cached too, so the function only runs when the page was changed).
My problem is with parsing the DOM (I'm never doing it before with PHP DOM). I have the following function, but $element->tagname returns NULL. I also want to check the element's "class" attribute, but I'm stuck here.
function get_rules($html) {
$arr = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $element ){
$arr[sizeof($arr)] = $element->tagname;
}
return array_unique($arr);
}
What can I do? How can I get all of the DOM elements tag name, and class from HTML?
Because tagname should be an undefined index because its supposed to be tagName (camel cased).
function get_rules($html) {
$arr = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $element ){
$e = array();
$e['tagName'] = $element->tagName; // tagName not tagname
// get all elements attributes
foreach($element->attributes as $attr) {
$attrs = array();
$attrs['name'] = $attr->nodeName;
$attrs['value'] = $attr->nodeValue;
$e['attributes'][] = $attrs;
}
$arr[] = $e;
}
return $arr;
}
Simple Output

Php Simple Html Dom Parser can't get content on pagination

Hi i'm a beginner in using simple_html_dom. i'm trying to fetch list of href's from list of posts from this sample website having pagination using below code.
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.themelock.com/wordpress/elegantthemes/');
function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
$items = $html->find('h2[class=post-title]');
foreach($items as $post) {
$articles[] = array($post->children(0)->href);
}
foreach($articles as $item) {
echo "<div class='item'>";
echo $item[0];
echo "</div>";
}
}
if($next = $html->find('div[class=navigation]', 0)->last_child() ) {
$URL = $next->href;
$html->clear();
unset($html);
getArticles($URL);
}
?>
As a result i'm getting
http://www.themelock.com/wordpress/908-minimal-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/892-event-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/882-askit-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/853-lightbright-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/850-inreview-elegantthemes-review-wordpress-theme.html
http://www.themelock.com/wordpress/807-boutique-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/804-elist-elegantthemes-directory-wordpress-theme.html
http://www.themelock.com/wordpress/798-webly-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/795-elegantestate-real-estate-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/786-notebook-elegantthemes-wordpress-theme.html
Above code fetching only Next page (Second page) contents. I'm wondering how to get first page post url's followed by next pages.
Did anyone know how to do this ?
Thanks for your support guys, I made this to work using below code,
<?php
include('simple_html_dom.php');
$url = "http://www.themelock.com/wordpress/yootheme-wordpress/";
// Start from the main page
$nextLink = $url;
// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);
$posts = $html->find('h2[class=post-title]');
foreach($posts as $post) {
// Get the link
$articles = $post->children(0)->href;
echo $articles.'</br>';
}
// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div[class=navigation]', 0)->last_child()) ? $temp->href : NULL );
// Clear DOM object
$html->clear();
unset($html);
}
?>

Find and replace all links in a web page using php/javascript

I need to find links in a part of some html code and replace all the links with two different absolute or base domains followed by the link on the page...
I have found a lot of ideas and tried a lot different solutions.. Luck aint on my side on this one.. Please help me out!!
Thank you!!
This is my code:
<?php
$url = "http://www.oxfordreference.com/views/SEARCH_RESULTS.html?&q=android";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table class="short_results_summary_table">');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
echo "{$table}";
$dom = new DOMDocument();
$dom->loadHTML($table);
$dom->strictErrorChecking = FALSE;
// Get all the links
$links = $dom->getElementsByTagName("a");
foreach($links as $link) {
$href = $link->getAttribute("href");
echo "{$href}";
if (strpos("http://oxfordreference.com", $href) == -1) {
if (strpos("/views/", $href) == -1) {
$ref = "http://oxfordreference.com/views/"+$href;
}
else
$ref = "http://oxfordreference.com"+$href;
$link->setAttribute("href", $ref);
echo "{$link->getAttribute("href")}";
}
}
$table12 = $dom->saveHTML;
preg_match_all("|<tr(.*)</tr>|U",$table12,$rows);
echo "{$rows[0]}";
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
echo "{$cells}";
}
}
?>
When i run this code i get htmlParseEntityRef: expecting ';' warning for the line where i load the html
var links = document.getElementsByTagName("a"); will get you all the links.
And this will loop through them:
for(var i = 0; i < links.length; i++)
{
links[i].href = "newURLHERE";
}
You should use jQuery - it is excellent for link replacement. Rather than explaining it here. Please look at this answer.
How to change the href for a hyperlink using jQuery
I recommend scrappedcola's answer, but if you dont want to do it on client side you can use regex to replace:
ob_start();
//your HTML
//end of the page
$body=ob_get_clean();
preg_replace("/<a[^>]*href=(\"[^\"]*\")/", "NewURL", $body);
echo $body;
You can use referencing (\$1) or callback version to modify output as you like.

Categories