PHP Simple DOM Parser Parse Current PHP Page

PHP Simple DOM Parser Parse Current PHP Page - php

I am using the PHP Simple DOM parser to extract all of the image sources on a given page like so:
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://google.com/');
// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
echo $e->src . '<br>';
Instead of using Google.com, I wish to use a page on Wordpress's admin (backend) area. These pages are PHP pages, not HTML (but the page has standard HTML throughout). How would I use the current page as the $html variable? PHP newbie over here.

Using this library dxtool found here.
Login
require 'WebGet.php';
$w = new WebGet();
// using cache to prevent repetitive download
$w->useCache = true;
$w->cacheLocation = '/tmp';
$w->cacheMaxAge = 3600;
$w->cookieFile = '/tmp/cookie.txt';
// $login_get_data and $login_post_data is associative array
$login = $w->requestContent($login_url, $login_get_data, $login_post_data);
Visiting Image containing page
// $image_page_url is the url of the page where your images exist.
$image_page = $w->requestContent($image_page_url);
Parse images and display
$dom = new DOMDocument();
$dom->loadHTML($image_page);
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img){
echo $img->getAttribute("src");
}
Disclaimer: I am the author of this class

Related

Question about using simple html dom parser to store HTML tags as objects

I am building a web scraper using the simple HTML DOM parser. However, I ran into some issues figuring out how to store HTML elements on a web page as objects. I would like to take an input URL, and turn all the HTML elements like tags, divs, fields, etc. and turn them into an object that gets spit out onto a page. I have written some code that currently works when I type in a URL, but the output is not what I am trying to achieve. Below, I have attached the code that I have worked out already, and I am seeking to find a way in which I could achieve what I am trying to do.
I have tried finding all images and links as well as creating a DOM object. I can't seem to figure out how to convert these elements into objects that I can use to learn more about a website, and possibly store that data into a database.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$url = $_POST["url"];
$html = file_get_html($url);
echo $html;
// Find all images
$element = new simple_html_dom();
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
$element = new simple_html_dom();
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a URL
$html->load_file($url);
echo $html;
?>
I am expecting an output of objects, but I am instead getting an actual output of images and links on a web page.

<?php
require('simple_html_dom.php');
// Create DOM from URL or file
// $url = $_POST["url"];
$url = 'Your-Url'; // Your url: 'www.example.com'
$html = file_get_html($url);
// Find all images
$images = []; //create empty images array
foreach($html->find('img') as $element){
$images[] = $element->src . '<br>'; //Store the found elements in the images array
}
echo '<pre>Output $images: '; var_dump($images); echo '</pre>'; //An output from the images array
// Find all links
$links = []; //create empty images array
foreach($html->find('a') as $element){
$links[] = $element->href . '<br>'; //Store the found elements in the links array
}
echo '<pre>Output $links: '; var_dump($links); echo '</pre>'; //An output from the links array
The echo's display the arrays filled with 'image' and 'a' tags value's from your page

DOMDocument->saveHTML isn't working

An api returns me couple of html code (only part of the body, not full html) and i want to change all images src's with others.
I get and set attributes then if i echo it in foreach loop i see old and new value but when i try to save it with saveHTML then dump the full html block which is returned from api, i don't see replaced paths.
$page = json_decode($page);
$page = (array) $page->rows;
$page = ($page[0]->_->content);
$dom = new \DOMDocument();
$dom->loadHTML($page);
$tag = $dom->getElementsByTagName('img');
foreach($tag as $t)
{
echo $t->getAttribute('src').'<br'>; //showing old src
$t->setAttribute('src', 'bla');
echo $t->getAttribute('src').'<br'>; //showing new src
}
$dom->saveHTML();
var_dump($page); //nothing is changed

My_ friend this is not how it works.
You should have your edited HTML in the result of saveHTML() so:
$editedHtml = $dom->saveHTML()
var_dump($editedHtml);
Now you should see your changed HTML.
Explanation is that $page is completely different object that has nothing to do with $dom object.
Cheers!

Scraping Thumbnail from NYtimes

My scraping code works for just about every site i've come accross while testing... except for nytimes.com articles. I use ajax with the following PHP code (i've left out some details to focus on my specific problem):
$link = "http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp";
$article = new DOMDocument;
$article->loadHTMLFile($link);
//generate image array
$images = $article->getElementsByTagName("img");
foreach ($images as $image) {
$source = $image->getAttribute("src");
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
My problem is that the main images on nytimes pages don't even seem to get picked up by the getElementsByTagName. Pinterest finds a way to scrape the main images from this site for example: http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp whereas I cannot. Any suggestions?

OK. So this is what I tried so far as I found your question interesting.
When I do this on browser console using jQuery, I do get results on images. My query was
var a= new Array();
$('img[src]').each(function(){ a.push($(this).attr('src'));});
console.log(a);
Also see screenshot of results
Note that console.log(arrayname) work in Chrome browser.
So ideally your code must work. Please consider adding a is_null check like I've done.
Below is the code where I try loading the URL using a different approach(perhaps better too) and get the root cause of why you get only single image of NYT logo.
The resultant HTML screenshot is attached .
<?php
$html = file_get_contents("http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp");
echo $html;
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$images = $xpath->query("//*/img");
if (!is_null($images)) {
echo sizeof($images);
foreach ($images as $image) {
$source = $image->getAttribute('src');
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
}
?>
You can't get the content via feed unless you are authenticated.
You can try-
To use context parameter in file_get_contents method
You can try consuming the RSS/ATOM feeds of the article.
you download the page as HTML and then load it in file_get_contents methods. PS: This works.

Google Calendar layout

I have been working with this php code, which should modify Google Calendars layout. But when I put the code to page, it makes everything below it disappear. What's wrong with it?
<?php
$your_google_calendar=" PAGE ";
$url= parse_url($your_google_calendar);
$google_domain = $url['scheme'].'://'.$url['host'].dirname($url['path']).'/';
// Load and parse Google's raw calendar
$dom = new DOMDocument;
$dom->loadHTMLfile($your_google_calendar);
// Change Google's CSS file to use absolute URLs (assumes there's only one element)
$css = $dom->getElementByTagName('link')->item(0);
$css_href = $css->getAttributes('href');
$css->setAttributes('href', $google_domain . $css_href);
// Change Google's JS file to use absolute URLs
$scripts = $dom->getElementByTagName('script')->item(0);
foreach ($scripts as $script) {
$js_src = $script->getAttributes('src');
if ($js_src) { $script->setAttributes('src', $google_domain . $js_src); }
}
// Create a link to a new CSS file called custom_calendar.css
$element = $dom->createElement('link');
$element->setAttribute('type', 'text/css');
$element->setAttribute('rel', 'stylesheet');
$element->setAttribute('href', 'custom_calendar.css');
// Append this link at the end of the element
$head = $dom->getElementByTagName('head')->item(0);
$head->appendChild($element);
// Export the HTML
echo $dom->saveHTML();
?>

When I'm testing your code, I'm getting some errors because of wrong method call:
->getElementByTagName should be ->getElementsByTagName with s on Element
and
->setAttributes and ->getAttributes should be ->setAttribute and ->getAttribute without s at end.
I'm guessing that you don't have any error_reporting on, and because of that don't know anything went wrong?

Regex Replacement Dependent On Class

I have the following code that replaces all tags on a page and adds the nCode image resizer to it. The code is as follows:
function ncode_the_content($content) {
return preg_replace("/<img([^`|>]*)>/im", "<img onload=\"NcodeImageResizer.createOn(this);\"$1>", $content); }
}
What I need to do is make it so that if an image has the class of "noresize" it doesn't do the preg_match.
I have only managed to get it so that if there is the "noresize" class anywhere on the page it stops resizing all images instead of just the one with the correct class.
Any suggestions?
UPDATE:
Am I even remotely in the right ballpark with this?
function ncode_the_content($content) {
//Load the HTML page
$html = file_get_contents($content);
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
#$dom = DOMDocument::loadHTML($html);
//Init the XPath object
$xpath = new DOMXpath($dom);
//Query the DOM
$linksnoresize = $xpath->query( 'img[#class = "noresize"]' );
$links = $xpath->query( 'img[]' );
//Display the results as in the previous example
foreach($links as $link){
echo $link->getAttribute('onload'), 'NcodeImageResizer.createOn(this);';
}
foreach($linksnoresize as $link){
echo $link->getAttribute('onload'), '';
}
}

Here's some untested code:
$dom = DOMDocument::loadHTML($content);
$images = $dom->getElementsByTagName("img");
foreach ($images as $image) {
if (!strstr($image->getAttribute("class"), "noresize")) {
$image->setAttribute("onload", "NcodeImageResizer.createOn(this);");
}
}
But, if it were me, I would eschew any such inline event handler and instead just find the appropriate elements with Javascript.

I ended up just using pure CSS and adding a around the images I didn't want to be resized. Forced the width and height of that div back to auto and then removed the warning message that was displayed above them. Seems to work fine. Thanks for your help :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Simple DOM Parser Parse Current PHP Page - php

Related

Question about using simple html dom parser to store HTML tags as objects

DOMDocument->saveHTML isn't working

Scraping Thumbnail from NYtimes

Google Calendar layout

Regex Replacement Dependent On Class

Categories

Resources