Using SimplePie and Simple HTML DOM together

Using SimplePie and Simple HTML DOM together - php

I'm trying to use SimplePie to pull a list of links via RSS feeds and then scrape those feeds using Simple HTML DOM to pull out images. I'm able to get SimplePie working to pull the links and store them in an array. I can also also use the Simple HTML DOM parser to get the image link that I'm looking for. The problem is that when I try to use SimplePie and Simple HTML DOM at the same time, I get a 500 error. Here's the code:
set_time_limit(0);
error_reporting(0);
$rss = new SimplePie();
$rss->set_feed_url('http://contently.com/strategist/feed/');
$rss->init();
foreach($rss->get_items() as $item)
$urls[] = $item->get_permalink();
unset($rss);
/*
$urls = array(
'https://contently.com/strategist/2016/01/22/whats-in-a-spotify-name-and-5-other-stories-you-should-read/',
'https://contently.com/strategist/2016/01/22/how-to-make-content-marketing-work-inside-a-financial-services-company/',
'https://contently.com/strategist/2016/01/22/glenn-greenwald-talks-buzzfeed-freelancing-the-future-journalism/',
...
'https://contently.com/strategist/2016/01/19/update-a-simpler-unified-workflow/');
*/
foreach($urls as $url) {
$html = new simple_html_dom();
$html->load_file($url);
$images = $html->find('img[class=wp-post-image]',0);
echo $images;
$html->clear();
unset($html);
}
I commented out the urls array, but it is identical to the array created by the SimplePie loop (I created it manually from the results). It fails on the find command the first time through the loop. If I comment out the $rss->init() line and use the static url array, the code all runs with no errors, but doesn't give me the result I want - of course. Any help is greatly appreciated!

There's a strange incompatibility between simple_html_dom and SimplePie. Loading html, the simple_html_dom->root is not loaded, causing error for any other operation.
Curiously, passing to function-mode instead of object-mode, for me it works fine:
$html = file_get_html( $url );
instead of:
$html = new simple_html_dom();
$html->load_file($url);
Anyway, simple_html_dom is is known for causing problems, above all about memory usage.
Edited:
OK, I have found the bug.
It reside on simple_html_dom->load_file(), that call standard function file_get_contents() and then check the result through error_get_last() and - if error was found - unset this own data. But if an error has occurred before (in my test SimplePie output a warning ./cache is not writeable) this previously error is interpreted by simple_html_dom as file_get_contents() fail.
If you have PHP 7 installed, you can call error_clear_last() after unset($rss), and your code should be work. Otherwise, you can use my code above or pre-load html data to a variable and then call simple_html_dom->load() instead of simple_html_dom->load_file()

Related

Pull content from one wordpress site to another wordpress site

I am trying to find a way of displaying the text from a website on a different site.
I own both the sites, and they both run on wordpress (I know this may make it more difficult). I just need a page to mirror the text from the page and when the original page is updated, the mirror also updates.
I have some experience in PHP and HTML, and I also would rather not use Js.
I have been looking at some posts that suggest cURL and file_get_contents but have had no luck editing it to work with my sites.
Is this even possible?
Look forward to your answers!

Both cURL and file_get_contents() are fine to get the full html output from an url. For example with file_get_contents() you can do it like this:
<?php
$content = file_get_contents('http://elssolutions.co.uk/about-els');
echo $content;
However, in case you need just a portion of the page, DOMDocument and DOMXPath are far better options, as with the latter you also can query the DOM. Below is working an example.
<?php
// The `id` of the node in the target document to get the contents of
$url = 'http://elssolutions.co.uk/about-els';
$id = 'comp-iudvhnkb';
$dom = new DOMDocument();
// Silence `DOMDocument` errors/warnings on html5-tags
libxml_use_internal_errors(true);
// Loading content from external url
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Querying DOM for target `id`
$xpathResultset = $xpath->query("//*[#id='$id']")->item(0);
// Getting plain html
$content = $dom->saveHTML($xpathResultset);
echo $content;

Using simpleHTMLDOM to only grab a specific div class grabs the entire page instead

Hello Stackoverflow I'm trying to use the following library simplehtmldom.sourceforge.net
The purpose of this little script is to grab the StackOverflow Logo and echo it. But for some strange reason it grabs every DOM element instead. Any idea what i'm doing wrong here?
<?php
include('simple_html_dom.php');
$request_url = 'http://stackoverflow.com/';
$html = file_get_html($request_url);
$element = $html->find('div[id=hlogo]');
echo $html->save($element);
Thank you in advance for taking your time to read this!

$html->find returns an array in the form that you're using it, so you need to access the first element of the array to get the results:
include('simple_html_dom.php');
$html = file_get_html('http://stackoverflow.com');
$logo = $html->find('#hlogo'); // find the id hlogo
echo $logo[0];
# prints out
# <div id="hlogo"> Stack Overflow </div>
You're also using the save function wrong; from the docs:
// Dumps the internal DOM tree back into string
$str = $html->save();
// Dumps the internal DOM tree back into a file
$html->save('result.htm');
You're getting the whole page because $html contains the whole DOM!

file_get_html() doesnt work [duplicate]

I used the following code to parse the HTML of another site but it display the fatal error:
$html=file_get_html('http://www.google.co.in');
Fatal error: Call to undefined function file_get_html()

are you sure you have downloaded and included php simple html dom parser ?

You are calling class does not belong to php.
Download simple_html_dom class here and use the methods included as you like it. It is really great especially when you are working with Emails-newsletter:
include_once('simple_html_dom.php');
$html = file_get_html('http://www.google.co.in');

As everyone have told you, you are seeing this error because you obviously didn't downloaded and included simple_html_dom class after you just copy pasted that third party code,
Now you have two options, option one is what all other developers have provided in their answers along with mine,
However my friend,
Option two is to not use that third party php class at all! and use the php developer's default class to perform same task, and that class is always loaded with php, so there is also efficiency in using this method along with originality plus security!
Instead of file_get_html which not a function defined by php developers use-
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
echo $doc->saveHTML(); that's indeed defined by them. Check it on php.net/manual(Original php manual by its devs)
This puts the HTML into a DOM object which can be parsed by individual tags, attributes, etc.. Here is an example of getting all the 'href' attributes and corresponding node values out of the 'a' tag. Very cool....
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
P.S. : PLEASE UPVOTE IF YOU LIKED MY ANSWER WILL HELP MY REPUTATION ON STACKOVERFLOW, THIS PEOPLES THINK I'M NOOB!

It looks like you're looking for simplexml_load_file which will load a file and put it into a SimpleXML object.
Of course, if it is not well-formatted that might cause problems. Your other option is DomObject::loadHTMLFile. That is a good deal more forgiving of badly formed documents.
If you don't care about the XML and just want the data, you can use file_get_contents.

$html = file_get_contents('http://www.google.co.in');
to get the html content of the page

in simple words
download the simple_html_dom.php from here Click here
now write these line to your Php file
include_once('simple_html_dom.php');
and start your coading after that
$html = file_get_html('http://www.google.co.in');
no error will be displayed

Try file_get_contents.
http://www.php.net/manual/en/function.file-get-contents.php

How can I load a url obtained from $_SERVER['REQUEST_URI'] into DOMDocument?

How to load a URL obtained from $_SERVER['REQUEST_URI'] into domDocument?
I am trying to load a dynamic webpage into DOMDocument to be parsed for certain words. Ultimately I want to create a glossary for my site (Tiki Wiki CMS). I started very simple and right now I am only trying to load a page and parse the text for testing purposes.
I am new to DOMDocument and after reading several articles on this site and on PHP Manual, I know that I have to load a html page with loadHTMLFile, then parse the site by getElementsById or getElementsByTagName in order to do stuff with it. It works fine for static pages, but the main problem I am having is that I cannot enter a static url into loadHTMLFile, because parsing should be performed when the site is uploaded by the user.
Here's the code that DID work:
$url = 'http://mysite.org/bbk/tiki-index.php?page=pagetext';
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
So, I thought I could use $_SERVER['REQUEST_URI'] for the job, but it did not work.
This did NOT work (no error message):
$url = $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
After checking what the $url output was, I decided to add http://mysite.org to it to make it identical to the url that worked. However, no luck either and this time I got an internal server error.
This did NOT work either (Internal Server Error):
$url = 'http://mysite.org' . $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
I think I am missing something substantial here and I thought it might just not be possible to use DOMDocument in this way, so I was searching the web for help again (if it is possible to use $_SERVER['REQUEST_URI'] in combination with DOMdocument at all), but I didn't find an answer. So I hope anybody here can help. Any suggestions including third party parsers etc. would be helpful, except anything that requires parsing with regex. Tiki Wiki CMS already has a glossary option done with regex, but it is very buggy.
Thanks.
UPDATE
I haven't found an answer to the problem, but I think I have an idea on where my mistake was. I was expecting $_SERVER['REQUEST_URI'] to run on a dynamic page that was not completely built yet. I ran the script on the main setup page, so I guess the html was not rendered yet, when I tried to point $_SERVER['REQUEST_URI'] to it. When I noticed that this might be the problem, I abandoned the idea of parsing the document with DomDocument and used a javascript solution that can be loaded after the document is ready.

I can think of two things that you can do (probably won't solve your problem directly, but will help you greatly with solving it):
$_SERVER['REQUEST_URI'] doesn't contain what you think it does. Try echoing or var_dumping it, and see if it matches your expectations.
Enable error reporting. The reason you are seeing a generic 500 error page, is because error reporting is disabled. enable it using error_reporting().
Also note that DOMDocument only parses HTML, if you have dynamic DOM nodes generated and added to the page using a client-side language, or CSS pseudo elements, they won't be displayed unless you deploy a JS/CSS parser as well (which is not trivial).

Finding and Echoing out a Specific ID from HTML document with PHP

I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>

You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using SimplePie and Simple HTML DOM together - php

Related

Pull content from one wordpress site to another wordpress site

Using simpleHTMLDOM to only grab a specific div class grabs the entire page instead

file_get_html() doesnt work [duplicate]

How can I load a url obtained from $_SERVER['REQUEST_URI'] into DOMDocument?

Finding and Echoing out a Specific ID from HTML document with PHP

Categories

Resources