How to workaround PHP advanced html dom's conversion of entities? - php

How can I workaround advanced_html_dom.php str_get_html's conversion of HTML entities, short of applying htmlentities() on every element content?
Despite
http://archive.is/YWKYp#selection-971.0-979.95
The goal of this project is to be a DOM-based drop-in replacement for
PHP's simple html dom library.
... If you use file/str_get_html then you don't need to change
anything.
I find on
include 'simple_html_dom.php';
$set = str_get_html('<html><title> </title></html>');
echo ($set->find('title',0)->innertext)."\n"; // Expected: Observed:
changing to advanced HTML DOM gives an incompatible result:
include 'advanced_html_dom.php';
$set = str_get_html('<html><title> </title></html>');
echo ($set->find('title',0)->innertext)."\n"; // Expected: Observed: -á
This issue is not confined to spaces.
$set = str_get_html('<html><body>•</body></html>');
echo $set->find('body',0)->innertext; // Expected $bull; Observed ÔÇó

You can check my own package PHPHTMLQuery, it helps you to use PHP to select HTML element using most of CSS3 selectors.
the package works with external links and internal html files too.
Installation
Open your terminal and browse into your project root folder and run
composer require "abdelilahlbardi/phphtmlquery":"#dev"
Documentation
For more informations, please visit the package link: PHPHTMLQuery

Related

file_get_html() doesnt work [duplicate]

I used the following code to parse the HTML of another site but it display the fatal error:
$html=file_get_html('http://www.google.co.in');
Fatal error: Call to undefined function file_get_html()
are you sure you have downloaded and included php simple html dom parser ?
You are calling class does not belong to php.
Download simple_html_dom class here and use the methods included as you like it. It is really great especially when you are working with Emails-newsletter:
include_once('simple_html_dom.php');
$html = file_get_html('http://www.google.co.in');
As everyone have told you, you are seeing this error because you obviously didn't downloaded and included simple_html_dom class after you just copy pasted that third party code,
Now you have two options, option one is what all other developers have provided in their answers along with mine,
However my friend,
Option two is to not use that third party php class at all! and use the php developer's default class to perform same task, and that class is always loaded with php, so there is also efficiency in using this method along with originality plus security!
Instead of file_get_html which not a function defined by php developers use-
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
echo $doc->saveHTML(); that's indeed defined by them. Check it on php.net/manual(Original php manual by its devs)
This puts the HTML into a DOM object which can be parsed by individual tags, attributes, etc.. Here is an example of getting all the 'href' attributes and corresponding node values out of the 'a' tag. Very cool....
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
P.S. : PLEASE UPVOTE IF YOU LIKED MY ANSWER WILL HELP MY REPUTATION ON STACKOVERFLOW, THIS PEOPLES THINK I'M NOOB!
It looks like you're looking for simplexml_load_file which will load a file and put it into a SimpleXML object.
Of course, if it is not well-formatted that might cause problems. Your other option is DomObject::loadHTMLFile. That is a good deal more forgiving of badly formed documents.
If you don't care about the XML and just want the data, you can use file_get_contents.
$html = file_get_contents('http://www.google.co.in');
to get the html content of the page
in simple words
download the simple_html_dom.php from here Click here
now write these line to your Php file
include_once('simple_html_dom.php');
and start your coading after that
$html = file_get_html('http://www.google.co.in');
no error will be displayed
Try file_get_contents.
http://www.php.net/manual/en/function.file-get-contents.php

Converting HTML to ENML

I am trying to write an extension for Gmail that lets you save mail as a note in Evernote, but Evernote's ENML is pretty strict, as in, it doesn't allow external styles.
So what I am looking to do is something like so -
- convert external styles to inline,
- validate/balance the tags
- purify the tags that Evernote considers offensive
So before I try to jump into writing a parser for above, does anyone know of a php library that is already doing the heavy lifting?
If not, what is the way to go with above requirement?
If the only interesting problem is converting external styles to inline styles you can use https://github.com/tijsverkoyen/CssToInlineStyles. It also has a composer package at packagist for easy deployment.
I used it like this:
<?php
// ...
use \TijsVerkoyen\CssToInlineStyles\CssToInlineStyles;
// ...
$css = file_get_contents('./content.html');
// create instance
$cssToInlineStyles = new CssToInlineStyles();
$css = file_get_contents('./styles.css');
$cssToInlineStyles->setHTML($content);
$cssToInlineStyles->setCSS($css);
$mail_content = $cssToInlineStyles->convert();

Parsing Wordpress XML file in PHP

Im migrating big Wordpress page to custom CMS. I need to extract information from big (20MB+) XML file, exported from Wordpress.
I don't have any experience in XML under PHP and i don't know how to start reading file.
Wordpress file contains structures like this:
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
and i don't know how to handle this in PHP.
You are probably going to do fine with simplexml:
$xml = simplexml_load_file('big_xml_file.xml');
foreach ($xml->element as $el) {
echo $el->name;
}
See php.net for more info
Unfortunately, your XML example didn't come through.
PHP5 ships with two extensions for working with XML - DOM and "SimpleXML".
Generally speaking, I recommend looking into SimpleXML first since it's the more accessible library of the two.
For starters, use "simplexml_load_file()" to read an XML file into an object for further processing.
You should also check out the "SimpleXML basic examples page on php.net".
I don't have any experience in XML under PHP
Take a look at simplexml_load_file() or DomDocument.
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
This should not be a problem for the XML parser. However, you will have a problem with the content exported by WordPress. For example, it can contain WordPress shortcodes, which will come across in their raw format instead of expanded.
Better Approach
Determine if what you are migrating to supports an export from WordPress feature. Many other systems do - Drupal, Joomla, Octopress, etc.
Although Adam is Absolutely right, his answer needed a bit more details. Here's a simple script that should get you going.
$xmlfile = simplexml_load_file('yourxmlfile.xml');
foreach ($xmlfile->channel->item as $item) {
var_dump($item->xpath('title'));
var_dump($item->xpath('wp:post_type'));
}
simplexml_load_file() is the way to go creating an object, but you will also need to use xpath as WordPress uses name spaces. If I remember correctly SimpleXML does not handle name space well or at all.
$xml = simplexml_load_file( $file );
$xml->xpath('/rss/channel/wp:category');
I would recommend looking at what WordPress uses for importing the files.
https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/class-wp-importer.php

Including/Excluding content with xPath/DOM > PHP

I'm trying to take an existing php file which I've built for a page of my site (blue.php), and grab the parts I really want with some xPath to create a different version of that page (blue-2.php).
I've been successful in pulling in my existing .php file with
$documentSource = file_get_contents("http://mysite.com/blue.php");
I can alter an attribute, and have my changes reflected correctly within blue-2.php, for example:
$xpath->query("//div[#class='walk']");
foreach ($xpath->query("//div[#class='walk']") as $node) {
$source = $node->getAttribute('class');
$node->setAttribute('class', 'run');
With my current code, I'm limited to making changes like in the example above. What I really want to be able to do is remove/exclude certain divs and other elements from showing on my new php page (blue-2.php).
By using echo $doc->saveHTML(); at the end of my code, it appears that everything from blue.php is included in blue-2.php's output, when I only want to output certain elements, while excluding others.
So the essence of my question is:
Can I parse an entire page using $documentSource = file_get_contents("http://mysite.com/blue.php");, and pick and choose (include and exclude) which elements show on my new page, with xPath? Or am I limited to only making modifications to the existing code like in my 'div class walk/run' example above?
Thank you for any guidance.
I've tried this, and it just throws errors:
$xpath->query("//img[#src='blue.png']")->remove();
What part of the documentation did make you think remove is a method of DOMNodeList? Use DOMNode::removeChild
foreach($xpath->query("//img[#src='blue.png']") as $node){
$node->parentNode->removeChild($node);
}
I would suggest browsing a bit through all classes & functions from the DOM extension (which is not PHP-only BTW), to get a bit of a feel what to find where.
On a side note: is probably very more resource efficient if you could get a switch in your original blue.php resulting in the different output, because this solution (extra http-request, full DOM load & manipulation) has a LOT of unneeded overhead compared to that.

htmlpurifier, overpurification of third party source

UPDATE 2: http://htmlpurifier.org/phorum/read.php?3,5088,5113 Author has already identified the problem.
UPDATE: Issue appears to be exclusive to version 4.2.0. I have downgraded to 4.1.0 and it works. Thank you for all your help. Author of package notified.
I am scraping some pages like:
http://form.horseracing.betfair.com/horse-racing/010108/Catterick_Bridge-GB-Cat/1215
According to W3C validation it is valid XHTML Strict.
I am then using http://htmlpurifier.org/ to purify the HTML before loading into a DOMDocument. However it is only returning a single line of content.
Output:
12:15 Catterick Bridge - Tuesday 1st January 2008 - Timeform | Betfair
Code:
echo $content; # all good
$purifier = new \HTMLPurifier();
$content = $purifier->purify($content);
echo $content; # all bad
BTW it works for data sourced from another site, just as you say leaves the title for all pages from this domain.
Related Links
HTMLPurifier dies when the following code is run through it (unanswered question on similar topic)
You should not need the HTML purifier. The DOMDocument class will take care of everything for you. However, it will trigger a warning on invalid html, so just do this:
$doc = new DOMDocument();
#$doc->loadHTML($content);
Then the error will not be triggered, and you can do what you wish with the HTML.
If you are scraping links, I would recommend that you use SimpleXMLElement::xpath(); That is much easier than working with the DOMDocument. Another example on that:
$xml = new SimpleXMLElement($content);
$result = $xml->xpath('a/#href');
print_r($result);
You can get much more complex xpaths that allow you to specifiy class names, ids, and other attributes. This is much more powerful than DOMDocument.

Categories