Robust way to parse text content from HTML in PHP?

Robust way to parse text content from HTML in PHP? - php

I'm trying to find a robust way to parse ALL of the text (i.e. non-html/non-code/non-script content) from an HTML document. I'm talking specifically about extracting keywords on any input web page on the internet. I'm writing a keyword spider that tracks keyword trends on web pages using PHP, and although I've found a number of great ways to actually read in the content (like DOMDocument and cURL), I'm having a hard time finding any robust solutions for actually parsing out all of the word content separate from the HTML/Javascript/CSS/etc on any old random page on the Internet.
I first tried using strip_tags(), but it has lots of artifacts of javascript and other xml that might be on the page. I've also tried Simple HTML DOM, but it seems to have problems with punctuation and whitespace handling. I finally tried building a library from tutorials on nadeausoftware, and while it works phenomenally well on most pages, on some pages it doesn't return any content at all (I guess the curse of trying to use regex for parsing).
I'm just wondering if there aren't any php libraries that provide the specific capability of grabbing all of the non-html/non-javascript/non-xml/non-code words from an HTML document. I know that might sound like a tall order, and I'm not looking for perfection, but if there's a solution that's 80% reliable on most web-pages, I'd be happy.
Thanks for any help anyone can provide!

You could load the document, get rid of the tags you don't want and then query the textContent property:
$html = '<html><head><style type="text/css">hola</style></head><body><script>tada</script>hello <span>world</span></body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('*') as $node) {
if (in_array($node->nodeName, array('script', 'style'))) {
$node->parentNode->removeChild($node);
}
}
echo $dom->documentElement->textContent;
// hello world
Demo

As it turns out, the PHP parsing code from nadeau software is actually more robust than I originally gave it credit --- on additional tinkering, I discovered that the problems I was encountering were due to my feeding the parser html content that wasn't properly encoded to utf-8.
It's unfortunate that there don't seem to be any existing libraries for handling such a complex use-case, but at least I was able to get the tutorial code to work on a broad number of test cases.

Related

Caching web pages using PHP (for offline viewing)

I'm working on a personal project to view web pages offline. The first idea that I came up with is using file_get_contents to get the contents of a specific url but this only gets the html and not the assets in that page(css, images, javascript, etc.). So I had to write regex to get the stylesheets and images in the page:
$css_pattern = '/\S*\.css"/';
$img_src_pattern = '/src=(?:"|\')?.+\.(?:gif|jpg|png|jpeg)(?:"|\')/';
preg_match_all($css_pattern, $contents, $style_matches);
preg_match_all($img_src_pattern, $contents, $img_matches);
This works but there are also images link in the css as well. And I'm still thinking how to deal with those.
There are also projects like ganon https://code.google.com/p/ganon/ and simple html parser that might make my life easier but I prefer using regex because I want to learn more about it.
The question is: is there a better way of doing this project? The app will probably have folders in which to save assets and html for each site and it will probably become unwieldy. I've heard of things like manifest file in html5 but I'm not sure if that's possible if you don't own the site. Any ideas? If there's no other way to do this then maybe you can just help me improve the regex that I have above. I basically have to use str_replace and foreach to get the stylesheets:
$stylesheets = array();
foreach($style_matches[0] as $match){
$stylesheets[] = str_replace(array('href=', '"', "'"), '', $match);
}
Thanks in advance!

I prefer using regex because I want to learn more about it.
Parsing HTML with regex is possible albeit non-trivial. A good introduction is given in the following paper:
REX: XML Shallow Parsing with Regular Expressions
The regular expressions used in that paper (REX) are not the ones used in PHP (PCRE), however you should be able to understand it if you're willing to learn, it's similar.
Following what that paper outlines and writing regular expressions in PHP on your own with some nice test-cases should be a real training camp for you digging into regular expressions.
Next to the regular expressions you also need to deal with character encodings which is another field of it's own and then adopting the parser for an encoding (if you do not re-encode before parsing).
If you're looking specifically for an HTML 5 compatible parser, it is specified as part of the HTML 5 "specification", but you can not do it precisely with regular expressions any longer in a sane way (at least as far as I know about it):
12.2 Parsing HTML documents — HTML Living Standard — Updated ca. daily
8.2 Parsing HTML documents — HTML5 — A vocabulary and associated APIs for HTML and XHTML W3C Candidate Recommendation 17 December 2012
For me that type of parsing looks like a large amount of overhead, but peek into the outline of the HTML 5 Parser and you get an idea what you could all take care of for HTML parsing nowadays. It seems like those guys and girls really needed to push anything in they could imagine. Actually the following engines/browsers have a HTML 5 Parser:
Gecko 2
Webkit
Chrome 7 (Webkit)
Opera 11.60 (Ragnarök)
IE10
From personal experience in the PHP eco-system there are not so many SGML based / "loose" / low-level / tag-soup HTML parsers. If I would write one, I would also use regular expressions for string parsing, the REX shallow parsing article has some good discussion. However I would probably only use such a low-level HTML parser to make any HTML consumable for DOMDocument or some other validation/fixing related stuff and won't use it for further parsing/document abstraction. DOMDocument is pretty powerful especially to gather links which you describe above.
For the rest of your question, you find all the elements you need to bring together outlined in diverse HTTP related RFCs, so you need to decide on your own which link resolving algorithm you want to support and how you re-map the static CSS/image/js files if you save them again. You normally then re-write the HTML as well for which DOMDocument is really handy.
Also you should store some HTTP headers inside the HTML file via the meta element. Especially for the encoding unless you don't re-encode it (which can be useful for offline reading anyway). Some of the more general Q&A suggestions for HTML authoring apply for a static cache as well.
The html5 manifest file is actually something different. The original server should have supported it. That is likely not the case (or you need to build a parser of it as well and process it). So if you create a mirror, you might want to also point out all static resources that can be stored locally for offline usage. That is some nice idea, I have not yet seen this implemented by tools like wget, so it's probably worth to play with that idea a little.
Instead of the HTML5 manifest file you might have also related to one of the following container formats:
Mozilla Archive Format - MAFF
MIME HTML - MHTML
Webarchive
Another one of these formats/extensions (here: SingleFile Chrome extension) makes use of the Data URI scheme according to wikipedia, which might be also useful in this context albeit I would not favorite it, I'd say it's better to have an algorithm that is able to re-write URLs to local file-system in a reproduce-able manner so that you can dump multiple HTML files with the same assets without fetching the assets multiple times.

Parsing HTML with PHP to get data for several articles of the same kind

I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:
http://www.biglion.ru/moscow/
I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?
Thanks for the help.

You can always use a DOM parser, but scraping content from sites is unreliable at best.
If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..

While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.
One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.
preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);

The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e)
echo $e->href . '<br>';
I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.

XML parser vs regex

What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?

What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .

If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?

It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.

Regex (or better suggestion) on html with correct nesting

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.
I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.
I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).
The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!
UPDATE:
Just to make it clear, I'm aware regexes are a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.
FURTHER UPDATE:
I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.

As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.
Relevant questions
Robust and Mature HTML Parser for PHP
How do you parse and process HTML/XML in PHP?

Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

How can I scrape a website with invalid HTML [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 4 years ago.
I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.
Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.
Edit: Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();
$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[#class="courseList_section"]//a');
foreach($links as $link) {
printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}
will output
ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD - Art and Design (index.aspx?semester=2010f&subjectID=AD )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB - Urban Systems (index.aspx?semester=2010f&subjectID=URB )
Using
echo $dom->saveXML($link), PHP_EOL;
in the foreach loop will output the full outerHTML of the links.

if you know the errors you might apply some regular expressions to fix them specifically. While this ad-hoc solution might seem dirty, it may actually be better as if the HTML is indeed malformed it might be complex to infer a correct interpretation automatically.
EDIT: Actually it might be better to simply extract the needed information through regular expressions as the page has many errors which would be hard or at least tedious to fix.

Is there a web service that will run your content through Tidy? Could you write one? Tidy is the only sane way I know of fixing broken markup.

Another simple way to solve the problem could be passing the site you are trying to scrape through a mobile browser adapter package such as google's mobilizer for complicated websites. This will correct the invalid html and enable you to use the simple html dom parser package, but it might not work if you need some of the information that is stripped out of the site. The links to this adapter are below. I use this for sites on which the information is poorly formatted or if I need a way to simplify the formatting so that it is easy to parse. The html returned by the google mobilizer is simpler and much easier to process.
http://www.google.com/gwt/n

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.