A HTML Extraction API using RegEx or HTML Parser

A HTML Extraction API using RegEx or HTML Parser - php

I am aware that it is public opinion to not use RegEx for parsing HTML; however I do not see how it would be harmful to use RegEx (alike functions have been added in previous Scripting Languages using RegEx such as _StringBetween( ) in AutoIt3) for what I want to achieve.
I am also aware that _StringBetween( ) was not specifically written for HTML but I have been using it without any problem on HTML content for the past 8 years along with other folks.
For my HTML Extraction API I would like to present the following piece of HTML
<div class="video" id="video-91519"><!-- The value of the identifier is dynamic-->
<a href="about:blank"><img src="silly.jpg"><!-- So is the href and src in a, img -->
</div>
The reason for the API I am trying to write is to make extraction of the video_url and thumbnail extremely easy and therefore a HTML parser seems out of reach. I would like to be able to extract it using something amongst the lines of
<div class="video" id="video-{{unknown}}">{{unknown}}<a href="{{video_url}}"><img src="{{thumbnail}}">{{unknown}}</div>
Of course, in the previous piece of HTML you could do it much easier such as
<a href="{{video_url}}"><img src="{{thumbnail}}">
but I was trying to present a perfect example to avoid confusion.
How does RegEx come into play? Well, I was going to replace {{video_url}}, {{thumbnail}} and {{unknown}} with (.*?), (.*?) and .* using /s and of course making sure that there are no multiple occurences of {{video_url}} and {{thumbnail}} in the provided input (not the HTML).
So, is there any reason for me not to use RegEx or still go for a HTML parser incl. proof of concepts of either acceptable RegEx and/ or using a HTML parser? I cannot personally see how to make this happen using a HTML parser

I think the way you have framed the problem pre-supposes the solution: if you want to be able to specify a pattern to match, then you have to use a pattern-matching language, such as regex. But if you frame the question as allowing searches for content in the document, then other options might be available, such as a path-based input that compiled to XPath expressions, or CSS selectors as used very successfully by the likes of jQuery.
What you are building here is not really an HTML extraction API as such, but a regex processing API - you are inventing a simplified pattern-matching language which can be compiled to a regex, and that regex applied to any string.
That in itself is not a bad thing, but if the users of that pattern-matching API try to use it to parse a more complex or unpredictable document, they will have all the same problems that everyone has when they try to match HTML using regexes, plus additional limitations imposed by your pre-processor. Those limitations are an inevitable consequence of simplifying the language: you are trading some of the power of the regex engine in order to make your patterns more "user friendly".
To return to the idea of reframing the question, here is an example of a simplified matching API that could compile to CSS expressions (e.g. used with SimpleHTMLDOM):
Find: div (class:video)
Must-Contain: a, img
Match: id Against video-{{video_id}}
Child: a
Extract: href Into video_url
Child: img
Extract: src Into thumbnail
Note that this language is a lot more abstracted away from the HTML; this has advantages and disadvantages. On the one hand, the simple matching pattern in your question is easy to create based on a single example. On the other hand, it is much more fragile to variations in the HTML, either due to changes in the site, or in-page variations such as adding an extra CSS class of "featured-video" to a small number of videos. The selector-based example requires the user to learn more specifics of the API, but if they don't know HTML and pattern-matching to begin with, a verbose syntax may be more helpful to them than one involving lots of cryptic punctuation.

Related

Caching web pages using PHP (for offline viewing)

I'm working on a personal project to view web pages offline. The first idea that I came up with is using file_get_contents to get the contents of a specific url but this only gets the html and not the assets in that page(css, images, javascript, etc.). So I had to write regex to get the stylesheets and images in the page:
$css_pattern = '/\S*\.css"/';
$img_src_pattern = '/src=(?:"|\')?.+\.(?:gif|jpg|png|jpeg)(?:"|\')/';
preg_match_all($css_pattern, $contents, $style_matches);
preg_match_all($img_src_pattern, $contents, $img_matches);
This works but there are also images link in the css as well. And I'm still thinking how to deal with those.
There are also projects like ganon https://code.google.com/p/ganon/ and simple html parser that might make my life easier but I prefer using regex because I want to learn more about it.
The question is: is there a better way of doing this project? The app will probably have folders in which to save assets and html for each site and it will probably become unwieldy. I've heard of things like manifest file in html5 but I'm not sure if that's possible if you don't own the site. Any ideas? If there's no other way to do this then maybe you can just help me improve the regex that I have above. I basically have to use str_replace and foreach to get the stylesheets:
$stylesheets = array();
foreach($style_matches[0] as $match){
$stylesheets[] = str_replace(array('href=', '"', "'"), '', $match);
}
Thanks in advance!

I prefer using regex because I want to learn more about it.
Parsing HTML with regex is possible albeit non-trivial. A good introduction is given in the following paper:
REX: XML Shallow Parsing with Regular Expressions
The regular expressions used in that paper (REX) are not the ones used in PHP (PCRE), however you should be able to understand it if you're willing to learn, it's similar.
Following what that paper outlines and writing regular expressions in PHP on your own with some nice test-cases should be a real training camp for you digging into regular expressions.
Next to the regular expressions you also need to deal with character encodings which is another field of it's own and then adopting the parser for an encoding (if you do not re-encode before parsing).
If you're looking specifically for an HTML 5 compatible parser, it is specified as part of the HTML 5 "specification", but you can not do it precisely with regular expressions any longer in a sane way (at least as far as I know about it):
12.2 Parsing HTML documents — HTML Living Standard — Updated ca. daily
8.2 Parsing HTML documents — HTML5 — A vocabulary and associated APIs for HTML and XHTML W3C Candidate Recommendation 17 December 2012
For me that type of parsing looks like a large amount of overhead, but peek into the outline of the HTML 5 Parser and you get an idea what you could all take care of for HTML parsing nowadays. It seems like those guys and girls really needed to push anything in they could imagine. Actually the following engines/browsers have a HTML 5 Parser:
Gecko 2
Webkit
Chrome 7 (Webkit)
Opera 11.60 (Ragnarök)
IE10
From personal experience in the PHP eco-system there are not so many SGML based / "loose" / low-level / tag-soup HTML parsers. If I would write one, I would also use regular expressions for string parsing, the REX shallow parsing article has some good discussion. However I would probably only use such a low-level HTML parser to make any HTML consumable for DOMDocument or some other validation/fixing related stuff and won't use it for further parsing/document abstraction. DOMDocument is pretty powerful especially to gather links which you describe above.
For the rest of your question, you find all the elements you need to bring together outlined in diverse HTTP related RFCs, so you need to decide on your own which link resolving algorithm you want to support and how you re-map the static CSS/image/js files if you save them again. You normally then re-write the HTML as well for which DOMDocument is really handy.
Also you should store some HTTP headers inside the HTML file via the meta element. Especially for the encoding unless you don't re-encode it (which can be useful for offline reading anyway). Some of the more general Q&A suggestions for HTML authoring apply for a static cache as well.
The html5 manifest file is actually something different. The original server should have supported it. That is likely not the case (or you need to build a parser of it as well and process it). So if you create a mirror, you might want to also point out all static resources that can be stored locally for offline usage. That is some nice idea, I have not yet seen this implemented by tools like wget, so it's probably worth to play with that idea a little.
Instead of the HTML5 manifest file you might have also related to one of the following container formats:
Mozilla Archive Format - MAFF
MIME HTML - MHTML
Webarchive
Another one of these formats/extensions (here: SingleFile Chrome extension) makes use of the Data URI scheme according to wikipedia, which might be also useful in this context albeit I would not favorite it, I'd say it's better to have an algorithm that is able to re-write URLs to local file-system in a reproduce-able manner so that you can dump multiple HTML files with the same assets without fetching the assets multiple times.

How to replace word in large string excluding tags like img/a

I want to link topics present in my description by its relevant topic...Now here I am using preg_replace() to do the same, but now I need help in formatting regex pattern to do that....
As challenges faced by me are:
1) Description can contain all types of html tags
2) My replace function should not replaces anything coming between tag and tag
3) it should not replace any attribute of any tag present withing description...like if there is string Style and Beauty and if i want to link Style as my topic..so in this case it should not link 'style' attribute of tag instead it should link Style from "Style and Beauty" string
Any kind of help on above query will be appreciated....
Thanks in advance...

Use either the DOMParser class or one of the several XML parsing libraries available in PHP, depending on how well formed your input is.

To elaborate on my comment: Regular Expressions are not suitable for stateful or recursive parsing, that is, they can match in quite advanced ways, but anything that requires recursion or state, most notably, anything that is somehow tree-like, can't be parsed using regular expressions. Some regex dialects (e.g. Perl regular expressions) feature back-references and other constructs that extend regular expressions beyond strictly regular parsing, but even with those, things are going to be painful at best.
Instead, do the sane thing: find a DOM parser that works on your input (e.g. PHP's own DOMDocument API), and do your processing on the resulting DOM tree. An approach that should work well is to walk through your DOM tree, recursively, and then at each node, see if it's a text node; if it is, apply your simple search-and-replace logic to its contents, otherwise descend into it and / or leave it unchanged. Alternatively, you can throw an XPath expression at it to give you the text nodes, and then change them directly. Or you can hook a suitable replacer function into an XslProcessor and do the replacing in XSLT - this is fairly straightforward if you're familiar with XSLT, but if you're not, the DOM walker is probably easier to implement.

Why use dom to parse webpages instead of regex?

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

A DOM parser is actually parsing the page.
A regular expression is searching for text, not understanding the HTML's semantic meaning.
It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.
You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.
Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.
So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.
The simple answer is:
A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.
If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.
In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML
might be not formed properly, then DOM parser can fail.
Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.

XML parser vs regex

What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?

What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .

If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?

It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.

Find important text in arbitrary HTML using PHP?

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied

I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/

Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.