I'm working on a personal project to view web pages offline. The first idea that I came up with is using file_get_contents to get the contents of a specific url but this only gets the html and not the assets in that page(css, images, javascript, etc.). So I had to write regex to get the stylesheets and images in the page:
$css_pattern = '/\S*\.css"/';
$img_src_pattern = '/src=(?:"|\')?.+\.(?:gif|jpg|png|jpeg)(?:"|\')/';
preg_match_all($css_pattern, $contents, $style_matches);
preg_match_all($img_src_pattern, $contents, $img_matches);
This works but there are also images link in the css as well. And I'm still thinking how to deal with those.
There are also projects like ganon https://code.google.com/p/ganon/ and simple html parser that might make my life easier but I prefer using regex because I want to learn more about it.
The question is: is there a better way of doing this project? The app will probably have folders in which to save assets and html for each site and it will probably become unwieldy. I've heard of things like manifest file in html5 but I'm not sure if that's possible if you don't own the site. Any ideas? If there's no other way to do this then maybe you can just help me improve the regex that I have above. I basically have to use str_replace and foreach to get the stylesheets:
$stylesheets = array();
foreach($style_matches[0] as $match){
$stylesheets[] = str_replace(array('href=', '"', "'"), '', $match);
}
Thanks in advance!
I prefer using regex because I want to learn more about it.
Parsing HTML with regex is possible albeit non-trivial. A good introduction is given in the following paper:
REX: XML Shallow Parsing with Regular Expressions
The regular expressions used in that paper (REX) are not the ones used in PHP (PCRE), however you should be able to understand it if you're willing to learn, it's similar.
Following what that paper outlines and writing regular expressions in PHP on your own with some nice test-cases should be a real training camp for you digging into regular expressions.
Next to the regular expressions you also need to deal with character encodings which is another field of it's own and then adopting the parser for an encoding (if you do not re-encode before parsing).
If you're looking specifically for an HTML 5 compatible parser, it is specified as part of the HTML 5 "specification", but you can not do it precisely with regular expressions any longer in a sane way (at least as far as I know about it):
12.2 Parsing HTML documents — HTML Living Standard — Updated ca. daily
8.2 Parsing HTML documents — HTML5 — A vocabulary and associated APIs for HTML and XHTML W3C Candidate Recommendation 17 December 2012
For me that type of parsing looks like a large amount of overhead, but peek into the outline of the HTML 5 Parser and you get an idea what you could all take care of for HTML parsing nowadays. It seems like those guys and girls really needed to push anything in they could imagine. Actually the following engines/browsers have a HTML 5 Parser:
Gecko 2
Webkit
Chrome 7 (Webkit)
Opera 11.60 (Ragnarök)
IE10
From personal experience in the PHP eco-system there are not so many SGML based / "loose" / low-level / tag-soup HTML parsers. If I would write one, I would also use regular expressions for string parsing, the REX shallow parsing article has some good discussion. However I would probably only use such a low-level HTML parser to make any HTML consumable for DOMDocument or some other validation/fixing related stuff and won't use it for further parsing/document abstraction. DOMDocument is pretty powerful especially to gather links which you describe above.
For the rest of your question, you find all the elements you need to bring together outlined in diverse HTTP related RFCs, so you need to decide on your own which link resolving algorithm you want to support and how you re-map the static CSS/image/js files if you save them again. You normally then re-write the HTML as well for which DOMDocument is really handy.
Also you should store some HTTP headers inside the HTML file via the meta element. Especially for the encoding unless you don't re-encode it (which can be useful for offline reading anyway). Some of the more general Q&A suggestions for HTML authoring apply for a static cache as well.
The html5 manifest file is actually something different. The original server should have supported it. That is likely not the case (or you need to build a parser of it as well and process it). So if you create a mirror, you might want to also point out all static resources that can be stored locally for offline usage. That is some nice idea, I have not yet seen this implemented by tools like wget, so it's probably worth to play with that idea a little.
Instead of the HTML5 manifest file you might have also related to one of the following container formats:
Mozilla Archive Format - MAFF
MIME HTML - MHTML
Webarchive
Another one of these formats/extensions (here: SingleFile Chrome extension) makes use of the Data URI scheme according to wikipedia, which might be also useful in this context albeit I would not favorite it, I'd say it's better to have an algorithm that is able to re-write URLs to local file-system in a reproduce-able manner so that you can dump multiple HTML files with the same assets without fetching the assets multiple times.
Related
I am aware that it is public opinion to not use RegEx for parsing HTML; however I do not see how it would be harmful to use RegEx (alike functions have been added in previous Scripting Languages using RegEx such as _StringBetween( ) in AutoIt3) for what I want to achieve.
I am also aware that _StringBetween( ) was not specifically written for HTML but I have been using it without any problem on HTML content for the past 8 years along with other folks.
For my HTML Extraction API I would like to present the following piece of HTML
<div class="video" id="video-91519"><!-- The value of the identifier is dynamic-->
<a href="about:blank"><img src="silly.jpg"><!-- So is the href and src in a, img -->
</div>
The reason for the API I am trying to write is to make extraction of the video_url and thumbnail extremely easy and therefore a HTML parser seems out of reach. I would like to be able to extract it using something amongst the lines of
<div class="video" id="video-{{unknown}}">{{unknown}}<a href="{{video_url}}"><img src="{{thumbnail}}">{{unknown}}</div>
Of course, in the previous piece of HTML you could do it much easier such as
<a href="{{video_url}}"><img src="{{thumbnail}}">
but I was trying to present a perfect example to avoid confusion.
How does RegEx come into play? Well, I was going to replace {{video_url}}, {{thumbnail}} and {{unknown}} with (.*?), (.*?) and .* using /s and of course making sure that there are no multiple occurences of {{video_url}} and {{thumbnail}} in the provided input (not the HTML).
So, is there any reason for me not to use RegEx or still go for a HTML parser incl. proof of concepts of either acceptable RegEx and/ or using a HTML parser? I cannot personally see how to make this happen using a HTML parser
I think the way you have framed the problem pre-supposes the solution: if you want to be able to specify a pattern to match, then you have to use a pattern-matching language, such as regex. But if you frame the question as allowing searches for content in the document, then other options might be available, such as a path-based input that compiled to XPath expressions, or CSS selectors as used very successfully by the likes of jQuery.
What you are building here is not really an HTML extraction API as such, but a regex processing API - you are inventing a simplified pattern-matching language which can be compiled to a regex, and that regex applied to any string.
That in itself is not a bad thing, but if the users of that pattern-matching API try to use it to parse a more complex or unpredictable document, they will have all the same problems that everyone has when they try to match HTML using regexes, plus additional limitations imposed by your pre-processor. Those limitations are an inevitable consequence of simplifying the language: you are trading some of the power of the regex engine in order to make your patterns more "user friendly".
To return to the idea of reframing the question, here is an example of a simplified matching API that could compile to CSS expressions (e.g. used with SimpleHTMLDOM):
Find: div (class:video)
Must-Contain: a, img
Match: id Against video-{{video_id}}
Child: a
Extract: href Into video_url
Child: img
Extract: src Into thumbnail
Note that this language is a lot more abstracted away from the HTML; this has advantages and disadvantages. On the one hand, the simple matching pattern in your question is easy to create based on a single example. On the other hand, it is much more fragile to variations in the HTML, either due to changes in the site, or in-page variations such as adding an extra CSS class of "featured-video" to a small number of videos. The selector-based example requires the user to learn more specifics of the API, but if they don't know HTML and pattern-matching to begin with, a verbose syntax may be more helpful to them than one involving lots of cryptic punctuation.
I've been trying to do some simple DOM parsing of HTML documents and am really shocked at how difficult it is to do.
I've looked into some of the many alternatives to PHP's DOM classes (like simple xml parser and simple HTML DOM). I found a very effective dom2array function too, which is useful for extremely basic parsing where you just want raw values of elements.
None of these alternatives is really compelling though.
PHP documentation of the DOM is typically lacking in detail and largely useless. A lot of the comments are actually really helpful though.
The tutorials I've found online typically cover only the very very basics like writing a 20 line XML document or parsing all the p tags in a document. Meh.
Are there any sites (or books) that go into detail specifically on working with the DOM using PHP's DOM libraries?
The DOM is a language-independent interface and documented in detail by the W3C.
That being said, if your aim is extremely simple parsing of (typically) structured information, XML may not be the correct format in the first place; XML includes a variety of advanced features (namespaces, DTDs, XSLT, distinction between attributes and text, markup instead of structured information). If that's the case, consider JSON, which is extremely easy to parse and generate.
Anything that says "DOM" in the name or claims to support it should support the DOM API as defined by the W3C, and you should consider their documentation normative for everything but the language-specific parts.
I should have titled my post, "Easiest way to parse HTML DOM in PHP". 'Easiest' is not a very good word, I know. It's all relative to what you're trying to do. What I'm doing is pretty straight-forward. I want to parse standalone HTML documents and present the content in a different context.
These are the things I wanted to do:
Parse basic properties like title and body
Alter all file references (images, links, css, js) to point to a valid location
Add/remove attributes from tags (dealing with 1995 HTML here)
Strip inline styles
I ended up going with Simple HTML DOM Parser
It has a very small learning curve and gives easy read/write access to the DOM. End of story. It does seem to choke on nested elements sometimes though.
I'm trying to parse specific Wikipedia content in a structured way. Here's an example page:
http://en.wikipedia.org/wiki/Polar_bear
I'm having some success. I can detect that this page is a "specie" page, and I can also parse the Taxobox (on the right) information into a structure. So far so good.
However, I'm also trying to parse the text paragraphs. These are returned by the API in Wiki format or HTML format, I'm currently working with the Wiki format.
I can read these paragraphs, but I'd like to "clean" them in a specific way, because ultimately I will have to display it in my app and it has no sense of Wiki markup. For example, I'd like to remove all images. That's fairly easy by filtering out [[Image:]] blocks. Yet there are also blocks that I simply cannot remove, such as:
{{convert|350|-|680|kg|abbr=on}}
Removing this entire block would break the sentence. And there are dozens of notations like this that have special meaning. I'd like to avoid writing a 100 regular expressions to process all of this and see how I can parse this in a smarter way.
My dilemma is as follow:
I could continue my current path of semi-structured parsing where I'd
have a lot of work deleting unwanted elements as well as "mimicing"
templates that do need to be rendered.
Or, I could start with the rendered HTML output and parse that, but my worry is that it's just as fragile and complex to parse in a structured way
Ideally, there's be a library to solve this problem, but I haven't found one yet that is up to this job. I also had a look at structured Wikipedia databases like DBPedia but those only have the same structured I already have, they don't provide any structure in the Wiki text itself.
There are too many templates in use to reimplement all of them by hand and they change all the time. So, you will need actual parser of the wiki syntax that can process all the templates.
And the wiki syxtax is quite complex, has lots of quirks and no formal specification. This means creating your own parser would be too much work, you should use the one in MediaWiki.
Because of this, I think getting the parsed HTML through the MediaWiki API is your best bet.
One thing that's probably easier to be parsed from wiki markup are the infoboxes, so maybe they should be a special case.
What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?
What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .
If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?
It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.
I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?
Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.
If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.
You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve
Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.
Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.
hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.
Speed may be an issue... handwritten can be a lot faster.
The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.
Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.