I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.
A DOM parser is actually parsing the page.
A regular expression is searching for text, not understanding the HTML's semantic meaning.
It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.
You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.
Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.
So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).
I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.
The simple answer is:
A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.
If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.
In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...
To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML
might be not formed properly, then DOM parser can fail.
Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.
Related
I am aware that it is public opinion to not use RegEx for parsing HTML; however I do not see how it would be harmful to use RegEx (alike functions have been added in previous Scripting Languages using RegEx such as _StringBetween( ) in AutoIt3) for what I want to achieve.
I am also aware that _StringBetween( ) was not specifically written for HTML but I have been using it without any problem on HTML content for the past 8 years along with other folks.
For my HTML Extraction API I would like to present the following piece of HTML
<div class="video" id="video-91519"><!-- The value of the identifier is dynamic-->
<a href="about:blank"><img src="silly.jpg"><!-- So is the href and src in a, img -->
</div>
The reason for the API I am trying to write is to make extraction of the video_url and thumbnail extremely easy and therefore a HTML parser seems out of reach. I would like to be able to extract it using something amongst the lines of
<div class="video" id="video-{{unknown}}">{{unknown}}<a href="{{video_url}}"><img src="{{thumbnail}}">{{unknown}}</div>
Of course, in the previous piece of HTML you could do it much easier such as
<a href="{{video_url}}"><img src="{{thumbnail}}">
but I was trying to present a perfect example to avoid confusion.
How does RegEx come into play? Well, I was going to replace {{video_url}}, {{thumbnail}} and {{unknown}} with (.*?), (.*?) and .* using /s and of course making sure that there are no multiple occurences of {{video_url}} and {{thumbnail}} in the provided input (not the HTML).
So, is there any reason for me not to use RegEx or still go for a HTML parser incl. proof of concepts of either acceptable RegEx and/ or using a HTML parser? I cannot personally see how to make this happen using a HTML parser
I think the way you have framed the problem pre-supposes the solution: if you want to be able to specify a pattern to match, then you have to use a pattern-matching language, such as regex. But if you frame the question as allowing searches for content in the document, then other options might be available, such as a path-based input that compiled to XPath expressions, or CSS selectors as used very successfully by the likes of jQuery.
What you are building here is not really an HTML extraction API as such, but a regex processing API - you are inventing a simplified pattern-matching language which can be compiled to a regex, and that regex applied to any string.
That in itself is not a bad thing, but if the users of that pattern-matching API try to use it to parse a more complex or unpredictable document, they will have all the same problems that everyone has when they try to match HTML using regexes, plus additional limitations imposed by your pre-processor. Those limitations are an inevitable consequence of simplifying the language: you are trading some of the power of the regex engine in order to make your patterns more "user friendly".
To return to the idea of reframing the question, here is an example of a simplified matching API that could compile to CSS expressions (e.g. used with SimpleHTMLDOM):
Find: div (class:video)
Must-Contain: a, img
Match: id Against video-{{video_id}}
Child: a
Extract: href Into video_url
Child: img
Extract: src Into thumbnail
Note that this language is a lot more abstracted away from the HTML; this has advantages and disadvantages. On the one hand, the simple matching pattern in your question is easy to create based on a single example. On the other hand, it is much more fragile to variations in the HTML, either due to changes in the site, or in-page variations such as adding an extra CSS class of "featured-video" to a small number of videos. The selector-based example requires the user to learn more specifics of the API, but if they don't know HTML and pattern-matching to begin with, a verbose syntax may be more helpful to them than one involving lots of cryptic punctuation.
I was wondering which method mentioned in the title is more efficient to replace content in a html page.
I have this custom tag in my page: <includes module='footer'/> which will be replaced with some content.
Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild for instance when i forgot to add the slash in the tag, like so <includes module='footer'> the whole site crashes.
Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}.
Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?
More here: Append child/element in head using XML Manipulation
cheers
I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.
Instead, I would be concerned about accuracy. In general DOMDocument will be much better at parsing XML since it was built to read and understand the language. However, it does fail on <includes module='footer'> because it is an un-closed tag (expecting: </includes>).
Most common HTML/XML formatting issues can be fixed with PHP's Tidy class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after the module, elements within the includes element, unexpected characters like <includes module='foo>bar'>, etc.
In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters module will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (after Tidy'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).
Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.
If you want me to get more specific in any area (i.e. give a Tidy example, or work on a benchmark), let me know.
So i did some naive performance testing using microtime(true). And it turns out using preg_replace is the faster option. While DOM replaceChild needed between 2.0 and 3.5 ms, preg_replace needed between 0.5 and 1.2 ms! But i guess thats only in my case.
This is how my html looks like:
<!DOCTYPE html>
<html>
<head>
{includes:title}
{includes:style}
</head>
<body>
{includes:body}
{includes:footer}
...
allot more here
...
</body>
</html>
this is the regex is used: /{([ ]*)includes:([ ]*)$key([^}]*)}/i
As i said, i'm not fully proficient in using regex, but this did the job. Guess if you optimize it, it would run even faster.
For the replaceChild method i used a custom tag like this: <includes module='body'/>
Again, this is testet on my local server, therefore i still need to make some tests of how it will behave on my online server...
I want to link topics present in my description by its relevant topic...Now here I am using preg_replace() to do the same, but now I need help in formatting regex pattern to do that....
As challenges faced by me are:
1) Description can contain all types of html tags
2) My replace function should not replaces anything coming between tag and tag
3) it should not replace any attribute of any tag present withing description...like if there is string Style and Beauty and if i want to link Style as my topic..so in this case it should not link 'style' attribute of tag instead it should link Style from "Style and Beauty" string
Any kind of help on above query will be appreciated....
Thanks in advance...
Use either the DOMParser class or one of the several XML parsing libraries available in PHP, depending on how well formed your input is.
To elaborate on my comment: Regular Expressions are not suitable for stateful or recursive parsing, that is, they can match in quite advanced ways, but anything that requires recursion or state, most notably, anything that is somehow tree-like, can't be parsed using regular expressions. Some regex dialects (e.g. Perl regular expressions) feature back-references and other constructs that extend regular expressions beyond strictly regular parsing, but even with those, things are going to be painful at best.
Instead, do the sane thing: find a DOM parser that works on your input (e.g. PHP's own DOMDocument API), and do your processing on the resulting DOM tree. An approach that should work well is to walk through your DOM tree, recursively, and then at each node, see if it's a text node; if it is, apply your simple search-and-replace logic to its contents, otherwise descend into it and / or leave it unchanged. Alternatively, you can throw an XPath expression at it to give you the text nodes, and then change them directly. Or you can hook a suitable replacer function into an XslProcessor and do the replacing in XSLT - this is fairly straightforward if you're familiar with XSLT, but if you're not, the DOM walker is probably easier to implement.
What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?
What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .
If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?
It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.
I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...