Regular Expressions - Where Angels Fear to Tread - php

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay

Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.

This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags

ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

Related

Why use dom to parse webpages instead of regex?

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.
A DOM parser is actually parsing the page.
A regular expression is searching for text, not understanding the HTML's semantic meaning.
It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.
You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.
Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.
So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).
I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.
The simple answer is:
A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.
If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.
In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...
To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML
might be not formed properly, then DOM parser can fail.
Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.

Regular expressions - Reference the first match in a search

I don't quite know how to describe my problem in a short title so I am sorry if the title for this question is a bit mis-leading.
But I really don't know what the thing I am looking for is called or if it is even possible.
I am trying to use a regular expression to find everything between a set of matching tags in HTML.
This was easy for me when I was testing with static tags because I could just search for everything in between two pieces of text such as \{myTag\}(someExpression)\{\/myTag\}
My problem comes with the fact that 'myTag' could be anything.
I just don't know how (or if it is even possible) to match the starting tag with the ending tag when that text is variable.
I thought I had seen some kind of referencing system in regular expressions before where you can use the dollar sign and a number, but I don't know if you can use this within the search itself.
I originally thought that perhaps I could write something like: \{(.*?)\}(someExpression)\{\/${1}\} but I have no idea if this would actually work or if it is possible (let alone if it is correct).
I hope this question makes sense as I'm not really sure how to ask it.
Mainly because like I said I don't know if this has a name, if it is possible and I am also a total beginner at regular experessions.
And if it makes any difference the language I am doing this is in PHP with the preg_replace_callback function.
Any help would be greatly appreciated.
Try this:
\{([^}]*)\}(someExpression)\{\/\1\}
but be aware that you need to make sure someExpression doesn't match ending tags as well (like for example .* would). And of course, if tags are nested, then all bets are off, and you'll need a different regex (or a parser).
It kind of depends on your case. If you know it's just an HTML snippet and there is a specific pattern you can search the HTML for then you can use a regex to find and replace the pattern but it seems to me you are trying to parse the HTML. So the issue would be if you had a nested tag. You should check out http://php.net/manual/en/function.preg-replace.php because that seems like a much easier function to use than the one with the callback.
As a note about regular expression look backs you can use $i or \i depending on the language you are using. I don't know if php regex supports capturing group look backs.

Scraping Google Search Results in PHP

I would like to get the links from the search results. Can someone please help with with the regular expression to do this? I've got this, and it doesn't work:
preg_match_all("/<h3(.*)><a href=\"(.*)\"(.*)<\/h3>/", $result, $matches);
Your patterns are likely having the biggest issues because of the greedy vs lazy nature of it. Changing it to the following should solve that issue...
preg_match_all('#<h3.*?><a href="(.*?)".*?</h3>#', $result, $matches);
print_r($matches[1]);
There are possibly a few rare URLs that could mess the pattern up, but chances are you won't run into one. I will point out that stillstanding has a good point though using the API would be a better option.
As for people that blanket answer with "You can't parse HTML with Regex, use a DOM"... Whilst you cannot create a generic HTML parser (and should be using DOM for that task), you can match patterns in a set of text you know follows a certain structure, the fact that structure is HTML is irrelevant. Yes, if Google change their layout it will probably break, but this is also probably true of a DOM Parser. (P.S. I'm well aware this will probably get down-voted by the sheeple).

Building a markup parser in php

I have created a very simple markup parser in php. However, it currently uses str_replace to switch between markup and html. How can I make a "code" box of sorts (will eventually use GeSHI) that has the contents untouched?
Right now, the following markup: [code][b]Some bold text[/b][/code] winds up parsing as the code box with <b>Some bold text</b>.
I need some advice, which option is best?
Have it check each word individually, and if it is not inside a [code] box it should parse
Leave it as is, let users be unable to post markup inside of [code].
Create another type of code box specifically for HTML markup, have [code] autorevert any < or > to [ and ].
Is there maybe even another option? This is a bit tougher than I thought it would be...
EDIT: Is it even worth adding a code box type thing to this parser? I mean, I see how it could be useful, but it is a rather large amount of effort for a small result.
Why would you reinvent the wheel?
There's plenty of markup parsers already.
Anyway, just str_replace won't help much. You'd have to learn regular expressions and as they say, now you've got two problems ;)
You could break it down into multiple strings for the purposes of using the str_replace. Split the strings on the [code] and [/code] tags - saving the code box in a separate string. Make note of where it went in the original string somehow. Then use str_replace on the original string and do whatever parsing you like on the code box string. Finally reinsert the parsed code boxes and display.
Just a word of warning though, turning input into html for display strikes me as inherently dangerous. I'd recommend a large amount of input sanitization and checking before converting to html for redisplay.
HTML beautifier is pretty sweet. http://pear.php.net/package/PHP_Beautifier . The have a decorator class as well that would probably suit your needs.
To be clear, your problem is in two parts. The first part is the need for a lexical analyzer to break your "code" into the keywords for your "language." Once you have a lexical analyzer, you then need a parser. A parser is code that accepts the keywords for your language one-at-a-time in a logical (usually recursive-descent way) manner.

PHP regex to get contents of a specific span element

I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?
There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).
(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS
There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...

Categories