Reliably parsing HTML elements using RegEx [duplicate] - php

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
I'm trying to parse a webpage using RegEx, and I'm having some trouble making it work in a reliable manner.
Say I wanted to parse the code that creates a div element, and I want to extract everything between <div> and </div>. Now, this code could just be <div></div>, but it could also very well be something like:
<div class="thisIsMyDivClass"><p>This text is inside the div</p></div>
How can I make sure that no matter how many characters that are in between the greater-than/less-than signs of the initial div tag and the corresponding last div tag, I'll always only get the content in between them? If I specify that the number of characters following < can be anything from one to ten thousand, I will always be extracting the > after ten thousand characters, and thus (most likely, unless there is a lot of code or text in between) retrieve a bunch of code in between that I don't need.
This is my code so far (not reliable for the aforementioned reason):
/<.{1,10000}>/

Regular expressions describe so called regular languages - or Type 3 in the Chomsky hierarchy. On the other hand HTML is a context free language which is Type 2 in the Chomsky hierarchy. So: There is no way to reliably parse HTML with regular expressions in general. Use a HTML parser instead. For PHP you can find some suggestions in this question: How do you parse and process HTML/XML in PHP?

You will need a Lexical analyser and grammar checker to parse html correctly. RegEx main focus was for searching strings for patterns.

I would suggest using something like DOM. I am doing a large scale site with and using DOM like crazy on it. It works, works good, and with a little work can be extremely powerful.
http://php.net/manual/en/book.dom.php

Related

How to retrieve content of a DIV using regex? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Xpath not behaving for me in parsing basic html
I know how to get content from a div with static name (i.e. always the same in the whole page). However, my case is "post_id_xxxxx", something like this:
<div id="post_id_12345">abc</div>
<div id="post_id_67890">abc</div>
<div id="post_id_31234">abc</div>
I would like to extract the "abc" string, but seems difficult to me since every div has different ID.
Thanks.
This is still workable with a regex, if it's really only about the overly simplistic cases in your example:
preg_match('#<div\s[^>]*id="post_id_12345"[^>]*>(.*?)</div>#', $str, $m)
But as soon as you have nested divs in the document or other complex constructs, you need to use a HTML parser. To give you a real example instead of generic links, use phpQuery or QueryPath with:
print qp($html)->find("#post_id_12345")->text();
Do not parse HTML/XML with regexp. HTML has a structure that a html specific parser can exploit. See this classic link: RegEx match open tags except XHTML self-contained tags
You should try some of PHPs parsers like domdocument
DO NOT USE THIS
Here is a regexp that will match the example you specified. It will not work on more complicated structures (e.g. nested divs). You haven't really specified what invariants you know about the structure of your html, from the example this should work. You can expand this regexp to match more complexities, but a real parser will be much more robust and easier.
<div id="post_id_[0-9]{5}">(.*)</div>

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

preg_match_all Problems

I'm trying to match a string that contains HTML code that contains parameters to a function in Javascript.
There are several of these functions found in the the string containing the HTML code.
changeImg('location','size');
Let's say that I want to grab the location within the single quotes, how would I go about doing this? There are more than one instance in the string.
Thanks in advance.
This is a fairly common question on SO and the answer is always the same: regular expressions are a poor tool for parsing HTML. Use an XML or HTML parser. That's what they're for. Take a look at Parse HTML With PHP And DOM for an example and Parsing Html The Cthulhu Way for a bit of background.
Parsing Javascript is even harder as it can appear inside <script> tags and attributes so in the very least you'd need to get every <script> tag and parse the contents as well as every element and parse their event handlers (onclick, etc).
I'm reminded of this quote:
"Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems." -- Jamie Zawinski

Building a markup parser in php

I have created a very simple markup parser in php. However, it currently uses str_replace to switch between markup and html. How can I make a "code" box of sorts (will eventually use GeSHI) that has the contents untouched?
Right now, the following markup: [code][b]Some bold text[/b][/code] winds up parsing as the code box with <b>Some bold text</b>.
I need some advice, which option is best?
Have it check each word individually, and if it is not inside a [code] box it should parse
Leave it as is, let users be unable to post markup inside of [code].
Create another type of code box specifically for HTML markup, have [code] autorevert any < or > to [ and ].
Is there maybe even another option? This is a bit tougher than I thought it would be...
EDIT: Is it even worth adding a code box type thing to this parser? I mean, I see how it could be useful, but it is a rather large amount of effort for a small result.
Why would you reinvent the wheel?
There's plenty of markup parsers already.
Anyway, just str_replace won't help much. You'd have to learn regular expressions and as they say, now you've got two problems ;)
You could break it down into multiple strings for the purposes of using the str_replace. Split the strings on the [code] and [/code] tags - saving the code box in a separate string. Make note of where it went in the original string somehow. Then use str_replace on the original string and do whatever parsing you like on the code box string. Finally reinsert the parsed code boxes and display.
Just a word of warning though, turning input into html for display strikes me as inherently dangerous. I'd recommend a large amount of input sanitization and checking before converting to html for redisplay.
HTML beautifier is pretty sweet. http://pear.php.net/package/PHP_Beautifier . The have a decorator class as well that would probably suit your needs.
To be clear, your problem is in two parts. The first part is the need for a lexical analyzer to break your "code" into the keywords for your "language." Once you have a lexical analyzer, you then need a parser. A parser is code that accepts the keywords for your language one-at-a-time in a logical (usually recursive-descent way) manner.

PHP regex to get contents of a specific span element

I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?
There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).
(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS
There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...

Categories