This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Xpath not behaving for me in parsing basic html
I know how to get content from a div with static name (i.e. always the same in the whole page). However, my case is "post_id_xxxxx", something like this:
<div id="post_id_12345">abc</div>
<div id="post_id_67890">abc</div>
<div id="post_id_31234">abc</div>
I would like to extract the "abc" string, but seems difficult to me since every div has different ID.
Thanks.
This is still workable with a regex, if it's really only about the overly simplistic cases in your example:
preg_match('#<div\s[^>]*id="post_id_12345"[^>]*>(.*?)</div>#', $str, $m)
But as soon as you have nested divs in the document or other complex constructs, you need to use a HTML parser. To give you a real example instead of generic links, use phpQuery or QueryPath with:
print qp($html)->find("#post_id_12345")->text();
Do not parse HTML/XML with regexp. HTML has a structure that a html specific parser can exploit. See this classic link: RegEx match open tags except XHTML self-contained tags
You should try some of PHPs parsers like domdocument
DO NOT USE THIS
Here is a regexp that will match the example you specified. It will not work on more complicated structures (e.g. nested divs). You haven't really specified what invariants you know about the structure of your html, from the example this should work. You can expand this regexp to match more complexities, but a real parser will be much more robust and easier.
<div id="post_id_[0-9]{5}">(.*)</div>
Related
I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.
A DOM parser is actually parsing the page.
A regular expression is searching for text, not understanding the HTML's semantic meaning.
It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.
You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.
Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.
So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).
I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.
The simple answer is:
A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.
If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.
In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...
To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML
might be not formed properly, then DOM parser can fail.
Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
I'm trying to parse a webpage using RegEx, and I'm having some trouble making it work in a reliable manner.
Say I wanted to parse the code that creates a div element, and I want to extract everything between <div> and </div>. Now, this code could just be <div></div>, but it could also very well be something like:
<div class="thisIsMyDivClass"><p>This text is inside the div</p></div>
How can I make sure that no matter how many characters that are in between the greater-than/less-than signs of the initial div tag and the corresponding last div tag, I'll always only get the content in between them? If I specify that the number of characters following < can be anything from one to ten thousand, I will always be extracting the > after ten thousand characters, and thus (most likely, unless there is a lot of code or text in between) retrieve a bunch of code in between that I don't need.
This is my code so far (not reliable for the aforementioned reason):
/<.{1,10000}>/
Regular expressions describe so called regular languages - or Type 3 in the Chomsky hierarchy. On the other hand HTML is a context free language which is Type 2 in the Chomsky hierarchy. So: There is no way to reliably parse HTML with regular expressions in general. Use a HTML parser instead. For PHP you can find some suggestions in this question: How do you parse and process HTML/XML in PHP?
You will need a Lexical analyser and grammar checker to parse html correctly. RegEx main focus was for searching strings for patterns.
I would suggest using something like DOM. I am doing a large scale site with and using DOM like crazy on it. It works, works good, and with a little work can be extremely powerful.
http://php.net/manual/en/book.dom.php
I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...
I'm trying to match a string that contains HTML code that contains parameters to a function in Javascript.
There are several of these functions found in the the string containing the HTML code.
changeImg('location','size');
Let's say that I want to grab the location within the single quotes, how would I go about doing this? There are more than one instance in the string.
Thanks in advance.
This is a fairly common question on SO and the answer is always the same: regular expressions are a poor tool for parsing HTML. Use an XML or HTML parser. That's what they're for. Take a look at Parse HTML With PHP And DOM for an example and Parsing Html The Cthulhu Way for a bit of background.
Parsing Javascript is even harder as it can appear inside <script> tags and attributes so in the very least you'd need to get every <script> tag and parse the contents as well as every element and parse their event handlers (onclick, etc).
I'm reminded of this quote:
"Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems." -- Jamie Zawinski
I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?
There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).
(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS
There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...