I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.
Related
I'm not familair with regular expressions. I'm trying to understand it, but it's difficult.
I've got a regular expression which will wrap any URL in an anchor tag. However, it's also wrapping URLs which are already in an anchor tag. I would like to prevent that, so I found a regular expression which does this for me.
?![^<]*</a>
However, I have no idea how I would add this to my existing regular expression. This is my current regular expression:
preg_replace('!(((ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text); ?>
So, how can I skip an URL that is already wrapped in an anchor tag?
I'm gonna join the choir and say: Don't use regex for this - use a html parser.
This said - the regex you found isn't really a regex in itself. It's part of a negative look-ahead that kind of checks you aren't in an anchor. (It should really be (?![^<]*</a>).) It checks that following text up to the next < (or the end) isn't followed by </>.
Appending this to the en of your original RE will sometimes do the trick. I won't spend time thinking of situations it'll fail - but it probably will.
Along with some simplifications your regex should look like this:
(https?:\/\/[-\wа-яА-Я()#:%+.~#?&;\/=]+)(?![^<]*<\/a>)
This probably will work for you mostly, but probably will fail at times as well.
Regards
I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:
yasar
So that yasar reads <span class="selected-word">yasar</span> , my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace() I am using looks like this:
preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);
How can I make a regular expression, so that it doesn't match anything inside a html tag?
You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >, or before any <. The latter test is easier to accomplish as lookahead assertions can be variable length:
/(asf|foo|barr)(?=[^>]*(<|$))/
See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.
Yasar, resurrecting this question because it had another solution that wasn't mentioned.
Instead of just checking that the next tag character is an opening tag, this solution skips all <full tags>.
With all the disclaimers about using regex to parse html, here is the regex:
<[^>]*>(*SKIP)(*F)|word1|word2|word3
Here is a demo. In code, it looks like this:
$target = "word1 <a skip this word2 >word2 again</a> word3";
$regex = "~<[^>]*>(*SKIP)(*F)|word1|word2|word3~";
$repl= '<span class="">\0</span>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);
Here is an online demo of this code.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
This might be the kind of thing that you're after: http://snipplr.com/view/3618/
In general, I'd advise against such. A better alternative is to strip out all HTML tags and instead rely on BBcode, such as:
[b]bold text[b] [i]italic text[i]
However I appreciate that this might not work well with what you're trying to do.
Another option may be HTML Purifier, see: http://htmlpurifier.org/
From top of my mind, this should be working:
echo preg_replace("/<(.*)>(.*)<\/(.*)>/i","<$1><span class=\"some-class\">$2</span></$3>",$target);
But, I don't know how safe this would be. I am just presenting a possibility :)
I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.
I'm trying to use preg_replace to get some data from a remote page, but I'm having a bit of an issue when it comes to sorting out the pattern.
function getData($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<span class=\"SectionHeader\"\>title\</span>/<br/>/\<div class=\"header2\"\>(.*)\</div\></span\>/",$str,$title);
return $title[1];
}
}
Here's the HTML as is before I ended up throwing a million slashes at it (looks like I forgot a part or two):
<span class="cell CellFullWidth"><span class="SectionHeader">mytitle</span><br/><div class="Center">Event Name</div></span>
Where Event Name is the data I want to return in my function.
Thanks a lot guys, this is a pain in the ass.
While I am inclined to agree with the commenters that this is not a pretty solution, here's my untested revision of your statement:
preg_match('#\<span class="SectionHeader"\>title\</span\>/\<br/\>/\<div class="header2"\>(.*)\</div\>\</span\>#',$str,$title);
I changed the double-quoted string to single-quoted as you aren't using any of the variable-substitution features of double-quoted strings and this avoids having to backslash-escape double-quotes as well as avoiding any ambiguity about backslashes (which perhaps should have been doubled to produce the proper strings--see the php manual on strings). I changed the slash / delimiters to hash # because of the number of slashes appearing in the match pattern (some of which were not backslash-escaped in your version).
There are quite a few things wrong with your expression:
You're using / as the delimiter, but then use / unescaped in various places.
You're escaping < and > seemingly at random. They shouldn't be escaped at all.
You have some rogue /s around the <br/> for some reason.
The class name for the div is specified as header2 in the regex but Center in the sample HTML
The title is mytitle in the HTML and title in the regex
With all of these corrected, you get:
preg_match('(<span class="SectionHeader">mytitle</span><br/><div class="Center">(.*)</div\></span\>)',$data,$t);
If you want to match any title instead of the specific title mytitle, just replace that with .*?.
I need to retrieve content of <p> tag with given class. Class could be simplecomment or comment ...
So I wrote the following code
preg_match("|(<p class=\"(simple)?comment(.*)?\">)(.*)<\/p>|ism", $fcon, $desc);
Unfortunately, it returns nothing. However if I remove tag-ending part (<\/p>) it works somehow, returing the string which is too long (from tag start to the end of the document) ...
What is wrong with my regular expression?
Try using a dom parser like http://simplehtmldom.sourceforge.net/
If I read the example code on simplehtmldom's homepage correctly
you could do something like this:
$html->find('div.simplecomment', 0)->innertext = '';
The quick fix here is the following:
'|(<p class="(simple)?comment[^"]*">)((?:[^<]+|(?!</p>).)*)</p>|is'
Changes:
The construct (.*) will just blindly match everything, which stops your regular expression from working, so I've replaced those instances completely with more strict matches:
...comment(.*)?... – this will match all or nothing, basically. I replaced this with [^"]* since that will match zero or more non-" characters (basically, it will match up to the closing " character of the class attribute.
...>)(.*)<\/p>... – again, this will match too much. I've replaced it with an efficient pattern that will match all non-< characters, and once it hits a < it will check if it is followed by </p>. If it is, it will stop matching (since we're at the end of the <p> tag), otherwise it will continue.
I removed the m flag since it has no use in this regular expression.
But it won't be reliable (imagine <p class="comment">...<p>...</p></p>; it will match <p class="comment">...<p>...</p>).
To make it reliable, you'll need to use recursive regular expressions or (even better) an HTML parser (or XML if it's XHTML you're dealing with.) There are even libraries out there that can handle malformed HTML "properly" (like browsers do.)