matching html tag content in regex

matching html tag content in regex - php

I know "Dont use regex for html", but seriously, loading an entire html parser isn't always an option.
So, here is the scenario
<script...>
some stuff
</script>
<script...>
var stuff = '<';
anchortext
</script>
If you do this:
<script[^>]*?>.*?anchor.*?</script>
You will capture from the first script tag to the /script in the second block. Is there a way to do a .*? but by replacing the . with a match block, something like:
<script[^>]*?>(^</script>)*?anchor.*?</script>
I looked at negative lookaheads etc, but I can't get something to work properly. Usually I just use [^>]*? to avoid running past the closing block, but in this particular example, the script content has a "<" in it, and it stops matching on that before reaching the anchortext.
To simplify, I need something like [^z]*? but instead of a single character or character range, I need a capture group to fit a string.
.*?(?!z) doesn't have the same effect as [^z]*? as I assumed it would.
Here is where I am stuck at: http://regexr.com?34llp

Match-anything-but is indeed commonly implemented with a negative lookahead:
((?!exclude).)*?
The trick is to not have the . dot repeated. But make it successively match any character while ensuring that character is not the beginning of the excluded word.
In your case you would want to have this instead of the initial .*?
<script[^>]*?>((?!</script>).)*?anchor.*?</script>

like that:
$pattern = '~<script[^>]*+>((?:[^<]+?|<++(?!/script>))*?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~';
But DOM is the better way as far to do that.

Related

Regex Including the next occurence of word

The regex works perfectly but the problem is it also include the next occurrence instead of ending with the first occurrence then start again from the
Regex : (?=<appView)\s{0,1}(.*)(?<=<\/appView>)
String: <appView></appView> <appView></appView>
But my problem is it eat matches the whole word like
(Match 1)<appView></appView> <appView></appView>
I want it to search the group differently but i cant make it work.
Desired output : (Match 1) <appView></appView> (Match 2)<appView></appView>

\s{0,1} equals \s? You need to use (.*?) to be lazy instead of (.*)
Use this pattern: ~(?=<appView)\s?(.*?)(?<=</appView>)~
Demo Link
*note, you don't have to escape / in the closing tag if you use something other than a slash as your pattern delimiter. I am using ~ at the beginning and end of my pattern to avoid escaping.

I fully recommend to switch from regex to an actual sequential xml parser. Regex is aweful for parsing xml based files, for example because of the problems below.
That said, you can "fix" your regex by using ([^<>]*). This will match all characters without < or >, which will make sure that no other tags are nested inside. If done with all tags, you cannot match something like <appview><unclosedTag></appView>, because it is invalid. If you can be certain that the structure is correct, this is slightly less of an issue.
Another problem your approach has is that if you have nested tags like so: <appView> something <appView> something else </appView> else </appView>, your approach will make you end up with [replaced] else </appView>.

I need to quickly remove a set of classes from an arbitrary string of html

The HTML is run through a purifier first (tinyMCE+Wordpress), so it should match somewhat standard forms. all script and style tags are stripped, and all data inside tags is html_encoded, so there are no extraneous symbols to worry about.
I know the general stance on parsing html with regular expressions is "don't", but in this specific example, the problem seems less like parsing, and more like simple string processing... am I missing some unseen level of complexity?
As far as I can break it down, it seems like the pattern in question can be broken down into logical components:
/<[a-zA-Z][^>]+ - matches the start of any html tag and any mix of tags and attributes within, but not the end bracket
(?i:class)=\" - the start of a class attribute, case-insensitive
(?: - start a non-capturing sub-pattern
(?: *[a-zA-Z_][\w-]* +)* - any number of class names (or none), but if they exist, there must be whitespace before the capture
( *.implode('|', $classes).*) - the set of classes to capture, preg_quoted
(?: +[a-zA-Z_][\w-]* *)* - any number of class names (or none), but if they exist, there must be whitespace after the capture
)+ - close the non-capturing subpattern and loop it in case multiple matching classes are in one attribute
\"(?: [^>]*)>/ - the end of the class attribute, and everything to the end of the html tag
making the final regex:
$pattern = "/<[a-zA-Z][^>]+ (?i:class)=\"(?:(?: *[a-zA-Z_][\w-]* +)*( *".implode('|', $classes)." *)(?: +[a-zA-Z_][\w-]* *)*)+\"(?: [^>]*)>/";
I haven't tried running this yet, because I know if it works, I'll be heavily tempted to use it, but running this through a preg_replace seems like it should do the job, except for one minor issue. I believe it will leave extraneous whitespace around the capture area. This isn't a significant issue, but it might be nice to avoid, if anyone knows how.
It should also be noted that this is not a mission-critical process, and if my capture occasionally fails to remove the classes, no one dies.
so, in essence... can someone explain what makes this a bad idea in this case?

Ok, is that the list of classnames you want to remove from a given html?
what i mean to say, is what is the given list of classnames you want to remove. Can you give an example of the typical html, what it is, and what you want to change it to.
Example:
Before
<div class="someClass">
<i class="dontchange doChange"></i>
<a class="hello john"></a>
</div>
Change to
<div>
<i class="dontchange"></i>
<a></a>
</div>

This will replace all the classes in all the html.
myHtml.replace(/class\=\"[^\"]*\"/g,'');
Is this what you are looking for? Or something more specific?

Matching sets of tags in PHP with Regular Expression

I am currently working on protecting my AJAX Chat against exploits by checking all text in PHP before it is passed to the client. So far I have been successful with my mission except for one part where I require to match sets of image tags.
Overall I wish to have it pick up any instance of there being a newline character between a set tags which I have sort of managed, but the solution I have is greedy and matches newline characters outside of tags as well if there are multiple sets of tags.
At the moment I have the following which works if I wanted to match just [img]{newline}[/img]
if(preg_match('/\[\bimg\].*\x0A.*\[\/\bimg\]/', $text)){ //code }
But if I wanted to do [img]image.jpg[/img]{newline}[img]image.jpg[/img], it only sees the very first and end tags which I do not want.
So now I ask, how do you make it match each set of tags properly?
Edit: For clarification. Any newline characters inside tags are bad, so I want to detect them. Any newline characters outside tags are good and I want to ignore them. The reason being, if the client processes a newline character inside of a tag, it crashes.

Just make it ungreedy by putting ? after the two .*
But note that your current solution will not match this:
[img]
look, two newlines!
[/img]
I'm not sure why you want to do this, but you can make . match newlines by adding the s modifier to your regex. Then it's just "(\[img\](.*?)\[/img\])is" to match it, and you can even capture that group and individually check it for newlines if you want.

Try setting the s modifier, like this:
if (preg_match('/\[\bimg\].*\x0A.*\[\/\bimg\]/s', $text)) { code }
See also the PHP Documentation for Regex modifiers

Regular expression to match a certain HTML element

I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.

You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.

You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML

$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)

Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML

I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.

RegEx problem - retrieve content of tag with given class - preg_match(_all)

I need to retrieve content of <p> tag with given class. Class could be simplecomment or comment ...
So I wrote the following code
preg_match("|(<p class=\"(simple)?comment(.*)?\">)(.*)<\/p>|ism", $fcon, $desc);
Unfortunately, it returns nothing. However if I remove tag-ending part (<\/p>) it works somehow, returing the string which is too long (from tag start to the end of the document) ...
What is wrong with my regular expression?

Try using a dom parser like http://simplehtmldom.sourceforge.net/
If I read the example code on simplehtmldom's homepage correctly
you could do something like this:
$html->find('div.simplecomment', 0)->innertext = '';

The quick fix here is the following:
'|(<p class="(simple)?comment[^"]*">)((?:[^<]+|(?!</p>).)*)</p>|is'
Changes:
The construct (.*) will just blindly match everything, which stops your regular expression from working, so I've replaced those instances completely with more strict matches:
...comment(.*)?... – this will match all or nothing, basically. I replaced this with [^"]* since that will match zero or more non-" characters (basically, it will match up to the closing " character of the class attribute.
...>)(.*)<\/p>... – again, this will match too much. I've replaced it with an efficient pattern that will match all non-< characters, and once it hits a < it will check if it is followed by </p>. If it is, it will stop matching (since we're at the end of the <p> tag), otherwise it will continue.
I removed the m flag since it has no use in this regular expression.
But it won't be reliable (imagine <p class="comment">...<p>...</p></p>; it will match <p class="comment">...<p>...</p>).
To make it reliable, you'll need to use recursive regular expressions or (even better) an HTML parser (or XML if it's XHTML you're dealing with.) There are even libraries out there that can handle malformed HTML "properly" (like browsers do.)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

matching html tag content in regex - php

like that: $pattern = '~<script[^>]+>((?:[^<]+?|<++(?!/script>))?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~'; But DOM is the better way as far to do that.

Related

Regex Including the next occurence of word

I need to quickly remove a set of classes from an arbitrary string of html

Matching sets of tags in PHP with Regular Expression

Regular expression to match a certain HTML element

RegEx problem - retrieve content of tag with given class - preg_match(_all)

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

matching html tag content in regex - php

like that: $pattern = '~<script[^>]*+>((?:[^<]+?|<++(?!/script>))*?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~'; But DOM is the better way as far to do that.

Related

Regex Including the next occurence of word

I need to quickly remove a set of classes from an arbitrary string of html

Matching sets of tags in PHP with Regular Expression

Regular expression to match a certain HTML element

RegEx problem - retrieve content of tag with given class - preg_match(_all)

Categories

Resources

like that: $pattern = '~<script[^>]+>((?:[^<]+?|<++(?!/script>))?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~'; But DOM is the better way as far to do that.