Regex: matching all improper tag attributes

Regex: matching all improper tag attributes - php

For an example input of:
<a href="abc" something=b foo="bar" baz=cool>
I am trying to match:
something=b
baz=cool
However, everything I end up with will only match the first one (something=b), even when using preg_match_all. The regular expression I am using is:
<\w+.*?(\w+=[^"|^'|^>]).*?>
Or:
<\w+ # Word starting with <
.*? # Anything that comes in front of the matching attribute.
(
\w+ # The attribute
=
[^"|^'|^>]+? # Keep going until we find a ", ' or >
)
.*? # Anything that comes after the matching attribute.
> # Closing >
I'm probably doing something horribly wrong, pretty new to regular expressions. Please advise! :)
edit:
Revised regular expression:
<\w+.*?\w+=([^"\'\s>]+).*?>
I want it to match zzz=aaa there too ;)

Use a library like Tidy or HTMLPurifier to fix broken HTML for you.

For starters, the caret "^" symbol negates the entire character class. The character class has implied or statements, that's the point of a character class, so your class can be shortened to [^'">]
Now as to why you're only getting the "something=b" tag, I believe you're missing a + after your character class.
So your regexp with my modifications would be:
<\w+.*?(\w+=[^"'>]+?) .*?>
Note the space after the end of the group

<\w+
(?:
\s+
(?:
\w+="[^"]*"
|(\w+=[^\s>]+)
)
)+
\s*/?>
You may try this with # delimiter and x modifier. I have formatted it so it is more readable.

In your regex <\w+.*?(\w+=[^"|^'|^>]).*?>, the \w+=[^"|^'|^>] part doesn't do what you think it does - you are mixing character classes and alternation with pipe character
Writing a regex that will catch all malformed attributes inside a given XMLish tag is tricky if the attribute value can have > or = characters.
For example:
<a href="asd" title=This page proves that e=MC^2>
Your regex tries to extract all attributes from the whole string in one step - it looks for <tag and then an unquoted attribute somewhere later. This way you'll match only one attribute, the first one.
You can extract the contents of the opening and closing angle brackets in one step and then look for attributes within that. The regex <\w+\s+([^>]+?)\s*> will give you the substring of attributes. Search within that string for unquoted attributes. If the attributes are simple (as in they don't contain spaces), you can use a simple
\w+=(?=[^"'])[^ ]+
If they can contain spaces too, you're gonna need some more lookahead:
\w+=(?=[^"']).+?(?=\w+=|$)

If you know you don't have any = sign outside your tags, you can use this regex:
(?<=\=)([^"\'\s>]+)(?=[\s>])
In this example it matches all improper attributes
Edit:
(?<=\=)([^"\'\s/>]+)(?=[\s/?>])
this matches class2 in <div class=class2/> also.

Related

php preg_match() not working, find a var

header.php file
<?php
echo 'this is example '.$adv['name'].' , this is another.....';
main.php file
<?php
if(preg_match('/$adv[(.*?)]/',file_get_contents('header.php'),$itext)){
echo $itext[1].'';
}
show empty

this regular expression will work
/\$adv\[(.*?)\]/
You need to escape symbols $,[,] using \ before them
since they have special meaning in regular expressions

Here is a more efficient solution:
Pattern (Demo):
/\$adv\[\K[^]]*?(?=\])/
PHP Implementation:
if(preg_match('/\$adv\[\K[^]]*?(?=\])/','this is example $adv[name]',$itext)){
echo $itext[0];
}
Output for $itext:
name
Notice that by using \K to replace the capture group, the targeted string is returned in the "full string" match which effectively reduces the output array by 50%.
You can see the demo link for explanation on individual pieces of the pattern, but basically it matches $adv[ then resets the matching point, matches all characters between the square brackets, then does a positive lookahead for a closing square bracket which will not be included in the returned match.
Regardless of if you want to match different variable names you could use: /\$[^[]*?\[\K[^]]*?(?=\])/. This will accommodate adv or any other substring that follows the dollar sign. By using Negated Character Classes like [^]] instead of . to match unlimited characters, the regex pattern performs more efficiently.
If adv is not a determining component of your input strings, I would use /\$[^[]*?\[\K[^]]*?(?=\])/ because it will be the most efficient.

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?

You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%

You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>

There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.

An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

Replacing all matches except if surrounded by or only if surrounded by

Given a text string (a markdown document) I need to achieve one of this two options:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax ![Blah theWord blah](url).
to replace all the matches of a particular expression ({{([^}}]+)}}\[\[[^\]\]]+\]\]) ONLY inside the markdown images, ie.: ![Blah {{theWord}}[[1234]] blah](url).
Both expressions are currently matching everything, no matter if inside the markdown image syntax or not, and I've already tried everything I could think.
Here is an example of the first option
And here is an example of the second option
Any help and/or clue will be highly appreciated.
Thanks in advance!

Well I modified first expression a little bit as I thought there are some extra capturing groups then made them by adding a lookahead trick:
-First one (Live demo):
\b(vitae)\b(?![^[]*]\s*\()
-Second one (Live demo):
{{([^}}]+)}}\[\[[^\]\]]+\]\](?=[^[]*]\s*\()
Lookahead part explanations:
(?! # Starting a negative lookahead
[^[]*] # Everything that's between brackets
\s* # Any whitespace
\( # Check if it's followed by an opening parentheses
) # End of lookahead which confirms the whole expression doesn't match between brackets
(?= means a positive lookahead

You can leverage the discard technique that it really useful for this cases. It consists of having below pattern:
patternToSkip1 (*SKIP)(*FAIL)|patternToSkip2 (*SKIP)(*FAIL)| MATCH THIS PATTERN
So, according you needs:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax
You can easily achieve this in pcre through (*SKIP)(*FAIL) flags, so for you case you can use a regex like this:
\[.*?\](*SKIP)(*FAIL)|\bTheWord\b
Or using your pattern:
\[.*?\](*SKIP)(*FAIL)|(\W)(theWord)(\W)
The idea behind this regex is tell regex engine to skip the content within [...]
Working demo

The first regex is easily fixed with a SKIP-FAIL trick:
\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b
To replace with the word of your choice. It is a totally valid way in PHP (PCRE) regex to match something outside some markers.
See Demo 1
As for the second one, it is harder, but acheivable with \G that ensures we match consecutively inside some markers:
(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))
To replace with $1$2{{NEW_REPLACED_TEXT}}[[NEW_DIGITS]]
See Demo 2
PHP:
$re1 = "#\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b#i";
$re2 = "#(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))#i";

PHP Preg_replace() pattern to replace only proper text and not image alt/src text

I'm trying to replace first occurrence of the string "Harvey" with the link <a href='/harvey'>Harvey</a>
I'm using "/(?<!(src|alt|href)=\")".$internal_links_row['key_phrase']."/i" as the search pattern, it only skips matching when there is exact match in the alt/src pattern.
For Eg: It matches alt="Harvey". But it does not match alt="James Stewart in Harvey",
I need to skip every occurrence within the double quotes and I can not use strip_tags
Please help me guys,
Thanks

Try using:
Harvey(?![^<>]*>)
Which makes sure there's no closing angled bracket ahead indicating it's inside an HTML tag.
If that doesn't work nicely, maybe a positive lookahead instead:
Harvey(?=[^<>]*(?:<|\Z))
Which makes sure there's the opening angled bracket of a tag ahead, or the end of the string.
Which translates to:
"/".$internal_links_row['key_phrase']."(?![^<>]*>)/i"
"/".$internal_links_row['key_phrase']."(?=[^<>]*(?:<|\Z))/i"
respectively
EDIT: As per comment:
"~".$internal_links_row['key_phrase']."(?=[^<>]*(?:<(?!/a)|\Z))~i"
^ ^^^^^^ ^
I changed the delimiters and added a negative lookahead.

why not use str_replace()
$ans = str_replace('href="/harvey", 'href="/some_string"', $subject);

PHP regex lookbehind with wildcard

I have two strings in PHP:
$string = '<a href="http://localhost/image1.jpeg" /></a>';
and
$string2 = '[caption id="attachment_5" align="alignnone" width="483"]<a href="http://localhost/image1.jpeg" /></a>[/caption]';
I'm trying to match strings of the first type. That is strings that are not surrounded by '[caption ... ]' and '[/caption]'. So far, I would like to use something like this:
$pattern = '/(?<!\[caption.*\])(?!\[\/caption\])(<a.*><img.*><\/a>)/';
but PHP matches out the first string as well with this pattern even though it is NOT preceeded by '[caption' and zero or more characters followed by ']'. What gives? Why is this and what's the correct pattern?
Thanks.

Variable length look-behind is not supported in PHP, so this part of your pattern is not valid:
(?<!\[caption.*\])
It should be warning you about this.
In addition, .* always matches the larges possible amount. Thus your pattern may result in a match that overlaps multiple tags. Instead, use [^>] (match anything that is not a closing bracket), because closing brackets should not occur inside the img tag.
To solve the look-behind problem, why not just check for the closing tag only? This should be sufficient (assuming the caption tags are only used in a way similar to what you have shown).
$pattern = '|(<a[^>]*><img[^>]*></a>)(?!\[/caption\])|';
When matching patterns that contain /, use another character as the pattern delimiter to avoid leaning toothpick syndrome. You can use nearly any non-alphanumeric character around the pattern.
Update: the previous regex is based on the example regex you gave, rather than the example data. If you want to match links that don't contain images, do this:
$pattern = '|(<a[^>]*>[^<]*</a>)(?!\[/caption\])|';
Note that this doesn't allow any tags in the middle of the link. If you allow tags (such as by using .*?), a regex could match something starting within the [caption] and ending elsewhere.

I don't see how your regexp could match either string, since you're looking for <a.*><img.*><\/a>, and both anchors don't contain an <img... tag. Also, the two subexpressions looking for and prohibiting the caption-bits look oddly positioned to me. Finally, you need to ensure your tag-matching bits don't act greedy, i.e. don't use .* but [^>]*.
Do you mean something like this?
$pattern = '/(<a[^>]*>(<img[^>]*>)?<\/a>)(?!\[\/caption\])/'
Test it on regex101.
Edit: Removed useless lookahead as per dan1111's suggestion and updated regex101 link.

Lookbehind doesn't allow non fixed length pattern i.e. (*,+,?), I think this /<a.*><\/a>(?!\[\/caption\])/ is enough for your requirement

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex: matching all improper tag attributes - php

Use a library like Tidy or HTMLPurifier to fix broken HTML for you.

<\w+ (?: \s+ (?: \w+="[^"]" |(\w+=[^\s>]+) ) )+ \s/?> You may try this with # delimiter and x modifier. I have formatted it so it is more readable.

If you know you don't have any = sign outside your tags, you can use this regex: (?<=\=)([^"\'\s>]+)(?=[\s>]) In this example it matches all improper attributes Edit: (?<=\=)([^"\'\s/>]+)(?=[\s/?>]) this matches class2 in <div class=class2/> also.

Related

php preg_match() not working, find a var

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

Replacing all matches except if surrounded by or only if surrounded by

PHP Preg_replace() pattern to replace only proper text and not image alt/src text

PHP regex lookbehind with wildcard

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex: matching all improper tag attributes - php

Use a library like Tidy or HTMLPurifier to fix broken HTML for you.

<\w+ (?: \s+ (?: \w+="[^"]*" |(\w+=[^\s>]+) ) )+ \s*/?> You may try this with # delimiter and x modifier. I have formatted it so it is more readable.

If you know you don't have any = sign outside your tags, you can use this regex: (?<=\=)([^"\'\s>]+)(?=[\s>]) In this example it matches all improper attributes Edit: (?<=\=)([^"\'\s/>]+)(?=[\s/?>]) this matches class2 in <div class=class2/> also.

Related

php preg_match() not working, find a var

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

Replacing all matches except if surrounded by or only if surrounded by

PHP Preg_replace() pattern to replace only proper text and not image alt/src text

PHP regex lookbehind with wildcard

Categories

Resources

<\w+ (?: \s+ (?: \w+="[^"]" |(\w+=[^\s>]+) ) )+ \s/?> You may try this with # delimiter and x modifier. I have formatted it so it is more readable.