The regex works perfectly but the problem is it also include the next occurrence instead of ending with the first occurrence then start again from the
Regex : (?=<appView)\s{0,1}(.*)(?<=<\/appView>)
String: <appView></appView> <appView></appView>
But my problem is it eat matches the whole word like
(Match 1)<appView></appView> <appView></appView>
I want it to search the group differently but i cant make it work.
Desired output : (Match 1) <appView></appView> (Match 2)<appView></appView>
\s{0,1} equals \s? You need to use (.*?) to be lazy instead of (.*)
Use this pattern: ~(?=<appView)\s?(.*?)(?<=</appView>)~
Demo Link
*note, you don't have to escape / in the closing tag if you use something other than a slash as your pattern delimiter. I am using ~ at the beginning and end of my pattern to avoid escaping.
I fully recommend to switch from regex to an actual sequential xml parser. Regex is aweful for parsing xml based files, for example because of the problems below.
That said, you can "fix" your regex by using ([^<>]*). This will match all characters without < or >, which will make sure that no other tags are nested inside. If done with all tags, you cannot match something like <appview><unclosedTag></appView>, because it is invalid. If you can be certain that the structure is correct, this is slightly less of an issue.
Another problem your approach has is that if you have nested tags like so: <appView> something <appView> something else </appView> else </appView>, your approach will make you end up with [replaced] else </appView>.
Related
I'm trying to match
Some HTML content
Using preg_match
\<\!\-\- FOR (\d+) \-\-\>(.*)\<\!\-\- END FOR \-\-\>
Doesn't work since they are on different lines.
First you need to learn that < ! - > are not special characters. Escaping them with backslashes makes you look a bit silly.
Then learn about the /x and /s flags. One of them is what you need. The other is me trying to trick you into learning something unrelated.
Then test your regular expression with some HTML content that contains two or more of those FOR/END FORs and see what happens.
Also, you need to look into how to make your capturing conditions "greedy" or "non greedy". By default, matches will be greedy. So a condition such as "A(.)B" with the string "A1B A2B A3B" would find one match "1B A2B A3" - everything form the first "A" to the last "B". If you wanted to find all the values between each set of A/B, then you need make the match non-greedy - "A(.?)B"
What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().
For example:
test
test title
test
In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.
I've got a regex that I feel like is close:
#(?!<.*?)(\btest\b)(?![^<>]*?>)#si
(and this will not match the url part)
But how do I modify the regex to also exclude the "test" between a and /a?
If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]
I ended up solving it myself. This regex pattern will do what I wanted:
#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si
I've searched for this but couldn't find a solution that worked for me.
I need regex pattern that will match all text except html tags, so I can make it cyrilic (which would obviously ruin the entire html =))
So, for example:
<p>text1</p>
<p>text2 <span class="theClass">text3</span></p>
I need to match text1, text2, and text3, so something like
preg_match_all("/pattern/", $text, $matches)
and then I would just iterate over the matches, or if it can be done with preg_replace, to replace text1/2/3, with textA/B/C, that would be even better.
As you probably know, regex is not a great choice for this (the general advice here will be to use a Dom parser).
However, if you needed a quick regex solution, you use this (see demo):
<[^>]*>(*SKIP)(*F)|[^<]+
How this works is that on the left the <[^>]*> matches complete <tags>, then the (*SKIP)(*F) causes the regex to fail and the engine to advance to the position in the string that follows the last character of the matched tag.
This is an application of a general technique to exclude patterns from matches (read the linked question for more details).
If you don't want to allow the matches to span several lines, add \r\n to the negated character class that does your matching, like so:
<[^>]*>(*SKIP)(*F)|[^<\r\n]+
How about this RegEx:
/(?<=>)[\w\s]+(?=<)/g
Online Demo
Maybe this one (in Ruby):
/(?<!<)(?<!<\/)(?<![<\/\w+])([[:alpha:]])+(?!>)/
Enjoy !
Please use PHP DOMDocument class to parse XML content :
PHP Doc
I know "Dont use regex for html", but seriously, loading an entire html parser isn't always an option.
So, here is the scenario
<script...>
some stuff
</script>
<script...>
var stuff = '<';
anchortext
</script>
If you do this:
<script[^>]*?>.*?anchor.*?</script>
You will capture from the first script tag to the /script in the second block. Is there a way to do a .*? but by replacing the . with a match block, something like:
<script[^>]*?>(^</script>)*?anchor.*?</script>
I looked at negative lookaheads etc, but I can't get something to work properly. Usually I just use [^>]*? to avoid running past the closing block, but in this particular example, the script content has a "<" in it, and it stops matching on that before reaching the anchortext.
To simplify, I need something like [^z]*? but instead of a single character or character range, I need a capture group to fit a string.
.*?(?!z) doesn't have the same effect as [^z]*? as I assumed it would.
Here is where I am stuck at: http://regexr.com?34llp
Match-anything-but is indeed commonly implemented with a negative lookahead:
((?!exclude).)*?
The trick is to not have the . dot repeated. But make it successively match any character while ensuring that character is not the beginning of the excluded word.
In your case you would want to have this instead of the initial .*?
<script[^>]*?>((?!</script>).)*?anchor.*?</script>
like that:
$pattern = '~<script[^>]*+>((?:[^<]+?|<++(?!/script>))*?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~';
But DOM is the better way as far to do that.
<a href="/search?hl=en&pwst=1&sa=X&ei=RCPqTqkHycryA_bK_f0J&ved=0CCUQvwUoAQ&q=psychology&spell=1" class=spell><b><i>psychology</i></b></a>
Hi, I'm looking to create a regex which matches this anchor and returns the inner text of it.
This is what I've been trying as a regex but without success.
'/<a[^>]+class=\"spell\"[^>]*>(.*?)<\/a>/isU'
It's probably something really silly. Thanks.
Problem was missing quotes surrounding the class. Not proper html markup but I neglected to notice so I just changed my regex to have quotes as optional.
Final regex:
'/<a[^>]+class=\"?spell\"?[^>]*>(.*?)<\/a>/is'
The regex looks OK, although you don't need to escape the quotes. Perhaps PHP doesn't like it if you use unnecessary escapes, although I doubt it. The problem is more likely the way you're using the regex. Did you access group number 1?
if (preg_match('%<a[^>]+class="spell"[^>]*>(.*?)</a>%', $subject, $regs)) {
$result = $regs[1];
}
Your problem might be the combination of (.*?) and /isU modifier. That U alters the meaning of ? making your match group (.*) greedy actually. Then you will match parts beyond the <\/a> end marker, until it encounters another.
If you remove the /U it works as expected. With your given input text, at least.
Here are two options to fix your expression:
For starters, you can simplify your expression to:
class=\"spell\"[^>]*>(.*?)<\/a>
This captures
<b><i>psychology</i></b>
in Group 1. I assume this is what you want to achieve.
Then, if you want to capture "psychology" without the bold and italic tags, you can use:
class=\"spell\"[^>]*>\s*<(\w+)>?\s*<(\w+)>?\s*(.*?)<\/\2>\s*<\/\1>\s*<\/a>
This captures "psychology" in group 3.
In group 1, you will find the first optional tag, whether it be "b", "strong" or nothing.
In group 2, you will find the second optional tag, which was "i" in your example.
The multiple instances of \s* allow for optional space between the tags.
Is this what you were looking for?