RegEx Or and AND - php

Hi I tired to use RegEx in PHP. The following elements I like to get with it:
<a="300">
<a="300"b="300">
<b="300">
The Problem is that I get only
<a="300">
<b="300">
with the following RegEx:
<(a|b)="[0-9]*">
What do I have to change, that I get all three elements? Is there a ANDOR operator?

Assuming your problem is rather a simple string processing than serious parsing, I would modify your regex like this:
<(a|b)="[0-9]+".*>
I added .* to allow characters inbetween " and >.
or a slightly my-flavored version:
<[ab]="\d+"[^>]*?>
piping single characters with | are less favored over [...]
\d is for series of digits
[^>]*? for characters other than >

You need an additional grouping, to specify, that you would accept multiple of that kind:
echo '<a="300">
<a="300"b="300">
<b="300">' | egrep '<((a|b)="[0-9]*")+>'
<a="300">
<a="300"b="300">
<b="300">

Regex is not boolean logic. The | symbol in regex is not an OR operator; it is referred to as alternation, which works similarly but is not quite the same thing. If you're just trying to match one of multiple characters, you should use square brackets [] to create a character set. In this case, [ab] matches a or b, just as [0-9] matches 0 or 1 or 2 etc.
Here's the pattern that I would suggest
<[ab]="\d+"(?:[ab]="\d+")?>

Related

Combine two regular expressions for php

I have these two regular expression
^(((98)|(\+98)|(0098)|0)(9){1}[0-9]{9})+$
^(9){1}[0-9]{9}+$
How can I combine these phrases together?
valid phone :
just start with : 0098 , +98 , 98 , 09 and 9
sample :
00989151855454
+989151855454
989151855454
09151855454
9151855454
You haven't provided what passes and what doesn't, but I think this will work if I understand correctly...
/^\+?0{0,2}98?/
Live demo
^ Matches the start of the string
\+? Matches 0 or 1 plus symbols (the backslash is to escape)
0{0,2} Matches between 0 and 2 (0, 1, and 2) of the 0 character
9 Matches a literal 9
8? Matches 0 or 1 of the literal 8 characters
Looking at your second regex, it looks like you want to make the first part ((98)|(\+98)|(0098)|0) in your first regex optional. Just make it optional by putting ? after it and it will allow the numbers allowed by second regex. Change this,
^(((98)|(\+98)|(0098)|0)(9){1}[0-9]{9})+$
to,
^(?:98|\+98|0098|0)?9[0-9]{9}$
^ this makes the non-grouping pattern optional which contains various alternations you want to allow.
I've made few more corrections in the regex. Use of {1} is redundant as that's the default behavior of a character, with or without it. and you don't need to unnecessarily group regex unless you need the groups. And I've removed the outer most parenthesis and + after it as that is not needed.
Demo
This regex
^(?:98|\+98|0098|0)?9[0-9]{9}$
matches
00989151855454
+989151855454
989151855454
09151855454
9151855454
Demo: https://regex101.com/r/VFc4pK/1/
However note that you are requiring to have a 9 as first digit after the country code or 0.

Regex prevent selecting characters from previous match

My title probably doesn't explain exactly what I mean. Take the following string:
POWERSTART9^{{2|3}}POWERENDx{{3^EXSTARTxEXEND}}=POWERSTART27^{{1|4}}POWEREND
What I want to do here is isolate the parts that are like this:
{{2|3}} or {{1|4}}
The following expression works to an extent, it selects the first one {{2|3}} with no issue:
\{\{(.*?)\|(.*?)\}\}
The problem is, it's not just selecting the first if {{2|3}} and the second of {{1|4}} because after the first one we have {{3^EXSTARTxEXEND}} so it's taking the starting point from {{3 and going right until the end of the second part I want |4}}
Here it is highlighted on RegExr:
I've never been great with regex and can't work out how to stop it doing that. Any ideas? I basically want it to only match the exact pattern and not something that contains it.
You may use
\{\{((?:(?!{{).)*?)\|(.*?)}}
See the regex demo.
If there can be no { and } inside the {{...}} substrings, you may use a simpler \{\{([^{}|]*)\|([^{}]*)}} expression (see demo).
Details
\{\{ - a {{ substring
((?:(?!{{).)*?) - Capturing group 1: any char (.), as few as possible (*?), that does not start a {{ char sequence (tempered greedy token)
[^{}|]* - any 0 or more chars other than {, } and |
\| - a | char
(.*?) - Capturing group 2: any 0 or more chars, as few as possible
[^{}]* - any 0 or more chars other than { and }
}} - a }} substring.
Try this \{\{([^\^|]*)\|([^\^|]*)\}\}
https://regex101.com/r/bLF8Oq/1

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

"OR" operator in RegEx syntax

OK, I've worked with RegEx numerous times but this is one of the things I honestly can't get my head around. And it looks as if I'm missing something rather simple...
So, let's say we want to match "AB" or "AC". In other words, "A" followed by either "B" OR "C".
This would be expressed like A[BC] or A[B|C] or A(B|C) and so on.
Now, what if A,B,C are not just single letters but sub-expressions?
Please, have a look at this example here (well, I admit it doesn't look that... simple! lol) : http://regexr.com?382a4
I'm trying to match capital = (and its variations) followed by either :
Pattern 1
Pattern 2
Why is it that using the | operator only works on the latter part (my regex also matches "Pattern 2" withOUT preceding capital =). Please note that I've also tried using positive look-arounds, but without any success.
Any ideas?
Your original regex could be summarized as:
capital = (ABC)|(DEF)
This matches capital = ABC or DEF. Add an extra pair of () that wraps the | clause properly.
Demo here
I suppose this regexp:
capital = (ABC|XYZ)
should work (if I did correctly understand your request...)
Actually [B|C] is incorrect, (B|C) is correct.
Character classes
In RegEx jargon [] is called a character class and it is used to represent one (single) character according to the options listed between the brackets.
In your case [B|C] matches either B or | or C. We can correct this by using [BC] to match either B or C. This matches exactly one character either B or C.
Capturing groups
In RegEx jargon () is called a capturing group. It is used to create boundaries between adjacent groups and whatever it matches will be present in the output array of a preg_match or as a variable in preg_replace.
Within that group you can us the | operator to specify that you want to match either whatever's before or whatever's after the operator.
This can be used to match strings with more than one characters such as (Ana|Maria) or various structures such as ([a-zA-Z]+|[0-9]+).
You can also use the | outside of a capturing group such as (group-1)|(group-2) and you can also use subgrouping such as ((group-1)|(group-2)).

Regex optional groups

I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.

Categories