OK, I've worked with RegEx numerous times but this is one of the things I honestly can't get my head around. And it looks as if I'm missing something rather simple...
So, let's say we want to match "AB" or "AC". In other words, "A" followed by either "B" OR "C".
This would be expressed like A[BC] or A[B|C] or A(B|C) and so on.
Now, what if A,B,C are not just single letters but sub-expressions?
Please, have a look at this example here (well, I admit it doesn't look that... simple! lol) : http://regexr.com?382a4
I'm trying to match capital = (and its variations) followed by either :
Pattern 1
Pattern 2
Why is it that using the | operator only works on the latter part (my regex also matches "Pattern 2" withOUT preceding capital =). Please note that I've also tried using positive look-arounds, but without any success.
Any ideas?
Your original regex could be summarized as:
capital = (ABC)|(DEF)
This matches capital = ABC or DEF. Add an extra pair of () that wraps the | clause properly.
Demo here
I suppose this regexp:
capital = (ABC|XYZ)
should work (if I did correctly understand your request...)
Actually [B|C] is incorrect, (B|C) is correct.
Character classes
In RegEx jargon [] is called a character class and it is used to represent one (single) character according to the options listed between the brackets.
In your case [B|C] matches either B or | or C. We can correct this by using [BC] to match either B or C. This matches exactly one character either B or C.
Capturing groups
In RegEx jargon () is called a capturing group. It is used to create boundaries between adjacent groups and whatever it matches will be present in the output array of a preg_match or as a variable in preg_replace.
Within that group you can us the | operator to specify that you want to match either whatever's before or whatever's after the operator.
This can be used to match strings with more than one characters such as (Ana|Maria) or various structures such as ([a-zA-Z]+|[0-9]+).
You can also use the | outside of a capturing group such as (group-1)|(group-2) and you can also use subgrouping such as ((group-1)|(group-2)).
Related
Hi I tired to use RegEx in PHP. The following elements I like to get with it:
<a="300">
<a="300"b="300">
<b="300">
The Problem is that I get only
<a="300">
<b="300">
with the following RegEx:
<(a|b)="[0-9]*">
What do I have to change, that I get all three elements? Is there a ANDOR operator?
Assuming your problem is rather a simple string processing than serious parsing, I would modify your regex like this:
<(a|b)="[0-9]+".*>
I added .* to allow characters inbetween " and >.
or a slightly my-flavored version:
<[ab]="\d+"[^>]*?>
piping single characters with | are less favored over [...]
\d is for series of digits
[^>]*? for characters other than >
You need an additional grouping, to specify, that you would accept multiple of that kind:
echo '<a="300">
<a="300"b="300">
<b="300">' | egrep '<((a|b)="[0-9]*")+>'
<a="300">
<a="300"b="300">
<b="300">
Regex is not boolean logic. The | symbol in regex is not an OR operator; it is referred to as alternation, which works similarly but is not quite the same thing. If you're just trying to match one of multiple characters, you should use square brackets [] to create a character set. In this case, [ab] matches a or b, just as [0-9] matches 0 or 1 or 2 etc.
Here's the pattern that I would suggest
<[ab]="\d+"(?:[ab]="\d+")?>
I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!
I am new to regex and I know the basics of how to pull out one sub string from a given string but I am struggling to get out multiple parts that I need. I am wondering if someone could help me with this simple example and then I work my way from there. Take this string:
LMJ won Neu. Zone - KEN #55 LEIGH vs LMJ #63 ONEIL
The parts in italics are the parts of the string that will change and bold will stay the same in every string. The parts I need out are:
First team id which in this case is LMJ, this will always start the string and be 3 uppercase letters, ^[A-Z]{3}?
The Neu part which could be one of 3 strings, Neu, Off, Def, [Neu|Off|Def]?
The second team part which will come always after the word Zone -, [A-Z]{3}?
Need the numeric part of the string after the first #. This could be 1 or 2 digits [0-9]{1,2}?
5.Third team part same as 3 except will appear after vs, [A-Z]{3}?
Same as 4 need numeric part after 2nd #, [0-9]{1,2}?
I would like to put that all together into one regex is that possible?
Everything inside square brackets is a so-called character class: it matches only a single character. so, [Neu|Off|Def] means: exactly one of the characters N, e, u, |, O, f or D (repetitions are ignored)
What you want is a capture group: (Neu|Off|Def)
Putting it together:
^([A-Z]{3}) won (Neu|Off|Def)\. Zone - ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+ vs ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+$
(This assumes you're not interested in the "LEIGH" and "ONEIL" parts, and these are always in upper case letters)
The regex should be something like;
'/([A-Z]{3})\ won\ (Neu|Off|Def)\.\ Zone\ -\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)\ vs\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)/'
() are used for capturing the different parts.
This is not tested properly.
I try to create a regular expression with searches in a huge document for a persons full name. In the text the name can be written in full, or the first names can be either abbreviated to a single letter or a letter followed by a dot or omitted. For instance my search for _ALBERTO JORGE ALONSO CALEFACCION_now is:
preg_match('/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+
(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION))([;:.,&\s\xc2(){}
!"'<>]{1})/i', $text, $match);
Between the first names and last names an asterisk (*) can be present.
This is working for the case all first names are at least present some way. But I don't know to extend the expression when first names are omitted. Can you help me?
Let's start by simplifying what you have;
start:
/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)([;:.,&\s\xc2(){}!"'<>]{1})/i
as I said in my comment, \b is "word break", so you can simplify a lot of that:
/\b(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
(added bonus: it won't match the characters either side now, and it will match at the start and end of the text)
Next, you can use the ? token for the dots (which should be escaped by the way; . is special and means "match anything")
/\b(ALBERTO|A\.?)[\s\xc2-]+(JORGE|J\.?)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Finally, to actually answer your question, you have 2 choices. Either make the entire bracketed name optional, or add a new blank option. The first is the most flexible, since we'll need to cope with the whitespace too:
/\b((ALBERTO|A\.?)[\s\xc2-]+((JORGE|J\.?)[\s\xc2,]+)?)?(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Note that if you're reading the matched parts you'll need to update your indices. Also note that this fixed an issue where omitting the second name (JORGE) still required an extra space.
This will match things like A. J. ALONSO CALEFACCION, A. ALONSO CALEFACCION and ALONSO CALEFACCION, but not J. ALONSO CALEFACCION (it's only a small tweak if you do want that)
Breaking up that final string for clarity:
/\b
(
(ALBERTO|A\.?)[\s\xc2-]+
(
(JORGE|J\.?)[\s\xc2,]+
)?
)?
(ALONSO)[\s\xc2*-]+
(CALEFACCION)
\b/i
Finally, it's an odd thought, but you could change the names which can be initials to be in this form: (A(LBERTO|\.|)), which means you're not repeating the initials (a potential source of mistakes)
I need to match "name" only after "listing", but of course those words could be any url directory or page.
mydomain.com/listing/name
so the only thing I can "REGuest" (request) is to be some parent directory there.
In other words, I want to match the "position" i.e. whatever comes 2nd after the domain.
I'm trying something like
(?<=mydomain\.com/[^/\?&]+/)[^/\?&]+(?:/)?
But the character set won't work inside the positive lookbehind, at least it's setup to match only ONE character. As soon as I try to match other than one (e.g. modify it with +, ? or *) it just stops working.
I'm obviously missing the positive lookbehind syntax and it seems not intended for what I'm trying.
How can I match that 2nd level filename?
Thanks.
Regular-expressions.info states that
The bad news is that most regex flavors do not allow you to use just
any regex inside a lookbehind, because they cannot apply a regular
expression backwards. Therefore, the regular expression engine needs
to be able to figure out how many steps to step back before checking
the lookbehind...
(Read further, they even mention Perl, Python and Java.)
I think the quantifier might be the problem. I found this on stackoverflow and briefly flew over it.
Wouldn't it be possible to just match the whole path, and use a group for the second level filename:
mydomain\.com\/[^\/\?&]+\/([^\/\?&]+)(?:\/)?
(note: I had to escape the / for my tests...)
The result of this would be something like:
Array
(
[0] => mydomain.com/listing/name
[1] => name
)
Now, because I don't know the context of your problem, I just assumed you would be able to postprocess the results and get the group 1 (index 1) from the result. If not, I unfortunately don't know...