PHP regex match multiple pieces - php

I am new to regex and I know the basics of how to pull out one sub string from a given string but I am struggling to get out multiple parts that I need. I am wondering if someone could help me with this simple example and then I work my way from there. Take this string:
LMJ won Neu. Zone - KEN #55 LEIGH vs LMJ #63 ONEIL
The parts in italics are the parts of the string that will change and bold will stay the same in every string. The parts I need out are:
First team id which in this case is LMJ, this will always start the string and be 3 uppercase letters, ^[A-Z]{3}?
The Neu part which could be one of 3 strings, Neu, Off, Def, [Neu|Off|Def]?
The second team part which will come always after the word Zone -, [A-Z]{3}?
Need the numeric part of the string after the first #. This could be 1 or 2 digits [0-9]{1,2}?
5.Third team part same as 3 except will appear after vs, [A-Z]{3}?
Same as 4 need numeric part after 2nd #, [0-9]{1,2}?
I would like to put that all together into one regex is that possible?

Everything inside square brackets is a so-called character class: it matches only a single character. so, [Neu|Off|Def] means: exactly one of the characters N, e, u, |, O, f or D (repetitions are ignored)
What you want is a capture group: (Neu|Off|Def)
Putting it together:
^([A-Z]{3}) won (Neu|Off|Def)\. Zone - ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+ vs ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+$
(This assumes you're not interested in the "LEIGH" and "ONEIL" parts, and these are always in upper case letters)

The regex should be something like;
'/([A-Z]{3})\ won\ (Neu|Off|Def)\.\ Zone\ -\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)\ vs\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)/'
() are used for capturing the different parts.
This is not tested properly.

Related

PHP preg_replace: find string part not starting with an exclamation point

I am working on some very messy Excel sheets, and trying to use PHP to find clues..
I have a MySQL database with all formulas from an excel document, and as usual, the cellnames from the current sheet do not have a "sheetname!" in front of it. To make it searchable (and find dead-routes in the formulas) I like to replace all formulas in the database with their sheetname as prefix.
Example:
=+(sheet_factory_costs!A17/sheet_employees!D23)+T12+W12
The database contains the name of the current sheet, and I like to change the formula above with that sheetname (let's call it "sheet_turnover").
=+(sheet_factory_costs!A17 / sheet_employees!D23)+sheet_turnover!T12+sheet_turnover!W12
I try this in PHP with preg_replace, and I think I need the following rules:
Find one or two letters, directly followed by a number. This is always a cell-adress within formulas.
When there is a ! on the position before, there is already a sheetname. So I am only looking for the letters and numbers NOT starting with an exclamation point.
The problem seems to be that the ! is also a special sign within patterns. Even if I try to escape it, it does not work:
$newformula =
preg_replace('/(?<\!)[A-Z]{1,2}[0-9]/',
'lala',
$oldformula);
(lala is my temporary marker to see if it is selecting the right cell-adresses)
(and yes, the lala is only places over the first number, but that's no issue right now)
(and yes, all Excel $..$.. (permanent) markers have already been replaced. No need to build that in the formula)
Your negative lookbehind is corrupt, you need to define it as (?<!!). However, you also need to use either a word boundary before it, or a (?<![A-Z]) lookbehind to make sure you have no other letters before the [A-Z]{1,2}.
So, you may use
'~\b(?<!!)[A-Z]{1,2}[0-9]~'
See the regex demo. Replace with sheet_turnover!$0 where $0 is the whole match value.
Details
\b - a word boundary (it is necessary, or name!AA11 would still get matched)
(?<!!) - no ! immediately to the left of the current location
[A-Z]{1,2} - 1 or 2 letters
[0-9] - a digit.
Another approach is match and skip "wrong" contexts and then match and keep the "right" ones:
'~\w+![A-Z]{1,2}[0-9](*SKIP)(*F)|\b[A-Z]{1,2}[0-9]~'
See this regex demo.
Here, \w+![A-Z]{1,2}[0-9](*SKIP)(*F)| part matches 1 or more word chars, then 1 or 2 uppercase ASCII letters and then a digit, and (*SKIP)(*F) will omit the match and will make the engine proceed looking for matches after the end of the previous match.

A test of preg_match is successful but preg_split fails

I am trying to test a means by which I can break apart a single string containing multiple records about scholarly publications. There is nothing so convenient as a meaningful delimiter separating one record from the next. But I believe it could be accomplished, given the pattern that each record ends with a date followed by a comma and a space (unless no additional records follow, in which case it is merely ended with the date), such as "YYYY-MM-DD, ".
I have begun with a simple test involving a string, and confirming that the regular expression recognizes the pattern I am looking for:
$date="2012-09-12, ";
if (preg_match("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]), $/",$date))
{
echo("yes");
}else{
echo("no");
However, when I try to take it to the next step by using a sample of real data and preg-split(), the split isn't working. I cannot understand why this simple test, taken from example 1 in the manual fails to result in the string being split:
<?php
$pubs="L.J. Santodonato, Y. Zhang, M. Feygenson, C.M. Parish, M.C. Gao, R.J. Weber, J.C. Neuefeind, Z. Tang, P.K. Liaw~Deviation from high-entropy configurations in the atomic distributions of a multi-principal-element alloy.~NATURE COMMUNICATIONS~6~2015~~~~0~~0~~2015-11-21, S. Liu, M.C. Gao, P.K. Liaw, Y. Zhang~Microstructures and mechanical properties of AlxCrFeNiTi 0.25 alloys.~JOURNAL OF ALLOYS AND COMPOUNDS~619~2015~610~~~0~~0~~2015-11-21";
$pubsArray = preg_split("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]), $/", $pubs);
print_r($pubsArray);
?>
Data matching the same pattern is found within the example string $pubs, but all I ever get back is an array with a single element containing the full string. I have run out of ideas as to what to try next, and would be grateful for any suggestions.
But I believe it could be accomplished, given the pattern that each record ends with a date followed by a comma and a space (unless no additional records follow, in which case it is merely ended with the date), such as "YYYY-MM-DD, ".
As you are trying to split string on occurrence of date for which you can use a simple regex like this /\d{4}(-\d{2}){2}/. As you are not validating date, there is no need to match all the months and dates.
To split string at date you should use following regex.
Regex: /(?<=\d{4}(-\d{2}){2}),\s*/ looks for occurrence of date followed by optional comma and space and splits on ,[space] as I suppose you want to keep the date of publication.
Php Code
<?php
$pubs="L.J. Santodonato, Y. Zhang, M. Feygenson, C.M. Parish, M.C. Gao, R.J. Weber, J.C. Neuefeind, Z. Tang, P.K. Liaw~Deviation from high-entropy configurations in the atomic distributions of a multi-principal-element alloy.~NATURE COMMUNICATIONS~6~2015~~~~0~~0~~2015-11-21, S. Liu, M.C. Gao, P.K. Liaw, Y. Zhang~Microstructures and mechanical properties of AlxCrFeNiTi 0.25 alloys.~JOURNAL OF ALLOYS AND COMPOUNDS~619~2015~610~~~0~~0~~2015-11-21";
$pubsArray = preg_split("/(?<=\d{4}(-\d{2}){2}),\s*/", $pubs);
print_r($pubsArray);
?>
Regex101 Demo
Ideone Demo

PHP Regexp capturing repeating group of chars, e.g. hahaha jajajaja hihihi

As title, is there a way in PHP, with preg_match_all to catch all the repetitions of chars group?
For instante catch
hahahaha
jajajaj
hihihi
It's fine to catch repetition of any char, like abababab, acacacacac.
Also, is there a way to count the number of repetition?
The idea is to catch all this "forms" of smiling on social media.
I figured out that there are also other cases, such as misspelled instances like ahahhahaah (where you have two consecutive a or h). Any ideas?
How about this:
preg_match_all('/((?i)[a-z])((?i)[a-z])(\1\2)+/', $str, $m);
$matches = $m[0]; //$matches will contain an array of matches
A bit complicated, but it does work. To explain, the first subpattern (((?i)[a-z])) matches any character between a and z, no matter the case. The second subpattern (((?i)[a-z])) does the same thing. The third subpattern ((\1\2)+) matches one or more repetitions of the first two letters, in the same case as they were originally put. This regular expression also assumes that there's an even number of repetitions. If you don't want that, you can add \1? at the end, meaning that (as long as it contains one or more repetitions), it can end with the first character (for instance, hahah and ikikikik would both be valid, but not asa).
To retrieve the number of repetitions for a specific match, you can do:
$numb = strlen($matches[$index])/2 - 1; //-1 because the first two letters aren't repetitions
For the shortest repetition (e.g. ha gets repeated multiple times in hahahaha):
(.+?)\1+
See demo.
For the longest repetition (e.g. haha gets repeated in hahahaha):
(.+)\1+
Counting Repetitions
The non-regex solution is to compare the lengths of Group 1 (the repteated token) and the overall match.
With pure regex, in .NET, you could simply do (.+?)(\1)+ and look at the number of captures in the Group 1 CaptureCollection object.
In PHP, that's not possible, but there are some hacks. See, for instance, this question about matching a line number—it's the same technique. This is for "study purposes" only—you wouldn't want to use that in real life.

"OR" operator in RegEx syntax

OK, I've worked with RegEx numerous times but this is one of the things I honestly can't get my head around. And it looks as if I'm missing something rather simple...
So, let's say we want to match "AB" or "AC". In other words, "A" followed by either "B" OR "C".
This would be expressed like A[BC] or A[B|C] or A(B|C) and so on.
Now, what if A,B,C are not just single letters but sub-expressions?
Please, have a look at this example here (well, I admit it doesn't look that... simple! lol) : http://regexr.com?382a4
I'm trying to match capital = (and its variations) followed by either :
Pattern 1
Pattern 2
Why is it that using the | operator only works on the latter part (my regex also matches "Pattern 2" withOUT preceding capital =). Please note that I've also tried using positive look-arounds, but without any success.
Any ideas?
Your original regex could be summarized as:
capital = (ABC)|(DEF)
This matches capital = ABC or DEF. Add an extra pair of () that wraps the | clause properly.
Demo here
I suppose this regexp:
capital = (ABC|XYZ)
should work (if I did correctly understand your request...)
Actually [B|C] is incorrect, (B|C) is correct.
Character classes
In RegEx jargon [] is called a character class and it is used to represent one (single) character according to the options listed between the brackets.
In your case [B|C] matches either B or | or C. We can correct this by using [BC] to match either B or C. This matches exactly one character either B or C.
Capturing groups
In RegEx jargon () is called a capturing group. It is used to create boundaries between adjacent groups and whatever it matches will be present in the output array of a preg_match or as a variable in preg_replace.
Within that group you can us the | operator to specify that you want to match either whatever's before or whatever's after the operator.
This can be used to match strings with more than one characters such as (Ana|Maria) or various structures such as ([a-zA-Z]+|[0-9]+).
You can also use the | outside of a capturing group such as (group-1)|(group-2) and you can also use subgrouping such as ((group-1)|(group-2)).

Regex Capital letter combo

REGEX is something of a mystery to me. After searching on SO, I did download Espresso and went through the tutorial, but things still are not clicking for me. It may just be my specific need, but I haven't found any examples. What I want to do is find matches that are exactly two specific capital (or lowercase, mix) and then a string of numbers. Here are the cases I want to test against:
TL123
TL 123
tl123
tl 123
TLABC123
tlabc123
What I'm then trying to do is preg_replace the results for that match (and ultimately always return TL-123 - for example).
So, any letter or number combo after TL would return TL- and vice-versa. Any nudges in the right direction would be extremely helpful. Thanks!
Edit
It might actually be preg_match_all that I need for this.
To match the specified pattern, you can use:
TL(?:[^0-9]*)(\d+)
This will match a TL followed by anything that isn't a number (or nothing) and then a list of numbers.
You could use this with PHP's preg_replace() like:
$str = preg_replace('/TL(?:[^0-9]*)(\d+)/i', 'TL-$1', $str);
This example, of course, assumes that TL is the exact characters you want to match. If TL is just a placeholder and you could match anything, you could use the following:
preg_replace('/([a-z]{2})(?:[^0-9]*)(\d+)/i', '$1-$2', $str);
With this, I have it hardcoded to only allow 2 characters to match ({2}). You can modify this to any number if you need it to change.
Also, as you want the matched characters to always be uppercase, but can match lowercase, I would suggest to just use strtoupper() around the result (instead of a callback).

Categories