PHP regex incorrect - php

I am trying to extract all strings that look like 12-15 from a parent string. This means all strings that have a dash in between two digits.
Using this answer as a basis, I tried the following:
<?php
$str = "34,56,67-90,45";
preg_match('/^(\d-\d)|(,\d-\d)|(\d-\d,)|(,\d-\d,)$/', $str, $output, PREG_OFFSET_CAPTURE);
echo print_r($output);
?>
This looks for any substring that looks a dash enclosed between digits, whether it has a comma before, after, or both, or none. When I run the PHP code, I get an empty array. On Regex101, when I test the regular expression, strings like 4-5,,,,, seem to, and I'm not understanding why it's letting me add extra commas.
What's wrong with my regex that I get an empty array?

I think you could use a simple regex like this
\d+[-]\d+
That is (match at least 1 digit) (match a literal dash) (match at least 1 digit)

\d matches a single digit. All the numbers in your sample string have two digits. You should use \d+ to match any number of digits.
preg_match('/^(\d+-\d+)|(,\d+-\d+)|(\d+-\d+,)|(,\d+-\d+,)$/', $str, $output, PREG_OFFSET_CAPTURE);
Output:
Array
(
[0] => Array
(
[0] => ,67-90
[1] => 5
)
[1] => Array
(
[0] =>
[1] => -1
)
[2] => Array
(
[0] => ,67-90
[1] => 5
)
)
You can also simplify the regexp:
preg_match('/(?:^|,)\d+-\d+(?:,|$)/', $str, $output, PREG_OFFSET_CAPTURE);
Output:
Array
(
[0] => Array
(
[0] => ,67-90,
[1] => 5
)
)

The | has precedence, meaning your expression is interpreted as "MATCH EITHER ONE OF THE FOLLOWING:
START of text -> 1 digit -> dash -> 1 digit (not matching end of text)
Comma (may be in the middle of the text, anywhere) -> 1 digit -> dash -> 1 digit
1 digit (anywhere) -> dash -> 1 digit -> comma
comma (anywhere) -> 1 digit -> dash -> 1 digit -> comma -> END of text
Also, your are using \d which matches 1 digit (only one character). You can use \d{2} to match 2 digits (00 to 99), or \d+ to match any integer (1, 55, 123456, etc).
In your case, I think you're trying to use this expression:
/(?:^|,)(\d+-\d+)(?=,|$)/
which means: START of text OR comma -> any integer -> dash -> any integer -> followed by (but not consuming inmatch) a comma OR END of text

Related

Regex: Capturing multiple instances in one word group

I'm not good at Regex and I've been trying for hours now so I hope you can help me. I have this text:
✝his is *✝he* *in✝erne✝*
I need to capture (using PREG_OFFSET_CAPTURE) only the ✝ in a word surrounded with *, so I only need to capture the last three ✝ in this example. The output array should look something like this:
[0] => Array
(
[0] => ✝
[1] => 17
)
[1] => Array
(
[0] => ✝
[1] => 32
)
[2] => Array
(
[0] => ✝
[1] => 44
)
I've tried using (✝) but ofcourse this will select all instances including the words without asterisks. Then I've tried \*[^ ]*(✝)[^ ]*\* but this only gives me the last instance in one word. I've tried many other variations but all were wrong.
To clarify: The asterisk can be at all places in the string, but always at the beginning and end of a word. The opening asterisk always precedes a space except at the beginning of the string and the closing asterisk always ends with a space except at the end of the string. I must add that punctuation marks can be inside these asterisks. ✝ is exactly (and only) what I need to capture and can be at any position in a word.
You could make use of the \G anchor to get iterative matches between the *. The anchor matches either at the start of the string, or at the end of the previous match.
(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)
Explanation
(?: Non capture group
\* Match *
| Or
\G(?!^) Assert the end of the previous match, not at the start
) Close non capture group
[^&*]* Match 0+ times any char except & and *
(?> Atomic group
&(?!#) Match & only when not directly followed by #
[^&*]* Match 0+ times any char except & and *
)* Close atomic group and repeat 0+ times
\K Clear the match buffer (forget what is matched until now)
✝ Match literally
(?=[^*]*\*) Positive lookahead, assert a * at the right
Regex demo | Php demo
For example
$re = '/(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)/m';
$str = '✝his is *✝he* *in✝erne✝*';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
Output
Array
(
[0] => Array
(
[0] => ✝
[1] => 16
)
[1] => Array
(
[0] => ✝
[1] => 31
)
[2] => Array
(
[0] => ✝
[1] => 43
)
)
Note The the offset is 1 less than the expected as the string starts counting at 0. See PREG_OFFSET_CAPTURE
If you want to match more variations, you could use a non capturing group and list the ones that you would accept to match. If you don't want to cross newline boundaries you can exclude matching those in the negated character class.
(?:\*|\G(?!^))[^&*\r\n]*(?>&(?!#)[^&*\\rn]*)*\K&#(?:x271D|169);(?=[^*\r\n]*\*)
Regex demo

Validate url parameters with preg_match

Valid example
12[red,green],13[xs,xl,xxl,some other text with chars like _&-##%]
number[anythingBut ()[]{},anythingBut ()[]{}](,number[anythingBut ()[]{},anythingBut ()[]{}]) or nothing
Full match 12[red,green]
Group 1 12
Group 2 red,green
Full match 13[xs,xl,xxl,some other text with chars like _&-##%]
Group 1 13
Group 2 xs,xl,xxl,some other text with chars like _&-##%
Not valid example
13[xs,xl,xxl 9974-?ds12[dfgd,dfgd]]
What I tried is this: (\d+(?=\[))\[([^\(\[\{\}\]\)]+)\], regex101 link with what I tried, but this also matches wrong input like given in the example.
If you just need to validate the input, you can add some anchors:
^(?:\d+\[[^\(\[\{\}\]\)]+\](?:,|$))+$
Regex101
If you also need to get all the matching parts, you can use another regex. Using only one will not work well.
$in = '12[red,green],13[xs,xl,xxl,some other text with chars like _&-##%],13[xs,xl,xxl 9974-?ds12[dfgd,dfgd]]';
preg_match_all('/(\d+)\[([^][{}()]+)(?=\](?:,|$))/', $in, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => 12[red,green
[1] => 13[xs,xl,xxl,some other text with chars like _&-##%
)
[1] => Array
(
[0] => 12
[1] => 13
)
[2] => Array
(
[0] => red,green
[1] => xs,xl,xxl,some other text with chars like _&-##%
)
)
Explanation:
/ : regex delimiter
(\d+) : group 1, 1 or more digits
\[ : open square bracket
( : start group 2
[^][{}()]+ : 1 or more any character that is not open or close parenthesis, brackets or square brackets
) : end group 2
(?= : positive lookahead, make sure we have after
\] : a close square bracket
(?:,|$) : non capture group, a comma or end of string
) : end group 2
/ : regex delimiter

need some help on regex in preg_match_all()

so I need to extract the ticket number "Ticket#999999" from a string.. how do i do this using regex.
my current regex is working if I have more than one number in the Ticket#9999.. but if I only have Ticket#9 it's not working please help.
current regex.
preg_match_all('/(Ticket#[0-9])\w\d+/i',$data,$matches);
thank you.
In your pattern [0-9] matches 1 digit, \w matches another digit and \d+ matches 1+ digits, thus requiring 3 digits after #.
Use
preg_match_all('/Ticket#([0-9]+)/i',$data,$matches);
This will match:
Ticket# - a literal string Ticket#
([0-9]+) - Group 1 capturing 1 or more digits.
PHP demo:
$data = "Ticket#999999 ticket#9";
preg_match_all('/Ticket#([0-9]+)/i',$data,$matches, PREG_SET_ORDER);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => Ticket#999999
[1] => 999999
)
[1] => Array
(
[0] => ticket#9
[1] => 9
)
)

Regex to split string with the last occurrence of a dot, colon or underscore

we have thousands of rows of data containing articlenumers in all sort of formats and I need to split off main article number from a size indicator. There is (almost) always a dot, dash or underscore between some last characters (not always 2).
In short: Data is main article number + size indicator, the separator is differs but 1 of 3 .-_
Question: how do I split main article number + size indicator? My regex below isn't working that I built based on some Google-ing.
preg_match('/^(.*)[\.-_]([^\.-_]+)$/', $sku, $matches);
Sample data + expected result
AR.110052.15-40 [AR.110052.15 & 40]
BI.533.41-41 [BI.533.41 & 41]
CG.00554.000-39 [CG.00554.000 & 39]
LL.PX00.SC004-40 [LL.PX00.SC004 & 40]
LOS.HAPPYSOCKS.1X [LOS.HAPPYSOCKS & 1X]
MI.PMNH300043-XXXXL [MI.PMNH300043 & XXXXL]
You need to move the - to the end of character class to make the regex engine parse it as a literal hyphen:
^(.*)[._-]([^._-]+)$
See the regex demo. Actually, even ^(.+)[._-](.+)$ will work.
^ - matches the start of string
(.*) - Group 1 capturing any 0+ chars as many as possible up to the last...
[._-] - either . or _ or -
([^._-]+) - Group 2: one or more chars other than ., _ and -
$ - end of string.
Use preg_split() instead of preg_match() because:
this isn't a validation task, it is an extraction task and
preg_split() returns the exact desired array compared to preg_match() which carries the unnecessary fullstring match in its returned array.
Limit the number of elements produced (like you would with explode()'s limit parameter.
No capture groups are needed at all.
Greedily match zero or more characters, then just before matching the latest occurring delimiter, restart the fullstring match with \K. This will effectively use the matched delimiter as the character to explode on and it will be "lost" in the explosion.
Code: (Demo)
$strings = [
'AR.110052.15-40',
'BI.533.41-41',
'CG.00554.000-39',
'LL.PX00.SC004-40',
'LOS.HAPPYSOCKS.1X',
'MI.PMNH300043-XXXXL',
];
foreach ($strings as $string) {
var_export(preg_split('~.*\K[._-]~', $string, 2));
echo "\n";
}
Output:
array (
0 => 'AR.110052.15',
1 => '40',
)
array (
0 => 'BI.533.41',
1 => '41',
)
array (
0 => 'CG.00554.000',
1 => '39',
)
array (
0 => 'LL.PX00.SC004',
1 => '40',
)
array (
0 => 'LOS.HAPPYSOCKS',
1 => '1X',
)
array (
0 => 'MI.PMNH300043',
1 => 'XXXXL',
)

split string by spaces and colon but not if inside quotes

having a string like this:
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"
the desired result is:
[0] => Array (
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
what I get with:
preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);
is:
[0] => Array (
[0] => dateto:'2015-10-07
[1] => 15:05'
[2] => xxxx
[3] => datefrom:'2015-10-09
[4] => 15:05'
[5] => yyyy
[6] => asdf
)
Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!
I would use PCRE verb (*SKIP)(*F),
preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);
DEMO
Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):
$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;
if (preg_match_all($pattern, $str, $m))
$result = $m[0];
pattern details:
~ # pattern delimiter
(?=\S) # the lookahead assertion only succeeds if there is a non-
# white-space character at the current position.
# (This lookahead is useful for two reasons:
# - it allows the regex engine to quickly find the start of
# the next item without to have to test each branch of the
# following alternation at each position in the strings
# until one succeeds.
# - it ensures that there's at least one non-white-space.
# Without it, the pattern may match an empty string.
# )
[^'"\s]* #"'# all that is not a quote or a white-space
(?: # eventual quoted parts
'[^']*' [^'"\s]* #"# single quotes
|
"[^"]*" [^'"\s]* # double quotes
)*
~
demo
Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:
~(?:[^'"\s]+|'[^']*'|"[^"]*")+~
but it's a little less efficient.
For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:
<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);
Output:
Array
(
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
Demo:
http://ideone.com/EP06Nt
Regex Explanation:
(?<!\d)(\s)
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
Match a single character that is a “whitespace character” «\s»

Categories