PHP preg_split (with the delimiter included) - php

I was trying to include the delimiter while using preg_split but was unsuccessful.
print_r(preg_split('/((?:fy)[.]+)/', 'fy13 eps fy14 rev', -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));
I'm trying to return:
array(
[0] => fy13 eps
[1] => fy14 rev
)
With the flags parameter set to PREG_SPLIT_DELIM_CAPTURE:
If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.
The fy is in parenthesis, so I don't know why this doesn't work.

Your current approach isn't working because "parenthesized expression" here is referring to capturing groups, and the ?: to start your group makes it a non-capturing group. So you can get the fy included by changing your expression to /(fy)/, however I don't think this is what you want because you will get an array that contains fy, 13 eps, fy, and 14 eps (the parenthesized expressions are separate entries in the result).
Instead, try the following:
print_r(preg_split('/(?=fy)/', 'fy13 eps fy14 rev', -1, PREG_SPLIT_NO_EMPTY));
This uses a lookahead to split just before each occurrence of fy in your string.

With the example you gave, I am not sure that you really need to use the preg_split function. For example you can obtain the same with preg_match_all in a more efficient way (from the perspective of performance):
preg_match_all('/fy(?>[^f]++|f++(?!y\d))*/', 'fy13 eps fy14 rev', $results);
print_r($results);
The idea here is to match fy followed by all characters but f one or more times or f not followed by y all zero or more times.
More informations about (?>..) and ++ respectively:
here for atomic groups
and here for possessive quantifiers

Related

regex to convert string 018v-s001v => 18v-s1v but 020v_001 => 20v_001

I'm struggling with a Regex to convert the following strings
018v-s001v => 18v-s1v
018v-s001r => 18v-s1r
018r-s002v => 18r-s2v
020v_001 => 20v_001
020r_002 => 20r_002
0001 => 0001
I could manage to convert the first three cases but I'm struggling with the latter three: How to preserve the zeros after_ and the all zeros in the last case?
My attempt: (0*)([1-9]{0,4}[vr]?)((-s)?+([0]{0,2}))?+([1-9][vr])?
https://regex101.com/r/2go5KO/1
For your given examples, you could use
000\d+(*SKIP)(*FAIL)|(?<=\b|[a-z])0+
See a demo on regex101.com.
To get the expected result for your example data you might use preg_replace.
You could match one or more times a zero 0+, capture in a group one or more digits and use a character class to match by v or r ([0-9]+[vr])
Regex
0+([0-9]+[vr])
Replace
Captured group 1 $1
Demo Php
How about this one:
$result = preg_replace('/(?:(\d{4})|(0)?(\d{2}\w))(?:([-_])(?:(\d{3})|(\w)(0+)(\d+?\w)))?/m',
'$1$3$4$5$6$8', $subject);
This produces all the results you require from your test strings. But it wasn't clear where a zero definitely will appear or only optionally. But I'm sure it can be adapted. Also I noticed the separator was occasionally a hyphen - and occasionally an underscore _ and it wasn't clear if that was just your typing or was significant. In any case I've assumed it could be either somewhat randomly.

RegExp in PHP. Get text between first level parentheses

I have two type of strings in one text:
a(bc)de(fg)h
a(bcd(ef)g)h
I need to get text between first level parentheses. In my example this is:
bc
fg
bcd(ef)g
I tried to use next regular expression /\((.+)\)/ with Ungreedy (U) flag:
bc
fg
bcd(ef
And without it:
bc)de(fg
bcd(ef)g
Both variants don't do what I need. Maybe someone know how solve my issue?
Use PCRE Recursive pattern to match substrings in nested parentheses:
$str = "a(bc)de(fg)h some text a(bcd(ef)g)h ";
preg_match_all("/\((((?>[^()]+)|(?R))*)\)/", $str, $m);
print_r($m[1]);
The output:
Array
(
[0] => bc
[1] => fg
[2] => bcd(ef)g
)
\( ( (?>[^()]+) | (?R) )* \)
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a
recursive match of the pattern itself (i.e. a correctly parenthesized
substring). Finally, there is a closing parenthesis.
Technical cautions:
If there are more than 15 capturing parentheses in a pattern, PCRE has
to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no
memory can be obtained, it saves data for the first 15 capturing
parentheses only, as there is no way to give an out-of-memory error
from within a recursion.
This question pretty much has the answer, but the implementations are a little ambiguous. You can use the logic in the accepted answer without the ~s to get this regex:
\(((?:\[^\(\)\]++|(?R))*)\)
Tested with this output:
Please can you try that:
preg_match("/\((.+)\)/", $input_line, $output_array);
Test this code in http://www.phpliveregex.com/
Regex: \((.+)\)
Input: a(bcd(eaerga(er)gaergf)g)h
Output: array(2
0 => (bcd(eaerga(er)gaergf)g)
1 => bcd(eaerga(er)gaergf)g
)

How to match all words but "stop" in a string by regex

another regex question. I use PHP, and have a string: fdjkaljfdlstopfjdslafdj. You see there is a stop in the middle. I just want to replace any other words excluding that stop. i try to use [^stop], but it also includes the s at the end of the string.
My Solution
Thanks everyone’s help here.
I also figure out a solution with pure RegEx method(I mean in my knowledge scoop to RegEx. PCRE verbs are too advanced for me). But it needs 2 steps. I don’t want to mix PHP method in, because sometimes the jobs are out of coding area, i.e. multi-renaming filenames in Total Commander.
Let’s see the string: xxxfooeoropwfoo,skfhlk;afoofsjre,jhgfs,vnhufoolsjunegpq. For example, I want to keep all foos in this string, and replace any other non-foo greedily into ---.
First, I need to find all the non-foo between each foo: (?<=foo).+?(?=foo).
The string will turn into xxxfoo---foo---foo---foolsjunegpq, just both sides non-foo words left now.
Then use [^-]+(?=foo)|(?<=foo)[^-]+.
This time: ---foo---foo---foo---foo---. All words but foo have been turned into ---.
i just dont want to include "stop"...
You can skip it by using PCRE verbs (*SKIP)(*F) try like this
stop(*SKIP)(*F)|.
Demo at regex101
or sequence: (stop)(*SKIP)(*F)|(?:(?!(?1)).)+
or for words: stop(*SKIP)(*F)|\w+
[^stop] doesn't means any text that is NOT stop. It just means any character that is not one of the 4 characters inside [...] which is in this case s,t,o,p.
Better to split on the text you don't want to match:
$s = 'fdjkaljfdlstopfjdslafdjstopfoobar';
php> $arr = preg_split('/stop/', $s);
php> print_r($arr);
Array
(
[0] => fdjkaljfdl
[1] => fjdslafdj
[2] => foobar
)
You can generalize this to any pattern:
(?<neg>stop)(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|(?&neg))
Demo
Just put the pattern you don't want in the neg group.
This regex will try to do the following for any character position:
Match the pattern you don't want. If it matches, discard it with (*SKIP)(*FAIL) and restart another match at this position.
If the pattern you don't want doesn't match at a particular position, then match anything, until either:
You reach the end of the input string (\Z)
Or the pattern you don't want immediately follows the current matching position ((?&neg))
This approach is slower than manually tuning the expression, you could get better performance at the cost of repeating yourself, which avoids the recursion:
stop(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|stop)
But of course, the best approach would be to use the features provided by your language: match the string you don't want, then use code to discard it and keep everything else.
In PHP, you can use the PREG_OFFSET_CAPTURE flag to tell the preg_match_all function to provide you the offsets of each match.

Combine multiple match regular expression into one and get the matching ones

I have a list of regular expressions:
suresnes|suresne|surenes|surene
pommier|pommiers
^musique$
^(faq|aide)$
^(file )?loss( )?less$
paris
faq <<< this match twice
My use case is that each pattern which got a match display a link to my user,
so I can have multiple pattern matching.
I test thoses patterns against a simple string of text "live in paris" / "faq" / "pom"...
The simple way to do it is to loop over all the patterns with a preg_match, but I'm will do that a lot on a performance critical page, so this look bad to me.
Here is what I have tried: combining all thoses expressions into one with group names:
preg_match("#(?P<group1>^(faq|aide|todo|paris)$)|(?P<group2>(paris)$)#im", "paris", $groups);
As you can see, each pattern is grouped: (?P<GROUPNAME>PATTERN) and they are all separated by a pipe |.
The result is not what I expect, as only the first group matching is returned. Look like when a match occurs the parsing is stopped.
What I want is the list of all the matching groups. preg_match_all does not help neither.
Thanks!
How about:
preg_match("#(?=(?P<group1>^(faq|aide|todo|paris)$))(?=(?P<group2>(paris)$))#im", "paris", $groups);
print_r($groups);
output:
Array
(
[0] =>
[group1] => paris
[1] => paris
[2] => paris
[group2] => paris
[3] => paris
[4] => paris
)
The (?= ) is called lookahead
Explanation of the regex:
(?= # start lookahead
(?P<group1> # start named group group1
^ # start of string
( # start catpure group #1
faq|aide|todo|paris # match any of faq, aide, todo or paris
) # end capture group #1
$ # end of string
) # end of named group group1
) # end of lookahead
(?= # start lookahead
(?P<group2> # start named group group2
( # start catpure group #2
paris # paris
) # end capture group #2
$ # end of string
) # end of named group group2
) # end of lookahead
Try this approach:
#/ define input string
$str_1 = "{STRING HERE}";
#/ Define regex array
$reg_arr = array(
'suresnes|suresne|surenes|surene',
'pommier|pommiers',
'^musique$',
'^(faq|aide)$',
'^(file )?loss( )?less$',
'paris',
'faq'
);
#/ define a callback function to process Regex array
function cb_reg($reg_t)
{
global $str_1;
if(preg_match("/{$reg_t}/ims", $str_1, $matches)){
return $matches[1]; //replace regex pattern with the result of matching is the key trick here
//or return $matches[0]; if you dont want to get captured parenthesized subpatterns
//or you could return an array of both. its up to you how to do it.
}else{
return '';
}
}
#/ Apply array Regex via much faster function (instead of a loop)
$results = array_map('cb_reg', $reg_arr); //returns regex results
$results = array_diff($results, array('')); //remove empty values returned
Basically, this is the fastest way I could think of.
You can't combine say 100s of Regex into one call, as it would be very complex regex to build and will have several chances to fail matching. This is one of the best way to do it.
In my opinion, combining large number of Regex into 1 regex (if possibly achieved) will be slower to execute with preg_match, as compared to this approach of Callback on Arrays. Just remember, the key here is Callback function on array member values, which is fastest way to handle array for your and similar situation in php.
Also note,
The callback on Array is not equal to looping the Array. Looping is slower and has an n from algorithm analysis. But callback on array elements is internal and is very fast as compared.
You can combine all of your regexes with "|" in between them. Then apply this: http://www.rexegg.com/regex-optimizations.html, which will optimize it, collapse common expressions, etc.

Tricky Question: How to order results from a multiple regexes

I currently use 3 different regular expressions in one preg_match, using the or sign | to separate them. This works perfectly. However the first and second regex have the same type of output. e.g. [0] Source Text [1] Number Amount [2] Name - however the last one since it uses a different arrangement of source text results in: [0] Source Text [1] Name [2] Number Amount.
preg_match('/^Guo (\d+) Cars #(\w+)|^AV (\d+) Cars #(\w+)|^#(\w+) (\d+) [#]?av/i', $source, $output);
Since Name is able to be numeric I can't do a simple check to see if it is numeric. Is there a way I can either switch the order in the regex or identify which regex it matched too. Speed is of the essence here so I didn't want to use 3 separate preg_match statements (and more to come).
Three separate regular expressions don't have to be slower. One big statement will mean a lot of backtracing for the regular expression engine. Key in regular expression optimisation is to make the engine fail ASAP. Did you do some benchmarking pulling them appart?
In your case you can make use of the PCRE's named captures (?<name>match something here) and replace with ${name} instead of \1. I'm not 100% certain this works for preg_replace. I know preg_match correctly stores named captures for certain, though.
PCRE needs to be compiled with the PCRE_DUPNAMES option for that to be useful in your case (as in RoBorg's) post. I'm not sure if PHP's compiled PCRE DLL file has that option set.
You could use named capture groups:
preg_match('/^Guo (?P<number_amount>\d+) Cars #(?P<name>\w+)|^AV (?P<number_amount>\d+) Cars #(?P<name>\w+)|^#(?P<name>\w+) (?P<number_amount>\d+) [#]?av/i', $source, $output);
I don’t know since what version PCRE supports the duplicate subpattern numbers syntax (?| … ). But try this regular expression:
/^(?|Guo (\d+) Cars #(\w+)|AV (\d+) Cars #(\w+)|#(\w+) (\d+) #?av)/i
So:
$source = '#abc 123 av';
preg_match('/^(?|Guo (\\d+) Cars #(\\w+)|AV (\\d+) Cars #(\\w+)|#(\\w+) (\\d+) #?av)/i', $source, $output);
var_dump($output);

Categories