PHP Regex with parentheses - php

I have the below string:
test 13 (8) end 12 (14:48 IN 1ST)
I need the output to be:
14:48 IN 1ST
or everything inside parentheses towards the end of the string.
I don't need, 8 which is inside first set of parentheses. There can be multiple sets of parentheses in the string. I only need to consider everything inside the last set of parentheses of input string.

Regex Explanation
.* Go to last
\( Stars with (
([^)]*) 0 or more character except )
\) Ends with
preg_match
$str = "test 13 (8) end 12 () (14:48 IN 1ST) asd";
$regex = "/.*\(([^)]*)\)/";
preg_match($regex,$str,$matches);
$matches
array (
0 => 'test 13 (8) end 12 () (14:48 IN 1ST)',
1 => '14:48 IN 1ST',
)
Accept Empty preg_match_all
$str = "test 13 (8) end 12 () (14:48 IN 1ST) asd";
$regex = "/\(([^)]*)\)/";
preg_match_all($regex,$str,$matches);
$matches
array (
0 =>
array (
0 => '(8)',
1 => '()',
2 => '(14:48 IN 1ST)',
),
1 =>
array (
0 => '8',
1 => '',
2 => '14:48 IN 1ST',
),
)
Don't Accept Empty preg_match_all
$str = "test 13 (8) end 12 () (14:48 IN 1ST) asd";
$regex = "/\(([^)]+)\)/";
preg_match_all($regex,$str,$matches);
$matches
array (
0 =>
array (
0 => '(8)',
1 => '(14:48 IN 1ST)',
),
1 =>
array (
0 => '8',
1 => '14:48 IN 1ST',
),
)

I wouldn't use a regex for this, it's unnecessary.
Use strrpos and substr to extract the string that you need. It's simple, straightforward, and achieves the desired output.
It works by finding the last '(' in the string, and removing one character from the end of the string.
$str = "test 13 (8) end 12 (14:48 IN 1ST)";
echo substr( $str, strrpos( $str, '(') + 1, -1);
$str = "(dont want to capture the (8)) test 13 (8) end 12 (14:48 IN 1ST)";
echo substr( $str, strrpos( $str, '(') + 1, -1);
Demo
Edit: I should also note that my solution will work for all of the following cases:
Empty parenthesis
One set of parenthesis (i.e. the string before the desired grouping does not contain parenthesis)
More than three sets of parenthesis (as long as the desired grouping is located at the end of the string)
Any text following the last parenthesis grouping (per the edits below)
Final edit: Again, I cannot emphasis enough that using a regex for this is unnecessary. Here's an example showing that string manipulation is 3x - 7x faster than using a regex.
As per MetaEd's comments, my example / code can easily be modified to ignore text after the last parenthesis.
$str = "test 13 (8) end 12 (14:48 IN 1ST) fkjdsafjdsa";
$beginning = substr( $str, strrpos( $str, '(') + 1);
substr( $beginning, 0, strpos( $beginning, ')')) . "\n";
STILL faster than a regex.

I would go with the following regex:
.*\(([^)]+)\)

\(.*\) will match the first and last parens. To prevent that, begin with .* which will greedily consume everything up to the final open paren. Then put a capture group around what you want to output, and you have:
.*\((.*)\)

This regex will do: .+\((.+?)\)$
Escape the parentheses, make the + non-greedy with ?, and make sure it's at the end of the line.
If there may be characters after it, try this instead:
.\).+\((.+?)\)
Which basically makes sure only the second parentheses will match. I would still prefer the first.

The easiest thing would be to split the string on ')' and then just grab everything from the last item in the resulting array up till '('... I know it's not strictly regex but it's close enough.
"test 13 (8) end 12 (14:48 IN 1ST)".split( /)/);
This will produce an array with two elements...
"test 13 (8"
and
" end 12 (14:48 IN 1ST"
Notice that no matter how many (xyz) you have in there you will end up with the last one in the last array item.
Then you just look through that last item for a '(' and if it's there grab everything behind it.
I suspect this will work faster than a straight regex approach, but I haven't tested, so can't guarantee that... regardless it does work.
[/edit]

Related

Split string after each number

I have a database full of strings that I'd like to split into an array. Each string contains a list of directions that begin with a letter (U, D, L, R for Up, Down, Left, Right) and a number to tell how far to go in that direction.
Here is an example of one string.
$string = "U29R45U2L5D2L16";
My desired result:
['U29', 'R45', 'U2', 'L5', 'D2', 'L16']
I thought I could just loop through the string, but I don't know how to tell if the number is one or more spaces in length.
You can use preg_split to break up the string, splitting on something which looks like a U,L,D or R followed by numbers and using the PREG_SPLIT_DELIM_CAPTURE to keep the split text:
$string = "U29R45U2L5D2L16";
print_r(preg_split('/([UDLR]\d+)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));
Output:
Array (
[0] => U29
[1] => R45
[2] => U2
[3] => L5
[4] => D2
[5] => L16
)
Demo on 3v4l.org
A regular expression should help you:
<?php
$string = "U29R45U2L5D2L16";
preg_match_all("/[A-Z]\d+/", $string, $matches);
var_dump($matches);
Because this task is about text extraction and not about text validation, you can merely split on the zer-width position after one or more digits. In other words, match one or more digits, then forget them with \K so that they are not consumed while splitting.
Code: (Demo)
$string = "U29R45U2L5D2L16";
var_export(
preg_split(
'/\d+\K/',
$string,
0,
PREG_SPLIT_NO_EMPTY
)
);
Output:
array (
0 => 'U29',
1 => 'R45',
2 => 'U2',
3 => 'L5',
4 => 'D2',
5 => 'L16',
)

How to preg_split without losing a character?

I have a string like this
$string = "Hello; how are you;Hey, I am fine";
$new = preg_split("/;\w/", $string);
print_r($new);
I am trying to split the string only when there is no white-space between the words and ";". But when I do this, I lose the H from Hey. It's probably because the split happens through the recognition of ;H. Could someone tell me how to prevent this?
My output:
$array = [
0 => [
0 => 'Hello; how are you ',
1 => 0,
],
1 => [
0 => 'ey, I am fine',
1 => 21,
],
]
You might use a word boundary \b:
\b;\b
$string = "Hello; how are you;Hey, I am fine";
$new = preg_split("/\b;\b/", $string);
print_r($new);
Demo
Or a negative lookahead and negative lookbehind
(?<! );(?! )
Demo
Lookarounds cost more steps. In terms of pattern efficiency, a word boundary is better and maintains the intended "no-length" character consumption.
In well-formed English, you won't ever have to check for a space before a semi-colon, so only 1 word boundary seems sufficient (I don't know if malformed English is possible because it is not represented in your sample string).
If you want to acquire the offset value, preg_split() has a flag for that.
Code: (Demo)
$string = "Hello; how are you;Hey, I am fine";
$new = preg_split("/;\b/", $string, -1, PREG_SPLIT_OFFSET_CAPTURE);
var_export($new);
Output:
array (
0 =>
array (
0 => 'Hello; how are you',
1 => 0,
),
1 =>
array (
0 => 'Hey, I am fine',
1 => 19,
),
)
Use split with this regex ;(?=\w) then you will not lose the H
You are capturingthe \w in your regex.You dont want that. Therefore, do this:
$new = preg_split("/;(?=\w)/", $string);
A capture group is defined in brackets, but the ?= means match but don't capture.
Check it out here https://3v4l.org/Q77LZ

Regex to split string with the last occurrence of a dot, colon or underscore

we have thousands of rows of data containing articlenumers in all sort of formats and I need to split off main article number from a size indicator. There is (almost) always a dot, dash or underscore between some last characters (not always 2).
In short: Data is main article number + size indicator, the separator is differs but 1 of 3 .-_
Question: how do I split main article number + size indicator? My regex below isn't working that I built based on some Google-ing.
preg_match('/^(.*)[\.-_]([^\.-_]+)$/', $sku, $matches);
Sample data + expected result
AR.110052.15-40 [AR.110052.15 & 40]
BI.533.41-41 [BI.533.41 & 41]
CG.00554.000-39 [CG.00554.000 & 39]
LL.PX00.SC004-40 [LL.PX00.SC004 & 40]
LOS.HAPPYSOCKS.1X [LOS.HAPPYSOCKS & 1X]
MI.PMNH300043-XXXXL [MI.PMNH300043 & XXXXL]
You need to move the - to the end of character class to make the regex engine parse it as a literal hyphen:
^(.*)[._-]([^._-]+)$
See the regex demo. Actually, even ^(.+)[._-](.+)$ will work.
^ - matches the start of string
(.*) - Group 1 capturing any 0+ chars as many as possible up to the last...
[._-] - either . or _ or -
([^._-]+) - Group 2: one or more chars other than ., _ and -
$ - end of string.
Use preg_split() instead of preg_match() because:
this isn't a validation task, it is an extraction task and
preg_split() returns the exact desired array compared to preg_match() which carries the unnecessary fullstring match in its returned array.
Limit the number of elements produced (like you would with explode()'s limit parameter.
No capture groups are needed at all.
Greedily match zero or more characters, then just before matching the latest occurring delimiter, restart the fullstring match with \K. This will effectively use the matched delimiter as the character to explode on and it will be "lost" in the explosion.
Code: (Demo)
$strings = [
'AR.110052.15-40',
'BI.533.41-41',
'CG.00554.000-39',
'LL.PX00.SC004-40',
'LOS.HAPPYSOCKS.1X',
'MI.PMNH300043-XXXXL',
];
foreach ($strings as $string) {
var_export(preg_split('~.*\K[._-]~', $string, 2));
echo "\n";
}
Output:
array (
0 => 'AR.110052.15',
1 => '40',
)
array (
0 => 'BI.533.41',
1 => '41',
)
array (
0 => 'CG.00554.000',
1 => '39',
)
array (
0 => 'LL.PX00.SC004',
1 => '40',
)
array (
0 => 'LOS.HAPPYSOCKS',
1 => '1X',
)
array (
0 => 'MI.PMNH300043',
1 => 'XXXXL',
)

Extracting some content from given string

This is the content piece:
This is content that is a sample.
[md] Special Content Piece [/md]
This is some more content.
What I want is a preg_match_all expression such that it can fetch and give me the following from the above content:
[md] Special Content Piece [/md]
I have tried this:
$pattern ="/\[^[a-zA-Z][0-9\-\_\](.*?)\[\/^[a-zA-Z][0-9\-\_]\]/";
preg_match_all($pattern, $content, $matches);
But it gives a blank array. Could someone help?
$pattern = "/\[md\](.*?)\[\md\]/";
generally
$pattern = "/\[[a-zA-Z0-9\-\_]+\](.*?)\[\/[a-zA-Z0-9\-\_]+\]/";
or even better
$pattern = "/\[\w+\](.*?)\[\/\w+\]/";
and to match the start tag with the end tag:
$pattern = "/\[(\w+)\](.*?)\[\/\1\]/";
(Just note that the "tag" name is then returned in the match array.)
You can use this:
$pattern = '~\[([^]]++)]\K[^[]++(?=\[/\1])~';
explanation:
~ #delimiter of the pattern
\[ #literal opening square bracket (must be escaped)
( #open the capture group 1
[^]]++ #all characters that are not ] one or more times
) #close the capture group 1
] #literal closing square bracket (no need to escape)
\K #reset all the match before
[^[]++ #all characters that are not [ one or more times
(?= #open a lookahead assertion (this doesn't consume characters)
\[/ #literal opening square bracket and slash
\1 #back reference to the group 1
] #literal closing square bracket
) #close the lookhead
~
Interest of this pattern:
The result is the whole match because i have reset all the match before \K and because the lookahead assertion, after what you are looking for, don't consume characters and is not in the match.
The character classes are defined in negative and therefore are shorter to write and permissive (you don't care about what characters must be inside)
The pattern checks if the opening and closing tags are the same with the system of capture group\back reference.
Limits:
This expression don't deal with nested structures (you don't ask for). If you need that, please edit your question.
For nested structures you can use:
(?=(\[([^]]++)](?<content>(?>[^][]++|(?1))*)\[/\2]))
If attributes are allowed in your bbcode:
(?=(\[([^]\s]++)[^]]*+](?<content>(?>[^][]++|(?1))*)\[/\2]))
If self-closing bbcode tags are allowed:
(?=((?:\[([^][]++)](?<content>(?>[^][]++|(?1))*)\[/\2])|\[[^/][^]]*+]))
Notes:
A lookahead means in other words: "followed by"
I use possessive quantifiers (++) instead of simple gready quantifiers (+) to inform the regex engine that it doesn't need to backtrack (gain of performance) and atomic groups (ie:(?>..)) for the same reasons.
In the patterns for nested structures slashes are not escaped, to use them you must choose a delimiter that is not a slash (~, #, `).
The patterns for nested structures use recursion (ie (?1)), you can have more informations about this feature here and here.
Update:
If you're likely to be working with nested "tags", I'd probably go for something like this:
$pattern = '/(\[\s*([^\]]++)\s*\])(?=(.*?)(\[\s*\/\s*\2\s*\]))/';
Which, as you probably can tell, is not unlike what CasimiretHippolyte suggested (only his regex, AFAIKT, won't capture outer tags in a scenario like the following:)
his is content that is a sample.
[md] Special Content [foo]Piece[/foo] [/md]
This is some more content.
Whereas, with this expression, $matches looks like:
array (
0 =>
array (
0 => '[md]',
1 => '[foo]',
),
1 =>
array (
0 => '[md]',
1 => '[foo]',
),
2 =>
array (
0 => 'md',
1 => 'foo',
),
3 =>
array (
0 => ' Special Content [foo]Piece[/foo] ',
1 => 'Piece',
),
4 =>
array (
0 => '[/md]',
1 => '[/foo]',
),
)
A rather simple pattern to match all substrings looking like this [foo]sometext[/foo]
$pattern = '/(\[[^\/\]]+\])([^\]]+)(\[\s*\/\s*[^\]]+\])/';
if (preg_match_all($pattern, $content, $matches))
{
echo '<pre>';
print_r($matches);
echo '</pre>';
}
Output:
array (
0 =>
array (
0 => '[md] Special Content Piece [/md]',
),
1 =>
array (
0 => '[md]',
),
2 =>
array (
0 => ' Special Content Piece ',
),
3 =>
array (
0 => '[/md]',
),
)
How this pattern works: It's devided into three groups.
The first: (\[[^\/\]]+\]) matches opening and closing [], with everything inbetween that is neither a closing bracket nor a forward slash.
The second: '([^]]+)' matches every char after the first group that is not [
The third: (\[\s*\/\s*[^\]]+\]) matches an opening [, followed by zero or more spaces, a forward slash, again followed by zero or more spaces, and any other char that isn't ]
If you want to match a specific end-tag, but keeping the same three groups (with a fourth), use this (slightly more complex) expression:
$pattern = '/(\[\s*([^\]]+?)\s*\])(.+?)(\[\s*\/\s*\2\s*\])/';
This'll return:
array (
0 =>
array (
0 => '[md] Special Content Piece [/md]',
),
1 =>
array (
0 => '[md]',
),
2 =>
array (
0 => 'md',
),
3 =>
array (
0 => ' Special Content Piece ',
),
4 =>
array (
0 => '[/md]',
),
)
Note that group 2 (the one we used in the expression as \2) is the "tagname" itself.

Retain Delimiters when Splitting String

Edit: OK, I can't read, thanks to Col. Shrapnel for the help. If anyone comes here looking for the same thing to be answered...
print_r(preg_split('/([\!|\?|\.|\!\?])/', $string, null, PREG_SPLIT_DELIM_CAPTURE));
Is there any way to split a string on a set of delimiters, and retain the position and character(s) of the delimiter after the split?
For example, using delimiters of ! ? . !? turning this:
$string = 'Hello. A question? How strange! Maybe even surreal!? Who knows.';
into this
array('Hello', '.', 'A question', '?', 'How strange', '!', 'Maybe even surreal', '!?', 'Who knows', '.');
Currently I'm trying to use print_r(preg_split('/([\!|\?|\.|\!\?])/', $string)); to capture the delimiters as a subpattern, but I'm not having much luck.
Your comment sounds like you've found the relevant flag, but your regex was a little off, so I'm going to add this anyway:
preg_split('/(!\?|[!?.])/', $string, null, PREG_SPLIT_DELIM_CAPTURE);
Note that this will leave spaces at the beginning of every string after the first, so you'll probably want to run them all through trim() as well.
Results:
$string = 'Hello. A question? How strange! Maybe even surreal!? Who knows.';
print_r(preg_split('/(!\?|[!?.])/', $string, null, PREG_SPLIT_DELIM_CAPTURE));
Array
(
[0] => Hello
[1] => .
[2] => A question
[3] => ?
[4] => How strange
[5] => !
[6] => Maybe even surreal
[7] => !?
[8] => Who knows
[9] => .
[10] =>
)
From PHP8.1, it is no longer permitted to use null as the limit parameter for preg_split() because an integer is expected. When seeking unlimited output elements from the return value, it is acceptable to use 0 or -1. (Demo)
To avoid empty elements in the returned array, I recommend PREG_SPLIT_NO_EMPTY as an additional flag. (Demo)
var_export(
preg_split(
'/(!\?|[!?.])/',
$string,
0,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
)
);
Since PHP8, it is technically possible to omit the limit parameter and declare flags by using named parameters.
Simply add the PREG_SPLIT_DELIM_CAPTURE to the preg_split function:
$str = 'Hello. A question? How strange!';
$var = preg_split('/([!?.])/', $str, 0, PREG_SPLIT_DELIM_CAPTURE);
$var = array(
0 => "Hello",
1 => ".",
2 => " A question",
3 => "?",
4 => " How strange",
5 => "!",
6 => "",
);
You can also split on the space after a ., !, ? or !?. But this can only be used if you can guarantee that there is a space after such a character.
You can do this, by matching a but with a positive look-back: (<=\.|!?|?|!): this makes the regex
'/(?<=\.|\?|!) /'
And then, you'll have to check if the strings matched ends with !?: if so, substring the last two. If not, you'll have to substring the last character.

Categories