Capture all occurrences of repeated formatted substrings - php

I've a string that follows this pattern [:it]Stringa in italiano[:en]String in english.
I'm trying to use preg_match_all() to capture the locales and the associated strings, ie:
[1] => 'it',
[2] => 'en',
...
[1] => 'Stringa in italiano',
[2] => 'String in english'
The regex that I'm using "/\[:(\w+)](.+?)(?=\[:\w+])/" (https://regex101.com/r/eZ1gT7/400) returns only the first group of data. What I'm doing wrong?

The final formatted segment will not satisfy your lookahead. You will need to include the option of match the position of the end of the string with an alternation. A pipe (|) means "or". A dollar symbol ($) means "end of string".
I am using negated character classes to match between literal square braces. If your \w is sufficient for your project, feel free to keep that portion as you originally posted.
Code: (Demo)
$string = '[:it]Stringa in italiano[:en]String in english';
preg_match_all('~\[:([^]]+)](.+?)(?=$|\[:[^]]+])~', $string, $m);
var_export($m);
Output:
array (
0 =>
array (
0 => '[:it]Stringa in italiano',
1 => '[:en]String in english',
),
1 =>
array (
0 => 'it',
1 => 'en',
),
2 =>
array (
0 => 'Stringa in italiano',
1 => 'String in english',
),
)

Related

How Can I Split a String in One Character Increments but also Keep The String between Brackets Intact?

So, say I have string like:
abcd[efg]hi[jkl]m
What I want is to split the string in such a way that each character is placed in its own index in an array, except for the characters within the brackets. They should remain grouped together. In other words, I want to create an array from the input string as such:
[0] => a
[1] => b
[2] => c
[3] => d
[4] => efg
[5] => h
[6] => i
[7] => jkl
[8] => m
I know I can split the input string in the way shown below using preg_split('/[\[]*[\][]/U', $value, -1):
[0] => abcd
[1] => efg
[2] => hi
[3] => jkl
[4] => m
However, I do not know the regular expression I'd need to use to get me exactly what I want. What regular expression will give me my desired solution?
Effectively, you split on every "optionally occurring" square-braced expression. IOW, if a square braced expression is encountered, split on the whole expression (without consuming it) and if there is no expression encountered, then split on that position in the string.
The pattern can look like either of these: (Regex Demo)
~\[([^\]]+)]|~ # (shorter pattern, 44 steps on the sample input string)
or
~(?:\[([^\]]+)])?~ # (longer pattern, 64 steps on the sample input string)
Both preg flags are essential to ensure that no substrings are lost and no empty elements are generated during the process.
Code: (Demo)
$string = 'abcd[efg]hi[jkl]m';
var_export(
preg_split(
'~\[([^\]]+)]|~',
$string,
0,
PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE
)
);
Output:
array (
0 => 'a',
1 => 'b',
2 => 'c',
3 => 'd',
4 => 'efg',
5 => 'h',
6 => 'i',
7 => 'jkl',
8 => 'm',
)
Pattern Breakdown:
~ #starting pattern delimiter
\[ #literal opening square brace
( #open capturing group
[^\]]+ #match one or more non-closing square braces
) #close capturing group
] #literal closing square brace
| #OR operator, to allow a match on current position (not a brace expression)
~
The literal braces are written outside of the inner capture group so that they are discarded/consumed during the splitting. Normally, the inner capture group would be consumed as well, but the PREG_SPLIT_DELIM_CAPTURE flag demands that that substring should remain after the splitting.

Regex to split string into array of numbers and characters using PHP

I have an arithmetic string that will be similar to the following pattern.
a. 1+2+3
b. 2/1*100
c. 1+2+3/3*100
d. (1*2)/(3*4)*100
Points to note are that
1. the string will never contain spaces.
2. the string will always be a combination of Numbers, Arithmetic symbols (+, -, *, /) and the characters '(' and ')'
I am looking for a regex in PHP to split the characters based on their type and form an array of individual string characters like below.
(Note: I cannot use str_split because I want numbers greater than 10 to not to be split.)
a. 1+2+3
output => [
0 => '1'
1 => '+'
2 => '2'
3 => '+'
4 => '3'
]
b. 2/1*100
output => [
0 => '2'
1 => '/'
2 => '1'
3 => '*'
4 => '100'
]`
c. 1+2+3/3*100
output => [
0 => '1'
1 => '+'
2 => '2'
3 => '+'
4 => '3'
5 => '/'
6 => '3'
7 => '*'
8 => '100'
]`
d. (1*2)/(3*4)*100
output => [
0 => '('
1 => '1'
2 => '*'
3 => '2'
4 => ')'
5 => '/'
6 => '('
7 => '3'
8 => '*'
9 => '4'
10 => ')'
11 => '*'
12 => '100'
]
Thank you very much in advance.
Use this regex :
(?<=[()\/*+-])(?=[0-9()])|(?<=[0-9()])(?=[()\/*+-])
It will match every position between a digit or a parenthesis and a operator or a parenthesis.
(?<=[()\/*+-])(?=[0-9()]) matches the position with a parenthesis or an operator at the left and a digit or parenthesis at the right
(?<=[0-9()])(?=[()\/*+-]) is the same but with left and right reversed.
Demo here
Since you state that the expressions are "clean", no spaces or such, you could split on
\b|(?<=\W)(?=\W)
It splits on all word boundaries and boundaries between non word characters (using positive lookarounds matching a position between two non word characters).
See an illustration here at regex101
As I said, I will help you with that if you can provide some work you did by yourself to solve that problem.
However, if when crafting an unidimensional array out of an arithmetic expression, your objective is to parse and cimpute that array, then you should build a tree instead and hierarchise it by putting the operators as nodes, the branches being the operands :
'(1*2)/(3*4)*100'
Array
(
[operand] => '*',
[left] => Array
(
[operand] => '/',
[left] => Array
(
[operand] => '*',
[left] => 1,
[right] => 2
),
[right] => Array
(
[operand] => '*',
[left] => 3,
[right] => 4
)
),
[right] => 100
)
There is no need to use regex for this. You just loop through the string and build the array as you want.
Edit, just realized it can be done much faster with a while loop instead of two for loops and if().
$str ="(10*2)/(3*40)*100";
$str = str_split($str); // make str an array
$arr = array();
$j=0; // counter for new array
for($i=0;$i<count($str);$i++){
if(is_numeric($str[$i])){ // if the item is a number
$arr[$j] = $str[$i]; // add it to new array
$k = $i+1;
while(is_numeric($str[$k])){ // while it's still a number append to new array item.
$arr[$j] .= $str[$k];
$k++; // add one to counter.
if($k == count($str)) break; // if counter is out of bounds, break loop.
}
$j++; // we are done with this item, add one to counter.
$i=$k-1; // set new value to $i
}else{
// not number, add it to the new array and add one to array counter.
$arr[$j] = $str[$i];
$j++;
}
}
var_dump($arr);
https://3v4l.org/p9jZp
You can also use this matching regex: [()+\-*\/]|\d+
Demo
I was doing something similar to this for a php calculator demo. A related post.
Consider this pattern for preg_split():
~-?\d+|[()*/+-]~ (Pattern Demo)
This has the added benefit of allowing negative numbers without confusing them for operators. The first "alternative" matches positive or negative integers, while the second "alternative (after the |) matches parentheses and operators -- one at a time.
In the php implementation, I place the entire pattern in a capture group and retain the delimiters. This way no substrings are left behind. ~ is used as the pattern delimiter so that the slash in the pattern doesn't need to be escaped.
Code: (Demo)
$expression = '(1*2)/(3*4)*100+-10';
var_export(
preg_split(
'~(-?\d+|[()*/+-])~',
$expression,
0,
PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
)
);
Output:
array (
0 => '(',
1 => '1',
2 => '*',
3 => '2',
4 => ')',
5 => '/',
6 => '(',
7 => '3',
8 => '*',
9 => '4',
10 => ')',
11 => '*',
12 => '100',
13 => '+',
14 => '-10',
)

Extract urls from string without spaces between

Let's say I have a string like this:
$urlsString = "http://foo.com/barhttps://bar.com//foo.com/foo/bar"
and I want to get an array like this:
array(
[0] => "http://foo.com/bar",
[1] => "https://bar.com",
[0] => "//foo.com/foo/bar"
);
I'm looking to something like:
preg_split("~((https?:)?//)~", $urlsString, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
Where PREG_SPLIT_DELIM_CAPTURE definition is:
If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.
That said, the above preg_split returns:
array (size=3)
0 => string '' (length=0)
1 => string 'foo.com/bar' (length=11)
2 => string 'bar.com//foo.com/foo/bar' (length=24)
Any idea of what I'm doing wrong or any other idea?
PS: I was using this regex until I've realized that it doesn't cover this case.
Edit:
As #sidyll pointed, I'm missing the $limit in the preg_split parameters. Anyway, there is something wrong with my regex, so I will use #WiktorStribiżew suggestion.
You may use a preg_match_all with the following regex:
'~(?:https?:)?//.*?(?=$|(?:https?:)?//)~'
See the regex demo.
Details:
(?:https?:)? - https: or http:, optional (1 or 0 times)
// - double /
.*? - any 0+ chars other than line break as few as possible up to the first
(?=$|(?:https?:)?//) - either of the two:
$ - end of string
(?:https?:)?// - https: or http:, optional (1 or 0 times), followed with a double /
Below is a PHP demo:
$urlsString = "http://foo.com/barhttps://bar.com//foo.com/foo/bar";
preg_match_all('~(?:https?:)?//.*?(?=$|(?:https?:)?//)~', $urlsString, $urls);
print_r($urls);
// => Array ( [0] => http://foo.com/bar [1] => https://bar.com [2] => //foo.com/foo/bar )

PHP Regex: How to get optional text if present?

Let's take an example of following string:
$string = "length:max(260):min(20)";
In the above string, :max(260):min(20) is optional. I want to get it if it is present otherwise only length should be returned.
I have following regex but it doesn't work:
/(.*?)(?::(.*?))?/se
It doesn't return anything in the array when I use preg_match function.
Remember, there can be something else than above string. Maybe like this:
$string = "number:disallow(negative)";
Is there any problem in my regex or PHP won't return anything? Dumping preg_match returns int 1 which means the string matches the regex.
Fully Dumped:
int 1
array (size=2)
0 => string '' (length=0)
1 => string '' (length=0)
You're using single character (.) matching in the case of being lazy, at the very beginning. So it stops at the zero position. If you change your preg_match function to preg_match_all you'll see the captured groups.
Another problem is with your Regular Expression. You're killing the engine. Also e modifier is deprecated many many decades before!!! and yet it was used in preg_replace function only.
Don't use s modifier too! That's not needed.
This works at your case:
/([^:]+)(:.*)?/
Online demo
I tried to prepare a regex which can probably solve your issue and also add some value to it
this regex will not only match the optional elements but will also capture in key value pair
Regex
/(?<=:|)(?'prop'\w+)(?:\((?'val'.+?)\))?/g
Test string
length:max(260):min(20)
length
number:disallow(negative)
Result
MATCH 1
prop [0-6] length
MATCH 2
prop [7-10] max
val [11-14] 260
MATCH 3
prop [16-19] min
val [20-22] 20
MATCH 4
prop [24-30] length
MATCH 5
prop [31-37] number
MATCH 6
prop [38-46] disallow
val [47-55] negative
try demo here
EDIT
I think I understand what you meant by duplicate array with different key, it was due to named captures eg. prop & val
here is the revision without named capturing
Regex
/(?<=:|)(\w+)(?:\((.+?)\))?/
Sample code
$str = "length:max(260):min(20)";
$str .= "\nlength";
$str .= "\nnumber:disallow(negative)";
preg_match_all("/(?<=:|)(\w+)(?:\((.+?)\))?/",
$str,
$matches);
print_r($matches);
Result
Array
(
[0] => Array
(
[0] => length
[1] => max(260)
[2] => min(20)
[3] => length
[4] => number
[5] => disallow(negative)
)
[1] => Array
(
[0] => length
[1] => max
[2] => min
[3] => length
[4] => number
[5] => disallow
)
[2] => Array
(
[0] =>
[1] => 260
[2] => 20
[3] =>
[4] =>
[5] => negative
)
)
try demo here

Extracting some content from given string

This is the content piece:
This is content that is a sample.
[md] Special Content Piece [/md]
This is some more content.
What I want is a preg_match_all expression such that it can fetch and give me the following from the above content:
[md] Special Content Piece [/md]
I have tried this:
$pattern ="/\[^[a-zA-Z][0-9\-\_\](.*?)\[\/^[a-zA-Z][0-9\-\_]\]/";
preg_match_all($pattern, $content, $matches);
But it gives a blank array. Could someone help?
$pattern = "/\[md\](.*?)\[\md\]/";
generally
$pattern = "/\[[a-zA-Z0-9\-\_]+\](.*?)\[\/[a-zA-Z0-9\-\_]+\]/";
or even better
$pattern = "/\[\w+\](.*?)\[\/\w+\]/";
and to match the start tag with the end tag:
$pattern = "/\[(\w+)\](.*?)\[\/\1\]/";
(Just note that the "tag" name is then returned in the match array.)
You can use this:
$pattern = '~\[([^]]++)]\K[^[]++(?=\[/\1])~';
explanation:
~ #delimiter of the pattern
\[ #literal opening square bracket (must be escaped)
( #open the capture group 1
[^]]++ #all characters that are not ] one or more times
) #close the capture group 1
] #literal closing square bracket (no need to escape)
\K #reset all the match before
[^[]++ #all characters that are not [ one or more times
(?= #open a lookahead assertion (this doesn't consume characters)
\[/ #literal opening square bracket and slash
\1 #back reference to the group 1
] #literal closing square bracket
) #close the lookhead
~
Interest of this pattern:
The result is the whole match because i have reset all the match before \K and because the lookahead assertion, after what you are looking for, don't consume characters and is not in the match.
The character classes are defined in negative and therefore are shorter to write and permissive (you don't care about what characters must be inside)
The pattern checks if the opening and closing tags are the same with the system of capture group\back reference.
Limits:
This expression don't deal with nested structures (you don't ask for). If you need that, please edit your question.
For nested structures you can use:
(?=(\[([^]]++)](?<content>(?>[^][]++|(?1))*)\[/\2]))
If attributes are allowed in your bbcode:
(?=(\[([^]\s]++)[^]]*+](?<content>(?>[^][]++|(?1))*)\[/\2]))
If self-closing bbcode tags are allowed:
(?=((?:\[([^][]++)](?<content>(?>[^][]++|(?1))*)\[/\2])|\[[^/][^]]*+]))
Notes:
A lookahead means in other words: "followed by"
I use possessive quantifiers (++) instead of simple gready quantifiers (+) to inform the regex engine that it doesn't need to backtrack (gain of performance) and atomic groups (ie:(?>..)) for the same reasons.
In the patterns for nested structures slashes are not escaped, to use them you must choose a delimiter that is not a slash (~, #, `).
The patterns for nested structures use recursion (ie (?1)), you can have more informations about this feature here and here.
Update:
If you're likely to be working with nested "tags", I'd probably go for something like this:
$pattern = '/(\[\s*([^\]]++)\s*\])(?=(.*?)(\[\s*\/\s*\2\s*\]))/';
Which, as you probably can tell, is not unlike what CasimiretHippolyte suggested (only his regex, AFAIKT, won't capture outer tags in a scenario like the following:)
his is content that is a sample.
[md] Special Content [foo]Piece[/foo] [/md]
This is some more content.
Whereas, with this expression, $matches looks like:
array (
0 =>
array (
0 => '[md]',
1 => '[foo]',
),
1 =>
array (
0 => '[md]',
1 => '[foo]',
),
2 =>
array (
0 => 'md',
1 => 'foo',
),
3 =>
array (
0 => ' Special Content [foo]Piece[/foo] ',
1 => 'Piece',
),
4 =>
array (
0 => '[/md]',
1 => '[/foo]',
),
)
A rather simple pattern to match all substrings looking like this [foo]sometext[/foo]
$pattern = '/(\[[^\/\]]+\])([^\]]+)(\[\s*\/\s*[^\]]+\])/';
if (preg_match_all($pattern, $content, $matches))
{
echo '<pre>';
print_r($matches);
echo '</pre>';
}
Output:
array (
0 =>
array (
0 => '[md] Special Content Piece [/md]',
),
1 =>
array (
0 => '[md]',
),
2 =>
array (
0 => ' Special Content Piece ',
),
3 =>
array (
0 => '[/md]',
),
)
How this pattern works: It's devided into three groups.
The first: (\[[^\/\]]+\]) matches opening and closing [], with everything inbetween that is neither a closing bracket nor a forward slash.
The second: '([^]]+)' matches every char after the first group that is not [
The third: (\[\s*\/\s*[^\]]+\]) matches an opening [, followed by zero or more spaces, a forward slash, again followed by zero or more spaces, and any other char that isn't ]
If you want to match a specific end-tag, but keeping the same three groups (with a fourth), use this (slightly more complex) expression:
$pattern = '/(\[\s*([^\]]+?)\s*\])(.+?)(\[\s*\/\s*\2\s*\])/';
This'll return:
array (
0 =>
array (
0 => '[md] Special Content Piece [/md]',
),
1 =>
array (
0 => '[md]',
),
2 =>
array (
0 => 'md',
),
3 =>
array (
0 => ' Special Content Piece ',
),
4 =>
array (
0 => '[/md]',
),
)
Note that group 2 (the one we used in the expression as \2) is the "tagname" itself.

Categories