preg_split with two patterns (one of them quoted) - php

I would like to split a string in PHP containing quoted and unquoted substrings.
Let's say I have the following string:
"this is a string" cat dog "cow"
The splitted array should look like this:
array (
[0] => "this is a string"
[1] => "cat"
[2] => "dog"
[3] => "cow"
)
I'm struggling a bit with regex and I'm wondering if it is even possible to achieve with just one regex/preg_split-Call...
The first thing I tried was:
[[:blank:]]*(?=(?:[^"]*"[^"]*")*[^"]*$)[[:blank:]]*
But this splits only array[0] and array[3] correctly - the rest is splitted on a per character base.
Then I found this link:
PHP preg_split with two delimiters unless a delimiter is within quotes
(?=(?:[^"]*"[^"]*")*[^"]*$)
This seems to me as a good startingpoint. However the result in my example is the same as with the first regex.
I tried combining both - first the one for quoted strings and then a second sub-regex which should ommit quoted string (therefore the [^"]):
(?=(?:[^"]*"[^"]*")*[^"]*$)|[[:blank:]]*([^"].*[^"])[[:blank:]]*
Therefore 2 questions:
Is it even possible to achieve what I want with just one regex/preg_split-Call?
If yes, I would appreciate a hint on how to assemble the regex correctly

Since matches cannot overlap, you could use preg_match_all like this:
preg_match_all('/"[^"]*"|\S+/', $input, $matches);
Now $matches[0] should contain what you are looking for. The regex will first try to match a quoted string, and then stop. If that doesn't do it it will just collect as many non-whitespace characters as possible. Since alternations are tried from left to right, the quoted version takes precedence.
EDIT: This will not get rid of the quotes though. To do this, you could use capturing groups:
preg_match_all('/(?|"([^"]*)"|(\S+))/', $input, $matches);
Now $matches[1] will contain exactly what you are looking for. The (?| is there so that both capturing groups end up at the same index.
EDIT 2: Since you were asking for a preg_split solution, that is also possible. We can use a lookahead, that asserts that the space is followed by an even number of quotes (up until the end of the string):
$result = preg_split('/\s+(?=(?:[^"]*"[^"]*")*$)/', $input);
Of course, this will not get rid of the quotes, but that can easily be done in a separate step.

Related

How do I locate and replace text with a common element using regex?

I'm pretty lousy at regex, and need help with the following scenario. I need to locate and replace text that has a common structure, but one aspect will be different:
here is a string (with 3 values)
here is another string (with 5 values)
In the above examples, I need to locate and then replace the value in parenthesis. I can't search by parens alone, as the string may contain other parens. But the value in the parens that needs to be replaced is consistently constructed: (with # values) -- the only difference will be the number.
So ideally the regex returns (with 3 values) and (with 5 values) so I can use a simple str_replace to change the text.
This is regex in a PHP script.
Try with this regex :
\(with\s+\d+\s+values\)
Demo here
The following regex should work for you:
/\(with (\d+) values\)/g
This matches strings of the specified format and gives the value in a capture group so it may be used in the replace. The g flag at the end is only needed if you have multiple of these in one string.
Demo here
If, however, there can only be one digit, then the following will work:
/\(with (\d) values\)/g
Or, if the number can only be a digit greater than 1, for example, then the following:
/\(with ([2-9]) values\)/g
If I got you right, you are looking for exactly three or five items within parentheses (comma separated).
This could be accomplished by
\( # "(" literally
(?:[^,()]+,){2} # not , or ( or ) exactly two times
(?:(?:[^,()]+,){2})? # repeated
[^,()]+ # without the comma in the end
\) # the closing parenthesis
See a demo on regex101.com.
If you're really looking only for two variant of strings, you could very easily do
\(with (?:3|5) values\)
In general
\(with \d+ values\)
as proposed by #SchoolBoy.
Something like this maybe
$str ="here is another string (with 5 values)";
preg_match_all("/\(with (\d+) values\)/", $str, $out );
print_r( $out );
Output:
Array
(
[0] => Array
(
[0] => (with 5 values)
)
[1] => Array
(
[0] => 5
)
)
Here at ideone...
It uses the regex
\(with (\d+) values\)
that matches the literal opening parentheses followed by the string with # values, capturing the actual number #, and finally the closing parentheses.
It returns the complete match (the parenthesized string) in the first dimension and the actual number in the second.

All caps string with space in php

I got to extract string like "THE NEED OF FOLLOWING A RELIGION " from string.
I extracted individual words like THE , NEED, OF... but I need complete string of capital letter like "THE NEED OF FOLLOWING A RELIGION" but not able to do so, please help.
preg_match_all("/[A-Z]*/", $html, $out);
Thanks
A very basic modification to the original code to find capitals of more than 1 at a time.
$str='This is a string WITH MIXED CASE words and WE ONLY WANT capitals';
preg_match_all("/[A-Z\s]{2,}/", $str, $out);
echo '<pre>',print_r($out,true),'</pre>';
outputs:
Array
(
[0] => Array
(
[0] => WITH MIXED CASE
[1] => WE ONLY WANT
)
)
Your regex just missed some condition of delimiter, which is
Words has to be either followed or leaded by a space.
Convert the sentence above to regex we get
[A-Z]*(?=\s)|(?<=\s)[A-Z]*
The regex above can interpret into either
\sWORD
WORD\s
\sWORD\s
See DEMO.
You can add the white space to your class like [A-Z ]. Now you can get all the strings in capitalized words but also a bench of single spaces.To avoid getting single spaces you use this ([A-Z]+[A-Z ]*[A-Z]) I added the () to capture the matched results.
You can check it in action here.

How to match all words but "stop" in a string by regex

another regex question. I use PHP, and have a string: fdjkaljfdlstopfjdslafdj. You see there is a stop in the middle. I just want to replace any other words excluding that stop. i try to use [^stop], but it also includes the s at the end of the string.
My Solution
Thanks everyone’s help here.
I also figure out a solution with pure RegEx method(I mean in my knowledge scoop to RegEx. PCRE verbs are too advanced for me). But it needs 2 steps. I don’t want to mix PHP method in, because sometimes the jobs are out of coding area, i.e. multi-renaming filenames in Total Commander.
Let’s see the string: xxxfooeoropwfoo,skfhlk;afoofsjre,jhgfs,vnhufoolsjunegpq. For example, I want to keep all foos in this string, and replace any other non-foo greedily into ---.
First, I need to find all the non-foo between each foo: (?<=foo).+?(?=foo).
The string will turn into xxxfoo---foo---foo---foolsjunegpq, just both sides non-foo words left now.
Then use [^-]+(?=foo)|(?<=foo)[^-]+.
This time: ---foo---foo---foo---foo---. All words but foo have been turned into ---.
i just dont want to include "stop"...
You can skip it by using PCRE verbs (*SKIP)(*F) try like this
stop(*SKIP)(*F)|.
Demo at regex101
or sequence: (stop)(*SKIP)(*F)|(?:(?!(?1)).)+
or for words: stop(*SKIP)(*F)|\w+
[^stop] doesn't means any text that is NOT stop. It just means any character that is not one of the 4 characters inside [...] which is in this case s,t,o,p.
Better to split on the text you don't want to match:
$s = 'fdjkaljfdlstopfjdslafdjstopfoobar';
php> $arr = preg_split('/stop/', $s);
php> print_r($arr);
Array
(
[0] => fdjkaljfdl
[1] => fjdslafdj
[2] => foobar
)
You can generalize this to any pattern:
(?<neg>stop)(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|(?&neg))
Demo
Just put the pattern you don't want in the neg group.
This regex will try to do the following for any character position:
Match the pattern you don't want. If it matches, discard it with (*SKIP)(*FAIL) and restart another match at this position.
If the pattern you don't want doesn't match at a particular position, then match anything, until either:
You reach the end of the input string (\Z)
Or the pattern you don't want immediately follows the current matching position ((?&neg))
This approach is slower than manually tuning the expression, you could get better performance at the cost of repeating yourself, which avoids the recursion:
stop(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|stop)
But of course, the best approach would be to use the features provided by your language: match the string you don't want, then use code to discard it and keep everything else.
In PHP, you can use the PREG_OFFSET_CAPTURE flag to tell the preg_match_all function to provide you the offsets of each match.

PHP Regex to identify keys in array representation

I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.
Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )
Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working

Search a String for Alpha Numeric Characters in a Pattern

I have a string that contains 5 words. In the string one of the words is a Ham Radio Call Sign and can be anyone of the thousands of call signs in the US. In order to extract the Call Sign from the string I need to utilize the below pattern. The Call Sign I need to extract can be in any of the 5 positions in the string. The number is never the first character and the number is never the last character. The string is actually put together from an Array since it is originally read from a text file.
$string = $word[1] $word[2] $word[3] etc....
So the search can be either done on the whole string or each piece of the array.
Patterns:
1 Number and 3 Letters Example: AB4C A4BC
1 Number and 4 Letters Example: A4BCD
1 Number and 5 Letters Example: AB4CDE
I have tried everything I can think of and search till I cant search no more. I am sure I am over thinking this.
A two-step regular expression like this would do it:
$str = "hello A4AB there BC5AD";
$signs = array();
preg_match_all('/[A-Z][A-Z\d]{1,3}[A-Z]/', $str, $possible_signs);
foreach($possible_signs[0] as $possible_sign)
if (preg_match('/^\D+\d\D+$/', $possible_sign))
array_push($signs, $possible_sign);
print_r($signs); //Array ([0] => A4AB [1] => BC5AD)
Explanation
This is a regular expression approach, using two patterns. I don't think it could be done with one and still satisfy the exact requirements of the matching rules.
The first pattern enforces the following requirements:
substring starts and ends with a capital letter
substring contains only other capital letters or numbers between the first and last letter
substring is, overall, not more than 6 characters long
What I can't do in that same pattern, for complex REGEX reasons I won't go into (unless someone knows a way and can correct me), is enforce that only one number is contained.
#jeroen's answer does enforce this in a single pattern, but in turn does not enforce the correct length of the substring. Either way, we need a second pattern.
So after grabbing the initial matches, we loop over the results. We then apply each to a second pattern that enforces simply that there is only one number in the substring.
If so, we green-light the substring and it's added to the $signs array.
Hope this helps.
It depends on what the other words can contain, but you could use a regular expression like:
#\b[a-z]+\d[a-z]+\b#i
^ case insensitive
^^ a word boundary
^^^^^^ One or more letters
^^ One number
You can make it more restrictive by using {1,3} instead of + for the letters so that you have a sequence of 1 to 3 letters.
The complete expression would be something like:
$success = preg_match('#\b[a-z]+\d[a-z]+\b#i', $input_string, $matches);
where $matches[0] will contain the matched value, see the manual.

Categories