Match all occurrences of a string - php

My search text is as follows.
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
It contains many lines(actually a javascript file) but need to parse the values in variable strings , ie aaa , bbb, ccc , ddd , eee
Following is the Perl code, or use PHP at bottom
my $str = <<STR;
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
STR
my #matches = $str =~ /(?:\"(.+?)\",?)/g;
print "#matches";
I know the above script will match all instants, but it will parse strings ("xyz") in the other lines also. So I need to check the string var strings =
/var strings = \[(?:\"(.+?)\",?)/g
Using above regex it will parse aaa.
/var strings = \[(?:\"(.+?)\",?)(?:\"(.+?)\",?)/g
Using above, will get aaa , and bbb. So to avoid the regex repeating I used '+' quantifier as below.
/var strings = \[(?:\"(.+?)\",?)+/g
But I got only eee, So my question is why I got eee ONLY when I used '+' quantifier?
Update 1: Using PHP preg_match_all (doing it to get more attention :-) )
$str = <<<STR
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
STR;
preg_match_all("/var strings = \[(?:\"(.+?)\",?)+/",$str,$matches);
print_r($matches);
Update 2: Why it matched eee ? Because of the greediness of (?:\"(.+?)\",?)+ . By removing greediness /var strings = \[(?:\"(.+?)\",?)+?/ aaa will be matched. But why only one result? Is there any way it can be achieved by using single regex?

Here's a single-regex solution:
/(?:\bvar\s+strings\s*=\s*\[|\G,)\s*"([^"]*)"/g
\G is a zero-width assertion that matches the position where the previous match ended (or the beginning of the string if it's the first match attempt). So this acts like:
var\s+strings\s*=\s*[\s*"([^"]*)"
...on the first attempt, then:
,\s*"([^"]*)"
...after that, but each match has to start exactly where the last one left off.
Here's a demo in PHP, but it will work in Perl, too.

You may prefer this solution which first looks for the string var strings = [ using the /g modifier. This sets \G to match immediately after the [ for the next regex, which looks for all immediately following occurrences of double-quoted strings, possibly preceded by commas or whitespace.
my #matches;
if ($str =~ /var \s+ strings \s* = \s* \[ /gx) {
#matches = $str =~ /\G [,\s]* "([^"]+)" /gx;
}
Despite using the /g modifier your regex /var strings = \[(?:\"(.+?)\",?)+/g matches only once because there is no second occurrence of var strings = [. Each match returns a list of the values of the capture variables $1, $2, $3 etc. when the match completed, and /(?:"(.+?)",?)+/ (there is no need to escape the double-quotes) captures multiple values into $1 leaving only the final value there. You need to write something like the above , which captures only a single value into $1 for each match.

Because the + tells it to repeat the exact stuff inside brackets (?:"(.+?)",?) one or more times. So it will match the "eee" string, end then look for repetitions of that "eee" string, which it does not find.
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/var strings = \[(?:"(.+?)",?)+/)->explain();
The regular expression:
(?-imsx:var strings = \[(?:"(.+?)",?)+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
var strings = 'var strings = '
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.+? any character except \n (1 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
,? ',' (optional (matching the most amount
possible))
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
A simpler example would be:
my #m = ('abcd' =~ m/(\w)+/g);
print "#m";
Prints only d. This is due to:
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/(\w)+/)->explain();
The regular expression:
(?-imsx:(\w)+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1 (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
)+ end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
If you use the quantifier on the capture group, only the last instance will be used.
Here's a way that works:
my $str = <<STR;
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
STR
my #matches;
$str =~ m/var strings = \[(.+?)\]/; # get the array first
my $jsarray = $1;
#matches = $array =~ m/"(.+?)"/g; # and get the strings from that
print "#matches";
Update:
A single-line solution (though not a single regex) would be:
#matches = ($str =~ m/var strings = \[(.+?)\]/)[0] =~ m/"(.+?)"/g;
But this is highly unreadable imho.

Related

Match everything in brackets after a specific character

I have the following string:
$text = 'These are my cards. They are {{Archetype|Agumon}} and {{Fire|Gabumon}}'
I'm trying to replace all instances of occurrences like {{Archetype|Agumon}} into [Agumon].
I've been struggling to get my head around it and have come up with this so far:
$string = preg_replace('#\{\{(.*?)\}\}#', '[$1]', $text);
This results in:
These are my cards. They are [Archetype|Agumon] and [Fire|Gabumon]
So I am currently matching the full text found in between the double curly brackets.
I thought it would be something like this: \|(.*?) to get the match after the | character in the curly brackets but to no avail.
You may use:
\{\{[^}]*\|([^}]*)\}\}
Demo.
Breakdown:
\{\{ - Match "{{" literally.
[^}]* - Greedily match zero or more characters other than '}'.
\| - Match a pipe character.
([^}]*) - Match zero or more characters other than '}' and capture them in group 1.
\}\} - Match "}}" literally.
Use
preg_replace('/{{(?:(?!{|}})[^|]*\|(.*?))}}/s', '[$1]', $text)
See proof. It will support { and } in the part before the pipe.
Explanation
--------------------------------------------------------------------------------
{{ '{{'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
}} '}}'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^|]* any character except: '|' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
}} '}}'
PHP code:
$text = 'These are my cards. They are {{Archetype|Agumon}} and {{Fire|Gabumon}}';
echo preg_replace('/{{(?:(?!{|}})[^|]*\|(.*?))}}/s', '[$1]', $text);
Results: These are my cards. They are [Agumon] and [Gabumon]

PHP preg_replace remove specific parts from string

im having problems with understanding regex in PHP. I have img src:
src="http://example.com/javascript:gallery('/info/2005/image.jpg',383,550)"
and need to build from it this:
src="http://example.com/info/2005/image.jpg"
How it it possible to cut first and last part from string to obtain clear link without javascript part?
Right now im using this regex:
$cont = 'src="http://example.com/javascript:gallery('/info/2005/image.jpg',383,550)"'
$cont = preg_replace("/(src=\")(.*)(\/info)/","$1http://example.com$3", $cont);
and output is:
src="http://example.com/info/2005/image.jpg',383,550)"
As an alternative solution, you might also capture the src="http://example.com part by matching the protocol in group 1, so you can use it in the replacement.
(src="https?://[^/]+)/[^']*'(/info[^']*)'[^"]*
Explanation
(src="https?://[^/]+)/ Capture group 1, match src="http, optional s, :// and till the first /
[^']*' Match any char except ', then match '
(/info[^']*) Capture group 2, match /info followed by any char except '
'[^"]* Match the ' followed by matching any char except "
Regex demo | Php demo
$cont = 'src="http://example.com/javascript:gallery(\'/info/2005/image.jpg\',383,550)"';
$cont = preg_replace("~(src=\"https?://[^/]+)/[^']*'(/info[^']*)'[^\"]*~", '$1$2', $cont);
echo $cont;
Output
src="http://example.com/info/2005/image.jpg"
Use
preg_replace("/src=\"\K.*(\/info[^']*)'[^\"]*/", 'http://example.com$1', $cont)
See regex proof.
Explanation
--------------------------------------------------------------------------------
src= 'src='
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
\K match reset operator
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
info 'info'
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount possible))

php preg_replace_callback blockquote regex

I am trying to create a REGEX that will
Input
> quote
the rest of it
> another paragraph
the rest of it
And OUTPUT
quote
the rest of it
another paragraph
the rest of it
with a resulting HTML of
<blockquote>
<p>quote
the rest of it</p>
<p>another paragraph
the rest of it</p>
</blockquote>
This is what I have below
$text = preg_replace_callback('/^>(.*)(...)$/m',function($matches){
return '<blockquote>'.$matches[1].'</blockquote>';
},$text);
DEMO
Any help or suggestion would be appreciated
Here is a possible solution for the given example.
$text = "> quote
the rest of it
> another paragraph
the rest of it";
preg_match_all('/^>([\w\s]+)/m', $text, $matches);
$out = $text ;
if (!empty($matches)) {
$out = '<blockquote>';
foreach ($matches[1] as $match) {
$out .= '<p>'.trim($match).'</p>';
}
$out .= '</blockquote>';
}
echo $out ;
Outputs :
<blockquote><p>quote
the rest of it</p><p>another paragraph
the rest of it</p></blockquote>
Try this regex:
(?s)>((?!(\r?\n){2}).)*+
meaning:
(?s) # enable dot-all option
b # match the character 'b'
q # match the character 'q'
\. # match the character '.'
( # start capture group 1
(?! # start negative look ahead
( # start capture group 2
\r? # match the character '\r' and match it once or none at all
\n # match the character '\n'
){2} # end capture group 2 and repeat it exactly 2 times
) # end negative look ahead
. # match any character
)*+ # end capture group 1 and repeat it zero or more times, possessively
The \r?\n matches a Windows, *nix and (newer) MacOS line breaks. If you need to account for real old Mac computers, add the single \r to it: \r?\n|\r
question: https://stackoverflow.com/a/2222331/9238511

PHP: Parse comma-delimited string outside single and double quotes and parentheses

I've found several partial answers to this question, but none that cover all my needs...
I am trying to parse a user generated string as if it were a series of php function arguments to determine the number of arguments:
This string:
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
will be inserted as the arguments of a function:
function my_function( [insert string here] ){ ... }
I need to parse the string on the commas, taking into account single- and double-quotes, parentheses, and escaped quotes and parentheses to create an array:
array(4) {
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
}
Any help with a regular expression or parser function to accomplish this is appreciated!
It isn't possible to solve this problem with a classical csv tool since there is more than one character able to protect parts of the string.
Using preg_split is possible but will result in a very complicated and inefficient pattern. So the best way is to use preg_match_all. There are however several problems to solve:
as needed, a comma enclosed in quotes or parenthesis must be ignored (seen as a character without special meaning, not as a delimiter)
you need to extract the params, but you need to check if the string has the good format too, otherwise the match results may be totally false!
For the first point, you can define subpatterns to describe each cases: the quoted parts, the parts enclosed between parenthesis, and a more general subpattern able to match a complete param and that uses the two previous subpatterns when needed.
Note that the parenthesis subpattern needs to refer to the general subpattern too, since it can contain anything (and commas too).
The second point can be solved using the \G anchor that ensures that all matchs are contiguous. But you need to be sure that the end of the string has been reached. To do that, you can add an optional empty capture group at the end of the main pattern that is created only if the anchor for the end of the string \z succeeds.
$subject = <<<'EOD'
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
EOD;
$pattern = <<<'EOD'
~
# named groups definitions
(?(DEFINE) # this definition group allows to define the subpatterns you want
# without matching anything
(?<quotes>
' [^'\\]*+ (?s:\\.[^'\\]*)*+ ' | " [^"\\]*+ (?s:\\.[^"\\]*)*+ "
)
(?<brackets> \( \g<content> (?: ,+ \g<content> )*+ \) )
(?<content> [^,'"()]*+ # ' # (<-- comment for SO syntax highlighting)
(?:
(?: \g<brackets> | \g<quotes> )
[^,'"()]* # ' #
)*+
)
)
# the main pattern
(?: # two possible beginings
\G(?!\A) , # a comma contiguous to a previous match
| # OR
\A # the start of the string
)
(?<param> \g<content> )
(?: \z (?<check>) )? # create an item "check" when the end is reached
~x
EOD;
$result = false;
if ( preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER) &&
isset(end($matches)['check']) )
$result = array_map(function ($i) { return $i['param']; }, $matches);
else
echo 'bad format' . PHP_EOL;
var_dump($result);
demo
You could split the argument string at ,$ and then append $ back the array values:
$args_array = explode(',$', $arg_str);
foreach($args_array as $key => $arg_raw) {
$args_array[$key] = '$'.ltrim($arg_raw, '$');
}
print_r($args_array);
Output:
(
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
)
If you want to use a regex, you can use something like this:
(.+?)(?:,(?=\$)|$)
Working demo
Php code:
$re = '/(.+?)(?:,(?=\$)|$)/';
$str = "\$arg1,\$arg2='ABC,DEF',\$arg3=\"GHI\",JKL\",\$arg4=array(1,'2)',\"3\"),\")\n";
preg_match_all($re, $str, $matches);
Match information:
MATCH 1
1. [0-5] `$arg1`
MATCH 2
1. [6-21] `$arg2='ABC,DEF'`
MATCH 3
1. [22-39] `$arg3="GHI\",JKL"`
MATCH 4
1. [40-67] `$arg4=array(1,'2)',"3\"),")`

split a string which consists decimals instead of integer

I split a string '3(1-5)' like this:
$pattern = '/^(\d+)\((\d+)\-(\d+)\)$/';
preg_match($pattern, $string, $matches);
But I need to do the same thing for decimals, i.e. '3.5(1.5-4.5)'.
And what do I have to do, if the user writes '3,5(1,5-4,5)'?
Output of '3.5(1.5-4.5)' should be:
$matches[1] = 3.5
$matches[2] = 1.5
$matches[3] = 4.5
You can use the following regular expression.
$pattern = '/^(\d+(?:[.,]\d+)?)\(((?1))-((?1))\)$/';
The first capturing group ( ... ) matches the following pattern:
( # group and capture to \1:
\d+ # digits (0-9) (1 or more times)
(?: # group, but do not capture (optional):
[.,] # any character of: '.', ','
\d+ # digits (0-9) (1 or more times)
)? # end of grouping
) # end of \1
Afterwords we look for an opening parenthesis and then recurse (match/capture) the 1st subpattern followed by a hyphen (-) and then recurse (match/capture) the 1st subpattern again followed by a closing parenthesis.
Code Demo
This pattern should help:
^(\d+\.?\,?\d+)\((\d+\,?\.?\d+)\-(\d+\.?\,?\d+)\)$

Categories