exploding a search string - php

I'm trying to make a search string, which can accept a query like this:
$string = 'title -launch category:technology -tag:news -tag:"outer space"$';
Here's a quick explanation of what I want to do:
$ = suffix indicating that the match should be exact
" = double quotes indicate that the multi-word is taken as a single keyword
- = a prefix indicating that the keyword is excluded
Here's my current parser:
$string = preg_replace('/(\w+)\:"(\w+)/', '"${1}:${2}', $string);
$array = str_getcsv($string, ' ');
I was using this above code before, but it doesn't work as intended with the keywords starting on searches like -tag:"outer space". The code above doesn't recognize strings starting with - character and breaks the keyword at the whitespace between the outer and space, despite being enclosed with double quotes.
EDIT: What I'm trying to do with that code is to preg_replace -tag:"outer space" into "-tag:outer space" so that they won't be broken when I pass the string to str_getcsv().

You may use preg_replace like this:
preg_replace('/(-?\w+:)"([^"]+)"/', '"$1$2"', $str);
See the PHP demo online.
The regex matches:
(-?\w+:) - Capturing group 1: an optional - (? matches 1 or 0 occurrences), then 1+ letters/digits/underscores and a :
" - a double quote (it will be removed)
([^"]+) - Capturing group 2: one or more chars other than a double quote
" - a double quote
The replacement pattern is "$1$2": ", capturing group 1 value,
capturing group 2 value, and a ".
See the regex demo here.

Here's how I did it:
$string = preg_replace('/(\-?)(\w+?\:?)"(\w+)/', '"$1$2$3', $string);
$array = str_getcsv($string, ' ');
I considered formats like -"top ten" for quoted multi-word keywords that doesn't have a category/tag + colon prefix.
I'm sorry for being slow, I'm new on regex, php and programming in general and this is also my first post in stackoverflow. I'm trying to learn it as a personal hobby. I'm glad that I learned something new today. I'll be reading more about regex since it looks like it can do a lot of stuff.

Related

php pregmatch_all url with specific word between double quotes

I am having a very hard time coming up with a regex that works in this situation.
I am trying to use pregmatch_all to capture the url between quotes which contain "720.mp4" resulting in the url without the double quotes.
[{"file":"https:\/\/cw012.videohost.com\/files\/videos\/2017\/09\/18\/1505738417e8b76-720.mp4?h=33wg3l1i1G0XcJxvT82x7Q&ttl=1505769928",
In the end i want the above to result as
https:\/\/cw012.videohost.com\/files\/videos\/2017\/09\/18\/1505738417e8b76-720.mp4?h=33wg3l1i1G0XcJxvT82x7Q&ttl=1505769928
Any ideas ? I am very new to regex, i have done my reading but cant put what i have read to work with this specific case.
As a simple approach you can use:
preg_match_all('#"([^"]*720\.mp4[^"]*)"#', $str, $m);
var_dump($m[1]);
The steps are straightforward. We want a literal ". Then we open a capture group ((), then anything that is not a ", then the literal string 720.mp4 (with an escaped dot, because . has a special meaning). Again anything but ", close the group, and a final ".
$m[1] is the content of the capture group we want. $m[0] contains the entire match with the quotes.

RegEx to match value of a variable or a string (with or without quotes)

Here is my dilemma:
I wrote this RegEx pattern which works in my sandbox but does not work on my website:
Sandbox: http://regex101.com/r/vP3uG4
Pattern:
(.*[$]'.$variable.'\s*=\s*\'?)(.*?)(\'?;.*)
The line of code goes like this:
$savedsettings_new = preg_replace('/(.*[$]'.$variable.'\s*=\s*\'?)(.*?)(\'?;.*)/is','$1'. $value .'$3',$savedsettings_temp);
As you can see it works on the sandbox but it doesn't work live.
I am trying to match values of variables that can be expressed as strings (with single quotes around them) or numerical values with no quotes, like so:
$match_string = 'value';
$match_number = 1;
Right now this code works fine with strings but with numerical variables that are not enclosed in strings I just get the contents of the backreference $3 and I don't get anything at all before that!
I'm scratching my head and really can't figure out why it works on RegEx101 but not live... Aren't I doing the right thing when matching for one or no single quotes (and escaping them because the preg_replace has quotes?
Okay, found out the issue. The solution is to wrap the backreference in ${}.
Quoting the PHP manual:
When working with a replacement pattern where a backreference is immediately followed by another number (i.e.: placing a literal number immediately after a matched pattern), you cannot use the familiar \\1 notation for your backreference. \\11, for example, would confuse preg_replace() since it does not know whether you want the \\1 backreference followed by a literal 1, or the \\11 backreference followed by nothing. In this case the solution is to use \${1}1.
So, your code should look like:
header('Content-Type: text/plain');
$variable = 'tbs_development';
$value = '333';
$savedsettings_temp = <<<'CODE'
$tbs_underconstruction = 'foo';
$tbs_development = 0;
CODE;
$pattern = '/(.*[$]'.preg_quote($variable).'\s*=\s*\'?)(.*?)(\'?;.*)/is';
$replacement = '${1}'.$value.'${3}';
$savedsettings_new = preg_replace($pattern, $replacement, $savedsettings_temp);
echo $savedsettings_new;
Output:
$tbs_underconstruction = 'foo';
$tbs_development = 333;
Demo.
If the variable $value contains a numerical value then the replacement pattern in your preg_replace will look like this: $12$3
That's true but not as you expected. In Regex Engine, $ddd or here $dd (which are equal to \ddd and \dd) are treated as octal numbers.
So in this case $12 means a octal index 12 which is equal to a kind of space in ASCII.
In the case of working with these tricky issues in Regular Expressions you should wrap your backreference number within {} so it should be ${1}2${3}
Change your replacement pattern to '${1}'.$value.'${3}'

Regex grab all text between brackets, and NOT in quotes

I'm attempting to match all text between {brackets}, however not if it is in quotation marks:
For example:
$str = 'value that I {want}, vs value "I do {NOT} want" '
my results should snatch "want", but omit "NOT". I've searched stackoverflow desperately for the regex that could perform this operation with no luck. I've seen answers that allow me to get the text between quotes but not outside quotes and in brackets. Is this even possible?
And if so how is it done?
So far this is what I have:
preg_match_all('/{([^}]*)}/', $str, $matches);
But unfortunately it only gets all text inside brackets, including {NOT}
It's quite tricky to get this done in one go. I even wanted to make it compatible with nested brackets so let's also use a recursive pattern :
("|').*?\1(*SKIP)(*FAIL)|\{(?:[^{}]|(?R))*\}
Ok, let's explain this mysterious regex :
("|') # match eiter a single quote or a double and put it in group 1
.*? # match anything ungreedy until ...
\1 # match what was matched in group 1
(*SKIP)(*FAIL) # make it skip this match since it's a quoted set of characters
| # or
\{(?:[^{}]|(?R))*\} # match a pair of brackets (even if they are nested)
Online demo
Some php code:
$input = <<<INP
value that I {want}, vs value "I do {NOT} want".
Let's make it {nested {this {time}}}
And yes, it's even "{bullet-{proof}}" :)
INP;
preg_match_all('~("|\').*?\1(*SKIP)(*FAIL)|\{(?:[^{}]|(?R))*\}~', $input, $m);
print_r($m[0]);
Sample output:
Array
(
[0] => {want}
[1] => {nested {this {time}}}
)
Personally I'd process this in two passes. The first to strip out everything in between double quotes, the second to pull out the text you want.
Something like this perhaps:
$str = 'value that I {want}, vs value "I do {NOT} want" ';
// Get rid of everything in between double quotes
$str = preg_replace("/\".*\"/U","",$str);
// Now I can safely grab any text between curly brackets
preg_match_all("/\{(.*)\}/U",$str,$matches);
Working example here: http://3v4l.org/SRnva

how to match this pattern in php

I am looking for a regular expression in php to parse a string of the following pattern. The command are wrapped by double square bracket as
[[a src="" desc=""]]
where a, src and desc are the keywords (won't be changed). src must be given but desc is optional, the value of src or desc can be wrapped by double or single quote. And src and desc could be given in any order. For example, the following patterns are all valid
[[a src="http://a.c.d" desc ="hello"]]
[[a src ="http://a.c.d" desc= 'hello']]
[[a desc ="hello " src= 'http://a.c.d' ]]
[[a src = "http://a.c.d" ]]
[[a src="http://a.c.d" desc ="hello"]]
any space between value and 'a', 'src', 'desc', '=' (without quotation) should be ignored. I am going to replace this command with html tag like
SOMETHING_EXTRACT_FROM_DESC
It seems pretty tough to think of one regex to do the work. Now I have 3 regex setup to handle difference cases separately. It looks like this
$pattern = '/\[\[a[:blank:]+src[:blank:]*=[:blank:]*"(.*?)"[:blank:]+desc[:blank:]*=[:blank:]+"(.*?)"\]\]/i';
$rtn = preg_replace($pattern, '${2}', $src);
$pattern = '/\[\[a[:blank:]+desc[:blank:]*=[:blank:]*"(.*?)"[:blank:]+src[:blank:]*=[:blank:]+"(.*?)"\]\]/i';
$rtn = preg_replace($pattern, '${2}', $rtn);
$pattern = '/\[\[a[:blank:]+src[:blank:]*=[:blank:]+"(.*?)"\]\]/i';
$rtn = preg_replace($pattern, '${2}', $rtn);
But this doesn't work, regular expression is hard to learn :(
I wrote a regular expression that matches everything you requested, but allows a bit of an overhead I''ll explain at the end. But first the regex:
Looks like this:
\[\[a(\s+(src|desc)\s*=\s*('[^']*'|"[^"]*")){1,2}\s*\]\]
I'll brake it down so you can understand it:
\[\[ ... \]\] matches [[ ... ]], the beginning and ending
\s matches any whitespace (space and tab), \s+ expects at least one
(src|desc) matches either the string src or the string desc. It's an OR operator: match src OR desc.
'[^']*' matches two single quotes and anything in between that is not a single quote
"[^"]*" same with double quotes
('[^']*'|"[^"]*") matches one of the above two
(src|desc)\s*=\s*('[^']*'|"[^"]*") matches a token like src='something'
{1,2} matches something once or twice, appending to the above expression, metches one or two of those tokens
And that's pretty much it. The only problem is that it will also match this:
[[a src="http://a.c.d" src="http://a.c.d"]]
Which I think is a mismatch. If it doesn't bother you, you're good to go, otherwise you'll need to change the whole concept of using a big atom with ors (i.e.: |) and take a different approach. You could use look-aheads for example. But it will get real nasty pretty fast.
You can test it online HERE
The regex is much more readable if I remove the backslashes and the \s stuffs. This won't work, but I think it will help you understand it:
[[a ( (src|desc)=('[^']*'|"[^"]*") ){1,2} ]]

PHP: Regex to ignore escaped quotes within quotes

I looked through related questions before posting this and I couldn't modify any relevant answers to work with my method (not good at regex).
Basically, here are my existing lines:
$code = preg_replace_callback( '/"(.*?)"/', array( &$this, '_getPHPString' ), $code );
$code = preg_replace_callback( "#'(.*?)'#", array( &$this, '_getPHPString' ), $code );
They both match strings contained between '' and "". I need the regex to ignore escaped quotes contained between themselves. So data between '' will ignore \' and data between "" will ignore \".
Any help would be greatly appreciated.
For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".
This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:
Good:
"([^"\\]|\\.)*"
Version 1: Works correctly but is not terribly efficient.
Better:
"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"
Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).
Best:
"[^"\\]*(?:\\.[^"\\]*)*"
Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)
Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:
$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";
Try a regex like this:
'/"(\\\\[\\\\"]|[^\\\\"])*"/'
A (short) explanation:
" # match a `"`
( # open group 1
\\\\[\\\\"] # match either `\\` or `\"`
| # OR
[^\\\\"] # match any char other than `\` and `"`
)* # close group 1, and repeat it zero or more times
" # match a `"`
The following snippet:
<?php
$text = 'abc "string \\\\ \\" literal" def';
preg_match_all('/"(\\\\[\\\\"]|[^\\\\"])*"/', $text, $matches);
echo $text . "\n";
print_r($matches);
?>
produces:
abc "string \\ \" literal" def
Array
(
[0] => Array
(
[0] => "string \\ \" literal"
)
[1] => Array
(
[0] => l
)
)
as you can see on Ideone.
This has possibilities:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
/'(?>(?:(?>[^'\\]+)|\\.)*)'/
This will leave the quotes outside
(?<=['"])(.*?)(?=["'])
and use global /g will match all groups
This seems to be as fast as the unrolled loop, based on some cursory benchmarks, but is much easier to read and understand. It doesn't require any backtracking in the first place.
"[^"\\]*(\\.[^"\\]*)*"
According to W3 resources :
https://www.w3.org/TR/2010/REC-xpath20-20101214/#doc-xpath-StringLiteral
The general Regex is:
"(\\.|[^"])*"
(+ There is no need to add back-slashes in capturing group when they checked first)
Explain:
"..." any match between quotes
(...)* The inside can have any length from 0 to Infinity
\\.|[^"] First accept any char that have slash behind | (Or) Then accept any char that is not quotes
The PHP version of the regex with better grouping for better handling of Any Quotes can be like this :
<?php
$str='"First \\" \n Second" then \'This \\\' That\'';
echo $str."\n";
// "First \" \n Second" then 'This \' That'
$RX_inQuotes='/"((\\\\.|[^"])*)"/';
preg_match_all($RX_inQuotes,$str,$r,PREG_SET_ORDER);
echo $r[0][1]."\n";
// First \" \n Second
$RX_inAnyQuotes='/("((\\\\.|[^"])*)")|(\'((\\\\.|[^\'])*)\')/';
preg_match_all($RX_inAnyQuotes,$str,$r,PREG_SET_ORDER);
echo $r[0][2]." --- ".$r[1][5];
// First \" \n Second --- This \' That
?>
Try it: http://sandbox.onlinephpfunctions.com/code/4328cc4dfc09183f7f1209c08ca5349bef9eb5b4
Important Note: In this age, for not sure contents, you have to use u flag in end of the regex like /.../u for avoid of destroying multi-byte strings like UTF-8, or functions like mb_ereg_match.

Categories