Regex grab all text between brackets, and NOT in quotes - php

I'm attempting to match all text between {brackets}, however not if it is in quotation marks:
For example:
$str = 'value that I {want}, vs value "I do {NOT} want" '
my results should snatch "want", but omit "NOT". I've searched stackoverflow desperately for the regex that could perform this operation with no luck. I've seen answers that allow me to get the text between quotes but not outside quotes and in brackets. Is this even possible?
And if so how is it done?
So far this is what I have:
preg_match_all('/{([^}]*)}/', $str, $matches);
But unfortunately it only gets all text inside brackets, including {NOT}

It's quite tricky to get this done in one go. I even wanted to make it compatible with nested brackets so let's also use a recursive pattern :
("|').*?\1(*SKIP)(*FAIL)|\{(?:[^{}]|(?R))*\}
Ok, let's explain this mysterious regex :
("|') # match eiter a single quote or a double and put it in group 1
.*? # match anything ungreedy until ...
\1 # match what was matched in group 1
(*SKIP)(*FAIL) # make it skip this match since it's a quoted set of characters
| # or
\{(?:[^{}]|(?R))*\} # match a pair of brackets (even if they are nested)
Online demo
Some php code:
$input = <<<INP
value that I {want}, vs value "I do {NOT} want".
Let's make it {nested {this {time}}}
And yes, it's even "{bullet-{proof}}" :)
INP;
preg_match_all('~("|\').*?\1(*SKIP)(*FAIL)|\{(?:[^{}]|(?R))*\}~', $input, $m);
print_r($m[0]);
Sample output:
Array
(
[0] => {want}
[1] => {nested {this {time}}}
)

Personally I'd process this in two passes. The first to strip out everything in between double quotes, the second to pull out the text you want.
Something like this perhaps:
$str = 'value that I {want}, vs value "I do {NOT} want" ';
// Get rid of everything in between double quotes
$str = preg_replace("/\".*\"/U","",$str);
// Now I can safely grab any text between curly brackets
preg_match_all("/\{(.*)\}/U",$str,$matches);
Working example here: http://3v4l.org/SRnva

Related

exploding a search string

I'm trying to make a search string, which can accept a query like this:
$string = 'title -launch category:technology -tag:news -tag:"outer space"$';
Here's a quick explanation of what I want to do:
$ = suffix indicating that the match should be exact
" = double quotes indicate that the multi-word is taken as a single keyword
- = a prefix indicating that the keyword is excluded
Here's my current parser:
$string = preg_replace('/(\w+)\:"(\w+)/', '"${1}:${2}', $string);
$array = str_getcsv($string, ' ');
I was using this above code before, but it doesn't work as intended with the keywords starting on searches like -tag:"outer space". The code above doesn't recognize strings starting with - character and breaks the keyword at the whitespace between the outer and space, despite being enclosed with double quotes.
EDIT: What I'm trying to do with that code is to preg_replace -tag:"outer space" into "-tag:outer space" so that they won't be broken when I pass the string to str_getcsv().
You may use preg_replace like this:
preg_replace('/(-?\w+:)"([^"]+)"/', '"$1$2"', $str);
See the PHP demo online.
The regex matches:
(-?\w+:) - Capturing group 1: an optional - (? matches 1 or 0 occurrences), then 1+ letters/digits/underscores and a :
" - a double quote (it will be removed)
([^"]+) - Capturing group 2: one or more chars other than a double quote
" - a double quote
The replacement pattern is "$1$2": ", capturing group 1 value,
capturing group 2 value, and a ".
See the regex demo here.
Here's how I did it:
$string = preg_replace('/(\-?)(\w+?\:?)"(\w+)/', '"$1$2$3', $string);
$array = str_getcsv($string, ' ');
I considered formats like -"top ten" for quoted multi-word keywords that doesn't have a category/tag + colon prefix.
I'm sorry for being slow, I'm new on regex, php and programming in general and this is also my first post in stackoverflow. I'm trying to learn it as a personal hobby. I'm glad that I learned something new today. I'll be reading more about regex since it looks like it can do a lot of stuff.

Regex to find hashtag in string - without taking the initial hashtag symbol

I'm trying to do this in PHP and I am just wondering as I'm not great with Regex.
I'm trying to find all hashtags in a string, and wrap them in a link to twitter. In order to do this I need the content of the hashtag, without the symbol.
I want to select the #hashtag - without the preceding # => Just to return hashtag?
I'd like to do it in one line but I'm doing a preg_replace, followed by a string replace as shown:
$string = preg_replace('/\B#([a-z0-9_-]+)/i', '$0 ', $string);
$string = str_replace('https://twitter.com/hashtag/#', 'https://twitter.com/hashtag/', $string);
Any guidance is apprecaited!
I was using a regex tester and found the answer.
preg_replace was returning two values, one $0 with the #hashtag value, and $1 with the hashtag value - without the # symbol.
Tested here (select preg_replace): http://www.phpliveregex.com/p/kOn
Perhaps it is something to do with the regex itself I'm not sure. Hopefully this helps someone else too.
My one liner is:
$string = preg_replace('/\B#([a-z0-9_-]+)/i', '$0 ', $string);
Edit: I understand it now. The added brackets ( ) around the square brackets effectively return the $1 variable. Otherwise the whole pattern is $0.

Search and Replace with Regex

I am trying to search through text for a specific word and then add a html tag around that word.For example if i had the string "I went to the shop to buy apples and oranges" and wanted to add html bold tags around apples.
The problem, the word i search the string with is stored in a text file and can be uppercase,lowercase etc.When i use preg_replace to do this i manage to replace it correctly adding the tags but for example if i searched for APPLES and the string contained "apples" it would change the formatting from apples to APPLES, i want the format to stay the same.
I have tried using preg_replace but i cant find a way to keep the same word casing.This is what i have:
foreach($keywords as $value)
{
$pattern = "/\b$value\b/i";
$replacement = "<b>$value</b>";
$new_string = preg_replace($pattern, $replacement, $string);
}
So again if $value was APPLES it would change every case format of apples in the $string to uppercase due to $replacemant having $value in it which is "APPLES".
How could i achieve this with the case format staying the same and without having to do multiple loops with different versions of case format?
Thanks
Instead of using $value verbatim in the replacement, you can use the literal strings \0 or $0. Just as \n/$n, for some integer n, refers back to the nth capturing group of parentheses, \0/$0 is expanded to the entire match. Thus, you'd have
foreach ($keywords as $value) {
$new_string = preg_replace("/\\b$value\\b/i", '<b>$0</b>', $string);
}
Note that '<b>$0</b>' uses single quotes. You can get away with double quotes here, because $0 isn't interpreted as a reference to a variable, but I think this is clearer. In general, you have to be careful with using a $ inside a double-quoted string, as you'll often get a reference to an existing variable unless you escape the $ as \$. Similarly, you should escape the backslash in \b inside the double quotes for the pattern; although it doesn't matter in this specific case, in general backslash is a meaningful character within double quotes.
I might have misunderstood your question, but if what you are struggling on is differentiating between upper-case letter (APPLE) and lower-case letter (apple), then the first thing you could do is convert the word into upper-case, or lower-case, and then run the tests to find it and put HTML tags around it. That is just my guess and maybe I completely misunderstood the question.
In the code exists offtopic error: the result value have been rewritten on not first loop iteration. And ending value of $new_string will be only last replacement.

PHP: Regex to ignore escaped quotes within quotes

I looked through related questions before posting this and I couldn't modify any relevant answers to work with my method (not good at regex).
Basically, here are my existing lines:
$code = preg_replace_callback( '/"(.*?)"/', array( &$this, '_getPHPString' ), $code );
$code = preg_replace_callback( "#'(.*?)'#", array( &$this, '_getPHPString' ), $code );
They both match strings contained between '' and "". I need the regex to ignore escaped quotes contained between themselves. So data between '' will ignore \' and data between "" will ignore \".
Any help would be greatly appreciated.
For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".
This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:
Good:
"([^"\\]|\\.)*"
Version 1: Works correctly but is not terribly efficient.
Better:
"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"
Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).
Best:
"[^"\\]*(?:\\.[^"\\]*)*"
Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)
Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:
$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";
Try a regex like this:
'/"(\\\\[\\\\"]|[^\\\\"])*"/'
A (short) explanation:
" # match a `"`
( # open group 1
\\\\[\\\\"] # match either `\\` or `\"`
| # OR
[^\\\\"] # match any char other than `\` and `"`
)* # close group 1, and repeat it zero or more times
" # match a `"`
The following snippet:
<?php
$text = 'abc "string \\\\ \\" literal" def';
preg_match_all('/"(\\\\[\\\\"]|[^\\\\"])*"/', $text, $matches);
echo $text . "\n";
print_r($matches);
?>
produces:
abc "string \\ \" literal" def
Array
(
[0] => Array
(
[0] => "string \\ \" literal"
)
[1] => Array
(
[0] => l
)
)
as you can see on Ideone.
This has possibilities:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
/'(?>(?:(?>[^'\\]+)|\\.)*)'/
This will leave the quotes outside
(?<=['"])(.*?)(?=["'])
and use global /g will match all groups
This seems to be as fast as the unrolled loop, based on some cursory benchmarks, but is much easier to read and understand. It doesn't require any backtracking in the first place.
"[^"\\]*(\\.[^"\\]*)*"
According to W3 resources :
https://www.w3.org/TR/2010/REC-xpath20-20101214/#doc-xpath-StringLiteral
The general Regex is:
"(\\.|[^"])*"
(+ There is no need to add back-slashes in capturing group when they checked first)
Explain:
"..." any match between quotes
(...)* The inside can have any length from 0 to Infinity
\\.|[^"] First accept any char that have slash behind | (Or) Then accept any char that is not quotes
The PHP version of the regex with better grouping for better handling of Any Quotes can be like this :
<?php
$str='"First \\" \n Second" then \'This \\\' That\'';
echo $str."\n";
// "First \" \n Second" then 'This \' That'
$RX_inQuotes='/"((\\\\.|[^"])*)"/';
preg_match_all($RX_inQuotes,$str,$r,PREG_SET_ORDER);
echo $r[0][1]."\n";
// First \" \n Second
$RX_inAnyQuotes='/("((\\\\.|[^"])*)")|(\'((\\\\.|[^\'])*)\')/';
preg_match_all($RX_inAnyQuotes,$str,$r,PREG_SET_ORDER);
echo $r[0][2]." --- ".$r[1][5];
// First \" \n Second --- This \' That
?>
Try it: http://sandbox.onlinephpfunctions.com/code/4328cc4dfc09183f7f1209c08ca5349bef9eb5b4
Important Note: In this age, for not sure contents, you have to use u flag in end of the regex like /.../u for avoid of destroying multi-byte strings like UTF-8, or functions like mb_ereg_match.

Negate charactor group: match "abc'," but not "abc\',"

I need a pattern which can negate a charactor group and also negate a charactor inside the negate group
The following pattern works, but I want to do a bit more
(?:(?!'\,).)+
Here I don't want to match strings that contain ',
But what I really need is to integrate a negation inside the negation group - something like this
(?:(?![^\\]'\,).)+
I don't want to match any escaped quote signs
Match: abc',
Don't match: abc\',
argh.. it posts on enter..
$str = "'abc\',',asdf";
preg_match("/^('(?:(?!',).)+')/", $str, $matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
this should output abc\', but it outputs abc\
Judging by your last comment, I think you're trying to match a single-quoted string literal, which might contain single-quotes escaped with backslashes. For example, in this string:
'abc\',','xyz'
...you want to match 'abc\',' and 'xyz'. That's easy enough:
$source = "'abc\',','xyz'";
print "$source\n\n";
preg_match_all("/'(?:[^'\\\\]++|\\\\.)*+'/", $source, $matches);
print_r($matches);
output:
'abc\',','xyz'
Array
(
[0] => Array
(
[0] => 'abc\','
[1] => 'xyz'
)
)
see it on ideone
But maybe you want to match all the items in a comma-separated list, which may or may not be quoted--in other words, CSV (or something very similar). If that's the case, you should use a dedicated CSV processing tool; there are many of them out there. In fact, PHP has one built in: http://php.net/manual/en/function.fgetcsv.php
/^(?:(?!\\\\',).)+$/ appears to do what you want. Note that you have to escape the single quote ''. See http://ideone.com/ypln2
If you don't necessarily want to match the full string, remove the ^ and $. See http://ideone.com/G67RV

Categories