PHP Regex : several stopping characters with Positive lookbehind - php

Hi stackoverflow community !
I'm trying to use a simple regex expression in PHP based on a Positive lookbehind. My objective is to extract everything in a URL between a domain name and a set of specific characters (? or & or /). I want to extract "bar" on those examples :
foo.com/bar?
foo.com/bar&
foo.com/bar/
I tried
(?<=foo\.com\/)[^/?&]+
it works fine in the plateform test
but not with PHP 5.3x preg_match : the error thrown is that I can't use several stopping characters - it works with one.
I also tried a combination of positive lookbehind/lookahead, but the issue remains the same.
What did I do wrong ?

In PHP, unlike (say) JavaScript, you can't use the regex-delimiter without escaping it, even inside a character class. So, you need to change this:
"/(?<=foo\.com\/)[^/?&]+/"
to this:
"/(?<=foo\.com\/)[^\/?&]+/"

Escape the slashes:
preg_match("/(?<=foo\.com\/)[^\/?&]+/", "http://www.foo.com/bar?", $result);
here ___^
or use another delimiter
preg_match("#(?<=foo\.com/)[^/?&]+#", "http://www.foo.com/bar?", $result);

Related

preg_replace_callback to run EXCEPT when inside first argument of .replace()

I want to perform a php preg_match_callback against all single or double-quoted strings, for which I'm using the code seen on https://codereview.stackexchange.com/a/217356, which includes handling of backslashed single/double quotes.
const PATTERN = <<<'PATTERN'
~(?|(")(?:[^"\\]|\\(?s).)*"|(')(?:[^'\\]|\\(?s).)*'|(#|//).*|(/\*)(?s).*?\*/|(<!--)(?s).*?-->)~
PATTERN;
$result=preg_replace_callback(PATTERN, function($m) {
return $m[1]."XXXX".$m[1];
}, $test);
but this runs into a problem when scanning blocks like that seen in .replace() calls from javascript, e.g.
x=y.replace(/'/g, '"');
... which treats '/g, ' as a string, with the "');......." as the following string.
To work around this I figure it would be good to do the callback except when the quotes are inside the first argument of .replace() as these cause problems with quoting.
i.e. do the standard callbacks, but when .replace is involved I want to change the XXXX part of abc.replace(/\'/, "XXXX"); but I want to ignore the \' quote/part.
How can I do this?
See https://onlinephp.io/c/5df12 ** https://onlinephp.io/c/8a697 for a running example, showing some successes (in green), and some failures (in red).
(** Edit to correct missing slash)
Note, the XXXX is a placeholder for some more work later.
Also note that I have looked at Javascript regex to match a regex but this talks about matching regex's - and I'm talking about excluding them. If you plug in their regex pattern into my code it does not work - so should not be considered a valid answer
You can use verbs (*SKIP)(*F) to skip something. For skipping the first argument e.g.:
\(\s*/.*?/\w*\h*,(*SKIP)(*F)|(?|(")[^"\\]*(?:\\.[^"\\]*)*"|(')[^'\\]*(?:\\.[^'\\]*)*')
See this demo at regex101 or your updated php demo
The pattern on the skipped side is very simple, you might want to further improve that.
Besides I used a bit more efficient pattern to match the quoted parts, explained here.

Extract text between brakets tags in php using Regex

I have the following content in a string (query from the DB), example:
$fulltext = "Thank you so much, {gallery}art-by-stephen{/gallery}. As you know I fell in love with it from the moment I saw it and I couldn’t wait to have it in my home!"
So I only want to extract what it is between the {gallery} tags, I'm doing the following but it does not work:
$regexPatternGallery= '{gallery}([^"]*){/gallery}';
preg_match($regexPatternGallery, $fulltext, $matchesGallery);
if (!empty($matchesGallery[1])) {
echo ('<p>matchesGallery: '.$matchesGallery[1].'</p>');
}
Any suggestions?
Try this:
$regexPatternGallery= '/\{gallery\}(.*)\{\/gallery\}/';
You need to escape / and { with a \ before it. And you where missing start and end / of the pattern.
http://www.phpliveregex.com/p/fn1
Similar to Andreas answer but differ in ([^"]*?)
$regexPatternGallery= '/\{gallery\}([^"]*?)\{\/gallery\}/';
Don't forget to put / at the beginning and the end of the Regex string. That's a must in PHP, different from other programming languages.
{,},/ are characters that can be confused as a Regex logic, so you have to escape it using \ like \{.
Use ? to make the string to non-greedy, thus saves memory. It avoids error when facing this kind of string "blabla {galery}you should only get this{/gallery} but you also got this instead.{/gallery} Rarely happens but be careful anyway".
Try this RegEx:
\{gallery\}(.*?)\{\/gallery\}
The problem with your RegEx was that you did not escape the / in the closing {gallery}. You also need to escape { and }.
You should use .*? for a lazy match, otherwise if there are 2 tags in one string, it will combine them. I.e. {gallery}by-joe{/gallery} and {gallery}by-tim{/gallery} would end up as:
by-joe{/gallery} and {gallery}by-tim
However, using a lazy match, you would get 2 results:
by-joe
by-tim
Live Demo on Regex101

How to match a string until the first instance of a character that does not follow another specific character

Related question: How can I use regex to match a character (') when not following a specific character (?)?
I'm parsing a log using regex (PHP PCRE library), and trying to extract a URL from it. The URL is encapsulated in double quotes ", but some of the requests also include a double quote ". For example:
"https://www.amh.net.au/online/dbSearch.php?t=all&q=\"Rosuvastatin\""
My first pattern was basically:
#\"([^\"]*)\"#
This worked well, until I reached one of the entries as above, and it truncated the match so all I got was:
https://www.amh.net.au/online/dbSearch.php?t=all&q=\
After digging around, and rediscovering the cheatsheets for regex at http://addedbytes.com and also some more useful information at http://www.regular-expressions.info/lookaround.html I have now tried the following look-behind:
#"([(?<!\\)"]*)"#
But, now all I get is "" and then an empty string
You placed your lookbehind INSIDE your group ([]), so it's not interpreted as such, but rather just you say you only want those individual characters.
Basically, I think you'd like something like this:
#"(?:[^"]|(?<=\\)")"#
Though you should be aware that you'd be trolled by \\" for example.
The URLs in the logs would be URL-encoded. As such, the following pattern should work:
#\"([^ ]*)\"#

Building a regex expression for PHP

I am stuck trying to create a regex that will allow for letters, numbers, and the following chars: _ - ! ? . ,
Here is what I have so far:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //not escaping the ?
and this version too:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //attempting to escape the ?
Neither of these seem to be able to match the following:
"Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?"
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
Thanks!
UPDATE:
Thanks to Lix's advice, this is what I have so far:
/^[-\'a-zA-Z0-9_!\?,\.\s]+$/
However, it's still not working??
UPDATE2
Ok, based on input this is what's happening.
User inputs string, then I run the string through following functions:
$comment = preg_replace('/\s+/', '',
htmlspecialchars(strip_tags(trim($user_comment_orig))));
So in the end, user input is just a long string of chars without any spaces. Then that string of chars is run using:
preg_match("#^[-_!?.,a-zA-Z0-9]+$#",$comment)
What could possibly be causing trouble here?
FINAL UPDATE:
Ended up using this regex:
"#[-'A-Z0-9_?!,.]+#i"
Thanks all! lol, ya'll are going to kill me once you find out where my mistake was!
Ok, so I had this piece of code:
if(!preg_match($pattern,$comment) || strlen($comment) < 2 || strlen($comment) > 60){
GEEZ!!! I never bothered to look at the strlen part of the code. Of course it was going to fail every time...I only allowed 60 chars!!!!
When in doubt, it's always safe to escape non alphanumeric characters in a class for matching, so the following is fine:
/^[\-\'a-zA-Z0-9\_\!\?\,\.\s]+$/
When run through a regular expression tester, this finds a match with your target just fine, so I would suggest you may have a problem elsewhere if that doesn't take care of everything.
I assume you're not including the quotes you used around the target when actually trying for a match? Since you didn't build double quote matching in...
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
in which case you don't need the \s if it's working correctly.
I got the following code to work as expected to (running php5):
<?php
$pattern = "#[-'A-Z0-9_?!,.\s]+#i";
$string = "Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?";
$results = array();
preg_match($pattern, $string, $results);
echo '<pre>';
print_r($results);
echo '</pre>';
?>
The output from print_r($results) was as following:
Array
(
[0] => Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?
)
Tested on http://writecodeonline.com/php/.
It's not necessary to escape most characters inside []. However, \s will not do what you want inside the expression. You have two options: either manually expand (/^[-\'a-zA-Z0-9_!?,. \t\n\r]+$/) or use alternation (/^(?:[-\'a-zA-Z0-9_!?,.]|\s)+$/).
Note that I left the \ before the ' because I'm assuming you're putting this in a PHP string and I wouldn't want to suggest a syntax error.
The only characters with a special meaning within a character class are:
the dash (since it can be used as a delimiter for ranges), except if it is used at the beginning (since in this case it is no part of any range),
the closing bracket,
the backslash.
In "pure regex parlance", your character class can be written as:
[-_!?.,a-zA-Z0-9\s]
Now, you need to escape whatever needs to be escaped according to your language and how strings are written. Given that this is PHP, you can take the above sample as is. Note that \s is interpreted in character classes as well, so this will match anything which is matched by \s outside of a character class.
While some manuals recommend using escapes for safety, knowing the general regex rules for character classes and applying them leads to shorter and easier to read results ;)

Difference in laziness of lookahead assertions between JavaScript and PHP

I'm confused by a difference I found between the way JavaScript and PHP handle the following regex.
In JavaScript,
'foobar'.replace(/(?=(bar))/ , '$1');
'foobar'.replace(/(?=(bar))?/ , '$1');
'foobar'.replace(/(?:(?=(bar)))?/, '$1');
results in, respectively,
foobarbar
foobar
foobar
as shown in this jsFiddle.
However, in PHP,
echo preg_replace('/(?=(bar))/', '$1', "foobar<br/>");
echo preg_replace('/(?=(bar))?/', '$1', "foobar<br/>");
echo preg_replace('/(?:(?=(bar)))?/', '$1', "foobar<br/>");
results in,
foobarbar
Warning: preg_replace() [function.preg-replace]: Compilation failed: nothing to repeat at offset 9 in /homepages/26/d94605010/htdocs/lz/writecodeonline.com/php/index.php(201) : eval()'d code on line 2
foobarbar
I'm not so much worried about the warning. But it appears that in JavaScript, lookahead assertions are somehow "lazier" than in PHP. Why the difference? Is this a bug in one of the engines? Which is theoretically more "correct"?
The real difference is actually very simple:
In JavaScript, replace will only replace the first match, unless using the /g flag (global).
In PHP, preg_replace replaces all matches.
The third pattern, (?:(?=(bar)))?, can match the empty string in every position, and captures "bar" in some positions. Without the /g flag, it only matches once, at the beginning of the string.
You would have easily seen the difference had you used a more visible replacement string, like [$1].
PHP Example: http://ideone.com/8Mjg6
JavaScript Example, no /g: http://jsfiddle.net/qKb4b/3/
JavaScript Example, with /g: http://jsfiddle.net/qKb4b/2/
I would also note that "laziness" is a different concept in regular expressions, not related to this question.

Categories