Why doesn't this greedy regex work as expected?

Why doesn't this greedy regex work as expected? - php

i'm trying to learn a bit about regex, can anyone explain to me what is going on here? And give example on a regex that would provide the expected output? Thanks!
input data = 'Sometext|even more text'
regex = '(.*)?\|?.*'
replacement = '$1'
expected output = 'Sometext'
actual output = 'Sometext|even more text'
PHP
preg_filter("(.*)?\|?.*", "$1", 'Sometext|even more text'); // returns Sometext|even more text

(.*) is greedy, so matches everything. $1 is everything then.
You are probably looking for:
/^([^|]*).*$/

Your regex is saying "all chars, followed by an optional |, followed by 0 or more chars".
Change the initial (.*) to ([^\|]*), or make the | non-optional.

* is greedy, which means it will try to match as much text as possible. In this case:
(.*)? will match all the text
\|?.* will match the "rest" (empty string)
try: regex = '\|[^|]*', replacement = ''

If you change your regex to (\w+)?\|?.*, specifically adding the + after the \w then you will get your expected answer of 'Sometext'.
The reason you were having the whole string match is that the first .* was matching the whole string. With the changes I have above, you will be matching on any word character.

Related

Regex After Last / and Before period

Sorry if the title is confusing. All I'm trying to do is some simple regex:
The text: /thing/images/info.gif
And what I want is: info
My regex (not fully working): ([^\/]+$)(.*?)(?=\.gif)
(Note: [^\/]+$ returns info.gif)
Thanks for any help!

I'd say you don't need to match all the string, so you can be much more generic. If you know your string always contains a path you can just use:
preg_match( '/([^\/]+)\.\w+$/', "/thing/images/info.gif", $matches) ;
print_r( $matches );
and it will be valid for any filename, even names that contains dots like my_file.name.jpg or spaces like /thing/images/my image.gif
Demo here.
The structure is (from the end of the regex moving to the left):
Match before the end of the string
any number of characters preceded by a dot
any character that is not a slash (your filename, if there is a slash, there starts the directories)

Not sure how much more complex the string is but this seems to work on the test string:
preg_match('![^/.]+(?=\.gif)!', '/thing/images/info.gif', $m);
Matching NOT / NOT . followed by .gif.

In editors (Sublime):
Find:^(.*)(\/)(.*)(\.)(.*)$
Replace it with:\3
In PHP:
<?php
preg_match('/^(.*)(\/)(.*)(\.)(.*)$/', '/thing/images/info.gif', $match);
echo $match[3];

regex to clean up url

I am looking for a way to get a valid url out of a string like:
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
My original solution was:
preg_match('#^[^:|]*#', str_replace('//', '/', $string), $modifiedPath);
But obviously its going to remove a slash from the http:// instead of the one in the middle of the string.
My expected output that I want from the original is:
http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
I could always break off the http part of the string first but would like a more elegant solution in the form of regex if possible. Thanks.

This will do exactly what you are asking:
<?php
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
preg_match('/^([^|]+)/', $string, $m); // get everything up to and NOT including the first pipe (|)
$string = $m[1];
$string = preg_replace('/(?<!:)\/\//', '/' ,$string); // replace all occurrences of // as long as they are not preceded by :
echo $string; // outputs: http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
exit;
?>
EDIT:
(?<!X) in regular expressions is the syntax for what is called a lookbehind. The X is replaced with the character(s) we are testing for.
The following expression would match every instance of double slashes (/):
\/\/
But we need to make sure that the match we are looking for is NOT preceded by the : character so we need to 'lookbehind' our match to see if the : character is there. If it is then we don't want it to be counted as a match:
(?<!:)\/\/
The ! is what says NOT to match in our lookbehind. If we changed it to (?=:)\/\/ then it would only match the double slashes that did have the : preceding them.
Here is a Quick tutorial that can explain it all better than I can lookahead and lookbehind tutorial

Assuming all your strings are in the form given, you don't need any but the simplest of regexes to do this; if you want an elegant solution, then a regex is definitely not what you need. Also, double slashes are legal in a URL, just like in a Unix path, and mean the same thing a single slash does, so you don't really need to get rid of them at all.
Why not just
$url = array_shift(preg_split('/\|/', $string));
?
If you really, really care about getting rid of the double slashes in the URL, then you can follow this with
$url = preg_replace('/([^:])\/\//', '$1/', $url);
or even combine them into
$url = preg_replace('/([^:])\/\//', '$1/', array_shift(preg_split('/\|/', $string)));
although that last form gets a little bit hairy.

Since this is a quite strictly defined situation, I'd consider just one preg to be the most elegant solution.
From the top of my head:
$sanitizedURL = preg_replace('~((?<!:)/(?=/)|\\|.+)~', '', $rawURL);
Basically, what this does is look for any forward slash that IS NOT preceded by a colon (:), and IS followed bij another forward slash. It also searches for any pipe character and any character following it.
Anything found is removed from the result.
I can explain the RegEx in more detail if you like.

extract text between two words in php

I got the following URL
http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego
and I want to extract
B000NO9GT4
that is the asin...to now, I can get search between the string, but not in this way I require. I saw the split functin, I saw the explode. but cant find a way out...also, the urls will be different in length so I cant hardcode the length two..the only thing which make some sense in my mind is to split the string so that
http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/
become first part
and
B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego
becomes the 2nd part , from the second part , I should extract B000NO9GT4
in the same way, i would want to get product name LEGO-Ultimate-Building-Set-Pieces from the first part
I am very bad at regex and cant find a way out..
can somebody guide me how I can do it in php?
thanks

This grabs both pieces of information that you are looking to capture:
$url = 'http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego';
$path = parse_url($url, PHP_URL_PATH);
if (preg_match('#^/([^/]+)/dp/([^/]+)/#i', $path, $matches)) {
echo "Description = {$matches[1]}<br />"
."ASIN = {$matches[2]}<br />";
}
Output:
Description = LEGO-Ultimate-Building-Set-Pieces
ASIN = B000NO9GT4
Short Explanation:
Any expressions enclosed in ( ) will be saved as a capture group. This is how we get at the data in $matches[1] and $matches[2].
The expression ([^/]+) says to match all characters EXCEPT / so in effect it captures everything in the URL between the two / separators. I use this pattern twice. The [ ] actually defines the character class which was /, the ^ in this case negates it so instead of matching / it matches everything BUT /. Another example is [a-f0-9] which would say to match the characters a,b,c,d,e,f and the numbers 0,1,2,3,4,5,6,7,8,9. [^a-f0-9] would be the opposite.
# is used as the delimiter for the expression
^ following the delimiter means match from the beginning of the string.
See www.regular-expressions.info and PCRE Pattern Syntax for more info on how regexps work.

You can try
$str = "http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego" ;
list(,$desc,,$num,) = explode("/",parse_url($str,PHP_URL_PATH));
var_dump($desc,$num);
Output
string 'LEGO-Ultimate-Building-Set-Pieces' (length=33)
string 'B000NO9GT4' (length=10)

preg_replace not matching properly

I know this has been asked before as ive just been reading those answers but still cant get this to work (properly).
Im very new to regex and am trying to do something that sounds pretty simple:
The string would be:
http://www.something.com/section/filter/colour/red-#998682/size/small/
What i would like to do is a preg_replace to remove the -#?????? so the url looks like:
http://www.something.com/section/filter/colour/red/size/small/
So i tried:
$string = $theURL;
$pattern = '/-\#(.*)\//i';
$replacement = '/';
$newURL = preg_replace($pattern, $replacement, $string);
That sort of works but it doesnt stop. If I have anything after the -#?????? it also removes that as well. But I thought having the / on the end would stop it doing that?
Hoping someone can help and thanks for reading

PCRE is greedy by default, meaning that .* will match as big a chunk as possible. Make it ungreedy by adding the U flag (for the entire pattern) or use .*? (for just that wildcard part):
/-\#(.*)\//iU
or
/-\#(.*?)\//i

You need to use non-greedy quantifier.
$pattern = '/-\#(.*?)\//i';
Your regex is greedy, which means that (.*)\/ looks for the last slash, not the first one.
demo

(.*) pattern is gready, which means it'll match as many characters as possible. To match everything to the first slash use (.*?):
$pattern = '/-\#(.*?)\//i';

regex to find all text after delimited string

I have some content that contains a token string in the form
$string_text = '[widget_abc]This is some text. This is some text, etc...';
And I want to pull all the text after the first ']' character
So the returned value I'm looking for in this example is:
This is some text. This is some text, etc...

preg_match("/^.+?\](.+)$/is" , $string_text, $match);
echo trim($match[1]);
Edit
As per author's request - added explanation:
preg_match(param1, param2, param3) is a function that allows you to match a single case scenario of a regular expression that you're looking for
param1 = "/^.+?](.+?)$/is"
"//" is what you put on the outside of your regular expression in param1
the i at the end represents case insensitive (it doesn't care if your letters are 'a' or 'A')
s - allows your script to go over multiple lines
^ - start the check from the beginning of the string
$ - go all the way to end of the string
. - represents any character
.+ - at least one or more characters of anything
.+? - at least one more more characters of anything until you reach
.+?] - at least one or more characters of anything until you reach ] (there is a backslash before ] because it represents something in regular expressions - look it up)
(.+)$ - capture everything after ] and store it as a seperate element in the array defined in param3
param2 = the string that you created.
I tried to simplify the explanations, I might be off, but I think I'm right for the most part.

The regex (?<=]).* will solve this problem if you can guarantee that there are no other square brackets on the line. In PHP the code will be:
if (preg_match('/(?<=\]).*/', $input, $group)) {
$match = $group[0];
}
This will transform [widget_abc]This is some text. This is some text, etc... into This is some text. This is some text, etc.... It matches everything that follows the ].

$output = preg_replace('/^[^\]]*\]/', '', $string_text);

Is there any particular reason why a regex is wanted here?
echo substr(strstr($string_text, ']'), 1);

A regex is definitely overkill for this instance.
Here is a nice one-liner :
list(, $result) = explode(']', $inputText, 2);
It does the job and is way less expensive than using regular expressions.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why doesn't this greedy regex work as expected? - php

(.) is greedy, so matches everything. $1 is everything then. You are probably looking for: /^([^|]).*$/

Your regex is saying "all chars, followed by an optional |, followed by 0 or more chars". Change the initial (.) to ([^\|]), or make the | non-optional.

* is greedy, which means it will try to match as much text as possible. In this case: (.)? will match all the text \|?. will match the "rest" (empty string) try: regex = '\|[^|]*', replacement = ''

Related

Regex After Last / and Before period

regex to clean up url

extract text between two words in php

preg_replace not matching properly

regex to find all text after delimited string

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why doesn't this greedy regex work as expected? - php

(.*) is greedy, so matches everything. $1 is everything then. You are probably looking for: /^([^|]*).*$/

Your regex is saying "all chars, followed by an optional |, followed by 0 or more chars". Change the initial (.*) to ([^\|]*), or make the | non-optional.

* is greedy, which means it will try to match as much text as possible. In this case: (.*)? will match all the text \|?.* will match the "rest" (empty string) try: regex = '\|[^|]*', replacement = ''

Related

Regex After Last / and Before period

regex to clean up url

extract text between two words in php

preg_replace not matching properly

regex to find all text after delimited string

Categories

Resources

(.) is greedy, so matches everything. $1 is everything then. You are probably looking for: /^([^|]).*$/

Your regex is saying "all chars, followed by an optional |, followed by 0 or more chars". Change the initial (.) to ([^\|]), or make the | non-optional.

* is greedy, which means it will try to match as much text as possible. In this case: (.)? will match all the text \|?. will match the "rest" (empty string) try: regex = '\|[^|]*', replacement = ''