preg_match in loop returning impossible results - php

I'm sure I'm missing something. I know just enough to be dangerous.
In my php code I use file_get_contents() to put a file into a variable.
I then loop through an array and use preg_match to search the same variable many times. The file is a tab-delimited txt file. It does fine 800 times but one time randomly in the middle it does something very odd.
$current = file_get_contents($file);
foreach($blahs as $blah){
$image = 'somefile.jpg';
$pattern = '/https:\/\/www\.example\.com\/media(.*)\/' . preg_quote($image) . '/';
preg_match($pattern, $current, $matches);
echo $matches[0];
}
For some reason that one time it turns two URL's with a tab between them. When I look at the txt file the image i'm looking for is listed first then followed by the second iamge but echo $matches[0] returns it in reverse order. it does not exist like echo $matches[0] returns it. It would be like if you searched the string 'one two' and $matches returned 'two one'.

The regex engine is trying to do you a favor and capture the longest match. The \t tab between the two urls is being matched by the . (dot / any character).
Demonstration: (Link)
$blah='test case: https://www.example.com/media/foo/bar.jpg https://www.example.com/media/cat/fish.jpg some text';
$image = 'fish.jpg';
$your_pattern = '/https:\/\/www\.example\.com\/media(.*)\/'.preg_quote($image).'/';
echo preg_match($your_pattern,$blah,$matches)?$matches[0]:'fail';
echo "\n----\n";
$my_pattern='~https://www\.example\.com/media(?:[^/\s]*/)+'.preg_quote($image).'~';
echo preg_match($my_pattern,$blah,$out)?$out[0]:'fail';
Output:
https://www.example.com/media/foo/bar.jpg https://www.example.com/media/cat/fish.jpg
----
https://www.example.com/media/cat/fish.jpg
To crystallize...
test case: https://www.example.com/media/foo/bar.jpg https://www.example.com/media/cat/fish.jpg some text
// your (.*) is matching ---------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
My suggested pattern (I may be able to refine the pattern if you provide smoe sample strings) uses (?:[^/\s]*/)+ instead of the (.*).
My non-capturing group breaks down like this:
(?: #start non-capturing group
[^/\s]* #greedily match zero or more non-slash, non-whitespace characters
/ #match a slash
) #end non-capturing group
+ #allow the group to repeat one or more times
*note1: You can use \t where I use \s if you want to be more literal, I am using \s because a valid url shouldn't contain a space anyhow. You may make this adjustment in your project without any loss of accuracy.
*note2: Notice that I changed the pattern delimiters to ~ so that / doesn't need to be escaped inside the pattern.

Related

Replace data after .PNG extension in image tag regular expression

Here is my code
<img src="folder/img1.jpg?somestring">
<img src="folder/img2.jpg?somediffstring">
want to replace somestring & somediffstring with another string in whole html. please suggest some regular expression with php.
example
change to using regular expression or anything
First of all, you shouldn't parse HTML with Regular Expressions.
Solution 1
Now, if you are exclusively parsing img tags, you could come up with a satisfying enough solution like this:
(\b\.jpg|\b\.png)\?(.*?)\"
That is:
(\b\.jpg|\b\.png) # 1st Capturing Group
\b\.jpg # 1st Alternative: match ``.jpg`` literally
\b\.png # 2nd Alternative: match ``.png`` literally
\? # Match the character ? literally
(.+?) # 2nd Capturing Group
.+? # Match any character between one and unlimited times,
# as few times as possible, expanding as needed.
\" # Match the character " literally
Problem
What's the problem? We are not checking if we are inside an img tag. This will match everywhere in the HTML.
Solution 2
Let's add the check for img > src:
<img.+?src=\".*?(\b\.jpg|\b\.png)\?(.+?)\"
That is:
<img # Match ``<img`` literally
.+? # Match any character between one and unlimited times,
# as few times as possible, expanding as needed.
# Needed in case there are rel or alt options inside the img tag.
src=\" # Match ``src="`` literally
... # The rest is same as before.
Problem
Does this really do its job? Apparently yes, but in reality no.
Consider the following HTML code
<img src="data:image/png;base64,iVBORw0KG" />
<div style="background-image: url(../images/test-background.jpg?)">
blah blah
</div>
It shouldn't match right? But it does (if you remove line-breaks). The regular expression above starts the match at <img src=", and will stop at "> of the div tag. The capturing group will contain the characters between ? and ": ), substituting it will break the HTML.
This was just an example, but many other situations will match even if they should not.
Other solutions...?
No matter how many constraints you can add to your RegEx and how sophisticated it becomes... HTML is a Context-Free Language and it can't be captured by a Regular Expression, which only recognizes Regular Languages.
In PHP
Still sure you're gonna use Regular Expressions? Alright, then your PHP function is preg_replace. You only need to keep in mind that it will replace everything that matched, not only the capturing groups. Hence, you need to wrap what you want to "remember" into another capturing group:
$str = '<img src="folder/img1.jpg?foo">';
$pattern = '/(<img.+?src=\".*?(\b\.jpg|\b\.png)\?)(.+?)(\")/';
$replacement = '$1' . 'bar' . '$4';
$str_replaced = preg_replace($pattern, $replacement, $str);
// Now you have $str_replaced = '<img src="folder/img1.jpg?bar">';
With reference to this How can I use the captured group in the same regex
suppose u wanna change img1.jpg?somestring to img1.jpg?somestringAAA
and img2.jpg?somediffstring to img2.jpg?somediffstringAAA
Search for: src="([a-zA-Z.0-9_]*)[?]([a-zA-Z.0-9_]*)">
Replace with: src="$1?$2AAA">
here $1 represents whatever is inside first round paranthesis () , i.e., img1.jpg
and $2 represents second paranthesis
UPDATE:
$string = 'img1.jpg?somestring';
$pattern = '/([a-zA-Z.0-9_]*)[?]([a-zA-Z.0-9_]*)/i';
$replacement = '$1?$2AAA';
echo preg_replace($pattern, $replacement, $string);
You can do it in this way :
<?php
$url_value = "folder/img2.jpg?somediffstring";
echo $url =substr($url_value, 0, strpos($url_value, "?"));
?>
you can use the regex \?(\w*)"
if u want to replace somestring and somediffstring with xx then u can replace it with regex \?(\w*)" and value as ?xx
https://regex101.com/r/S5pPuW/1

Regex to ignore matches that have a specific string in front of it

I have this string stored in php:
Keyboard layout codes found here https://msdn.microsoft.com/en-us/library/cc233982.aspx test 123
test https://google.com
test google.com
<img src='http://example.com/pages/projects/uploader/files/2017-06-16%2011_27_36-Settings.png'>Link Converted to Image</img>
The img element was made with a prevous regex;
$url = '~(https|http)?(://)((\S)+(png|jpg|gif|jpeg))~';
$output = preg_replace($url, "<img src='$0'>Link Converted to Image</img>", $output);
My problem is, now I want to convert the regular links to an a element.
I have this regex, which works except for one problem.
$url = '~(https|http)?(://)?((\S)+[.]+(\w*))~';
$output = preg_replace($url, "<img src='$0'>Link Converted to Image</img>", $output);
This regex ALSO converts the link that has already become an img element, so it puts an a element in the source of the img element. My thinking on avoiding this problem is to ignore a preg match checking if the match starts with src=', but I can't figure out how to actually do this.
Am I doing this incorrectly? What is the most common/effecient way to accomplish this?
A good example for (*SKIP)(*FAIL):
<img.+?</img>(*SKIP)(*FAIL) # match <img> tags and throw them away
| # or
\bhttps?\S+\b # a link starting with http/https
In PHP:
<?php
$string = <<<DATA
Keyboard layout codes found here https://msdn.microsoft.com/en-us/library/cc233982.aspx test 123
test https://google.com
<img src='http://example.com/pages/projects/uploader/files/2017-06-16%2011_27_36-Settings.png'>Link Converted to Image</img>
DATA;
$regex = '~<img.+?</img>(*SKIP)(*FAIL)|\bhttps?\S+\b~';
$string = preg_replace($regex, "<a href='$0'>$0</a>", $string);
echo $string;
?>
Adding to #Jan's answer, although there may be some drawbacks with this workaround, it will match URL-like strings:
<img.+?</img>(*SKIP)(*FAIL)|(?:https?\S+|(?:(?!:)(?(1)\S|(\w)))*\.\w{2,5})
Live demo
Breakdown:
(?: # Open a NCG (a)
(?!:) # Next immediate character shouldn't be a colon `:`
(?(1)\S # If CG #1 exists match a non-whitespace character
| # otherwise
(\w)) # Match a word character (a URL begins with a word character)
)* # As much as possbile (this cluster denotes a tempered pattern)
\.\w{2,5} # Match TLD
Drawbacks:
TLD's character limit
Partial match of URLs containing a port number

Regex After Last / and Before period

Sorry if the title is confusing. All I'm trying to do is some simple regex:
The text: /thing/images/info.gif
And what I want is: info
My regex (not fully working): ([^\/]+$)(.*?)(?=\.gif)
(Note: [^\/]+$ returns info.gif)
Thanks for any help!
I'd say you don't need to match all the string, so you can be much more generic. If you know your string always contains a path you can just use:
preg_match( '/([^\/]+)\.\w+$/', "/thing/images/info.gif", $matches) ;
print_r( $matches );
and it will be valid for any filename, even names that contains dots like my_file.name.jpg or spaces like /thing/images/my image.gif
Demo here.
The structure is (from the end of the regex moving to the left):
Match before the end of the string
any number of characters preceded by a dot
any character that is not a slash (your filename, if there is a slash, there starts the directories)
Not sure how much more complex the string is but this seems to work on the test string:
preg_match('![^/.]+(?=\.gif)!', '/thing/images/info.gif', $m);
Matching NOT / NOT . followed by .gif.
In editors (Sublime):
Find:^(.*)(\/)(.*)(\.)(.*)$
Replace it with:\3
In PHP:
<?php
preg_match('/^(.*)(\/)(.*)(\.)(.*)$/', '/thing/images/info.gif', $match);
echo $match[3];

extract text between two words in php

I got the following URL
http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego
and I want to extract
B000NO9GT4
that is the asin...to now, I can get search between the string, but not in this way I require. I saw the split functin, I saw the explode. but cant find a way out...also, the urls will be different in length so I cant hardcode the length two..the only thing which make some sense in my mind is to split the string so that
http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/
become first part
and
B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego
becomes the 2nd part , from the second part , I should extract B000NO9GT4
in the same way, i would want to get product name LEGO-Ultimate-Building-Set-Pieces from the first part
I am very bad at regex and cant find a way out..
can somebody guide me how I can do it in php?
thanks
This grabs both pieces of information that you are looking to capture:
$url = 'http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego';
$path = parse_url($url, PHP_URL_PATH);
if (preg_match('#^/([^/]+)/dp/([^/]+)/#i', $path, $matches)) {
echo "Description = {$matches[1]}<br />"
."ASIN = {$matches[2]}<br />";
}
Output:
Description = LEGO-Ultimate-Building-Set-Pieces
ASIN = B000NO9GT4
Short Explanation:
Any expressions enclosed in ( ) will be saved as a capture group. This is how we get at the data in $matches[1] and $matches[2].
The expression ([^/]+) says to match all characters EXCEPT / so in effect it captures everything in the URL between the two / separators. I use this pattern twice. The [ ] actually defines the character class which was /, the ^ in this case negates it so instead of matching / it matches everything BUT /. Another example is [a-f0-9] which would say to match the characters a,b,c,d,e,f and the numbers 0,1,2,3,4,5,6,7,8,9. [^a-f0-9] would be the opposite.
# is used as the delimiter for the expression
^ following the delimiter means match from the beginning of the string.
See www.regular-expressions.info and PCRE Pattern Syntax for more info on how regexps work.
You can try
$str = "http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego" ;
list(,$desc,,$num,) = explode("/",parse_url($str,PHP_URL_PATH));
var_dump($desc,$num);
Output
string 'LEGO-Ultimate-Building-Set-Pieces' (length=33)
string 'B000NO9GT4' (length=10)

Regular Expression to collect everything after the last /

I'm new at regular expressions and wonder how to phrase one that collects everything after the last /.
I'm extracting an ID used by Google's GData.
my example string is
http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123
Where the ID is: p1f3JYcCu_cb0i0JYuCu123
Oh and I'm using PHP.
This matches at least one of (anything not a slash) followed by end of the string:
[^/]+$
Notes:
No parens because it doesn't need any groups - result goes into group 0 (the match itself).
Uses + (instead of *) so that if the last character is a slash it fails to match (rather than matching empty string).
But, most likely a faster and simpler solution is to use your language's built-in string list processing functionality - i.e. ListLast( Text , '/' ) or equivalent function.
For PHP, the closest function is strrchr which works like this:
strrchr( Text , '/' )
This includes the slash in the results - as per Teddy's comment below, you can remove the slash with substr:
substr( strrchr( Text, '/' ), 1 );
Generally:
/([^/]*)$
The data you want would then be the match of the first group.
Edit   Since you’re using PHP, you could also use strrchr that’s returning everything from the last occurence of a character in a string up to the end. Or you could use a combination of strrpos and substr, first find the position of the last occurence and then get the substring from that position up to the end. Or explode and array_pop, split the string at the / and get just the last part.
You can also get the "filename", or the last part, with the basename function.
<?php
$url = 'http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123';
echo basename($url); // "p1f3JYcCu_cb0i0JYuCu123"
On my box I could just pass the full URL. It's possible you might need to strip off http:/ from the front.
Basename and dirname are great for moving through anything that looks like a unix filepath.
/^.*\/(.*)$/
^ = start of the row
.*\/ = greedy match to last occurance to / from start of the row
(.*) = group of everything that comes after the last occurance of /
you can also normal string split
$str = "http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123";
$s = explode("/",$str);
print end($s);
This pattern will not capture the last slash in $0, and it won't match anything if there's no characters after the last slash.
/(?<=\/)([^\/]+)$/
Edit: but it requires lookbehind, not supported by ECMAScript (Javascript, Actionscript), Ruby or a few other flavors. If you are using one of those flavors, you can use:
/\/([^\/]+)$/
But it will capture the last slash in $0.
Not a PHP programmer, but strrpos seems a more promising place to start. Find the rightmost '/', and everything past that is what you are looking for. No regex used.
Find position of last occurrence of a char in a string
based on #Mark Rushakoff's answer the best solution for different cases:
<?php
$path = "http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123?var1&var2#hash";
$vars =strrchr($path, "?"); // ?asd=qwe&stuff#hash
var_dump(preg_replace('/'. preg_quote($vars, '/') . '$/', '', basename($path))); // test.png
?>
Regular Expression to collect everything after the last /
How to get file name from full path with PHP?

Categories