How to preg_match all three cases in the content-disposition header? - php

I'm trying to decode the content-disposition header (from curl) to get the filename using the following regular expression:
<?php
$str = 'attachment;filename="unnamed.jpg";filename*=UTF-8\'\'unnamed.jpg\'';
preg_match('/^.*?filename=(["\'])([^"\']+)\1/m', $str, $matches);
print_r($matches);
So while it matches if the filename is in single or double quotes, it fails if there are no quotes around the filename (which can happen)
$str = 'attachment;filename=unnamed.jpg;filename*=unnamed.jpg';
Right now I'm using two regular expressions (with if-else) but I just wanted to learn if it is possible to do in a single regex? Just for my own learning to master regex.

I will use the branch reset feature (?|...|...|...) that gives a more readable pattern and avoids to create a capture group for the quotes. In a branch-reset group, each capture groups have the same numbers for each alternative:
if ( preg_match('~filename=(?|"([^"]*)"|\'([^\']*)\'|([^;]*))~', $str, $match) )
echo $match[1], PHP_EOL;
Whatever the alternative that succeeds, the capture is always in group 1.

Just to put my two cents in - you could use a conditional regex:
filename=(['"])?(?(1)(.+?)\1|([^;]+))
Broken down, this says:
filename= # match filename=
(['"])? # capture " or ' into group 1, optional
(?(1) # if group 1 was set ...
(.+?)\1 # ... then match up to \1
| # else
([^;]+) # not a semicolon
)
Afterwards, you need to check if group 2 or 3 was present.
Alternatively, go for #Casimir's answer using the (often overlooked) branch reset.
See a demo on regex101.com.

One approach is to use an alternation in a single regex to match either a single/double quoted filename, or a filename which is completely unquoted. Note that one side effect of this approach is that we introduce more capture groups into the regex. So we need a bit of extra logic to handle this.
<?php
$str = 'attachment;filename=unnamed.jpg;filename*=UTF-8\'\'unnamed.jpg\'';
$result = preg_match('/^.*?filename=(?:(?:(["\'])([^"\']+)\1)|([^"\';]+))/m',
$str, $matches);
print_r($matches);
$index = count($matches) == 3 ? 2 : 3;
if ($result) {
echo $matches[$index];
}
else {
echo "filename not found";
}
?>
Demo

You could make your capturing group optional (["\'])? and \1? like:
and add a semicolon or end of the string to the end of the regex in a non capturing group which checks if there is a ; or the end of the line (?:;|$)
^.*?filename=(["\'])?([^"\']+)\1?(?:;|$)
$str = 'attachment;filename=unnamed.jpg;filename*=UTF-8\'\'unnamed.jpg\'';
preg_match('/^.*?filename=(["\'])?([^"\']+)\1?(?:;|$)/m', $str, $matches);
print_r($matches);
Output php
You can also use \K to reset the starting point of the reported match and then match until you encounter a double quote or a semicolon [^";]+. This will only return the filename.
^.*?filename="?\K[^";]+
foreach ($strings as $string) {
preg_match('/^.*?filename="?\K[^";]+/m', $string, $matches);
print_r($matches);
}
Output php

Related

simple pattern with preg_match_ALL work fine!, how to use with preg_replace?

thanks by your help.
my target is use preg_replace + pattern for remove very sample strings.
then only using preg_replace in this string or others, I need remove ANY content into <tag and next symbol >, the pattern is so simple, then:
$x = '#<\w+(\s+[^>]*)>#is';
$s = 'DATA<td class="td1">111</td><td class="td2">222</td>DATA';
preg_match_all($x, $s, $Q);
print_r($Q[1]);
[1] => Array
(
[0] => class="td1"
[1] => class="td2"
)
work greath!
now I try remove strings using the same pattern:
$new_string = '';
$Q = preg_replace($x, "\\1$new_string", $s);
print_r($Q);
result is completely different.
what is bad in my use of preg_replace?
using only preg_replace() how I can remove this strings?
(we can use foreach(...) for remove each string, but where is the error in my code?)
my result expected when I intro this value:
$s = 'DATA<td class="td1">111</td><td class="td2">222</td>DATA';
is this output:
$Q = 'DATA<td>111</td><td>222</td>DATA';
Let's break down your RegEx, #<\w+(\s+[^>]*)>#is, and see if that helps.
# // Start delimiter
< // Literal `<` character
\w+ // One or more word-characters, a-z, A-Z, 0-9 or _
( // Start capturing group
\s+ // One or more spaces
[^>]* // Zero or more characters that are not the literal `>`
) // End capturing group
> // Literal `>` character
# // End delimiter
is // Ignore case and `.` matches all characters including newline
Given the input DATA<td class="td1">DATA this matches <td class="td1"> and captures class="td1". The difference between match and capture is very important.
When you use preg_match you'll see the entire match at index 0, and any subsequent captures at incrementing indexes.
When you use preg_replace the entire match will be replaced. You can use the captures, if you so choose, but you are replacing the match.
I'm going to say that again: whatever you pass as the replacement string will replace the entirety of the found match. If you say $1 or \\=1, you are saying replace the entire match with just the capture.
Going back to the sample after the breakdown, using $1 is the equivalent of calling:
str_replace('<td class="td1">', ' class="td1"', $string);
which you can see here: https://3v4l.org/ZkPFb
To your question "how to change [0] by $new_string", you are doing it correctly, it is your RegEx itself that is wrong. To do what you are trying to do, your pattern must capture the tag itself so that you can say "replace the HTML tag with all of the attributes with just the tag".
As one of my comments noted, this is where you'd invert the capturing. You aren't interesting in capturing the attributes, you are throwing those away. Instead, you are interested in capturing the tag itself:
$string = 'DATA<td class="td1">DATA';
$pattern = '#<(\w+)\s+[^>]*>#is';
echo preg_replace($pattern, '<$1>', $string);
Demo: https://3v4l.org/oIW7d

Extract shortcode from Instagram URL

I try to extract the shortcode from Instagram URL
Here what i have already tried but i don't know how to extract when they are an username in the middle. Thank you a lot for your answer.
Instagram pattern : /p/shortcode/
https://regex101.com/r/nO4vdd/1/
https://www.instagram.com/p/BxKRx5CHn5i/
https://www.instagram.com/p/BxKRx5CHn5i/?utm_source=ig_share_sheet&igshid=znsinsart176
https://www.instagram.com/p/BxKRx5CHn5i/
https://www.instagram.com/username/p/BxKRx5CHn5i/
expected : BxKRx5CHn5i
I took you original query and added a .* bafore the \/p\/
This gave a query of
^(?:https?:\/\/)?(?:www\.)?(?:instagram\.com.*\/p\/)([\d\w\-_]+)(?:\/)?(\?.*)?$
This would be simpler assuming the username always follows the /p/
^(?:.*\/p\/)([\d\w\-_]+)
You could prepend an optional (?:\/\w+)? non capturing group.
Note that \w also matches _ and \d so the capturing group could be updated to ([\w-]+) and the forward slash in the non capturing group might also be written as just /
^(?:https?:\/\/)?(?:www\.)?(?:instagram\.com(?:\/\w+)?\/p\/)([\w-]+)(?:\/)?(\?.*)?$
Regex demo
You don't have to escape the backslashes if you use a different delimiter than /. Your pattern might look like:
^(?:https?://)?(?:www\.)?(?:instagram\.com(?:/\w+)?/p/)([\w-]+)/?(\?.*)?$
This expression might also work:
^https?:\/\/(?:www\.)?instagram\.com\/[^\/]+(?:\/[^\/]+)?\/([^\/]{11})\/.*$
Test
$re = '/^https?:\/\/(?:www\.)?instagram\.com\/[^\/]+(?:\/[^\/]+)?\/([^\/]{11})\/.*$/m';
$str = 'https://www.instagram.com/p/BxKRx5CHn5i/
https://www.instagram.com/p/BxKRx5CHn5i/?utm_source=ig_share_sheet&igshid=znsinsart176
https://www.instagram.com/p/BxKRx5CHn5i/
https://www.instagram.com/username/p/BxKRx5CHn5i/';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $match) {
var_export($match[1]);
}
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Assuming that you aren't simply trusting /p/ as the marker before the substring, you can use this pattern which will consume one or more of the directories before your desired substring.
Notice that \K restarts the fullstring match, and effectively removes the need to use a capture group -- this means a smaller output array and a shorter pattern.
Choosing a pattern delimiter like ~ which doesn't occur inside your pattern alleviates the need to escape the forward slashes. This again makes your pattern more brief and easier to read.
If you do want to rely on the /p/ substring, then just add p/ before my \K.
Code: (Demo)
$strings = [
"https://www.instagram.com/p/BxKRx5CHn5i/",
"https://www.instagram.com/p/BrODg5XHlE6/?utm_source=ig_share_sheet&igshid=znsinsart176",
"https://www.instagram.com/p/BxKRx5CHn5i/",
"https://www.instagram.com/username/p/BxE5PpZhoa9/",
"https://www.instagram.com/username/p/BxE5PpZhoa9/#look=overhere"
];
foreach ($strings as $string) {
echo preg_match('~(?:https?://)?(?:www\.)?instagram\.com(?:/[^/]+)*/\K\w+~', $string , $m) ? $m[0] : '';
echo " (from $string)\n";
}
Output:
BxKRx5CHn5i (from https://www.instagram.com/p/BxKRx5CHn5i/)
BrODg5XHlE6 (from https://www.instagram.com/p/BrODg5XHlE6/?utm_source=ig_share_sheet&igshid=znsinsart176)
BxKRx5CHn5i (from https://www.instagram.com/p/BxKRx5CHn5i/)
BxE5PpZhoa9 (from https://www.instagram.com/username/p/BxE5PpZhoa9/)
BxE5PpZhoa9 (from https://www.instagram.com/username/p/BxE5PpZhoa9/#look=overhere)
If you are implicitly trusting the /p/ as the marker and you know that you are dealing with instagram links, then you can avoid regex and just cut out the 11-character-substring, 3-characters after the marker.
Code: (Demo)
$strings = [
"https://www.instagram.com/p/BxKRx5CHn5i/",
"https://www.instagram.com/p/BrODg5XHlE6/?utm_source=ig_share_sheet&igshid=znsinsart176",
"https://www.instagram.com/p/BxKRx5CHn5i/",
"https://www.instagram.com/username/p/BxE5PpZhoa9/",
"https://www.instagram.com/username/p/BxE5PpZhoa9/#look=overhere"
];
foreach ($strings as $string) {
$pos = strpos($string, '/p/');
if ($pos === false) {
continue;
}
echo substr($string, $pos + 3, 11);
echo " (from $string)\n";
}
(Same output as previous technique)

Find next word after colon in regex

I am getting a result as a return of a laravel console command like
Some text as: 'Nerad'
Now i tried
$regex = '/(?<=\bSome text as:\s)(?:[\w-]+)/is';
preg_match_all( $regex, $d, $matches );
but its returning empty.
my guess is something is wrong with single quotes, for this i need to change the regex..
Any guess?
Note that you get no match because the ' before Nerad is not matched, nor checked with the lookbehind.
If you need to check the context, but avoid including it into the match, in PHP regex, it can be done with a \K match reset operator:
$regex = '/\bSome text as:\s*'\K[\w-]+/i';
See the regex demo
The output array structure will be cleaner than when using a capturing group and you may check for unknown width context (lookbehind patterns are fixed width in PHP PCRE regex):
$re = '/\bSome text as:\s*\'\K[\w-]+/i';
$str = "Some text as: 'Nerad'";
if (preg_match($re, $str, $match)) {
echo $match[0];
} // => Nerad
See the PHP demo
Just come from the back and capture the word in a group. The Group 1, will have the required string.
/:\s*'(\w+)'$/

How to manipulate a string so I can make implicit multiplication explicit in a math expression?

I want to manipulate a string like "...4+3(4-2)-...." to become "...4+3*(4-2)-....", but of course it should recognize any number, d, followed by a '(' and change it to 'd*('. And I also want to change ')(' to ')*(' at the same time if possible. Would nice if there is a possibility to add support for constants like pi or e too.
For now, I just do it this stupid way:
private function make_implicit_multiplication_explicit($string)
{
$i=1;
if(strlen($string)>1)
{
while(($i=strpos($string,"(",$i))!==false)
{
if(strpos("0123456789",substr($string,$i-1,1)))
{
$string=substr_replace($string,"*(",$i,1);
$i++;
}
$i++;
}
$string=str_replace(")(",")*(",$string);
}
return $string;
}
But I Believe this could be done much nicer with preg_replace or some other regex function? But those manuals are really cumbersome to grasp, I think.
Let's start by what you are looking for:
either of the following: ((a|b) will match either a or b)
any number, \d
the character ): \)
followed by (: \(
Which creates this pattern: (\d|\))\(. But since you want to modify the string and keep both parts, you can group the \( which results in (\() making it worse to read but better to handle.
Now everything left is to tell how to rearrange, which is simple: \\1*\\2, leaving you with code like this
$regex = "/(\d|\))(\()/";
$replace = "\\1*\\2";
$new = preg_replace($regex, $replace, $test);
To see that the pattern actually matches all cases, see this example.
To recognize any number followed by a ( OR a combination of a )( and place an asterisk in between them, you can use a combination of lookaround assertions.
echo preg_replace("/
(?<=[0-9)]) # look behind to see if there is: '0' to '9', ')'
(?=\() # look ahead to see if there is: '('
/x", '*', '(4+3(4-2)-3)(2+3)');
The Positive Lookbehind asserts that what precedes is either a number or right parentheses. While the Positive Lookahead asserts that the preceding characters are followed by a left parentheses.
Another option is to use the \K escape sequence in replace of the Lookbehind. \K resets the starting point of the reported match. Any previously consumed characters are no longer included ( throws away everything that it has matched up to that point. )
echo preg_replace("/
[0-9)] # any character of: '0' to '9', ')'
\K # resets the starting point of the reported match
(?=\() # look ahead to see if there is: '('
/x", '*', '(4+3(4-2)-3)(2+3)');
Your php code should be,
<?php
$mystring = "4+3(4-2)-(5)(3)";
$regex = '~\d+\K\(~';
$replacement = "*(";
$str = preg_replace($regex, $replacement, $mystring);
$regex1 = '~\)\K\(~';
$replacement1 = "*(";
echo preg_replace($regex1, $replacement1, $str);
?> //=> 4+3*(4-2)-(5)*(3)
Explanation:
~\d+\K\(~ this would match the one or more numbers followed by a (. Because of \K it excludes the \d+
Again it replaces the matched part with *( which in turn produces 3*( and the result was stored in another variable.
\)\K\( Matches )( and excludes the first ). This would be replaced by *( which in turn produces )*(
DEMO 1
DEMO 2
Silly method :^ )
$value = '4+3(4-2)(1+2)';
$search = ['1(', '2(', '3(', '4(', '5(', '6(', '7(', '8(', '9(', '0(', ')('];
$replace = ['1*(', '2*(', '3*(', '4*(', '5*(', '6*(', '7*(', '8*(', '9*(', '0*(', ')*('];
echo str_replace($search, $replace, $value);

Looping within a regular expression

can regex able to find a patter to this?
{{foo.bar1.bar2.bar3}}
where in the groups would be
$1 = foo $2 = bar1 $3 = bar2 $4 = bar3 and so on..
it would be like re-doing the expression over and over again until it fails to get a match.
the current expression i am working on is
(?:\{{2})([\w]+).([\w]+)(?:\}{2})
Here's a link from regexr.
http://regexr.com?3203h
--
ok I guess i didn't explain well what I'm trying to achieve here.
let's say I am trying to replace all
.barX inside a {{foo . . . }}
my expected results should be
$foo->bar1->bar2->bar3
This should work, assuming no braces are allowed within the match:
preg_match_all(
'%(?<= # Assert that the previous character(s) are either
\{\{ # {{
| # or
\. # .
) # End of lookbehind
[^{}.]* # Match any number of characters besides braces/dots.
(?= # Assert that the following regex can be matched here:
(?: # Try to match
\. # a dot, followed by
[^{}]* # any number of characters except braces
)? # optionally
\}\} # Match }}
) # End of lookahead%x',
$subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
I'm not a PHP person, but I managed to construct this piece of code here:
preg_match_all("([a-z0-9]+)",
"{{foo.bar1.bar2.bar3}}",
$out, PREG_PATTERN_ORDER);
foreach($out[0] as $val)
{
echo($val);
echo("<br>");
}
The code above prints the following:
foo
bar1
bar2
bar3
It should allow you to exhaustively search a given string by using a simple regular expression. I think that you should also be able to get what you want by removing the braces and splitting the string.
I don't think so, but it's relatively painless to just split the string on periods like so:
$str = "{{foo.bar1.bar2.bar3}}";
$str = str_replace(array("{","}"), "", $str);
$values = explode(".", $str);
print_r($values); // Yields an array with values foo, bar1, bar2, and bar3
EDIT: In response to your question edit, you could replace all barX in a string by doing the following:
$str = "{{foo.bar1.bar2.bar3}}";
$newStr = preg_replace("#bar\d#, "hi", $str);
echo $newStr; // outputs "{{foo.hi.hi.hi}}"
I don't know the correct syntax in PHP, for pulling out the results, but you could do:
\{{2}(\w+)(?:\.(\w+))*\}{2}
That would capture the first hit in the first capturing group and the rest in second capturing group. regexr.com is lacking the ability to show that as far as I can see though. Try out Expresso, and you'll see what I mean.

Categories