What does this `#((?<=\?)|&)openid\.[^&]+#` regexp mean? - php

So I am triing to read some php code... I found such line
$uri = rtrim(preg_replace('#((?<=\?)|&)openid\.[^&]+#', '', $_SERVER['REQUEST_URI']), '?');
what does it mean? and if it (seems for me) just returns 'file name' why it is so complicated?

The purpose of that line is to remove values like openid.something=value from the request URI.
There are tools out there to translate regex into prose, with an aim to help you understand what a regex is trying to match. For example, when yours is passed to such a tool the description comes back as:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
openid 'openid'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
[^&]+ any character except: '&' (1 or more times
(matching the most amount possible))
As the above says, the regex looks for a ? or & followed by openid., followed by anything not &. The resulting match will include the preceeding & if there is one, but not the ? since a look behind was used for the latter.

Related

php regex to get base64 string

I have a file smime.p7m with many content. One or more of this Content is like this
--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"
JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--
Is there a way to get the filename for example with regex if it's a pDF and the BASE64 code below? It can happen that there is more than one PDF file in the file.
The Filename is not the problem. I get this with "filename="(.*).pdf". But I don't know how I get the base64code after the filename
base64 consists of characters A...Z a...z digits 0..9 symbols + and /. It also can have one or two = in the end and can be split to several lines.
if (preg_match('/filename=\"(?P<filename>[^"]*?\.pdf)\"\s*(?P<base64>([A-Za-z0-9+\/]+\s*)+=?=?)/', $s, $regres)) {
print("FileName: {$regres['filename']}\n");
print("Base64: {$regres['base64']}\n");
}
Use
(?im)^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)
See proof
PHP:
preg_match_all('/^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)/im', $str, $matches);
Explanation
--------------------------------------------------------------------------------
(?im) set flags for this block (case-
insensitive) (with ^ and $ matching start
and end of line) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
filename=" 'filename="'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
pdf 'pdf'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
\R+ any line break sequence (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\R any line break sequence
--------------------------------------------------------------------------------
.+ any character except \n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
) end of \2
I gather that this task is not about validation at all, and solely focuses on data extraction -- this makes sharpening the regex logic unnecessary.
You only need a pattern that will match filename=" at the start of a line, then capture the quote-wrapped substring (so long as it ends in .pdf), then after any number of whitespace characters, capture all characters until one or two = are encountered,
Using greedy negative character classes allows the regex engine to move quickly. The m pattern modifier tells the regex engine that the ^ meta character (not the ^ used inside of square braces) may match the start of a line in addition to the start of the string.
Perhaps you'd like to generate an associative array where the keys are the filename strings and the encoded strings are the values, array_column() does a snappy job of setting that up when there are qualifying matches.
Code: (Demo)
var_export(
preg_match_all(
'~^filename="([^"]+)\.pdf"\s*([^=]+={1,2})~m',
$fileContents,
$out,
PREG_SET_ORDER
)
? array_column($out, 2, 1)
: "no pdf's found"
);
Output:
array (
'001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)

PHP autodetect translatables / detect piece of code by regex

I'm having a multilanguage site which stores translatables within a default.php filled with an array that contains all the keys.
I would prefer to make it automatic.
I already have a (singleton) class that is able to detect all my files based by type. (Controller, action, view, model, etc...)
I would like to detect any piece of code of which the format is like this:
$this->translate('[a-zA-Z]');
$view->translate('[a-zA-Z]');
getView()->translate('[a-zA-Z]');
throw new Exception('[a-zA-Z]');
addMessage(array('message' => '[a-zA-Z]');
However it must be filtered when it starts with/contains:
$this->translate('((0-9)+_)[a-zA-Z]');
$this->translate('[a-zA-Z]' . $* . '[a-zA-Z]'); // Only a variable in the middle must filtered, begin or end is still allowed
ofcourse [a-zA-Z] is a regex example.
Like i sais i already have a class that detect certain files. This class also make use of Reflection (or in this case Zend Reflection, as i'm using Zend) However i could not see a way to reflect a function using regex.
The action will be placed within a cronjob and manual called action so it is not a big issue when the used memory is a bit 'too' large.
Description
[$]this->translate[(]'((?:[^'\\]|\\.|'')*)'[)];
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
code blocks starting with $this-translate(' through it's closing ');
places the value inside the ' quotes into capture group 1
avoids messy edge cases where in the substring may contain what looks like an end '); string when in reality the characters could be escaped.
Example
Live Demo
https://regex101.com/r/eC5xQ6/
Sample text
$This->Translate('(?:Droids\');{2}');
$NotTranalate('fdasad');
$this->translate('[a-zA-Z]');
Sample Matches
MATCH 1
1. [17-33] `(?:Droids\');{2}`
MATCH 2
1. [79-87] `[a-zA-Z]`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
[$] any character of: '$'
----------------------------------------------------------------------
this->translate 'this->translate'
----------------------------------------------------------------------
[(] any character of: '('
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^'\\] any character except: ''', '\\'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
'' '\'\''
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[)] any character of: ')'
----------------------------------------------------------------------
; ';'
----------------------------------------------------------------------

regex: lookbehind assertation

i use ((\d)(\d(?!\2))((?<!\3)\d(?!\3)))\1 to match arbitrary digit that not same one row sort like:
234234, 345345, 359359 but not match 211211, 355355 (removing the lookbehind assertation will match these)
i found the pattern got error when run with preg_match() in PHP since the length of offset must fixed, but its OK when tested in other debuger (i use kodos in this case)
preg_match_all(): Compilation failed: lookbehind assertion is not fixed length at offset 23
Are there any alternative of the pattern to match sort digit above? 245245 or other digit that fit ABCABC format pattern.
if the 3 digits must be different you can use:
((\d)(?!.?\2)(\d)(?!\3)\d)\1
but if 545545 is allowed you can use:
((\d)(?!\2)(\d)(?!\3)\d)\1
The problem is the lookbehind, this turns it into a lookahead and seems to work for me regex101
((\d)(\d(?!\2))(?!\3)(\d(?!\3)))\1
Just use lookahead instead of lookbehind?
((\d)(?!\2)(\d)(?!\2|\3)\d)\1
Explained by Regex Explainer:
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\3 what was matched by capture \3
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\1 what was matched by capture \1

How do you create a string to match an regex?

I need to create a formatting documentation. I know the regex that are used to format the text but I don't know how to reproduce an example for that regex.
This one should be an internal link:
'{\[((?:\#|/)[^ ]*) ([^]]*)\]}'
Can anyone create an example that would match this, and maybe explain how he got it. I got stuck at '?'.
I never used this meta-character at the beginning, usually I use it to mark that an literal cannot appear or appear exactly once.
Thanks
(?:...) has the same grouping effect as (...), but without "capturing" the contents of the group; see http://php.net/manual/en/regexp.reference.subpatterns.php.
So, (?:\#|/) means "either # or /".
I'm guessing you know that [^ ]* means "zero or more characters that aren't SP", and that [^]]* means "zero or more characters that aren't right-square-brackets".
Putting it together, one possible string is this:
'{[/abcd asdfasefasdc]}'
See Open source RegexBuddy alternatives and Online regex testing for some helpful tools. It's easiest to have a regex explained by them first. I used YAPE here:
NODE EXPLANATION
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\# '#'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[^ ]* any character except: ' ' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
' '
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^]]* any character except: ']' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\] ']'
----------------------------------------------------------------------
This is under the presumption that { and } in your example are the regex delimiters.
You can just read through the list of explanations and come up with a possible source string such as:
[#NOSPACE NOBRACKET]
I think this is a good post to help design regex. While its fairly easy to write a
general regex to match a string, sometimes its helpfull to look at it in reverse after
its designed. Sometimes it is necessary to see what bizzar things will match.
When mixing a lot of the metachars as literals, its fairly important to format
these kind for ease of reading and to avoid errors.
Here are some samples in Perl which were easier (for me) to prototype.
my #samps = (
'{[/abcd asdfasefasdc]}',
'{[# ]}',
'{[# /# \/]}',
'{[/# {[
| /# {[#\/} ]}',
,
);
for (#samps) {
if (m~{\[([#/][^ ]*) ([^]]*)\]}~)
{
print "Found: '$&'\ngrp1 = '$1'\ngrp2 = '$2'\n===========\n\n";
}
}
__END__
Expanded
\{\[
(
[#/][^ ]*
)
[ ]
(
[^\]]*
)
\]\}
Output
Found: '{[/abcd asdfasefasdc]}'
grp1 = '/abcd'
grp2 = 'asdfasefasdc'
===========
Found: '{[# ]}'
grp1 = '#'
grp2 = ''
===========
Found: '{[# /# \/]}'
grp1 = '#'
grp2 = '/# \/'
===========
Found: '{[/# {[
| /# {[#\/} ]}'
grp1 = '/# {[
|'
grp2 = '/# {[#\/} '
===========

Understanding Pattern in preg_match_all() Function Call

I am trying to understand how preg_match_all() works and when looking at the documentation on the php.net site, I see some examples but am baffled by the strings sent as the pattern parameter. Is there a really thorough, clear explanation out there? For example, I don't understand what the pattern in this example means:
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
"Call 555-1212 or 1-800-555-1212", $phones);
or this:
$html = "<b>bold text</b><a href=howdy.html>click me</a>";
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);
I've taken an introductory class on PHP, but never saw anything like this. Some clarification would be appreciated.
Thanks!
Those aren't "PHP patterns", those are Regular Expressions. Instead of trying to explain what has been explained before a thousand times in this answer, I'll point you to http://regular-expressions.info for information and tutorials.
You are looking for this,
PHP PCRE Pattern Syntax
PCRE Standard syntax
Note that first one is a subset of second one.
Also have a look at YAPE, which for example gives this nice textual explanation for your first regex:
(?x-ims:\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4})
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?x-ims: group, but do not capture (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?(1) if back-reference \1 matched, then:
----------------------------------------------------------------------
[\-\s] any character of: '\-', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| else:
----------------------------------------------------------------------
succeed
----------------------------------------------------------------------
) end of conditional on \1
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The pattern you write about is a mini-language in it's own called Regular Expression. It's specialized on finding patterns in strings, do replacements etc. for everything that follows some sort of pattern.
More specifically it's a Perl Compatible Regular Expression (PCRE).
The handbook for that language is not available on the PHP manual website, you find it here: PCRE Manpage.
A well made step-by-step introduction is on the Regular Expressions Info Website.

Categories