php regex to get base64 string - php

I have a file smime.p7m with many content. One or more of this Content is like this
--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"
JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--
Is there a way to get the filename for example with regex if it's a pDF and the BASE64 code below? It can happen that there is more than one PDF file in the file.
The Filename is not the problem. I get this with "filename="(.*).pdf". But I don't know how I get the base64code after the filename

base64 consists of characters A...Z a...z digits 0..9 symbols + and /. It also can have one or two = in the end and can be split to several lines.
if (preg_match('/filename=\"(?P<filename>[^"]*?\.pdf)\"\s*(?P<base64>([A-Za-z0-9+\/]+\s*)+=?=?)/', $s, $regres)) {
print("FileName: {$regres['filename']}\n");
print("Base64: {$regres['base64']}\n");
}

Use
(?im)^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)
See proof
PHP:
preg_match_all('/^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)/im', $str, $matches);
Explanation
--------------------------------------------------------------------------------
(?im) set flags for this block (case-
insensitive) (with ^ and $ matching start
and end of line) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
filename=" 'filename="'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
pdf 'pdf'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
\R+ any line break sequence (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\R any line break sequence
--------------------------------------------------------------------------------
.+ any character except \n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
) end of \2

I gather that this task is not about validation at all, and solely focuses on data extraction -- this makes sharpening the regex logic unnecessary.
You only need a pattern that will match filename=" at the start of a line, then capture the quote-wrapped substring (so long as it ends in .pdf), then after any number of whitespace characters, capture all characters until one or two = are encountered,
Using greedy negative character classes allows the regex engine to move quickly. The m pattern modifier tells the regex engine that the ^ meta character (not the ^ used inside of square braces) may match the start of a line in addition to the start of the string.
Perhaps you'd like to generate an associative array where the keys are the filename strings and the encoded strings are the values, array_column() does a snappy job of setting that up when there are qualifying matches.
Code: (Demo)
var_export(
preg_match_all(
'~^filename="([^"]+)\.pdf"\s*([^=]+={1,2})~m',
$fileContents,
$out,
PREG_SET_ORDER
)
? array_column($out, 2, 1)
: "no pdf's found"
);
Output:
array (
'001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)

Related

regex php look ahead number

stackers!
I have been trying to figure this out for some time but no luck.
(.*?(?:\.|\?|!))(?: |$)
the above pattern is capturing and breaking all sentences in a paragraph with ending punctuation.
example
Today is the greatest. You are the greatest.
The match comes back with three
Match {
1.
Today is the greatest.
You are the greatest.
}
However I am trying to get it to not break when there is a number with a period and would like to see the following match instead:
Match {
1.Today is the greatest.
You are the greatest.
}
Thanks for your help in advance
Use
.*?[.?!](?=(?<!\d\.)\s+|\s*$)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[.?!] any character of: '.', '?', '!'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead

preg_match_all for a string

I'm look to just get America/Chicago from this string
Local Time Zone (America/Chicago (CDT) offset -18000 (Daylight))
and have it work for other TimeZones, like America/Los_Angeles,America/New_York and so on. Im not very good with prey_match_all and also, if someone can direct me to a good tutorial on how to properly learn that,because this is the 3rd time ive needed to use it.
here is your solution with my regular expression
Code
$in = "Local Time Zone (America/Chicago (CDT) offset -18000 (Daylight))";
preg_match_all('/\(([A-Za-z0-9_\W]+?)\\s/', $in, $out);
echo "<pre>";
print_r($out[1][0]);
?>
And OUTPUT
America/Chicago
hope this will sure help you.
I'd use the regex: \((\S+)
preg_match_all('/\((\S+)/', $in, $out);
explanation:
The regular expression:
(?-imsx:\((\S+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Understanding Pattern in preg_match_all() Function Call

I am trying to understand how preg_match_all() works and when looking at the documentation on the php.net site, I see some examples but am baffled by the strings sent as the pattern parameter. Is there a really thorough, clear explanation out there? For example, I don't understand what the pattern in this example means:
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
"Call 555-1212 or 1-800-555-1212", $phones);
or this:
$html = "<b>bold text</b><a href=howdy.html>click me</a>";
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);
I've taken an introductory class on PHP, but never saw anything like this. Some clarification would be appreciated.
Thanks!
Those aren't "PHP patterns", those are Regular Expressions. Instead of trying to explain what has been explained before a thousand times in this answer, I'll point you to http://regular-expressions.info for information and tutorials.
You are looking for this,
PHP PCRE Pattern Syntax
PCRE Standard syntax
Note that first one is a subset of second one.
Also have a look at YAPE, which for example gives this nice textual explanation for your first regex:
(?x-ims:\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4})
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?x-ims: group, but do not capture (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?(1) if back-reference \1 matched, then:
----------------------------------------------------------------------
[\-\s] any character of: '\-', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| else:
----------------------------------------------------------------------
succeed
----------------------------------------------------------------------
) end of conditional on \1
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The pattern you write about is a mini-language in it's own called Regular Expression. It's specialized on finding patterns in strings, do replacements etc. for everything that follows some sort of pattern.
More specifically it's a Perl Compatible Regular Expression (PCRE).
The handbook for that language is not available on the PHP manual website, you find it here: PCRE Manpage.
A well made step-by-step introduction is on the Regular Expressions Info Website.

What does this `#((?<=\?)|&)openid\.[^&]+#` regexp mean?

So I am triing to read some php code... I found such line
$uri = rtrim(preg_replace('#((?<=\?)|&)openid\.[^&]+#', '', $_SERVER['REQUEST_URI']), '?');
what does it mean? and if it (seems for me) just returns 'file name' why it is so complicated?
The purpose of that line is to remove values like openid.something=value from the request URI.
There are tools out there to translate regex into prose, with an aim to help you understand what a regex is trying to match. For example, when yours is passed to such a tool the description comes back as:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
openid 'openid'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
[^&]+ any character except: '&' (1 or more times
(matching the most amount possible))
As the above says, the regex looks for a ? or & followed by openid., followed by anything not &. The resulting match will include the preceeding & if there is one, but not the ? since a look behind was used for the latter.

How does this regex divide text into sentences?

I know this regex divides a text into sentences. Can someone help me understand how?
/(?<!\..)([\?\!\.])\s(?!.\.)/
You can use YAPE::Regex::Explain to decipher Perl regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<! look behind to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\?\!\.] any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.
(? # Find a group (don't capture)
< # before the following regular expression
! # that does not match
\. # a literal "."
. # followed by 1 character
) # (End look-behind group)
( # Start a group (capture it to $1)
[\?\!\.] # Containing any one of the characters in the following set "?!."
) # End group $1
\s # followed by a whitespace character " ", \t, etc.
(? # Followed by a group (don't capture)
# after the preceding regular expression
! # that does not have
. # 1 character
\. # followed by a literal "."
) # (End look-ahead group)
The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.
The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.
The next part is a single (??) white-space character: \s
And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.
This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.
I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:
/(?<![?!.].)([?!.])\s(?!.[?!.])/
Portions:
([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
(?<!\..) where the characters before this 'ending character' arent a .+anything
(?!.\.) after the whitespace character any character directly followed by any . isn't allowed.
Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).

Categories