regex: lookbehind assertation - php

i use ((\d)(\d(?!\2))((?<!\3)\d(?!\3)))\1 to match arbitrary digit that not same one row sort like:
234234, 345345, 359359 but not match 211211, 355355 (removing the lookbehind assertation will match these)
i found the pattern got error when run with preg_match() in PHP since the length of offset must fixed, but its OK when tested in other debuger (i use kodos in this case)
preg_match_all(): Compilation failed: lookbehind assertion is not fixed length at offset 23
Are there any alternative of the pattern to match sort digit above? 245245 or other digit that fit ABCABC format pattern.

if the 3 digits must be different you can use:
((\d)(?!.?\2)(\d)(?!\3)\d)\1
but if 545545 is allowed you can use:
((\d)(?!\2)(\d)(?!\3)\d)\1

The problem is the lookbehind, this turns it into a lookahead and seems to work for me regex101
((\d)(\d(?!\2))(?!\3)(\d(?!\3)))\1

Just use lookahead instead of lookbehind?
((\d)(?!\2)(\d)(?!\2|\3)\d)\1
Explained by Regex Explainer:
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\3 what was matched by capture \3
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\1 what was matched by capture \1

Related

regex php look ahead number

stackers!
I have been trying to figure this out for some time but no luck.
(.*?(?:\.|\?|!))(?: |$)
the above pattern is capturing and breaking all sentences in a paragraph with ending punctuation.
example
Today is the greatest. You are the greatest.
The match comes back with three
Match {
1.
Today is the greatest.
You are the greatest.
}
However I am trying to get it to not break when there is a number with a period and would like to see the following match instead:
Match {
1.Today is the greatest.
You are the greatest.
}
Thanks for your help in advance
Use
.*?[.?!](?=(?<!\d\.)\s+|\s*$)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[.?!] any character of: '.', '?', '!'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead

RegExp match group not preceded by alphanumeric (\w) ignoring spaces

I'm trying to match a group that only matches when the first non-spacing character preceding the match is NOT an alphanumeric character.
RegExp i've tried, consuming the spaces first with \s* then looking behind to check for \w:
(?<!\w)\s*\({\w+}\)
Success
Input: this will = ({match})
Expected: ({match})
Actual: ({match})
Failure, still matches while preceded by alphanumeric (ignoring spaces)
Input: this should = not ({match})
Expected: -
Actual: ({match})
Using \s+ instead of \s* solved it partially but now it requires at least one space which is not desired!
(?<!\w)\s+\({\w+}\)
I've been looking around the internet but cannot solve the problem. Anyone?
Use this solution (a mix of #horcrux and #Wiktor Stribizew suggestions):
<?php
$regex = '/(?<![\w\s])\s*(\({\w+}\))/';
$string = 'this will = ({match})
this should = not ({match})';
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
See regex proof.
Results:
array(1) {
[0]=>
string(9) "({match})"
}
See PHP proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[\w\s] any character of: word characters (a-z,
A-Z, 0-9, _), whitespace (\n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
} '}'
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
If there has to be a first non-spacing character present, you could match it and use \K to clear the match buffer.
[^\w\s]\h*\K\({\w+}\)
The pattern matches
[^\w\s] Match a single char other than a word char or whitespace char
\h*\K Match 0+ horizontal whitespace chars and forget what is matched so far
\({\w+}\) Match 1+ word chars between ({ and })
Regex demo | Php demo

php regex to get base64 string

I have a file smime.p7m with many content. One or more of this Content is like this
--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"
JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--
Is there a way to get the filename for example with regex if it's a pDF and the BASE64 code below? It can happen that there is more than one PDF file in the file.
The Filename is not the problem. I get this with "filename="(.*).pdf". But I don't know how I get the base64code after the filename
base64 consists of characters A...Z a...z digits 0..9 symbols + and /. It also can have one or two = in the end and can be split to several lines.
if (preg_match('/filename=\"(?P<filename>[^"]*?\.pdf)\"\s*(?P<base64>([A-Za-z0-9+\/]+\s*)+=?=?)/', $s, $regres)) {
print("FileName: {$regres['filename']}\n");
print("Base64: {$regres['base64']}\n");
}
Use
(?im)^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)
See proof
PHP:
preg_match_all('/^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)/im', $str, $matches);
Explanation
--------------------------------------------------------------------------------
(?im) set flags for this block (case-
insensitive) (with ^ and $ matching start
and end of line) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
filename=" 'filename="'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
pdf 'pdf'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
\R+ any line break sequence (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\R any line break sequence
--------------------------------------------------------------------------------
.+ any character except \n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
) end of \2
I gather that this task is not about validation at all, and solely focuses on data extraction -- this makes sharpening the regex logic unnecessary.
You only need a pattern that will match filename=" at the start of a line, then capture the quote-wrapped substring (so long as it ends in .pdf), then after any number of whitespace characters, capture all characters until one or two = are encountered,
Using greedy negative character classes allows the regex engine to move quickly. The m pattern modifier tells the regex engine that the ^ meta character (not the ^ used inside of square braces) may match the start of a line in addition to the start of the string.
Perhaps you'd like to generate an associative array where the keys are the filename strings and the encoded strings are the values, array_column() does a snappy job of setting that up when there are qualifying matches.
Code: (Demo)
var_export(
preg_match_all(
'~^filename="([^"]+)\.pdf"\s*([^=]+={1,2})~m',
$fileContents,
$out,
PREG_SET_ORDER
)
? array_column($out, 2, 1)
: "no pdf's found"
);
Output:
array (
'001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)

Understanding Pattern in preg_match_all() Function Call

I am trying to understand how preg_match_all() works and when looking at the documentation on the php.net site, I see some examples but am baffled by the strings sent as the pattern parameter. Is there a really thorough, clear explanation out there? For example, I don't understand what the pattern in this example means:
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
"Call 555-1212 or 1-800-555-1212", $phones);
or this:
$html = "<b>bold text</b><a href=howdy.html>click me</a>";
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);
I've taken an introductory class on PHP, but never saw anything like this. Some clarification would be appreciated.
Thanks!
Those aren't "PHP patterns", those are Regular Expressions. Instead of trying to explain what has been explained before a thousand times in this answer, I'll point you to http://regular-expressions.info for information and tutorials.
You are looking for this,
PHP PCRE Pattern Syntax
PCRE Standard syntax
Note that first one is a subset of second one.
Also have a look at YAPE, which for example gives this nice textual explanation for your first regex:
(?x-ims:\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4})
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?x-ims: group, but do not capture (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?(1) if back-reference \1 matched, then:
----------------------------------------------------------------------
[\-\s] any character of: '\-', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| else:
----------------------------------------------------------------------
succeed
----------------------------------------------------------------------
) end of conditional on \1
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The pattern you write about is a mini-language in it's own called Regular Expression. It's specialized on finding patterns in strings, do replacements etc. for everything that follows some sort of pattern.
More specifically it's a Perl Compatible Regular Expression (PCRE).
The handbook for that language is not available on the PHP manual website, you find it here: PCRE Manpage.
A well made step-by-step introduction is on the Regular Expressions Info Website.

What does this `#((?<=\?)|&)openid\.[^&]+#` regexp mean?

So I am triing to read some php code... I found such line
$uri = rtrim(preg_replace('#((?<=\?)|&)openid\.[^&]+#', '', $_SERVER['REQUEST_URI']), '?');
what does it mean? and if it (seems for me) just returns 'file name' why it is so complicated?
The purpose of that line is to remove values like openid.something=value from the request URI.
There are tools out there to translate regex into prose, with an aim to help you understand what a regex is trying to match. For example, when yours is passed to such a tool the description comes back as:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
openid 'openid'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
[^&]+ any character except: '&' (1 or more times
(matching the most amount possible))
As the above says, the regex looks for a ? or & followed by openid., followed by anything not &. The resulting match will include the preceeding & if there is one, but not the ? since a look behind was used for the latter.

Categories