RegEx Fails on Known Good Value - php

I have a regex designed to detect plausible Base64 strings. It works in tests at https://regex101.com for all expected test values.
~^((?:[a-zA-Z0-9/+]{4})*(?:(?:[a-zA-Z0-9/+]{3}=)|(?:[a-zA-Z0-9/+]{2}==))?)$~
However, when I use this pattern in PHP, I find some values inexplicably fail.
$tests = array(
'MFpGQkVBJTNkJTNkfTxCUj4NCg0KICAgIDwvZm9udD4=',
'MFpGRkVBJTNkJTNkfTxCUj4NCg0KICAgIDwvZm9udD4=',
'MFpGSkVBJTNkJTNkfTxCUj4NCg0KICAgIDwvZm9udD4=',
);
foreach ($tests as $str) {
$result = preg_match(
'~^((?:[a-zA-Z0-9/+]{4})*(?:(?:[a-zA-Z0-9/+]{3}=)|(?:[a-zA-Z0-9/+]{2}==))?)$~i',
preg_replace('~[\s\R]~u', "", $str)
);
var_dump($result);
}
results:
int(1)
int(0)
int(1)
Question: Why does this pattern fail for the second test string?

Problem is in your preg_replace call:
preg_replace('~[\s\R]~u', "", $str)
Inside character class \R is matching and removing literal R from 2nd element in array and thus causing preg_match to fail.
Change it to:
preg_replace('~\s|\R~u', "", $str)
As \s will also match \R you can just do:
preg_replace('~\s+~u', "", $str)

Related

preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters

If is somewhere in the string a non-UTF8 character, preg_match with the modifier u returns false for an error.
For example:
<?php
$string = "ABCD\xc3";
$r = preg_match('/^./u',$string, $match);
var_dump($r); //bool(false)
This example for try yourself: https://3v4l.org/qkHl4
The regular expression finds the first character if the non-UTF8 character is removed at the end.
$string = "ABCD";
$r = preg_match('/^./u',$string, $match);
var_dump($r, $match);
//int(1) array(1) { [0]=> string(1) "A" }
Is there an easy way to use regular expressions to identify a UTF-8 character at the beginning for strings that also contain non-UTF8 characters?
You could also consider using T-Regx, which handles UTF8 errors in a more cooperative way:
try {
pattern('^.', 'u')->match("ABCD\xc3")->all();
catch (SafeRegexException $e) {
// handle
}
Based on this answer you can remove invalid utf characters using mb_convert_encoding:
$string = "ABCD\xc3";
$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
$r = preg_match('/^./u', $string, $match);
var_dump($r, $match);
give the following result:
int(1)
array(1) {
[0] =>
string(1) "A"
}
I think after a long search I found an answer myself.
The modifier u works only if the entire string is a valid UTF-8 string.
Even if only the first character is to be found, the entire string is checked first.
The modifier u can not be used for this problem. However, regular expressions can be used.
function utf8Char($string){
$ok = preg_match(
'/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
|^[\xC0-\xDF][\x80-\xBF]
|^[\x00-\x7f]/sx',
$string,
$match);
return $ok ? $match[0] : false;
}
var_dump(utf8char("€a\xc3def")); //string(3) "€"
var_dump(utf8char("a\xc3def")); //string(1) "a"
var_dump(utf8char("\xc3def")); //bool(false)
The non-UTF8-bytes can be retrieved using the substr function.
var_dump(substr("\xc3def",0,1)); //string(1) "�"

PHP - How to remove specific character at a string?

i have more + symbol in my string and i want to remove last one and any character after it
ex
Giza+badrashen+test
You can explode your string on '+' and then join it ignoring the last element of the split (with array_slice with negative index), like this (assuming $str is your string)
$result = join('', array_slice(explode('+', $str), -1));
In case you suspect your string may not contain a '+', you can check for its presence first with strpos
if(strpos($str, '+') !== false) {
$result = join('', array_slice(explode('+', $str), -1));
}
A simple regex solution:
Assuming you have Giza+badrashen+test and want Giza+badrashen as result.
echo preg_replace("/\+[^\+]*$/", "", "Giza+badrashen+test");
Tests:
var_dump(preg_replace("/\+[^\+]*$/", "", "Giza+badrashen+test"));
var_dump(preg_replace("/\+[^\+]*$/", "", "Giza+badrashen+test+"));
var_dump(preg_replace("/\+[^\+]*$/", "", "Giza"));
Output:
string(14) "Giza+badrashen"
string(19) "Giza+badrashen+test"
string(4) "Giza"
$string = "abc1234+12+3455+xzyabc";
$string = substr($string, 0, strrpos($string,"+"));
echo $string;
> abc1234+12+3455
EDIT: and gives an empty string if there is no + but it doesn't crash/fail
EDIT2: I slightly mis-read the question the first time, my edited answer is even simpler
Regex with a negative lookahead might be the most compact solution:
$myString="foo-bar+foo+foobar";
$result = preg_split("/\+(?!.*\+)/", $myString);
echo $result[0];
//result: foo-bar+foo
No need of additional check, cause in case no + is found it just gives back the original string.
It's just worth pointing that the + must be escaped having special meaning in all flowers of regex...

php preg_match get numbers between two strings

Hi I'm starting to learn php regex and have the following problem:
I need to extract the numbers inside $string.
The regex I use returns "NULL".
$string = 'Clasificación</a> (2194) </li>';
$regex = '/Clasificación</a>((.*?))</li>/';
preg_match($regex , $string, $match);
var_dump($match);
Thanks in advance.
There are three problems with your regex:
You aren't escaping the forward slash. You're using the forward slash as a delimiter, so if you want to use it as a literal character inside the expression, you need to escape it
((.*?)) doesn't do what you think it does. It creates two capturing groups -- one nested inside the other. I assume, you're trying to capture what's inside the parentheses. For that, you'll need to escape the ( and ) characters. The expression would become: \((.*?)\)
Your expression doesn't handle whitespace. In the string you've given, there is whitespace between the </a> and the beginning of the number -- </a> (2194). To ignore the whitespace and capture just the number, you need to use \s (which matches any whitespace character). For that, you need to write \s*\((.*?)\)\s*.
The final regular expression after fixing all the above errors, will look like:
$regex = '~Clasificación</a>\s*\((.*?)\)\s*</li>~';
Full code:
$string = 'Clasificación</a> (2194) </li>';
$regex = '~Clasificación</a>\s*\((.*?)\)\s*</li>~';
preg_match($regex , $string, $match);
var_dump($match);
Output:
array(2) {
[0]=>
string(32) "Clasificación (2194) "
[1]=>
string(4) "2194"
}
Demo.
You forget to espace / in your regex, since you're using the / as a delimiter:
$regex = '/Clasificación<\/a>((.*?))<\/li>/';
// ^ delimiter ^^ ^ delimiter
// ^^ / in a string which is escaped
Another way can be to change that delimiter, and then you will not have to escape it:
$regex = '#Clasificación<\/a>((.*?))<\/li>#';
See the PHP documentation for more information.
you will have to escape out the special characters that you want to match:
$regex = '/Clasificación<\/a> \((.*?)\) <\/li>/'
and may want to make your match a little more specific where it matters (depending on your use case)
$regex = '/Clasificación<\/a>\s*\(([0-9]+)\)\s*<\/li>/';
that will allow for 0 or more spaces before or after the (1234) and only match if there are only numbers in the ()
I just tried this in php:
php > preg_match($regex , $string, $match);
php > var_dump($match);
array(2) {
[0]=>
string(30) "Clasificacin</a> (2194) </li>"
[1]=>
string(4) "2194"
}

Extract all strings values from code

everyone. I have a problem and I can't resolve it.
Pattern: \'(.*?)\'
Source string: 'abc', 'def', 'gh\'', 'ui'
I need [abc], [def], [gh\'], [ui]
But I get [abc], [def], [gh\], [, ] etc.
Is it possible? Thanks in advance
PHP Code: Using negative lookbehind
$s = "'abc', 'def', 'ghf\\\\', 'jkl\'f'";
echo "$s\n";
if (preg_match_all("~'.*?(?<!(?:(?<!\\\\)\\\\))'~", $s, $arr))
var_dump($arr[0]);
OUTOUT:
array(4) {
[0]=>
string(5) "'abc'"
[1]=>
string(5) "'def'"
[2]=>
string(7) "'ghf\\'"
[3]=>
string(8) "'jkl\'f'"
}
Live Demo: http://ideone.com/y80Gas
Yes, those matches are possible.
But if you mean to ask whether it's possible to get what's inside the quotes, the easiest here would be to split by comma (through a CSV parser preferably) and trim any trailing spaces.
Otherwise, you could try something like:
\'((?:\\\'|[^\'])+)\'
Which will match either \' or a non-quote character, but will fail against stuff like \\'...
A longer, and slower regex you might use for this case is:
\'((?:(?<!\\)(?:\\\\)*\\\'|[^\'])+)\'
In PHP:
preg_match_all('/\'((?:(?<!\\)\\\'|[^\'])+)\'/', $text, $match);
Or if you use double quotes:
preg_match_all("/'((?:(?<!\\\)\\\'|[^'])+)'/", $text, $match);
Not sure why there's an error with (?<!\\) (I really mean one literal backslash) when it should be working fine. It works if the pattern is changed to (?<!\\\\).
ideone demo
EDIT: Found a simpler, better, faster regex:
preg_match_all("/'((?:[^'\\]|\\.)+)'/", $text, $match);
<?php
// string to extract data from
$string = "'abc', 'def', 'gh\'', 'ui'";
// make the string into an array with a comma as the delimiter
$strings = explode(",", $string);
# OPTION 1: keep the '
// or, if you want to keep that escaped single quote
$replacee = ["'", " "];
$strings = str_replace($replacee, "", $strings);
$strings = str_replace("\\", "\'", $strings);
# OPTION 2: remove the ' /// uncomment tripple slash
// replace the single quotes, spaces, and the backslash
/// $replacee = ["'", "\\", " "];
// do the replacement, the $replacee with an empty string
/// $strings = str_replace($replacee, "", $strings);
var_dump($strings);
?>
Instead you should use str_getcsv
str_getcsv("'abc', 'def', 'gh\'', 'ui'", ",", "'");

How do I remove many different parenthesis [] () {} and content from a string using REGEX with a single preg_replace?

Currently I'm using
$str = preg_replace("/\([^\)]+\)/", "", $str); // Remove () and Content
$str = preg_replace("/\[[^\)]+\]/", "", $str); // Remove [] and Content
$str = preg_replace("/\{[^\)]+\}/", "", $str); // Remove {} and Content
$str = preg_replace("/[^a-zA-Z0-9]/", "", $str); // Remove all non-alphanumeric characters
I was wondering if there was a way to combine these into a single regex statement
You can use | to combine those:
(\([^\)]+\)|\[[^\]]+\]|\{[^\}]+\})|[^A-Za-z0-9])
But in your case, I would usually just go through the string and parse it manually.
With "state variables" ($paren_open, $brace_open, $curly_open) to know if you ignore the char or not.
Even if it doesn't seem fast (because you have to go through every character), it's much quicker than regexes, because the regex will do something way more complicated.
Resources:
Wikipedia: State pattern

Categories