PCRE_UTF8 Modifier Extremely Slow

PCRE_UTF8 Modifier Extremely Slow - php

For some reason, simply adding the PCRE_UTF8 modifier to the regex input for preg_match() roughly decuples (x10) the execution time even if no multibyte characters are used. I can't figure out why this is the case and how to best reduce the time. The script used to test is:
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /u', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "u Modifier:\t".(($e-$s)/$i)."\n";
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "No Modifier:\t".(($e-$s)/$i)."\n";
Try it online here.
The results were:
u Modifier: 2.5037050247192E-5
No Modifier: 2.4969577789307E-6
I tried to see if this was a known bug online, but supposedly, it is not a problem with PHP.
What is this caused by and what would be the best method to execute the match1 quicker?
1"the match" refers to any match. The example used is simply a minimal example and could obviously be matched in much better ways.

PCRE checks for UTF validity before any other processing takes place.
From the PCRE docs:
When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by calling pcre2_get_startchar(), which is used for this purpose after a UTF error.
...
The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.
...
In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
(Note: These docs are quoted from PCRE2, but the PCRE behavior is the same)
Unfortunately, I don't think there's a way to set the PCRE2_NO_UTF_CHECK option from PHP.
Anyway, your benchmark should go over much more iterations to be meaningful. You should measure the time over several seconds worth of computing time to get a better sense of the impact this feature has.

Related

preg_match ( PHP ) : report the FIRST conflicting character & HALT the operation

Supposedly my string is:
$a = 'abc-def";
i only want a-z in my string.
but i want to let the user know what "bad" character makes their string "bad".
i can do this:
$conflict = preg_replace('/[a-zA-Z0-9 .]/', '', $a);
echo "the conflict was: $conflict";
but if my string is huge.. let's say a document filled with lots of text.. then this code above would go through the entire document and show all the conflicting characters.
i would like it to STOP after it finds only 1 conflicting character.
that way less CPU / memory / resources are used.
in other words.
$a = 'abc-,def";
in this example it should stop and indicate the conflict is "-"
rather than report the conflict is : "-,"
because soon as it sees "-" it should stop looking anymore.
the goal is to not use too many resources.
MORE INFO:
the idea is to read a string from left to right (i suppose) and soon as a character is detected that does not obey the law of a-z and 0-9.. the process is to be terminated and the string that has been found is to be reported

You can do the following:
if (preg_match("/[^a-z0-9]/i", $a, $match)) :
echo "Conflict was: $match[0]";
Logic: Check for characters other than a-z0-9 and if there is a match print the response.

preg_match appears to hit a limit when using two matches

I have run up against an odd problem. it appears i am reaching some sort of limit with preg_replace while trying to use two matches using php-5.3.3
// works fine
$pattern_1 = '?START(.*)STOP?';
$string = 'START' . str_repeat('x',9999999) . 'STOP' ;
preg_match($pattern_1, $string , $matchedArray ) ;
$pattern_2 = '?START-ONE(.*)STOP-ONE.*START-TWO(.*)STOP-TWO.*?';
// works fine
$string = 'START-ONE this is head stuff STOP-ONE START-TWO' . str_repeat('x', 49970) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_2 ) ;
// didnt work
$string = 'START-ONE this is head stuff STOP-ONE START-TWO' . str_repeat('x', 49971) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_3 ) ;
The first option with only one match uses a very large string and has no problems.
The second option has a string length of 50,026 and works fine. the last option has a string length of 50,027 (one more) and the match no longer works. since the 49971 number can vary when the error occurs, it could be changed to something larger to simulate the problem.
Any ideas or thoughts? perhaps is this a php version issue? maybe a possible workaround is merely to only use one match rather than two and then run preg_match it twice ?

Ok, PHP's not very talkative about regex errors, it just returns false for the last case, which simply tells than an error occured, per the PHP docs.
I've reproduced the problem using PCRE (the regex engine used by preg_match) in C# (but with a much higher character count), and the error I'm getting is PCRE_ERROR_MATCHLIMIT.
This means you're hitting the backtracking limit set in PCRE. It's just a safety measure to prevent the engine from looping indefinitely, and I think your PHP configuration sets it to a low value.
To fix the issue, you can set a higher value for the pcre.backtrack_limit PHP option which controls this limit:
ini_set("pcre.backtrack_limit", "10000000"); // Actually, this is PCRE's default
On a side note:
You probably should replace (.*) with (.*?) to get less useless backtracking and for correctness (otherwise the regex engine will get past the STOP string and will have to backtrack to reach it)
Using ? as a pattern delimiter is a bad idea since it prevents you from using the ? metacharacter and therefore applying the above advice. Really, you should never use regex metacharacters as pattern delimiters.
If you're interested in more low-level details, here's the relevant bit of the PCRE docs (emphasis mine):
The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlimited repeats.
Internally, pcre_exec() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string.
When pcre_exec() is called with a pattern that was successfully studied with a JIT option, the way that the matching is executed is entirely different. However, there is still the possibility of runaway matching that goes on for a very long time, and so the match_limit value is also used in this case (but in a different way) to limit how long the matching can continue.
The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
A value for the match limit may also be supplied by an item at the start of a pattern of the form
(*LIMIT_MATCH=d)
where d is a decimal number. However, such a setting is ignored unless d is less than the limit set by the caller of pcre_exec() or, if no such limit is set, less than the default.

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

I need a PHP regular expression to validate string format of 5 digits, one comma

I have a huge PHP input box on a webpage. This input should only take 5 digit string separated by commas:
00100,00247,90277,97030,00657
notice the last one has no comma at the end.
Is there a regular expression that can do this? Since the input box is very large and can take 100+ of these items, I want to validate it on the PHP server side before the database is queried and those avoid any SQL Injection tries.
Query is only run if only 5 numbers and a comma in the sequence, except for the last one.
These are a state's public water system ID's by the way.

I believe this will get the result you're looking for, though explode may be the better option.
/^(?:\d{5},)*\d{5}$/
This will only match 1 or more 5-digit numbers that are comma delimited with no spaces.

Since this is user submitted data, your validation should be more flexible. What if the user accidentally puts a space after one of the commas? Or a line break gets inserted?
I realize you are looking for a regex solution but may I suggest using explode to create an array and apply a rule to each element. Having them separated into elements allows more flexibility when validating and storing:
$nums = explode(',', '00100,00247,90277,97030,00657');
foreach ($nums as $num) {
if (!preg_match('/^\d{5}$)/', trim($num))) {
// error!
}
}

I'd explode it and validate each string individually:
$input = '00100,00247,90277,97030,00657';
$input_array = explode(',', $input);
$is_valid = true;
foreach ($input_array as $number) {
if (preg_match("/\\d/", trim($number)) != strlen(trim($number))) {
$is_valid = false;
}
}
print($is_valid);

I think you rather need str_getcsv:
while ($row = str_getcsv($fp)) {
// $row is an array containing your digits
}

Simple. This regex matches a value having one or more comma separated 5-digit numbers:
if (preg_match('/^\d{5}(\s*,\s*\d{5})*$/', $value)) {
// Good value
}
It allows whitespace between the numbers as well.

This might work:
/^\d{5}(?:,\d{5})*$/
edit 1 noticed ridgerunner has the same answer, so disregard this.
edit 2 some notes on performance.
Failure analysis
Backtracking give back on failure:
^\d{5}(?:,\d{5})*$ gives back ,\d{5}
^(?:\d{5},)*\d{5}$ gives back \d{5},
Post Backtracking regressive topography checks:
(After backtracking give back, checks are to the right of the one that gave back)
^\d{5}(?:,\d{5})*$ checks for $
^(?:\d{5},)*\d{5}$ checks for \d{5}$
Winner: ^\d{5}(?:,\d{5})*$
NON-Backtracking regex's (using possesive quantifier +):
^\d{5}(?:,\d{5})*+$ gives nothing back, fails immediately
^(?:\d{5},)*+\d{5}$ gives nothing back fails immediately
Benchmarks
Using a string of 50 blocks of \d{5},.
The sample string is matched against each regex in a loop of 100,000 times.
Failure was induced at the end of the string, removed for a sucess test.
Sucess:
All took 1 second to complete a sucessfull run.
Failure, Backtracking:
^\d{5}(?:,\d{5})\*$ took 1.2 seconds best
^(?:\d{5},)\*\d{5}$ took 1.6 seconds
Failure, Non-Backtracking:
^\d{5}(?:,\d{5})*+$ took .9 seconds
^(?:\d{5},)*+\d{5}$ took .9 seconds
Conclusions
Backtracking - Put the smallest post-backtracking check
after the backtracking sub-expression. In this case, the
smallest is $.
In general, put the required expressions ahead of the optional ones.
Best ^\d{5}(?:,\d{5})*$
NON-Backtracking - It doesn't matter.
^\d{5}(?:,\d{5})*+$ or ^(?:\d{5},)*+\d{5}$

php regular expression to filter out junk

So I have an interesting problem: I have a string, and for the most part i know what to expect:
http://www.someurl.com/st=????????
Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ
Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.
The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?
update:
To clear up some confusion, I get an input string that's like this:
[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????
except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case
$clean = join(
array_filter(
str_split($var, 1),
function ($char) {
return (
array_key_exists(
$char,
array_flip(array_merge(
range('A','Z'),
range('a','z'),
range((string)'0',(string)'9'),
array(':','.','/','-','_')
))
)
);
}
)
);
Hah, that was a joke. Here's a regex for you:
$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":
__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().

You can use this regular expression :
if (preg_match('/[\'^£$%&*()}{##~?><>,|=_+¬-]/', $string) ==1)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.