preg_match is faster than strpos on large text? - php

I am currently updating very old script written for PHP 5.2.17 for PHP 8.1.2. There is a lot of text processing code blocks and almost all of them are preg_match/preg_match_all. I used to know, that strpos for string matching have always been faster than preg_match, but I decided to check one more time.
Code was:
$c = file_get_contents('readme-redist-bins.txt');
$start = microtime(true);
for ($i=0; $i < 1000000; $i++) {
strpos($c, '[SOMEMACRO]');
}
$el = microtime(true) - $start;
exit($el);
and
$c = file_get_contents('readme-redist-bins.txt');
$start = microtime(true);
for ($i=0; $i < 1000000; $i++) {
preg_match_all("/\[([a-z0-9-]{0,100})".'[SOMEMACRO]'."/", $c, $pma);
}
$el = microtime(true) - $start;
exit($el);
I took readme-redist-bins.txt file which comes with php8.1.2 distribution, about 30KB.
Results(preg_match_all):
PHP_8.1.2: 1.2461s
PHP_5.2.17: 11.0701s
Results(strpos):
PHP_8.1.2: 9.97s
PHP_5.2.17: 0.65s
Double checked... Tried Windows and Linux PHP builds, on two machines.
Tried the same code with small file(200B)
Results(preg_match_all):
PHP_8.1.2: 0.0867s
PHP_5.2.17: 0.6097s
Results(strpos):
PHP_8.1.2: 0.0358s
PHP_5.2.17: 0.2484s
And now the timings is OK.
So, how cant it be, that preg_match is so match faster on large text? Any ideas?
PS: Tried PHP_7.2.10 - same result.

PCRE2 is really fast. It's so fast that there usually is barely any difference between it and plain string processing in PHP and sometimes it's even faster. PCRE2 internally uses JIT and contains a lot of optimizations. It's really good at what it does.
On the other hand, strpos is poorly optimized. It's doing some simple byte comparison in C. It doesn't use parallelization/vectorization. For short needles and short haystacks, it uses memchr, but for longer values, it performs Sunday Algorithm.
For small datasets, the overhead from calling PCRE2 will probably outweigh its optimizations, but for larger strings, or case-insensitive/Unicode strings PCRE2 might offer better performance.

Related

PHP built in functions complexity (isAnagramOfPalindrome function)

I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)
A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though

PHP looping context query

A question that has always puzzled me is why people write it like the first version when the second version is smaller and easier to read. I thought it might be because php calculates the strlen each time it iterates. any ideas?
FIRST VERSION
for ($i = 0, $len = strlen($key); $i < $len; $i++) {}
You can obviously use $len inside the loop and further on in the code, but what are the benefits over the following version?
SECOND VERSION
for ($i = 0; $i < strlen($key); $i++) {}
It's a matter of performance.
Your first version of the for loop will recaculate the strlen every time and thus, the performances could be slowed down.
Even though it wouldn't be significant enough, you could be surprised how much the slow can be exponantial sometimes.
You can see here for some performances benchmarks with loops.
The first version is best used if the loop is expected to have many iterations and $key won't change in the process.
The second one is best used if the loop is updating $key and you need to recalculate it, or, when recalculating it doesn't affect your performance.

In PHP which operator should i use '<' or '<='?

I am using the following code :
<?php
$start_time = microtime(true);
for($i=1;$i <= 999999; $i++){
}
$end_time = microtime(true);
echo "Time Interval : ".$end_time-$start_time;
$start_time = microtime(true);
for($j=1;$j < 1000000; $j++){
}
$end_time = microtime(true);
echo "Time Interval : ".$end_time-$start_time;
?>
It shows met the time difference between them of 0.0943 seconds.So in this way <= operator is faster than '<'. I just wanna know if there is any disadvantage of using <= operator over '<'?
There is certainly no performance penalty. Both operands are single opcode in PHP bytecode, and ultimately both should be executed using exactly one CPU instruction.
http://www.php.net/manual/en/internals2.opcodes.is-smaller.php
http://www.php.net/manual/en/internals2.opcodes.is-smaller-or-equal.php
and of course I agree with comments - you should never optimize things at this low level of your code.
http://en.wikipedia.org/wiki/Program_optimization
"Premature optimization" is a phrase used to describe a situation
where a programmer lets performance considerations affect the design
of a piece of code. This can result in a design that is not as clean
as it could have been or code that is incorrect, because the code is
complicated by the optimization and the programmer is distracted by
optimizing.
Usually I'd stick with having meaningful code vs. faster code that's less meaningful. That guard will probably be a value coming from a variable defined elsewhere, so you should keep it's meaning. You shouldn't change all your < to <= to save a fraction of time (which may be a random error).

PHP function efficiencies

Is there a table of how much "work" it takes to execute a given function in PHP? I'm not a compsci major, so I don't have maybe the formal background to know that "oh yeah, strings take longer to work with than integers" or anything like that. Are all steps/lines in a program created equal? I just don't even know where to start researching this.
I'm currently doing some Project Euler questions where I'm very sure my answer will work, but I'm timing out my local Apache server at a minute with my requests (and PE has said that all problems can be solved < 1 minute). I don't know how/where to start optimizing, so knowing more about PHP and how it uses memory would be useful. For what it's worth, here's my code for question 206:
<?php
$start = time();
for ($i=1010374999; $i < 1421374999; $i++) {
$a = number_format(pow($i,2),0,".","");
$c = preg_split('//', $a, -1, PREG_SPLIT_NO_EMPTY);
if ($c[0]==1) {
if ($c[2]==2) {
if ($c[4]==3) {
if ($c[6]==4) {
if ($c[8]==5) {
if ($c[10]==6) {
if ($c[12]==7) {
if ($c[14]==8) {
if ($c[16]==9) {
if ($c[18]==0) {
echo $i;
}
}
}
}
}
}
}
}
}
}
}
$end = time();
$elapsed = ($end-$start);
echo "<br />The time to calculate was $elapsed seconds";
?>
If this is a wiki question about optimization, just let me know and I'll move it. Again, not looking for an answer, just help on where to learn about being efficient in my coding (although cursory hints wouldn't be flat out rejected, and I realize there are probably more elegant mathematical ways to set up the problem)
There's no such table that's going to tell you how long each PHP function takes to execute, since the time of execution will vary wildly depending on the input.
Take a look at what your code is doing. You've created a loop that's going to run 411,000,000 times. Given the code needs to complete in less than 60 seconds (a minute), in order to solve the problem you're assuming each trip through the loop will take less than (approximately) .000000145 seconds. That's unreasonable, and no amount of using the "right" function will solve your call. Try your loop with nothing in there
for ($i=1010374999; $i < 1421374999; $i++) {
}
Unless you have access to science fiction computers, this probably isn't going to complete execution in less than 60 seconds. So you know this approach will never work.
This is known a brute force solution to a problem. The point of Project Euler is to get you thinking creatively, both from a math and programming point of view, about problems. You want to reduce the number of trips you need to take through that loop. The obvious solution will never be the answer here.
I don't want to tell you the solution, because the point of these things is to think your way through it and become a better algorithm programmer. Examine the problem, think about it's restrictions, and think about ways you reduce the total number of numbers you'd need to check.
A good tool for taking a look at execution times for your code is xDebug: http://xdebug.org/docs/profiler
It's an installable PHP extension which can be configured to output a complete breakdown of function calls and execution times for your script. Using this, you'll be able to see what in your code is taking longest to execute and try some different approaches.
EDIT: now that I'm actually looking at your code, you're running 400 million+ regex calls! I don't know anything about project Euler, but I have a hard time believing this code can be excuted in under a minute on commodity hardware.
preg_split is likely to be slow because it's using a regex. Is there not a better way to do that line?
Hint: You can access chars in a string like this:
$str = 'This is a test.';
echo $str[0];
Try switching preg_split() to explode() or str_split() which are faster
First, here's a slightly cleaner version of your function, with debug output
<?php
$start = time();
$min = (int)floor(sqrt(1020304050607080900));
$max = (int)ceil(sqrt(1929394959697989990));
for ($i=$min; $i < $max; $i++) {
$c = $i * $i;
echo $i, ' => ', $c, "\n";
if ($c[0]==1
&& $c[2]==2
&& $c[4]==3
&& $c[6]==4
&& $c[8]==5
&& $c[10]==6
&& $c[12]==7
&& $c[14]==8
&& $c[16]==9
&& $c[18]==0)
{
echo $i;
break;
}
}
$end = time();
$elapsed = ($end-$start);
echo "<br />The time to calculate was $elapsed seconds";
And here's the first 10 lines of output:
1010101010 => 1020304050403020100
1010101011 => 1020304052423222121
1010101012 => 1020304054443424144
1010101013 => 1020304056463626169
1010101014 => 1020304058483828196
1010101015 => 1020304060504030225
1010101016 => 1020304062524232256
1010101017 => 1020304064544434289
1010101018 => 1020304066564636324
1010101019 => 1020304068584838361
That, right there, seems like it oughta inspire a possible optimization of your algorithm. Note that we're not even close, as of the 6th entry (1020304060504030225) -- we've got a 6 in a position where we need a 5!
In fact, many of the next entries will be worthless, until we're back at a point where we have a 5 in that position. Why bother caluclating the intervening values? If we can figure out how, we should jump ahead to 1010101060, where that digit becomes a 5 again... If we can keep skipping dozens of iterations at a time like this, we'll save well over 90% of our run time!
Note that this may not be a practical approach at all (in fact, I'm fairly confident it's not), but this is the way you should be thinking. What mathematical tricks can you use to reduce the number of iterations you execute?

Quickest way to Count the amount of numbers in a string, plus how to test PHP performance?

I need the best way of finding how many numbers in a string, would I have to first remove everything but numbers and then strlen?
Also how would I go about testing the performance of a any PHP script I have written, say for speed and performance under certain conditions?
UPDATE
say I had to inlcude ½ half numbers, its definetly preg then is it not?
You can just split the string with preg_split() and count the parts:
$digits = count(preg_split("/\d/", $str))-1;
$numbers = count(preg_split("/\d+/", $str))-1;
For performance you could use microtime().
eg:
For execution time:
$start = microtime(true);
// do stuff
$end = microtime(true)-$start;
For memory usage:
$start = memory_get_usage(true);
// do stuff
$end = memory_get_usage(true) - $start;
http://us.php.net/manual/en/function.memory-get-usage.php
For peak memory:
memory_get_peak_usage(true);
http://us.php.net/manual/en/function.memory-get-peak-usage.php
There are also specific PHP profiling tools like XDebug.
There is a good tutorial on using it with Eclipse:
http://devzone.zend.com/article/2930
There is also benchmark in PEAR:
http://pear.php.net/package/Benchmark/docs/latest/Benchmark/Benchmark_Profiler.html
And a list of others:
http://onwebdevelopment.blogspot.com/2008/06/php-code-performance-profiling-on.html

Categories