I am processing somewhat big sized files in PHP (300 MB - 1024 MB) in pursue of finding a line that matches my search criteria and return the whole line. Since I cannot afford to read the entire file and store it in memory, I am reading line-by-line:
function getLineWithString($fileName, $str) {
$matches = array();
$handle = #fopen($fileName, "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
if (strpos($buffer, $str) !== FALSE) {
return '<pre>'.$matches[] = $buffer.'</pre>';
}
}
fclose($handle);
}
}
Since my $str (needle) is an array, I am using foreach() to process all its elements and invoke the function every time:
foreach ($my_array as $a_match) {
echo getLineWithString($myFile, trim($a_match));
}
However, this approach (using foreach()) hits the max_execution_time, memory_limit, Apache's FcgidIOTimeout and others. My array (needle) contains 88 elements and they might grow in number depending on the enduser's actions so this is definitely not an adequate way.
My question is how can I prevent the usage of foreach() or any other looping and invoke the function only once?
Note about memory leak
It's important to note that this is a misuse of the term memory leak since in PHP you have no control over memory management. A memory leak is generally defined as a process having allocated memory on the system that is no longer reachable by that process. It's not possible for you to do this in your PHP code since you have no direct control over the PHP memory manager.
Your code runs inside of the PHP virtual machine which manages memory for you. Exceeding the memory_limit you set in PHP is not the same thing as PHP leaking memory. This is a defined limit, controlled by you. You can raise or lower this limit at your discretion. You may even ask PHP to not limit the amount of memory at all by setting memory_limit = -1, for example. Of course, this is still subject to your machine's memory capacity.
Your actual problem
However, the approach you are using is not much better than reading the entire file into memory because you will have to read the file line by line with every search (call to your function). That's worse time complexity even though it may be more efficient in terms of memory.
To be efficient in both time and memory complexity you need to perform the search on each needle at once while reading from the file. Rather than send a single needle to your function, consider sending the entire array of needles at once. This way you defer the loop you're using to call your function to the function itself.
Additionally, you should note that your current function returns immediately, upon finding a match, since you're using return inside your loop. You should instead use return $matches at the end of your function, outside of the loop.
Here's a better approach.
function getLineWithString($fileName, Array $needles) {
$matches = [];
$handle = fopen($fileName, "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle);
foreach($needles as $str) {
if (strpos($buffer, $str) !== FALSE) {
$matches[] = $buffer;
}
}
}
fclose($handle);
return $matches;
}
}
Now, let's say you're searching for the strings "foo", "bar", and "baz" in your file. You can make a single call to your function with an array of those strings to search them all at once rather than call your function in a loop. The function will loop over the search strings each time it reads a line from the file and search that $buffer for a match, then return the entire $matches array when it's done.
var_dump(getLineWithStrings("somefile.txt", ["foo", "bar", "baz"]));
N.B
I'd strongly advice against using the error silence operator # since it will effectively make debugging your code when there is a problem more difficult, because it turns off all error reporting for its operand. Even if there was an error PHP won't tell you about it, which isn't useful at all.
Related
I see many code like this:
function load_items(&$items_arr) {
// ... some code
}
load_items($item_arr);
$v = &$items_arr[$id];
compared to code as follow:
function load_items() {
// ... some code
return $items_arr;
}
$item_arr = load_items();
$v = $items_arr[$id];
Did the second code will copy items_arr and $item_arr[$id]?
Will the first code import performance?
No, it will not copy the value right away. Copy-on-write
is one of the memory management methods used in PHP. It ensures that memory isn’t wasted when you copy values between variables.
What that means is that when you assign:
$v = $items_arr[$id];
PHP will simply update the symbol table to indicate that $v points to the same memory address of $item_arr[$id], just if you change $item_arr or $v afterwards then PHP allocates more memory and then performs the copy.
By delaying this extra memory allocation and the copying PHP saves time and memory in some cases.
There's are nice article about memory management in PHP: http://hengrui-li.blogspot.no/2011/08/php-copy-on-write-how-php-manages.html
I have some PHP code on my site that reads a line from a file and echos that line. In one of these lines of cade I have placed a variable concatenation in the form of {$varname}, but in the end result it actually echos {$varname} instead of switching it with the variable.
The line reads:
<p>Today's date (at the server) is {$date}.</p>
The code used to echo the line reads:
$line = fgets($posts);
echo $line;
And the output of that code is: 'Today's date (at the server) is {$date}.'
And the variable $date is declared earlier in the code. I am wondering if there is some special way of doing this for lines in a file, or if I'm not doing this right?
EDIT: The output is also available at http://codegamecentral.grn.cc/main/?pageNumber=2.
ANOTHER EDIT: This code is run through a while loop until it reaches the end of the file, read line by line. This should preferably be a fast solution that will not cause problems when used with a string that doesn't contain {$date}.
How about a string_replace()?
echo str_replace('{$date}', $date, $line);
This is how you can do what you want:
eval("\$line = \"$line\";");
echo $line;
Warning:
Although this will do the job, I would strongly discourage you from doing it unless you have 100% certainty that only you or trusted people can generate the files that will be evaluated this way because eval() can run any PHP code inside the variable.
You are reading a line as text, and PHP substitution won't work unless it's done against code.
You need some more complicated processing (or to tell PHP to consider that text as if it was code, by using eval; which is strongly discouraged for security reasons, and may not work everywhere since the eval function is sometimes disabled by webmasters, for those same security reasons).
The most powerful alternative would be to use preg_replace_callback to recognize text sequences such as {$varname} and replace them with $varname. Of course $varname would need to be defined, or checked for existence:
function expandVariables($text, $allowedVariables) {
return preg_replace_callback('#{\$([a-z][a-z_0-9]*)}#i',
function($replace) use ($allowedVariables) {
if (array_key_exists($replace[1], $allowedVariables)) {
return $allowedVariables[$replace[1]];
}
return "NO '{$replace[1]} VARIABLE HERE.";
},
$text
);
}
$date = date('Y-m-d H:i:s');
$line = '<p>Now (at the server) is {$date}.</p>';
$vars = get_defined_vars(); // LOTS of memory :-(
// better:
// $vars = array ( 'date' => $date, ... ); // Only allowed variables.
$eval = expandVariables($line, $vars);
print "The line is {$line}\nand becomes:\n{$eval}";
Outputs:
The line is <p>Now (at the server) is {$date}.</p>
and becomes:
<p>Now (at the server) is 2014-10-12 18:36:16.</p>
Caveats
This implementation is more secure than straight eval(), which would execute any PHP code at all it were to find in the read line. But it can still be used to output the content of any defined variable provided the attacker knows its name and is allowed to request it; an admittedly unrealistic example would be <p>Hello, {$adminPassword}!</p>.
To be more secure still, albeit at the expense of flexibility, I'd endorse Viktor Svensson's solution which only allows very specific variables to be set, and is both simpler and faster:
// Remember to use 'single quotes' for '{$variables}', because you DO NOT
// want expanded them in here, but in the replaced text!
$text = str_replace(array('{$date}', '{$time}' /*, ...more... */),
array(date('Y-m-d'), date('H:i:s') /*, ...more... */),
$text);
Also, you might be interested in checking out some templating solutions such as Smarty.
Processing a whole file
To process a whole file, if memory is not an object, you can in both cases (preg and str_) load the whole file as an array of lines:
$file = file($fileName);
and use $file as subject of the replace. Then you can iterate on the results:
// replaceVariables receives a string and returns a string
// or receives an array of strings and returns the same.
$text = replaceVariables($file, $variables);
foreach ($text as $line) {
// Do something with $line, where variables have already been replaced.
}
Speed
Something one doesn't often appreciate is that regular expressions are fast. I don't know what text-matching algorithm str_replace employs (I think Boyer-Moore's), but preg has the advantage of being Perlishly array-aware. It has a heavier setup (linear on the dictionary size), but then it "scales" better in the replacement time, while str_replace is constant setup, scales linearly in the replacement time. Which means preg_replace will save huge amounts of times in massive replacements; it fares worse in simpler contexts.
Very roughly, run time is (S + RLV)*V ( V = number of variables, L = number of lines ) where preg_replace has a detectable S and a negligible R, while str has the reverse. With large values of V you really want the smallest R you can get, even at the expense of an increase in setup time S.
Dependency on Variable N.. ? variables, 18 lines, keylen 5, vallen 20
v preg advantage
5 -82%
15 -55%
25 -2%
35 14%
45 65%
55 41%
65 51%
75 197%
85 134%
95 338%
Dependency on File length. 32 variables, ? lines, keylen 5, vallen 20
l preg advantage
5 -31%
15 -33%
25 14%
35 80%
45 116%
Of course, maintainability is also an issue - str_replace does it in one line of code, and the function itself is maintained by the PHP team. The function built around preg_replace requires as much as 15 lines. Granted that, once tested, you shouldn't need to modify it any longer, just pass it a dictionary.
Looping
Finally, you may want to use variables referring other variables. In this case neither preg nor str_ will work reliably, and you will have to implement a loop of your own:
<?php
$file = "This is a {\$test}. And {\$another}. And {\$yet_another}.\n";
$vars = array(
"test" => "test",
"another" => "another {\$test}",
"yet_another" => "yet {\$another} {\$test}",
);
$text = preg_replace_callback('#{\$([a-z][a-z_0-9]*)}#i',
function($replace) use ($vars) {
if (array_key_exists($replace[1], $vars)) {
return $vars[$replace[1]];
}
return "NO '{$replace[1]} VARIABLE HERE.";
},
$file
);
$keys = array_map(function($k){ return "{\${$k}}"; }, array_keys($vars));
$vals = array_values($vars);
$text2 = str_replace($keys, $vals, $file);
$text3 = $file;
do {
$prev = $text3;
$text3 = str_replace($keys, $vals, $text3);
} while ($text3 != $prev);
print "PREG: {$text}\nSTR_: {$text2}\nLOOP: {$text3}\n";
The output is:
PREG: This is a test. And another {$test}. And yet {$another} {$test}.
STR_: This is a test. And another {$test}. And yet {$another} {$test}.
LOOP: This is a test. And another test. And yet another test test.
I'm trying to extract data file from many html files. In order to do it fast I don't use DOM parser, but simple strpos(). Everything goes well if I generate from circa 200000 files. But if the do it with more files (300000) it outputs nothing, and do this strange effect:
Look at the bottom diagram. (The upper is the CPU) In the first (marked RED) phase the output filesize is growing, everything seems OK. After that the (marked ORANGE) file size become zero and the memory usage is growing. (Everything are two times, because I restarted the computing at halftime)
I forget to say that I use WAMP.
I have tired unset variables, put loop into function, using implode instead of concatenating strings, using fopen instead of filegetcontents and garbage collection too...
What is the 2nd phase? Am I out of memory? Is there some limit that I don't know (max_execution_time,memory_limit - are already ignored)? Why does this small program use so much memory?
Here is the code.
$datafile = fopen("meccsek2b.jsb", 'w');
for($i=0;$i<100000;$i++){
$a = explode('|',$data[$i]);
$file = "data2/$mid.html";
if(file_exists($file)){
$c = file_get_contents($file);
$o = 0;
$a_id = array();
$a_h = array();
$a_d = array();
$a_v = array();
while($o = strpos($c,'<a href="/test/',$o)){
$o = $o+15;
$a_id[] = substr($c,$o,strpos($c,'/',$o)-$o);
$o = strpos($c,'val_h="',$o)+7;
$a_h[] = substr($c,$o,strpos($c,'"',$o)-$o);
$o = strpos($c,'val_d="',$o)+7;
$a_d[] = substr($c,$o, strpos($c,'"',$o)-$o);
$o = strpos($c,'val_v="',$o)+7;
$a_v[] = substr($c,$o,strpos($c,'"',$o)-$o);
}
fwrite($datafile,
$mid.'|'.
implode(';',$a_id).'|'.
implode(';',$a_h).'|'.
implode(';',$a_d).'|'.
implode(';',$a_v).
PHP_EOL);
}
}
fclose($datafile);
Apache error log. (expires in 30 days)
I think I found the problem:
There was an infinite loop because strpos() returned 0.
The allocated memory size was growing until an exception:
PHP Fatal error: Out of memory
Ensino's note was very useful about using command line,that lead me finally to this question.
You should consider running your script from the command line; this way you might catch the error without digging through the error logs.
Furthermore, as stated in the PHP manual, the strpos function may return boolean FALSE, but may also return a non-boolean value which evaluates to FALSE, so the correct way to test the return value of this function is by using the !== operator:
while (($o = strpos($c,'<a href="/test/',$o)) !== FALSE){
...
}
The CPU spike most likely means that PHP is doing garbage collection. In case you want get some performance at cost of bigger memory usage, you can disable garbage collection by gc_disable().
Looking at the code, I'd guess, that you've reached point where file_get_contents is reading some big file and PHP realizes it has to free some memory by running garbage collection to be able to store it's content.
The best approach how to deal with that is to read the file continuously and process it by chunks rather than having it whole in the memory.
Huge amount of data is going into the system internal cache. When the data of the system cache is written to disk, it might have impact on memory and performance.
There is a the system function FlushFileBuffers to enfoce writes:
Please look at http://msdn.microsoft.com/en-us/library/windows/desktop/aa364451%28v=vs.85%29.aspx and http://winbinder.org/ for calling the function.
(Though, this explains not the empty file, unless there is windows bug.)
foreach(explode(',' $foo) as $bar) { ... }
vs
$test = explode(',' $foo);
foreach($test as $bar) { ... }
In the first example, does it explode the $foo string for each iteration or does PHP keep it in memory exploded in its own temporary variable? From an efficiency point of view, does it make sense to create the extra variable $test or are both pretty much equal?
I could make an educated guess, but let's try it out!
I figured there were three main ways to approach this.
explode and assign before entering the loop
explode within the loop, no assignment
string tokenize
My hypotheses:
probably consume more memory due to assignment
probably identical to #1 or #3, not sure which
probably both quicker and much smaller memory footprint
Approach
Here's my test script:
<?php
ini_set('memory_limit', '1024M');
$listStr = 'text';
$listStr .= str_repeat(',text', 9999999);
$timeStart = microtime(true);
/*****
* {INSERT LOOP HERE}
*/
$timeEnd = microtime(true);
$timeElapsed = $timeEnd - $timeStart;
printf("Memory used: %s kB\n", memory_get_peak_usage()/1024);
printf("Total time: %s s\n", $timeElapsed);
And here are the three versions:
1)
// explode separately
$arr = explode(',', $listStr);
foreach ($arr as $val) {}
2)
// explode inline-ly
foreach (explode(',', $listStr) as $val) {}
3)
// tokenize
$tok = strtok($listStr, ',');
while ($tok = strtok(',')) {}
Results
Conclusions
Looks like some assumptions were disproven. Don't you love science? :-)
In the big picture, any of these methods is sufficiently fast for a list of "reasonable size" (few hundred or few thousand).
If you're iterating over something huge, time difference is relatively minor but memory usage could be different by an order of magnitude!
When you explode() inline without pre-assignment, it's a fair bit slower for some reason.
Surprisingly, tokenizing is a bit slower than explicitly iterating a declared array. Working on such a small scale, I believe that's due to the call stack overhead of making a function call to strtok() every iteration. More on this below.
In terms of number of function calls, explode()ing really tops tokenizing. O(1) vs O(n)
I added a bonus to the chart where I run method 1) with a function call in the loop. I used strlen($val), thinking it would be a relatively similar execution time. That's subject to debate, but I was only trying to make a general point. (I only ran strlen($val) and ignored its output. I did not assign it to anything, for an assignment would be an additional time-cost.)
// explode separately
$arr = explode(',', $listStr);
foreach ($arr as $val) {strlen($val);}
As you can see from the results table, it then becomes the slowest method of the three.
Final thought
This is interesting to know, but my suggestion is to do whatever you feel is most readable/maintainable. Only if you're really dealing with a significantly large dataset should you be worried about these micro-optimizations.
In the first case, PHP explodes it once and keeps it in memory.
The impact of creating a different variable or the other way would be negligible. PHP Interpreter would need to maintain a pointer to a location of next item whether they are user defined or not.
From the point of memory it will not make a difference, because PHP uses the copy on write concept.
Apart from that, I personally would opt for the first option - it's a line less, but not less readable (imho!).
Efficiency in what sense? Memory management, or processor? Processor wouldn't make a difference, for memory - you can always do $foo = explode(',', $foo)
This recursive function uses over 1.4 MB of RAM, and it's not freed. All it returns is a single int. How can I free up much more memory?
function bottomUpTree($item, $depth)
{
if ($depth)
{
--$depth;
$newItem = $item << 1;
return array(
bottomUpTree($newItem - 1, $depth),
bottomUpTree($newItem, $depth),
$item
);
}
unset($depth);
unset($newItem);
return array(NULL, NULL, $item);
}
bottomUpTree(0, 7);
Recursive functions will always suck up memory. Each call will use up a bit more until you reach the bottom and start returning. This is unavoidable. Doing unset() on the function's parameters won't help you... they've already taken up space on the call stack and can't be removed until the function returns.
One option would be to switch over to an iterative function, but that's harder to do with a tree-structure.
But most of all, what does this function actually accomplish? You're returning arrays, but not assigning them anywhere in the calling level, so you're creating a ton of arrays only to throw them away again immediately.
A few tricks to getting the php compiler to release memory are to do the following:
Extract the memory intensive pieces of your recursive function to it's own function/method. PHP won't release memory until the function finishes/exits/returns.
Before returning in your extracted function/method set variables to NULL.
Call gc_collect_cycles() after you call the memory intensive function.