I'm trying to extract data file from many html files. In order to do it fast I don't use DOM parser, but simple strpos(). Everything goes well if I generate from circa 200000 files. But if the do it with more files (300000) it outputs nothing, and do this strange effect:
Look at the bottom diagram. (The upper is the CPU) In the first (marked RED) phase the output filesize is growing, everything seems OK. After that the (marked ORANGE) file size become zero and the memory usage is growing. (Everything are two times, because I restarted the computing at halftime)
I forget to say that I use WAMP.
I have tired unset variables, put loop into function, using implode instead of concatenating strings, using fopen instead of filegetcontents and garbage collection too...
What is the 2nd phase? Am I out of memory? Is there some limit that I don't know (max_execution_time,memory_limit - are already ignored)? Why does this small program use so much memory?
Here is the code.
$datafile = fopen("meccsek2b.jsb", 'w');
for($i=0;$i<100000;$i++){
$a = explode('|',$data[$i]);
$file = "data2/$mid.html";
if(file_exists($file)){
$c = file_get_contents($file);
$o = 0;
$a_id = array();
$a_h = array();
$a_d = array();
$a_v = array();
while($o = strpos($c,'<a href="/test/',$o)){
$o = $o+15;
$a_id[] = substr($c,$o,strpos($c,'/',$o)-$o);
$o = strpos($c,'val_h="',$o)+7;
$a_h[] = substr($c,$o,strpos($c,'"',$o)-$o);
$o = strpos($c,'val_d="',$o)+7;
$a_d[] = substr($c,$o, strpos($c,'"',$o)-$o);
$o = strpos($c,'val_v="',$o)+7;
$a_v[] = substr($c,$o,strpos($c,'"',$o)-$o);
}
fwrite($datafile,
$mid.'|'.
implode(';',$a_id).'|'.
implode(';',$a_h).'|'.
implode(';',$a_d).'|'.
implode(';',$a_v).
PHP_EOL);
}
}
fclose($datafile);
Apache error log. (expires in 30 days)
I think I found the problem:
There was an infinite loop because strpos() returned 0.
The allocated memory size was growing until an exception:
PHP Fatal error: Out of memory
Ensino's note was very useful about using command line,that lead me finally to this question.
You should consider running your script from the command line; this way you might catch the error without digging through the error logs.
Furthermore, as stated in the PHP manual, the strpos function may return boolean FALSE, but may also return a non-boolean value which evaluates to FALSE, so the correct way to test the return value of this function is by using the !== operator:
while (($o = strpos($c,'<a href="/test/',$o)) !== FALSE){
...
}
The CPU spike most likely means that PHP is doing garbage collection. In case you want get some performance at cost of bigger memory usage, you can disable garbage collection by gc_disable().
Looking at the code, I'd guess, that you've reached point where file_get_contents is reading some big file and PHP realizes it has to free some memory by running garbage collection to be able to store it's content.
The best approach how to deal with that is to read the file continuously and process it by chunks rather than having it whole in the memory.
Huge amount of data is going into the system internal cache. When the data of the system cache is written to disk, it might have impact on memory and performance.
There is a the system function FlushFileBuffers to enfoce writes:
Please look at http://msdn.microsoft.com/en-us/library/windows/desktop/aa364451%28v=vs.85%29.aspx and http://winbinder.org/ for calling the function.
(Though, this explains not the empty file, unless there is windows bug.)
Related
I am processing somewhat big sized files in PHP (300 MB - 1024 MB) in pursue of finding a line that matches my search criteria and return the whole line. Since I cannot afford to read the entire file and store it in memory, I am reading line-by-line:
function getLineWithString($fileName, $str) {
$matches = array();
$handle = #fopen($fileName, "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
if (strpos($buffer, $str) !== FALSE) {
return '<pre>'.$matches[] = $buffer.'</pre>';
}
}
fclose($handle);
}
}
Since my $str (needle) is an array, I am using foreach() to process all its elements and invoke the function every time:
foreach ($my_array as $a_match) {
echo getLineWithString($myFile, trim($a_match));
}
However, this approach (using foreach()) hits the max_execution_time, memory_limit, Apache's FcgidIOTimeout and others. My array (needle) contains 88 elements and they might grow in number depending on the enduser's actions so this is definitely not an adequate way.
My question is how can I prevent the usage of foreach() or any other looping and invoke the function only once?
Note about memory leak
It's important to note that this is a misuse of the term memory leak since in PHP you have no control over memory management. A memory leak is generally defined as a process having allocated memory on the system that is no longer reachable by that process. It's not possible for you to do this in your PHP code since you have no direct control over the PHP memory manager.
Your code runs inside of the PHP virtual machine which manages memory for you. Exceeding the memory_limit you set in PHP is not the same thing as PHP leaking memory. This is a defined limit, controlled by you. You can raise or lower this limit at your discretion. You may even ask PHP to not limit the amount of memory at all by setting memory_limit = -1, for example. Of course, this is still subject to your machine's memory capacity.
Your actual problem
However, the approach you are using is not much better than reading the entire file into memory because you will have to read the file line by line with every search (call to your function). That's worse time complexity even though it may be more efficient in terms of memory.
To be efficient in both time and memory complexity you need to perform the search on each needle at once while reading from the file. Rather than send a single needle to your function, consider sending the entire array of needles at once. This way you defer the loop you're using to call your function to the function itself.
Additionally, you should note that your current function returns immediately, upon finding a match, since you're using return inside your loop. You should instead use return $matches at the end of your function, outside of the loop.
Here's a better approach.
function getLineWithString($fileName, Array $needles) {
$matches = [];
$handle = fopen($fileName, "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle);
foreach($needles as $str) {
if (strpos($buffer, $str) !== FALSE) {
$matches[] = $buffer;
}
}
}
fclose($handle);
return $matches;
}
}
Now, let's say you're searching for the strings "foo", "bar", and "baz" in your file. You can make a single call to your function with an array of those strings to search them all at once rather than call your function in a loop. The function will loop over the search strings each time it reads a line from the file and search that $buffer for a match, then return the entire $matches array when it's done.
var_dump(getLineWithStrings("somefile.txt", ["foo", "bar", "baz"]));
N.B
I'd strongly advice against using the error silence operator # since it will effectively make debugging your code when there is a problem more difficult, because it turns off all error reporting for its operand. Even if there was an error PHP won't tell you about it, which isn't useful at all.
I have read several times that, in order to invoke the garbage collector and actually clean the RAM used by a variable, you have to assign a new value (e.g. NULL) instead of simply unset() it.
This code, however, shows that the memory allocated for the array $a is not cleaned after the NULL assignment.
function build_array()
{
for ($i=0;$i<10000;$i++){
$a[$i]=$i;
}
$i=null;
return $a;
}
echo '<p>'.memory_get_usage(true);
$a = build_array();
echo '<p>'.memory_get_usage(true);
$a = null;
echo '<p>'.memory_get_usage(true);
The output I get is:
262144
1835008
786432
So part of the memory is cleaned, but not all the memory. How can I completely clean the RAM?
You have no way to definitely clear a variable from the memory with PHP.
Its is up to PHP garbage collector to do that when it sees that it should.
Fortunately, PHP garbage collector may not be perfect but it is one of the best features of PHP. If you do things as per the PHP documentation there's no reasons to have problems.
If you have a realistic scenario where it may be a problem, post the scenario here or report it to PHP core team.
Other unset() is the best way to clear vars.
Point is that memory_get_usage(true) shows the memory allocated to your PHP process, not the amount actually in use. System could free unused part once it is required somewhere.
More details on memory_get_usage could be found there
If you run that with memory_get_usage(false), you will see that array was actually gc'd. example
I see many code like this:
function load_items(&$items_arr) {
// ... some code
}
load_items($item_arr);
$v = &$items_arr[$id];
compared to code as follow:
function load_items() {
// ... some code
return $items_arr;
}
$item_arr = load_items();
$v = $items_arr[$id];
Did the second code will copy items_arr and $item_arr[$id]?
Will the first code import performance?
No, it will not copy the value right away. Copy-on-write
is one of the memory management methods used in PHP. It ensures that memory isn’t wasted when you copy values between variables.
What that means is that when you assign:
$v = $items_arr[$id];
PHP will simply update the symbol table to indicate that $v points to the same memory address of $item_arr[$id], just if you change $item_arr or $v afterwards then PHP allocates more memory and then performs the copy.
By delaying this extra memory allocation and the copying PHP saves time and memory in some cases.
There's are nice article about memory management in PHP: http://hengrui-li.blogspot.no/2011/08/php-copy-on-write-how-php-manages.html
So, I am running a long-running script that is dealing with memory sensitive data (large amounts of it). I (think) I am doing a good job of properly destroying large objects throughout the long running process, to save memory.
I have a log that continuously outputs current memory usage (using memory_get_usage()), and I do not notice rises and drops (significant ones) in memory usage. Which tells me I am probably doing the right thing with memory management.
However, if I log on to the server and run a top command, I notice that the apache process that is dealing with this script never deallocates memory (at least visibly though the top command). It simply remains at the highest memory usage, even if the current memory usage reported by php is much, much lower.
So, my question is: are my attempts to save memory futile if the memory isnt really being freed back to the server? Or am I missing something here.
Thank you.
ps. using php 5.4 on linux
pps. For those who want code, this is a basic representation:
function bigData()
{
$obj = new BigDataObj();
$obj->loadALotOfData();
$varA = $obj->getALotOfData();
//all done
$obj = NULL;
$varA = NULL;
unset($obj,$varA);
}
update: as hek2mgl recommended, I ran debug_zval_dump(), and the output, to me, seems correct.
function bigData()
{
$obj = new BigDataObj();
$obj->loadALotOfData();
//all done
$obj = NULL;
debug_zval_dump($obj);
unset($obj);
debug_zval_dump($obj);
}
Output:
NULL refcount(2)
NULL refcount(1)
PHP has a garbage collector. It will free memory for variable containers which reference count is set to 0, meaning that no userland reference exists anymore.
I guess there are still references to variables which you might think to have cleaned. Need to see your code to show you what is the problem.
In a PHP program, I sequentially read a bunch of files (with file_get_contents), gzdecode them, json_decode the result, analyze the contents, throw most of it away, and store about 1% in an array.
Unfortunately, with each iteration (I traverse over an array containing the filenames), there seems to be some memory lost (according to memory_get_peak_usage, about 2-10 MB each time). I have double- and triple-checked my code; I am not storing unneeded data in the loop (and the needed data hardly exceeds about 10MB overall), but I am frequently rewriting (actually, strings in an array). Apparently, PHP does not free the memory correctly, thus using more and more RAM until it hits the limit.
Is there any way to do a forced garbage collection? Or, at least, to find out where the memory is used?
it has to do with memory fragmentation.
Consider two strings, concatenated to one string. Each original must remain until the output is created. The output is longer than either input.
Therefore, a new allocation must be made to store the result of such a concatenation. The original strings are freed but they are small blocks of memory.
In a case of 'str1' . 'str2' . 'str3' . 'str4' you have several temps being created at each . -- and none of them fit in the space thats been freed up. The strings are likely not laid out in contiguous memory (that is, each string is, but the various strings are not laid end to end) due to other uses of the memory. So freeing the string creates a problem because the space can't be reused effectively. So you grow with each tmp you create. And you don't re-use anything, ever.
Using the array based implode, you create only 1 output -- exactly the length you require. Performing only 1 additional allocation. So its much more memory efficient and it doesn't suffer from the concatenation fragmentation. Same is true of python. If you need to concatenate strings, more than 1 concatenation should always be array based:
''.join(['str1','str2','str3'])
in python
implode('', array('str1', 'str2', 'str3'))
in PHP
sprintf equivalents are also fine.
The memory reported by memory_get_peak_usage is basically always the "last" bit of memory in the virtual map it had to use. So since its always growing, it reports rapid growth. As each allocation falls "at the end" of the currently used memory block.
In PHP >= 5.3.0, you can call gc_collect_cycles() to force a GC pass.
Note: You need to have zend.enable_gc enabled in your php.ini enabled, or call gc_enable() to activate the circular reference collector.
Found the solution: it was a string concatenation. I was generating the input line by line by concatenating some variables (the output is a CSV file). However, PHP seems not to free the memory used for the old copy of the string, thus effectively clobbering RAM with unused data. Switching to an array-based approach (and imploding it with commas just before fputs-ing it to the outfile) circumvented this behavior.
For some reason - not obvious to me - PHP reported the increased memory usage during json_decode calls, which mislead me to the assumption that the json_decode function was the problem.
There's a way.
I had this problem one day. I was writing from a db query into csv files - always allocated one $row, then reassigned it in the next step. Kept running out of memory. Unsetting $row didn't help; putting an 5MB string into $row first (to avoid fragmentation) didn't help; creating an array of $row-s (loading many rows into it + unsetting the whole thing in every 5000th step) didn't help. But it was not the end, to quote a classic.
When I made a separate function that opened the file, transferred 100.000 lines (just enough not to eat up the whole memory) and closed the file, THEN I made subsequent calls to this function (appending to the existing file), I found that for every function exit, PHP removed the garbage. It was a local-variable-space thing.
TL;DR
When a function exits, it frees all local variables.
If you do the job in smaller portions, like 0 to 1000 in the first function call, then 1001 to 2000 and so on, then every time the function returns, your memory will be regained. Garbage collection is very likely to happen on return from a function. (If it's a relatively slow function eating a lot of memory, we can safely assume it always happens.)
Side note: for reference-passed variables it will obviously not work; a function can only free its inside variables that would be lost anyway on return.
I hope this saves your day as it saved mine!
I've found that PHP's internal memory manager is most-likely to be invoked upon completion of a function. Knowing that, I've refactored code in a loop like so:
while (condition) {
// do
// cool
// stuff
}
to
while (condition) {
do_cool_stuff();
}
function do_cool_stuff() {
// do
// cool
// stuff
}
EDIT
I ran this quick benchmark and did not see an increase in memory usage. This leads me to believe the leak is not in json_decode()
for($x=0;$x<10000000;$x++)
{
do_something_cool();
}
function do_something_cool() {
$json = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
$result = json_decode($json);
echo memory_get_peak_usage() . PHP_EOL;
}
I was going to say that I wouldn't necessarily expect gc_collect_cycles() to solve the problem - since presumably the files are no longer mapped to zvars. But did you check that gc_enable was called before loading any files?
I've noticed that PHP seems to gobble up memory when doing includes - much more than is required for the source and the tokenized file - this may be a similar problem. I'm not saying that this is a bug though.
I believe one workaround would be not to use file_get_contents but rather fopen()....fgets()...fclose() rather than mapping the whole file into memory in one go. But you'd need to try it to confirm.
HTH
C.
Call memory_get_peak_usage() after each statement, and ensure you unset() everything you can. If you are iterating with foreach(), use a referenced variable to avoid making a copy of the original (foreach()).
foreach( $x as &$y)
If PHP is actually leaking memory a forced garbage collection won't make any difference.
There's a good article on PHP memory leaks and their detection at IBM
There recently was a similar issue with System_Daemon. Today I isolated my problem to file_get_contents.
Could you try using fread instead? I think this may solve your problem.
If it does, it's probably time to do a bugreport over at PHP.