PHP PDO fetch() loop dies after processing part of large dataset - php

I have a PHP script which processes a "large" dataset (about 100K records) from a PDO query into a single collection of objects, in a typical loop:
while ($record = $query->fetch()) {
$obj = new Thing($record);
/* do some processing */
$list[] = $obj;
$count++;
}
error_log('Processed '.$count.' records');
This loop processes about 50% of the dataset and then inexplicably breaks.
Things I have tried:
Memory profiling: memory_get_peak_usage() consistently outputs about 63MB before the loop dies. The memory limit is 512MB, set through php.ini.
Using set_time_limit() to increase script execution time to 1 hour (3600 seconds). The loop breaks long before that and I don't see the usual error in the log for this one.
Setting PDO::MYSQL_ATTR_USE_BUFFERED_QUERY to false to avoid buffering the entire dataset
Logging out $query->errorInfo() immediately after the loop break. This was no help as the error code was "00000".
Checking the MySQL error log. Nothing of note in there before, after, or while this script runs.
Batching the processing into 20K-record chunks. No difference. Loop broke in the same spot. However, by "cleaning up" the PDO statement object at the end of each batch, I was able to get the processed total to 54%.
Other weird behavior:
When I set the memory limit using ini_set('memory_limit', '1024MB'), the loop actually dies earlier than with a smaller memory limit, at about 20% progress.
During this loop, the PHP process uses 100% CPU, but once it breaks, usage drops back down to 2%, despite immediate processing in another loop immediately afterwards. Likely, the connection with the MySQL server in the first loop is very resource-intensive.
I am doing this all locally using MAMP PRO if that makes any difference.
Is there something else that could be consistently breaking this loop that I haven't checked? Is this simply not a viable strategy for processing this many records?
UPDATE
After using a batching strategy (20K increments), I have started to see a MySQL error consistently around the third batch: MySQL server has gone away; possibly a symptom of a long-running unbuffered query.

If You really need to process 100K records on the fly, You should do the processing in SQL, and fetch the result as You need it - it should save a lot of time.
But You probably cant do that for some reason. You always process all the rows from statement, so use fetchAll once - and let MySQL alone after that, like that:
$records = $query->fetchAll()
foreach ($records as record)
{
$obj = new Thing($record);
/* do some processing */
$list[] = $obj;
$count++;
}
error_log('Processed '.$count.' records');
Also, select only rows that You will use.
If this does not help, You can try with this: Setting a connect timeout with PDO .

Related

free_result needed by reuse variable

INCLUDE mysqli object
--
$sel = $mysqli->query("select * from `items`");
while($res = $sel->fetch_assoc()) {
$items[] = $res;
}
$sel->free_result();
$sel = $mysqli->query("select * from `sets`");
while($res = $sel->fetch_assoc()) {
$sets[] = $res;
}
$sel->free_result();
$sel = $mysqli->query("select * from `parts`");
while($res = $sel->fetch_assoc()) {
$parts[] = $res;
}
$sel->free_result();
--
DO OTHER STUFF
Are the first two times of executing $sel->free_result(); really needed?
I think they are unnecessary when I reuse the variable sel.
Do you agree with me?
According to the PHP documentation on the matter:
You should always free your result with mysqli_free_result(), when
your result object is not needed anymore.
Looking into the comments section of that page gives a reason why:
If you are getting this error: Internal SQL Bug: 2014, Commands out
of sync; you can't run this command now
Then you never called mysqli_result::free(),
mysqli_result::free_result(), mysqli_result::close(), or
mysqli_free_result() in your script, and must call it before executing
another stored procedure.
Basically, you don't need to do it, but there could be odd instances where you're unable to execute other procedures, and commands could be out of sync.
Another issue which could come from not doing it is that the result could be taking up a lot of memory so if you put new code in between the queries which needs memory, then you could hit an issue where the script crashes due to being unable to allocate more memory. By freeing the result you're helping to reduce the chances of this happening should you need to make changes between query executions.
At an extreme level, if you've got a bug in a different part of your system which can read other memory addresses and one of those queries has sensitive information stored in it, by not freeing the result, you're potentially giving an attacker a chance to read that sensitive information from memory.
There's bound to be other cases where there are issues, but those two sprung to mind.
In a nutshell, it's better to free the result and not have the information in memory when it's not needed.

PHP, Using PDO with nested queries, results corrupted

I'm using PDO against MSSQL, and need to run nested queries. They are all prepared statements. If I try to use the fetch() method, it inner queries fail immediately, so I used fetchAll(). So, I get something like this, with Programs, Products and Budgets:
$pgm_stmt->execute();
$pgm_res = $pgm_stmt->fetchAll(PDO::FETCH_ASSOC);
foreach ($pgm_res as $pgmrow) {
$prod_stmt->execute(array($pgmrow['ID']));
$prod_res = $prod_stmt->fetchAll(PDO::FETCH_ASSOC);
foreach ($prod_res as $prodrow) {
$bdgt_stmt->execute(array($pgmrow['ID'], $prodrow['ID']));
$bdgt_res = $bdgt_stmt->fetchAll(PDO::FETCH_NUM);
foreach ($bdgt_res as $bdgtrow) {
... work here
}
}
}
OK, everything works the first time through, but when it loops back for the 2nd program, the product result set gets corrupted somehow. When I dump the $prod_res variable right after the fetchAll(), the values are randomly assigned from other parts of memory, bits of other arrays, etc. Of course it fails because the $prodrow['ID'] value is undefined, because that whole result set is mangled.
Can someone help me troubleshoot this? I'm stumped.
Thanks.
Not a bug, but a feature, see: https://bugs.php.net/bug.php?id=65945
This is the behavior of MSSQL (TDS), DBLIB and FreeTDS. One statement per connection rule. If you initiate another statement, the previous statement is cancelled.
The previous versions buffered the entire result set in memory leading to OOM errors on large results sets.
The previous behavior can be replicated using fetchAll() and a loop if desired. Another workaround is to open 2 connection objects, one per statement.

Memory leak in fgets

I have been playing with PHP sockets for some days when making a simple IRC bot for some other projects, the bot is up and running but i noticed that after a couple of hours it will have eaten up all memory available.
I have been doing some debugging with memory_get_usage() and after making sure that i null out all variables i use within my loops, the only thing that causes an increase in memory usage is "fgets()", and i cannot seem to figure out why it wont release its memory after using it.
Any ideas of what i have been doing wrong?
Psudo-code:
$this->socket = stream_socket_client(server, port);
stream_set_blocking($this->socket, 0);
stream_set_timeout($this->socket, 600);
while(true) {
usleep(500000);
$data = fgets($this->socket, 8192);
*work with data if strlen > 0*
$data = null;
}
Note that i have disabled blocking so that the bot can do some background tasks even when there are no activity on the channels it is watching.
Memory usage before and after calling fgets (the same result with stream_get_line):
int(959504)
string(0) "" //Data returned from gets
int(967736)
Note that i am testing against an SSL-server, could this be some kind of SSL "overflow"?
Or if you want to look at the whole code for yourself: https://github.com/Ueland/VikingBot
According to https://bugs.php.net/bug.php?id=38962 it's bug which is reproduced in a specific php 5.2.6 version. So if you use higher version you can report about your findings :)
Simply setting a variable to null doesn't release the memory the previous data was using. It simply disconnects the data from the variable. At some point in the future, the PHP garbage collector MAY kick in and actually free up the memory, but it's not guaranteed to do so. Garbage collection is a very expensive operation, CPU-usage-wise, and PHP will not run the GC unless it absolutely has to. Usually this'd be when memory usage gets close to the memory_limit setting.
You can try to force a GC run via gc_collect_cycles()
use stream_get_line() instead of fgets()
Figured it out at last. I realized that instead of doing a "while(true)" i used a function that only called itself when it was done, and therefore keeping a reference of itself laying around. Dunno why i didnt notice it before now, but at least now the memory usage keeps the same for every round. ;)
Thanks for all suggestions!

Force freeing memory in PHP

In a PHP program, I sequentially read a bunch of files (with file_get_contents), gzdecode them, json_decode the result, analyze the contents, throw most of it away, and store about 1% in an array.
Unfortunately, with each iteration (I traverse over an array containing the filenames), there seems to be some memory lost (according to memory_get_peak_usage, about 2-10 MB each time). I have double- and triple-checked my code; I am not storing unneeded data in the loop (and the needed data hardly exceeds about 10MB overall), but I am frequently rewriting (actually, strings in an array). Apparently, PHP does not free the memory correctly, thus using more and more RAM until it hits the limit.
Is there any way to do a forced garbage collection? Or, at least, to find out where the memory is used?
it has to do with memory fragmentation.
Consider two strings, concatenated to one string. Each original must remain until the output is created. The output is longer than either input.
Therefore, a new allocation must be made to store the result of such a concatenation. The original strings are freed but they are small blocks of memory.
In a case of 'str1' . 'str2' . 'str3' . 'str4' you have several temps being created at each . -- and none of them fit in the space thats been freed up. The strings are likely not laid out in contiguous memory (that is, each string is, but the various strings are not laid end to end) due to other uses of the memory. So freeing the string creates a problem because the space can't be reused effectively. So you grow with each tmp you create. And you don't re-use anything, ever.
Using the array based implode, you create only 1 output -- exactly the length you require. Performing only 1 additional allocation. So its much more memory efficient and it doesn't suffer from the concatenation fragmentation. Same is true of python. If you need to concatenate strings, more than 1 concatenation should always be array based:
''.join(['str1','str2','str3'])
in python
implode('', array('str1', 'str2', 'str3'))
in PHP
sprintf equivalents are also fine.
The memory reported by memory_get_peak_usage is basically always the "last" bit of memory in the virtual map it had to use. So since its always growing, it reports rapid growth. As each allocation falls "at the end" of the currently used memory block.
In PHP >= 5.3.0, you can call gc_collect_cycles() to force a GC pass.
Note: You need to have zend.enable_gc enabled in your php.ini enabled, or call gc_enable() to activate the circular reference collector.
Found the solution: it was a string concatenation. I was generating the input line by line by concatenating some variables (the output is a CSV file). However, PHP seems not to free the memory used for the old copy of the string, thus effectively clobbering RAM with unused data. Switching to an array-based approach (and imploding it with commas just before fputs-ing it to the outfile) circumvented this behavior.
For some reason - not obvious to me - PHP reported the increased memory usage during json_decode calls, which mislead me to the assumption that the json_decode function was the problem.
There's a way.
I had this problem one day. I was writing from a db query into csv files - always allocated one $row, then reassigned it in the next step. Kept running out of memory. Unsetting $row didn't help; putting an 5MB string into $row first (to avoid fragmentation) didn't help; creating an array of $row-s (loading many rows into it + unsetting the whole thing in every 5000th step) didn't help. But it was not the end, to quote a classic.
When I made a separate function that opened the file, transferred 100.000 lines (just enough not to eat up the whole memory) and closed the file, THEN I made subsequent calls to this function (appending to the existing file), I found that for every function exit, PHP removed the garbage. It was a local-variable-space thing.
TL;DR
When a function exits, it frees all local variables.
If you do the job in smaller portions, like 0 to 1000 in the first function call, then 1001 to 2000 and so on, then every time the function returns, your memory will be regained. Garbage collection is very likely to happen on return from a function. (If it's a relatively slow function eating a lot of memory, we can safely assume it always happens.)
Side note: for reference-passed variables it will obviously not work; a function can only free its inside variables that would be lost anyway on return.
I hope this saves your day as it saved mine!
I've found that PHP's internal memory manager is most-likely to be invoked upon completion of a function. Knowing that, I've refactored code in a loop like so:
while (condition) {
// do
// cool
// stuff
}
to
while (condition) {
do_cool_stuff();
}
function do_cool_stuff() {
// do
// cool
// stuff
}
EDIT
I ran this quick benchmark and did not see an increase in memory usage. This leads me to believe the leak is not in json_decode()
for($x=0;$x<10000000;$x++)
{
do_something_cool();
}
function do_something_cool() {
$json = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
$result = json_decode($json);
echo memory_get_peak_usage() . PHP_EOL;
}
I was going to say that I wouldn't necessarily expect gc_collect_cycles() to solve the problem - since presumably the files are no longer mapped to zvars. But did you check that gc_enable was called before loading any files?
I've noticed that PHP seems to gobble up memory when doing includes - much more than is required for the source and the tokenized file - this may be a similar problem. I'm not saying that this is a bug though.
I believe one workaround would be not to use file_get_contents but rather fopen()....fgets()...fclose() rather than mapping the whole file into memory in one go. But you'd need to try it to confirm.
HTH
C.
Call memory_get_peak_usage() after each statement, and ensure you unset() everything you can. If you are iterating with foreach(), use a referenced variable to avoid making a copy of the original (foreach()).
foreach( $x as &$y)
If PHP is actually leaking memory a forced garbage collection won't make any difference.
There's a good article on PHP memory leaks and their detection at IBM
There recently was a similar issue with System_Daemon. Today I isolated my problem to file_get_contents.
Could you try using fread instead? I think this may solve your problem.
If it does, it's probably time to do a bugreport over at PHP.

is it a good practice to use mysql_free_result($result)?

I am aware of that All associated result memory is automatically freed at the end of the script's execution. But would you recommend using it, if I am using a quite of lot of somewhat similar actions as below?
$sql = "select * from products";
$result = mysql_query($sql);
if($result && mysql_num_rows($result) > 0) {
while($data = mysql_fetch_assoc($result)) {
$sql2 = "insert into another_table set product_id = '".$data['product_id']."'
, product_name = '".$data['product_name']."'
";
$result2 = mysql_query($sql2);
**mysql_free_result($result2);**
}
}
Thanks.
Quoting the documentation of mysql_free_result :
mysql_free_result() only needs to be
called if you are concerned about how
much memory is being used for queries
that return large result sets. All
associated result memory is
automatically freed at the end of the
script's execution.
So, if the documentation says it's generally not necessary to call that function, I would say it's not really necessary, nor good practice, to call it ;-)
And, just to say : I almost never call that function myself ; memory is freed at the end of the script, and each script should not eat too much memory.
An exception could be long-running batches that have to deal with large amounts of data, though...
Yes, it is good practice to use mysql_free_result($result). The quoted documentation in the accepted answer is inaccurate. That is what the documentation says, but that doesn't make any sense. Here is what it says:
mysql_free_result() only needs to be called if you are concerned about how much memory is being used for queries that return large result sets. All associated result memory is automatically freed at the end of the script's execution.
The first part of the first sentence is correct. It is true that you don't need to use it for reasons other than memory concerns. Memory concerns are the only reason to use it. However, the second part of the first sentence doesn't make any sense. The claim is that you would only be concerned about memory for queries that return large result sets. This is very misleading as there are other common scenarios where memory is a concern and calling mysql_free_result() is very good practice. Any time queries may be run an unknown number of times, more and more memory will be used up if you don't call mysql_free_result(). So if you run your query in a loop, or from a function or method, it is usually a good idea to call mysql_free_result(). You just have to be careful not to free the result until after it will not be used anymore. You can shield yourself from having to think about when and how to use it by making your own select() and ex() functions so you are not working directly with result sets. (None of the code here is exactly the way I would actually write it, it is more illustrative. You may want to put these in a class or special namespace, and throw a different Exception type, or take additional parameters like $class_name, etc.)
// call this for select queries that do not modify anything
function select($sql) {
$array= array();
$rs= query($sql);
while($o= mysql_fetch_object($rs))
$array[]= $o;
mysql_free_result($rs);
return $array;
}
// call this for queries that modify data
function ex($sql) {
query($sql);
return mysql_affected_rows();
}
function query($sql) {
$rs= mysql_query($sql);
if($rs === false) {
throw new Exception("MySQL query error - SQL: \"$sql\" - Error Number: "
.mysql_errno()." - Error Message: ".mysql_error());
}
return $rs;
}
Now if you only call select() and ex(), you are just dealing with normal object variables and only normal memory concerns instead of manual memory management. You still have to think about normal memory concerns like how much memory is in use by the array of objects. After the variable goes out of scope, or you manually set it to null, it become available for garbage collection so PHP takes care of that for you. You may still want to set it to null before it goes out of scope if your code does not use it anymore and there are operations following it that use up an unknown amount of memory such as loops and other function calls. I don't know how result sets and functions operating on them are implemented under the hood (and even if I did, this could change with different/future versions of PHP and MySQL), so there is the possibility that the select() function approximately doubles the amount of memory used just before mysql_free_result($rs) is called. However using select() still eliminates what us usually the primary concern of more and more memory being used during loops and various function calls. If you are concerned about this potential for double memory usage, and you are only working with one row at a time over a single iteration, you can make an each() function that will not double your memory usage, and will still shield you from thinking about mysql_free_result():
each($sql,$fun) {
$rs= query($sql);
while($o= mysql_fetch_object($rs))
$fun($o);
mysql_free_result($rs);
}
You can use it like this:
each("SELECT * FROM users", function($user) {
echo $user->username."<BR>";
});
Another advantage of using each() is that it does not return anything, so you don't have to think about whether or not to set the return value to null later.
The answer is of course YES in mysqli.
Take a look at PHP mysqli_free_result documentation:
You should always free your result with mysqli_free_result(), when your result object is not needed anymore.
I used to test it with memory_get_usage function:
echo '<br>before mysqli free result: '.memory_get_usage();
mysqli_free_result($query[1]);
echo '<br>after mysqli free result'.memory_get_usage();
And it is the result:
before mysqli free result:2110088
after mysqli free result:1958744
And here, we are talking about 151,344 bytes of memory in only 1000 rows of mysql table.
How about a million rows and how is it to think about large projects?
mysqli_free_result() is not only for large amount of data, it is also a good practice for small projects.
It depends on how large your queries are or how many queries you run.
PHP frees the memory at the end of the script(s) automatically, but not during the run. So if you have a large amount of data comming from a query, better free the result manually.
I would say: YES, it is good practice because you care about memory during the development or your scripts and that is what makes a good developer :-)

Categories