I have a PHP script that has a large array of people, it grabs their details from an external resource via SOAP, modifies the data and sends it back. Due to the size of the details I upped PHP's memory to 128MB. After about 4 hours of running (It will probably take 4 days to run) it ran out of memory. Heres the basics of what it does:
$people = getPeople();
foreach ($people as $person) {
$data = get_personal_data();
if ($data == "blah") {
importToPerson("blah", $person);
} else {
importToPerson("else", $person);
}
}
After it ran out of memory and crashed I decided to initialise $data before the foreach loop and according to top, memory usage for the process hasn't risen above 7.8% and it's been running for 12 hours.
So my question is, does PHP not run a garbage collector on variables initialised inside the loop even if reused? Is the system reclaiming the memory and PHP hasn't marked it as usable yet and will eventually crash again (I've upped it to 256MB now so I've changed 2 things and not sure which has fixed it, I could probably change my script back to answer this but don't want to wait another 12 hours for it to crash to figure out)?
I'm not using the Zend framework so the other question like this I don't think is relevant.
EDIT: I don't actually have an issue with the script or what it's doing. At the moment, as far as all system reporting is concerned I don't have any issues. This question is about the garbage collector and how / when it reclaims resources in a foreach loop and / or how the system reports on memory usage of a php process.
I don't know the insides of PHP's VM, but from my experience, it doesn't garbage collect whilst your page is running. This is because it throws away everything your page created when it finishes.
Most of the time, when a page runs out of memory and the limit is pretty high (and 128Mb isn't high), there is an algorithm problem. Many PHP programmers assemble a structure of data, then pass it to the next step which iterates over the structure, usually creating another one. Lather, rinse, repeat. Unfortunately, this approach is a big memory hog and you end up creating multiple copies of your data in memory. Two of the really big changes in PHP 5 was that objects are reference counted, not copied, and the entire string subsystem was made much much faster. But it's still a problem.
To minimise memory use, you would look at re-structuring your algorithm so it can work with one piece of data from start to finish. Then you get the next and start again. Best case scenario is that you don't ever have the entire dataset in memory. For a database-backed website, this would mean processing a row of data from a database query all the way to presentation before getting the next. Of course, this approach isn't always possible and the script just has to keep a huge wodge of data in memory.
That said, you can do this memory-saving approach for part of the data. The trick is that you explicitly unset() a key variable or two at the end of the loop. This should reclaim the space. The other "best-practice" trick is to shift out of the loop data manipulation that doesn't need to be in the loop. As you seem to have discovered.
I've run PHP scripts that need upwards of 1Gb of memory. You can set the memory limit per script, actually, with ini_set('memory_limit', '1G');
Use memory_get_usage() to see what going on? Could put it inside of the loop to see the behavior in memory allocation.
Have you tried looking at the system monitor or whatever to see how much memory php is using during that process?
Related
I am running an algorithm in PHP which has a lot of data involved. All the processing happens within a nested for loop. Strangely, the outer for loop stops working after 'X' number of iterations (where 'X' is changing all the time I run the script). It takes anywhere between 5 mins to 30mins for the script to crash depending on 'X'. It does not throw out any errors, and only does an incomplete printout of my var_dump (in the first iteration of the outer loop)
These are the precautions I took:
1. I have set the timeout limit in php.ini to be 3600sec (60mins).
2. I am printing out the memory_get_usage() after every outer for loop iteration and i have verified that it is much lesser compared to the max memory allocated to php.
3. I am unsetting arrays once they are used
4. I reuse variable names to limit memory within the forloop
5. I have minimal calls to my DB
I have been solving this for a long time to no avail. So my question is what can be the cause of this problem/how do I go about debugging it. Thank you so much!
Extra: If i work with a much smaller test data size, everything works fine.
Obviously without code this is just a guess, but are you making sure to use a single connection to your database? If you are reconnecting every time you may get too many connections which could cause an error like this.
This sounds like an issue with utilisation of your server cores and a similar answer/workaround could be found here: Boost Apache2 up to 4 cores usage, running PHP
Try running your datasets in parallel.
I have seemingly harmless while loop that goes through the result-set of a mysql query and compares the id returned from mysql, to one in a very large multidimensional array:
//mysqli query here
while($row = fetch_assoc())
{
if(!in_array($row['id'], $multiDArray['dimensionOne']))
{
//do something
}
}
When the script first executes, it is running through the results at about 2-5k per second. Sometimes more, rarely less. The result set brings back 7million rows, and the script peaks at 2.8GB of memory.
In terms of big data, this is not a lot.
The problem is, around the 600k mark, the loop starts to slow down, and by 800k, it is processing a few records a second.
In terms of server load and memory use, there are no issues.
This is behaviour I have noticed before in other scripts dealing with large data sets.
Is array seek time progressively slower as the internal pointer moves deeper?
That really depends on what happens inside the loop. I know you are convinced it's not a memory issue but it looks like one. Program usually get very slow when system tries to get extra RAM by using SWAP. Using hard drive is obviously very slow and that's what you might be experiencing. It's very easy to benchmark it.
In one terminal run
vmstat 3 100
Run you scrip and observe vmstat. Look into IO and SWAP. If that is really not the case then profile execution with XDEBUG. It might be tricky because you do many iterations and this will also cause major IO.
Anyone had any luck with fixing the Simple_DOM memory problem? I scoured these forums and found only recommmendations for other parsing engines.
My script loops through 20,000 files and extracts one word from each. I have to call the file_get_html function each time.
Moved it to a different server. Same result.
Changed the foreach loop to a while loop.
increase memory limit, either server. won't work.
yes you can increase the memory with ini_set() but that's only of you have the permission to do so.
what i recommend is when you are going through your loop, when you complete the task, unset the variables that contain the large sets of data.
for($i=0;$i < 30000;$i++){
$file = file_get_contents($some_path.$i);
// do something, like write to file
// unset the variables
unset($file);
}
of course this is just an example, but you can relate it to your codeand make sure every request is like a running your file the first time.
Wish you Good luck :)
Seems to me like the approach to processing that much data during a single execution is flawed. In my experience, PHP cli processed aren't really meant to run for long periods of time and process tons of data. It takes very, very careful memory management to do so. Throw in a leaky 3rd party script, and you have a recipe for banging your head against a desk.
Maybe instead of attempting to run through all 20k files at once, you could process a few hundred at a time, store the results someplace intermediary, like a MySQL database, and then gather the results once all the files have been processed.
I have a map. On this map I want to show live data collected from several tables, some of which have astounding amounts of rows. Needless to say, fetching this information takes a long time. Also, pinging is involved. Depending on servers being offline or far away, the collection of this data could vary from 1 to 10 minutes.
I want the map to be snappy and responsive, so I've decided to add a new table to my database containing only the data the map needs. That means I need a background process to update the information in my new table continuously. Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed. And what if the number of offline IP addresses suddenly spike and the loop takes longer to run than the interval of the Cron job?
My own solution is to create an infinite loop in PHP that runs by the command line. This loop would refresh the data for the map into MySQL as well as record other useful data such as loop time and failed attempts at pings etc, then restart after a short pause (a few seconds).
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
Partly I'm writing this question to confirm if this is in fact the case, but some tips and tricks on how I would go about writing a clean loop that doesn't leak memory (If that is possible) wouldn't go amiss. Opinions on the matter would also be appreciated.
The reply I feel sheds the most light on the issue I will mark as correct.
The loop should be in one script which will activate/call the actual script as a different process...much like cron is doing.
That way, even if memory leaks, and non collected memory is accumulating, it will/should be free after each cycle.
However - I'm being repeatedly told by people that a PHP script running for ever is BAD. After a while it will hog gigabytes of RAM (and other terrible things)
This used to be very true. Previous versions of PHP had horrible garbage collection, so long-running scripts could easily accidentally consume far more memory than they were actually using. PHP 5.3 introduced a new garbage collector that can understand and clean up circular references, the number one cause of "memory leaks." It's enabled by default. Check out that link for more info and pretty graphs.
As long as your code takes steps to allow variables to go out of scope at proper times and otherwise unset variables that will no longer be used, your script should not consume unnecessary amounts of memory just because it's PHP.
I don't think its bad, as with anything that you want to run continuously you have to be more careful.
There are libraries out there to help you with the task. Have a look at System_Daemon, which release RC 1 just over a month ago, which allows you to "Set options like max RAM usage".
Rather than running an infinite loop I'd be tempted to go with the cron option you mention in conjunction with a database table entry or flat-file that you'd use to store a "currently active" status bit to ensure that you didn't have overlapping processes attempting to run at the same time.
Whilst I realise that this would mean a minor delay before you perform the next iteration, this is probably a better idea anyway as:
It'll let the RDBMS perform any pending low-priority updates, etc. that may well been on-hold due to the amount of activity that you've been carrying out.
Even if you neatly unset all the temporary variables you've been using, it's still possible that PHP will "leak" memory, although recent improvements (5.2 introduced a new memory management system and garbage collection was overhauled in 5.3) should hopefully mean that this less of an issue.
In general, it'll also be easier to deal with other issues (if the DB connection temporarily goes down due to a config change and restart for example) if you use the cron approach, although in an ideal world you'd cater for such eventualities in your code anyway. (That said, the last time I checked, this was far from an ideal world.)
First I fail to see how you need a daemon script in order to provide the functionality you describe.
Cron jobs are of course a possibility, but I want the refreshing of data to happen as soon as the previous interval has completed
The neither a cron job nor a daemon are the way to solve the problem (unless the daemon becomes the data sink for the scripts). I'd spawn a dissociated process when the data is available using a locking strategy to aoid concurrency.
Long running PHP scripts are not intrinsically bad - but there reference counting garbage collector does not deal with all possible scenarios for cleaning up memory - but more recent implementations have a more advanced collector which should clean up a lot more (circular reference checker).
I currently have a PHP CLI script using Zend Framework extensively which seems to be using a ever larger amount of memory as it runs. It loops through a large set of models retrieved from a database in batches of 1000. Calls to memory_get_usage() show that the memory usage of the script is always increasing.
This is despite making sure that I'm unsetting the model after every iteration and actually using array_shift() to reduce the size of the array of models on every iteration.
My question is, in PHP is there a way of discovering the size-in-memory of a variable so I can track what's growing?
i don't have a solution to check the size of every variable, but if you use doctrine its probably the reason
you need to use
$Elem->free(true);
another thing is to upgrade to 5.3 (if you dont do it yet), the garbage collector of 5.3 is better
No. You are likely looking for memory that is not freed, e.g. you unlinked a variable or deleted a reference and the garbage collector did not yet release the associated block in memory.
You could try Zend Server 5 (you need the commercial version though) to mem-profile your application. It has code tracing. I dont know if this would allow you to spot memory leaks though.
Also see:
What's new in PHP V5.2, Part 1: Using the new memory manager
Memtrack (PECL)
I don't know how accurate it is, but I got a number by using apc_add('variable_name', $var);. I then go to my apc.php under user cache entries and look at the size column.
Of course for this to work you need to have APC installed and running. :-)
Here is a code snippet I found at weberdev
<?php
function array_size($a){
$size = 0;
while(list($k, $v) = each($a)){
$size += is_array($v) ? array_size($v) : strlen($v);
}
return $size;
}
?>
It gets the size of the given array in bytes. Is this what you meant?