How to fix Simple_DOM memory limit fatal error? - php

Anyone had any luck with fixing the Simple_DOM memory problem? I scoured these forums and found only recommmendations for other parsing engines.
My script loops through 20,000 files and extracts one word from each. I have to call the file_get_html function each time.
Moved it to a different server. Same result.
Changed the foreach loop to a while loop.
increase memory limit, either server. won't work.

yes you can increase the memory with ini_set() but that's only of you have the permission to do so.
what i recommend is when you are going through your loop, when you complete the task, unset the variables that contain the large sets of data.
for($i=0;$i < 30000;$i++){
$file = file_get_contents($some_path.$i);
// do something, like write to file
// unset the variables
unset($file);
}
of course this is just an example, but you can relate it to your codeand make sure every request is like a running your file the first time.
Wish you Good luck :)

Seems to me like the approach to processing that much data during a single execution is flawed. In my experience, PHP cli processed aren't really meant to run for long periods of time and process tons of data. It takes very, very careful memory management to do so. Throw in a leaky 3rd party script, and you have a recipe for banging your head against a desk.
Maybe instead of attempting to run through all 20k files at once, you could process a few hundred at a time, store the results someplace intermediary, like a MySQL database, and then gather the results once all the files have been processed.

Related

PHP Timeout and TOO_MANY_REDIRECTS

here is the situation;
I have an import running on php (basically, you can consider it as a big while loop). But, as there is a lot of data (hours of data to import); I can't do that in one request, otherwise i'm taking the php timeout error after 10 min.
in order to avoid that timeout issue, I've decided to cut my import into many part..basically...i'm calling the same url again but increasing the parameters offset by a thousand every 5 min.
This is also working...but after some redirects..i'm taking the too many redirects error.
This issue is tagged chrome but if you have a solution for an other broweser I take it.
My question is : Do I have a way on chrome to increase the number of redirects which can be allowed ?
Or may be the fix could be to temporary remove the timeout from php ? I'm struggling to know what the best solution could be. How to do that ?
First of all I would not recommend going into those redirects.
It would be way better to just set:
max_execution_time = 0
You don't have to change this setting for all PHP, you can set it in your import script.
Do you have any possibility to change source file of your import?
It would be better to break this file to smaller ones and than you could use any message broker (eg. RabbitMQ) to queue your files one by one to import script.
If you can't change source file because it's from external source than you can chunk it by your own in yours script. Thant try to queue those chunks and import one after one using CRON job or something similiar.
What is happening during this import?
Maybe you are trying to do just too much during import?
EDIT 2022-06
I am just curious if people are using yield instead of returning whole data read from a file during such imports. To save server's memory it would be highly recommended to do so.
It could be used like:
public function readFile(string $filePath): iterable
{
$file = new SplFileObject($filePath);
...
while (!$file->eof()) {
$row = ...
...
yield $row;
}
}
Using yield statement here gives us huge memory savings (especially while loading big files) and make it possible to work on huge data amount smoothly.

Splitting PHP script into chunks to avoid max_execution_time

G'day all,
This is actually the first question I have asked, however I use stack overflow religiously with its awesome search function, but I have come to a stop here.
I've been writing a bit of PHP code that basically takes the user input for Australian Airports, fetches the PDF's relevant to the aircraft type (for whatever reason the publisher releases them as single PDF's), and puts them into one PDF file. I've got it working reasonably smoothly now, but the last hitch in the plan is that when you place in lots of airfields (or ones with lots of PDF's) it exceeds the max_execution_time and gives me a 500 Internal Server Error. Unfortunately I'm with GoDaddy's shared hosting and cant change this, either in the php.ini, or in a script with set_time_limit(). This guy had the same problem and I have come out as fruitless as he: PHP GoDaddy maximun execution time not working
Anyway, apart from switching my hosting my only thought is to break up the php code so it doesn't run all at once. The only problem being is I am running a foreach loop and I haven't the faintest idea where to start.
Here is the code I have for the saving of the PDF's:
foreach ($pos as $po){
file_put_contents("/dir/temp/$chartNumber$po", file_get_contents("http://www.airservicesaustralia.com/aip/current/dap/$po"));
$chartNumber = $chartNumber + 1;
}
The array $pos is generated by a regex search of the website and takes very little time, it is the saving of the PDF files that kills me, and if it manages to get them all, the combining can take a bit of time as well with this code:
exec("gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=/dir/finalpdf/$date.pdf /dir/temp/*.pdf");
My question is, is there any way I can do each of the foreach loop in a seperate script, and then pick up where I left off? Or is it time to get new hosting?
Cheers in advance!
My suggestion would be to use AJAX requests, splitting each request per file.
Here's how I would approach it:
Make a request to generate $pos array and return it in JSON.
Make a request to generate each file, by passing $po and it's position in array (assuming that's the $chartNumber).
Check if last file was generated in jquery (returned true), and call the script to write the final file, returning the filename for download.
But ofcourse the best solution would be to switch to a cloud hosting. I personally use digitalocean.com where I'm running big PHP fetching scripts without any limitations.
I've taken Edvinas advice and transferred to digitalocean.com and have the script running now with no problems whatsoever. I have also managed to reduce the time by downloading each file with parallelcurl, which will download 5 at a time, so I can have a full, 100 page file (larger than I'll expect I'll ever need) downloaded and generated in just under 5 minutes. I guess other than hosting the PDF's on my own server (in which case I may miss updated of charts), this will be about as quick as I can get it to run.
Thanks for the advice!
Breaking down the operations into batches and running them serially will actually take longer than what you are currently doing. If the performance bottleneck is in creation of the component parts, a better solution would be to generate the parts in parallel.
the combining can take a bit of time as well with this code
Well, thge first part of fixing any performance issue should be profiling to identify the bottleneck. Without direct admin access to the host there's not a lot you can do to speed up the execution of a single line shell script - but if you can run shell commands then you can run a background job outside of the webserver process group.

How do I split a really long, memory intensive loop into smaller chunks

I have quite a long, memory intensive loop. I can't run it in one go because my server places a time limit for execution and or I run out of memory.
I want to split up this loop into smaller chunks.
I had an idea to split the loop into smaller chunks and then set a location header to reload the script with new starting conditions.
MY OLD SCRIPT (Pseudocode. I'm aware of the shortcomings below)
for($i=0;$i<1000;$i++)
{
//FUNCTION
}
MY NEW SCRIPT
$start=$_GET['start'];
$end=$start+10;
for($i=$start;$i<$end;$i++;)
{
//FUNCTION
}
header("Location:script.php?start=$end");
However, my new script runs successfully for a few iterations and then I get a server error "Too many redirects"
Is there a way around this? Can someone suggest a better strategy?
I'm on a shared server so I can't increase memory allocation or script execution time.
I'd like a PHP solution.
Thanks.
"Too many redirects" is a browser error, so a PHP solution would be to use cURL or standard streams to load the initial page and let it follow all redirects. You would have to run this from a machine without time-out limitations though (e.g. using CLI)
Another thing to consider is to use AJAX. A piece of JavaScript on your page will run your script, gather the output from your script and determine whether to stop (end of computation) or continue (start from X). This way you can create a nifty progress meter too ;-)
You probably want to look into forking child processes to do the work. These child processes can do the work in smaller chunks in their own memory space, while the parent process fires off multiple children. This is commonly handled by Gearman, but can be done without.
Take a look at Forking PHP on Dealnews' Developers site. It has a library and some sample code to help manage code that needs to spawn child processes.
Generally if I have to iterate over something many many times and it has a decent amount of data, I use a "lazy load" type application like:
for($i=$start;$i<$end;$i++;)
{
$data_holder[] = "adding my big data chunks!";
if($i % 5 == 1){
//function to process data
process_data($data_holder); // process that data like a boss!
unset($data_holder); // This frees up the memory
}
}
// Now pick up the stragglers of whatever is left in the data chunk
if(count($data_holder) > 0){
process_data($data_holder);
}
That way you can continue to iterate through your data, but you don't stuff up your memory. You can work in chunks, then unset the data, work in chunks, unset data, etc.. to help prevent memory. As far as execution time, that depends on how much you have to do / how efficient your script is written.
The basic premise -- "Process your data in smaller chunks to avoid memory issues. Keep your design simple to keep it fast."
How about you put a conditional inside your loop to sleep every 100 iterations?
for ($i = 0; $i < 1000; $i++)
{
if ($i % 100 == 0)
sleep(1800) //Sleep for half an hour
}
First off, without knowing what your doing inside the loop, it's hard to tell you the best approach to actually solving your issue. However, if you want to execute something that takes a really long time, my suggestion would be to set up a cron job and let it nail out little portions at a time. The script would log where it stops and the next time it starts up, it could read the log for where to start.
Edit: If you are dead set against cron, and you aren't too concerned about user experience, you could do this:
Let the page load similar to the cron job above. Except after so many seconds or iterations, stop the script. Display a refresh meta tag or javascript refresh. Do this until the task is done.
With the limitations you have, I think the approach you are using could work. It may be that your browser is trying to be smart and not let you redirect back the page you were just on. It might be trying to prevent an endless loop.
You could try
Redirecting back and forth between two scripts that are identical (or aliases).
A different browser.
Having your script output an HTML page with a refresh tag, e.g.
<meta http-equiv="refresh" content="1; url=http://example.com/script.php?start=xxx">

How does the garbage collector work in PHP

I have a PHP script that has a large array of people, it grabs their details from an external resource via SOAP, modifies the data and sends it back. Due to the size of the details I upped PHP's memory to 128MB. After about 4 hours of running (It will probably take 4 days to run) it ran out of memory. Heres the basics of what it does:
$people = getPeople();
foreach ($people as $person) {
$data = get_personal_data();
if ($data == "blah") {
importToPerson("blah", $person);
} else {
importToPerson("else", $person);
}
}
After it ran out of memory and crashed I decided to initialise $data before the foreach loop and according to top, memory usage for the process hasn't risen above 7.8% and it's been running for 12 hours.
So my question is, does PHP not run a garbage collector on variables initialised inside the loop even if reused? Is the system reclaiming the memory and PHP hasn't marked it as usable yet and will eventually crash again (I've upped it to 256MB now so I've changed 2 things and not sure which has fixed it, I could probably change my script back to answer this but don't want to wait another 12 hours for it to crash to figure out)?
I'm not using the Zend framework so the other question like this I don't think is relevant.
EDIT: I don't actually have an issue with the script or what it's doing. At the moment, as far as all system reporting is concerned I don't have any issues. This question is about the garbage collector and how / when it reclaims resources in a foreach loop and / or how the system reports on memory usage of a php process.
I don't know the insides of PHP's VM, but from my experience, it doesn't garbage collect whilst your page is running. This is because it throws away everything your page created when it finishes.
Most of the time, when a page runs out of memory and the limit is pretty high (and 128Mb isn't high), there is an algorithm problem. Many PHP programmers assemble a structure of data, then pass it to the next step which iterates over the structure, usually creating another one. Lather, rinse, repeat. Unfortunately, this approach is a big memory hog and you end up creating multiple copies of your data in memory. Two of the really big changes in PHP 5 was that objects are reference counted, not copied, and the entire string subsystem was made much much faster. But it's still a problem.
To minimise memory use, you would look at re-structuring your algorithm so it can work with one piece of data from start to finish. Then you get the next and start again. Best case scenario is that you don't ever have the entire dataset in memory. For a database-backed website, this would mean processing a row of data from a database query all the way to presentation before getting the next. Of course, this approach isn't always possible and the script just has to keep a huge wodge of data in memory.
That said, you can do this memory-saving approach for part of the data. The trick is that you explicitly unset() a key variable or two at the end of the loop. This should reclaim the space. The other "best-practice" trick is to shift out of the loop data manipulation that doesn't need to be in the loop. As you seem to have discovered.
I've run PHP scripts that need upwards of 1Gb of memory. You can set the memory limit per script, actually, with ini_set('memory_limit', '1G');
Use memory_get_usage() to see what going on? Could put it inside of the loop to see the behavior in memory allocation.
Have you tried looking at the system monitor or whatever to see how much memory php is using during that process?

Fatal error: Allowed memory size- PHP expert advice needed

Hello
I have a PHP program with my own classes. I am using a Xampp server in the office.
It has 4 basic parts:
1)The program reads one mysql record ("SELECT a, b, c....
then loops a few times to total.
2) echo some variables to the screen (about 10)
3) Insert a record into a second mysql file (summary of the first group)
4) Clear variables (30 or so)
Now I have read many pages on this topic including about 15 here. I know I can up the memory, but that is not the solution I am looking for.
I now have about 3,000 records in the first database and I expect it to grow 1000x that.
I have years of programing experience and can see the problem. In a language like "c" when you do this kind of loop, the results are displayed immediately. But with PHP it cycles and nothing is displayed until the operation is over or it stops like here. I know that is filling up the memory. I could not display any variables, but it might be difficult to debug.
So how do I resolve this? I looked at flush() and ob_flush(). Is that what is needed so this does not all accumulate in memory?
Thank You in advance
.
.
Why don't you fetch 5000 records at a time, then write it to disk (like a cache).
Can we see some code too please?
Unlike the command line interface, if you have output buffering set to automatic or your script has ob_start(), you do need ob_flush() AND flush() to commit data to the browser output. Otherwise, only flush() is necessary. This is the browser's behavior and the nature of HTTP connections.

Categories