here is the situation;
I have an import running on php (basically, you can consider it as a big while loop). But, as there is a lot of data (hours of data to import); I can't do that in one request, otherwise i'm taking the php timeout error after 10 min.
in order to avoid that timeout issue, I've decided to cut my import into many part..basically...i'm calling the same url again but increasing the parameters offset by a thousand every 5 min.
This is also working...but after some redirects..i'm taking the too many redirects error.
This issue is tagged chrome but if you have a solution for an other broweser I take it.
My question is : Do I have a way on chrome to increase the number of redirects which can be allowed ?
Or may be the fix could be to temporary remove the timeout from php ? I'm struggling to know what the best solution could be. How to do that ?
First of all I would not recommend going into those redirects.
It would be way better to just set:
max_execution_time = 0
You don't have to change this setting for all PHP, you can set it in your import script.
Do you have any possibility to change source file of your import?
It would be better to break this file to smaller ones and than you could use any message broker (eg. RabbitMQ) to queue your files one by one to import script.
If you can't change source file because it's from external source than you can chunk it by your own in yours script. Thant try to queue those chunks and import one after one using CRON job or something similiar.
What is happening during this import?
Maybe you are trying to do just too much during import?
EDIT 2022-06
I am just curious if people are using yield instead of returning whole data read from a file during such imports. To save server's memory it would be highly recommended to do so.
It could be used like:
public function readFile(string $filePath): iterable
{
$file = new SplFileObject($filePath);
...
while (!$file->eof()) {
$row = ...
...
yield $row;
}
}
Using yield statement here gives us huge memory savings (especially while loading big files) and make it possible to work on huge data amount smoothly.
Related
G'day all,
This is actually the first question I have asked, however I use stack overflow religiously with its awesome search function, but I have come to a stop here.
I've been writing a bit of PHP code that basically takes the user input for Australian Airports, fetches the PDF's relevant to the aircraft type (for whatever reason the publisher releases them as single PDF's), and puts them into one PDF file. I've got it working reasonably smoothly now, but the last hitch in the plan is that when you place in lots of airfields (or ones with lots of PDF's) it exceeds the max_execution_time and gives me a 500 Internal Server Error. Unfortunately I'm with GoDaddy's shared hosting and cant change this, either in the php.ini, or in a script with set_time_limit(). This guy had the same problem and I have come out as fruitless as he: PHP GoDaddy maximun execution time not working
Anyway, apart from switching my hosting my only thought is to break up the php code so it doesn't run all at once. The only problem being is I am running a foreach loop and I haven't the faintest idea where to start.
Here is the code I have for the saving of the PDF's:
foreach ($pos as $po){
file_put_contents("/dir/temp/$chartNumber$po", file_get_contents("http://www.airservicesaustralia.com/aip/current/dap/$po"));
$chartNumber = $chartNumber + 1;
}
The array $pos is generated by a regex search of the website and takes very little time, it is the saving of the PDF files that kills me, and if it manages to get them all, the combining can take a bit of time as well with this code:
exec("gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=/dir/finalpdf/$date.pdf /dir/temp/*.pdf");
My question is, is there any way I can do each of the foreach loop in a seperate script, and then pick up where I left off? Or is it time to get new hosting?
Cheers in advance!
My suggestion would be to use AJAX requests, splitting each request per file.
Here's how I would approach it:
Make a request to generate $pos array and return it in JSON.
Make a request to generate each file, by passing $po and it's position in array (assuming that's the $chartNumber).
Check if last file was generated in jquery (returned true), and call the script to write the final file, returning the filename for download.
But ofcourse the best solution would be to switch to a cloud hosting. I personally use digitalocean.com where I'm running big PHP fetching scripts without any limitations.
I've taken Edvinas advice and transferred to digitalocean.com and have the script running now with no problems whatsoever. I have also managed to reduce the time by downloading each file with parallelcurl, which will download 5 at a time, so I can have a full, 100 page file (larger than I'll expect I'll ever need) downloaded and generated in just under 5 minutes. I guess other than hosting the PDF's on my own server (in which case I may miss updated of charts), this will be about as quick as I can get it to run.
Thanks for the advice!
Breaking down the operations into batches and running them serially will actually take longer than what you are currently doing. If the performance bottleneck is in creation of the component parts, a better solution would be to generate the parts in parallel.
the combining can take a bit of time as well with this code
Well, thge first part of fixing any performance issue should be profiling to identify the bottleneck. Without direct admin access to the host there's not a lot you can do to speed up the execution of a single line shell script - but if you can run shell commands then you can run a background job outside of the webserver process group.
I have to do a sql request on a table (MySQL) with 600 000+ rows with CakePHP.
I'm testing in local how I can handle with this huge table.
With CakePHP when I first tried a simple ->find('all'), I had many errors with buffers sizes.
I add
ini_set("memory_limit", "-1");
set_time_limit(0);
in my index.php
Now, my page is loading a long time then crash.
When I try ->find('first') to get just the first row, same thing : the page loads a long time and encounters an error after.
Have you some ideas about that ?
Have you some ideas about that ?
Paginate it.
If you can't paginate it you'll have to implement a job queue that does it on the server side in the background and generate the document and then provide it to the user as download.
It is pretty obvious that you can't process a huge set of data without running into some technical limitations. By 600k rows, depending on how they're rendered, even the client (browser) will probably become terrible slow.
Change your approach to this, it's not going to work well this way. Or put a few
hundred gigabyte of RAM into your server.
Setting ini_set("memory_limit", "-1"); is never a solution but a fugly workaround. Instead make sure that your script always works within some predefined boundaries. This will just make sure the script blows up on a server with 64mb but work on one with 128mb and later blow up on the one with 128mb as well when you get more data.
I have quite a long, memory intensive loop. I can't run it in one go because my server places a time limit for execution and or I run out of memory.
I want to split up this loop into smaller chunks.
I had an idea to split the loop into smaller chunks and then set a location header to reload the script with new starting conditions.
MY OLD SCRIPT (Pseudocode. I'm aware of the shortcomings below)
for($i=0;$i<1000;$i++)
{
//FUNCTION
}
MY NEW SCRIPT
$start=$_GET['start'];
$end=$start+10;
for($i=$start;$i<$end;$i++;)
{
//FUNCTION
}
header("Location:script.php?start=$end");
However, my new script runs successfully for a few iterations and then I get a server error "Too many redirects"
Is there a way around this? Can someone suggest a better strategy?
I'm on a shared server so I can't increase memory allocation or script execution time.
I'd like a PHP solution.
Thanks.
"Too many redirects" is a browser error, so a PHP solution would be to use cURL or standard streams to load the initial page and let it follow all redirects. You would have to run this from a machine without time-out limitations though (e.g. using CLI)
Another thing to consider is to use AJAX. A piece of JavaScript on your page will run your script, gather the output from your script and determine whether to stop (end of computation) or continue (start from X). This way you can create a nifty progress meter too ;-)
You probably want to look into forking child processes to do the work. These child processes can do the work in smaller chunks in their own memory space, while the parent process fires off multiple children. This is commonly handled by Gearman, but can be done without.
Take a look at Forking PHP on Dealnews' Developers site. It has a library and some sample code to help manage code that needs to spawn child processes.
Generally if I have to iterate over something many many times and it has a decent amount of data, I use a "lazy load" type application like:
for($i=$start;$i<$end;$i++;)
{
$data_holder[] = "adding my big data chunks!";
if($i % 5 == 1){
//function to process data
process_data($data_holder); // process that data like a boss!
unset($data_holder); // This frees up the memory
}
}
// Now pick up the stragglers of whatever is left in the data chunk
if(count($data_holder) > 0){
process_data($data_holder);
}
That way you can continue to iterate through your data, but you don't stuff up your memory. You can work in chunks, then unset the data, work in chunks, unset data, etc.. to help prevent memory. As far as execution time, that depends on how much you have to do / how efficient your script is written.
The basic premise -- "Process your data in smaller chunks to avoid memory issues. Keep your design simple to keep it fast."
How about you put a conditional inside your loop to sleep every 100 iterations?
for ($i = 0; $i < 1000; $i++)
{
if ($i % 100 == 0)
sleep(1800) //Sleep for half an hour
}
First off, without knowing what your doing inside the loop, it's hard to tell you the best approach to actually solving your issue. However, if you want to execute something that takes a really long time, my suggestion would be to set up a cron job and let it nail out little portions at a time. The script would log where it stops and the next time it starts up, it could read the log for where to start.
Edit: If you are dead set against cron, and you aren't too concerned about user experience, you could do this:
Let the page load similar to the cron job above. Except after so many seconds or iterations, stop the script. Display a refresh meta tag or javascript refresh. Do this until the task is done.
With the limitations you have, I think the approach you are using could work. It may be that your browser is trying to be smart and not let you redirect back the page you were just on. It might be trying to prevent an endless loop.
You could try
Redirecting back and forth between two scripts that are identical (or aliases).
A different browser.
Having your script output an HTML page with a refresh tag, e.g.
<meta http-equiv="refresh" content="1; url=http://example.com/script.php?start=xxx">
Anyone had any luck with fixing the Simple_DOM memory problem? I scoured these forums and found only recommmendations for other parsing engines.
My script loops through 20,000 files and extracts one word from each. I have to call the file_get_html function each time.
Moved it to a different server. Same result.
Changed the foreach loop to a while loop.
increase memory limit, either server. won't work.
yes you can increase the memory with ini_set() but that's only of you have the permission to do so.
what i recommend is when you are going through your loop, when you complete the task, unset the variables that contain the large sets of data.
for($i=0;$i < 30000;$i++){
$file = file_get_contents($some_path.$i);
// do something, like write to file
// unset the variables
unset($file);
}
of course this is just an example, but you can relate it to your codeand make sure every request is like a running your file the first time.
Wish you Good luck :)
Seems to me like the approach to processing that much data during a single execution is flawed. In my experience, PHP cli processed aren't really meant to run for long periods of time and process tons of data. It takes very, very careful memory management to do so. Throw in a leaky 3rd party script, and you have a recipe for banging your head against a desk.
Maybe instead of attempting to run through all 20k files at once, you could process a few hundred at a time, store the results someplace intermediary, like a MySQL database, and then gather the results once all the files have been processed.
I wrote a download counter:
$hit_count = #file_get_contents('download.txt');
$hit_count++;
#file_put_contents('download.txt', $hit_count);
header('Location: file/xxx.zip');
As simple as that. The problem is the stats number is truncated to 4 digits thus not showing the real count:
http://www.converthub.com/batch-image-converter/download.txt
The batch image converter program gets downloaded a couple hundred times per day and the PHP counter has been in place for months. The first time I found out about this was about 2 months ago when I was very happy that it hit 8000 mark after a few weeks yet a week after that it was 500 again. And it happened again and again.
No idea why. Why?
You're probably suffering a race condition in the filesystem, you're attempting to open and read a file, then open the same file and write to it. The operating system may not have fully released its original lock on the file when you close it for reading then open it for writing again straight away. If the site is as busy as you say, then you could even have issues of multiple instances of your script trying to access the file at the same time
Failing that, do all your file operations in one go. If you use fopen (), flock (), fread (), rewind (), fwrite () and fclose () to handle the hit counter updating you can avoid having to close the file and open it again. If you use r+ mode, you'll be able to read the value, increment it, and write the result back in one go.
None of this can completely guarantee that you won't hit issues with concurrent accesses though.
I'd strongly recommend looking into a different approach to implementing your hit counter, such as a database driven counter.
Always do proper error handling, don't just suppress errors with #. In this case, it is probable that the file_get_contents has failed as the file was being written at the time. Thus, $hit_count is set to FALSE, and $hit_count++ makes it 1. So your counter gets randomly reset to 1 whenever the reading fails.
If you insist on writing the number to a file, do proper error checking and only write to the file if you are SURE you got the file open.
$hit_count = file_get_contents('download.txt');
if($hit_count !== false) {
$hit_count++;
file_put_contents('download.txt', $hit_count);
}
header('Location: file/xxx.zip');
It will still fail occasionally, but at least it will not truncate your counter.
This is a kind of situation where having a database record the visits (which would allow for greater data-mining as it could be trended by date, time, referrer, location, etc) would be a better solution than using a counter in a flat file.
A cause may be that you are having a collision between a read and write action on a file (happening once every 8,000 instances or so). Adding the LOCK_EX flag to the file_get_contents() PHP Reference action may prevent this, but I could not be 100% certain.
Better to look at recording the data into a database, as that is almost certain to prevent your current problem of losing count.