I have the below PHP code. I want to be able to continue reading the text file from the point it stopped, and the text file is over 90mb.
Is it possible to continue reading from the point the script stopped running?
$in = fopen('email.txt','r');
while($kw = trim(fgets($in))) {
//my code
}
No, that's not easily possible without saving the current state from time to time.
However, instead of doing that you should better try to fix whatever causes your script to stop. set_time_limit(0); and ignore_user_abort(true); will most likely prevent your script from being stopped while it's running.
If you do want to be able to continue from some position, use ftell($in) to get the position and store it in a file/database from time to time. When starting the script you check if you have a stored position and then simply fseek($in, $offset); after opening the file.
If the script is executed from a browser and it takes enough time to make aborts likely, you could also consider splitting it in chunks and cleanly terminating the script with a redirect containing an argument where to continue. So your script would process e.g. 1000 lines and then be restarted with an offset of 1000 to process the next 1000 lines and so on.
Related
I have this file of 10 millions words, one word on every line. I'm trying to open that file, read every line, put it in an array and count the number of occurrences for each word.
wartek
mei_atnz
sommerray
swaggyfeed
yo_bada
ronnieradke
… and so on (10M+ lines)
I can open the file, read its size, even parse it line by line and echo the line on the browser (it's very long, of course), but when I'm trying to perform any other operation, the script just refuse to execute. No error, no warning, no die(…), nothing.
Accessing the file is always OK, but it's the operations which are not performed with the same success. I tried this and it worked…
while(!feof($pointer)) {
$row = fgets($pointer);
print_r($row);
}
… but this didn't :
while(!feof($pointer)) {
$row = fgets($pointer);
array_push($dest, $row);
}
Also tried with SplFileObject or file($source, FILE_IGNORE_NEW_LINES) with the same result every time (not okay with big file, okay with small file)
Guessing that the issue is not the size (150 ko), but probably the length (10M+ lines), I chunked the file to reduce it to ~20k without any improvement, then reduced it again to ~8k lines, and it worked.
I also removed the time limit with set_time_limit(0); or removed (almost) any memory limit both in the php.ini and in my script ini_set('memory_limit', '8192M');.Regarding the errors I could have, I set the error_reporting(E_ALL); at the top of my script.
So the questions are :
is there a maximum number of lines that can be read by PHP built-in functions?
why I can echo or print_r but not perform any other operations?
I think you might be running into a long execution time:
How to increase the execution timeout in php?
Different operation take different time. Printing might be a lot easier than pushing 10M new data into an array one-by-one. It's strange that you don't get any error messages, you should receive process exceeded time somewhere.
Using fgetcsv, can I somehow do a destructive read where rows I've read and processed would be discarded so if I don't make it through the whole file in the first pass, I can come back and pick up where I left off before the script timed out?
Additional Details:
I'm getting a daily product feed from a vendor that comes across as a 200mb .gz file. When I unpack the file, it turns into a 1.5gb .csv with nearly 500,000 rows and 20 - 25 fields. I need to read this information into a MySQL db, ideally with PHP so I can schedule a CRON to run the script at my web hosting provider every day.
I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.
My idea was to grab the information from the .csv using the fgetcsv function, but I'm expecting to have to take multiple passes at the file because of the 3 minute timeout, I was thinking it would be nice to whittle away at the file as I process it so I wouldn't need to spend cycles skipping over rows that were already processed in a previous pass.
From your problem description it really sounds like you need to switch hosts. Processing a 2 GB file with a hard time limit is not a very constructive environment. Having said that, deleting read lines from the file is even less constructive, since you would have to rewrite the entire 2 GB to disk minus the part you have already read, which is incredibly expensive.
Assuming you save how many rows you have already processed, you can skip rows like this:
$alreadyProcessed = 42; // for example
$i = 0;
while ($row = fgetcsv($fileHandle)) {
if ($i++ < $alreadyProcessed) {
continue;
}
...
}
However, this means you're reading the entire 2 GB file from the beginning each time you go through it, which in itself already takes a while and you'll be able to process fewer and fewer rows each time you start again.
The best solution here is to remember the current position of the file pointer, for which ftell is the function you're looking for:
$lastPosition = file_get_contents('last_position.txt');
$fh = fopen('my.csv', 'r');
fseek($fh, $lastPosition);
while ($row = fgetcsv($fh)) {
...
file_put_contents('last_position.txt', ftell($fh));
}
This allows you to jump right back to the last position you were at and continue reading. You obviously want to add a lot of error handling here, so you're never in an inconsistent state no matter which point your script is interrupted at.
You can avoid timeout and memory error to some extent when reading like a Stream. By Reading line by line and then inserts each line into a database (Or Process accordingly). In that way only single line is hold in memory on each iteration. Please note don't try to load a huge csv-file into an array, that really would consume a lot of memory.
if(($handle = fopen("yourHugeCSV.csv", 'r')) !== false)
{
// Get the first row (Header)
$header = fgetcsv($handle);
// loop through the file line-by-line
while(($data = fgetcsv($handle)) !== false)
{
// Process Your Data
unset($data);
}
fclose($handle);
}
I think a better solution (it will be phenomnally inefficient to continuously rewind and write to open file stream) would be to track the file position of each record read (using ftell) and store it with the data you've read - then if you have to resume, then just fseek to the last position.
You could try loading the file directly using mysql's read file function (which will likely be a lot faster) although I've had problems with this in the past and ended up writing my own php code.
I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.
What have you tried?
The memory can be limited by other means than the php.ini file, but I can't imagine how anyone could actually prevent you from using a different execution time (even if ini_set is disabled, from the command line you could run php -d max_execution_time=3000 /your/script.php or php -c /path/to/custom/inifile /your/script.php )
Unless you are trying to fit the entire datafile into memory then there should be no issue with a memory limit of 128Mb
I have a file that I'm using to log IP addresses for a client. They want to keep the last 500 lines of the file. It is on a Linux system with PHP4 (oh no!).
I was going to add to the file one line at a time with new IP addresses. We don't have access to cron so I would probably need to make this function do the line-limit cleanup as well.
I was thinking either using like exec('tail [some params]') or maybe reading the file in with PHP, exploding it on newlines into an array, getting the last 1000 elements, and writing it back. Seems kind of memory intensive though.
What's a better way to do this?
Update:
Per #meagar's comment below, if I wanted to use the zip functionality, how would I do that within my PHP script? (no access to cron)
if(rand(0,10) == 10){
shell_exec("find . logfile.txt [where size > 1mb] -exec zip {} \;")
}
Will zip enumerate the files automatically if there is an existing file or do I need to do that manually?
The fastest way is probably, as you suggested, to use tail:
passthru("tail -n 500 $filename");
(passthru does the same as exec only it outputs the entire program output to stdout. You can capture the output using an output buffer)
[edit]
I agree with a previous comment that a log rotate would be infinitely better... but you did state that you don't have access to cron so I'm assuming you can't do logrotate either.
logrotate
This would be the "proper" answer, and it's not difficult to set this up either.
You may get the number of lines using count(explode("\n", file_get_contents("log.txt"))) and if it is equal to 1000, get the substring starting from the first \n to the end, add the new IP address and write the whole file again.
It's almost the same as writing the new IP by opening the file in a+ mode.
I have a DB of sensor data that is being collected every second. The client would like to be able to download 12hour chunks in CSV format - This is all done.
The output is sadly not straight data and needs to be processed before the CSV can be created (parts are stored as JSON in the DB) - so I cant just dump the table.
So, to reduce load, I figured that the first time the file is downloaded, I would cache it to disk, then any more requests just download that file.
If I dont try to write it (using file_put_contents, FILE_APPEND), and just echo every line it is fine, but writing it, even if I give the script 512M it runs out of memory.
so this works
while($stmt->fetch()){
//processing code
$content = //CSV formatting
echo $content;
}
This does not
while($stmt->fetch()){
//processing code
$content = //CSV formatting
file_put_contents($pathToFile, $content, FILE_APPEND);
}
It seems like even thought I am calling file_put_contents at every line, it is storing it all to memory.
Any suggestions?
The problem is that file_put_contents is trying to dump the entire thing at once. Instead you should loop through in your formatting and use fopen, fwrite, fclose.
while($stmt->fetch()){
//processing code
$content[] = //CSV formatting
$file = fopen($pathToFile, a);
foreach($content as $line)
{
fwrite($file, $line);
}
fclose($file);
}
This will limit the amount of data trying to be tossed around in data at any given time.
I agree completely with writing one line at a time, you will never have memory issues this way since there is never more than 1 line loaded in to memory at a time. I have an application that does the same. A problem I have found with this method however, is that the file takes forever to finish writing. So this post is to back up what has already been said, but also to ask all of you for an opinion on how to speed this up? For example, my system cleans a data file against a suppression file, so I read in one line at a time and look for a match in the suppression file, then if no match is found, I write the line in to the new cleaned file. A 50k line file is taking about 4 hours to finish however, so I am hoping to find a better way. I have tried this several ways, and at this point I load the entire suppression file in to memory now to avoid my main reading loop to have to run another loop through each line in the suppression file, but even that is still taking hours.
So, line by line is by far the best way to manage your system's memory, but I'd like to get the processing time for a 50k line file (lines are email addresses and first and last names) to finishing running in less than 30 minutes if possible.
fyi: the suppression file is 16,000 kb in size and total memory used by the script as told by memory_get_usage() is about 35 megs.
Thanks!
I'm running a continuous PHP loop that executes another PHP file by using exec("php ...");. The plan is for the executed script to run, then sleep for 2 seconds, then start again. However, it seems like my loop is starting a new instance every 2 seconds instead. So long question short, how do I get my first php script to wait until the execution of script nr 2 is complete?
All this is run using the command line. I would also like the echo functions in script nr 2 to show up on the command line.
Any thoughts would help.
Thanks
Exec does not maintain any state information between instances. You could:
Loop in your subscript
OR
You could set some sort of environment variables or that are read at the beginning of the subscript and written at the end.
OR
You could have the subscript read/write to a file in a similar fashion
OR
You could pass in parameters to the subscript who's output is captured
For outputting to the screen, you might play around with the other exec/system calls:
exec
shell_exec
passthru
system
I believe passthru() will work. Another possibility if it doesn't is to call exec(), using the output parameters to capture the output strings from the subscript. Then just echoing that output on return of the subscript.
I also believe that using the output parameters (or capturing the result of the function in a variable) will cause the exec to wait until the command is complete before continuing on.
The problem is, once you excute the script, it will run. Another exec will start another instance like you found out.
What you can do is
Put the sleep inside the executed script. Once it starts running, it will do its own sleep. You can look at setting an execution time limit and maybe ignoring user abort.
You can create a function and let your script call that function. It will then sleep after execution and call the function again.
// maybe set time limit here
Function loop ()
{
Sleep(120);
//you can make a check whether to loop or not.
Loop();
}
Loop();