I am processing a big .gz file using PHP (transfering data from gz to mysql)
it takes about 10 minutes per .gz file.
I have a lot of .gz file to be processed.
After PHP is finished with one file I have to manually change the PHP script to select another .gz file and then run the script again manually.
I want it to be automatically run the next job to process the next file.
the gz file is named as 1, 2 ,3, 4, 5 ...
I can simply make a loop to be something like this ( process file 1 - 5):
for ($i = 1 ; $i >= 5; $i++)
{
$file = gzfile($i.'.gz')
...gz content processing...
}
However, since the gz file is really big, I cannot do that, because if I use this loop, PHP will run multiple big gz files as single script job. (takes a lot of memory)
What I want to do is after PHP is finished with one job I want a new job to process the next file.
maybe its going to be something like this:
$file = gzfile($_GET['filename'].'.gz')
...gz content processing...
Thank You
If you clean up after processing and free all memory using unset(), you could simply wrap the whole script in a foreach (glob(...) as $filename) loop. Like this:
<?php
foreach (glob(...) as $filename) {
// your script code here
unset($thisVar, $thatVar, ...);
}
?>
What you should do is
Schedule a cronjob to run your php script every x minutes
When script is run, check if there is a lock file in place, if not create one and start processing the next unprocessed gz file, if yes abort
Wait for the queue to get cleared
You should call the PHP script with argument, from a shell script. Here's the doc how to use command-line parameters in PHP http://php.net/manual/en/features.commandline.php
Or, I can't try it now, but you may give a chance to unset($file) after processing the gzip.
for ($i = 1 ; $i >= 5; $i++)
{
$file = gzfile($i.'.gz')
...gz content processing...
unset($file);
}
Related
My hosting is shared, their rule is at most 30 the set_time_limit, I already tried in several ways changing in cpanel or .htaccess I have many lines in different files to save.
Currently I am cutting the contents of the files in several files so as not to exceed the time:
$lines = file(''.get_template_directory_uri() . '/lines1.csv', FILE_IGNORE_NEW_LINES);
foreach ($lines as $line_num => $line){
//here is some code for save you content line
}
But, someone told me to use the code:
exec("php csv_import.php > /dev/null &");
This would run only a single file .csv in the background instead of multiple files , without having problems with exceeding time limit
It is the first time I see about shell and php, and I have doubts on how to work
I have to create a file csv_import.phpwith the normal php code? But how do I run this in the shell of my server?
If your host allow you to change the value you can define an different time limit on the php file.
<?php
$minutes = 30 ; // just for easy manage
$runfor =$minutes * 60;
set_time_limit ( $runfor );
?>
I have a directory which can contain CSV files that come through a service that I need to import into database. These CSV files are 1000 rows each and can be 10 to 150 files.
I want to insert data of all these CSV files into database. The problem is that PHP dies because of timeout issue because even if I use set_time_limit(0), the server (siteground.com) imposes its restrictions. Here is the code:
// just in case even though console script should not have problem
ini_set('memory_limit', '-1');
ini_set('max_input_time', '-1');
ini_set('max_execution_time', '0');
set_time_limit(0);
ignore_user_abort(1);
///////////////////////////////////////////////////////////////////
function getRow()
{
$files = glob('someFolder/*.csv');
foreach ($files as $csvFile) {
$fh = fopen($csvFile, 'r');
$count = 0;
while ($row = fgetcsv($fh)) {
$count++;
// skip header
if ($count === 1) {
continue;
}
// make sure count of header and actual row is same
if (count($this->headerRow) !== count($row)) {
continue;
}
$rowWithHeader = array_combine($this->headerRow, $row);
yield $rowWithHeader;
}
}
}
foreach(getRow() as $row) {
// fix row
// now insert in database
}
This is actually a Command run through artisan (I am using Laravel). I know that CLI doesn't have time restrictions but for some reason not all CSV files get imported and process ends at certain point of time.
So my question is is there way to invoke separate PHP process for each CSV file present in a directory ? Or some other way of doing this so I am able to import all CSV files without any issue like PHP's generator, etc
You could just do some bash magic. refactor your script so that it processes one file only. The file to process is an argument to the script, access it by using $argv.
<?php
// just in case even though console script should not have problem
ini_set('memory_limit', '-1');
ini_set('max_input_time', '-1');
ini_set('max_execution_time', '0');
set_time_limit(0);
ignore_user_abort(1);
$file = $argv[1]; // file is the first and only argument to the script
///////////////////////////////////////////////////////////////////
function getRow($csvFile)
{
$fh = fopen($csvFile, 'r');
$count = 0;
while ($row = fgetcsv($fh)) {
$count++;
// skip header
if ($count === 1) {
continue;
}
// make sure count of header and actual row is same
if (count($this->headerRow) !== count($row)) {
continue;
}
$rowWithHeader = array_combine($this->headerRow, $row);
yield $rowWithHeader;
}
}
foreach(getRow($file) as $row) {
// fix row
// now insert in database
}
Now, call your script like this:
for file in `ls /path/to/folder | grep csv`; do php /path/to/your/script.php /path/to/folder/$file; done
This will execute your script for each .csv file in your /path/to/folder
The best approach is to process a limited number of files per one php process. For example, you can start with 10(calculate a number of files empirical) files, process them, mark as removed(move to a folder with processed file) and stop the process. After that start a new process to import another 10 files and so on. In Laravel you can say to not start more than one process for a specific command if another process is working already. The command for Laravel is below:
$schedule->command("your job")->everyMinute()->withoutOverlapping();
If you use this approach you can be sure that all files will be processed for specific time and they will not consume too much resources to be killed.
If your hosting providers allows cron jobs, they dont have a timeout limit.
Also they should fit the job better than manually calling the function for heavy and long tasks, since that could case huge problems if the method its called several times.
I am currently re-writing a file uploader. Parsing scripts for different data types that currently exists are perl scripts. Program is written in php. The way it currently is that it allows for a single file upload only and once the file is on the server, it will call the perl script for the uploaded file's data type. We have over 20 data types.
What I have done so far is to write a new system that allows multiple file uploads. It will first let you validate your attributes before upload, compress them using zipjs, upload the zipped file, uncompress it on the server, for each file, call the parser for it.
I am at the part where I need to say for each file, put the parser call in the queue. I can not run multiple parsers at once. Rough sketch is below.
for each file
$job = "exec('location/to/file/parser.pl file');";
// using the pheanstalkd library
$this->pheanstalk->useTube('testtube')->put($job);
Depending on the file, parsing may take 2mins or 20mins. When I put the job on the queue, I need to make sure that the parser for the file2 fires after the parser for file1 finishes. How can I accomplish that ? Thx
Beanstalk doesn't have the notion of dependencies between jobs. You seem to have two jobs:
Job A: Parse file 1
Job B: Parse file 2
If you need job B to run only after job A, the most straightforward way to do this is for Job A to create Job B as its last action.
I have achieved what I wanted which was to request more time if parser is taking longer than a minute. Worker is a php script and I can get the process id when I execute the "exec" command for the parser executable file. I am currently using the code snippet below in my worker.
$job = $pheanstalk->watch( $tubeName )->reserve();
// do some more stuff here ... then
// while the parser is running on the server
while( file_exists( "/proc/$pid" ) )
{
// make sure the job is still reserved on the queue server
if( $job ) {
// get the time left on the queue server for the job
$jobStats = $pheanstalk->statsJob( $job );
// when there is not enough time, request more
if( $jobStats['time-left'] < 5 ){
echo "requested more time for the job at ".$jobStats['time-left']." secs left \n";
$pheanstalk->touch( $job );
}
}
}
How to process big text file in PHP?
In python one can use generators and read file just line by line without loading whole file to memory. Is there something like generators in PHP?
you can run your php script in batch command ...
make a new file (run.bat)
right click and edit
and put this in the file :
c:/PHP/php.exe -f .\yourscript.php
change the url to php.exe and to yourscript.php to where its located
then loop true the file with this :
<?php
$data=file("the big file.txt");
$lines=count($data);
$x=0;
while($x<=$lines){
print $data[$x]."\n";
sleep(1); //print out new line every 1 sec
$x++;}
I've done a little bit of PHP coding and am familiar with aspects of it.
I have made a PHP script that runs as a cron job that will pull data from a database and if certain conditions are met, some information is written to a file.
Because there may be more than one result in the database, a loop is done to run through each result in the database.
Within that loop, I have another loop which will write data to a file. A cron job is then used to call this file every minute and run the contents in the bash script.
So, the PHP loop it setup to see if the file has anything written to it by using the filesize() function. If the filesize is not zero, then it will sleep for 10 seconds and try to read it again. Here is the code:
while(filesize('/home/cron-script.sh') != 0)
{
sleep(10);
}
Unfortunately, when the filesize is ran, it seems to place some kind of lock or something on the file. The cron job can execute the bash script without a problem and the very last command in the script is to zero out the file:
cat /dev/null > /home/cron-script.sh
But, it seems that once the while loop above is started, it locks in the original file size. As an example, I just simply put in the word "exit" in the cron-script.sh file and then ran through a test script:
while(filesize("/home/cron-script.sh") != 0)
{
echo "filesize: " . filesize("/home/cron-script.sh");
sleep(10);
}
The loop is infinite and will continue to show "filesize: 4" when I put in the word "exit". I will then issue the command at the terminal:
cat /dev/null > /home/cron-script.sh
Which will then clear the file while I have the test script above running. But, it continues to say the filesize is 4 and never returns to 0 - therefore making the PHP script run until the execution time limit is reached.
Could anyone give me some advice on how I can resolve this issue? In essence, I just need some way to reading the filesize - and if there is any kind of data in the file, it will need to loop through a sleep routine until the file is cleared. The file should clear within one minute (since the cron job calls that cron-script.sh file every minute).
Thank you!
From http://www.php.net/manual/en/function.filesize.php
Note: The results of this function are cached. See clearstatcache() for more details.
To resolve this, remember to call clearstatcache() before calling filesize():
while(filesize("/home/cron-script.sh") != 0)
{
echo "filesize: " . filesize("/home/cron-script.sh");
sleep(10);
clearstatcache();
}
The results of filesize are cached.
You can use clearstatchace to clear the cache on each iteration of the loop.