Background process for importing data in PHP - php

Details
When a user first logs into my app I need to import all of their store's products from an API, this can be anywhere from 10 products to 11,000. So I'm thinking I need to to inform the user that we'll import their products and email them when we're finished.
Questions
What would be the best way to go about importing this data without requiring the user to stay on the page?
Should I go down the pcntl_fork route?
Would system style background tasks be better?

AFAIK there is no way to pcntl_fork() from a web server process, you can only do it from the command line. You can, however, start a child process using exec() (or similar) that will continue to run after you have terminated.
I don't know how "correct" this is, but I would do something like this:
upload.php - Get the user to upload their products list in whatever format you want. I shall assume you know how to do this and won't include any code - if you want an example let me know.
store.php - the upload form submits to this file:
// Make sure a file was uploaded and you have the user's ID
if (!isset($_FILES['file'],$_POST['userId']))
exit('No file uploaded or bad user ID');
// Make sure the upload was successful
if ($_FILES['file']['error'])
exit('File uploaded with error code '.$_FILES['file']['error']);
// Generate a temp name and store the file for processing
$tmpname = microtime(TRUE).'.tmp';
$tmppath = '/tmp/'; // ...or wherever you want to temporarily store the file
if (!move_uploaded_file($_FILES['file']['tmp_name'],$tmppath.$tmpname))
exit('Could not store file for processing');
// Start an import process, then display a message to the user
// The ' > /dev/null &' is required here - it let's you start the process asynchronously
exec("php import.php \"{$_POST['userId']}\" \"$tmppath$tmpname\" > /dev/null &");
// On Windows you can do this to start an asynchronous process instead:
//$WshShell = new COM("WScript.Shell");
//$oExec = $WshShell->Run("php import.php \"{$_POST['userId']}\" \"$tmppath$tmpname\"", 0, false);
exit("I'm importing your data - I'll email you when I've done it");
import.php - handles the import and sends an email
// Make sure the required command line arguments were passed and make sense
if (!isset($argv[1],$argv[2]) || !file_exists($argv[2])) {
// handle improper calls here
}
// Connect to DB here and get user details based on the username (passed in $argv[1])
// Do the import (pseudocode-ish)
$wasSuccessful = parse_import_data($argv[2]);
if ($wasSuccessful) {
// send the user an email
} else {
// handle import errors here
}
// Delete the file
unlink($argv[2]);
The main issue with this approach is that if lots of people upload lists to be imported at the same time, you would risk stressing your system resources with multiple simultaneous versions of import.php running.
For this reason, it is possibly better to schedule a cron job to import the lists one at a time as suggested by Aaron Bruce - but which approach is best for you will depend on your precise requirements.

I think the "standard" way to do this in PHP would be to run a cron every five minutes or so that checks a queue of pending imports.
So your user logs in, part of the log in process is to add them to your "pending_import" table (or however you choose to store the import queue). Then the next time the cron fires it will take care of the current contents of your queue.

Related

How to check if process is completed?

I have a php script that is currently invoked directly by a webhook. The webhook method was fine up until this past week where the volume of requests is becoming problematic for API rate limits.
What I have been trying to do is make a second PHP file ($path/webhook-receiver.php) to be invoked by webhooks only when there isn't a process running. I'll be using the process user webb recommended, which is in the invoked script ($path/event-finance-reporting.php) it will create a file as the first action, and the delete that file as the last exection.
Before invoking the script the automation will check the directory to make sure it is empty, otherwise it will kick back an error to the user telling them to wait until the current job is completed before submitting another one.
The problem I'm running into now is that both $command1 and $command2'. both end up invoking the$path/webhook-reciever.phpinstead of$path/event-finance-reporting.php`.
$command1 = "php -f $path/event-finance-reporting.php 123456789";
$command2 = "/usr/bin/php -q -f $path/event-finance-reporting.php 123456789";
Anyone know why would be?
The goal it to have only one instance of event-finance-reporting.php run at a time. One strategy is to create a unique lockfile, don't run if it exists, and delete it when it finishes, e.g.,:
$lockfilepath = '.../event-finance-reporting.lock';
if(file_exists($lockfilepath)){
print("try again later");
exit();
}
touch($lockfilepath);
...
// event-finance-reporting.php code
...
unlink($lockfilepath);
You could also do something more complicated in the if, such as checking the age of the lockfile, then deleting and ignoring it if it was left behind awhile ago by a crashed instance of event-finance-reporting.php.
With this strategy, you also don't need two separate phps.

PHP - Email notification whenever a remote file changes

I was wondering if there's a way to have a php script on my web server email me whenever a file from another web server changes.
For instance, there's this file that changes frequently: http://media1.clubpenguin.com/play/en/web_service/game_configs/paper_items.json
I blog about a game and that file is very important for creating post on updates before my competitors. I often forget to check it though.
Is there a way to have a script email me whenever that file updates, or check that file to see if it has updated, and email me if it has?
Use crontab to setup checking script to run once a minute and compare this file with your locally stored version (or use md5 checksums instead - it will differ if file changes).
file_get_contents('http://url-to-file', 'checkfile.tmp');
if (md5(file_get_contents('lastfile.tmp')) != md5(file_get_contents('checkfile.tmp')))
{
//copy checkfile to lastfile
unlink('lastfile.tmp');
copy('checkfile.tmp', 'lastfile.tmp');
//send email or do something you want ;)
}
You need have this two files in same folder.
old.json
scriptForCron.php
In scriptForCron.php write:
$url='http://media1.clubpenguin.com/play/en/web_service/game_configs/paper_items.json';
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$execute = curl_exec($ch);
$fp=fopen('old.json','w+');
$oldjson=fread($fp,filesize('old.json'));
if($execute!=$oldjson){
mail('your#mail.com','Yoohoo', 'File changed');
fputs($fp,$execute);
}
fclose($fp);
And then add scriptForCron.php to cron job.
You can ask hosting support for it.
This code does not check for updates in realtime - it would be pretty much impossible - but every 1 hour/minute.
First, save a file on your system which has the same contents as this. Name it any way, for example paper_items.json.
Now make a file named checkitems.php. Read the file which changes frequently, compare if it's contents are equal to your paper_items.json. If equal, nothing to do, if not, save the online file to your local paper_items.json and use PHP's mail() to email you something like "there was a change".
Finally, set up a cron job to run this every n (for example 1) hour or 1 minute, etc.

Multiple "agents" handling a single array

Apologies if this has been covered before - I did my searching but possibly may not know the correct terms to have used.
This process is handled with PHP.
Here's the situation:
I have a large array of file names. The script I have opens these files and enters their content into a database. Processing these files one at a time takes over 24 hours, and these files are updated on a daily basis.
Breaking the single large array into four smaller arrays and running concurrent processes finishes the job before the 24 hour window elapses, but sometimes one or two processes will finish hours before the others because file sizes vary on a daily basis.
Much like people who stock retail shelves (who else has worked that nightmare before?) pitch in to help out with what's left after finishing their own tasks, I'd like to have a script in place where these "agents" do the same.
Here's some basics of what I have figured out - it could be wrong, and I'm not too proud to protest if I am :-)
$files = array('file1','file2','file3','file4','file5');
//etc... on to over 4k elements
while($file = array_pop($files)){
//Something in here... I have no idea what.
}
Ideas? Something like four function calls or four loops within that overarching 'while' has crossed my mind, but I'm pretty sure it's going to wait on executing subsequent calls until the previous one(s) finish.
Any help is appreciated. I'm seriously stuck on this one!
Thanks!
A database-backed message queue seems the obvious solution but I think that's overkill in this case. I would simply put the files to be processed into a single dedicated queue directory, then use the DirectoryIterator class to scan it. Something like this:
while (true) {
look in the queue directory for a file
if you don't fine one, exit the script, all processing is done
if you find one, rename it or move it to a work directory
if the rename/move command succeeded, process the file
if the rename/move command failed, one of the other threads got it first
}
Edit:
Regarding launching the workers, you could use a simple shell script to spawn the PHP processes in the background:
NUM_WORKERS=5
for WORKER in $(seq 1 ${NUM_WORKERS})
do
echo "starting worker ${WORKER}"
php -f /path/to/my/process.php &
done
Then, create a cron entry to run this launcher, for example, at midnight:
0 0 * * * /path/to/launcher.sh
You want what's called a "message queue". Something like beanstalkd
You'll basically create a list of messages that include your individual filenames. You'll then create a set of processors to process them. Each processor will handle one file then go back to the queue to see if there are more messages/files waiting to be processed.
EDIT:
Here's an analogy to help explain message queues. Your first idea is like a human manager taking a stack of files, dividing them into four piles and then handing each of his four employees a pile to process. A message queue is more like this: the manager puts all the files on a table and tells each employee to take a single file from the table and process it. He tells them when they're done with the first file to keep taking files until there are no more files on the table. When all the files are done, the employees can go home.
One employee might end up with really large files and only handle a few, while another employee might get smaller files and handle many. It doesn't matter how many each employee handles, they'll all keep working until the table is empty.
I would have a socket server master script that hands out file paths to x number of slave scripts, until there are no files left to process. This way, all the slave scripts will keep running, and you can hand out file paths dynamically as they are requested.
Something like this:
master.php
<?php
// load the array of files to process (however you do this)
$fileList = file('filelist.txt');
// Create a listening socket on localhost
$serverSocket = stream_socket_server('tcp://127.0.0.1:7878');
$sockets = array($serverSocket);
$clients = array();
// Loop while there are still files to process
while (count($fileList)) {
// Run a select() call on the existing sockets' read buffers
// Skip to next iteration if no sockets are waiting for handling
if (stream_select($read = $sockets, $write = NULL, $except = NULL, 1) < 1) {
continue;
}
// Loop sockets with data to read
foreach ($read as $socket) {
if ($socket == $serverSocket) {
// Accept new clients
$sockets[] = $clients[] = stream_socket_accept($serverSocket);
} else if (trim(fgets($socket)) == 'next') {
// Hand out a new file path to the client
fwrite($socket, array_shift($fileList)."\n");
if (!count($fileList)) {
break 2;
}
}
}
}
// When we're done, disconnect the clients
foreach ($clients as $socket) {
#fclose($socket);
}
// ...and close the listen socket
#fclose($serverSocket);
slave.php
<?php
$socket = fsockopen('127.0.0.1', 7878);
while (!feof($socket)) {
// Get a new file path from the master
fwrite($socket,"next\n");
$path = trim(fgets($socket));
if (is_file($path)) {
// Process the file at $path here
}
}
You then just need to start master.php, then when it is running, you can start however many instances of slave.php as you want, and they will all keep running until there are no more files to process.
Obviously, this has no error handling, but it should provide a basic framework to get you started. This relies on blocking function calls (stream_select() and fgets()) to avoid a race condition - this may or may not be sufficient for your purposes.

Limiting Parallel/Simultaneous Downloads - How to know if download was cancelled?

I have a simple file upload service, written out in PHP, which also includes a script that controls download speeds by sending limited-sized packets when a user requests a download from this site.
I want to implement a system to limit parallel/simultaneous downloads to 1 per user if they are not premium members. In the download script above, I can use a MySQL database to store a record that has: (1) the user ID; (2) the file ID; (3) when the download was initiated; and (4) when the last packet was sent, which is updated each time this is done (if DL speed is limited to 150 kB/sec, then after every 150 kB, this record is updated, etc.).
However, thus far, the database record will only be deleted once the download has successfully completed — at the end of the script, after the download has been fully served, the record is deleted from the table:
insert DB record;
while (download is being served) {
serve packet of data;
update DB record with current date/time;
}
// Download is now complete
delete DB record;
How will I be able to detect when a download has been cancelled? Would I just have to have a Cron job (or something similar) detect if an existing download record is more than X minutes/hours old? Or is there something else I can do that I'm missing?
I hope I've explained this well enough. I don't think posting specific code is required; I'm interested more in the logistics of how/whether this can be done. If specific is needed, I will gladly provide it.
NOTE: I know how to detect if a file was successfully downloaded; I need to know how to detect if it was cancelled, aborted, or otherwise stopped (and not just paused). This will be useful in stopping parallel downloads, as well as preventing a situation where the user cancels Download #1 and tries to initiate Download #2, only to find that the site claims he is still downloading file #1.
EDIT: You can find my download script here: http://codetidy.com/1319/ — it already supports multi-part downloads and download resuming.
<?php
class DownloadObserver
{
protected $file;
public function __construct($file) {
$this->file = $file;
}
public function send() {
// -> note in DB you've started
readfile($this->file);
}
public function __destruct() {
// download is done, either completed or aborted
$aborted = connection_aborted();
// -> note in DB
}
}
$dl = new DownloadObserver("/tmp/whatever");
$dl->send();
should work just fine. No need for a shutdown_function or any funky self-built connection observation.
You will want to check out the following functions: connection_status(), connection_aborted() and ignore_user_abort() (see the connection handling section of the PHP manual for more info).
Although I can't guarantee the reliability (it's been a while since I've played around with it), with the right combination you should be able to accomplish what you want. There are a few caveats when working with these though, the big one being that if something goes wrong you could end up with stranded PHP scripts running on the server requiring you to kill Apache to stop them.
The following should give you a good idea of how to do it (adapted from the PHP code examples and a couple of the comments):
<?php
//Set PHP not to cancel execution if the connection is aborted
//and drop the time limit to allow for big file downloads
ignore_user_abort(true);
set_time_limit(0);
while(true){
//See the ignore_user_abort() docs re having to send data
echo chr(0);
//Make sure the data gets flushed properly or the connection check won't work
flush();
ob_flush();
//Check then connection status and exit loop if aborted
if(connection_status() != CONNECTION_NORMAL || connection_aborted()) break;
//Just to provide some spacing in this example
sleep(1);
}
file_put_contents("abort.txt", "aborted\n", FILE_APPEND);
//Never hurts to ensure that the script halts execution
die();
Obviously for how you would be using it the data being sent would simply be the download data chunk (just make sure you flush the buffer properly to ensure the data is actually sent). As far as I'm aware, there is no way of making a distinction between pausing and aborting/stopping. Pause/resume functionality (and multi-part downloading - i.e. how download managers accelerate downloads) relies on the "Range" header, basically requesting byte x to byte y of the file. So if you want to allow resumable downloads you'll have to deal with that too.
There is no HTTP "cancel" signal that is sent by default. So, it looks like you will need to decide on a timeout, the length of time a connection can sit without sending/receiving another packet. If you are sending rather small packets (as I presume you are) keep the timeout short for best effect.
In your while condition you will need to check the age of the last timestamp update, if its too old, stop sending the file.

Downloading large number of Images to my Server and notifying the User when Download is finished

I want to download a large amount of Files to my Server. I have a List of different Files to download and Locations where to put them. This all is not a Problem, i use wget to download the File, execute this with shell_exec
$command = 'wget -b -O' . $filenameandpathtoput . ' ' . $submission['url'];
shell_exec($command);
This works great, the Server starts all the Threads and the Files are downloaded in no Time.
Problem is, I want to notify the User when the Files are downloaded... And this does not work with my current way of doing things. So how would you implement this?
Any Suggestions would be helpful!
I guess that you are able to check whether all files are in place with something like
function checkFiles ()
{
foreach ($_SESSION["targetpaths"] as $p)
{
if (!is_file($p)) return false;
}
return true;
}
Now all you have to do is to call a script on your server that calls this function every second (or so). You can either accomplish this with Meta Refresh (forcing the browser to reload the page after n seconds) or by using AJAX (have a look at jQuery's .getJSON, for example).
If the script is called and the files are not yet all downloaded, print something like "Please wait" and refresh again later. Otherwise, show the success message. Thats all.
You can consider using exec to run the external wget command. Your PHP script will block till the external command completes. Once it completes you can echo the name of the completed file.

Categories