I want to create an aggregator, which goal is to grabb facebook posts (text, pictures, videos). Next step is to post grabbed info on my resource, and video on my youtube channel. I want to let cron launch this aggregator every 1 minute. But I think, that I will meet next problem: If I grabbed such large video file, it can be not complete uploading to youtube till 1 minute. Till this minute it make happen so, that next video file will be grabbed and it need to be uploaded too. My question: is this second video file will be stay in queue and wait till first will be uploaded, or I need to create a multi threads for this? If so, please tell me how?
Crontab simply spawns a process at a scheduled time. If cron is still running a process that it initiated 1 minute ago, it will create another process (i.e. 2 will be running at the same time).
Your code must ensure that files which are being uploaded are marked such that, if/when cron runs a second process, it doesn't try to upload the same file multiple times.
Your logic would be something like this:
Grab data
Before you upload that data, check (in a database, for example) if you are already uploading it
If you are not already uploading it, mark in the database that you are uploading that data
Upload the data
To be honest, this isn't a great use of cron and you'd do better with a long-running process. Software like supervisord can allow you to create a long-running PHP script which is automatically restarted if it crashes.
Related
this time I come with a question that I hope you can guide me to solve.
I have created a PHP script that allows loading a CSV file with a large amount of data (to load it I use the AJAX request). This script extracts the data from the file, then checks that this data is not already stored in the database, makes use of another script to obtain information of each data that is extracted from the file and finally saves the data that has passed successfully. all that validation process in a BD table.
It is a process that can last a few seconds or many minutes, because there are files that I can upload that contain more than 100 thousand data, so I would not like to leave the browser open all the time the process lasts.
What I want to know is how I could leave this process running internally on the server when I close the browser. Something like putting it in queue and let it continue running when I close my browser.
Once I reopen the browser and open the page of the script that shows me how the process is currently going. The idea is that the data processing is not interrupted when I close my browser.
Any suggestions or examples you could give me to achieve this?
Based on your description, I think you'd better run a dedicated daemon (either a 3rd party one or one written by yourself) yourself which does the background stuff.
The rationale behind why I don't think it right to do that in your PHP code is:
If you fork it from your server code, you have to install something else and since it is a folk, that process you are gonna spawn will inherit some data not useful at all from the parent process
With a dedicated daemon, it's easier for you to track the status of each job and more importantly, not a bunch of processes will be spawned if you just fork a new process for each job in the server code.
My problem might be very basic for you because I'm new to phalcon framework.
I'm building a web-based system were I want to upload an excel file into the server then perform a set of conditions and insert the result into my DB, but I want the excution to be done at the server meaning that user can close the browser but the the system should still process the file and while processing it should show the for the user a process bar.
You can upload file and add it's name to queue system or just store it in database. Phalcon provides Beanstalk support out of the box but if you want keep history (and show progress with ease) of uploads I would recommend database-way. Table structure should contain columns like: file name, processed_rows, rows, status (0 - new, 1 - processed).
To process files in background you can create Phalcon CLI app which should observe queue (or unprocessed files from database) and process files. You should configure CRON task to run it every 1/5/10min depending on how much files you upload or run it in endless loop.
To determine progress you can count all rows in file and update processed rows count while processing. Then you can just calculate progress on client's request.
If you need live progress you can make ajax calls in interval to get current progress from database or implement WebSocket server which will send you progress from another queue's tube (updated by background processing process).
Ok, let's say my website allows users to upload large batches of files for processing.
Now, once they upload the files say a process_files() function is called. This will take a long time to complete and it will basically make the website unusable for the user until the task either times out or completes.
What I'm wondering is if there's an easy to to just execute this in parallel or in the background so the user can still use the website?
If it makes a different in my case these will very seldom be called.
you can divide task to two parts:
first part (web) only saves a file on a disk. this way web part will work fast.
second (system) will process all files saved by first part. and you can start it periodically ie as a cron task on linux, or use some other task scheduler depending on your system.
i would suggest that after the file upload, you call a script via the CLI, which will run in the background and be non blocking:
exec("php the_long_file_processing_script.php > /dev/null &");
I've build a scraper to get some data from another website. The scraper runs currently at the command line in a screen so the process is never stopping. Between each request I've set an interval to keep things calm. In one scrape it's possible there are coming 100 files along with which needs to be download. Also this process haves an interval after every download.
Now I want to add the functionality in the back-end to scrape on the fly. Everything works fine, I get the first data set which only has 2 requests. Within this data returned I've an array with files need to be download (can be 10 can be +100).. I would like to create something the user can see realtime how far the download process is.
The thing I am facing, when the scraper has 2 jobs to do in a browser window with up to +20 downloads including intervals to keep things clam down it will take too much time. I am thinking about to save the files needed to be download into a database table and handle this part of the data process by another shell script (screen) or cronjob.
I am wondering about if my thoughts are in the good way, overkilled or there are some better examples to handle these kind of processes.
Thanks for any advice.
p.s. I am developing in PHP
If you think that is overkill, you can run the script and waiting that task is finished before run again.
Basically you need to implement a message queue where http request handler (front controller?) emit a message to fetch a page, and one or more workers do the job, optionally emitting more messages to the queue to download files.
There are plenty of MQ brokers, but you can implement your own with database as a queue storage.
I have a customer based website which requires them to upload an image, and on doing so my script will save about 25-30 variations of the image onto the server using the GD library. Because of the amount of images there is currently a very long waiting time for the customer to continue on the site waiting until all images have been created and saved. Until then they cannot proceed so we get a high level of customers leaving the site.
Is it possible after upload to instead store the image url in a database table record, then have a php script which creates the 25-30 images pull each record in the database and run every 5 minutes of the day using cronjob. This way it allows the customer to continue through the website and have the images automatically created 'in the background'
Will all this going on in the background cause any issues for the speed of my site? Will it slowdown the site for people browsing especially if 10-100's of customers are using it at the same time?
I suggest to start looking at queues, more specific to Gearman.
This will decrease the load time for your customers as you can offload the generation of the images to a separate server. And it scales easily across multiple servers if you need more processing power.
Having a PHP script process the images would cause no more lag in a cron job than it would when the customer uploads the image and waits for the processing to complete in realtime. So in short, no there's no additional impact of this approach.
The catch is, you need to make sure your cron job is self-aware and does not create overlaps. For example if the cron runs and takes more than 5 minutes to complete its current task, what happens when a second cron spins up and begins processing another image (or the same one if you didn't implement the queue properly)? Well now you have two crons running, fighting for resources. Which means the second one will likely take over 5 minutes as well. Eventually you get to the point with 3, 4, etc crons all running at once. So make sure your crons are only "booting up" if there's not one running already.
All this being said, you'd probably be best off having another server handle the image processing depending on the size of your client's site and how heavy their traffic is. You could have a cloud server in a cluster with your production site server which can connect via local network to access an image, process it, and return the 25-30 copies to the server in the appropriate location. This way your processing queue occupies 0 resources of the public facing web server and will have no impact on speed of the site itself.
Sure you can store PATH of an image on your server and process it later.
create php script that when run creates a LOCK file ie "/tmp/imgprocessor.lock" and deletes it at the end, if cron starts a new process you first check that file doesn't exist.
I would store uploaded images in ie pathtoimages/toprocess/ and delete each one after processing or move it elsewhere. Have new images in ie /processed/
This way you don't need to query DB for path of images but just process what is in 'toprocess' folder and you can just have UNIQ_NAME_OF_IMAGE in table. In your web script before loading the oage check if UNIQ_NAME_OF_IMAGE exist in 'processed' folder , and if so display it ...
On server load , it depends how many images you originaly have and what sizes are, image processing can be heavy on server but processing 1000 users *30 images wont be a heavy duty task , as I said depends on size of images.
NOTE: if you go this way you need to make sure that when starting a cron ERROR log is also outputed to some log file. Cron script must be bullet proof ie if it fails for some reason LOCK file will remain , so no more processing will happen , you will need to delete it manually (or create custom error handler that deletes it and maybe send some mails ) , you should check periodically log file so you know whats going on.