PHP - Using Cron Jobs and GD together - php

I have a customer based website which requires them to upload an image, and on doing so my script will save about 25-30 variations of the image onto the server using the GD library. Because of the amount of images there is currently a very long waiting time for the customer to continue on the site waiting until all images have been created and saved. Until then they cannot proceed so we get a high level of customers leaving the site.
Is it possible after upload to instead store the image url in a database table record, then have a php script which creates the 25-30 images pull each record in the database and run every 5 minutes of the day using cronjob. This way it allows the customer to continue through the website and have the images automatically created 'in the background'
Will all this going on in the background cause any issues for the speed of my site? Will it slowdown the site for people browsing especially if 10-100's of customers are using it at the same time?

I suggest to start looking at queues, more specific to Gearman.
This will decrease the load time for your customers as you can offload the generation of the images to a separate server. And it scales easily across multiple servers if you need more processing power.

Having a PHP script process the images would cause no more lag in a cron job than it would when the customer uploads the image and waits for the processing to complete in realtime. So in short, no there's no additional impact of this approach.
The catch is, you need to make sure your cron job is self-aware and does not create overlaps. For example if the cron runs and takes more than 5 minutes to complete its current task, what happens when a second cron spins up and begins processing another image (or the same one if you didn't implement the queue properly)? Well now you have two crons running, fighting for resources. Which means the second one will likely take over 5 minutes as well. Eventually you get to the point with 3, 4, etc crons all running at once. So make sure your crons are only "booting up" if there's not one running already.
All this being said, you'd probably be best off having another server handle the image processing depending on the size of your client's site and how heavy their traffic is. You could have a cloud server in a cluster with your production site server which can connect via local network to access an image, process it, and return the 25-30 copies to the server in the appropriate location. This way your processing queue occupies 0 resources of the public facing web server and will have no impact on speed of the site itself.

Sure you can store PATH of an image on your server and process it later.
create php script that when run creates a LOCK file ie "/tmp/imgprocessor.lock" and deletes it at the end, if cron starts a new process you first check that file doesn't exist.
I would store uploaded images in ie pathtoimages/toprocess/ and delete each one after processing or move it elsewhere. Have new images in ie /processed/
This way you don't need to query DB for path of images but just process what is in 'toprocess' folder and you can just have UNIQ_NAME_OF_IMAGE in table. In your web script before loading the oage check if UNIQ_NAME_OF_IMAGE exist in 'processed' folder , and if so display it ...
On server load , it depends how many images you originaly have and what sizes are, image processing can be heavy on server but processing 1000 users *30 images wont be a heavy duty task , as I said depends on size of images.
NOTE: if you go this way you need to make sure that when starting a cron ERROR log is also outputed to some log file. Cron script must be bullet proof ie if it fails for some reason LOCK file will remain , so no more processing will happen , you will need to delete it manually (or create custom error handler that deletes it and maybe send some mails ) , you should check periodically log file so you know whats going on.

Related

PHP script keeps "restarting" creating new instances of itself

I developed a site using Zend Framework 2. It is basically a price comparison site that integrates with many of the top affiliate networks out there. I wrote a script that checks prices from each affiliate network, and then updates my local DB with that price. Depending on which affiliate network I am contacting, I may be making an API call (Amazon or CJ.com), or I may be looking at an XML product feed (Pepperjam or LinkShare). The XML product feed would be hosted locally.
At present, there are around 3,500 sku's that I am checking with this script. The vast majority of them (95%+) are targeting an XML product feed. I would estimate that this script should probably take in the neighborhood of 10 minutes to complete. Some of the XML files I am looking at are around 8 MB in size.
I have tested this script thoroughly in my local environment and taken great lengths to make sure that there is no memory leak or something of that nature which would cause performance issues. As an example, I made sure to use data streams where possible to avoid putting the XML file in memory over and over, etc. Suffice to say, the script runs locally without issue.
This script is intended to be run as a cron job, however I do have a way to trigger it via the secure admin interface ad-hoc. Locally, this is how I initiate the script to run, and everything goes rather smoothly.
When I deploy my code to the shared hosting account, I am having all sorts of problems. In order to troubleshoot, I attached logging to various stages of this script to track when it starts, how it progresses, and when each step completes, etc. All of this is being logged to a MySQL database.
Problem #1: If I run the script ad-hoc via an HTTP request, I find that it will run for a couple minutes, and then the script starts again (so there are now two instance apparently running). Wait another couple minutes, and a third one will start, etc..... Here is an example when I triggered the script to run at 10:09pm via an HTTP request.
Screenshot of process manager
Needless to say, I DO NOT run it via an HTTP request because it only serves to get me in trouble with my web hosting provider :)
Problem #2: When the script runs on the server, triggered via a cron job, it is failing to complete. I have taken the production copy of the database and taken it locally along with the XML files, it runs fine. So it should not be a problem with bad data exposing bad code. My observation is - the script nearly runs for the exact same amount of time - before aborts, or is terminated, or whatever. The last record updated is generally timestamped around 4 minutes and 30 seconds or so (if memory serves) after the script is triggered. The SKU list is constantly changing so the record that it ends on differs, but the the time of the last update is nearly the same each time. Nothing is being logged in the error logs. I monitored server resources via SSH top command and there is nothing out of the ordinary. CPU usage is in check and memory used does not go up.
I have a shared hosting account through Bluehost. My thoughts were that perhaps it was a script max execution time issue. I extended the max execution time in the script itself and via php.ini. Made no difference.
So I guess what I am looking for is some fresh ideas of where to go next. What questions should I be asking my hosting company so they can help me get to the bottom of this. They are only somewhat helpful to say the least. Could it be some limitation on my hosting account? Triggering some sort of automatic monitor that is killing the script? What types of Apache settings could be problematic for a script of this nature? PHP.ini settings? Absolutely any input you can provide would be helpful.
And why, when triggered via HTTP, would it keep spinning up new instances? I guess I could live w/o running it manually, and only run it via a cron job, but that isn't working either. So .... interested in hearing the communities thoughts on this. Thanks!
I haven't seen your script, neither did I work with your hoster, so everything below is just a guess - and a suggestion.
Given your description, I would say you're right that your script might have been killed by timeout when run from cron. I'm not sure why it keeps spawning new instances of your script when you execute it manually via an HTTP request, but it may also be related to a timeout (e.g. if they have a logic that restarts a script if it has not produced an output within a certain time, or something like that).
You can follow up with your hosting provider about running long-running (or memory-consuming) script in their environment, and they might have some FAQ or document already written that covers this topic.
Let me suggest an option for you in case if your provider is unable to help.
From what you said, I expect your script runs an SQL query to get a list of SKUs, and then slowly iterates over this list, performing some job on every item (and eventually dies for whatever reason, as we learned).
How about if you create a temporary table (or file - just any kind of persistent storage on the server) that would save the last processed record ID of the script, or NULL if the script successfully completed. That way you'll be able to make your script start with the last processed record (if the last processed record had id = 1000, add ... WHERE id > 1000 to the main query that fetches SKUs), and you won't really care if the script completed its first attempt or not (if not, it will keep processing from that very point when it was killed, on its second try).
Alternatively, to extend this approach, you can limit one invocation to the certain amount of records to process (e.g. 100 or 1000), again, saving the last processed record ID in the database or somewhere else.
The main idea is: if the script fails to process all SKUs at once, just make it restartable so that it does not lose its progress.

Using PHP Phalcon framework to run on Background

My problem might be very basic for you because I'm new to phalcon framework.
I'm building a web-based system were I want to upload an excel file into the server then perform a set of conditions and insert the result into my DB, but I want the excution to be done at the server meaning that user can close the browser but the the system should still process the file and while processing it should show the for the user a process bar.
You can upload file and add it's name to queue system or just store it in database. Phalcon provides Beanstalk support out of the box but if you want keep history (and show progress with ease) of uploads I would recommend database-way. Table structure should contain columns like: file name, processed_rows, rows, status (0 - new, 1 - processed).
To process files in background you can create Phalcon CLI app which should observe queue (or unprocessed files from database) and process files. You should configure CRON task to run it every 1/5/10min depending on how much files you upload or run it in endless loop.
To determine progress you can count all rows in file and update processed rows count while processing. Then you can just calculate progress on client's request.
If you need live progress you can make ajax calls in interval to get current progress from database or implement WebSocket server which will send you progress from another queue's tube (updated by background processing process).

Apache cron: synchronous tasks execution

I want to create an aggregator, which goal is to grabb facebook posts (text, pictures, videos). Next step is to post grabbed info on my resource, and video on my youtube channel. I want to let cron launch this aggregator every 1 minute. But I think, that I will meet next problem: If I grabbed such large video file, it can be not complete uploading to youtube till 1 minute. Till this minute it make happen so, that next video file will be grabbed and it need to be uploaded too. My question: is this second video file will be stay in queue and wait till first will be uploaded, or I need to create a multi threads for this? If so, please tell me how?
Crontab simply spawns a process at a scheduled time. If cron is still running a process that it initiated 1 minute ago, it will create another process (i.e. 2 will be running at the same time).
Your code must ensure that files which are being uploaded are marked such that, if/when cron runs a second process, it doesn't try to upload the same file multiple times.
Your logic would be something like this:
Grab data
Before you upload that data, check (in a database, for example) if you are already uploading it
If you are not already uploading it, mark in the database that you are uploading that data
Upload the data
To be honest, this isn't a great use of cron and you'd do better with a long-running process. Software like supervisord can allow you to create a long-running PHP script which is automatically restarted if it crashes.

PHP Scrape and download files from urls with interval

I've build a scraper to get some data from another website. The scraper runs currently at the command line in a screen so the process is never stopping. Between each request I've set an interval to keep things calm. In one scrape it's possible there are coming 100 files along with which needs to be download. Also this process haves an interval after every download.
Now I want to add the functionality in the back-end to scrape on the fly. Everything works fine, I get the first data set which only has 2 requests. Within this data returned I've an array with files need to be download (can be 10 can be +100).. I would like to create something the user can see realtime how far the download process is.
The thing I am facing, when the scraper has 2 jobs to do in a browser window with up to +20 downloads including intervals to keep things clam down it will take too much time. I am thinking about to save the files needed to be download into a database table and handle this part of the data process by another shell script (screen) or cronjob.
I am wondering about if my thoughts are in the good way, overkilled or there are some better examples to handle these kind of processes.
Thanks for any advice.
p.s. I am developing in PHP
If you think that is overkill, you can run the script and waiting that task is finished before run again.
Basically you need to implement a message queue where http request handler (front controller?) emit a message to fetch a page, and one or more workers do the job, optionally emitting more messages to the queue to download files.
There are plenty of MQ brokers, but you can implement your own with database as a queue storage.

Writing a PHP web crawler using cron

I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links
That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.
Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?
Looks like you're running your script in a web browser. You may consider running it from the command line. You can execute multiple scripts to crawl on different pages at the same time. That should speed things up.
Memory must not be an problem for a crawler.
Once you are done with one page and have written all relevant data to the database you should get rid of all variables you created for this job.
The memory usage after 100 pages must be the same as after 1 page. If this is not the case find out why.
You can split up the work between different processes: Usually parsing a page does not take as long as loading it,so you can write all links that you find to a database and have multiple other processes that just download the documents to a temp directory.
If you do this you must ensure that
no link is downloaded by to workers.
your processes wait for new links if there are none.
temp files are removed after each scan.
the download process stop when you run out of links. You can archive this by setting a "kill flag" this can be a file with a special name or an entry in the database.

Categories