I have a service like backupify. Which Downloads data from different social media platforms, Currently i have about 2500 active users, for each user a script runs which gets data from facebook and stores them on Amazon S3, My server is Ec2 Instance on AWS.
I have entries in table like 900 entries for facebook users, There is a PHP script which runs and gets user from database table and then backups data from the facebook and then picks the next user from facebook.
Everything was fine when i was having less than 1000 users, but now i have more than 2500 users problem is that the PHP script halts, or runs for first 100 users and then halts, time out etc. I am running PHP Script fro php -q myscript.php command.
The other problem is that single user scripts takes about 65 seconds to reach the last user from the database table is may take days, so whats the best way to run parrallel on the databse table etc.
Please suggest me what is the best way to backup large amount of data for large amount of users, and i should be able to monitor the cron, somehting like a mangaer.
If I get it right, you've got a single cron task for all the users, running at some frequency, trying to process the data of every user in a single shot.
Did you try issuing set_time_limit(0); at the beginning of your code?
Also, if the task is resource demanding, did you consider creating a separate cron task for every N user (basically mimicking multithreaded behaviour; and thus utilizing multiple CPU cores of the server)?
Is writing your data to some kind of cache instead of the database and having a separate task commit the cache contents to the database feasible for you?
Do you have the opportunity to use an in-memory data table (that's pretty quick)? You'll need to persist the DB contents to disk every now and then, but for this price you get a fast DB access.
Can you maybe outsource the task to separate servers as a disributed service and write the cron script as a load balancer for them?
Also optimizing your code might help. For example (if you're not doing so yet) you could buffer the collected data and commit in a single transaction at the end of the script so the execution flow is not scattered by DB recurring I/O blockages.
Related
I have a cron job script that runs every 60 seconds to process and store results in a database. That’s a maximum of 1,440 new database entries per day.
I need to have many many millions of database entries, so doing this with just one instance of this script is really impractical. I’m looking for a minimum of a 50x speed up, and ideally 300x to 500x if the cost is reasonable.
It seems like I need a server farm, but I have to use Amazon Web Services to process this data. How can I set this script up to run many simultaneous instances, while storing the data in a single, unified database?
Do I need to create completely separate server instances for every time I want to run this script, multiplying the cost?
Thank you for your help!
A serverless approach using a remote lambda function to execute your job triggered by a queue system solve your problem both technically and at pricing level.
https://aws.amazon.com/lambda/
For example you can trigger a lambda function execution from a local centralized script (eg. by a single cron) which enqueue some messages to a queue system for as many entries you need to compute in an asyncronous/concurrent way.
The serverless framework can help you to avoid AWS lock-in:
https://serverless.com/
Im just getting started with Queues, and they work fine for messaging and sending emails and SMS's to Twilio etc.
But now I want to do something more complex, and time consuming. I'm looking to upload a file of about 10,000 rows to AmazonS3, parse it, check for duplicates, and then only insert records that aren't duplicates.
When I run this process it takes over 6 minutes to complete. Which is way to long. I want to have this run in the background, with a visual progress bar that gets updated sporadically, based on the queue status.
Also, while this is running, I want the users to have full access to the site and database tables. This process, will lock my main table.
So I basically want to have it run in the background, only touch the main table once to check for duplicates, and from there, just proces/parse the file into a temporary table of 10,000+ rows. While leaving the other table free.
Once completed...it will then only write back to the main table once.
How can I achieve this without slowing the site/main server down?? I apologize for the extremely broad question
Laravel Queues can do what you want, but there are a couple of points to address in your email.
How can I achieve this without slowing the site/main server down?
Well, the queue is run as a separate process on the server, so you probably won't see a major impact on the server, provided your background process doesn't do anything too stressful for the server. If you're concerned about an impact on performance and you're running a Linux server, there are options for limiting the resources used by processes - check out the renice command which allows you to adjust the priority of processes. If you're not on Linux, then there are probably other options for your OS.
With respect to the database, that's harder to answer without knowing what your tables look like. It might be possible to do the check for duplicates with a single query and JOIN on the two tables, perhaps writing the results of the check to a different table. This might work, but it could also take a long time depending on how the tables are set up. Another solution would be to use a mirror of the main database table - copy it temporarily, do your work, then delete it. And finally, for a really involved solution, set up database replication and work off a slave.
As for running the queue worker, I have found that using supervisord to run my background working is VERY helpful - it allows me to start/stop the process easily and will automatically restart the process when it fails. The documentation on queue listeners has some discussion of this.
And the worker will fail - I have found that my worker process fails on a pretty regular basis. I think it has something to do with the PHP CLI settings, but it hasn't caused me any issues so I haven't really investigated it further. However, for a long-running job, you might run into difficulties. One way to mitigate this would be to break your job up into multiple smaller jobs and "daisy-chain" them together: when part1 finishes, it queues up part2; when part2 finishes, it queues up part3, etc.
As for the progress bar, that's pretty easy. Have the jobs update a value (in your database probably, or possibly in the filesystem) with the current status and have a Javascript function on the client periodically performing an AJAX request to get that value & update the progress bar.
I have a customer based website which requires them to upload an image, and on doing so my script will save about 25-30 variations of the image onto the server using the GD library. Because of the amount of images there is currently a very long waiting time for the customer to continue on the site waiting until all images have been created and saved. Until then they cannot proceed so we get a high level of customers leaving the site.
Is it possible after upload to instead store the image url in a database table record, then have a php script which creates the 25-30 images pull each record in the database and run every 5 minutes of the day using cronjob. This way it allows the customer to continue through the website and have the images automatically created 'in the background'
Will all this going on in the background cause any issues for the speed of my site? Will it slowdown the site for people browsing especially if 10-100's of customers are using it at the same time?
I suggest to start looking at queues, more specific to Gearman.
This will decrease the load time for your customers as you can offload the generation of the images to a separate server. And it scales easily across multiple servers if you need more processing power.
Having a PHP script process the images would cause no more lag in a cron job than it would when the customer uploads the image and waits for the processing to complete in realtime. So in short, no there's no additional impact of this approach.
The catch is, you need to make sure your cron job is self-aware and does not create overlaps. For example if the cron runs and takes more than 5 minutes to complete its current task, what happens when a second cron spins up and begins processing another image (or the same one if you didn't implement the queue properly)? Well now you have two crons running, fighting for resources. Which means the second one will likely take over 5 minutes as well. Eventually you get to the point with 3, 4, etc crons all running at once. So make sure your crons are only "booting up" if there's not one running already.
All this being said, you'd probably be best off having another server handle the image processing depending on the size of your client's site and how heavy their traffic is. You could have a cloud server in a cluster with your production site server which can connect via local network to access an image, process it, and return the 25-30 copies to the server in the appropriate location. This way your processing queue occupies 0 resources of the public facing web server and will have no impact on speed of the site itself.
Sure you can store PATH of an image on your server and process it later.
create php script that when run creates a LOCK file ie "/tmp/imgprocessor.lock" and deletes it at the end, if cron starts a new process you first check that file doesn't exist.
I would store uploaded images in ie pathtoimages/toprocess/ and delete each one after processing or move it elsewhere. Have new images in ie /processed/
This way you don't need to query DB for path of images but just process what is in 'toprocess' folder and you can just have UNIQ_NAME_OF_IMAGE in table. In your web script before loading the oage check if UNIQ_NAME_OF_IMAGE exist in 'processed' folder , and if so display it ...
On server load , it depends how many images you originaly have and what sizes are, image processing can be heavy on server but processing 1000 users *30 images wont be a heavy duty task , as I said depends on size of images.
NOTE: if you go this way you need to make sure that when starting a cron ERROR log is also outputed to some log file. Cron script must be bullet proof ie if it fails for some reason LOCK file will remain , so no more processing will happen , you will need to delete it manually (or create custom error handler that deletes it and maybe send some mails ) , you should check periodically log file so you know whats going on.
I need to run automated tasks every 15. The tasks is for my server (call it server A, Ubuntu 10.04 LAMP) to query for updates to another server (Server B).
I have multiple users to query for, possibly 14 (or more) users. As of now, the scripts are written in PHP. They do the following:
request Server B for updates on users
If Server B says there are updates, then Server A retrieves the updates
In Server A, update DB with the new data for the users in
run calculations in server A
send prompts to users that I received data updates from.
I know cron jobs might be the way to go, but there could be a scenario where I might have a cron job for every user. Is that reasonable? Or should I force it to be one cron job querying data for all my users?
Also, the server I'm querying has a java api that I can use to query it. Which means I could develop a java servlet to do the same. I've had trouble with this approach but I'm looking for feedback if this is the way to go. I'm not familiar with Tomcat and I don't fully understand it yet.
Summary: I need my server to run tasks automatically every 15 mins, the requests data from another server, updates its DB and then send prompts to users. What's are recommended approaches?
Thanks for your help!
Create a single script, triggered by cron, that loops through each user and takes all three steps for each user: Pseudo code:
query for list of users from local DB;
foreach(users as user){
check for updates;
if(updates){
update db;
email user;
}
}
If you have a lot of users or the API is slow, you'll either want to set a long script timeout (ini_set) or you could add a TIMESTAMP DB column "LastUpdateCheck" and run the cron more often (every 30 seconds?) but limiting the update/API query to one or two users per instance (those with the oldest update times)
On the front-end, I have a PHP webapp that allows users to create a list of their websites (5 max).
On the back-end, a Python script runs daily (and has ~10 iterations) for each website that the user registers. Each script per website takes about 10 seconds to run through all iterations and finish its scraping. It then makes a CSV file with its findings.
So, in total, that's up to (5 websites * 10 iterations =) 50 iterations at 8.3 total minutes per user.
Right now, the script works when I manually feed it a URL, so I'm wondering how to make it dynamically part of the webapp.
How do I programmatically add and remove scripts that run daily depending on the number of users and the websites each user has each day?
How would I schedule this script to run for each website of each user, passing in the appropriate parameters?
I'm somewhat acquainted with cronjobs, as it's the only thing I know of that is made for routine processes.
You can make the PHP app place the URLs into a database (MySQL, Sqlite, etc.) or text file. Then, loop through the database/text file in your Python script. Use Cron to run the Python script every day.
There are lots of resources for learning the Cron syntax:
http://google.com/search?q=cron+tutorial
Do you need to run the script 50 times per user, or only when the user has logged into your service to check on things?
Assuming you're using a database to store the users' web sites, you can have just 1 script that runs as a daily cron job and queries the database for the list of sites to process.