This is more in search of advice/best practice. We have a site with many users ( > 200,000 ), and we need to send emails to all of them about events occurring in their areas. What would be the best way to stagger the adding of the jobs?
Things to note:
We store everything in a MySQL database
The emails go out on a queue-based system, with independent workers grabbing the tasks and sending them out.
We collect username, join date that we can use for grouping
Sending the emails is not the problem, the problem is getting the jobs added. I am afraid of a performance hit if we suddenly try to add that many jobs at once.
I hope you requirement is like sending news letters to news groups and subscribers.
Do you already have groups and is it possible to implement.
It will help to filter and avoid filtering the entire 200000 users.
send the emails based on groups will reduce a db load i hope !!.
in db active and inactive status for user can be there.
running the cron job is the solution . but the intervals is based on the load that job can impact your sever.
so if db design and job intervals are good performance will be betetr
I assume your queue is a table in a database and you concern is that adding thousands of records to a table will thrash it because the index gets rebuilt each time?
If you add many entries within a single process (eg. a single http-request or a single cronjob script), you can start a transaction before inserting and commit when done. With all the inserts inside a transaction, the index will only be updated once.
If it's a more general problem, you might want to consider using a message queue instead of a database table.
Or am I completely off?
Set a cron job for every 5 minutes. Have it check if there are emails to send. If there are, and none are set as "in progress" yet, pick the first one and set it as being in progress. Select the first users with id < n and send it to them. Keep track of that last id, and repeat until you reach the end of the user list.
Related
I developed a web application for running email campaigns. A cron is running to send multiple emails (upto 100 in single request) per minute.
SELECT id,email,email_text FROM recipients
WHERE sent_status=0 LIMIT 100
This script takes approx 70-100 seconds to send all the email using php. After sending each email, I update the sent_status=1.
Now the problem is that dues to shared hosting the script is not able to process more than 50-60 records in 60 seconds, then another request started which also select those 40 records that are still processing with first request not updated yet. Due to this some recipients receives duplicate emails.
Can this prevent by using Locking or any other solution ?
UPDATE
However my question is very similar with the linked duplicate question, except that I am actually SELECTing data from multiple tables, using GROUP BY and using ORDER BY clause on multiple columns including RAND().
My actual query something like this
SELECT s.sender_name,
s.sender_email,
r.recipient_name,
r.email,
c.campaign_id,
c.email_text
FROM users s, recipients r, campaigns c
WHERE c.sender_id=s.sender_id
AND c.recipient_id=r.recipient_id
AND sent_status=0
GROUP BY c.sender_id, r.recipient_id
ORDER BY DATE(previous_sent_time), RAND()
LIMIT 100
Thanks
You shouldn't try to fix this by using some database mechanics.
Instead, you should rethink your method of processing the "sending".
In your case, I would perform the following steps:
Create the emails you want to send, store them inside the database. Maybe 100.000 records in 10 seconds - that's no issue.
Use a script that processes these records according to your limitations (50-60 mails per minute) - That's a simple SELECT with proper limits, called every minute.
Voila, Your mails are beeing send. 100.000 Mails with 60 mails per minute would require about 27 hours - but you can't bypass "Hosting-Limitations" by altering code.
Wrap the execution into a Singleton, or some "locking" method to make sure, there is only one Mail-Queue-Processor active. Then you don't have any issues with double selects of the same mail-queue-entry
.
I actually ran into this issue myself when developing a similar app. My solution was that at the beginning of the cron, I set every processing task in the database to be marked as in process.
Once the script is done, it marks it as done and moves on.
Using this method, if another script runs over the same item, it will automatically skip it.
I am putting together an interface for our employees to upload a list of products for which they need industry stat's (currently doing 'em manually one at a time).
Each product will then be served up to our stat's engine via a webservice api.
I will be replying. The Stat's-engine will be requesting the "next victim" from my api.
Each list the users upload will have between 50 and 1000 products, and will be its own queue.
For now, Queues/Lists will likely be added (& removed via completion) aprox 10-20 times per day.
If successful, traffic will probably rev up after a few months to something like 700-900 lists per day.
We're just planning to go with a simple round-robin approach to direct the traffic evenly across queues.
The multiplexer would grab the top item off of List A, then List B, then List C and so on until looping back around to List A again ... keeping in mind that lists/queues can be added/removed at any time.
The issue I'm facing is just conceptualizing the management of this.
I thought about storing each queue as a flat file and managing the rotation via relational DB (MySQL). Thought about doing it the reverse. Thought about going either completely flat-file or completely relational DB ... bottom line, I'm flexible.
Regardless, my brain is just vapor locking when I try to statelessly meld a variable list of participants with a circular rotation (I just got back from a quick holiday, and I don't think my brain's made it home yet ;)
Has anyone done something like this?
How did you handle it?
What would you improve if you had to do it again?
Any & all tips/suggestions/advice are welcome.
NOTE: Since each request from our stat's engine/tool will be separated by many seconds, if not a couple minutes, I need to keep this stateless.
List data should be stored in a database, for sure. Your PHP side should have a view giving the status of the system, and the form to add lists.
Since each request becomes its own queue, and all the request-queues are considered equal in priority, the ideal number of tables is probably three. One to list requests and their priority relative to another (to determine who goes next in the round-robin) and processing status, another to list the contents (list-items) of each request that are yet to be processed, and a third table to list the processed items from each queue.
You will also need a script that does the actual processing, that is not driven by a user request, but instead by a system-scheduled job that executes periodically (throttled to whatever you desire). This can of course also be in PHP. This is where you would set up your 10-at-a-time list checks and updates.
The processing would be something like:
Select the next set of at most 10 items from the highest-priority queue.
Process them, Updating their DB status as they complete.
Update the priority of the above queue so that it is now the lowest priority.
And if new queues are added, they would be added with lowest priority.
Priority could be represented with an integer.
Your users would need to wait patiently for their list to be processed and then view or download the result. You might setup an auto-refresh script for this on your view page.
It sounds like you're trying to implement something that Gearman already does very well. For each upload / request, you can simply send off a job to the Gearman server to be queued.
Gearman can be configured to be persistent (just in case things go to hell), which should eliminate the need for you logging requests in a relational database.
Then, you can start as many workers as you'd like. I know you suggest running all jobs serially, which you can still do, but you can also parallelize the work, so that your user isn't sitting around quite as long as they would've been if all jobs had been processed in a serial fashion.
After a good nights sleep, I now have my wits about me (I hope :).
A simple solution is a flat file for the priorities.
Have a text file simply with one List/Queue ID on each line.
Feed from one end of the list, and add to the other ... simple.
Criticisms are welcome ;o)
Thanks #Trylobot and #Chris_Henry for the feedback.
On the webpage there is a google map where the user can change the location to one that he is interested in and sign up for alerts of new jobs by pressing a button. The location of interest saved will be defined by the bounds of the google map. Whenever a new job appears within that bound, an email alert will be sent to that user based on a frequency chosen by him (every hour or every day).
Problem: I am confused on how I should process all the alerts for all users.
Currently I am thinking of using a cron job for a table with all the lat1, lng1, lat2, lng2, user_id for hourly alerts that runs every hour, and another cron job for another table for daily alerts that runs once a day say 9pm. The cron job will loop through all the individual user's lat, lng pairs that define the google map bounds, and query the main jobs database for any jobs with posting timestamp within 1hr (or 1 day). If there is, an email alert will be sent.
This seems like a lot of work for the server, especially when there are 5000 user's location preferences and 1,000,000 jobs in the database. (30-ish mins to finish the cronjob?) I am stuck here and would like your opinions.
Instead of searching everything every time the cron runs (assuming I'm reading correctly that that's what you're doing), I'd consider performing that check when the alert is added:
Alert added to the system. System checks for any matching boundaries, if any are found then for each match store that info into a separate table. Stick two extra columns in this new table, one for hourly sending, one for daily.
On the hourly check, just send those where the hourly flag hasn't yet been applied, and for the daily, send those where the daily flag hasn't been set.
Then delete any where both have been set afterwards.
Doing it this way, you'll be breaking up the work to be done from one massive check on each cron job (All alerts, all boundaries), to one smaller check for each alert (One alert, all boundaries).
I think you can probably create two cron with this frequency
by hour.
by day.
(or any frequency you like)
Rather than processing all the alerts for all users, why not when user subscribed to a location, in your php codeigniter, create a task file with details of this job. Example, user_id, location(coordinate), frequency. The exact detail for this task file depend
in your situation and you will need to analyze into your system. Then place this task file to a directory.
Then based on the frequency you specified above, create a general php script to be called in the frequency. This script will loop through the directories, process the task file and send out email. This way, you will not worry to scan the whole database. There is also minor details like remove, update, delete task file but this is entirely implementation related.
Side note, probably this is irrevent since you tag this question with php but just if you would like to know, quartz do exactly what you want but it is in Java though. You can find out here if you want.
I'm looking for a technique to do the following and I need your advices.
I have a huge (really )table with registration ids and I need to send messages to these ID owners. I cant send the message to many recipients at once, this needs to be proceeded one by one. So I would like to have a script(php) which can run in many parallel instances (processes) by getting some amount from db and processing it. In other words every process needs to work with a particular range of data. I would like also to stop each process and to be able to continue message sending from the stopped user to another set of users who didnt get the message yet.
If it's possible? Any tips and advices are welcome.
You may wish to set a cron job, typically one of the best approaches to run large batch operations with PHP scripts:
http://www.developertutorials.com/tutorials/php/running-php-cron-jobs-regular-scheduled-tasks-in-php-172/
Your cron job will need to point to a PHP script which does the following:
Selects a subset of recipients from your large DB table, based on a
flag set at #3 (below), identifying the next batch to process
Send email to those selected recipients
Saves a note of current job position success/fail (i.e. you could set a
flag next to each recipient in the DB who is succesfully mailed, these are then not selected when the job is rerun)
Parallel processing is possible only to the extent of the configuration of your server. Many servers can serve pages in a parallel fashion, but then again, it is limited to a few. Instead, the rule of thumb is to be as fast as possible and jump to the next request.
Regarding your processing of a really large list of data in your database. You will first of all need a list of id for the mailing your are doing:
INSERT INTO `mymailinglisttable` (mailing_id, recipient_id, senton) SELECT 123 AS mailing_id, mycontacttable.recipient_id, NULL FROM mycontacttable WHERE [insert your criterias for your contacts]
Next you will need to use either innodb or some clever logic for your parallel processing:
With InnoDB, you can do some row level locking, but don't ask me how, search it yourself, i don't use InnoDB at all, but i know it is possible. So you read the docs on that, select and lock some rows, send the emails, mark as sent and wash rinse repeat the operation by calling back your own script. (Either with AJAX or with a php socket)
Without InnoDB, you can simply add 2 fields to your database, one is a processid, the other is a lockedon field. When you want to lock some addresses for your processing, do:
$mypid = getmypid().rand(1111,9999);
$now = date('Y-m-d G:i:s');
mysql_query('UPDATE mymailinglisttable SET mypid = '.$mypid.', lockedon = "'.$now.'" LIMIT 3');
This will lock 3 rows for your pid and on the current time, select the rows that were locked using:
mysql_query('SELECT * FROM mymailinglisttable WHERE mypid = '.$mypid.' AND lockedon = "'.$now.'")
You will retrieve the 3 rows that you locked correctly for processing. I tend to use this version more than the innodb version cause i was raised with this method but not because it is more performant, actually, i'm sure InnoDB's version is much better just never tried it.
If you're comfortable with using PEAR modules, I'd recommend having a look at the pear Mail_Queue module.
http://pear.php.net/package/Mail_Queue
Well documented and with a nice tutorial. I've used a modified version of this before to send out thousands of emails to customers and it hasn't given me a problem yet:
http://pear.php.net/manual/en/package.mail.mail-queue.mail-queue.tutorial.php
I want to extract some of the time consuming things into a queue. For this I found Gearman to be the most used but don't know if it is the right thing for me.
One of the tasks we want to do is queue sending emails and want to provide the feature to be able to cancel to send the mail for 1 minute. So it should not work on the job right away but execute it at now + 1 minute. That way I can cancel the job before that and it never gets sent.
Is there a way to do this?
It will run on debian. And should be usable from php. The only thing I found so far was Schedule a job in Gearman for a specific date and time but that runs on something not widely spread :(
There are two parts to your question: (1) scheduling in the future and (2) being able to cancel the job until that time.
For (1) at should work just fine as specified in that question and the guy even posted his wrapper code. Have you tried it?
If you don't want to use that, consider this scenario:
insert an email record for the email to-be-sent in a database, including a "timeSent" column which you will set 1 minute in the future.
have a single gearman worker (I'll explain why single) look at the database for emails that have not been sent (eg some status column = 0) and where timeSent has already passed, and send those.
So, for (2), if you want to cancel an email before it's sent just update its status column to something else.
Your gearman worker has to be a single one because if you have multiple they might fetch and try to send the same email record. If you need multiple make sure the one that gets the email record first locks it immediately before any time consuming operations like actually emailing it (say, by updating that status column to something else).