techniques for bulk data processing

techniques for bulk data processing - php

I'm looking for a technique to do the following and I need your advices.
I have a huge (really )table with registration ids and I need to send messages to these ID owners. I cant send the message to many recipients at once, this needs to be proceeded one by one. So I would like to have a script(php) which can run in many parallel instances (processes) by getting some amount from db and processing it. In other words every process needs to work with a particular range of data. I would like also to stop each process and to be able to continue message sending from the stopped user to another set of users who didnt get the message yet.
If it's possible? Any tips and advices are welcome.

You may wish to set a cron job, typically one of the best approaches to run large batch operations with PHP scripts:
http://www.developertutorials.com/tutorials/php/running-php-cron-jobs-regular-scheduled-tasks-in-php-172/
Your cron job will need to point to a PHP script which does the following:
Selects a subset of recipients from your large DB table, based on a
flag set at #3 (below), identifying the next batch to process
Send email to those selected recipients
Saves a note of current job position success/fail (i.e. you could set a
flag next to each recipient in the DB who is succesfully mailed, these are then not selected when the job is rerun)

Parallel processing is possible only to the extent of the configuration of your server. Many servers can serve pages in a parallel fashion, but then again, it is limited to a few. Instead, the rule of thumb is to be as fast as possible and jump to the next request.
Regarding your processing of a really large list of data in your database. You will first of all need a list of id for the mailing your are doing:
INSERT INTO `mymailinglisttable` (mailing_id, recipient_id, senton) SELECT 123 AS mailing_id, mycontacttable.recipient_id, NULL FROM mycontacttable WHERE [insert your criterias for your contacts]
Next you will need to use either innodb or some clever logic for your parallel processing:
With InnoDB, you can do some row level locking, but don't ask me how, search it yourself, i don't use InnoDB at all, but i know it is possible. So you read the docs on that, select and lock some rows, send the emails, mark as sent and wash rinse repeat the operation by calling back your own script. (Either with AJAX or with a php socket)
Without InnoDB, you can simply add 2 fields to your database, one is a processid, the other is a lockedon field. When you want to lock some addresses for your processing, do:
$mypid = getmypid().rand(1111,9999);
$now = date('Y-m-d G:i:s');
mysql_query('UPDATE mymailinglisttable SET mypid = '.$mypid.', lockedon = "'.$now.'" LIMIT 3');
This will lock 3 rows for your pid and on the current time, select the rows that were locked using:
mysql_query('SELECT * FROM mymailinglisttable WHERE mypid = '.$mypid.' AND lockedon = "'.$now.'")
You will retrieve the 3 rows that you locked correctly for processing. I tend to use this version more than the innodb version cause i was raised with this method but not because it is more performant, actually, i'm sure InnoDB's version is much better just never tried it.

If you're comfortable with using PEAR modules, I'd recommend having a look at the pear Mail_Queue module.
http://pear.php.net/package/Mail_Queue
Well documented and with a nice tutorial. I've used a modified version of this before to send out thousands of emails to customers and it hasn't given me a problem yet:
http://pear.php.net/manual/en/package.mail.mail-queue.mail-queue.tutorial.php

Related

What is the best way to check MySQL table's update continuously?

For some reasons (that I think it is not the point of my question, but if it help, ask me and I can describe why), I need to check MySQL tables continuously for new records. If any new records come, I want to do some related actions that are not important now.
Question is, how I should continuously check the database to make sure I am using the lowest resources and getting the results, close to the realtime.
For now, I have this:
$new_record_come = false;
while(! $new_record_come) {
$sql = "SELECT id FROM Notificatins WHERE insert_date > (NOW() - INTERVAL 5 SECONDS)";
$result = $conn->query($sql);
if ($result)
{
//doing some related actions...
$new_record_come = true;
}
else
{
sleep(5); //5 seconds delay
}
}
But I am worry that if I get thousands of users, it will make the server down, even if the server is a high price one!
Do you have any advice to make it better in performance or even change the way completely or even change the type of query or any other suggestion?

Polling a database is costly, so you're right to be wary of that solution.
If you need to scale this application up to handle thousands of concurrent users, you probably should consider additional technology that complements the RDBMS.
For this, I'd suggest using a message queue. After an app inserts a new notification to the database, the app will also post an item to a topic on the message queue. Typically the primary key (id) is the item you post.
Meanwhile, other apps are listening to the topic. They don't need to do polling. The way message queues work is that the client just waits until there's a new item in the queue. The wait will return the item.
A comment suggested using a trigger to invoke a PHP script. This won't work, because triggers execute while the transaction that spawned them is not yet committed. So if the trigger runs a PHP script, which probably needs to read the record from the database. But an uncommitted record is not visible to any other database session, so the PHP script can never read the data that it was notified about.

Another angle (much simpler than message queue I think):
I once implemented this on a website by letting the clients poll AND compare it to their latest id they received.
For example: You have a table with primary key, and want to watch if new items are added.
But you don't want to set up a database connection and query the table if there is nothing new in it.
Let's say the primary key is named 'postid'.
I had a file containing the latest postid.
I updated it with each new entry in tblposts, so it contains alsways the latest postid.
The polling scripts on the clientside simply retrieved that file (do not use PHP, just let Apache serve it, much faster: name it lastpostid.txt or something).
Client compares to its internal latest postid. If it is bigger, the client requests the ones after the last one. This step DOES include a query.
Advantage is that you only query the database when something new is in, and you can also tell the PHP script what your latest postid was, so PHP can only fetch the later ones.
(Not sure if this will work in your situation becuase it assumes an increasing number meaning 'newer'.)

This might not be possible with your current system design but how about instead of using triggers or a heartbeat to poll the database continuously that you go where the updates, etc happen and from there execute other code? This way, you can avoid polling the database continuously and code will fire ONLY IF somebody initiates a request?

Handle Queue race condition in PHP Symfony with MySQL Database

I've an application in Symfony that needs to send Emails/Notificatios from the App.
Since the Email/Notifications sending process takes time, so I decided to put them in Queue and process the Queue periodically. Hence I can decrease the response time for the Requests involving the Email/Notification dispatch.
The Cron Job(a php script - Symfony route) to process the queue runs every 30 seconds, and checks if there are any unsent Emails/Notifications if found it gets all data from the Queue Table and starts sending them. When an Email/Notification is sent, the Queue Table row status flag is updated to show that it's sent.
Now, when there are more Emails in Queue which could take more than 30 seconds to send. Another Cron Job also start running and starts sending emails from the Queue. Hence resulting in duplicate Emails/Notifications dispatch.
My Table structure for Email Queue is as follows :
|-------------------------------------|
| id | email | body | status | sentat |
|-------------------------------------|
My Ideas to resolve this issue are as follows :
Set a flag in Database that a Cron Job is running, and no other Cron Jobs should proceed if found the flag set.
Update status as 'sent' for all records and then start sending Emails/Notifications.
So my question is, are there any efficient approach to process Queues? Is there any Symfony Bundle/Feature to do such specific task?

So my question is, are there any efficient approach to process Queues? Is there any Symfony Bundle/Feature to do such specific task?
You can take enqueue-bundle plus doctrine dbal transport.
It already takes care of race conditions and other stuff.

Regarding your suggestions:
What if the cronjob process dies (for whatever reason) and cannot clean up the flag? A flag is not a good idea, I think. If you would like to follow this approach, you should not use a boolean, but rather either a process ID or a timestamp, so that you can check if the process is still alive or if it started a suspiciously long time ago without cleaning up.
Same question: what if the process dies? You don’t want to mark the mails as sent before they are sent.
I guess I’d probably use two fields: one for marking a record as “sending in progress” (thus telling other processes to skip this record) and another one for marking it as “sending successfully completed”. I’d write a timestamp to both, so that I can (automatically or manually) find those records where the “sending in progress” is > X seconds in the past, which would be an indicator for a died process.

You can use Database Transactions here. Rest will be handled by database locking mechanism and concurrency control. Generally whatever DML/DCL/DDL commands you are giving, are treated as isolated Transactions. In your Question, if 2nd cron job will read the rows(before 1st cron job will update it as sent) , it will find the email unsent, and try to send it again. and before 2nd cron job will update it as sent, if 3rd job will find it unsent, it will do same. So it can cause big problem for you.
whatever approach you will take, there will be Race Condition. So let the database allow to do it. there are many concurrency control methods you can refer.
BEGIN_TRANSACTION
/* Perform your actions here. N numbers of read/write */
END_TRANSACTION
Still there is one problem with this solution. You will find at one stage that, when number of read/write operation will increase, some inconsistency still remains.
Here comes isolation level of the database, It is the factor that will define how much 2 transactions are isolated from each other, and how to schedule them to run concurrently.
You can set isolation level as per your requirements. Remember that, concurrency is inversely proportional to isolation level. So analyse your Read/Write statements, figure out which level you need. Do not use higher level then that. I am giving some links, which may help you
http://www.ibm.com/developerworks/data/zones/informix/library/techarticle/db_isolevels.html
Difference between read commit and repeatable read
http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.htm
If you can post your database operations here. I can suggest you some possible isolation level

Charging a batch of credit cards

I have user subscriptions in my DB, and the SUBSCRIPTIONS table has UID (that I use with the CC company to charge the Card), and DATE_EXPIRED, DATE_PAID, VALID and LENGTH.
I am going to need to use some kind of a function to process a batch of subscriptions from the DB and send to the CC company for processing, and upon return, I should treat the subscription accordingly.
in the case the charge went ok, I mark VALID = 1, set DATE_PAID = NOW(), SET DATE_EXPIRED = NOW + INTERVAL LENGTH MONTH
but if the result is not ok, I need to mark VALID = ERROR_NO and do some actions, like sending an email, bring up messages etc.
My question is about the approach to the situation,
I will be getting a list of subscriptions that need to be updated, so:
What do you think is the best way to process a batch like that?
My internal processing function is using cURL to contact the CC server, and gets a cURL response. upon which I know what to do next.
Is it a file I should write all necessary subscriptions to ? Is that a cURL to my own server to batch independently ? How do I send a batch of result for processing to myself ? What do you think is the best approach ?
How do I keep the SELECT result to process in one time ?
I think I should clarify. Say I theoretically have 100,000 results, I think I should seperate my SELECT to portions with LIMIT X,Y then put in memory and save on memory, but on the next SELECT I will be casting upon a perhaps different table (as it might have been updated by now). I'd like to run through all 100,000 results having the same 100,000 I had in the first SELECT. Doing all the process in the page would be uneffective of course as the whole process would drop as soon as you close the page.

Use LOCK TABLES subscriptions WRITE to solve your second problem, and UNLOCK the table when you are done. Although really, unless you're running this on a server with no RAM at all, keeping these 100,000 rows in memory and going through them in a while loop using something like mysql_fetch_assoc or mysqli_fetch_assoc shouldn't be a problem.
As to your first question, why don't you simply process the cURL response when you get it, and save the results in the database on a per-subscription basis? I don't really see why you would want to put this into a separate file first, and certainly not why you would then need to use cURL to contact your own server?

How would I organize data to send many emails to users?

This is more in search of advice/best practice. We have a site with many users ( > 200,000 ), and we need to send emails to all of them about events occurring in their areas. What would be the best way to stagger the adding of the jobs?
Things to note:
We store everything in a MySQL database
The emails go out on a queue-based system, with independent workers grabbing the tasks and sending them out.
We collect username, join date that we can use for grouping
Sending the emails is not the problem, the problem is getting the jobs added. I am afraid of a performance hit if we suddenly try to add that many jobs at once.

I hope you requirement is like sending news letters to news groups and subscribers.
Do you already have groups and is it possible to implement.
It will help to filter and avoid filtering the entire 200000 users.
send the emails based on groups will reduce a db load i hope !!.
in db active and inactive status for user can be there.
running the cron job is the solution . but the intervals is based on the load that job can impact your sever.
so if db design and job intervals are good performance will be betetr

I assume your queue is a table in a database and you concern is that adding thousands of records to a table will thrash it because the index gets rebuilt each time?
If you add many entries within a single process (eg. a single http-request or a single cronjob script), you can start a transaction before inserting and commit when done. With all the inserts inside a transaction, the index will only be updated once.
If it's a more general problem, you might want to consider using a message queue instead of a database table.
Or am I completely off?

Set a cron job for every 5 minutes. Have it check if there are emails to send. If there are, and none are set as "in progress" yet, pick the first one and set it as being in progress. Select the first users with id < n and send it to them. Keep track of that last id, and repeat until you reach the end of the user list.

PHP Database Value Change Listener, is there a better way?

Our company deals with sales. We receive orders and our PHP application allows our CSRs to process these orders.
There is a record in the database that is constantly changing depending on which order is currently being processed by a specific CSR - there is one of these fields for every CSR.
Currently, a completely separate page polls the database every second using an xmlhhtp request and receives the response. If the response is not blank (only when the value has changed on the database) it performs an action.
As you can imagine, this amounts to one databse query per second as well as a http request every second.
My question is, is there a better way to do this? Possibly a listener using sockets? Something that would ping my script when a change has been performed without forcing me to poll the database and/or send an http request.
Thanks in advance

First off, 1 query/second, and 1 request/second really isn't much. Especially since this number wont change as you get more CSRs or sales. If you were executing 1 query/order/second or something you might have to worry, but as it stands, if it works well I probably wouldn't change it. It may be worth running some metrics on the query to ensure that it runs quickly, selecting on an indexed column and the like. Most databases offer a way to check how a query is executing, like the EXPLAIN syntax in MySQL.
That said, there are a few options.
Use database triggers to either perform the required updates when an edit is made, or to call an external script. Some reference materials for MySQL: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Have whatever software the CSRs are using call a second script directly when making an update.
Reduce polling frequency.

You could use an asynchronous architecture based on a message queue. When a CSR starts to handle an order, and the record in the database is changed, a message is added to the queue. Your script can either block on requests for the latest queue item or you could implement a queue that will automatically notify your script on the addition of messages.
Unless you have millions of these events happening simultaneously, this kind of setup will cause the action to be executed within milliseconds of the event occuring, and you won't be constantly making useless polling requests to your database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.