Im building an OO PHP application that will be run across multiple nodes, and will be relatively stateless in nature, and I need to implement proper publisher-subscriber (http://en.wikipedia.org/wiki/Observer_pattern / http://sourcemaking.com/design_patterns/Observer/php) style events.
My question is, how can I handle events?
In my application we are using technologies like Cassandra, Redis, Mongo and RabbitMQ.
I know PHP has an event EXTENSION available, but from what I can tell it sticks within state - or if something like memcached is leveraged it can possibly be used within that node... but my application will be distributed across multiple nodes.
So let's look at an example:
On Node 1, a metric (Metric ID 37) is updated and anything that subscribes to that metric needs to be updated. This publishes Changing and Changed as it does the update.
I have something that is subscribed to Metric ID 37 being updated, for example Metric 38, may need to recalculate itself when Metric 37's value changes.
Metric 38 is currently instantiated and being used on Node 2 in Process ID 1011... How does Metric 37 tell Metric 38 on Node 2 (Process ID 1011 in this case) to run the subscribed function?
Metric 39 subscribes to Metric 38 being updated, but is not instantiated anywhere... How does Metric 39 update when Metric 38 finishes updating?
I was thinking of something like using RabbitMQ as my event queue manager, and on each node have a daemon style 'event consumer' application that reads events in the event queue (for sake of load balancing/distribution of the work).
Then the consumer sees "Metric:38:Updated" it checks something like Redis for anything subscribed to "Metric:38:Updated" and gets the value ("What:Function:Values") and does something like call_user_func_array(array($what,$function),$values); .... but this seems like it may cause a crapload of overhead and some level of synchronization issues...
I'm using Doctrine MongoDB ODM to persist my objects... To handle synchronization issues I was thinking of something like this:
Objects could have a version number... (version=1.0)
And redis could be used to maintain a quick reference to the latest version of the object (ObjectVersion:ObjectType:ObjectId)=1.1
And when a getter is called on an object property that is marked as #critical(things like isDeleted, monetary balances etc) it could check if the instance's version ID is equal to the version # in redis and update its values from mongo if it needs to...
An alternate setup is using amphp/amp (http://amphp.org/docs/amp/reactor-concepts.html) and some form of RPC to synchronize the nodes
Since I'm fairly new to web development (moving from c#) and stateless, and distributed.. I thought it would be a good idea to ask the community if anyone has better suggestions?
My question is, how can I handle events?
If you want to use an event loop implementation, there are multiple choices available:
Amp
Icicle
React
You can use a PubSub system like Redis offers: http://redis.io/topics/pubsub. Amp offers a package for Redis, other event libraries might already have an implementation available.
Redis will send an event notification to all connected and listening clients. You may not want that, because you want to synchronize your calculations and execute them only once.
You could push the actual data to a Redis list and use the event system only to poll in case of a new job, so the workers can otherwise sleep. A better solution might be to use blocking list operations, which block a Redis connection until there's new data available in a Redis list. When that event happens, you can recalculate the value and push the update to an event.
That's basically building a message queue with Redis, but essentially you will just want to look at the features of different message queue implementations and see if they suit your needs. If you want to use any of the event loop libraries, you may also want to look at the available clients and other features you need from them, because they're generally not compatible (yet).
maybe a midware should be needed , like http://redis.io/topics/pubsub or some other like message queue can support your application
Related
I want to save a lengthy form's inputs at the server. But I don't think making db calls on each auto-save action is the best approach to go for.
What would constitute as a good approach to solve this?
Another problem is that I have 3 app servers. So in memory cache wouldn't work.
I was thinking keeping the data in redis and updating it on every call and finally updating the db. But since I have 3 servers how do I make sure the calls are in queue?
Can anyone help with the architecture?
But I don't think making db calls on each auto-save action is the best approach to go for.
That's the real question, let's start with that. Why would you think that?
You want auto-save, right? This is the only thing that saves user work.
All the other options you listed (memcached/redis, in-process caching) - not only do they not save user work, they're ticking time bombs. Think of all things that can fail there: redis dies, network is split, the whole data center is hit by lightning.
Why create all the complexity when you can just... save? You may find out that it's not that slow (if this was your concern).
This is a very classical problem faced when scaling your architecture, but lets come to scaling later as essentially your initial app requires many calls to the data base level lets optimize that first.
Since you have not given any details to your iops I'll describe below the approaches to solving it, in increasing order of load, all approaches are of cascading nature, that is the last approach actually is built on all previous solutions :
The Direct Database Approach {db calls on each auto-save action}:
client -> application layer -> database (save data here directly)
Where basically on any data update we directly propagate it to the database level. The main block in this schema is the database
Things to take care in this approach :
For the most popularly used relational databases like Mysql :
An insert query takes less time than an update query, for a college project you could keep updating the row against the same primary key and it would work beautifully.
But in any mid sized application where a single form modify takes 30-40 requests the update query would block your db resource.
So keep a addendum kind of schema, like maybe maintain a secondary key like status that keeps track of till which level the user has filled the form but keep inserting data for each update. And always read as per the most recent status inserted.
For further optimization use of indexes such as foreign key constraints should be applied
When this step fails, next step being the database itself, the next step to optimise on is, The type of data you're dealing with, you can choose a schema-less db like mongo, dynamodb etc for non transactional data
Use a schema-less db can be very helpful for large amounts of non-transactional data as inherently they allow for the addendum approach to the same row of data.
For additional optimization use secondary indexes to query data faster.
Involving The Application Layer Approach {Application layer caching }:
client -> application layer (save some data here then propagate later) -> database (Finally save here)
When the simple database is unable to serve and scale as required the easiest way is to shift some load on to your API server . As horizontal scaling of the same is much easier and cheaper.
It is exactly as you understand the memory cache approach, though do not go about designing your own - It takes years of understanding large scale web infrastructure, then also people are not able to design an efficient cache on the application layer.
use something like Express Session , I was using this for a web app having 10 ec2 nodejs instances, storing about 15mb of user data per session . It scales beautifully, maintains unique user session data across all servers - so on each auto update save data to user session - on form submit write to database from session. Can be easily scaled by adding more api servers, best use case with use it to save data on redis {Why re-invent the wheel ?}
Re-inventing the wheel : Custom level application layer caching + db caching + db optimize
client -> application layer (save some data here) ->{ Add your db cache}-> database (save data here finally)
This is where we come to designing our own thing using redis/Dynamodb/mongo to act as a cache for your primary database. (Please note if using a non transactional db in the first place go for the addendum approach - this is purely much more suited to scale transactional databases by adding a wrapper of a non transactional database)
Also express session works in the same way actually by caching the data on redis at the application layer, always try to lessen the number of calls to db as much as possible.
So if you have a fully functioning application layer caching and an optimized db then go for this approach, as it needs experienced developers and is resource intensive and usually employed for very large scale applications like I had a layer of redis caching after the express session for an application that avgd iops of 300k reqs per second
The approach here would be to save data on user session, apply a lazy write back to your cache. Maritain a cache ledger then write to your main database. For the queuing approach in such a large scale system I had written an entire separate microservice that worked in background to transport data from session to redis to mysql, for your concern of how to maintain a queue read more about priority queues and background workers . I used Kue is a priority job queue backed by redis, built for node.js.
Maintaining parallel and sequential queues
php client for KUE
Background Services
This solution adopted industry wide successfully.
Configure a 3 node redis cluster which take care of replication of data.
Writes happen only to the master node.
redis (master) - app server 1,
redis (slave1) - app server 2,
redis (slave2) - app server 3
Adding a slave is straighforward using the slaveof :port command.
Replication is done over the wire (not temporary disk storage)
Link -https://redis.io/topics/replication
Aim
To synchronize my elasticsearch server with new and expired data in my SQL database
Issue
There are two very different ways I can achieve this and I don't know which is better. I can either pull information to elasticsearch with a direct connection to the SQL database using the JDBC river plugin. Alternatively I can push data to elasticsearch using the PHP client using the code shown below as an example:
// The Id of the document
$id = 1;
// Create a document
$tweet = array(
'id' => $id,
'user' => array(
'name' => 'mewantcookie',
'fullName' => 'Cookie Monster'
),
'msg' => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
'tstamp' => '1238081389',
'location'=> '41.12,-71.34',
'_boost' => 1.0
);
// First parameter is the id of document.
$tweetDocument = new \Elastica\Document($id, $tweet);
// Add tweet to type
$elasticaType->addDocument($tweetDocument);
// Refresh Index
$elasticaType->getIndex()->refresh();
I was going to have a cron run every thirty minuets to check for items in my database that not only have an "active" flag but that also do not have an "indexed" flag, that means I need to add them to the index.
QUESTION
Seeing as I have two methods to synchronize data between elasticsearch and mysql in two different ways, what is the advantage and disadvantage of each option. Is there a specific usecase which defines using one over the other?
I would use the river method even thought a in house build solution might be more customizable.
On one side, the jdbc-river plugin is a plugin that is already built and it has around 20 contributors so far. So you kind have a extra team working to improve that tool along the way elasticsearch itself is improving.
All you'll have is to install it and you don't even need a complexed configuration to set a river between your cluster and your relational database.
Another advantage with the jdbc-river solution is that you don't need to deal with memory management. The plugin can operate as a river in "pull mode" or as a feeder in "push mode". In feeder mode, the plugin runs in a separate JVM and can connect to a remote Elasticsearch cluster. I personally prefere the river mode because in this case Elasticsearch would deal with the indexing and memory management issues.
The relational data is internally transformed into structured JSON objects for the schema-less indexing model of Elasticsearch documents.
Both ends are scalable. The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode ensures high throughput when indexing to Elasticsearch.
One of the drawbacks of this solution is that it doesn't notify when it's done indexing. As a solution for that I suggest that you use the Count API to compare results.
Another drawback of the river is that it doesn't pull on update, it just does on insert or delete. I'm referring of course the sql actions UPDATE, INSERT and DELETE.
On second hand, your solution might bring some advantages and drawbacks you might want to consider.
You solution is highly customizable, so you can manage your scripts however you want. But considering the current state of any PHP Elasticsearch client available (Official Elasticseach-php Client , Elastica or FOSElasticaBundle), and even thought the guys are doing a great job on them, it's still considered as a not very mature APIs to work with on that level comparing to the official Elasticsearch JAVA API used for the river.
You should also consider dealing with all the errors that can throw your cluster at you from memory loss, management, performance, etc.
Ex: I tried to build a Proof of Concept using the Elastica API pushing my data from my database to my cluster, with a configuration of 32g RAM, 8 cores running #2.05GHz each, in a test environment, without getting into much details. It took me 5 hours to push 10M records from the database to the cluster. Where as with the river, it takes 20 minutes for the same records. Of course there might be optimizations that can be done around my code but I've consider it more time-consuming that it can bring me.
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
NB: Of course there might be other point you might want to consider but this subject is quite long to discuss over here. So I chose some point, I found essential that you should be aware of.
If you forget for a moment that you need to import initial data into Elasticsearch, I would use an event system to push data to Elasticsearch. This is more efficient in the long run.
Your application knows exactly when something needs to be indexed by Elasticsearch. To take your tweet example, at some point a new tweet will enter your application (a user writes one for example). This would trigger a newTweet event. You have a listener in place that will listen to that event, and store the tweet in Elasticsearch whenever such an event is dispatched.
If you don't want to use resources/time in the web request to do this (and you definitely don't want to do this), the listener could add a job to a queue (Gearman or Beanstalkd for example). You would then need a worker that will pick that job up and store the tweet in Elasticsearch.
The main advantage is that Elasticsearch is kept up-to-date more real-time. You won't need a cronjob that would introduce a delay. You'll (mostly) handle a single document at a time. You won't need to bother the SQL database to find out what needs to be (re)indexed.
Another advantage is that you can easily scale when the amount of events/data gets out of hand. When Elasticsearch itself needs more power, add servers to the cluster. When the worker can't handle the load, simply add more of them (and place them on dedicated machines). Plus your webserver(s) and SQL database(s) won't feel a thing.
I would use the river method.
Advantages of the river:
Already built. Just download it, set your configurations and everything is done.
Tested. The river has been used by several people and thus mistakes have been fixed.
Customizable. You can set the duration between the runs, define a sql-statement for getting new data, etc.
Advantages of your solution:
Highly customizable, you can do with your script whatever you want.
Disadvantages of your solution:
Needs special flags
Prone for errors (since it is not tested for a long time)
...
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?
Basically, one part of some metrics that I would like to track is the amount of impressions that certain objects receive on our marketing platform.
If you imagine that we display lots of objects, we would like to track each time an object is served up.
Every object is returned to the client through a single gateway/interface. So if you imagine that a request comes in for a page with some search criteria, and then the search request is proxied to our Solr index.
We then get 10 results back.
Each of these 10 results should be regarded as an impression.
I'm struggling to find an incredibly fast and accurate implementation.
Any suggestions on how you might do this? You can throw in any number of technologies. We currently use, Gearman, PHP, Ruby, Solr, Redis, Mysql, APC and Memcache.
Ultimately all impressions should eventually be persisted to mysql, which I could do every hour. But I'm not sure how to store the impressions in memory fast without effecting the load time of the actual search request.
Ideas (I just added option 4 and 5)
Once the results are returned to the client, the client then requests a base64 encoded URI on our platform which contains the ID's of all of the objects that they have been served. This object is then passed to gearman, which then saves the count to redis. Once an hour, redis is flushed and the count is increments for each object in mysql.
After the results have been returned from Solr, loop over, and save directly to Redis. (Haven't benchmarked this for speed). Repeat the flushing to mysql every hour.
Once the items are returned from Solr, send all the ID's in a single job to gearman, which will then submit to Redis..
new idea Since the most number of items returned will be around 20, I could set a X-Application-Objects header with a base64 header of the ID's returned. These ID's (in the header) could then be stripped out by nginx, and using a custom LUA nginx module, I could write the ID's directly to Redis from nginx. This might be overkill though. The benefit of this though is that I can tell nginx to return the response object immediately while it's writing to redis.
new idea Use fastcgi_finish_request() in order to flush the request back to nginx, but then insert the results into Redis.
Any other suggestions?
Edit to Answer question:
The reliability of this data is not essential. So long as it is a best guess. I wouldn't want to see a swing of say 30% dropped impressions. But I would allow a tolerance of 10% -/+ acurracy.
I see your two best options as:
Using the increment command I redis to incremenent counters as you pull the dis. Use the Id as a key and increment it in Redis. Redis can easily handle hundreds of thousands of increments per second, so that should be fast enough to do without any noticeable client impact. You could even pipeline each request if the PHP language binding supports it. I think it does.
Use redis as a plain cache. In this option you would simply use a Redis list and do an rpush of a string containing the IDs separated by eg. a comma. You might use the hour of the day as the key. Then you can have a separate process pull it out by grabbing the previous hour and massaging it however you want to into MySQL. I'd you put an expires on keys you can have them cleaned out after a period of time, or just delete the keys with the post-processing process.
You can also use a read slave to do the exporting to MySQL from if you have very high redis traffic or just want to offload it and get as a bonus a backup of it. If you do that you can set the master redis instance to not flush to disk, increasing write performance.
For some additional options regarding a more extended use of redis' features for this sort of tracking see this answer You could also avoid the MySQL portion and pull the data from redis, keeping the overall system simpler.
I would do something like #2, and hand the data off to the fastest queue you can to update Redis counters. I'm not that familiar with Gearman, but I bet it's slow for this. If your Redis client supports asynchronous writes, I'd use that, or put this in a queue on a separate thread. You don't want to slow down your response waiting to update the counters.
I currently have a read-heavy mobile app (90% reads, 10% writes) that communicates with a single web server through php calls and single MySQL db. The db stores user profile information and messages the users send and receive. We get a few messages per second added to the db.
I'm in the process scaling horizontally, load balancing, etc. So we'll have a load balancer in front of a cluster of web servers and then I plan to put a layer of Couchbase nodes on top of a MySQL cluster so we can have fast access to user profile info and messages info. We'll memcache all user info in Couchbase but then I want to memcache only the latest 24 hours worth of messages in Couchbase since that is the timeframe where most of the read activity will happen.
For the messages data stored in memcache, I want to be able to filter messages based on various data found in a message's fields like country, city, time, etc. I know Couchbase uses a KV approach so I can't query using where clauses like I would with MySQL.
Is there a way to do this? Is Couchbase Views the answer? Or am I totally barking up the wrong tree with Couchbase?
The views in Couchbase Server 2.0 and later are what you're looking for. If the data being put in Couchbase is JSON, you can use those views to perform queries across the data you put in the Couchbase cluster.
Note that you can use a view that emits a date time as an array (a common technique) and even use that in restricting your view time period so you could, potentially, just store all of your data in Couchbase without a need to put it in another system too. If you have other reasons though, you can certainly just have the items expire 24 hours after you put them in the cache. Then, if you're using one of the clients that supports it, you'll be able to get-and-touch the document in the cache extending the expiration if needed. The only downside there is that you'll need to come up with a method of invalidating the document on update.
One way to do that is a trigger in mysql which would delete the given key-- another way is to invalidate it from the application layer.
p.s.: full disclosure: I'm one of the Couchbase folks