Store and retrieve millions of JSON-encoded events (PHP/Database)

Store and retrieve millions of JSON-encoded events (PHP/Database) - php

Let's assume we have the following example JSON event data:
{
"eventId":"eb1363c3-6bf7-4a42-9daa-66270b922367",
"timestamp":"2014-10-28T09:12:22.628Z",
"ip":"1.2.3.4",
"device":{
"type":"mobile",
"os":{
"name":"iOS",
"version":"7.1.1"
},
"name":"iPhone 4/4s",
...
},
"eventType":"AddedProductToCart",
"store":"US",
"product":{
"sku":"ABC123",
"name":"Yellow Socks",
"quantity":1,
"properties":{
"foo":"bar",
"bar":1
}
...
},
"user":{
"id":123456,
"name":"jeff",
"type":"registered"
...
}
}
while "eventId" and "timestamp" will always be supplied, the structure of the array can vary and is not the same. There are around 30-40 unique eventTypes, all with different event properties. Most of the event data have a nested structure.
What would be the best approach for storing those event properties? I have looked into MongoDB, DynamoDB and a project called EventStore (http://geteventstore.com). Obviously I have also considered MySQL, but I am wondering how it would perform in our use case.
The storage of the data is only the first part. After this, we should be able to query our database / event storage with complex queries like the following (and not only retrieve by indexed ID for example):
select all events where eventType is "AddedProductToCart" and timestamp > 2 weeks ago
-> should return all "AddedProductToCart" from 2 weeks ago until now
select all events where device.OS.name is "iOS" and device.OS.version is "7.1.1"
-> should return all events from iOS 7.1.1
etc.
We are expecting around 10 million events per month. This amounts to 3-4 writes per second on average, and probably more like 30-40 writes per second peak / worst case scenario. Storage should not really be an issue - total size per event will likely not exceed 1 or 2kb (this amounts to 1-2GB per 1 million events).
The querying part should be in PHP, preferably. DynamoDB for example has an SDK for PHP, which will certainly facilitate our
What would be our best solution for this? Writes should be blazing fast and our querying should also be acceptable. In a nutshell, we're looking for a low-cost data store to easily store and then retrieve (->queried not only using an index but also by using event properties from the nested JSON) our data.
Thanks for any suggestions, and if more information is required to properly answer this question, I'd be glad to supply more information.

Amazon's DynamoDB offers a fully managed (auto-scaling), durable, and predictable solution.
Judging by the amount of traffic and data you expect, DynamoDB’s free tier of 25 write/read capacity units and 25 GB covers your operations basically for free.
Each write capacity unit is equivalent of writing 1KB of data, so if you’re expecting 3-4 writes per second of 2KB data, you need to provision 8 WCU’s. In addition, DynamoDB's performance extremely is predictable with fast single digit millisecond latency. For more information about the free tier, check out http://aws.amazon.com/dynamodb/pricing/.
In terms of your data set, for non-document objects querying is relatively simple with the use of global secondary indexes.
Here’s an example from the PHP SDK.
$twoWeeksAgo = date("Y-m-d H:i:s", strtotime("-14 days"));
$response = $dynamoDB->query(array(
"TableName" => <Table Name>,
"KeyConditions => array(
"EventType" => array(
"ComparisonOperator" => ComparisonOperator::EQ,
"AttributeValueList" => array(
array(Type::STRING => "AddedProductToCart")
)
),
"Timestamp" => array(
"ComparisonOperator" => ComparisonOperator:GE,
"AttributeValueList" => array(
array(Type::STRING => $twoWeeksAgo)
)
)
)
));
You can query "Device.OS.Name" and "Device.OS.Version" via a scan, but there are a couple of optimizations you should consider based on what kind of queries you want to make.
If you're looking to make adhoc queries, you can make a parallel scan call and then apply the ScanFilter using a ConditionalExpression on your nested attributes. By parallelizing your scan, you optimize the consumption of read capacity units on your table as well as the speed of the operation. For more information about parallel scan, check out http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan.
Alternatively, if you have select attributes you want to query, consider making some of the fields top level attributes or move them into their own separate table, flatten the necessary attributes (i.e. os.name to osname) and have a back reference to your original item (mainly applies to your documents like "device"). By doing this, you can add indexes on top of these attributes and query them quickly and efficiently. Additionally with the pre announcement of online indexing, you should be able to add and remove indexes where necessary to meet your requirements soon.
If you would like to discuss this in further detail or ask questions in general about using DynamoDB, feel free to reach out to me by private message.
Thanks

MongoDB is a good bet here. It can handle the write/s easily (the mongod sees more action on my laptop).
The queries you mentioned are basic ones. For example:
db.collection.find({"device.OS.name":"iOS","device.OS.version":"7.1.1"})
and (shortened for readability)
db.collection.find({"eventType":"AddedProductToCart",timestamp:{$gte: ISODate(iso8601String)}})
With indices set correctly, those should be lightning fast. You can even use TTL indices to automatically remove events older that a certain time.
For data analysis, you have both map/reduce and MongoDB's extremely powerful aggregation framework.
Let's come to the downsides. While scaling is relatively easy with MongoDB, for some reason people assume that a replicated sharded cluster with automatic distribution of data is as easy to manage as the rest of MongoDB. The keyword is that it is relatively easy (compare it to replicated data partitioning with MySQL or - Lord help us - Oracle), but still it has some pitfalls.
Point-in-time recoveries in a sharded environment without the use of MMS are possible, but you really have to know what you are doing since the synching of the individual backups of the shards is quite tricky.
No matter which database you choose, I strongly advice to get into touch with an according specialist. Production data is elementary and no database with it should be planned and maintained by non-specialists.

Related

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!

My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.

If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.

Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions

This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.

mongodb server side api for php

I use MongoDb in which data changes ( updates ) frequently, - every minute.
The data is taken from MongoDB thought third party API application via HTTP. Also in that API data is additionaly agregrated before they are returned, for example counted last X days views sum for page N.
Constantly increasing data amount ( i.e. few of these collections are from 6 GB to 14 GB ) in some cases occurred 2 - 7 seconds delays till API returns aggregated data. Mentioned delay for web application is big enought.
I want to reduce these delays somehow.
Which models are used in my described situations?
Maybe first of all i should descline that HTTP API idea and move all API logic to server side?
Own ideas, considerations:
Maybe there should be two seperated data "proccessors":
1) First "proccessor" should do all aggregation jobs and just write to second one.
2) Second "proccessor" all data justs returns without any internal calculations, aggregations.
But also there can be bootleneck when the first writes to second data store, there should be the logic to update new and old data which also impacts the performance..

That third-party application seems to do a bad job, therefore you should drop it. Probably you can fix your problems by refactoring the data model or using better aggregation algorithms.
Pre-calculations
Using a batch processor and a real-time processor sounds like a good idea, but I think you won't need it yet (see below). If you still want to implement it, you should read about Lambda architecture, because it fixes some problems your approach might have.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate precomputed views, while simultaneously using real-time stream processing to provide dynamic views. The two view outputs may be joined before presentation.
Data Model (6 rules of thumb)
You're saying that there are a lot of updates, this is a red flag when using MongoDB. Some kind of updates could slow down MongoDB, because of its distributed nature. For example try to insert subdocuments, instead of updating fields. But this isn't an exact science, therefore I can't help without seeing the data model.
Aggregation Framework
Databases are made for data, so move data aggregation into MongoDB. Map Reduce is slow on MongoDB, thus use the Aggregation Framework.

Elasticsearch - Do i need the JDBC driver?

Aim
To synchronize my elasticsearch server with new and expired data in my SQL database
Issue
There are two very different ways I can achieve this and I don't know which is better. I can either pull information to elasticsearch with a direct connection to the SQL database using the JDBC river plugin. Alternatively I can push data to elasticsearch using the PHP client using the code shown below as an example:
// The Id of the document
$id = 1;
// Create a document
$tweet = array(
'id' => $id,
'user' => array(
'name' => 'mewantcookie',
'fullName' => 'Cookie Monster'
),
'msg' => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
'tstamp' => '1238081389',
'location'=> '41.12,-71.34',
'_boost' => 1.0
);
// First parameter is the id of document.
$tweetDocument = new \Elastica\Document($id, $tweet);
// Add tweet to type
$elasticaType->addDocument($tweetDocument);
// Refresh Index
$elasticaType->getIndex()->refresh();
I was going to have a cron run every thirty minuets to check for items in my database that not only have an "active" flag but that also do not have an "indexed" flag, that means I need to add them to the index.
QUESTION
Seeing as I have two methods to synchronize data between elasticsearch and mysql in two different ways, what is the advantage and disadvantage of each option. Is there a specific usecase which defines using one over the other?

I would use the river method even thought a in house build solution might be more customizable.
On one side, the jdbc-river plugin is a plugin that is already built and it has around 20 contributors so far. So you kind have a extra team working to improve that tool along the way elasticsearch itself is improving.
All you'll have is to install it and you don't even need a complexed configuration to set a river between your cluster and your relational database.
Another advantage with the jdbc-river solution is that you don't need to deal with memory management. The plugin can operate as a river in "pull mode" or as a feeder in "push mode". In feeder mode, the plugin runs in a separate JVM and can connect to a remote Elasticsearch cluster. I personally prefere the river mode because in this case Elasticsearch would deal with the indexing and memory management issues.
The relational data is internally transformed into structured JSON objects for the schema-less indexing model of Elasticsearch documents.
Both ends are scalable. The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode ensures high throughput when indexing to Elasticsearch.
One of the drawbacks of this solution is that it doesn't notify when it's done indexing. As a solution for that I suggest that you use the Count API to compare results.
Another drawback of the river is that it doesn't pull on update, it just does on insert or delete. I'm referring of course the sql actions UPDATE, INSERT and DELETE.
On second hand, your solution might bring some advantages and drawbacks you might want to consider.
You solution is highly customizable, so you can manage your scripts however you want. But considering the current state of any PHP Elasticsearch client available (Official Elasticseach-php Client , Elastica or FOSElasticaBundle), and even thought the guys are doing a great job on them, it's still considered as a not very mature APIs to work with on that level comparing to the official Elasticsearch JAVA API used for the river.
You should also consider dealing with all the errors that can throw your cluster at you from memory loss, management, performance, etc.
Ex: I tried to build a Proof of Concept using the Elastica API pushing my data from my database to my cluster, with a configuration of 32g RAM, 8 cores running #2.05GHz each, in a test environment, without getting into much details. It took me 5 hours to push 10M records from the database to the cluster. Where as with the river, it takes 20 minutes for the same records. Of course there might be optimizations that can be done around my code but I've consider it more time-consuming that it can bring me.
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
NB: Of course there might be other point you might want to consider but this subject is quite long to discuss over here. So I chose some point, I found essential that you should be aware of.

If you forget for a moment that you need to import initial data into Elasticsearch, I would use an event system to push data to Elasticsearch. This is more efficient in the long run.
Your application knows exactly when something needs to be indexed by Elasticsearch. To take your tweet example, at some point a new tweet will enter your application (a user writes one for example). This would trigger a newTweet event. You have a listener in place that will listen to that event, and store the tweet in Elasticsearch whenever such an event is dispatched.
If you don't want to use resources/time in the web request to do this (and you definitely don't want to do this), the listener could add a job to a queue (Gearman or Beanstalkd for example). You would then need a worker that will pick that job up and store the tweet in Elasticsearch.
The main advantage is that Elasticsearch is kept up-to-date more real-time. You won't need a cronjob that would introduce a delay. You'll (mostly) handle a single document at a time. You won't need to bother the SQL database to find out what needs to be (re)indexed.
Another advantage is that you can easily scale when the amount of events/data gets out of hand. When Elasticsearch itself needs more power, add servers to the cluster. When the worker can't handle the load, simply add more of them (and place them on dedicated machines). Plus your webserver(s) and SQL database(s) won't feel a thing.

I would use the river method.
Advantages of the river:
Already built. Just download it, set your configurations and everything is done.
Tested. The river has been used by several people and thus mistakes have been fixed.
Customizable. You can set the duration between the runs, define a sql-statement for getting new data, etc.
Advantages of your solution:
Highly customizable, you can do with your script whatever you want.
Disadvantages of your solution:
Needs special flags
Prone for errors (since it is not tested for a long time)
...
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.

Best practice for custom statistics

I'm sitting in a situation where i have to build a statistics module which can store user related statistical informations.
Basically, all thats stored is a event identifier, a datetime object and the amount of times this event has been fired and the id of the object which is being interacted with.
Ive made similar systems before, but never anything that has to store the amount of informations as this one.
My suggestion would be a simple tabel in the database.
etc. "statistics" containing the following rows
id (Primary, auto-increment)
amount (integer)
event (enum -(list,click,view,contact)
datetime (datetime)
object_id (integer)
Usually, this method works fine, enabling me to store statistics about the object in a given timeframe ( inserting a new datetime every hour or 15 minutes, so the statistics will update every 15 minute )
Now, my questions are:
is theres better methods or more optimized methods of achieving
and building a custom statistics module.
since this new site will receive massive traffic, how do i go about the paradox that index on object id will cause slower update response time
How do you even achieve live statistics like etc. analytics? Is this solely about the server size and processing power? Or is there a best practice.
I hope my questions are understandable, and i'm looking forward to get wiser on this topic.
best regards.
Jonas

I believe one of the issues you are going to run into is you wanting two worlds of transactional and analytical. Which is fine in small cases, but when you start to scale, especially into realm of 500M+ records.
I would suggest separating the two, you generate events and keep track of just the event itself. You would then run analytical queries to get things such as count of events per object interaction. You could have these counts or other metric calculations aggregated into a report table periodically.
As for tracking events, you could either do that with keeping them in a table of occurrences of events, or have something before the database that is doing this tracking and it is then providing the periodic aggregations to the database. Think of the world of monitoring systems which use collect agents to generate events which go to an aggregation layer which then writes a periodic metric snapshot to an analytical area (e.g. CollectD to StatsD / Graphite to Whisper)
Disclaimer, I am an architect for InfiniDB
Not sure what kind of datasource you are using, but as you grow and determine amount of history etc... you will probably face sizing problems as most people typically do when they are collecting event data or monitoring data. If you are in MySQL / MariaDB / PostegreSQL , I would suggest you check out InfiniDB (open source columnar MPP database for analytics); It is fully open source (GPLv2) and will provide the performance you need to do queries upon billions and TBs of data for answering those analytical questions.

Cloud Architecture Stack Opinions - EC2 versus Azure

I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.
I have a data set - which can be thought of as a standard table of the format:
[id] [name] [d0] [d1] [d2] .. [d63]
---------------------------------------
0 Name1 0.43 -0.22 0.11 -0.81
1 Name2 0.23 0.65 0.62 0.41
2 Name3 -0.13 -0.23 0.17 0.00
...
N NameN 0.43 -0.23 0.12 0.01
I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT statement similar to:
SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5
where QueryParameter1,2,N are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).
My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.
I can do this a number of ways:
(1) Use SQL Azure, just as the query lies above. I have tried this method, and the queries can be quite slow as expected since SQL only gives you a single instance. I can spin up multiple instances of SQL and shard the data, but that gets real expensive real quick.
(2) Use Azure Storage Tables. Bloggers claim storage tables are faster in general, but would this still be the case for my query requirements?
(3) Use EC2 and spin up several instances with MySQL, possibly incorporating sharding to new instances (cost increases though).
(4) Use EC2 with MongoDB, as I've read it is faster than MySQL. Again this is probably dependent on the type of query.
(5) Google AppEngine. I'm not really sure how GAE would work with this query structure, but I guess that's why I am looking for opinions.
I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL query above).
Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE clause?
Cheers,
Brett

Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.
In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.
Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:
Table/row scan at the fastest speed
Ability to highly parallelize your operation
Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.
My personal recommendation is Azure, since you can:
Store massive amounts of data in "table storage", optimized for fast scan retrieval, and partitioned (e.g. over d0 ranges) for optimal parallelism
Dynamically "spin up" as many compute instances as you like to process the data in parallel
Queueing mechanisms to synchronize the results collation
Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.

The problem is not the math operators or the number thereof, the problem is that they are parameterized - you are effectively doing a weighted average across the columns with the weights being defined at run-time, so that the operation must be computed and cannot be inferred.
Even in SQL Server, this operation can be parallelized (and this should show up on the execution plan), but it is not amenable to search optimization using indexes, which is where most relational databases will really shine. With static weights and indexed computed column would obviously perform very quickly.
Because this problem is easily parallelized, you might want to look at something based on a Map-Reduce principle.

Currently neither SQL Azure nor Amazon RDS can scale horizontally (EC2 can at least vertically) but IF and only IF your data can be partitioned in a way that still makes it possible to execute your query the upcoming SQL Federations Feature of SQL Azure might be worth looking at and help making an informed decision.
MongoDB (which I like a lot) is more geared toward Document oriented workloads and is possible not the best solution for this type of job although your mileage may vary (it's blazingly fast as long as most of your working set fits into memory).

Assuming that the QueryParameter0, QueryParameter1, ... , QueryParameterN are all supplied at runtime and are different each time, then I don't think that any of the platforms will be able to provide significant advantages over any of the others - since none of them will be able to take advantage of any pre-computed indicies.
With indicies removed, the only other factors for speed then comes from the processing power available - you already know about this for the SQL Azure option, and for the other options this pretty much comes down to you deciding what processing to apply - it's up to you to fetch all the data and to then process it.
One option you might consider is whether you could host this data yourself on an instance (e.g. using an Azure blob or cloud drive) and could then process the data in a custom built worker role. This isn't something I'd think about for general data storage, but if its just this one table and this one query then it would be pretty easy to hand craft a quick solution?
Update - just seen the answer from #Cade too - +1 for his suggestion of parallelization.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.