Aim
To synchronize my elasticsearch server with new and expired data in my SQL database
Issue
There are two very different ways I can achieve this and I don't know which is better. I can either pull information to elasticsearch with a direct connection to the SQL database using the JDBC river plugin. Alternatively I can push data to elasticsearch using the PHP client using the code shown below as an example:
// The Id of the document
$id = 1;
// Create a document
$tweet = array(
'id' => $id,
'user' => array(
'name' => 'mewantcookie',
'fullName' => 'Cookie Monster'
),
'msg' => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
'tstamp' => '1238081389',
'location'=> '41.12,-71.34',
'_boost' => 1.0
);
// First parameter is the id of document.
$tweetDocument = new \Elastica\Document($id, $tweet);
// Add tweet to type
$elasticaType->addDocument($tweetDocument);
// Refresh Index
$elasticaType->getIndex()->refresh();
I was going to have a cron run every thirty minuets to check for items in my database that not only have an "active" flag but that also do not have an "indexed" flag, that means I need to add them to the index.
QUESTION
Seeing as I have two methods to synchronize data between elasticsearch and mysql in two different ways, what is the advantage and disadvantage of each option. Is there a specific usecase which defines using one over the other?
I would use the river method even thought a in house build solution might be more customizable.
On one side, the jdbc-river plugin is a plugin that is already built and it has around 20 contributors so far. So you kind have a extra team working to improve that tool along the way elasticsearch itself is improving.
All you'll have is to install it and you don't even need a complexed configuration to set a river between your cluster and your relational database.
Another advantage with the jdbc-river solution is that you don't need to deal with memory management. The plugin can operate as a river in "pull mode" or as a feeder in "push mode". In feeder mode, the plugin runs in a separate JVM and can connect to a remote Elasticsearch cluster. I personally prefere the river mode because in this case Elasticsearch would deal with the indexing and memory management issues.
The relational data is internally transformed into structured JSON objects for the schema-less indexing model of Elasticsearch documents.
Both ends are scalable. The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode ensures high throughput when indexing to Elasticsearch.
One of the drawbacks of this solution is that it doesn't notify when it's done indexing. As a solution for that I suggest that you use the Count API to compare results.
Another drawback of the river is that it doesn't pull on update, it just does on insert or delete. I'm referring of course the sql actions UPDATE, INSERT and DELETE.
On second hand, your solution might bring some advantages and drawbacks you might want to consider.
You solution is highly customizable, so you can manage your scripts however you want. But considering the current state of any PHP Elasticsearch client available (Official Elasticseach-php Client , Elastica or FOSElasticaBundle), and even thought the guys are doing a great job on them, it's still considered as a not very mature APIs to work with on that level comparing to the official Elasticsearch JAVA API used for the river.
You should also consider dealing with all the errors that can throw your cluster at you from memory loss, management, performance, etc.
Ex: I tried to build a Proof of Concept using the Elastica API pushing my data from my database to my cluster, with a configuration of 32g RAM, 8 cores running #2.05GHz each, in a test environment, without getting into much details. It took me 5 hours to push 10M records from the database to the cluster. Where as with the river, it takes 20 minutes for the same records. Of course there might be optimizations that can be done around my code but I've consider it more time-consuming that it can bring me.
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
NB: Of course there might be other point you might want to consider but this subject is quite long to discuss over here. So I chose some point, I found essential that you should be aware of.
If you forget for a moment that you need to import initial data into Elasticsearch, I would use an event system to push data to Elasticsearch. This is more efficient in the long run.
Your application knows exactly when something needs to be indexed by Elasticsearch. To take your tweet example, at some point a new tweet will enter your application (a user writes one for example). This would trigger a newTweet event. You have a listener in place that will listen to that event, and store the tweet in Elasticsearch whenever such an event is dispatched.
If you don't want to use resources/time in the web request to do this (and you definitely don't want to do this), the listener could add a job to a queue (Gearman or Beanstalkd for example). You would then need a worker that will pick that job up and store the tweet in Elasticsearch.
The main advantage is that Elasticsearch is kept up-to-date more real-time. You won't need a cronjob that would introduce a delay. You'll (mostly) handle a single document at a time. You won't need to bother the SQL database to find out what needs to be (re)indexed.
Another advantage is that you can easily scale when the amount of events/data gets out of hand. When Elasticsearch itself needs more power, add servers to the cluster. When the worker can't handle the load, simply add more of them (and place them on dedicated machines). Plus your webserver(s) and SQL database(s) won't feel a thing.
I would use the river method.
Advantages of the river:
Already built. Just download it, set your configurations and everything is done.
Tested. The river has been used by several people and thus mistakes have been fixed.
Customizable. You can set the duration between the runs, define a sql-statement for getting new data, etc.
Advantages of your solution:
Highly customizable, you can do with your script whatever you want.
Disadvantages of your solution:
Needs special flags
Prone for errors (since it is not tested for a long time)
...
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
Related
I want to make a detailed logger for my application and because it can get very complex and have to save a lot of different things I wonder where is the best to save it in a database(and if database wich kind of database is better for this kind of opperations) or in file(and if file what kind of format:text,csv,json,xml).My first thought was of course file because in database I see a lot of problems but I also want to be able to show those logs and for this is easier with database.
I am building a log for HIPPA compliance and here is my rough implementation (not finished yet).
File VS. DB
I use a database table to store the last 3 months of data. Every night a cron will run and push the older data (data past 3 months) off into compressed files. I haven't written this script yet but it should not be difficult. That way the last 3 months can be searched, filtered, etc. But the database won't be overwhelmed with log entries.
Database Preference
I am using MSSQL because I don't have a choice. I usually prefer MySQL though as it has better pager optimization. If you are doing more than a very minimal amount of searching and filtering or if you are concerned about performance you may want to consider an apache solr middle man. I'm not a db expert so I can't give you much more than that.
Table Structure
My table is 5 columns. Date, Operation (create, update, delete), Object (patient, appointment, doctor), ObjectID, and Diff (a serialized array of before and after values, changed values only no empty or unchanged values for the sake of saving space).
Summary
The most important piece to consider is: Do you need people to be able to access and filter/search the data regularly? IF yes consider a database for the recent history or the most important data.
If no a file is probably a better option.
My hybrid solution is also worth considering. I'll be pushing the files off to a amz file server so it doesn't take up my web servers space.
You can create the detail & Complex logger with using the some existing libraries like log4php because that is fully tested as part of the performance compare to you design custom for your self and it will also save time of development, I personally used few libraries from php and dotnet for our complex logger need in some financial and medical domain projects
here i would suggest if you need to do from the php then use this
https://logging.apache.org/log4php/
I think the right answer is actually: Neither.
Neither the file or a DB give you proper search, filtering, and you need that when looking at logs. I deal with logs all day long (see http://sematext.com/logsene to see why), and I'd tackle this as follows:
log to file
use a lightweight log shipper (e.g. Logagent or Filebeat)
index logs into either your own Elasticsearch cluster (if you don't mind managing and learning) or one of the Cloud log management services (if you don't want to deal with Elasticsearch management, scaling, etc. -- Logsene, Loggly, Logentries...)
I am running a crm application which uses mysql database. My application generating lots of data in mysql. Now i want to give my customer a reporting section where admin can view real time report, they should be able to filter at real time. Basically i want my data to be slice and dice at real time fast as possible.
I have implemented the reporting using mysql and php. But now as data is too much query takes too much time and page does not load. After few read i came across few term like Nosql, mongoDb , cassandra , OLAP , hadoop etc but i was confuse which to choose. Is there any mechanism which would transfer my data from mysql to nosql on which i can run my reporting query ans serve my customer keeping my mysql database as it is ?
It doesn't matter what database / datastore technology you use for reporting: you still will have to design it to extract the information you need efficiently.
Improving performance by switching from MySQL to MongoDB or one of the other scalable key/value store systems is like solving a pedestrian traffic jam by building a railroad. It's going to take a lot of work to make it help the situation. I suggest you try getting things to work better in MySQL first.
First of all, you need to take a careful look at which SQL queries in your reporting system are causing trouble. You may be able to optimize their performance by adding indexes or doing other refactoring. That should be your first step. MySQL has a slow query log. Look at it.
Secondly, you may be able to add resources (RAM, faster disks, etc) to MySQL, and you may be able to tune it for higher performance. There's a book called High Performance MySQL that offers a sound methodology for doing this.
Thirdly, many people who need to add a reporting function to their busy application use MySQL replication. That is, they configure one or two slave MySQL servers to accept copies of all data from the master server.
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
They then use the slave server or servers to run reporting queries. The slaves are ordinarily a few seconds or minutes behind the master (that is, they're slightly out of date). But it usually is good enough to give users the illusion of real-time reporting.
Notice that if you use MongoDB or some other technology you will also have to replicate your data.
I will throw this link out there for you to read which actually gives certain use cases: http://www.mongodb.com/use-cases/real-time-analytics but I will speak for a more traditional setup of just MongoDB.
I have used both MySQL and MongoDB for analytical purposes and I find MongoDB better suited, if not needing a little bit of hacking to get it working well.
The great thing about MongoDB when it comes to retreiving analytical data is that it does not require the IO/memory to write out a separate result set each time. This makes reads on a single member of a replica set extremely scalable since you just add your analytical collections to the working set (a.k.a memory) and serve straight from those using batch responses (this is the default implementation of the drivers).
So with MongoDB replication rarely gives an advantage in terms of read/write, and in reality with MySQL I have found it does not either. If it does then you are doing the wrong queries which will not scale anyway; at which point you install memcache onto your database servers and, look, you have stale data being served from memory in a NoSQL fashion anyway...whoop, I guess.
Okay, so we have some basic ideas set out; time to talk about that hack. In order to get the best possible speed out of MongoDB, and since it does not have JOINs, you need to flatten your data so that no result set will even be needed your side.
There are many tactics for this, but the one I will mention here is: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ pre-aggregated reports. This method also works well in SQL techs since it essentially is the in the same breath as logically splitting tables to make queries faster and lighter on a large table.
What you do is you get your analytical data, split it into a demomination such as per day or month (or both) and then you aggregate your data across those ranges in a de-normalised manner, essentially, all one row.
After this you can show reports straight from a collection without any need for a result set making for some very fast querying.
Later on you could add a map reduce step to create better analytics but so far I have not needed to, I have completed full video based anlytics without such need.
This should get you started.
TiDB may be a good fit https://en.pingcap.com/tidb/, it is MySQL compatible, good at real-time analytics, and could replicate the data from MySQL through binlog.
Let's assume we have the following example JSON event data:
{
"eventId":"eb1363c3-6bf7-4a42-9daa-66270b922367",
"timestamp":"2014-10-28T09:12:22.628Z",
"ip":"1.2.3.4",
"device":{
"type":"mobile",
"os":{
"name":"iOS",
"version":"7.1.1"
},
"name":"iPhone 4/4s",
...
},
"eventType":"AddedProductToCart",
"store":"US",
"product":{
"sku":"ABC123",
"name":"Yellow Socks",
"quantity":1,
"properties":{
"foo":"bar",
"bar":1
}
...
},
"user":{
"id":123456,
"name":"jeff",
"type":"registered"
...
}
}
while "eventId" and "timestamp" will always be supplied, the structure of the array can vary and is not the same. There are around 30-40 unique eventTypes, all with different event properties. Most of the event data have a nested structure.
What would be the best approach for storing those event properties? I have looked into MongoDB, DynamoDB and a project called EventStore (http://geteventstore.com). Obviously I have also considered MySQL, but I am wondering how it would perform in our use case.
The storage of the data is only the first part. After this, we should be able to query our database / event storage with complex queries like the following (and not only retrieve by indexed ID for example):
select all events where eventType is "AddedProductToCart" and timestamp > 2 weeks ago
-> should return all "AddedProductToCart" from 2 weeks ago until now
select all events where device.OS.name is "iOS" and device.OS.version is "7.1.1"
-> should return all events from iOS 7.1.1
etc.
We are expecting around 10 million events per month. This amounts to 3-4 writes per second on average, and probably more like 30-40 writes per second peak / worst case scenario. Storage should not really be an issue - total size per event will likely not exceed 1 or 2kb (this amounts to 1-2GB per 1 million events).
The querying part should be in PHP, preferably. DynamoDB for example has an SDK for PHP, which will certainly facilitate our
What would be our best solution for this? Writes should be blazing fast and our querying should also be acceptable. In a nutshell, we're looking for a low-cost data store to easily store and then retrieve (->queried not only using an index but also by using event properties from the nested JSON) our data.
Thanks for any suggestions, and if more information is required to properly answer this question, I'd be glad to supply more information.
Amazon's DynamoDB offers a fully managed (auto-scaling), durable, and predictable solution.
Judging by the amount of traffic and data you expect, DynamoDB’s free tier of 25 write/read capacity units and 25 GB covers your operations basically for free.
Each write capacity unit is equivalent of writing 1KB of data, so if you’re expecting 3-4 writes per second of 2KB data, you need to provision 8 WCU’s. In addition, DynamoDB's performance extremely is predictable with fast single digit millisecond latency. For more information about the free tier, check out http://aws.amazon.com/dynamodb/pricing/.
In terms of your data set, for non-document objects querying is relatively simple with the use of global secondary indexes.
Here’s an example from the PHP SDK.
$twoWeeksAgo = date("Y-m-d H:i:s", strtotime("-14 days"));
$response = $dynamoDB->query(array(
"TableName" => <Table Name>,
"KeyConditions => array(
"EventType" => array(
"ComparisonOperator" => ComparisonOperator::EQ,
"AttributeValueList" => array(
array(Type::STRING => "AddedProductToCart")
)
),
"Timestamp" => array(
"ComparisonOperator" => ComparisonOperator:GE,
"AttributeValueList" => array(
array(Type::STRING => $twoWeeksAgo)
)
)
)
));
You can query "Device.OS.Name" and "Device.OS.Version" via a scan, but there are a couple of optimizations you should consider based on what kind of queries you want to make.
If you're looking to make adhoc queries, you can make a parallel scan call and then apply the ScanFilter using a ConditionalExpression on your nested attributes. By parallelizing your scan, you optimize the consumption of read capacity units on your table as well as the speed of the operation. For more information about parallel scan, check out http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan.
Alternatively, if you have select attributes you want to query, consider making some of the fields top level attributes or move them into their own separate table, flatten the necessary attributes (i.e. os.name to osname) and have a back reference to your original item (mainly applies to your documents like "device"). By doing this, you can add indexes on top of these attributes and query them quickly and efficiently. Additionally with the pre announcement of online indexing, you should be able to add and remove indexes where necessary to meet your requirements soon.
If you would like to discuss this in further detail or ask questions in general about using DynamoDB, feel free to reach out to me by private message.
Thanks
MongoDB is a good bet here. It can handle the write/s easily (the mongod sees more action on my laptop).
The queries you mentioned are basic ones. For example:
db.collection.find({"device.OS.name":"iOS","device.OS.version":"7.1.1"})
and (shortened for readability)
db.collection.find({"eventType":"AddedProductToCart",timestamp:{$gte: ISODate(iso8601String)}})
With indices set correctly, those should be lightning fast. You can even use TTL indices to automatically remove events older that a certain time.
For data analysis, you have both map/reduce and MongoDB's extremely powerful aggregation framework.
Let's come to the downsides. While scaling is relatively easy with MongoDB, for some reason people assume that a replicated sharded cluster with automatic distribution of data is as easy to manage as the rest of MongoDB. The keyword is that it is relatively easy (compare it to replicated data partitioning with MySQL or - Lord help us - Oracle), but still it has some pitfalls.
Point-in-time recoveries in a sharded environment without the use of MMS are possible, but you really have to know what you are doing since the synching of the individual backups of the shards is quite tricky.
No matter which database you choose, I strongly advice to get into touch with an according specialist. Production data is elementary and no database with it should be planned and maintained by non-specialists.
Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?
I'm going to try to make this as brief as possible while covering all points - I work as a PHP/MySQL developer currently. I have a mobile app idea with a friend and we're going to start developing it.
I'm not saying it's going to be fantastic, but if it catches on, we're going to have a LOT of data.
For example, we'd have "clients," for lack of a better term, who would have anywhere from 100-250,000 "products" listed. Assuming the best, we could have hundreds of clients.
The client would edit data through a web interface, the mobile interface would just make calls to the web server and return JSON (probably).
I'm a lowly cms-developing kinda guy, so I'm not sure how to handle this. My question is more or less about performance; the most I've ever seen in a MySQL table was 340k, and it was already sort of slow (granted it wasn't the best server either).
I just can't fathom a table with 40 million rows (and potential to continually grow) running well.
My plan was to have a "core" database that held the name of the "real" database, so the user would come in and try to access a client's data, it would go to the core database and figure out which database to get the information from.
I'm not concerned with data separation or data security (it's not private information)
Yes, it's possible and my company does it. I'm certainly not going to say it's smart, though. We have a SAAS marketing automation system. Some client's databases have 1 million+ records. We deal with a second "common" database that has a "fulfillment" table tracking emails, letters, phone calls, etc with over 4 million records, plus numerous other very large shared tables. With proper indexing, optimizing, maintaining a separate DB-only server, and possibly clustering (which we don't yet have to do) you can handle a LOT of data......in many cases, those who think it can only handle a few hundred thousand records work on a competing product for a living. If you still doubt whether it's valid, consider that per MySQL's clustering metrics, an 8 server cluster can handle 2.5million updates PER SECOND. Not too shabby at all.....
The problem with using two databases is juggling multiple connections. Is it tough? No, not really. You create different objects and reference your connection classes based on which database you want. In our case, we hit the main database's company class to deduce the client db name and then build the second connection based on that. But, when you're juggling those connections back and forth you can run into errors that require extra debugging. It's not just "Is my query valid?" but "Am I actually getting the correct database connection?" In our case, a dropped session can cause all sorts of PDO errors to fire because the system no longer can keep track of which client database to access. Plus, from a maintainability standpoint, it's a scary process trying to push table structure updates to 100 different live database. Yes, it can be automated. But one slip up and you've knocked a LOT of people down and made a ton of extra work for yourself. Now, calculate the extra development and testing required to juggle connections and push updates....that will be your measure of whether it's worthwhile.
My recommendation? Find a host that allows you to put two machines on the same local network. We chose Linode, but who you use is irrelevant. Start out with your dedicated database server, plan ahead to do clustering when it's necessary. Keep all your content in one DB, index and optimize religiously. Finally, find a REALLY good DB guy and treat him well. With that much data, a great DBA would be a must.