I use MongoDb in which data changes ( updates ) frequently, - every minute.
The data is taken from MongoDB thought third party API application via HTTP. Also in that API data is additionaly agregrated before they are returned, for example counted last X days views sum for page N.
Constantly increasing data amount ( i.e. few of these collections are from 6 GB to 14 GB ) in some cases occurred 2 - 7 seconds delays till API returns aggregated data. Mentioned delay for web application is big enought.
I want to reduce these delays somehow.
Which models are used in my described situations?
Maybe first of all i should descline that HTTP API idea and move all API logic to server side?
Own ideas, considerations:
Maybe there should be two seperated data "proccessors":
1) First "proccessor" should do all aggregation jobs and just write to second one.
2) Second "proccessor" all data justs returns without any internal calculations, aggregations.
But also there can be bootleneck when the first writes to second data store, there should be the logic to update new and old data which also impacts the performance..
That third-party application seems to do a bad job, therefore you should drop it. Probably you can fix your problems by refactoring the data model or using better aggregation algorithms.
Pre-calculations
Using a batch processor and a real-time processor sounds like a good idea, but I think you won't need it yet (see below). If you still want to implement it, you should read about Lambda architecture, because it fixes some problems your approach might have.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate precomputed views, while simultaneously using real-time stream processing to provide dynamic views. The two view outputs may be joined before presentation.
Data Model (6 rules of thumb)
You're saying that there are a lot of updates, this is a red flag when using MongoDB. Some kind of updates could slow down MongoDB, because of its distributed nature. For example try to insert subdocuments, instead of updating fields. But this isn't an exact science, therefore I can't help without seeing the data model.
Aggregation Framework
Databases are made for data, so move data aggregation into MongoDB. Map Reduce is slow on MongoDB, thus use the Aggregation Framework.
Related
The application i am working on needs to obtain dataset of around 10mb maximum two times a hour. We use that dataset to display paginated results on the site also simple search by one of the object properties should also be possible.
Currently we are thinking about 2 different ways to implement this
1.) Store the json dataset in the database or a file in the file system, read that and loop over to display results whenever we need.
2.) Store the json dataset in relational MySQL table and query the results and loop over whenever we need to display them.
Replacing/Refreshing the results has to be done multiple times per hour as i said.
Both ways have cons. I am trying to choose a good way which is less evil overall. Reading 10 MB in memory is not a lot and on the other hand rewriting a table few times a hour could produce conflicts in my opinion.
My concern regarding 1.) is how safe the app will be if we read 10mb in the memory all the time? What will happen if multiple users do this at some point of time, is this something to worry about or PHP is able to handle this in background?
What do you think it will be best for this use case?
Thanks!
When php runs on a web server (as it usually does) the server starts new php processes on demand when they're needed to handle concurrent requests. A powerful web server may allow fifty or so php processes. If each of them is handling this large data set, you'll need to have enough RAM for fifty copies. And, you'll need to load that data somehow for each new request. Reading 10mb from a file is not an overwhelming burden unless you have some sort of parsing to do. But it is a burden.
As it starts to handle each request, php offers a clean context to the programming environment. php is not good at maintaining in-RAM context from one request to the next. You may be able to figure out how to do it, but it's a dodgy solution. If you're running on a server that's shared with other web applications -- especially applications you don't trust -- you should not attempt to do this; the other applications will have access to your in-RAM data.
You can control the concurrent processes with Apache or nginx configuration settings, and restrict it to five or ten copies of php. But if you have a lot of incoming requests, those requests get serialized and they will slow down.
Will this application need to scale up? Will you eventually need a pool of web servers to handle all your requests? If so, the in-RAM solution looks worse.
Does your json data look like a big array of objects? Do most of the objects in that array have the same elements as each other? If so, that's conformable to a SQL table? You can make a table in which the columns correspond to the elements of your object. Then you can use SQL to avoid touching every row -- every element of each array -- every time you display or update data.
(The same sort of logic applies to Mongo, Redis, and other ways of storing your data.)
Aim
To synchronize my elasticsearch server with new and expired data in my SQL database
Issue
There are two very different ways I can achieve this and I don't know which is better. I can either pull information to elasticsearch with a direct connection to the SQL database using the JDBC river plugin. Alternatively I can push data to elasticsearch using the PHP client using the code shown below as an example:
// The Id of the document
$id = 1;
// Create a document
$tweet = array(
'id' => $id,
'user' => array(
'name' => 'mewantcookie',
'fullName' => 'Cookie Monster'
),
'msg' => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
'tstamp' => '1238081389',
'location'=> '41.12,-71.34',
'_boost' => 1.0
);
// First parameter is the id of document.
$tweetDocument = new \Elastica\Document($id, $tweet);
// Add tweet to type
$elasticaType->addDocument($tweetDocument);
// Refresh Index
$elasticaType->getIndex()->refresh();
I was going to have a cron run every thirty minuets to check for items in my database that not only have an "active" flag but that also do not have an "indexed" flag, that means I need to add them to the index.
QUESTION
Seeing as I have two methods to synchronize data between elasticsearch and mysql in two different ways, what is the advantage and disadvantage of each option. Is there a specific usecase which defines using one over the other?
I would use the river method even thought a in house build solution might be more customizable.
On one side, the jdbc-river plugin is a plugin that is already built and it has around 20 contributors so far. So you kind have a extra team working to improve that tool along the way elasticsearch itself is improving.
All you'll have is to install it and you don't even need a complexed configuration to set a river between your cluster and your relational database.
Another advantage with the jdbc-river solution is that you don't need to deal with memory management. The plugin can operate as a river in "pull mode" or as a feeder in "push mode". In feeder mode, the plugin runs in a separate JVM and can connect to a remote Elasticsearch cluster. I personally prefere the river mode because in this case Elasticsearch would deal with the indexing and memory management issues.
The relational data is internally transformed into structured JSON objects for the schema-less indexing model of Elasticsearch documents.
Both ends are scalable. The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode ensures high throughput when indexing to Elasticsearch.
One of the drawbacks of this solution is that it doesn't notify when it's done indexing. As a solution for that I suggest that you use the Count API to compare results.
Another drawback of the river is that it doesn't pull on update, it just does on insert or delete. I'm referring of course the sql actions UPDATE, INSERT and DELETE.
On second hand, your solution might bring some advantages and drawbacks you might want to consider.
You solution is highly customizable, so you can manage your scripts however you want. But considering the current state of any PHP Elasticsearch client available (Official Elasticseach-php Client , Elastica or FOSElasticaBundle), and even thought the guys are doing a great job on them, it's still considered as a not very mature APIs to work with on that level comparing to the official Elasticsearch JAVA API used for the river.
You should also consider dealing with all the errors that can throw your cluster at you from memory loss, management, performance, etc.
Ex: I tried to build a Proof of Concept using the Elastica API pushing my data from my database to my cluster, with a configuration of 32g RAM, 8 cores running #2.05GHz each, in a test environment, without getting into much details. It took me 5 hours to push 10M records from the database to the cluster. Where as with the river, it takes 20 minutes for the same records. Of course there might be optimizations that can be done around my code but I've consider it more time-consuming that it can bring me.
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
NB: Of course there might be other point you might want to consider but this subject is quite long to discuss over here. So I chose some point, I found essential that you should be aware of.
If you forget for a moment that you need to import initial data into Elasticsearch, I would use an event system to push data to Elasticsearch. This is more efficient in the long run.
Your application knows exactly when something needs to be indexed by Elasticsearch. To take your tweet example, at some point a new tweet will enter your application (a user writes one for example). This would trigger a newTweet event. You have a listener in place that will listen to that event, and store the tweet in Elasticsearch whenever such an event is dispatched.
If you don't want to use resources/time in the web request to do this (and you definitely don't want to do this), the listener could add a job to a queue (Gearman or Beanstalkd for example). You would then need a worker that will pick that job up and store the tweet in Elasticsearch.
The main advantage is that Elasticsearch is kept up-to-date more real-time. You won't need a cronjob that would introduce a delay. You'll (mostly) handle a single document at a time. You won't need to bother the SQL database to find out what needs to be (re)indexed.
Another advantage is that you can easily scale when the amount of events/data gets out of hand. When Elasticsearch itself needs more power, add servers to the cluster. When the worker can't handle the load, simply add more of them (and place them on dedicated machines). Plus your webserver(s) and SQL database(s) won't feel a thing.
I would use the river method.
Advantages of the river:
Already built. Just download it, set your configurations and everything is done.
Tested. The river has been used by several people and thus mistakes have been fixed.
Customizable. You can set the duration between the runs, define a sql-statement for getting new data, etc.
Advantages of your solution:
Highly customizable, you can do with your script whatever you want.
Disadvantages of your solution:
Needs special flags
Prone for errors (since it is not tested for a long time)
...
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
I'm currently designing and developing a web application that has the potential to grow very large at a fast rate. I will give some general information and move on to my question(s). I would say I am a mid-level web programmer.
Here are some specifications:
MySQL - Database Backend
PHP - Used in front/backend. Also used for SOAP Client
HTML, CSS, JS, jQuery - Front end widgets (highcharts, datatables, jquery-ui, etc.)
I can't get into too many fine details as it is a company project, but the main objective is to construct a dashboard that thousands of users will be accessing from various devices.
The data for this project is projected to grow by 50,000 items per year ( ~1000 items per week ).
1 item = 1 row in database
An item will also record a daily history starting at the day it was inserting.
1 day of history per item = 1 record
365 records per 1 year per device
365 * 50,000 = ~18,500,000 [first year]
multiply ~18,500,000 records by x for each year after.
(My forumla is a bit off since items will be added periodically throughout that year)
All items and history are accessed through a SOAP Client that connects to an API service, then writes the record to the database.
Majority of this data will be read and remain static (read only). But some item data may be updated or changed. The data will also be updated each day and need to write another x amount of history.
Questions:
1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
3) Are there any other requirements I should be thinking about?
The difficulty involved in scaling is almost always a function of users times data. If you have a lot of users, but not much data, it's not hard to scale. A typical example is a popular blog. Likewise, if you have a lot of data but not very many users, you're also going to be fine. This represents things like accounting systems or data-warehouse situations.
The first step towards any solution is to rough in a schema and test it at scale. You will have no idea how your application is going to perform until you run it through the paces. No two applications ever have exactly the same problems. Most of the time you'll need to adjust your schema, de-normalize some data, or cache things more aggressively, but these are just techniques and there's no standard cookbook for scaling.
In your specific case you won't have many problems if the rate of INSERT activity is low and your indexes aren't too complicated. What you'll probably end up doing is splitting out those hundreds of millions of rows into several identical tables each with a much smaller set of records in them.
If you're having trouble getting your queries to execute, consider the standard approach: index, optimize, then denormalize, then cache.
Where PHP can't cut it, consider using something like Python, Ruby, Java/Scala or even NodeJS to help facilitate your database calls. If you're writing a SOAP interface, you have many options.
1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
Absolutely. Make sure you've got everything indexed properly, and if you hit a storage or query-per-second limit, you've got plenty of options that apply to most/all DBMS's. You can get beefier hardware, start sharding data across servers, clustering, etc..
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
PHP 5+ allows you to execute multiple requests in parallel with CURL. Refer to the curl_muli* function for this, such as curl_multi_exec(). As far as I know, this requires you to handle SOAP/XML processing disjointly from the requests.
3) Are there any other requirements I should be thinking about?
Probably. But, you're usually on the right track if you start with a properly indexed, normalized database, for which you've thought about your objects at least mostly correctly. Start denormalizing if/when you find instances wherein denormalization solves an existing or obvious near-future efficiency problem. But, don't optimize for things that could become problems if the moons of Saturn align. Only optimize for problems that users will notice somewhat regularly.
While talking about large scale app the all the efforts and credits should not be given to the database alone. However it is the core part as our data in the main thing in any web aplication and side my side the your application depends upon the code optimization too that includes your backend and frontend script. Images and mainly server. Oh god many factors affecting the application.
Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?
We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.
You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.
I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.