tips for dealing with millions of documents?

tips for dealing with millions of documents? - php

i'm logging many information of 8 machines in a sharded clustered mongodb. it's growing up about 500k documents each day in 3 collections. this is 1gb/day.
my structure is:
1 VPS 512mb RAM ubuntu // shardsrvr, configsrvr and router
1 VPS 512mb RAM ubuntu // shardsrvr, configsrvr
1 VPS 8gb RAM ubuntu // shardsrvr, configsrvr // primary for all collections
for now no one collection has sharded enabled and no one has replica set. I just installed the cluster.
so now I need to run queries in all theses documents and collections to get different statistics. this means many wheres, counts, etc...
the first test I made was looping all documents in one collection with PHP and printing the ID. this crashed down the primary shardserver.
then I tried some other tests limiting queries by 5k documents and it works...
my question is about a better way to deal with this structure.
enable sharding for collections?
create replica sets?
php is able to do this? maybe use nodejs is better?

The solution is probably going to depend on what you're hoping to accomplish long term and what types of operations you're trying to perform.
A replica set will only help you with redundancy and data availability. If you are planning on letting the data continue to grow long term, you may want to consider this as a disaster recovery solution.
Sharding, on the other hand, will provide you with horizontal scaling and should increase the speed of your queries. Since a query crashed your primary shard server, i'm guessing that the data it was attempting to process was too large for it to handle by itself. In this case, it sounds like sharding the collection being used would help, as it would spread the workload across multiple servers. You should also consider if indexes would be helpful to make the queries more efficient.
However, you should consider that sharding with your current set up would introduce more possible points of failure; if any one of disks get corrupted then your entire data set is trashed.
In the end, it may come down to who is doing the heavy lifting, PHP or Mongo?
If you're just doing counts and returning large sets of documents for PHP to process, you might be able to handle performance issues by creating the proper indexes for your queries.

Related

querying on 10 million mongodb documents

I am storing book meta-data like name,authors,price,publisher,etc in a mongodb document. I have about 10 million of these documents and they all are in one collection. The average document size is 1.9 KB. Now i have indexes on name,authors and price. In fact i have 2 indexes on price one in ascending order and one descending order. My mongodb version is 2.2.0 and i am using the php driver to query mongo. The driver's version is 1.12. But when i do a range query on price i get a MongoCursorTimeoutException. In my query i am trying to find books in a certain price range like "price less than 1000 and more than 500".
Increasing the timeout doesn't seem to be a good idea(It is already 30 sec). Is there anything else that i can do to speed up the query process.
EDIT
Actually my price index is compound. I have a status field which has an integer value so my price index looks like {price:-1,status:1} and {price:1,status:1}
Also I am trying to retrieve 20 documents at a time with PHP.

We have had a lot of experience with Mongo collections with millions of documents using both single/shared servers and dedicated replica sets on EC2 using both traditional and SSD EBS volumes. The workloads are varied: some are analytics-oriented and others are backing Web requests. Here is the root cause analysis path I'd recommend:
Run your queries with .explain() to see what's going on in terms of indexes used, etc. Adjust indexes if necessary. Mongo's optimizer is rather naive so if your indexes don't match the query pattern perfectly, they may be missed.
Check MMS and look for any of the following problems: (1) not all data in memory (indicated by page faults) and (2) queue lengths (typically indicating some type of bottleneck). Mongo's performance degrades rapidly when not all data is in memory because the database has a single global lock and touching storage, especially in the cloud is bad news. We recently upgraded to SSD cloud storage and we are seeing 3-10x improvements in performance on a database that's about 1/2 Tb in size.
Increase the profiling level to 2 (the max), run for a while and look at the operation log. See the MongoDB profiler.
Hope this helps.

Check your indecies. Reindex your data, and make sure that the collection is fully indexed before running the queries. (10 mi. docs may take awhile to index)
The slowest part of any indexed query is the actual document retrieval. I could imagine that depending on the amount of documents you are pulling this could take 30 seconds or more and a lot of memory.
For more helpful instructions on some things you could try check out this page:
http://www.mongodb.org/display/DOCS/Optimization
For 10 mi. documents you might also think about sharding the data across computers. Remember that hard drive reads are slower than cpu cycles.

As #JohnyHK said my RAM was too low. So increased it to 12 GB and it works now. Thanks everyone for their comments and answers

Cassandra is much slower than Mysql for simple operations?

I see a lot of statements like: "Cassandra very fast on writes", "Cassandra has reads really slower than writes, but much faster than Mysql"
On my windows7 system:
I installed Mysql of default configuration.
I installed PHP5 of default configuration.
I installed Casssandra of default configuration.
Making simple write test on mysql: "INSERT INTO wp_test (id,title) VALUES ('id01','test')" gives me result: 0.0002(s)
For 1000 inserts: 0.1106(s)
Making simple same write test on Cassandra: $column_faily->insert('id01',array('title'=>'test')) gives me result of: 0.005(s)
For 1000 inserts: 1.047(s)
For reads tests i also got that Cassandra is much slower than mysql.
So the question, does this sounds correct that i have 5ms for one write operation on Cassadra? Or something is wrong and should be at least 0.5ms.

When people say "Cassandra is faster than MySQL", they mean when you are dealing with terabytes of data and many simultaneous users. Cassandra (and many distributed NoSQL databases) is optimized for hundreds of simultaneous readers and writers on many nodes, as opposed to MySQL (and other relational DBs) which are optimized to be really fast on a single node, but tend to fall to pieces when you try to scale them across multiple nodes. There is a generalization of this trade-off by the way- the absolute fastest disk I/O is plain old UNIX flat files, and many latency-sensitive financial applications use them for that reason.
If you are building the next Facebook, you want something like Cassandra because a single MySQL box is never going to stand up to the punishment of thousands of simultaneous reads and writes, whereas with Cassandra you can scale out to hundreds of data nodes and handle that load easily. See scaling up vs. scaling out.
Another use case is when you need to apply a lot of batch processing power to terabytes or petabytes of data. Cassandra or HBase are great because they are integrated with MapReduce, allowing you to run your processing on the data nodes. With MySQL, you'd need to extract the data and spray it out across a grid of processing nodes, which would consume a lot of network bandwidth and entail a lot of unneeded complication.

Cassandra benefits greatly from parallelisation and batching. Try doing 1 million inserts on each of 100 threads (each with their own connection & in batches of 100) and see which ones is faster.
Finally, Cassandra insert performance should be relatively stable (maintaining high throughput for a very long time). With MySQL, you will find that it tails off rather dramatically once the btrees used for the indexes grow too large memory.

It's likely that the maturity of the MySQL drivers, especially the improved MySQL drivers in PHP 5.3, is having some impact on the tests. It's also entirely possible that the simplicity of the data in your query is impacting the results - maybe on 100 value inserts, Cassandra becomes faster.
Try the same test from the command line and see what the timestamps are, then try with varying numbers of values. You can't do a single test and base your decision on that.

Many user space factors can impact write performance. Such as:
Dozens of settings in each of the database server's configuration.
The table structure and settings.
The connection settings.
The query settings.
Are you swallowing warnings or exceptions? The MySQL sample would on face value be expected to produce a duplicate key error. It could be failing while doing nothing at all. What Cassandra might do in the same case isn't something I'm familiar with.
My limited experience of Cassandra tell me one thing about inserts, while performance of everything else degrades as data grows, inserts appear to maintain the same speed. How fast it is compared to MySQL however isn't something I've tested.
It might not be so much that inserts are fast but rather tries to be never slow. If you want a more meaningful test you need to incorporate concurrency and more variations on scenario such as large data sets, various batch sizes, etc. More complex tests might test latency for availability of data post insert and read speed over time.
It would not surprise me if Cassandra's first port of call for inserting data is to put it on a queue or to simply append. This is configurable if you look at consistency level. MySQL similarly allows you to balance performance and reliability/availability though each will have variations on what they allow and don't allow.
Outside of that unless you get into the internals it may be hard to tell why one performs better than the other.
I did some benchmarks of a use case I had for Cassandra a while ago. For the benchmark it would insert tens of thousands of rows first. I had to make the script sleep for a few seconds because otherwise queries run after the fact would not see the data and the results would be inconsistent between implementations I was testing.
If you really want fast inserts, append to a file on ramdisk.

Mysql replication - is it worth it?

Replication
I have an app that Is polling data from a large number of data feeds. It processes thousands of records per day and this number is ever increasing. The data is stored in Mysql. 
I then have a website that utilises this data.
I'm trying to build my environment with future in mind. 
 I thought of mysql replication so that the website can use it's own database on a different server and get bogged down by the thousands of write commands that are happening on the main database. 
I am having difficulty getting this setup, despite mysql reporting it's all working fine. 
I then started think - is there not a better way ?
From what I understand mysql sends the write command to the slave database as the master. 
Does this not mean that what I am trying to avoid is just happening anyway?
Does this mean that the slave database will suffer thousands of writes 
I am a one man band, doing this venture with my own money so I need to do this a cheapest way. I am getting a bit lost !
I have a dedicated server,
A vps
Using Php5, mysql 5 in a lamp stack.
I cannot begin to tell you how much I would appreciate some guidance!

If the slaves are a 1:1 clone of the master, than all writes to the master MUST be propagated down to the slaves. Otherwise replication would be useless.
Thousands of records per day is actually very small. Assuming the same processing time for each, and doing 5000 records, you'd have 86400/5000 = 17.28 seconds per record. That's very minimal write overhead.
If you were doing millions of records a day, THEN you'd have a write bottleneck.

I would split this in three layers.
Data Feed layer. Data read from the feeds is preprocessed and posted into a queue. This layer has a temporary queue that serves also as a temporary storage, a buffer to allow all data feed to post its data. I'd use a Message Queue System. It's fast and reliable.
Data Store layer. This layer reads from the queue, maybe processes someway the data read, and stores the data in the database.
Data Analysis layer. This is your "slave" database. It's a data warehouse. It periodically does ETL (extract, transform and load) data from the Data Store layer to this secondary database.
This layeread approach allows you isolate concerns (speed, reliability, security) and implementation details; and allows for future scalability.

Replication is literally what the word suggest - replicating queries on another machine.
MySQL creates a log that's filled with queries that were used to create the dataset on the original machine (master) and sends it to the slave(s) that read the log and re-execute those queries.
Basically, what you want is to increase your write ratio. That's achievable trough using different engines, for example TokuDB is one of them (however it isn't free, but you are allowed to store 50gb of user data for free and use it).
What you want (for the moment) is fast HDD subsystem more than a monolithic write-scalable storage system. InnoDB is capable of achieving a lot of queries per second on properly configured machine with sufficient hardware. I am not sure about pricing, but SSD and 4-8 gigs of ram shouldn't be that expensive. As Marc. B said - until you reach millions of records per day, you don't have to worry about scaling reads and writes trough replication.

You say you have an app "polling" your data from datafeeds. Does that mean you are doing full text searches? I'm making an assumption here in that you are batch processing date feeds and then querying that. If that is the case I'd offload all your fulltext queries to something like Solr. It actually isn't too time consuming to setup, depending on the size of your DB you can get away with running it on a fairly small VPS or on your dedicated, and best yet the difference is search speed is incredible. I've had full text mysql queries that would take 20 minutes to run be done in solr in under a second.
Just make sure you use a try statement in the event your solr instance goes down.

SQL Insert at 15 minute intervals, big MySQL table

I'm fairly familiar with most aspects of web development and I consider myself a junior level programmer. I'm always anxious when I think about application scaling and would like to learn a little more about it. Let's have a hypothetical situation.
I'm working on a web application that polls a device and fetches about 2kb of XML data at 15 minute intervals. This data must be stored for A Very Long Time (at least a couple years?). Now imagine that this web application has 100 users that each have this device.
After 10 years we're talking tens of millions of table rows. With 100 users we have a cron job that is querying each users device, getting 2kb of XML, and inserting it into the SQL database every 15 minutes.
Assuming my queries are relatively simple, only collecting the columns necessary, using joins, and avoiding subqueries, is there any reason this should not scale?

Inserting doesn't generally get slower as a table gets larger, but index updates may take longer. At some point you may want to split the table into two parts. One for archival storage, optimized for data retrieval (basically index the heck out of it), and a second table to handle the newer data, optimized more for insertion (fewer indexes).
But as always, the only way to tell for sure is to benchmark things. Set up some cloned tables with a few thousand rows, and some with multi-millions of rows, and see what happens.

You could always consider using partitioning to automagically split your data files by date, and age older records off to an slower, high-capacity disk array while keeping the newer records (and the INSERTs) on a high-speed array. Then, your index builds will only have to work on a subset of the data rather than the whole deal, and should go quickly (disk I/O is typically the slowest part of a database system).

Assuming my queries are relatively simple, only collecting the columns
necessary, using joins, and avoiding subqueries, is there any reason
this should not scale?
When you get large you should put you active dataset in a in-memory database(faster than disc) just like Facebook, Twitter, etc do. Twitter became very slow when they did not put active dataset in memory/scale up => A lot of people called this fail whale. Both use memcached for this, but you could also use Redis(I like this) or APC if you are just a single box. You should always install APC if want performance because APC is used for caching the compiled bytecode.
Most PHP accelerators work by caching the compiled bytecode of PHP
scripts to avoid the overhead of parsing and compiling source code on
each request (some or all of which may never even be executed). To
further improve performance, the cached code is stored in shared
memory and directly executed from there, minimizing the amount of slow
disk reads and memory copying at runtime.

Social Network Performance - PHP

i made a social network, testing it in WAMP shows almost 1500 SQL for a single person for a session of about 30 mins and 50 page views !
[ i'm not using ZEND or APC or MEMCACHED The heaviest page gets loaded within 0.25 second config 512 MB RAM, AMD 1.81GHz ]
Q-> is it ok or i need to less the number of SQL ?
there are 2 tables PARENT and CHILD
Structure of PARENT table
PID [primary key]
...
...
Structure of CHILD table
ID [primary key]
PID
...
...
i've not used Foreign Key, but deleting on PARENT also deletes from CHILD
and i made this in PHP/SQL
Q-> is it ok or i should go for FOREIGN KEY for better performance ?
In PHP i can config how much memory PHP gonna eat
Q-> can i also do it with MySQL ?
[ i am using WAMP,and need to monitor the social network's performance in bottle neck condition ! ]

No-one can say if an arbitrary number of SQL queries is OK :
It depends on the complexity of those queries
It depends on your database's structure (indexes, for instance, play a big role)
It depends on the amount of data you have
It depends on how many concurrent users you plan to have (with one user at a time, your application will probably be way faster than with 100 users at a given instant)
...
Basically : do some benchmarks, using tools such as ab / siege / Jmeter ; and see if your server can handle the load you expect on having in the next few weeks.
Using foreign keys generally doesn't help with performances (except if they force you setting indexes you'd need but wouldn't have created by yourself) : they add some extra-work on the DB side.
But using foreign keys helps with data integrity -- and having data that's OK is probably more important than a couple milliseconds, especially if you are just launching your application (which means there could be quite a few bugs).

30 SQL queries per page is reasonable in general (actually it's quite low considering what some CMS do). On the other hand, with the information given, it is not possible to determine whether it is reasonable in your case.
Foreign keys do not improve performance. Foreign key constraints might. But they also put business logic into the persistence layer. It's an optimization.
Information about configureing the memory usage of MySQL can be found in the handbook section 7.11.4.1. How MySQL Uses Memory.

I'd agree with Pascal and Oswald - esp. on testing with JMeter or similar to see if you really do have a problem.
I would also load up the database with a few million test profiles to see whether your queries slow down over time. This should help with optimizing query performance.
If your goal for tweaking MySQL is to introduce an artificial bottleneck to test the application, I'd be careful to extrapolate from those tests. What you see with bottlenecks is that they tend to be non-linear - everything is fine until you hit a bottleneck moment, and then everything becomes highly unpredictable. You may not recreate this simply by reducing the memory of the database server.
If there's any low-hanging fruit, I would reduce the number of SQL queries, but 30 queries per page is not excessive. If you want to be prepared to scale to Facebook levels, I don't think reducing the queries per page from 30 to 28 will help much - you need to be ready to partition the application across multiple databases, introduce caching, and buy more powerful hardware.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.