I see a lot of statements like: "Cassandra very fast on writes", "Cassandra has reads really slower than writes, but much faster than Mysql"
On my windows7 system:
I installed Mysql of default configuration.
I installed PHP5 of default configuration.
I installed Casssandra of default configuration.
Making simple write test on mysql: "INSERT INTO wp_test (id,title) VALUES ('id01','test')" gives me result: 0.0002(s)
For 1000 inserts: 0.1106(s)
Making simple same write test on Cassandra: $column_faily->insert('id01',array('title'=>'test')) gives me result of: 0.005(s)
For 1000 inserts: 1.047(s)
For reads tests i also got that Cassandra is much slower than mysql.
So the question, does this sounds correct that i have 5ms for one write operation on Cassadra? Or something is wrong and should be at least 0.5ms.
When people say "Cassandra is faster than MySQL", they mean when you are dealing with terabytes of data and many simultaneous users. Cassandra (and many distributed NoSQL databases) is optimized for hundreds of simultaneous readers and writers on many nodes, as opposed to MySQL (and other relational DBs) which are optimized to be really fast on a single node, but tend to fall to pieces when you try to scale them across multiple nodes. There is a generalization of this trade-off by the way- the absolute fastest disk I/O is plain old UNIX flat files, and many latency-sensitive financial applications use them for that reason.
If you are building the next Facebook, you want something like Cassandra because a single MySQL box is never going to stand up to the punishment of thousands of simultaneous reads and writes, whereas with Cassandra you can scale out to hundreds of data nodes and handle that load easily. See scaling up vs. scaling out.
Another use case is when you need to apply a lot of batch processing power to terabytes or petabytes of data. Cassandra or HBase are great because they are integrated with MapReduce, allowing you to run your processing on the data nodes. With MySQL, you'd need to extract the data and spray it out across a grid of processing nodes, which would consume a lot of network bandwidth and entail a lot of unneeded complication.
Cassandra benefits greatly from parallelisation and batching. Try doing 1 million inserts on each of 100 threads (each with their own connection & in batches of 100) and see which ones is faster.
Finally, Cassandra insert performance should be relatively stable (maintaining high throughput for a very long time). With MySQL, you will find that it tails off rather dramatically once the btrees used for the indexes grow too large memory.
It's likely that the maturity of the MySQL drivers, especially the improved MySQL drivers in PHP 5.3, is having some impact on the tests. It's also entirely possible that the simplicity of the data in your query is impacting the results - maybe on 100 value inserts, Cassandra becomes faster.
Try the same test from the command line and see what the timestamps are, then try with varying numbers of values. You can't do a single test and base your decision on that.
Many user space factors can impact write performance. Such as:
Dozens of settings in each of the database server's configuration.
The table structure and settings.
The connection settings.
The query settings.
Are you swallowing warnings or exceptions? The MySQL sample would on face value be expected to produce a duplicate key error. It could be failing while doing nothing at all. What Cassandra might do in the same case isn't something I'm familiar with.
My limited experience of Cassandra tell me one thing about inserts, while performance of everything else degrades as data grows, inserts appear to maintain the same speed. How fast it is compared to MySQL however isn't something I've tested.
It might not be so much that inserts are fast but rather tries to be never slow. If you want a more meaningful test you need to incorporate concurrency and more variations on scenario such as large data sets, various batch sizes, etc. More complex tests might test latency for availability of data post insert and read speed over time.
It would not surprise me if Cassandra's first port of call for inserting data is to put it on a queue or to simply append. This is configurable if you look at consistency level. MySQL similarly allows you to balance performance and reliability/availability though each will have variations on what they allow and don't allow.
Outside of that unless you get into the internals it may be hard to tell why one performs better than the other.
I did some benchmarks of a use case I had for Cassandra a while ago. For the benchmark it would insert tens of thousands of rows first. I had to make the script sleep for a few seconds because otherwise queries run after the fact would not see the data and the results would be inconsistent between implementations I was testing.
If you really want fast inserts, append to a file on ramdisk.
Related
i'm logging many information of 8 machines in a sharded clustered mongodb. it's growing up about 500k documents each day in 3 collections. this is 1gb/day.
my structure is:
1 VPS 512mb RAM ubuntu // shardsrvr, configsrvr and router
1 VPS 512mb RAM ubuntu // shardsrvr, configsrvr
1 VPS 8gb RAM ubuntu // shardsrvr, configsrvr // primary for all collections
for now no one collection has sharded enabled and no one has replica set. I just installed the cluster.
so now I need to run queries in all theses documents and collections to get different statistics. this means many wheres, counts, etc...
the first test I made was looping all documents in one collection with PHP and printing the ID. this crashed down the primary shardserver.
then I tried some other tests limiting queries by 5k documents and it works...
my question is about a better way to deal with this structure.
enable sharding for collections?
create replica sets?
php is able to do this? maybe use nodejs is better?
The solution is probably going to depend on what you're hoping to accomplish long term and what types of operations you're trying to perform.
A replica set will only help you with redundancy and data availability. If you are planning on letting the data continue to grow long term, you may want to consider this as a disaster recovery solution.
Sharding, on the other hand, will provide you with horizontal scaling and should increase the speed of your queries. Since a query crashed your primary shard server, i'm guessing that the data it was attempting to process was too large for it to handle by itself. In this case, it sounds like sharding the collection being used would help, as it would spread the workload across multiple servers. You should also consider if indexes would be helpful to make the queries more efficient.
However, you should consider that sharding with your current set up would introduce more possible points of failure; if any one of disks get corrupted then your entire data set is trashed.
In the end, it may come down to who is doing the heavy lifting, PHP or Mongo?
If you're just doing counts and returning large sets of documents for PHP to process, you might be able to handle performance issues by creating the proper indexes for your queries.
I currently have 2000 records in a postgresql database being updated every minute that are filtered with a SQL statement. Upto 1000 different filter combinations can exist and approx 500 different filters can be called every minute. At the moment http responses are cached for 59 seconds to ease server load and database calls. However im considering caching the whole db table in memcached and doing the filtering in php. 2000 rows isnt alot but the response time for getting data from memory vs the db would be alot faster.
Would the php processing time outweigh the database response time for sql filtering for this number of rows? The table shouldnt grow anymore than 3000 rows in the foreseeable future.
As with any question relating to is x faster than y, the only real answer is to benchmark it for yourself. However, if the database is properly indexed for the queries you need to perform, it is likely to be quite a bit faster at filtering result sets than most any PHP code you could write.
The RDBMS is on the other hand, is already designed and optimized for locating, filtering, and ordering rows.
The way PostgreSQL operates, if you aren't extremely starving it for memory, 100% of such a small and frequently queried table will be held in RAM (Cache) already by the default caching algorithms. Having the database engine filter it is almost certainly faster than doing the same it in your application.
You may want to inspect your postgresql.conf, especially shared_buffers, the planner cost constants (set random_page_cost almost or exactly as low as seq_page_cost) and effective_cache_size (set it high enough).
You could probably benefit from optimizing indexes. There is a wide range of types available. Consider partial indexes, indexes on expression or multi-column indexes in addition to plain indexes. Test with EXPLAIN ANALYZE and only keep indexes that actually get used and speed up queries. As all of the table resides in RAM, the query planner should calculate that random access is almost or exactly as fast as sequential access. The difference only applies to disc reads.
As you updating every minute, be sure not to keep any indexes that aren't actually helping. Also, vacuuming and analyzing it frequently are keys to performance in such a case. Not VACUUM FULL ANALYZE, just VACUUM ANALYZE. Or use auto-vacuum with tuned settings.
Of course, all the standard advice on performance optimization applies.
I'm fairly familiar with most aspects of web development and I consider myself a junior level programmer. I'm always anxious when I think about application scaling and would like to learn a little more about it. Let's have a hypothetical situation.
I'm working on a web application that polls a device and fetches about 2kb of XML data at 15 minute intervals. This data must be stored for A Very Long Time (at least a couple years?). Now imagine that this web application has 100 users that each have this device.
After 10 years we're talking tens of millions of table rows. With 100 users we have a cron job that is querying each users device, getting 2kb of XML, and inserting it into the SQL database every 15 minutes.
Assuming my queries are relatively simple, only collecting the columns necessary, using joins, and avoiding subqueries, is there any reason this should not scale?
Inserting doesn't generally get slower as a table gets larger, but index updates may take longer. At some point you may want to split the table into two parts. One for archival storage, optimized for data retrieval (basically index the heck out of it), and a second table to handle the newer data, optimized more for insertion (fewer indexes).
But as always, the only way to tell for sure is to benchmark things. Set up some cloned tables with a few thousand rows, and some with multi-millions of rows, and see what happens.
You could always consider using partitioning to automagically split your data files by date, and age older records off to an slower, high-capacity disk array while keeping the newer records (and the INSERTs) on a high-speed array. Then, your index builds will only have to work on a subset of the data rather than the whole deal, and should go quickly (disk I/O is typically the slowest part of a database system).
Assuming my queries are relatively simple, only collecting the columns
necessary, using joins, and avoiding subqueries, is there any reason
this should not scale?
When you get large you should put you active dataset in a in-memory database(faster than disc) just like Facebook, Twitter, etc do. Twitter became very slow when they did not put active dataset in memory/scale up => A lot of people called this fail whale. Both use memcached for this, but you could also use Redis(I like this) or APC if you are just a single box. You should always install APC if want performance because APC is used for caching the compiled bytecode.
Most PHP accelerators work by caching the compiled bytecode of PHP
scripts to avoid the overhead of parsing and compiling source code on
each request (some or all of which may never even be executed). To
further improve performance, the cached code is stored in shared
memory and directly executed from there, minimizing the amount of slow
disk reads and memory copying at runtime.
I've a php website which displays recipes www.trymasak.my, to be exact. The recipes being displayed at the index page is updated about once a day. To get the latest recipes, I just use a mysql query which is something like "select recipe_name, page_views, image from table order by last_updated". So if I got 10000 visitors a day, obviously the query would be made 10000 times a day. A friend told me a better way (in terms of reducing server load) is when I update the recipes, I just put in the latest recipe details (names,images etc) into a text file, and make my page instead of querying a same query for 10,000 times, just get the data from the text file. Is his suggestion really better? If yes, which is the best php command should I use to open, read and close the text file?
thanks
The typical solution is to cache in memory. Either the query result or the whole page.
Benchmark
To know the truth about something you should really benchmark it. "Simple is Hard" from Rasmus Ledorf(Author of PHP) are really interesting video/slides(my opinion ;)) which explain how to benchmark your website. It will teach you to tackle the low hanging fruit of your website instead of wasting your time doing premature optimizations.
Donald Knuth made the following two
statements on optimization: "We should
forget about small efficiencies, say
about 97% of the time: premature
optimization is the root of all evil"
"In established engineering
disciplines a 12 % improvement, easily
obtained, is never considered marginal
and I believe the same viewpoint
should prevail in software
engineering"5
In a nutshell you will run benchmarks using tools like Siege, ab, httperf, etc. I would really like to advice you to watch this video if you aren't familiar with this topic, because I found it a really interesting watch/read.
Speed
If speed as your concern you should have at least consider:
Using a bytecode cache => APC. Precompiling your PHP will really speed up your website for at least these two big reasons:
Most PHP accelerators work by caching
the compiled bytecode of PHP scripts
to avoid the overhead of parsing and
compiling source code on each request
(some or all of which may never even
be executed). To further improve
performance, the cached code is stored
in shared memory and directly executed
from there, minimizing the amount of
slow disk reads and memory copying at
runtime.
PHP accelerators can substantially
increase the speed of PHP
applications. Improvements of web page
generation throughput by factors of 2
to 7 have been observed. 50
times faster for compute intensive
analysis programs.
Us an in-memory database to store your queries => Redis or Memcached. There is a very very big mismatch between memory and the disc(IO).
Thus, we observe that the main memory
is about 10 times slower and I/O units
1000 times slower than the processor.
The analogy part is also interesting read(can't copy from google books :)).
Databases are more flexible, secure and scalable in the long run. 10000 queries per day isn't really that much for modern RDBMS either. Go (or stay) database.
Optimize on the caching side of things, the HTTP specification has an own section on that:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
Reading files:
file_get_contents(), if you want
to store contents in a variable
before outputting
readfile(), if
you just want to output file's
contents
If you try it, store the recipes in a folder structure like this:
/recipes/X/Y/Z/id.txt
where X, Y and Z is a random integer, from 1 to 25
example:
/recipes/3/12/22/12345.txt
This is because the filesystem is just another database. And it has a lot more hidden meta data updates to deal with.
I think MySQL will be faster, and certainly more manageable, since you'd have to backup the MySQL db anyway.
Opening ONE file is faster that doing a mysql connect + query.
BUT, if your website already needs a mysql connect to retreive some other informations, you probably want to stick with your query because the longuest part is the connection and your query is very light.
On the other hand, opening 10 files is longer than query 10 records from a database, because you only open one mysql connection.
In any case, you have to consider how long is your query and if caching it in a text file will have more pros than cons.
I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.
Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.
In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.
The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.
Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.
A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.
If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)