I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.
I have a data set - which can be thought of as a standard table of the format:
[id] [name] [d0] [d1] [d2] .. [d63]
---------------------------------------
0 Name1 0.43 -0.22 0.11 -0.81
1 Name2 0.23 0.65 0.62 0.41
2 Name3 -0.13 -0.23 0.17 0.00
...
N NameN 0.43 -0.23 0.12 0.01
I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT statement similar to:
SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5
where QueryParameter1,2,N are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).
My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.
I can do this a number of ways:
(1) Use SQL Azure, just as the query lies above. I have tried this method, and the queries can be quite slow as expected since SQL only gives you a single instance. I can spin up multiple instances of SQL and shard the data, but that gets real expensive real quick.
(2) Use Azure Storage Tables. Bloggers claim storage tables are faster in general, but would this still be the case for my query requirements?
(3) Use EC2 and spin up several instances with MySQL, possibly incorporating sharding to new instances (cost increases though).
(4) Use EC2 with MongoDB, as I've read it is faster than MySQL. Again this is probably dependent on the type of query.
(5) Google AppEngine. I'm not really sure how GAE would work with this query structure, but I guess that's why I am looking for opinions.
I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL query above).
Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE clause?
Cheers,
Brett
Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.
In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.
Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:
Table/row scan at the fastest speed
Ability to highly parallelize your operation
Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.
My personal recommendation is Azure, since you can:
Store massive amounts of data in "table storage", optimized for fast scan retrieval, and partitioned (e.g. over d0 ranges) for optimal parallelism
Dynamically "spin up" as many compute instances as you like to process the data in parallel
Queueing mechanisms to synchronize the results collation
Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.
The problem is not the math operators or the number thereof, the problem is that they are parameterized - you are effectively doing a weighted average across the columns with the weights being defined at run-time, so that the operation must be computed and cannot be inferred.
Even in SQL Server, this operation can be parallelized (and this should show up on the execution plan), but it is not amenable to search optimization using indexes, which is where most relational databases will really shine. With static weights and indexed computed column would obviously perform very quickly.
Because this problem is easily parallelized, you might want to look at something based on a Map-Reduce principle.
Currently neither SQL Azure nor Amazon RDS can scale horizontally (EC2 can at least vertically) but IF and only IF your data can be partitioned in a way that still makes it possible to execute your query the upcoming SQL Federations Feature of SQL Azure might be worth looking at and help making an informed decision.
MongoDB (which I like a lot) is more geared toward Document oriented workloads and is possible not the best solution for this type of job although your mileage may vary (it's blazingly fast as long as most of your working set fits into memory).
Assuming that the QueryParameter0, QueryParameter1, ... , QueryParameterN are all supplied at runtime and are different each time, then I don't think that any of the platforms will be able to provide significant advantages over any of the others - since none of them will be able to take advantage of any pre-computed indicies.
With indicies removed, the only other factors for speed then comes from the processing power available - you already know about this for the SQL Azure option, and for the other options this pretty much comes down to you deciding what processing to apply - it's up to you to fetch all the data and to then process it.
One option you might consider is whether you could host this data yourself on an instance (e.g. using an Azure blob or cloud drive) and could then process the data in a custom built worker role. This isn't something I'd think about for general data storage, but if its just this one table and this one query then it would be pretty easy to hand craft a quick solution?
Update - just seen the answer from #Cade too - +1 for his suggestion of parallelization.
Related
So currently i have an application where i'm storing location data (lat,lng) along with other fields and who not. So what i love about mysql or sql in general is that i can get geospatial queries easily. e.g. select all rows that fall within a given radius and center point.
What i love about dynamodb is that it's damn near infinitely scalable on AWS, which is the service i'll be using, and fast. I would love to move all my data over to dynamodb and even insert new data there But i wouldn't be able to use those geospatial queries which is the most important part of my application. It's required.
I know about the geolibrary for dynamodb but its written in java and my backend is written in php so thats a no go, plus they don't seem to update or maintain that library.
One solution i was thinking of was to store just the coordinates in mysql and store the corresponding id along with the other data (including the lat and long values) in dynamodb.
With this i could achieve the geospatial query functionality i want while being able to scale everything well on amazon specifically because thats the host i'm using.
So basically i'd query all POIs within a given radius from mysql and with all the ids i'd use that to get all results from dynamodb. Sounds crazy or what?
But the potential downside of this is having to query one data source and then querying another one immediately after with the result from the first query. Maybe I'm over thinking and underestimating how fast these technologies have become.
So to sum up my requirements:
Must be on AWS
Must be able to perform geospatial queries
Must be able to connect to dynamodb and MySQL in PHP
Any help or suggestions would be greatly appreciated.
My instinct says, don't use 2 datasources, only if you have a really specific case.
How much data do you have? Is MySQL ( or Aurora) really can't handle it? If your application is read heavy, it can easily scale with read replicas.
I have a few ideas for you which may brings you at least a bit closer:
Why don't you implement your own geo-library in php? :D
You can do a dummy search in the DB, where you not filtering by actual distance, but with an upper and lower boundary in lat. and long. ( So you not searching in a circle, but in a square. Then it's on you if your application is fine with it, or it filters the result, but that would be a much smaller dataset and an easy filter.
Maybe CloudSearch can help you out. It offers geo spatial queries on lat long fields. It works well together with DynamoDB, and it has a PHP SDK (never tried that though, I use nodejs)
You write the items that have lat,long fields to DynamoDB. Each item (or item update/deletion) is uploaded to CloudSearch automatically via a DynamoDB stream. So now you have "automatic copies" of your DynamoDB items in CloudSearch, and you can use all query capabilities of CloudSearch, including geo queries (one limitation, it only queries in boxes, not in circles, so you will need some extra math)
You will need to create a DynamoDB stream that triggers a Lambda function that uploads every item to CloudSearch. You set this up once, and it will do its magic "forever".
This approach will only work if you accept a small delay between the moment you are writing in DynamoDB and the moment it is available in CloudSearch.
With this approach you still have 2 datasources, but they are entirely separated from the perspective of your app. One datasource is for querying and the other one for writing. Keeping them in sync is done automatically for you in the AWS cloud. Your app writes to DynamoDB, and queries from CloudSearch. And you have the scalability advantages that these AWS services offer.
I am running a crm application which uses mysql database. My application generating lots of data in mysql. Now i want to give my customer a reporting section where admin can view real time report, they should be able to filter at real time. Basically i want my data to be slice and dice at real time fast as possible.
I have implemented the reporting using mysql and php. But now as data is too much query takes too much time and page does not load. After few read i came across few term like Nosql, mongoDb , cassandra , OLAP , hadoop etc but i was confuse which to choose. Is there any mechanism which would transfer my data from mysql to nosql on which i can run my reporting query ans serve my customer keeping my mysql database as it is ?
It doesn't matter what database / datastore technology you use for reporting: you still will have to design it to extract the information you need efficiently.
Improving performance by switching from MySQL to MongoDB or one of the other scalable key/value store systems is like solving a pedestrian traffic jam by building a railroad. It's going to take a lot of work to make it help the situation. I suggest you try getting things to work better in MySQL first.
First of all, you need to take a careful look at which SQL queries in your reporting system are causing trouble. You may be able to optimize their performance by adding indexes or doing other refactoring. That should be your first step. MySQL has a slow query log. Look at it.
Secondly, you may be able to add resources (RAM, faster disks, etc) to MySQL, and you may be able to tune it for higher performance. There's a book called High Performance MySQL that offers a sound methodology for doing this.
Thirdly, many people who need to add a reporting function to their busy application use MySQL replication. That is, they configure one or two slave MySQL servers to accept copies of all data from the master server.
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
They then use the slave server or servers to run reporting queries. The slaves are ordinarily a few seconds or minutes behind the master (that is, they're slightly out of date). But it usually is good enough to give users the illusion of real-time reporting.
Notice that if you use MongoDB or some other technology you will also have to replicate your data.
I will throw this link out there for you to read which actually gives certain use cases: http://www.mongodb.com/use-cases/real-time-analytics but I will speak for a more traditional setup of just MongoDB.
I have used both MySQL and MongoDB for analytical purposes and I find MongoDB better suited, if not needing a little bit of hacking to get it working well.
The great thing about MongoDB when it comes to retreiving analytical data is that it does not require the IO/memory to write out a separate result set each time. This makes reads on a single member of a replica set extremely scalable since you just add your analytical collections to the working set (a.k.a memory) and serve straight from those using batch responses (this is the default implementation of the drivers).
So with MongoDB replication rarely gives an advantage in terms of read/write, and in reality with MySQL I have found it does not either. If it does then you are doing the wrong queries which will not scale anyway; at which point you install memcache onto your database servers and, look, you have stale data being served from memory in a NoSQL fashion anyway...whoop, I guess.
Okay, so we have some basic ideas set out; time to talk about that hack. In order to get the best possible speed out of MongoDB, and since it does not have JOINs, you need to flatten your data so that no result set will even be needed your side.
There are many tactics for this, but the one I will mention here is: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ pre-aggregated reports. This method also works well in SQL techs since it essentially is the in the same breath as logically splitting tables to make queries faster and lighter on a large table.
What you do is you get your analytical data, split it into a demomination such as per day or month (or both) and then you aggregate your data across those ranges in a de-normalised manner, essentially, all one row.
After this you can show reports straight from a collection without any need for a result set making for some very fast querying.
Later on you could add a map reduce step to create better analytics but so far I have not needed to, I have completed full video based anlytics without such need.
This should get you started.
TiDB may be a good fit https://en.pingcap.com/tidb/, it is MySQL compatible, good at real-time analytics, and could replicate the data from MySQL through binlog.
I see a lot of statements like: "Cassandra very fast on writes", "Cassandra has reads really slower than writes, but much faster than Mysql"
On my windows7 system:
I installed Mysql of default configuration.
I installed PHP5 of default configuration.
I installed Casssandra of default configuration.
Making simple write test on mysql: "INSERT INTO wp_test (id,title) VALUES ('id01','test')" gives me result: 0.0002(s)
For 1000 inserts: 0.1106(s)
Making simple same write test on Cassandra: $column_faily->insert('id01',array('title'=>'test')) gives me result of: 0.005(s)
For 1000 inserts: 1.047(s)
For reads tests i also got that Cassandra is much slower than mysql.
So the question, does this sounds correct that i have 5ms for one write operation on Cassadra? Or something is wrong and should be at least 0.5ms.
When people say "Cassandra is faster than MySQL", they mean when you are dealing with terabytes of data and many simultaneous users. Cassandra (and many distributed NoSQL databases) is optimized for hundreds of simultaneous readers and writers on many nodes, as opposed to MySQL (and other relational DBs) which are optimized to be really fast on a single node, but tend to fall to pieces when you try to scale them across multiple nodes. There is a generalization of this trade-off by the way- the absolute fastest disk I/O is plain old UNIX flat files, and many latency-sensitive financial applications use them for that reason.
If you are building the next Facebook, you want something like Cassandra because a single MySQL box is never going to stand up to the punishment of thousands of simultaneous reads and writes, whereas with Cassandra you can scale out to hundreds of data nodes and handle that load easily. See scaling up vs. scaling out.
Another use case is when you need to apply a lot of batch processing power to terabytes or petabytes of data. Cassandra or HBase are great because they are integrated with MapReduce, allowing you to run your processing on the data nodes. With MySQL, you'd need to extract the data and spray it out across a grid of processing nodes, which would consume a lot of network bandwidth and entail a lot of unneeded complication.
Cassandra benefits greatly from parallelisation and batching. Try doing 1 million inserts on each of 100 threads (each with their own connection & in batches of 100) and see which ones is faster.
Finally, Cassandra insert performance should be relatively stable (maintaining high throughput for a very long time). With MySQL, you will find that it tails off rather dramatically once the btrees used for the indexes grow too large memory.
It's likely that the maturity of the MySQL drivers, especially the improved MySQL drivers in PHP 5.3, is having some impact on the tests. It's also entirely possible that the simplicity of the data in your query is impacting the results - maybe on 100 value inserts, Cassandra becomes faster.
Try the same test from the command line and see what the timestamps are, then try with varying numbers of values. You can't do a single test and base your decision on that.
Many user space factors can impact write performance. Such as:
Dozens of settings in each of the database server's configuration.
The table structure and settings.
The connection settings.
The query settings.
Are you swallowing warnings or exceptions? The MySQL sample would on face value be expected to produce a duplicate key error. It could be failing while doing nothing at all. What Cassandra might do in the same case isn't something I'm familiar with.
My limited experience of Cassandra tell me one thing about inserts, while performance of everything else degrades as data grows, inserts appear to maintain the same speed. How fast it is compared to MySQL however isn't something I've tested.
It might not be so much that inserts are fast but rather tries to be never slow. If you want a more meaningful test you need to incorporate concurrency and more variations on scenario such as large data sets, various batch sizes, etc. More complex tests might test latency for availability of data post insert and read speed over time.
It would not surprise me if Cassandra's first port of call for inserting data is to put it on a queue or to simply append. This is configurable if you look at consistency level. MySQL similarly allows you to balance performance and reliability/availability though each will have variations on what they allow and don't allow.
Outside of that unless you get into the internals it may be hard to tell why one performs better than the other.
I did some benchmarks of a use case I had for Cassandra a while ago. For the benchmark it would insert tens of thousands of rows first. I had to make the script sleep for a few seconds because otherwise queries run after the fact would not see the data and the results would be inconsistent between implementations I was testing.
If you really want fast inserts, append to a file on ramdisk.
I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.
Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.
In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.
The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.
Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.
A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.
If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)
How to increase the performance for mysql database because I have my website hosted in shared server and they have suspended my account because of "too many queries"
the stuff asked "index" or "cache" or trim my database
I don't know what does "index" and cache mean and how to do it on php
thanks
What an index is:
Think of a database table as a library - you have a big collection of books (records), each with associated data (author name, publisher, publication date, ISBN, content). Also assume that this is a very naive library, where all the books are shelved in order by ISBN (primary key). Just as the books can only have one physical ordering, a database table can only have one primary key index.
Now imagine someone comes to the librarian (database program) and says, "I would like to know how many Nora Roberts books are in the library". To answer this question, the librarian has to walk the aisles and look at every book in the library, which is very slow. If the librarian gets many requests like this, it is worth his time to set up a card catalog by author name (index on name) - then he can answer such questions much more quickly by referring to the catalog instead of walking the shelves. Essentially, the index sets up an 'alternative ordering' of the books - it treats them as if they were sorted alphabetically by author.
Notice that 1) it takes time to set up the catalog, 2) the catalog takes up extra space in the library, and 3) it complicates the process of adding a book to the library - instead of just sticking a book on the shelf in order, the librarian also has to fill out an index card and add it to the catalog. In just the same way, adding an index on a database field can speed up your queries, but the index itself takes storage space and slows down inserts. For this reason, you should only create indexes in response to need - there is no point in indexing a field you rarely search on.
What caching is:
If the librarian has many people coming in and asking the same questions over and over, it may be worth his time to write the answer down at the front desk. Instead of checking the stacks or the catalog, he can simply say, "here is the answer I gave to the last person who asked that question".
In your script, this may apply in different ways. You can store the results of a database query or a calculation or part of a rendered web page; you can store it to a secondary database table or a file or a session variable or to a memory service like memcached. You can store a pre-parsed database query, ready to run. Some libraries like Smarty will automatically store part or all of a page for you. By storing the result and reusing it you can avoid doing the same work many times.
In every case, you have to worry about how long the answer will remain valid. What if the library got a new book in? Is it OK to use an answer that may be five minutes out of date? What about a day out of date?
Caching is very application-specific; you will have to think about what your data means, how often it changes, how expensive the calculation is, how often the result is needed. If the data changes slowly, it may be best to recalculate and store the result every time a change is made; if it changes often but is not crucial, it may be sufficient to update only if the cached value is more than a certain age.
Setup a copy of your application locally, enable the mysql query log, and setup xdebug or some other profiler. The start collecting data, and testing your application. There are lots of guides, and books available about how to optimize things. It is important that you spend time testing, and collecting data first so you optimize the right things.
Using the data you have collected try and reduce the number of queries per page-view, Ideally, you should be able to get everything you need in less 5-10 queries.
Look at the logs and see if you are asking for the same thing twice. It is a bad idea to request a record in one portion of your code, and then request it again from the database a few lines later unless you are sure the value is likely to have changed.
Look for queries embedded in loop, and try to refactor them so you make a single query and simply loop on the results.
The select * you mention using is an indication you may be doing something wrong. You probably should be listing fields you explicitly need. Check this site or google for lots of good arguments about why select * is evil.
Start looking at your queries and then using explain on them. For queries that are frequently used make sure they are using a good index and not doing a full table scan. Tweak indexes on your development database and test.
There are a couple things you can look into:
Query Design - look into more advanced and faster solutions
Hardware - throw better and faster hardware at the problem
Database Design - use indexes and practice good database design
All of these are easier said than done, but it is a start.
Firstly, sack your host, get off shared hosting into an environment you have full control over and stand a chance of being able to tune decently.
Replicate that environment in your lab, ideally with the same hardware as production; this includes things like RAID controller.
Did I mention that you need a RAID controller. Yes you do. You can't achieve decent write performance without one - which needs a battery backed cache. If you don't have one, each write needs to physically hit the disc which is ruinous for performance.
Anyway, back to read performance, once you've got the machine with the same spec RAID controller (and same discs, obviously) as production in your lab, you can try to tune stuff up.
More RAM is usually the cheapest way of achieving better performance - make sure that you've got MySQL configured to use it - which means tuning storage-engine specific parameters.
I am assuming here that you have at least 100G of data; if not, just buy enough ram that your entire DB fits in ram then read performance is essentially solved.
Software changes that others have mentioned such as optimising queries and adding indexes are helpful too, but only once you've got a development hardware environment that enables you to usefully do performance work - i.e. measure performance of your application meaningfully - which means real hardware (not VMs), which is consistent with the hardware environment used in production.
Oh yes - one more thing - don't even THINK about deploying a database server on a 32-bit OS, it's a ruinous waste of good ram.
Indexing is done on the database tables in order to speed queries. If you don't know what it means you have none. At a minumum you should have indexes on every foriegn key and on most fileds that are used frequently in the where clauses of your queries. Primary keys should have indexes automatically assuming you set them up to begin with which I would find unlikely in someone who doesn't know what an index is. Are your tables normalized?
BTW, since you are doing a division in your math (why I haven't a clue), you should Google integer math. You may neot be getting correct results.
You should not select * ever. Instead, select only the data you need for that particular call. And what is your intention here?
order by votes*1000+((1440 - ($server_date - date))/60)2+visites600 desc
You may have poorly-written queries, and/or poorly written pages that run too many queries. Could you give us specific examples of queries you're using that are ran on a regular basis?
sure
this query to fetch the last 3 posts
select * from posts where visible = 1 and date > ($server_date - 86400) and dont_show_in_frontpage = 0 order by votes*1000+((1440 - ($server_date - date))/60)*2+visites*600 desc limit 3
what do you think?