We have a database of about 500 000 records that has non-normalized data (Vehicles for sale). We have a master MySQL DB and to enable fast searching, we update a Solr index whenever changes are made. Most of our data is served from the Solr index due to the complex nature of the joins and relationships in the MySQL DB.
We have started to run into problems with the speed and integrity of updates from within solr. When we push updates using softcommit we are find that it takes ~1 second for the changes to be visible. While it isn’t a big issue at the moment, we are concerned that the problem will get worse and we want to have a solution before we get there.
We are after some guidance on what solutions we should be looking at:
How big is our dataset in comparison to other solutions using Solr in
this manner?
We are only using 1 server for Solr at the moment. What is the split point to move to clustering and will that help or hinder our update problem?
One solution we have seen is using a NoSQL DB for some of the
data. Do NoSQL DBs have better performance on a record by record
level?
Are there some other options that might be worth looking into?
I'll answer your questions in sequence
1) No your dataset is not that huge. Anything below 1 million records is fine for solr.
2)Using 1 solr server is not a good option. Try SolrCloud, it is the best way to get a solr into High Availability and it will improve your performance
3)Both sql and nosql databases have their advantages and disadvantages. It depends on your dataset. In general nosql databases are faster.
4)I suggest go with SolrCloud.It is fast and reliable.
Related
I have an optimization question.
The PHP web application, that I have recently started working with, has several large database tables in a MySQL database. The information in this tables should be accessible at all times for business purposes, which makes them grow really big eventually.
The tables are regularly written to and recent records are frequently selected.
Previous developers came up with a very weird practice of optimizing the system. They created separate database for storing recent records in order to keep tables compact and sync the tables once the record grows "old" (more than 24 hours old).
The application uses current date to pick the right database, when performing a SELECT query.
This is a very weird solution in my opinion. We had a big argument over that and I am looking to change this. However, before, I decided to ask:
1) Has someone ever came across anything similar before? I mean, separate database for recent records.
2) What are the most common practices to optimize databases for this particular case?
Any opinions are welcome, as there are many ways one can go at this point.
Try using INDEX:
CREATE INDEX
That improve the access, use and deploy of the information.
I believe this could help you RANGE Partitioning
The solution is to do a Partion to the table base on a date range.
By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan. Maintenance tasks, such as rebuilding indexes or backing up a table, can run more quickly.
The documentation of Mysql can be useful, check this out :
https://dev.mysql.com/doc/refman/5.5/en/partitioning-columns-range.html
We have a 10-year-old website that has been using SMF. Now we wrote our very own forum script but since we are unexperienced developers, we have no idea about optimizing. Our messages table is too big (about 2 gigabytes including indexes, 2.654.193 rows total). SMF was using this table really fast but our new forum script causes high system load average.
Here is the query list: http://i.imgur.com/NPm0DmM.jpg
Here is the table structure and indexes: http://i.imgur.com/FwPdMoI.jpg
Note: We use APC for acceleration and Memcached for caching. I'm a hundred percent sure that the messages table (and topics table maybe) is slowing our website.
This is just the right moment to learn all about SQL indexing.
Proper indexing is THE way to improve SQL performance. Indexing has to be done by developers.
Consider starting here (it's the free web-edition of my book SQL Performance Explained
http://use-the-index-luke.com/
Major disclaimer: all links go to my own content.
I am running a crm application which uses mysql database. My application generating lots of data in mysql. Now i want to give my customer a reporting section where admin can view real time report, they should be able to filter at real time. Basically i want my data to be slice and dice at real time fast as possible.
I have implemented the reporting using mysql and php. But now as data is too much query takes too much time and page does not load. After few read i came across few term like Nosql, mongoDb , cassandra , OLAP , hadoop etc but i was confuse which to choose. Is there any mechanism which would transfer my data from mysql to nosql on which i can run my reporting query ans serve my customer keeping my mysql database as it is ?
It doesn't matter what database / datastore technology you use for reporting: you still will have to design it to extract the information you need efficiently.
Improving performance by switching from MySQL to MongoDB or one of the other scalable key/value store systems is like solving a pedestrian traffic jam by building a railroad. It's going to take a lot of work to make it help the situation. I suggest you try getting things to work better in MySQL first.
First of all, you need to take a careful look at which SQL queries in your reporting system are causing trouble. You may be able to optimize their performance by adding indexes or doing other refactoring. That should be your first step. MySQL has a slow query log. Look at it.
Secondly, you may be able to add resources (RAM, faster disks, etc) to MySQL, and you may be able to tune it for higher performance. There's a book called High Performance MySQL that offers a sound methodology for doing this.
Thirdly, many people who need to add a reporting function to their busy application use MySQL replication. That is, they configure one or two slave MySQL servers to accept copies of all data from the master server.
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
They then use the slave server or servers to run reporting queries. The slaves are ordinarily a few seconds or minutes behind the master (that is, they're slightly out of date). But it usually is good enough to give users the illusion of real-time reporting.
Notice that if you use MongoDB or some other technology you will also have to replicate your data.
I will throw this link out there for you to read which actually gives certain use cases: http://www.mongodb.com/use-cases/real-time-analytics but I will speak for a more traditional setup of just MongoDB.
I have used both MySQL and MongoDB for analytical purposes and I find MongoDB better suited, if not needing a little bit of hacking to get it working well.
The great thing about MongoDB when it comes to retreiving analytical data is that it does not require the IO/memory to write out a separate result set each time. This makes reads on a single member of a replica set extremely scalable since you just add your analytical collections to the working set (a.k.a memory) and serve straight from those using batch responses (this is the default implementation of the drivers).
So with MongoDB replication rarely gives an advantage in terms of read/write, and in reality with MySQL I have found it does not either. If it does then you are doing the wrong queries which will not scale anyway; at which point you install memcache onto your database servers and, look, you have stale data being served from memory in a NoSQL fashion anyway...whoop, I guess.
Okay, so we have some basic ideas set out; time to talk about that hack. In order to get the best possible speed out of MongoDB, and since it does not have JOINs, you need to flatten your data so that no result set will even be needed your side.
There are many tactics for this, but the one I will mention here is: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ pre-aggregated reports. This method also works well in SQL techs since it essentially is the in the same breath as logically splitting tables to make queries faster and lighter on a large table.
What you do is you get your analytical data, split it into a demomination such as per day or month (or both) and then you aggregate your data across those ranges in a de-normalised manner, essentially, all one row.
After this you can show reports straight from a collection without any need for a result set making for some very fast querying.
Later on you could add a map reduce step to create better analytics but so far I have not needed to, I have completed full video based anlytics without such need.
This should get you started.
TiDB may be a good fit https://en.pingcap.com/tidb/, it is MySQL compatible, good at real-time analytics, and could replicate the data from MySQL through binlog.
I am designing an "high" traffic application which realies mainly on PHP and MySQL database queries.
I am designing the database tables so they can hold 100'000 rows, each page loading queries the db for user data.
I can experience slow performances or database errors when there are say 1000 users connected ?
Asking because i cannot find specification on the real performance limits of mysql databases.
Thanks
If the userdata remains unchanged due loading another page, you could think about storing those information in a session.
Also, you should analyze how the read/write ratio in your database/ on specific tables is. MyIsam and InnoDB are very different when it comes to locking. Many connections can slow down your server, try to cache connections.
Take a look at http://php.net/manual/en/pdo.connections.php
if designed wrongly, one user might kill your server. you need to have performance tests, find bottle necks profiling your code. use explain for your queries...
Well designed databases can handle with tens of millions of rows, but poor designed can't.
Don't worry about performance, try to design it well.
It's just hard to say a design was good or not,you should always do some stress tests before you set up your application or website to help you see the performance,tools i often used were mysqlslap(for mysql only) and apache's ab command.you can google them for details.
I have innoDB table using numerous foreign keys, but we just want to look up some basic info out of it.
I've done some research but still lost.
How can I tell if my host has Sphinx
installed already? I don't see it
as an option for table storage
method (i.e. innodb, myisam).
Zend_Search_Lucene, responsive
enough for AJAX functionality of
millions of records?
Mirror my
innoDB with a myisam? Make every
innodb transaction end with a write
to the myisam, then use 1:1 lookups?
How would I do this automagically?
This should make MyISAM
ACID-compliant and free(er) from
corruption no?
PostgreSQL fulltext
queries don't even look like SQL to
me wtf, I don't have time to learn a
new SQL syntax I need noob options
????????????????????
This is high volume site on a decently-equipped VPS
Thanks very much for any ideas.
Sphinx is very good choice. Very scalable, built-in clustering and sharding.
Your question is very vague on what you're actually wanting to accomplish here but I can tell you to stay away from Zend_Search_Lucene with record counts that high. In my experience (and many others, including Zend Certified Engineers) ZSL's performance on large record-sets is poor at best. Use a tool like Apache Lucene instead if you go that route.