I've two routes,
1) creating sub-tables for each user and storing his individual content
2) creating few tables and store data of all users in them.
for instance.
1) 100,000 tables each with 1000 rows
2) 50 Tables each with 2,000,000 rows
I wanna know which route is the best and efficient.
Context: Like Facebook, for millions of users, their posts, photos, tags. All this information is in some giant tables for all users or each user has it's own sub-tables.
This are some pros and cons of this two approaches in MySQL.
1. Many small tables.
Cons:
More concurrent tables used means more file descriptors needed (check this)
A database with 100.000 tables is a mess.
Pros:
Small tables means small indexes. Small indexes can be loaded entirely on memory, that means that your queries will run faster.
Also, because of small indexes, data manipulation like inserts will run faster.
2. Few big tables
Cons:
A huge table imply very big indexes. If your index cannot be entirely loaded on memory most of the queries will be very slow.
Pros:
The database (and also your code) it's clear and easy to mantain.
You can use partitioning if your tables became so big. (check this).
From my experience a table of two millions rows (I've worked with 70 millions rows tables) it's not a performance problem under MySQL if you are able to load all your active index on memory.
If you'll have many concurrent users I'll suggest you to evaluate other technologies like Elastic Search that seems to fit better this kind of scenarios.
Creating a table for each user is the worse design possible. It is one of the first things you are taught in db design class.
Table is a strong logical component of a database and hence it is used for many of the maintenance tasks by RDBMS. E.g. it is customary to set up table file space, limitations, quota, log space, transaction space, index tree space and many many other things. If every table gets it's own file to put data in, you'll get big roundtrip times when joining tables or whatsoever.
When you create many tables, you'll be having a really BIG overhead in maintenance. Also, you'll be denying the very nature of relational sources. And just suppose you're adding record to a database - creating a new table each time? It'd be a bit harder on your code.
But then again, you could try and see for yourself.
You should leverage the power of MySQL indexes which will basically provide something similar to having one table per user.
Creating one table called user_data indexed on user_id will (in a big picture) transform your queries having a where clause on user_id like this one:
SELECT picture FROM user_data WHERE user_id = INT
Into:
Look in the index to find rows from user_data where user_id = INT
Then, in this batch of rows load me the value of picture
By doing that, MySQL won't search into all rows from user_data but into the relevant one found in the index.
Related
Part of my project involves storing and retrieving loads of ips in my database. I have estimated that my database will have millions of ips within months of starting the project. That been the case I would like to know how slow simple queries to a big database can get? What will be the approximate speeds of the following queries:
SELECT * FROM table where ip= '$ip' LIMIT 1
INSERT INTO table(ip, xxx, yyy)VALUES('$ip', '$xxx', '$yyy')
on a table with 265 million rows?
Could I speed query speeds up by having 255^2 tables created that would have names corresponding to all the 1st two numbers of all possible ipv4 ip addresses, then each table would have a maximum of 255^2 rows that would accommodate all possible 2nd parts to the ip. So for example to query the ip address "216.27.61.137" it would be split into 2 parts, "216.27"(p1) and "61.137"(p2). First the script would select the table with the name, p1, then it would check to see if there are any rows called "p2", if so it would then pull the required data from the row. The same process would be used to insert new ips into the database.
If the above plan would not work what would be a good way to speed up queries in a big database?
The answers to both your questions hinge on the use of INDEXES.
If your table is indexed on ip your first query should execute more or less immediately, regardless of the size of your table: MySQL will use the index. Your second query will slow as MySQL will have to update the index on each INSERT.
If your table is not indexed then the second query will execute almost immediately as MySQL can just add the row at the end of the table. Your first query may become unusable as MySQL will have to scan the entire table each time.
The problem is balance. Adding an index will speed the first query but slow the second. Exactly what happens will depend on server hardware, which database engine you choose, configuration of MySQL, what else is going on at the time. If performance is likely to be critical, do some tests first.
Before doing any of that sort, read this question (and more importantly) its answers: How to store an IP in mySQL
It is generally not a good idea to split data among multiple tables. Database indexes are good at what they do, so just make sure you create them accordingly. A binary column to store IPv4 addresses will work rather nicely - it is more a question of query load than of table size.
First and foremost, you can't predict how long will a query will take, even if we knew all information about the database, the database server, the network performance and another thousands of variables.
Second, if you are using a decent database engine, you don't have to split the data into different tables. It knows how to handle big data. Leave the database functionality to the database itself.
There are several workarounds to deal with large datasets. Using the right data types and creating the right indexes will help a lot.
When you begin to have problems with your database, then search for something specific to the problem you are having.
There are no silver bullets to big data problems.
I am doing a tracking site like google adwords. I am doing only modifications. In that site i have seen they are creating tables for each month and merging that table into one this is done for the tables which stores information about clicks and search details and the merged tables having crores of records.And for querying they have used only the merged table which is having crores of records. Is there any advantage of using tables like this?And the query is taking more than 10 minutes to execute.
Advantages are as follows, If you dont get the following advantages then do not use it.
Easily manage a set of log tables. For example, you can put data from different months into separate tables, compress some of them with myisampack, and then create a MERGE table to use them as one.
Obtain more speed. You can split a large read-only table based on some criteria, and then put individual tables on different disks. A MERGE table structured this way could be much faster than using a single large table.
Perform more efficient searches. If you know exactly what you are looking for, you can search in just one of the underlying tables for some queries and use a MERGE table for others. You can even have many different MERGE tables that use overlapping sets of tables.
Perform more efficient repairs. It is easier to repair individual smaller tables that are mapped to a MERGE table than to repair a single large table.
Instantly map many tables as one. A MERGE table need not maintain an index of its own because it uses the indexes of the individual tables. As a result, MERGE table collections are very fast to create or remap. (You must still specify the index definitions when you create a MERGE table, even though no indexes are created.)
If you have a set of tables from which you create a large table on demand, you can instead create a MERGE table from them on demand. This is much faster and saves a lot of disk space.
Exceed the file size limit for the operating system. Each MyISAM table is bound by this limit, but a collection of MyISAM tables is not.
You can create an alias or synonym for a MyISAM table by defining a MERGE table that maps to that single table. There should be no really notable performance impact from doing this (only a couple of indirect calls and memcpy() calls for each read).
You can read more about this on basic info link, advantages disadvantages link.
A MRG_MYISAM only works over MyISAM tables, which are, by themselves, not the first option for a table. You would normally go for InnoDB tables.
The MRG_MYISAM engine was invented before MySQL had support for views and for partitions. Range partitioning (e.g. partition per month) is most probably what you want.
Partitioning is transparent to the user in terms of queries, but nevertheless uses pruning so as to only read from selected partitions for a query, thus optimizing it.
I would recomment that you use InnoDB tables, and check out range partitioning.
MRG_MYISAM and MyISAM are still in use. They could work out for you. It's just that MyISAM introduces so much trouble (no crash recovery, table level locking, more...) that it's many times out of the question.
I'm developing software for conducting online surveys. When a lot of users are filling in a survey simultaneously, I'm experiencing trouble handling the high database write load. My current table (MySQL, InnoDB) for storing survey data has the following columns: dataID, userID, item_1 .. item_n. The item_* columns have different data types corresponding to the type of data acquired with the specific items. Most item columns are TINYINT(1), but there are also some TEXT item columns. Large surveys can have more than a hundred items, leading to a table with more than a hundred columns. The users answers around 20 items in one http post and the corresponding row has to be updated accordingly. The user may skip a lot of items, leading to a lot of NULL values in the row.
I'm considering the following solution to my write load problem. Instead of having a single table with many columns, I set up several tables corresponding to the used data types, e.g.: data_tinyint_1, data_smallint_6, data_text. Each of these tables would have only the following columns: userID, itemID, value (the value column has the data type corresponding to its table). For one http post with e.g. 20 items, I then might have to create 19 rows in data_tinyint_1 and one row in data_text (instead of updating one large row with many columns). However, for every item, I need to determine its data type (via two table joins) so I know in which table to create the new row. My zend framework based application code will get more complicated with this approach.
My questions:
Will my solution be better for heavy write load?
Do you have a better solution?
Since you're getting to a point of abstracting this schema to mimic actual datatypes, it might stand to reason that you should simply create new table sets per-survey instead. Benefit will be that the locking will lessen and you could isolate heavy loads to outside machines, if the load becomes unbearable.
The single-survey database structure then can more accurately reflect your real world conditions and data input handlers. It ought to make your abstraction headaches go away.
There's nothing wrong with creating tables on the fly. In some configurations, soft sharding is preferable.
This looks like obvious solution would be to use document database for fast writes and then bulk-insert answers to MySQL asynchronously using cron or something like that. You can create view in the document database for quick statistics, but allow filtering and other complicated stuff only in MySQ if you're not a fan of document DBMSs.
I have any number of users in a database (this could be 100, 2000, or 3) what i'm doing is using mysql "show tables" and storing the table names in an array, then i'm running a while loop and taking every table name (the user's name) and inserting it into some code, then i'm running said piece of code for every table name. With 3 users, this script takes around 20 seconds. It uses the Twitter API and does some mysql inserts. Is this the most efficient way to do it or not?
Certainly not!
I don't understand why you store each user in their table. You should create a users table and select from there.
It will run in 0.0001 seconds.
Update:
A table has rows and columns. You can store multiple users in rows, and information about each user in columns.
Please try some database design tutorials/books, they wil help you a great deal.
If your worried about storing multiple entries for each user within the same users table, you can have a seperate table for tweets with the tweet_id refering to the user.
I'd certainly go for one users table.
Databases are optimized for processing many rows; some of the techniques used are indexes, physical layout of data on disk and so on. Operations on many tables will be always be slower - this is just not what RDBMS were built to do.
There is one exception - sometimes you optimize databases by sharding (partitioning data), but this approach has as many advantages as disadvantages. One of the disadvantages is that queries like the one you described take a lot of time.
You should put all your users in one table, because, from logical point of view - they represent one entity.
Using PHP, I am building an application that is MySQL database resource heavy, but I also need it's data to be very flexible. Currently there are a number of tables which have an array of different columns (including some text, longtext, int, etc), and in the future I would like to expand on the number of columns of these tables, whenever new data-groups are required.
My question is, if I have a table with, say, 10 columns, and I expand this to 40 columns in the future, would a SQL query (via PHP) be slowed down considerably?
As long as the initial, small query that is only looking up the initial 10 columns is not a SELECT-all (*) query, I would like to know if more resources or processing is used because the source table is now much larger.
Also, will the database in general run slower or be much larger due to many columns now constantly remaining as NULL values (eg, whenever a new entry that only requires the first 10 columns is inserted)?
MyISAM and InnoDB behave differently in this regard, for various reasons.
For instance, InnoDB will partition disk space for each column on disk regardless of whether it has data in it, while MyISAM will compress the tables on disk. In a case where there are large amounts of empty columns, InnoDB will be wasting a lot of space. On the other hand, InnoDB does row-level locking, which means that (with caveats) concurrent read / writes to the same table will perform better (MyISAM does a table-level lock on write).
Generally speaking, it's probably not a good idea to have many columns in one table, particularly for volatility reasons. For instance, in InnoDB (possibly MyISAM also?), re-arranging columns or changing types of columns (i.e. varchar 128 -> varchar 255) in the middle of a table requires that all data in columns to the right be moved around on disk to make (or remove) space for the altered column.
With respect to your overall database design, it's best to aim for as many columns as possible to be not null, which saves space (you don't need the null flag on the column, and you don't store empty data) and also increases query and index performance. If many records will have a particular column set to null, you should probably move it to a foreign key relationship and use a JOIN. That way disk space and index overhead is only incurred for records that are actually holding information.
Likely, the best solution would be to create a new table with the additional fields and JOIN the tables when necessary. The original table remains unchanged, keeping it's speed, but you can still get to the extra fields.
Optimization is not a trivia question. Nothing can be predicted.
In general short answer is: yes, it will be slower (because DBMS at least need to read from the disk and send more data, obviously).
But, it is very dependent on each particular case how much slower it will be. You can either even don't see the difference, or get it 10x times slower.
In all likelihood, no it won't be slowed down considerably.
However, a better question to ask is: Which method of adding more fields results in a more elegant, understandable, maintainable, cost effective solution?
Usually the answer is "It depends." It depends on how the data is accessed, how the requirements will change, how the data is updated, and how fast the tables grow.
you can divide one master table into multiple TRANSACTION tables so you will get much faster result than you getting now. and also make the primary key as UNIQUE KEY also in all the transaction as well as master tables. its really help you to make your query faster.
Thanks.