I'm developing software for conducting online surveys. When a lot of users are filling in a survey simultaneously, I'm experiencing trouble handling the high database write load. My current table (MySQL, InnoDB) for storing survey data has the following columns: dataID, userID, item_1 .. item_n. The item_* columns have different data types corresponding to the type of data acquired with the specific items. Most item columns are TINYINT(1), but there are also some TEXT item columns. Large surveys can have more than a hundred items, leading to a table with more than a hundred columns. The users answers around 20 items in one http post and the corresponding row has to be updated accordingly. The user may skip a lot of items, leading to a lot of NULL values in the row.
I'm considering the following solution to my write load problem. Instead of having a single table with many columns, I set up several tables corresponding to the used data types, e.g.: data_tinyint_1, data_smallint_6, data_text. Each of these tables would have only the following columns: userID, itemID, value (the value column has the data type corresponding to its table). For one http post with e.g. 20 items, I then might have to create 19 rows in data_tinyint_1 and one row in data_text (instead of updating one large row with many columns). However, for every item, I need to determine its data type (via two table joins) so I know in which table to create the new row. My zend framework based application code will get more complicated with this approach.
My questions:
Will my solution be better for heavy write load?
Do you have a better solution?
Since you're getting to a point of abstracting this schema to mimic actual datatypes, it might stand to reason that you should simply create new table sets per-survey instead. Benefit will be that the locking will lessen and you could isolate heavy loads to outside machines, if the load becomes unbearable.
The single-survey database structure then can more accurately reflect your real world conditions and data input handlers. It ought to make your abstraction headaches go away.
There's nothing wrong with creating tables on the fly. In some configurations, soft sharding is preferable.
This looks like obvious solution would be to use document database for fast writes and then bulk-insert answers to MySQL asynchronously using cron or something like that. You can create view in the document database for quick statistics, but allow filtering and other complicated stuff only in MySQ if you're not a fan of document DBMSs.
Related
I have a MySQL database that is becoming really large. I can feel the site becoming slower because of this.
Now, on a lot of pages I only need a certain part of the data. For example, I store information about users every 5 minutes for history purposes. But on one page I only need the information that is the newest (not the whole history of data). I achieve this by a simple MAX(date) in my query.
Now I'm wondering if it wouldn't be better to make a separate table that just stores the latest data so that the query doesn't have to search for the latest data from a specific user between millions of rows but instead just has a table with only the latest data from every user.
The con here would be that I have to run 2 queries to insert the latest history in my database every 5 minutes, i.e. insert the new data in the history table and update the data in the latest history table.
The pro would be that MySQL has a lot less data to go through.
What are common ways to handle this kind of issue?
There are a number of ways to handle slow queries in large tables. The three most basic ways are:
1: Use indexes, and use them correctly. It is important to avoid table scans on large tables; this is almost always your most significant performance hit with single queries.
For example, if you're querying something like: select max(active_date) from activity where user_id=?, then create an index on the activity table for the user_id column. You can have multiple columns in an index, and multiple indexes on a table.
CREATE INDEX idx_user ON activity (user_id)
2: Use summary/"cache" tables. This is what you have suggested. In your case, you could apply an insert trigger to your activity table, which will update the your summary table whenever a new row gets inserted. This will mean that you won't need your code to execute two queries. For example:
CREATE TRIGGER update_summary
AFTER INSERT ON activity
FOR EACH ROW
UPDATE activity_summary SET last_active_date=new.active_date WHERE user_id=new.user_id
You can change that to check if a row exists for the user already and do an insert if it is their first activity. Or you can insert a row into the summary table when a user registers...Or whatever.
3: Review the query! Use MySQL's EXPLAIN command to grab a query plan to see what the optimizer does with your query. Use it to ensure that the optimizer is avoiding table scans on large tables (and either create or force an index if necesary).
I'm creating a database on mysql for a small app.
Problem is there are too many fields that are identical on different Tables like
Table 1: Muncipal Issues:
ID,
UserID,
Title,
Location,
Description,
ImageURL,
Table 2: Harrasement Issues:
ID ,
UserID,
Title,
Location,
Description,
ImageURL
Tables 3 same as above
both tables have almost same coulmns.
i want to ask if it's better to use a relations and create a table for handling IDs and link it with other details or it's better to create a single table with an extra coulmn for these issues.
on one hand there'll be too many tables with identical columns.
on the other hand there'll few tables with too many rows in it.
What will be best for performance more rows or more tables.
i'm using Mysql.
Firstly, unless you expect millions of records don't care that much about performance but care more about the structure of your data and how easy it will be to access it. Literally write down a list of data that you plan to extract in your app e.g. "find all issues today", "find all unresolved issues older than 6 months" and then try to build real SQL queries on your expected structure. If they're going hard try to change the structure.
To answer your question: it depends. The current structure has following benefits:
It's easy to query certain type of issues
It's easy to build a PHP application - just make one template form (or model) and then copypaste it with slight changes for other tables
In case of performance problems it may be easier to create a cluster by simply putting each table on the different db server.
and following downsides:
It's inflexible. Adding new field that you forgot to add in the beginning will be painful since you'll have to change 3 (or more) tables and then the same amount of pieces in your app.
Adding new types of issues will be painful and require creating new table.
Creating SQL-s for getting data like "all non-resolved issues (regardless of type)" will require complicated UNION-s. Moreover this UNIONS will require creating virtual field with issue type otherwise you can't tell from which table did certain id come.
The classical db approach recommends using one table for common fields and create derived tables for fields that are different. So:
issues table should have all common fields and is identified by PK issue_id
municipal_issues uses the foreign key to issues.issue_id and has only the specific fields
harassment_issues uses the foreign key to issues.issue_id and has only the specific fields
also the issues table has the issue_type field that takes values "harassment", "municipal" etc and helps finding the table where the additional data are stored.
This pattern is called "Class Table inheritance" and you may check out the SQL antipatterns presentation for more info and other approaches. This solves the flexibility issue and still allows re-creating each of the original tables with only one simple JOIN that goes pretty fast.
Also as a side note you may look into the db schema of bug-trackers like Mantis since this looks like the same domain.
I am planing to design a database which may have to store huge amounts of data. But i am not sure which way i should use for this? the records may have fields like user id, record date, group, coordinate and perhaps other properties like that, but the key is the user id.
then i may have to call (select) or process the records with that user id. there may be thousands of user ids so here is the question.
1-) on every record; i should directly store all records in a single table? and
then call or process them like "... WHERE userId=12345 ...".
2-) on every record; i should check if there exists a table with that
user id and if not create a new table with the user id as table name
and store its data in that table. and then call or process them with
"SELECT * FROM ...".
So what would you suggest me?
There are different views about using many databases vs many tables. the common view is that there isn't any performance disadvantage. i prefered to go with the 1st way (single table). the project is finished and there arent any problems. i dont need to change the table all the time. but my main reason was because it is a little bit more complicated and time-consuming to program many tables style.
1-) on every record; i should directly store all records in a single table? and then call or process them like "... WHERE userId=12345 ...".
besides that here is a link of mysql.com about many tables that could be.
Disadvantages of Creating Many Tables in the Same Database
If you have many MyISAM tables in the same database directory, open, close, and create operations are slow. If you execute SELECT statements on many different tables, there is a little overhead when the table cache is full, because for every table that has to be opened, another must be closed. You can reduce this overhead by increasing the number of entries permitted in the table cache.
(http://dev.mysql.com/doc/refman/5.7/en/creating-many-tables.html)
I have any number of users in a database (this could be 100, 2000, or 3) what i'm doing is using mysql "show tables" and storing the table names in an array, then i'm running a while loop and taking every table name (the user's name) and inserting it into some code, then i'm running said piece of code for every table name. With 3 users, this script takes around 20 seconds. It uses the Twitter API and does some mysql inserts. Is this the most efficient way to do it or not?
Certainly not!
I don't understand why you store each user in their table. You should create a users table and select from there.
It will run in 0.0001 seconds.
Update:
A table has rows and columns. You can store multiple users in rows, and information about each user in columns.
Please try some database design tutorials/books, they wil help you a great deal.
If your worried about storing multiple entries for each user within the same users table, you can have a seperate table for tweets with the tweet_id refering to the user.
I'd certainly go for one users table.
Databases are optimized for processing many rows; some of the techniques used are indexes, physical layout of data on disk and so on. Operations on many tables will be always be slower - this is just not what RDBMS were built to do.
There is one exception - sometimes you optimize databases by sharding (partitioning data), but this approach has as many advantages as disadvantages. One of the disadvantages is that queries like the one you described take a lot of time.
You should put all your users in one table, because, from logical point of view - they represent one entity.
I'm working on a basic php/mysql CMS and have a few questions regarding performance.
When viewing a blog page (or other sortable data) from the front-end, I want to allow a simple 'sort' variable to be added to the querystring, allowing posts to be sorted by any column. Obviously I can't accept anything from the querystring, and need to make sure the column exists on the table.
At the moment I'm using
SHOW TABLES;
to get a list of all of the tables in the database, then looping the array of table names and performing
SHOW COLUMNS;
on each.
My worry is that my CMS might take a performance hit here. I thought about using a static array of the table names but need to keep this flexible as I'm implementing a plugin system.
Does anybody have any suggestions on how I can keep this more concise?
Thankyou
If you using mysql 5+ then you'll find database information_schema usefull for your task. In this database you can access information of tables, columns, references by simple SQL queries. For example you can find if there is specific column at the table:
SELECT count(*) from COLUMNS
WHERE
TABLE_SCHEMA='your_database_name' AND
TABLE_NAME='your_table' AND
COLUMN_NAME='your_column';
Here is list of tables with specific column exists:
SELECT TABLE_SCHEMA, TABLE_NAME from COLUMNS WHERE COLUMN_NAME='your_column';
Since you're currently hitting the db twice before you do your actual query, you might want to consider just wrapping the actual query in a try{} block. Then if the query works you've only done one operation instead of 3. And if the query fails, you've still only wasted one query instead of potentially two.
The important caveat (as usual!) is that any user input be cleaned before doing this.
You could query the table up front and store the columns in a cache layer (i.e. memcache or APC). You could then set the expire time on the file to infinite and only delete and re-create the cache file when a plugin has been newly added, updated, etc.
I guess the best bet is to put all that stuff ur getting from Show tables etc in a file already and just include it, instead of running that every time. Or implement some sort of caching if the project is still in development and u think the fields will change.