So... assuming i have a database with three tables:
Table clients
Table data
and Table clients_to_data
And I have a API which allows Clients to Access data from Table data. Every client has a record in Table clients (with things like IP adress etc.) To log who accesses what, i'm logging in the table clients_to_data (which contains the ID for table clients, table data and a timestamp.)
Every time a user access my API, he get's logged in the clients_to_data table. (So records in clients and data are not updated, just read.)
I also want to be able to get the amount of hits per client. Pretty easy, just query the clients_to_data table with a client_id and count the results. But as my DB grows, i'll have tenthousands of records in the clients_to_data table.
And here's my question:
Is it a better practice to add a field "hits" to Table clients that stores the amount of hits for that user and increment it every time the user queries the API
So this would be adding redundancy to the DB which i've heard generally is a bad thing. But in this case i think it would speed up the process of retrieving the amount of hits.
So which method is better and faster in this case? Thanks for your help!
Faster when?
Appending to the table will be faster , than finding the record and updating it, much faster than reading it, incrementing and updating it.
However having hits "precalulated", will be faster than the aggregate query to count them.
What you gain on the swings you lose on the roundabouts, which choice you make depends on your current usage patterns. So are you prepared to slow down adding a hit, to gain a signicant boost on finding out how many you've had?
Obviously selecting a single integer column from a table will be faster then selecting a count() of rows from a table.
The complexity trade off is a bit moot. 1 way you need to write a more complex sql, the other way you will need to update/insert 2 tables in your code.
How often is the number of hits queried? Do you clients look it up, or do you check it once a month? If you only look now and then I probably wouldn't be too concerned about the time taken to select count(*).
If your clients look up the hit count with every request, then I would look at storing a hits column.
Now that our table structures are all clearly defined, lets get to work.
You want to record something in the DB which is the number of times every client has accessed the data, in other terms,
Insert a record into a table "client_to_data" for every clients "impression".
You are worried about 2 things,
1. Redundancy
2. Performance when retrieving the count
What about the performance when storing the count.(Insert statements)..?
This is a classic scenario, where I would write the data to be inserted into memcache, and do a bulk insert at the end of the day.
More importantly, I will normalize the data before inserting it to the DB.
As to select, create indexes. If its text, install sphinx.
Thanks.
Related
I have a MySQL database that is becoming really large. I can feel the site becoming slower because of this.
Now, on a lot of pages I only need a certain part of the data. For example, I store information about users every 5 minutes for history purposes. But on one page I only need the information that is the newest (not the whole history of data). I achieve this by a simple MAX(date) in my query.
Now I'm wondering if it wouldn't be better to make a separate table that just stores the latest data so that the query doesn't have to search for the latest data from a specific user between millions of rows but instead just has a table with only the latest data from every user.
The con here would be that I have to run 2 queries to insert the latest history in my database every 5 minutes, i.e. insert the new data in the history table and update the data in the latest history table.
The pro would be that MySQL has a lot less data to go through.
What are common ways to handle this kind of issue?
There are a number of ways to handle slow queries in large tables. The three most basic ways are:
1: Use indexes, and use them correctly. It is important to avoid table scans on large tables; this is almost always your most significant performance hit with single queries.
For example, if you're querying something like: select max(active_date) from activity where user_id=?, then create an index on the activity table for the user_id column. You can have multiple columns in an index, and multiple indexes on a table.
CREATE INDEX idx_user ON activity (user_id)
2: Use summary/"cache" tables. This is what you have suggested. In your case, you could apply an insert trigger to your activity table, which will update the your summary table whenever a new row gets inserted. This will mean that you won't need your code to execute two queries. For example:
CREATE TRIGGER update_summary
AFTER INSERT ON activity
FOR EACH ROW
UPDATE activity_summary SET last_active_date=new.active_date WHERE user_id=new.user_id
You can change that to check if a row exists for the user already and do an insert if it is their first activity. Or you can insert a row into the summary table when a user registers...Or whatever.
3: Review the query! Use MySQL's EXPLAIN command to grab a query plan to see what the optimizer does with your query. Use it to ensure that the optimizer is avoiding table scans on large tables (and either create or force an index if necesary).
I am planing to design a database which may have to store huge amounts of data. But i am not sure which way i should use for this? the records may have fields like user id, record date, group, coordinate and perhaps other properties like that, but the key is the user id.
then i may have to call (select) or process the records with that user id. there may be thousands of user ids so here is the question.
1-) on every record; i should directly store all records in a single table? and
then call or process them like "... WHERE userId=12345 ...".
2-) on every record; i should check if there exists a table with that
user id and if not create a new table with the user id as table name
and store its data in that table. and then call or process them with
"SELECT * FROM ...".
So what would you suggest me?
There are different views about using many databases vs many tables. the common view is that there isn't any performance disadvantage. i prefered to go with the 1st way (single table). the project is finished and there arent any problems. i dont need to change the table all the time. but my main reason was because it is a little bit more complicated and time-consuming to program many tables style.
1-) on every record; i should directly store all records in a single table? and then call or process them like "... WHERE userId=12345 ...".
besides that here is a link of mysql.com about many tables that could be.
Disadvantages of Creating Many Tables in the Same Database
If you have many MyISAM tables in the same database directory, open, close, and create operations are slow. If you execute SELECT statements on many different tables, there is a little overhead when the table cache is full, because for every table that has to be opened, another must be closed. You can reduce this overhead by increasing the number of entries permitted in the table cache.
(http://dev.mysql.com/doc/refman/5.7/en/creating-many-tables.html)
Part of my project involves storing and retrieving loads of ips in my database. I have estimated that my database will have millions of ips within months of starting the project. That been the case I would like to know how slow simple queries to a big database can get? What will be the approximate speeds of the following queries:
SELECT * FROM table where ip= '$ip' LIMIT 1
INSERT INTO table(ip, xxx, yyy)VALUES('$ip', '$xxx', '$yyy')
on a table with 265 million rows?
Could I speed query speeds up by having 255^2 tables created that would have names corresponding to all the 1st two numbers of all possible ipv4 ip addresses, then each table would have a maximum of 255^2 rows that would accommodate all possible 2nd parts to the ip. So for example to query the ip address "216.27.61.137" it would be split into 2 parts, "216.27"(p1) and "61.137"(p2). First the script would select the table with the name, p1, then it would check to see if there are any rows called "p2", if so it would then pull the required data from the row. The same process would be used to insert new ips into the database.
If the above plan would not work what would be a good way to speed up queries in a big database?
The answers to both your questions hinge on the use of INDEXES.
If your table is indexed on ip your first query should execute more or less immediately, regardless of the size of your table: MySQL will use the index. Your second query will slow as MySQL will have to update the index on each INSERT.
If your table is not indexed then the second query will execute almost immediately as MySQL can just add the row at the end of the table. Your first query may become unusable as MySQL will have to scan the entire table each time.
The problem is balance. Adding an index will speed the first query but slow the second. Exactly what happens will depend on server hardware, which database engine you choose, configuration of MySQL, what else is going on at the time. If performance is likely to be critical, do some tests first.
Before doing any of that sort, read this question (and more importantly) its answers: How to store an IP in mySQL
It is generally not a good idea to split data among multiple tables. Database indexes are good at what they do, so just make sure you create them accordingly. A binary column to store IPv4 addresses will work rather nicely - it is more a question of query load than of table size.
First and foremost, you can't predict how long will a query will take, even if we knew all information about the database, the database server, the network performance and another thousands of variables.
Second, if you are using a decent database engine, you don't have to split the data into different tables. It knows how to handle big data. Leave the database functionality to the database itself.
There are several workarounds to deal with large datasets. Using the right data types and creating the right indexes will help a lot.
When you begin to have problems with your database, then search for something specific to the problem you are having.
There are no silver bullets to big data problems.
What's the most efficient way of counting the total number of registered users on a website?
I was thinking of using the following query, but if this table contained 1000's of users, the execution time will be very long.
mysql_query("SELECT COUNT(*) FROM users");
Instead, I thought of creating a separate table that will hold this value. Each time a new user registers, or a current one deleted, this value gets updated.
My Question:
Is it possible to carry out an INSERT and UPDATE in one query? - The INSERT will be for storing the new users details, and the UPDATE to increment the total users value.
I'm very interested in your thoughts on this.
If there is a better and faster way to find out the total registered users, I'm very interested to know ;
Cheers ;)
You can use triggers to update the value every time you make an INSERT, UPDATE or DELETE.
if this table contained 1000's of users, the execution time will be very long.
I doubt that it would be that slow for thousands of users. If you had millions of users then it would probably be too slow.
And does your count need to be 100% accurate?
If an approximate row count is sufficient, SHOW TABLE STATUS can be used.
(Source)
By the way, if you are using MyISAM then your original query will be close to instant because the row count is stored already with this storage engine.
You don't do an insert and update in one query. but rather, you do them in one "Transaction".
Transactions have a concept of "atomicity", which means that other processes cannot see "part" of the transaction - it is all or nothing.
If this concept is not familiar to you, you may wish to look it up.
I have any number of users in a database (this could be 100, 2000, or 3) what i'm doing is using mysql "show tables" and storing the table names in an array, then i'm running a while loop and taking every table name (the user's name) and inserting it into some code, then i'm running said piece of code for every table name. With 3 users, this script takes around 20 seconds. It uses the Twitter API and does some mysql inserts. Is this the most efficient way to do it or not?
Certainly not!
I don't understand why you store each user in their table. You should create a users table and select from there.
It will run in 0.0001 seconds.
Update:
A table has rows and columns. You can store multiple users in rows, and information about each user in columns.
Please try some database design tutorials/books, they wil help you a great deal.
If your worried about storing multiple entries for each user within the same users table, you can have a seperate table for tweets with the tweet_id refering to the user.
I'd certainly go for one users table.
Databases are optimized for processing many rows; some of the techniques used are indexes, physical layout of data on disk and so on. Operations on many tables will be always be slower - this is just not what RDBMS were built to do.
There is one exception - sometimes you optimize databases by sharding (partitioning data), but this approach has as many advantages as disadvantages. One of the disadvantages is that queries like the one you described take a lot of time.
You should put all your users in one table, because, from logical point of view - they represent one entity.