What is the best database model to store user visits and count unique users using the IP in a big database with 1.000.000 rows for example?
SELECT COUNT(DISTINCT ip) FROM visits
But with 1.000.000 diferent ip's it can be a slow query. Caching will not return the real number.
How big stats systems counts uniques visits?
Have another MyISAM table with only IP column and UNIQUE index on it. You'll get the proper count in no time (MyISAM caches number of rows in table)
[added after comments]
If you also need to count visits from each IP, add one more column visitCount and use
INSERT INTO
visitCounter (IP,visitCount)
VALUES
(INET_ATON($ip),1)
ON DUPLICATE KEY UPDATE
SET visitCount = visitCount+1
Don't use a relational database for that. It's not designed to store that type of information.
You can try a NoSQL database such as Mongo (I know a lot of places use that for their logging since it has so little overhead).
If you must stick with MySQL, you can add an index to the ip column which should speed things up significantly...
Related
As part of a project I've been gathering data from a number of sensors deployed in the field. The aim of this is to understand the performance of the devices and any potential problems or bugs that might be present.
I'm storing the data in a single table in a database with the columns id(primary), MAC_address, name, status, timestamp.
MAC_address is the one thing that is guaranteed to always be the same for a physical device and this is what I've been using mostly to extract information from the database.
My aim was to be able to extract data over a specific time period for a specific device whose MAC_address can be selected from a dropdown. Even doing a single SELECT DISTINCT query to get a list of unique MAC addresses was taking forever, but creating an index for that column seemed to speed it up. However, it still takes >30 seconds right now to extract any number of full rows from the database.
What is the best way to go about speeding up queries from a database this large?
Single table...wrong.
One table describes each "device" (or "sensor"). It has an id which is, perhaps, a 2-byte SMALLINT UNSIGNED (range of 0..65K -- you won't have more sensors than that?). Note the MAC_address and name belong in this table. This id is used on the other table...
Another table contains the sensor_id, timestamp, and value. This table should have PRIMARY KEY(sensor_id, timestamp) and be ENGINE=InnoDB`. Now it is very efficient for finding all the readings for one sensor over a period of time.
This completely avoids the SELECT DISTINCT since there are no dups in the first table.
If you are having a pulldown, then TINYINT UNSIGNED (1-byte, 0..255) is probably plenty big.
OK, I have helped you with two SELECTs; what other ones will you have. Keep in mind that performance is primarily based on the number of rows you need to touch. And that counts the ones you reject.
How to handle over 10 million records in MySQL only read operations.
I have two table one is company which holds records of company i.e its name and the services provided by it, thus 2 column and has about 3 million records and another table employee which has about 40 columns and about 10 million records. When trying to fetch data even simple queries such as
Employee::paginate(100);
//In terms of mysql it
//select count(*) as aggregate from 'employee';
//Select * from 'employee' limit 100 offset 0;
used to take about 30s and now it takes like forever.
If I want to do a search, apply a filter or wants to join two table i.e company and employee then sometimes it works and sometimes it crashes and gives lots of errors/warning in the SQL server logs.
Can anyone please tell me how can I handle this volume of records more efficiently without causing SQL server meltdown especially not during high traffic time.
Currently, I have only primary keys i.e ids and joint ids are indexed.
This is kind of duplicate post compare to all similar queries has been made on SO, but those did not helped me much.
Questions i have followed on SO
Best data store for billions of rows
Handling very large data with mysql
https://dba.stackexchange.com/questions/20335/can-mysql-reasonably-perform-queries-on-billions-of-rows
Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows?
How big can a MySQL database get before performance starts to degrade
Handling millions of records in arrays from MySQL in PHP?
Make these changes:
Use partitioning in your database based on your needs. (e.g. how to partition a table by datetime column?)
Use simplePaginate method instead of paginate. It does not query for count. (https://laravel.com/docs/5.7/pagination)
Try to improve your indexing. It affects. (google: mysql indexing best practices)
If you need count of rows use caching drivers (like redis)
Part of my project involves storing and retrieving loads of ips in my database. I have estimated that my database will have millions of ips within months of starting the project. That been the case I would like to know how slow simple queries to a big database can get? What will be the approximate speeds of the following queries:
SELECT * FROM table where ip= '$ip' LIMIT 1
INSERT INTO table(ip, xxx, yyy)VALUES('$ip', '$xxx', '$yyy')
on a table with 265 million rows?
Could I speed query speeds up by having 255^2 tables created that would have names corresponding to all the 1st two numbers of all possible ipv4 ip addresses, then each table would have a maximum of 255^2 rows that would accommodate all possible 2nd parts to the ip. So for example to query the ip address "216.27.61.137" it would be split into 2 parts, "216.27"(p1) and "61.137"(p2). First the script would select the table with the name, p1, then it would check to see if there are any rows called "p2", if so it would then pull the required data from the row. The same process would be used to insert new ips into the database.
If the above plan would not work what would be a good way to speed up queries in a big database?
The answers to both your questions hinge on the use of INDEXES.
If your table is indexed on ip your first query should execute more or less immediately, regardless of the size of your table: MySQL will use the index. Your second query will slow as MySQL will have to update the index on each INSERT.
If your table is not indexed then the second query will execute almost immediately as MySQL can just add the row at the end of the table. Your first query may become unusable as MySQL will have to scan the entire table each time.
The problem is balance. Adding an index will speed the first query but slow the second. Exactly what happens will depend on server hardware, which database engine you choose, configuration of MySQL, what else is going on at the time. If performance is likely to be critical, do some tests first.
Before doing any of that sort, read this question (and more importantly) its answers: How to store an IP in mySQL
It is generally not a good idea to split data among multiple tables. Database indexes are good at what they do, so just make sure you create them accordingly. A binary column to store IPv4 addresses will work rather nicely - it is more a question of query load than of table size.
First and foremost, you can't predict how long will a query will take, even if we knew all information about the database, the database server, the network performance and another thousands of variables.
Second, if you are using a decent database engine, you don't have to split the data into different tables. It knows how to handle big data. Leave the database functionality to the database itself.
There are several workarounds to deal with large datasets. Using the right data types and creating the right indexes will help a lot.
When you begin to have problems with your database, then search for something specific to the problem you are having.
There are no silver bullets to big data problems.
So... assuming i have a database with three tables:
Table clients
Table data
and Table clients_to_data
And I have a API which allows Clients to Access data from Table data. Every client has a record in Table clients (with things like IP adress etc.) To log who accesses what, i'm logging in the table clients_to_data (which contains the ID for table clients, table data and a timestamp.)
Every time a user access my API, he get's logged in the clients_to_data table. (So records in clients and data are not updated, just read.)
I also want to be able to get the amount of hits per client. Pretty easy, just query the clients_to_data table with a client_id and count the results. But as my DB grows, i'll have tenthousands of records in the clients_to_data table.
And here's my question:
Is it a better practice to add a field "hits" to Table clients that stores the amount of hits for that user and increment it every time the user queries the API
So this would be adding redundancy to the DB which i've heard generally is a bad thing. But in this case i think it would speed up the process of retrieving the amount of hits.
So which method is better and faster in this case? Thanks for your help!
Faster when?
Appending to the table will be faster , than finding the record and updating it, much faster than reading it, incrementing and updating it.
However having hits "precalulated", will be faster than the aggregate query to count them.
What you gain on the swings you lose on the roundabouts, which choice you make depends on your current usage patterns. So are you prepared to slow down adding a hit, to gain a signicant boost on finding out how many you've had?
Obviously selecting a single integer column from a table will be faster then selecting a count() of rows from a table.
The complexity trade off is a bit moot. 1 way you need to write a more complex sql, the other way you will need to update/insert 2 tables in your code.
How often is the number of hits queried? Do you clients look it up, or do you check it once a month? If you only look now and then I probably wouldn't be too concerned about the time taken to select count(*).
If your clients look up the hit count with every request, then I would look at storing a hits column.
Now that our table structures are all clearly defined, lets get to work.
You want to record something in the DB which is the number of times every client has accessed the data, in other terms,
Insert a record into a table "client_to_data" for every clients "impression".
You are worried about 2 things,
1. Redundancy
2. Performance when retrieving the count
What about the performance when storing the count.(Insert statements)..?
This is a classic scenario, where I would write the data to be inserted into memcache, and do a bulk insert at the end of the day.
More importantly, I will normalize the data before inserting it to the DB.
As to select, create indexes. If its text, install sphinx.
Thanks.
What is the purpose of the Secondary key? Say I have a table that logs down all the check-ins (similar to Foursquare), with columns id, user_id, location_id, post, time, and there can be millions of rows, many people have stated to use secondary keys to speed up the process.
Why does this work? And should both user_id and location_id be secondary keys?
I'm using mySQL btw...
Edit: There will be a page that lists/calculates all the check-ins for a particular user, and another page that lists all the users who has checked-in to a particular location
mySQL Query
Type 1
SELECT location_id FROM checkin WHERE user_id = 1234
SELECT user_id FROM checkin WHERE location_id = 4321
Type 2
SELECT COUNT(location_id) as num_users FROM checkin
SELECT COUNT(user_id) as num_checkins FROM checkin
The key (also called index) is for speeding up queries. If you want to see all checkins for a given user, you need a key on user_id field. If you want to see all checking for a given location, you need index on location_id field. You can read more at mysql documentation
I want to comment on your question and your examples.
Let me just suggest strongly to you that since you are using MySQL you make sure that your tables are using the innodb engine type for many reasons you can research on your own.
One important feature of InnoDB is that you have referential integrity. What does that mean? In your checkin table, you have a foreign key of user_id which is the primary key of the user table. With referential integrity, MySQL will not let you insert a row with a user_id that doesn't exist in the user table. Using MyISAM, you can. That alone should be enough to make you want to use the innodb engine.
To your question about keys/indexes, essentially when a table is defined and a key is declared for a column or some combination of columns, mysql will create an index.
Indexes are essential for performance as a table grows with the insert of rows.
All relational databases and Document databases depend on an implementation of BTree indexing. What Btree's are very good for, is finding an item (or not) using a predictable number of lookups. So when people talk about the performance of a relational database the essential building block of that is use of btree indexes, which are created via KEY statements or with alter table or create index statements.
To understand why this is, imagine that your user table was simply a text file, with one line per row, perhaps separated by commas. As you add a row, a new line in the text file gets added at the bottom.
Eventually you get to the point that you have 10,000 lines in the file.
Now you want to find out if you entered a line for one particular person with the last name of Smith. How can you find that out?
Without any sort of sortation of the file, or a separate index, you have but one option and that is to start at the first line in the file and scan through every line in the table looking for a match. Even if you found a Smith, that might not be the only 'Smith' in the table, so you have to read the entire file from top to bottom every time you want do do this search.
Obviously as the table grows the performance of searching gets worse and worse.
In relational database parlance, this is known as a "table scan". The database has to start at the first row and scan through reading every row until it gets to the end.
Without indexes, relational databases still work, but they are highly dependent on IO performance.
With a Btree index, the rows you want to find are found in the index first. The indexes have a pointer directly to the data you want, so the table no longer needs to be scanned, but instead the individual data pages required are read. This is how a database can maintain adequate performance even when there are millions or 10's or 100's of millions of rows.
To really start to gain insight into how mysql works, you need to get familiar with EXPLAIN EXTENDED ... and start looking at the explain plans for queries. Simple ones like those you've provided will have simple plans that show you how many rows are being examined to get a result and whether or not they are using one or more indexes.
For your summary queries, indexes are not helpful because you are doing a COUNT(). The table will need to be scanned when you have no other criteria constraining the search.
I did notice what looks like a mistake in your summary queries. Just based on your labels, I would think that these are the right queries to get what you would want given your column alias names.
SELECT COUNT(DISTINCT user_id) as num_users FROM checkin
SELECT COUNT(*) as num_checkins FROM checkin
This is yet another reason to use InnoDB, which when properly configured has a data cache (innodb buffer pool) similar to other rdbms's like oracle and sql server. MyISAM doesn't cache data at all, so if you are repeatedly querying the same sorts of queries that might require a lot of IO, MySQL will have to do all that data reading work over and over, whereas with InnoDB, that data could very well be sitting in cache memory and have the result returned without having to go back and read from storage.
Primary vs Secondary
There really is no such concept internally. A Primary key is special because it allows the database to find one single row. Primary keys must be unique, and to reflect that, the associated Btree index is unique, which simply means that it will not allow you to have 2 keys with the same data to exist in the index.
Whether or not an index is unique is an excellent tool that allows you to maintain the consistency of your database in many other cases. Let's say you have an 'employee' table with the SS_Number column to store social security #. It makes sense to have an index on that column if you want the system to support finding an employee by SS number. Without an index, you will tablescan. But you also want to have that index be unique, so that once an employee with a SS# is inserted, there is no way the database will let you enter a duplicate employee with the same SS#.
But to demystify this for you, when you declare keys these indexes are just being created for you and used automagically in most cases, when you define the tables.
It's when you aren't dealing with keys (primary or foreign) as in the example of usernames, first, last & last names, ss#'s etc., that you need to also be aware of how to create an index because you are searching (using where clause criteria) on one or more columns that aren't keys.