Part of my project involves storing and retrieving loads of ips in my database. I have estimated that my database will have millions of ips within months of starting the project. That been the case I would like to know how slow simple queries to a big database can get? What will be the approximate speeds of the following queries:
SELECT * FROM table where ip= '$ip' LIMIT 1
INSERT INTO table(ip, xxx, yyy)VALUES('$ip', '$xxx', '$yyy')
on a table with 265 million rows?
Could I speed query speeds up by having 255^2 tables created that would have names corresponding to all the 1st two numbers of all possible ipv4 ip addresses, then each table would have a maximum of 255^2 rows that would accommodate all possible 2nd parts to the ip. So for example to query the ip address "216.27.61.137" it would be split into 2 parts, "216.27"(p1) and "61.137"(p2). First the script would select the table with the name, p1, then it would check to see if there are any rows called "p2", if so it would then pull the required data from the row. The same process would be used to insert new ips into the database.
If the above plan would not work what would be a good way to speed up queries in a big database?
The answers to both your questions hinge on the use of INDEXES.
If your table is indexed on ip your first query should execute more or less immediately, regardless of the size of your table: MySQL will use the index. Your second query will slow as MySQL will have to update the index on each INSERT.
If your table is not indexed then the second query will execute almost immediately as MySQL can just add the row at the end of the table. Your first query may become unusable as MySQL will have to scan the entire table each time.
The problem is balance. Adding an index will speed the first query but slow the second. Exactly what happens will depend on server hardware, which database engine you choose, configuration of MySQL, what else is going on at the time. If performance is likely to be critical, do some tests first.
Before doing any of that sort, read this question (and more importantly) its answers: How to store an IP in mySQL
It is generally not a good idea to split data among multiple tables. Database indexes are good at what they do, so just make sure you create them accordingly. A binary column to store IPv4 addresses will work rather nicely - it is more a question of query load than of table size.
First and foremost, you can't predict how long will a query will take, even if we knew all information about the database, the database server, the network performance and another thousands of variables.
Second, if you are using a decent database engine, you don't have to split the data into different tables. It knows how to handle big data. Leave the database functionality to the database itself.
There are several workarounds to deal with large datasets. Using the right data types and creating the right indexes will help a lot.
When you begin to have problems with your database, then search for something specific to the problem you are having.
There are no silver bullets to big data problems.
Related
How to handle over 10 million records in MySQL only read operations.
I have two table one is company which holds records of company i.e its name and the services provided by it, thus 2 column and has about 3 million records and another table employee which has about 40 columns and about 10 million records. When trying to fetch data even simple queries such as
Employee::paginate(100);
//In terms of mysql it
//select count(*) as aggregate from 'employee';
//Select * from 'employee' limit 100 offset 0;
used to take about 30s and now it takes like forever.
If I want to do a search, apply a filter or wants to join two table i.e company and employee then sometimes it works and sometimes it crashes and gives lots of errors/warning in the SQL server logs.
Can anyone please tell me how can I handle this volume of records more efficiently without causing SQL server meltdown especially not during high traffic time.
Currently, I have only primary keys i.e ids and joint ids are indexed.
This is kind of duplicate post compare to all similar queries has been made on SO, but those did not helped me much.
Questions i have followed on SO
Best data store for billions of rows
Handling very large data with mysql
https://dba.stackexchange.com/questions/20335/can-mysql-reasonably-perform-queries-on-billions-of-rows
Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows?
How big can a MySQL database get before performance starts to degrade
Handling millions of records in arrays from MySQL in PHP?
Make these changes:
Use partitioning in your database based on your needs. (e.g. how to partition a table by datetime column?)
Use simplePaginate method instead of paginate. It does not query for count. (https://laravel.com/docs/5.7/pagination)
Try to improve your indexing. It affects. (google: mysql indexing best practices)
If you need count of rows use caching drivers (like redis)
On a daily basis, I get a source csv file that has 250k rows and 40 columns. It’s 290MB. I will need to filter it because it has more rows than I need and more columns than I need.
For each row that meets the filtering criteria, we want to update that into the destination system 1 record at a time using its PHP API.
What will be the best approach for everything up until the API call (the reading / filtering / loading) for the fastest performance?
Iterating through each row of the file, deciding if it’s a row I want, grabbing only the columns I need, and then passing it to the API?
Loading ALL records into a temporary MySQL table using LOAD DATA INFILE. Then querying the table for the rows and fields I want, and iterating through the resultset passing each record to the API?
Is there a better option?
Thanks!
I need make an assumption first, majority of the 250K rows will go to database. If only a very small percentage, then iterate over the file and send all the rows in batch is definitely faster.
Different configurations could affect both approaches, but general speaking, the 2nd approach performs better with less scripting effort.
Approach 1: the worst is to send each row to server. More round trip and more small commits.
What you can improve here is to send rows in batch, maybe a few hundreds together. You will see a much better result.
Approach 2: MyISAM will be faster than InnoDB because of all the overheads and complexity of ACID. If MyISAM is acceptable to you, try it first.
For InnoDB, there is a better Approach 3 (which is actually a mix of approach 1 and approach 2).
Because InnoDB don't do table lock and you can try to import multiple files concurrently, i.e., separate the CSV files to multiple files and execute Load Data from your scripts. Don't add auto_increment key into the table first to avoid auto_inc lock.
LOAD DATA, but say #dummy1, #dummy2 etc for columns that you don't need to keep. That gets rid of the extra columns. Load into a temp table. (1 SQL statement.)
Do any cleansing of the data. (Some number of SQL statements, but no loop, if possible.)
Do one INSERT INTO real_table SELECT ... FROM tmp_table WHERE ... to both filter out unnecessary rows and copy the desired ones into the real table. (1 SQL statement.)
You did not mention any need for step 2. Some things you might need:
Computing one column from other(s).
Normalization.
Parsing dates into the right format.
In one project I did:
1GB of data came in every hour.
Load into a temp table.
Normalize several columns (2 SQLs per column)
Some other processing.
Summarize the data into about 6 summary tables. Sample: INSERT INTO summary SELECT a,b,COUNT(*),SUM(c) FROM tmp GROUP BY a,b;. Or an INSERT ON DUPLICATE KEY UPDATE to deal with rows already existing in summary.
Finally copy the normalized, cleaned, data into the 'real' table.
The entire process took 7 minutes.
Whether to use MyISAM, InnoDB, or MEMORY -- You would need to benchmark your own case. No index is needed on the tmp table since each processing step does a full table scan.
So, 0.3GB each 24 hours -- Should be no sweat.
I'm doing some web crawling and inserting the result into a database. It takes about 2 seconds to scrape but a lot longer to insert. There are two tables, table one is a list of urls and an Ids, table two is a set of tagIds and siteIds.
When I add indexes to the siteIds (which are md5 hashes of the URL, I did this because it speeds up the insertion as it doesn't have to query the database for each urls id to add the site-tag pairings) the insert speed falls off a cliff after 300,000 or so pages.
Example
Table 1
hash |url |title |description
sjkjsajwoi20doi2jdo2xq2klm www.somesite.com somesite a site with info
Table2
site |tag
sjkjsajwoi20doi2jdo2xq2klm xn\zmcbmmndkd2
When I took off the indexes it went much faster and I was able to add about 25 million records in 12 hours, but searching unindexed tags is just impossible.
I'm using php and mysqli for this, I'm open to suggestions for a better way to organise this data.
Hmm, this is a bit tricky as the slow-down is due to the overhead of the database needing to update the index data structure when each record is inserted.
How are you accessing this? Using PDO for php? Using raw sql? Prepared statements?
I would also ensure if you need transactions or not, as the db could be implicitly using a transaction, and that could slow down the inserts. For atomic records (records not deleted but collected, or ones WITHOUT normalized foreign key dependent records) you don't need this.
You could also consider testing if a STORED PROCEDURE has better efficiency (the db could possibly optimize if it has a stored procedure). Then just call this stored procedure via the PDO. It is also possible that the server / install of the db has a hardware limitation, either storage (not on an ssd) or the db operations / install cannot access the full power of the cpu (low priority in the os, other large processing making the db wait for cpu cycles, etc).
So... assuming i have a database with three tables:
Table clients
Table data
and Table clients_to_data
And I have a API which allows Clients to Access data from Table data. Every client has a record in Table clients (with things like IP adress etc.) To log who accesses what, i'm logging in the table clients_to_data (which contains the ID for table clients, table data and a timestamp.)
Every time a user access my API, he get's logged in the clients_to_data table. (So records in clients and data are not updated, just read.)
I also want to be able to get the amount of hits per client. Pretty easy, just query the clients_to_data table with a client_id and count the results. But as my DB grows, i'll have tenthousands of records in the clients_to_data table.
And here's my question:
Is it a better practice to add a field "hits" to Table clients that stores the amount of hits for that user and increment it every time the user queries the API
So this would be adding redundancy to the DB which i've heard generally is a bad thing. But in this case i think it would speed up the process of retrieving the amount of hits.
So which method is better and faster in this case? Thanks for your help!
Faster when?
Appending to the table will be faster , than finding the record and updating it, much faster than reading it, incrementing and updating it.
However having hits "precalulated", will be faster than the aggregate query to count them.
What you gain on the swings you lose on the roundabouts, which choice you make depends on your current usage patterns. So are you prepared to slow down adding a hit, to gain a signicant boost on finding out how many you've had?
Obviously selecting a single integer column from a table will be faster then selecting a count() of rows from a table.
The complexity trade off is a bit moot. 1 way you need to write a more complex sql, the other way you will need to update/insert 2 tables in your code.
How often is the number of hits queried? Do you clients look it up, or do you check it once a month? If you only look now and then I probably wouldn't be too concerned about the time taken to select count(*).
If your clients look up the hit count with every request, then I would look at storing a hits column.
Now that our table structures are all clearly defined, lets get to work.
You want to record something in the DB which is the number of times every client has accessed the data, in other terms,
Insert a record into a table "client_to_data" for every clients "impression".
You are worried about 2 things,
1. Redundancy
2. Performance when retrieving the count
What about the performance when storing the count.(Insert statements)..?
This is a classic scenario, where I would write the data to be inserted into memcache, and do a bulk insert at the end of the day.
More importantly, I will normalize the data before inserting it to the DB.
As to select, create indexes. If its text, install sphinx.
Thanks.
I have any number of users in a database (this could be 100, 2000, or 3) what i'm doing is using mysql "show tables" and storing the table names in an array, then i'm running a while loop and taking every table name (the user's name) and inserting it into some code, then i'm running said piece of code for every table name. With 3 users, this script takes around 20 seconds. It uses the Twitter API and does some mysql inserts. Is this the most efficient way to do it or not?
Certainly not!
I don't understand why you store each user in their table. You should create a users table and select from there.
It will run in 0.0001 seconds.
Update:
A table has rows and columns. You can store multiple users in rows, and information about each user in columns.
Please try some database design tutorials/books, they wil help you a great deal.
If your worried about storing multiple entries for each user within the same users table, you can have a seperate table for tweets with the tweet_id refering to the user.
I'd certainly go for one users table.
Databases are optimized for processing many rows; some of the techniques used are indexes, physical layout of data on disk and so on. Operations on many tables will be always be slower - this is just not what RDBMS were built to do.
There is one exception - sometimes you optimize databases by sharding (partitioning data), but this approach has as many advantages as disadvantages. One of the disadvantages is that queries like the one you described take a lot of time.
You should put all your users in one table, because, from logical point of view - they represent one entity.