I need to perform a large SELECT WHERE IN query on MySQL and I need it to run quickly. I have a table with more than 100 million rows with the primary key on a varchar 127 (and it has to be that way).
I am performing a SELECT col1 FROM table WHERE col1 IN ($in) where $in has 5000 values. I essentially just need to find which of the 5,000 values are in the table in the primary key col1.
The query takes between 1 and 10 seconds generally but is usually about 7 or 8 seconds.
Is there a more optimal, fast way of performing SELECTS with large IN clauses on huge tables indexed by a varchar?
I am using InnoDB on a dedicated server with PHP and PDO. Thanks for the suggestions.
This is a bit long for a comment.
I am guessing that you already have an index on table(col1), otherwise the query would probably take longer than 10 seconds. If this is not true, then add a column. Better yet, make the column the primary key.
I have a suspicion that the index doesn't fit into memory. For this, you will need to find a MySQL DBA (which you should have if you have such a large table) or learn about the memory options for MySQL. An index not fitting into memory would exhibit this type of behavior.
If this is true, then the behavior should be pretty linear. So, if you have a list of 500 ids, it should take about one second or a bit less. If you have 50 ids, then a tenth of a second or so.
It is possible that sorting the list of ids would help in this case. However, that is really just speculation on my part.
Related
On a daily basis, I get a source csv file that has 250k rows and 40 columns. It’s 290MB. I will need to filter it because it has more rows than I need and more columns than I need.
For each row that meets the filtering criteria, we want to update that into the destination system 1 record at a time using its PHP API.
What will be the best approach for everything up until the API call (the reading / filtering / loading) for the fastest performance?
Iterating through each row of the file, deciding if it’s a row I want, grabbing only the columns I need, and then passing it to the API?
Loading ALL records into a temporary MySQL table using LOAD DATA INFILE. Then querying the table for the rows and fields I want, and iterating through the resultset passing each record to the API?
Is there a better option?
Thanks!
I need make an assumption first, majority of the 250K rows will go to database. If only a very small percentage, then iterate over the file and send all the rows in batch is definitely faster.
Different configurations could affect both approaches, but general speaking, the 2nd approach performs better with less scripting effort.
Approach 1: the worst is to send each row to server. More round trip and more small commits.
What you can improve here is to send rows in batch, maybe a few hundreds together. You will see a much better result.
Approach 2: MyISAM will be faster than InnoDB because of all the overheads and complexity of ACID. If MyISAM is acceptable to you, try it first.
For InnoDB, there is a better Approach 3 (which is actually a mix of approach 1 and approach 2).
Because InnoDB don't do table lock and you can try to import multiple files concurrently, i.e., separate the CSV files to multiple files and execute Load Data from your scripts. Don't add auto_increment key into the table first to avoid auto_inc lock.
LOAD DATA, but say #dummy1, #dummy2 etc for columns that you don't need to keep. That gets rid of the extra columns. Load into a temp table. (1 SQL statement.)
Do any cleansing of the data. (Some number of SQL statements, but no loop, if possible.)
Do one INSERT INTO real_table SELECT ... FROM tmp_table WHERE ... to both filter out unnecessary rows and copy the desired ones into the real table. (1 SQL statement.)
You did not mention any need for step 2. Some things you might need:
Computing one column from other(s).
Normalization.
Parsing dates into the right format.
In one project I did:
1GB of data came in every hour.
Load into a temp table.
Normalize several columns (2 SQLs per column)
Some other processing.
Summarize the data into about 6 summary tables. Sample: INSERT INTO summary SELECT a,b,COUNT(*),SUM(c) FROM tmp GROUP BY a,b;. Or an INSERT ON DUPLICATE KEY UPDATE to deal with rows already existing in summary.
Finally copy the normalized, cleaned, data into the 'real' table.
The entire process took 7 minutes.
Whether to use MyISAM, InnoDB, or MEMORY -- You would need to benchmark your own case. No index is needed on the tmp table since each processing step does a full table scan.
So, 0.3GB each 24 hours -- Should be no sweat.
Scenario 1
I have one table lets say "member". In that table "member" i have 7 fields ( memid,login_name,password,age,city,phone,country ). In my table i have 10K records.i need to fetch one record . so i'm using the query like this
mysql_query("select * from member where memid=999");
Scenario 2
I have the same table called "member" but i'm splitting the table like this member and member_txt .So in my member_txt table i have memid,age,phone,city,country )and in my member table i have memid,login_name,password .
Which is the best scenario to fetch the data quickly? Either going to single table or split the table into two with reference?
Note: I need to fetch the particular data in PHP and MYSQL. Please let me know which is best method to follow.
we have 10K records
For your own health, use the single table approach.
As long as you are using a primary key for memid, things are going to be lightning fast. This is because PRIMARY KEY automatically assigns an index, which basically tells the exact location for the data and eliminates the need to go through data that it would otherwise do.
From http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data. If
a table has 1,000 rows, this is at least 100 times faster than reading
sequentially. If you need to access most of the rows, it is faster to
read sequentially, because this minimizes disk seeks.
Your second approach only makes your system more complex, and provides no benefits.
Use the scenario 1.
please make the memid primary/unique key then having one table is faster than having two tables.
In general you should not see to much impact on performance with 10k rows as long as your accessing it by your primary key.
Also note that fetching data from one table is also faster than fetching data from 2 tables.
If you want to optimize further use the column names in the select statement instead of the * operator.
I have 35 large (1M plus rows with 35 columns) databases and each one gets updated with per row imports based on the primary key.
I am thinking about grouping these updates into blocks, disabling the keys and then re-enabling them.
Does anyone know when disabling the keys is recommended. i.e. If I was going to update a single record it'd be a terrible idea but if I wanted to update every record, it would be a good idea. Are there any mathematical formulae to follow for this or should I just keep benchmarking?
I would disable my keys when I notice that there are particular performance effects on inserts / updates. These are the most prone to getting bogged down in foreign-key problems. Inserting a row into a fully keyed/indexed table with tens of millions of records can be a nightmare, if there are alot of columns and non-null attributes in the insert. I wouldnt worry about keys/indices in a small table --- in smaller tables (lets say ~500,000 rows or less with maybe 6 or 7 columns) the keys probably aren't going to kill you.
As hinted above, you must also consider disabling the real-time management of indices when you are doing this. Indices, if maintained by the database in real-time, will slow down operations that change the tables in the database as well.
Regarding mathematical forumlae : You can look at the trends in your insert/update speed when you do / do not have indices, with respect to database size. At some point (i.e. one your db reaches a certain size) you might find that the time for an insert starts increasing geometrically .... Or that it takes a steep "jump". If you can find these points in your system, you'll know when you are pushing it to the limit --- and a good admin might even be able to tell you WHY , at those points, the system performance is dropping.
Ironically -- sometimes keys/indices speed things up ! Indices and keys can speed up some updates and inserts by making any subqueries or other operations EXTREMELY (linear-time) fast. So if an operation is slow you might ask yourself "Is there some static data that i can index to speed lookup operation up " ?
I'm looking for the best, most scaleable way of keeping track of a large number of on/offs. The on/offs apply to items, numbering from 1 to about 60 million. (In my case the on/off is whether a member's book has been indexed or not, a separate process.)
The on/offs must be searched rapidly by item number. They change constantly, so re-indexing costs can't be high. New items are added to the end of the table less often.
The idea solution would, I think, be an index-only table--a table where every field was part of the primary key. I gather ORACLE has this, but no engine for MySQL has it.
If I use MySQL I think my choice is between:
a two-field table--the item and the "on/off" field. Changes would be handled with UPDATE.
a one-field table--the item. Being in the table means being "on." Changes are handled with INSERT and DELETE.
I am open to other technologies. Storing the whole thing bitwise in a file?
You may have more flexibility by using option #1, but both would work effectively. However, if speed is an issue, you might want to consider creating a HEAP table that is pre-populated on mysql startup and maintained in-situ with your other processes. Also, use int, and enum field types in the table. Since it'll all be held in memory, it should be lightning fast, and because there is not a lot of data stored in the table, 60 million records shouldn't be a huge burden, memory-wise. If I had to roughly estimate:
int(8) (for growth, assuming you'll exceed 100million records someday)
enum(0,1)
So let's round up to 10 bytes per record:
10 * 60,000,000 = 600,000,000
That's about 572 MB worth of data, plus the index and additional overhead, so let's roughly say.. a 600 MB table. If you have that kind of memory to spare on your server, then a HEAP table might be the way to go.
60 million rows with an ID and an on/off bit should be no problem at all for MySQL if you are using InnoDB.
I have an InnoDB table that tracks which forum topics users have read and which post they've read up to. It contains 250 million rows, is 14 bytes wide, and it is updated constantly... It's doing 50 updates a second right now and it is midnight so peak time could be 100-200?.
The indexed columns themselves are not updated after insert. The primary key is (user_id, topic_id) and I add new last_read information by using INSERT ... ON DUPLICATE KEY UPDATE.
I measure constantly and I don't see any contention or performance problems but I do cache reads a lot in memcached since deciding when to expire the cache is very straightforward. I've been considering sharding this table by user in order to keep growth in check but I may not even bother storing it in MySQL forever.
I am open to other technologies. Storing the whole thing bitwise in a file?
Redis would be a great alternative. In particular, its sets and sorted sets would work for this (sorted sets might be nice if you need to grab a range of values using something other than the item ID - like last update time)
Redis might be worth checking out if you haven't already - it can be a great addition to an application that relies on MySQL and you'll likely find other good uses for it that simplify your life.
Using PHP, I am building an application that is MySQL database resource heavy, but I also need it's data to be very flexible. Currently there are a number of tables which have an array of different columns (including some text, longtext, int, etc), and in the future I would like to expand on the number of columns of these tables, whenever new data-groups are required.
My question is, if I have a table with, say, 10 columns, and I expand this to 40 columns in the future, would a SQL query (via PHP) be slowed down considerably?
As long as the initial, small query that is only looking up the initial 10 columns is not a SELECT-all (*) query, I would like to know if more resources or processing is used because the source table is now much larger.
Also, will the database in general run slower or be much larger due to many columns now constantly remaining as NULL values (eg, whenever a new entry that only requires the first 10 columns is inserted)?
MyISAM and InnoDB behave differently in this regard, for various reasons.
For instance, InnoDB will partition disk space for each column on disk regardless of whether it has data in it, while MyISAM will compress the tables on disk. In a case where there are large amounts of empty columns, InnoDB will be wasting a lot of space. On the other hand, InnoDB does row-level locking, which means that (with caveats) concurrent read / writes to the same table will perform better (MyISAM does a table-level lock on write).
Generally speaking, it's probably not a good idea to have many columns in one table, particularly for volatility reasons. For instance, in InnoDB (possibly MyISAM also?), re-arranging columns or changing types of columns (i.e. varchar 128 -> varchar 255) in the middle of a table requires that all data in columns to the right be moved around on disk to make (or remove) space for the altered column.
With respect to your overall database design, it's best to aim for as many columns as possible to be not null, which saves space (you don't need the null flag on the column, and you don't store empty data) and also increases query and index performance. If many records will have a particular column set to null, you should probably move it to a foreign key relationship and use a JOIN. That way disk space and index overhead is only incurred for records that are actually holding information.
Likely, the best solution would be to create a new table with the additional fields and JOIN the tables when necessary. The original table remains unchanged, keeping it's speed, but you can still get to the extra fields.
Optimization is not a trivia question. Nothing can be predicted.
In general short answer is: yes, it will be slower (because DBMS at least need to read from the disk and send more data, obviously).
But, it is very dependent on each particular case how much slower it will be. You can either even don't see the difference, or get it 10x times slower.
In all likelihood, no it won't be slowed down considerably.
However, a better question to ask is: Which method of adding more fields results in a more elegant, understandable, maintainable, cost effective solution?
Usually the answer is "It depends." It depends on how the data is accessed, how the requirements will change, how the data is updated, and how fast the tables grow.
you can divide one master table into multiple TRANSACTION tables so you will get much faster result than you getting now. and also make the primary key as UNIQUE KEY also in all the transaction as well as master tables. its really help you to make your query faster.
Thanks.