I have a very large database (~10 million rows) and I want to list these things as fast as possible in a table. I have few options :
I can limit the rows from Mysql - Not Preferred as I want to count the rows with specific type of data say attachment
Fetch all rows and use while loop to limit 1000 records each time - I think it's good to do but calling 10 million rows in memory looks insane and I am quite sure that it must have worse performance.
Count the total data and then list using limit - but mysql count is a deal breaker as inspite of unique and indexed id I have faced bad time with mysql count.
What is the best way to do this?
If I just want to list 10 million rows and parsing data using php to stop it and display each time 1000 rows it is a bad idea ?
Theres some things to consider:
Is the database optimized? if yes, skip
Indexing columns you want to filter the search from
Select the columns you require from it only (instead of select *)
If you want to count the total and the id is sequencial, you get select the latest row and count based on the id if it's 'that slow'
If you're looking at some sort of pagination, you can count the rows and select only a few records based of an user input (select with limit 1000, skip '1000' when its page 2, etc)
You wouldn't want 10million data in your "memory" when you'd be using 0.1% of it right?
Related
I'm using PHP 7, MySQL and a small custom-built forum and a query for grabbing 7 columns with 2 SQL join statements into a "latest post" page. When the time comes that I hit 1 million rows will the limit 30 stop at 30 rows or will it have to sort the entire DB each run?
The reason I'm asking is I'm trying to wrap my head around how to paginate this custom forum I've built and if that pagination will be "ok" once it has to (theoretically) read through a million rows?
EDIT: My current query is a limit 30, sort desc.
EDIT2: Currently I'm getting about 500-600 posts give or take 50 a day. It's quickly adding up so I'm trying to monitor this before I get 1 million. That being said I'm only looking up one table right now, tblTopics and topic_id, topic_name, and topic_author (a fk). Then I'm doing another another lookup after that with the topic itself's foreign keys, topic_rating, and topic_category. The original lookup is where I have the sort and limit.
Sort is applied on the complete set, limit is applied after the sort, so adding a limit to an ORDER BY query does not make it a lot faster.
It depends.
SELECT ... FROM tbl ORDER BY x LIMIT 30;
INDEX(x)
will probably use the index and stop after 30 rows, not 1 million.
SELECT ... FROM tbl GROUP BY zz ORDER BY x LIMIT 30;
will scan all million rows, do the grouping, write to a tmp table, sort that tmp table, and only then deliver 30 rows.
SELECT ... FROM tbl WHERE yy = 123 ORDER BY x LIMIT 30;
INDEX(yy)
will probably prefer INDEX(yy), and it is hard to say how efficient it will be.
SELECT ... FROM tbl WHERE yy = 123 ORDER BY x LIMIT 30;
INDEX(yy, x)
will be very efficient -- not only can it use the index for filtering, but also for the ORDER BY and the LIMIT. Only 30 rows will be touched.
SELECT ... FROM tbl LIMIT 30;
is of dubious use. You will get some 30 rows, but who knows which 30? But it will be fast.
Well, this is still not answering you question. Your question involves a JOIN. Can you guess how much more complex the question becomes with JOIN involved?
If you would like to discuss your specific query, please provide the query and SHOW CREATE TABLE for each table and how many rows in each table.
If you are joining a 1-row table to a million row table, the 1-row table probably does not add any complexity.
If you are joining two million-row tables together without any indexes, then you are looking at a trillion intermediate 'rows' to work with!
Oh, and then you will want the 'second' 30 rows? That adds another dimension of complexity. I could spend a few more paragraphs on what can go wrong with OFFSET.
If this forum is somewhat open-ended where anyone can post "topics" and be the originating author, you probably want at a minimum a topics table with a PKID, Name, Author as you have, but also date added and most recent post and also count of posts against it. Too many times people build web sites that want counters all over the place and try to do aggregates, or the most recent, etc. Come to mention the most recent post, hold the ID of the most recent post too so you don't have to find the max date, then get the join base on that.
Then secondary table would be the details associated for a given post.
Then, via a trigger on your detail table for whatever you are posting against, you can do an update to the parent topic id and stamp it with count +1, most recent date of now, and the last ID with the ID of the newest record just created.
So now, joining to get that most recent context entry is a simple join and not overly complex.
Index on your topics table on the most recent post date so you are now getting ex: the most recent 30 topics, not necessarily the most recent 30 posts, such as 3 posts have a bunch of hits and account for all 30. Get 30 distinct topics, then let user see the details as they select the topic of interest. Your query at the top level is never going against the underlying details.
Obviously brief on true context of your website, but hopefully suggestions make sense for you to run with.
I'm working on a cassandra database storing the amount of times a word occurred. I want to find out which 100 words occur the most times. In a relational database, it'd be something like this:
select * FROM wordcounter ORDER BY counts DESC LIMIT 100;
but ordering by a counter-column in cassandra is impossible.
So, instead I'll have to periodically (probably once per day) fetch all rows and write the 100 words with the highest counters to the db. The following is not an option;
select * FROM wordcounter
Because that would return way too much data. I'll have to do it in increments, but how (and how many rows per query is acceptable)?
UPDATE
It's supposedly possible to iterate over all cassandra rows, but I am using PHP pdo to communicate with cassandra & it certainly doesn't have an iterate feature as far as I've seen. But I found I can query by token so this is possible;
select * FROM wordcounter LIMIT 100;
And then keep looping this until 0 results are returned
select * FROM wordcounter WHERE token(word) > token('lastword') LIMIT 100;
So this basically is the equivelent of an OFFSET which will allow me to process parts of the dataset without having to query it all at once. But I guess this does mean I can't distribute the query over multiple systems. Does anyone know of any alternatives?
I have table(1) that holds a total records value for table(2). I do this so that I can quickly show users the total value without having to run select count every time a page is brought up.
My Question:
I am debating on whether or not to update that total records value in table(1) as new records come in or to have a script run every 5 minutes to update the total records value in table(1).
Problem is we plan on having many records created during a day which will result in an additional update for each one.
However if we do a script it will need to run for every record in table(1) and that update query will have a sub query counting records from table(2). This script will need to run like every 5 to 10 minutes to keep things in sync.
table(1) will not grow fast maybe at peak it could get to around 5000 records. table(2) has the potential to get massive over 1 million records in a short period of time.
Would love to hear some suggestions.
This is where a trigger on table 2 might be useful, automatically updating table 1 as part of the same transaction, rather than using a second query initiated by PHP. It's still a slight overhead, but handled by the database itself rather than a larger overhead in your PHP code, and maintains the accuracy of the table 1 counts ACIDly (assuming you use transactions)
There is a difference between myisam and innodb engines. If you need to count the total number of rows in the table COUNT(*) FROM table, than if you are using myisam, you will get this number blazingly fast no matter what is the size of the table (myisam tables already store the row count, so it just reads it).
Innodb does not store such info. But if an approximate row count is sufficient, SHOW TABLE STATUS can be used.
If you need to count based on something, COUNT(*) FROM table WHERE ... then there are two different options:
either put an index on that something, and count will be fast
use triggers/application logic to automatically update field in the other table
What I try to do is that I have a table to keep user information (one row for each user), and I run a php script daily to fill in information I get from users. For one column say column A, if I find information I'll fill it in, otherwise I don't touch it so it remains NULL. The reason is to allow them to be updated in the next update when the information might possibly be available.
The problem is that I have too many rows to update, if I blindly SELECT all rows that's with column A as NULL then the result won't fit into memory. If I SELECT 5000 at a time, then in the next SELECT 5000 I could get the same rows that didn't get updated last time, which would be an infinite loop...
Does anyone have any idea of how to do this? I don't have ID columns so I can't just say SELECT WHERE ID > X... Is there a solution (either on the MySQL side or on the php side) without modifying the table?
You'll want to use the LIMIT and OFFSET keywords.
SELECT [stuff] LIMIT 5000 OFFSET 5000;
LIMIT indicates the number of rows to return, and OFFSET indicates how far along the table is read from.
I have a big MySQL users table and need to get six random rows from it (I'm using PHP). The table has an index column that is auto incremented. The only problem is that some rows are marked as inactive, because some users have disabled their accounts or whatever. That means I can't just count the rows and then grab a random number from that range because some of them will be inactive.
How can I efficiently get a random row without using RAND() and without trying to query an inactive user?
Thanks!
WHERE `inactive` = 0 LIMIT random_offset, 1
where random_offset is precalculated in PHP (as random from 1 to COUNT of active users). And the result consists of 6 UNIONed queries.
If you wish to avoid the very slow order by RAND() method, here are various alternatives, with various options to manage holes
quick selection of a random row from a large table in mysql