I'm working on a cassandra database storing the amount of times a word occurred. I want to find out which 100 words occur the most times. In a relational database, it'd be something like this:
select * FROM wordcounter ORDER BY counts DESC LIMIT 100;
but ordering by a counter-column in cassandra is impossible.
So, instead I'll have to periodically (probably once per day) fetch all rows and write the 100 words with the highest counters to the db. The following is not an option;
select * FROM wordcounter
Because that would return way too much data. I'll have to do it in increments, but how (and how many rows per query is acceptable)?
UPDATE
It's supposedly possible to iterate over all cassandra rows, but I am using PHP pdo to communicate with cassandra & it certainly doesn't have an iterate feature as far as I've seen. But I found I can query by token so this is possible;
select * FROM wordcounter LIMIT 100;
And then keep looping this until 0 results are returned
select * FROM wordcounter WHERE token(word) > token('lastword') LIMIT 100;
So this basically is the equivelent of an OFFSET which will allow me to process parts of the dataset without having to query it all at once. But I guess this does mean I can't distribute the query over multiple systems. Does anyone know of any alternatives?
Related
I have a very large database (~10 million rows) and I want to list these things as fast as possible in a table. I have few options :
I can limit the rows from Mysql - Not Preferred as I want to count the rows with specific type of data say attachment
Fetch all rows and use while loop to limit 1000 records each time - I think it's good to do but calling 10 million rows in memory looks insane and I am quite sure that it must have worse performance.
Count the total data and then list using limit - but mysql count is a deal breaker as inspite of unique and indexed id I have faced bad time with mysql count.
What is the best way to do this?
If I just want to list 10 million rows and parsing data using php to stop it and display each time 1000 rows it is a bad idea ?
Theres some things to consider:
Is the database optimized? if yes, skip
Indexing columns you want to filter the search from
Select the columns you require from it only (instead of select *)
If you want to count the total and the id is sequencial, you get select the latest row and count based on the id if it's 'that slow'
If you're looking at some sort of pagination, you can count the rows and select only a few records based of an user input (select with limit 1000, skip '1000' when its page 2, etc)
You wouldn't want 10million data in your "memory" when you'd be using 0.1% of it right?
Currently, I am using this query in my PHP script:
SELECT * FROM `ebooks` WHERE `id`!=$ebook[id] ORDER BY RAND() LIMIT 125;
The database will be about 2500 rows big at max, but I've read that ORDER BY RAND() eventually will slow down the processing time as the data in the database grows.
So I am looking for an alternate method for my query to make things still run smoothly.
Also, I noticed that ORDER BY RAND() is not truly randomizing the rows, because often I see that it follows some kind of pattern that sometimes repeats over and over again.
Is there any method to truly randomize the rows?
The RAND() function is a pseudo-random number generator and if you do not initialize it with different values will give you the same sequence of numbers, so what you should do is:
SELECT * FROM `ebooks` WHERE `id`!=$ebook[id] ORDER BY RAND(UNIX_TIMESTAMP()) LIMIT 125;
which will seed the random number generator from the current time and will give you a different sequence of numbers.
RAND() will slow down the SELECT's ORDER BY clause since it has to generate a random number every time and then sort by it. I would suggest you have the data returned to the calling program and randomize it there using something like array_rand.
This question has already been answered:
quick selection of a random row from a large table in mysql
Here too:
http://snippetsofcode.wordpress.com/2011/08/01/fast-php-mysql-random-rows/
I have a file that goes thru a large data set and splits out the rows in a paginated manner. The dataset contains about 210k rows, which isn't even that much, it will grow to 3Mil+ in a few weeks, but its already slow.
I have a first query that gets the total number of items in the DB for a particular WHERE clause combination, the most basic one looks like this:
SELECT count(v_id) as num_items FROM versions
WHERE v_status = 1
It takes 0.9 seconds to run.
The 2nd query is a LIMIT query that gets the actual data for that page. This query is really quick. (less than 0.001 s).
SELECT
v_id,
v_title,
v_desc
FROM versions
WHERE v_status = 1
ORDER BY v_dateadded DESC
LIMIT 0, 25
There is an index on v_status, v_dateadded
I use php. I cache the result into memcace, so subsequent requests are really fast, but the first request is laggy. Especially once I throw in a fulltext search in there, it starts taking 2-3 seconds for the 2 queries.
I don't think this is right, but try making it count(*), i think the count(x) has to go through every row and count only the ones that don't have a null value (so it has to go through all the rows)
Given that v_id is a PRIMARY KEY it should not have any nulls, so try count(*) instead...
But i don't think it will help since you have a where clause.
Not sure if this is the same for MySQL, but in MS SQL Server COUNT(*) is almost always faster than COUNT(column). The parser determines the fastest column to count and uses that.
Run an explain plan to see how the optimizer is running your queries.
That'll probably tell you what Andreas Rehm told you: you'll want to add indices that cover your where clauses.
EDIT: For me FOUND_ROWS() was the fastest way of doing this:
SELECT
SQL_CALC_FOUND_ROWS
v_id,
v_title,
v_desc
FROM versions
WHERE v_status = 1
ORDER BY v_dateadded DESC
LIMIT 0, 25;
Then in a secondary query just do:
SELECT FOUND_ROWS();
If you are outputting to PHP you do this:
$totalnumber = mysql_result(mysql_query($secondquery)),0,0);
I was previously trying to the same thing as OP, putting COUNT(column) on the first query but it took about three times longer than even the slowest WHERE and ORDERBY query that I could do (with a LIMIT set). I tried changing to COUNT(*) and it improved a lot. But results in my case were even better using MySQL's FOUND_ROWS();
I am testing in PHP with microtime and repeating the query. In OP's case, if he ran COUNT(*) I think he will save some time, but it is not the fastest way of doing this. I ran some tests on COUNT(*) VS. FOUND_ROWS() and FOUND_ROWS() is quite a bit faster.
Using FOUND_ROWS() was nearly twice as fast in my case.
I first started doing EXPLAIN on the COUNT(*) query. In OP's case you would see that MySQL still checks a total of 210k rows in the first query. It checks every row before even starting the LIMIT query and doesn't seem to get any performance benefit from doing this.
If you run EXPLAIN on the LIMIT query it will probably check less than 100 rows as you have limited the results to 25. But this is still overlap and there will be some cases where you can't afford this or at the least you should still compare performance with FOUND_ROWS().
I thought this might only save time on large LIMIT requests, but when I run EXPLAIN on my LIMIT query it was actually only checking 25 rows to get 15 values. However, there was still a very noticeable difference in query time - on average I got down from .25 to .14 seconds and achieved the same results.
I've developed a user rating system that takes analyzes a users and saves their information with a score in a db.
I'm getting close to 1 Million users rated and stored.
I'm having issues with taking a certain set of users from the table (score < 50) and then comparing their ids against another set of ids without the whole thing crashing down.
The result of the (score < 50) query is around 65k rows and the comparison is against probably 1,000 user ids so the whole thing is running 65k * 1,000.
Is my bottleneck at the db? Or is it at the comparison of ids? Is there a better way to split this up?
Query -> "select username, userscore from users where userscore < 50"
then
Foreach compares values
Since you haven't provided any table or index information, here's what I'm going to suggest.
Make sure there's an index on userscore. If you have more than a million rows in your table and you're doing a query with "WHERE userscore > 50", that column needs an index.
Make sure your query is using that index. Run your query manually with EXPLAIN at the front, ie. EXPLAIN SELECT username, userscore from users where userscore < 50. Optimize the results.
You haven't mentioned how you're doing the id comparison, so I'll assume it's in a loop that checks each one against the array. You might be better off putting all 1000 ids into the query and limiting your SELECT query to users with score < 50 AND with their id in that set.
If you post more information about your tables, indexes, and comparisons, I can probably be more specific.
seems easy enough to answer if it is the db or not. Just prior to your query, do an fopen of a log file in /tmp. Then fwrite the results of a microtime() into the file. Just after your query, fwrite the results of another microtime(). Run your script once. you will be able to see the following:
1) are you even getting to the pre-query spot
2) Is the script failing in the middle of the query
3) how long does the query take if it doesn't crash the script
When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.