I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.
So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.
Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.
Related
I have 2 tables
1. First table contains prospects, their treatment status and the mail code they received (see it as a foreign key)
2. Second table contains mails, indexed with email code
I need to display some charts about hundreds of thousands prospects so I was thinking about an aggregate query (get prospect data group by month, count status positive, count status negative, between start and end date, etc)
Result is pretty short and simple, and I can use it directly in charts :
[ "2019-01" => [ "WON" => 55000, "LOST" => 85000, ...],
...
]
Then I was asked to add a filter with mails (code and human label) so user would chose it from a multi select field. I can handle writting the query(ies), but I am wondering about which way I should use.
I got a choice between:
- keeping my first query and do a second one (distinct values of mail, same conditions)
- query everything and treat all my rows with PHP
I know coding but I have little knowledge about performance.
In theory I should not use 2 queries about same data but treating all those lines with php when mysql can do it better, looks like ... "overkill".
Is there a best practice ?
I have a lot of PHP pages that have dozens of queries supporting them, and they run plenty fast. When a page does not run fast, I focus on the slowest query; I do not on playing games in PHP. But I avoid running a query that hits hundreds of thousands of rows; it will be "too" slow. Some things...
Maybe I will find a way to aggregate the data to avoid a big scan.
Maybe I will move the big query to a second page -- this avoids penalizing the user who does not need.
Maybe I will break up the big scan so that the user must ask for pieces, not build a page with 100K lines. Pagination is not good for that many rows. So...
Maybe I will dynamically build an index into a second level of pages.
To discuss this further, please provide SHOW CREATE TABLE, some SELECTs (not worrying about how bad they are; we'll tell you), and mockups of page(s).
So i have a website with a product catalog, this page has 4 product sliders one for recent products, another for bestsellers a third one for special offers.
Is it better to create a query for each type of slider, or should I get all products and then have php sort them out and separate them into three distict arrays one for each slider?
Currently I am just doing
SELECT * FROM products WHERE deleted = 0
For testing.
It's almost always best to refine the query so that it just returns not only the records you actually need, but also only the columns you really need. So, in your case this would look like
SELECT id, description, col3
FROM products
WHERE deleted = 0
AND -- conditions that make it fit in the proper slider
The reason is that it also costs resources (time and bandwidth) to transport a result set over to the processing program and processing time to evaluate the individual return records, and thus the "cost" of the query will evolve with the table size, not with the size of the dataset you actually need.
Just an example, but let's suppose you want to create a top 10 seller list, and you have 100 items in story. You'll retrieve 100 records and retain 10. No big deal. Now, your shop grows and you have 100000 items in store. For your top 10 you'll have to plow through all of theses and throw away 99990 records. 99990 records you'll have the db server read and transfer to you, and 99990 records you have to individually inspect to see whether it's a top 10 item.
As this type of query is going to be executed often, it's also a good idea to optimize them by indexing the search columns, as indexed searches in the db server are much faster.
I said "almost always" because there are rare cases where you have a hard time expressing in SQL what you actually need, or where you need to use fairly exotic techniques like query hints to force the database engine to execute your query in a reasonably efficient way. But these cases are fairly rare, and when a query doesn't perform as expected, with some analysis you'll manage to improve its performance in most cases - have a look at the literally thousands of questions regarding query optimization here on SO.
I have a table with currently ~1500 rows which is expected to grow over time (can't say how much, but still), the website is read-only and lets users do complex queries through the use of some forms, then the search query is completely URL-encoded since it's a public database. It's important to know that users can select what column data must be sorted by.
I'm not concerned about putting some indexes and slowing down INSERTs and UPDATEs (just performed occasionally by admins) since it's basically heavy-reading, but I need to paginate results as some popular queries can return 900+ results and that takes up too much space and RAM on client-side (results are further processed to create a quite rich <div> HTML element with an <img> for each result, btw).
I'm aware of the use of OFFSET {$m} LIMIT {$n} but would like to avoid it
I'm aware of the use of this
Query
SELECT *
FROM table
WHERE {$filters} AND id > {$last_id}
ORDER BY id ASC
LIMIT {$results_per_page}
and that's what I'd like to use, but that requires rows to be sorted only by their ID!
I've come up with (what I think is) a very similar query to custom sort results and allow efficient pagination.
Query:
SELECT *
FROM table
WHERE {$filters} AND {$column_id} > {$last_column_id}
ORDER BY {$column} ASC
LIMIT {$results_per_page}
but that unfortunately requires to have a {$last_column_id} value to pass between pages!
I know indexes (especially unique indexes) are basically automatically-updated integer-based columns that "rank" a table by values of a column (be it integer, varchar etc.), but I really don't know how to make MySQL return the needed $last_column_id for that query to work!
The only thing I can come up with is to put an additional "XYZ_id" integer column next to every "XYZ" column users can sort results by, then update values periodically through some scripts, but is it the only way to make it work? Please help.
(Too many comments to fit into a 'comment'.)
Is the query I/O bound? Or CPU bound? It seems like a mere 1500 rows would lead to being CPU-bound and fast enough.
What engine are you using? How much RAM? What are the settings of key_buffer_size and innodb_buffer_pool_size?
Let's see SHOW CREATE TABLE. If the table is full of big BLOBs or TEXT fields, we need to code the query to avoid fetching those bulky fields only to throw them away because of OFFSET. Hint: Fetch the LIMIT IDs, then reach back into the table to get the bulky columns.
The only way for this to be efficient:
SELECT ...
WHERE x = ...
ORDER BY y
LIMIT 100,20
is to have INDEX(x,y). But, even that, will still have to step over 100 cow paddies.
You have implied that there are many possible WHERE and ORDER BY clauses? That would imply that adding enough indexes to cover all cases is probably impractical?
"Remembering where you left off" is much better than using OFFSET, so try to do that. That avoids the already-discussed problem with OFFSET.
Do not use WHERE (a,b) > (x,y); that construct used not to be optimized well. (Perhaps 5.7 has fixed it, but I don't know.)
My blog on OFFSET discusses your problem. (However, it may or may not help your specific case.)
I have a dilemma that I'm trying to solve right now. I have a table called "generic_pricing" that has over a million rows. It looks like this....
I have a list of 25000 parts that I need to get generic_pricing data for. Some parts have a CLEI, some have a partNumber, and some have both. For each of the 25000 parts, I need to search the generic_pricing table to find all rows that match either clei or partNumber.
Making matters more difficult is that I have to do matches based on substring searches. For example, one of my parts may have a CLEI of "IDX100AB01", but I need the results of a query like....
SELECT * FROM generic_pricing WHERE clei LIKE 'IDX100AB%';
Currently, my lengthy PHP code for finding these matches is using the following logic is to loop through the 25000 items. For each item, I use the query above on clei. If found, I use that row for my calculations. If not, I execute a similar query on partNumber to try to find the matches.
As you can imagine, this is very time consuming. And this has to be done for about 10 other tables similar to generic_pricing to run all of the calculations. The system is now bogging down and timing out trying to crunch all of this data. So now I'm trying to find a better way.
One thought I have is to just query the database one time to get all rows, and then use loops to find matches. But for 25000 items each having to compare against over a million rows, that just seems like it would take even longer.
Another thought I have is to get 2 associative arrays of all of the generic_pricing data. i.e. one array of all rows indexed by clei, and another all indexed by partNumber. But since I am looking for substrings, that won't work.
I'm at a loss here for an efficient way to handle this task. Is there anything that I'm overlooking to simplify this?
Do not query the db for all rows and sort them in your app. Will cause a lot more headaches.
Here are a few suggestions:
Use parameterized queries. This allows your db engine to compile the query once and use it multiple times. Otherwise it will have to optimize and compile the query each time.
Figure out a way to make in work. Instead of using like try ... left(clei,8) in ('IDX100AB','IDX100AC','IDX101AB'...)
Do the calculations/math on the db side. Build a stored proc which takes a list of part/clei numbers and outputs the same list with the computed prices. You'll have a lot more control of execution and a lot less network overhead. If not a stored proc, build a view.
Paginate. If this data is being displayed somewhere, switch to processing in batches of 100 or less.
Build a cheat sheet. If speed is an issue try precomputing prices into a separate table nightly, include some partial clei/part numbers if needed. Then use the precomputed lookup table.
I have a search engine on a shared host that uses MySQL. This search engine potentially has millions/trillions etc of records.
Each time a search is performed I return a count of the records that can then be used for pagination purposes.
The count tells you how many results there are in regard to the search performed. MySQL count is I believe considered quite slow.
Order of search queries:
Search executed and results returned
Count query executed
I don't perform a PHP count as this will be far slower in larger data sets.
Question is, do I need to worry about MySQL "count" and at what stage should I worry about it. How do the big search engines perform this task?
In almost all cases the answer is indexing. The larger your database gets the more important it is to have a well designed and optimized indexing strategy.
The importance of indexing on a large database can not be overstated.
You are absolutely right about not looping in code to count DB records. Your RDBMS is optimized for operations like that, your programming language is no. Wherever possible you want to do any sorting, grouping, counting, filtering operations within the SQL language provided by your RDBMS.
As for efficiently getting the count on a "paginated" query that uses a LIMIT clause, check out SQL_CALC_FOUND_ROWS.
SQL_CALC_FOUND_ROWS tells MySQL to calculate how many rows there would
be in the result set, disregarding any LIMIT clause. The number of
rows can then be retrieved with SELECT FOUND_ROWS(). See Section
11.13, “Information Functions”.
If MySQL database reaches several millions of records, that's a sign you'll be forced to stop using monolithic data store - meaning you'll have to split reads, writes and most likely use a different storage engine than the default one.
Once that happens, you'll stop using the actual count of the rows and you'll start using the estimate, cache the search results and so on in order to alleviate the work on the database. Even Google uses caching and displays an estimate of number of records.
Anyway, for now, you've got 2 options:
1 - Run 2 queries, one to retrieve the data and the other one where you use COUNT() to get the number of rows.
2 - Use SQL_CALC_FOUND_ROWS like #JohnFX suggested.
Percona has an article about what's faster, tho it might be outdated now.
The biggest problem you're facing is the way MySQL uses LIMIT OFFSET, which means you probably won't like your users using large offset numbers.
In case you indeed get millions of records - I don't forsee a bright future for your MySQL monolithic storage on a shared server. However, good luck to you and your project.
If I understand what you are trying to do properly, you can execute the one query, and perform the mysql_num_rows() function on the result in PHP... that should be pretty zippy.
http://php.net/manual/en/function.mysql-num-rows.php
Since you're using PHP, you could use the mysql_num_rows method to tell you the count after the query is done. See here: http://www.php.net/manual/en/function.mysql-num-rows.php