Sphinx query with large number of attributes

Sphinx query with large number of attributes - php

We have a big index, around 1 Billion of documents. Our application does not allow users to search everything. They have subscriptions and they should be able to search only in them.
Our first iteration of the index used attributes, so a typical query looked like this (we are using PHP API):
$cl->SetFilter('category_id', $category_ids); // array with all user subscriptions
$result = $cl->Query($term,"documents");
This worked without issues, but was very slow. Then we saw this article. The analogy with un-indexed MySQL query was alarming and we decided to ditch the attribute based filter and try with a full text column. So now, our category_id is a full_text column. Indeed our initial tests showed that searching is a lot faster, but when we launched the index into production we ran into an issue. Some users have many subscriptions and we started to receive this error from Sphinx:
Error: index documents: query too complex, not enough stack (thread_stack_size=337K or higher required)
Our new queries look like this:
user_input #category_id c545|c547|c549|c556|c568|c574|c577|c685...
When there are too many categories the above error shows up. We thought it will be easy to fix, by just increasing thread_stack to higher value, but it turned out to be limited to 2MB and we still have queries exceeding that.
The question is what to do now? We were thinking about splitting the query into smaller queries, but then how will we aggregate the results with the correct limit (we are using $cl->SetLimits($page, $limit); for pagination)?
Any ideas will be welcome.

You can do the 'pagination' in the application, this is sort of how sphinx does merging when quering distrubuted indexes.
$upper_limit = ($page_number*$page_size)+1;
$cl->setLimits(0,$upper_limit);
foreach ($indexes as $index) {
$cl->addQuery(...);
}
$cl->RunQueries()
$all = array;
foreach ($results) {
foreach (result->matches) {
$all[$id] = $match['weight'];
}
}
asort($all);
$results = array_slice($all,$page,$page_size)
(I know its not completely valid PHP, its just to show the basic procedure)
... yes its wasteful, but in practice most queries are the first few pages anyway, so doesnt matter all that much. Its 'deep' results will be particully slow.

Related

CodeIgniter - How to SELECT all rows in a big table without memory leake

It's kinda hard to undertsand my need in the title.
CodeIgniter is performing a SELECT query in a table of 800,000+ rows in one shot.
It takes a lot of memory, but in one specific server, I get a "Out of memory" fatal error.
For performance purposes, I would like to seperate the select into 2 selects, and more specifically, the 50% first rows, and then the 50% left.
I reuse this set of data to perform an INSERT afterwise.
How to do that without losing/forgetting any single row ?

Beside the fact that operations like that are highly connected to performance issues, you can use unbuffered_row.
Basically, if you have a job with that large data - you should use
unbuffered_row provided and integrated in the built
in query builder.
its very well documented here in the result rows section.
for example:
$query = $this->db->select('*')->from('your_table')->get();
while($row = $query->unbuffered_row())
{
//do your job
}
This will avoid your memory problem.

What's faster, db calls or resorting an array?

In a site I maintain I have a need to query the same table (articles) twice, once for each category of article. AFAIT there are basically two ways of doing this (maybe someone can suggest a better, third way?):
Perform the db query twice, meaning the db server has to sort through the entire table twice. After each query, I iterate over the cursor to generate html for a list entry on the page.
Perform the query just once and pull out all the records, then sort them into two separate arrays. After this, I have to iterate over each array separately in order to generate the HTML.
So it's this:
$newsQuery = $mysqli->query("SELECT * FROM articles WHERE type='news' ");
while($newRow = $newsQuery->fetch_assoc()){
// generate article summary in html
}
// repeat for informational articles
vs this:
$query = $mysqli->query("SELECT * FROM articles ");
$news = Array();
$info = Array();
while($row = $query->fetch_assoc()){
if($row['type'] == "news"){
$news[] = $row;
}else{
$info[] = $row;
}
}
// iterate over each array separate to generate article summaries
The recordset is not very large, current <200 and will probably grow to 1000-2000. Is there a significant different in the times between the two approaches, and if so, which one is faster?
(I know this whole thing seems awfully inefficient, but it's a poorly coded site I inherited and have to take care of without a budget for refactoring the whole thing...)
I'm writing in PHP, no framework :( , on a MySql db.
Edit
I just realized I left out one major detail. On a given page in the site, we will display (and thus retrieve from the db) no more than 30 records at once - but here's the catch: 15 info articles, and 15 news articles. On each page we pull the next 15 of each kind.

You know you can sort in the DB right?
SELECT * FROM articles ORDER BY type

EDIT
Due to the change made to the question, I'm updating my answer to address the newly revealed requirement: 15 rows for 'news' and 15 rows for not-'news'.
The gist of the question is the same "which is faster... one query to two separate queries". The gist of the answer remains the same: each database roundtrip incurs overhead (extra time, especially over a network connection to a separate database server), so with all else being equal, reducing the number database roundtrips can improve performance.
The new requirement really doesn't impact that. What the newly revealed requirement really impacts is the actual query to return the specified resultset.
For example:
( SELECT n.*
FROM articles n
WHERE n.type='news'
LIMIT 15
)
UNION ALL
( SELECT o.*
FROM articles o
WHERE NOT (o.type<=>'news')
LIMIT 15
)
Running that statement as a single query is going to require fewer database resources, and be faster than running two separate statements, and retrieving two disparate resultsets.
We weren't provided any indication of what the other values for type can be, so the statement offered here simply addresses two general categories of rows: rows that have type='news', and all other rows that have some other value for type.
That query assumes that type allows for NULL values, and we want to return rows that have a NULL for type. If that's not the case, we can adjust the predicate to be just
WHERE o.type <> 'news'
Or, if there are specific values for type we're interested in, we can specify that in the predicate instead
WHERE o.type IN ('alert','info','weather')
If "paging" is a requirement... "next 15", the typical pattern we see applied, LIMIT 30,15 can be inefficient. But this question isn't asking about improving efficiency of "paging" queries, it's asking whether running a single statement or running two separate statements is faster.
And the answer to that question is still the same.
ORIGINAL ANSWER below
There's overhead for every database roundtrip. In terms of database performance, for small sets (like you describe) you're better off with a single database query.
The downside is that you're fetching all of those rows and materializing an array. (But, that looks like that's the approach you're using in either case.)
Given the choice between the two options you've shown, go with the single query. That's going to be faster.
As far as a different approach, it really depends on what you are doing with those arrays.
You could actually have the database return the rows in a specified sequence, using an ORDER BY clause.
To get all of the 'news' rows first, followed by everything that isn't 'news', you could
ORDER BY type<=>'news' DESC
That's MySQL short hand for the more ANSI standards compliant:
ORDER BY CASE WHEN t.type = 'news' THEN 1 ELSE 0 END DESC
Rather than fetch every single row and store it in an array, you could just fetch from the cursor as you output each row, e.g.
while($row = $query->fetch_assoc()) {
echo "<br>Title: " . htmlspecialchars($row['title']);
echo "<br>byline: " . htmlspecialchars($row['byline']);
echo "<hr>";
}

Best way of dealing with a situation like this is to test this for yourself. Doesn't matter how many records do you have at the moment. You can simulate whatever amount you'd like, that's never a problem. Also, 1000-2000 is really a small set of data.
I somewhat don't understand why you'd have to iterate over all the records twice. You should never retrieve all the records in a query either way, but only a small subset you need to be working with. In a typical site where you manage articles it's usually about 10 records per page MAX. No user will ever go through 2000 articles in a way you'd have to pull all the records at once. Utilize paging and smart querying.
// iterate over each array separate to generate article summaries
Not really what you mean by this, but something tells me this data should be stored in the database as well. I really hope you're not generating article excerpts on the fly for every page hit.
It all sounds to me more like a bad architecture design than anything else...
PS: I believe sorting/ordering/filtering of a database data should be done on the database server, not in the application itself. You may save some traffic by doing a single query, but it won't help much if you transfer too much data at once, that you won't be using anyway.

Reasons not to use GROUP_CONCAT?

I just discovered this amazingly useful MySQL function GROUP_CONCAT. It appears so useful and over-simplifying for me that I'm actually afraid of using it. Mainly because it's been quite some time since I started in web-programming and I've never seen it anywhere. A sample of awesome usage would be the following
Table clients holds clients ( you don't say... ) one row per client with unique IDs.
Table currencies has 3 columns client_id, currency and amount.
Now if I wanted to get user 15's name from the clients table and his balances, with the "old" method of array overwriting I would have to do use the following SQL
SELECT id, name, currency, amount
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
Then in php I would have to loop through the result set and do an array overwrite ( which I'm really not a big fan of, especially in massive result sets ) like
$result = array();
foreach($stmt->fetchAll() as $row){
$result[$row['id']]['name'] = $row['name'];
$result[$row['id']]['currencies'][$row['currency']] = $row['amount'];
}
However with the newly discovered function I can use this
SELECT id, name, GROUP_CONCAT(currency) as currencies GROUP_CONCAT(amount) as amounts
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
GROUP BY clients.id
Then on application level things are so awesome and pretty
$results = $stmt->fetchAll();
foreach($results as $k => $v){
$results[$k]['currencies'] = array_combine(explode(',', $v['currencies']), explode(',', $v['amounts']));
}
The question I would like to ask is are there any drawbacks to using this function in performance or anything at all, because to me it just looks like pure awesomeness, which makes me think that there must be a reason for people not to be using it quite often.
EDIT:
I want to ask, eventually, what are the other options besides array overwriting to end up with a multidimensional array from a MySQL result set, because if I'm selecting 15 columns it's a really big pain in the neck to write that beast..

Using GROUP_CONCAT() usually invokes the group-by logic and creates temporary tables, which are usually a big negative for performance. Sometimes you can add the right index to avoid the temp table in a group-by query, but not in every case.
As #MarcB points out, the default length limit of a group-concatenated string is pretty short, and many people have been confused by truncated lists. You can increase the limit with group_concat_max_len.
Exploding a string into an array in PHP does not come for free. Just because you can do it in one function call in PHP doesn't mean it's the best for performance. I haven't benchmarked the difference, but I doubt you have either.
GROUP_CONCAT() is a MySQLism. It is not supported widely by other SQL products. In some cases (e.g. SQLite), they have a GROUP_CONCAT() function, but it doesn't work exactly the same as in MySQL, so this can lead to confusing bugs if you have to support multiple RDBMS back-ends. Of course, if you don't need to worry about porting, this is not an issue.
If you want to fetch multiple columns from your currencies table, then you need multiple GROUP_CONCAT() expressions. Are the lists guaranteed to be in the same order? That is, does the third field in one list correspond to the third field in the next list? The answer is no -- not unless you specify the order with an ORDER BY clause inside the GROUP_CONCAT().
I usually favor your first code format, use a conventional result set, and loop over the results, saving to a new array indexed by client id, appending the currencies to an array. This is a straightforward solution, keeps the SQL simple and easier to optimize, and works better if you have multiple columns to fetch.
I'm not trying to say GROUP_CONCAT() is bad! It's really useful in many cases. But trying to make any one-size-fits-all rule to use (or to avoid) any function or language feature is simplistic.

The biggest problem that I see with GROUP_CONCAT is that it is highly specific to MySql: if you want to port your code to run against any other platform, you would have to rewrite all queries that use GROUP_CONCAT. For example, your first query is a lot more portable - you can probably run it against any major RDBMS engine without changing a single character in it.
If you are fine with working only with MySql (say, because you are writing a tool that is meant to be specific to MySql) the queries with GROUP_CONCAT would probably go faster, because the RDBMS would do more work for you, saving on the size of the data transfer.

!= in sql or != in php

Task: get display 10 objects except 1 specific
Solutions:
get 11 objects from DB, and do something like this
foreach ($products as $product) {
if($product->getId() != $specificProduct->getId()){
//display
}
}
just add condition in sql query WHERE p.id != :specific_product_id
Some additional information: we use doctrine2 with mysql, so we must expect some additional time by hydration. I have made some test, I timed both of this solutions but I still haven't any idea which way is better.
So, I have gotten some strange results by my test(get 100 queries with different parameters)
php = 0.19614
dql = 0.16745
php = 0.13542
dql = 0.15531
Maybe someone have advice about how I should have made my test better

If you're concerned about the overhead of hydration, keep in mind that doing the != condition in PHP code means you have to fetch thousands of irrelevant, non-matching rows from the database, hydrate them, and then discard them.
Just the wasted network bandwidth for fetching all those irrelevant rows is costly -- even more so if you do that query hundreds of times per second as many applications do.
It is generally far better to eliminate the unwanted rows using SQL expressions, supported by indexes on the referenced tables.

Use SQL as much as possible. THis is the basic query, you could use. This is much more effecient than discarding the rows in PHP.
$query = "SELECT * FROM table_name WHERE p.id <> '$specific_id'"

I think such as queries (like id based, or well indexed) must be on sql side... cause it uses indexes, and returns less data to your application. processing less data makes your applications runs faster.

Performance issues with mongo + PHP with pagination, distinct values

I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.

So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.

Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.