I have an array of IDs that I am doing foreach loop and searching each ID in a SOLR index using Php Apache SOLR client. Its slow like a dead turtle. Any help appreciated in optimizing this
foreach ( $f_games as $game_id ){
$game_type = BKT_PLUGIN_CLASS::tv_regions($game_id);
//Do my stuff
$count++;
}
Where
BKT_PLUGIN_CLASS::tv_regions
is my class method for SOLR API search ( which works fine, no issues there ).
So its doing what i want it to do. It takes each ID and goes to SOLR and brings the result of that item and I do what I want to do and increase count. With only 200+ IDs, it takes more than 2 minutes to spit out results.
Use Result Grouping in Solr - that way you can get x number of hits for each region, all rolled up into a single response. Tweak the number of groups and number of hits for each group to match your need.
Filter the list by having a fq with all the values in, so that it returns the documents you need, then group by the value you'd normally search for.
Why are You pinging API for each game? You are loosing lot of time just to connect there ...
Can't You just pass all IDs and just count result?
I don't know SOLR well but for me it's insane that it's not possible (so I assume that it is doable)
How to do an IN query in Solr?
How can I search on a list of values using Solr/Lucene?
Related
I have Sphinx Search running on production, performing search with keywords, accessed through official sphinxapi.php. Now I need to output a sum of an attribute called price along with search results, similar to SQL query "SELECT SUM(t.price) from table_name t WHERE condition". This data is supposed to be displayed on a web page like "Showing 1 - 10 out of 12345 results, total cost is $67890". As documentation says, SUM() function is available when used with GROUP BY. However, the documentation does not provide enough details on implementation, googling and searching Stackoverflow doesn't help much as well.
Questions:
How should I group the search result?
Can it be performed with 1 Sphinx request, or do I have to get the search results first and then query Sphinx again to get the sum of found documents?
Please advise. An example will be really helpful. Thank you.
You will need to run a second query. The 'sum' is wanted on the WHOLE result set, whereas normal grouping, the aggregation is run per row. In your example, there is an implicit GROUP BY '1' which aggregates all rows.
So would need to use Grouping to do same in sphinx.
http://sphinxsearch.com/docs/current.html#clustering
Using the aggregation function is relatively easy, use with setSelect, but not sure SetGroupBy has a syntax to group all rows so will have to emulate it.
//all normal setup need for normal query here
$cl->SetLimits($offset,$limit);
$cl->AddQuery($query, $index);
//add the group query
$cl->setSelect("1 as one, SUM(price) as sum_price");
$cl->setGroupBy("one",SPH_GROUPBY_ATTR); //dont care about sorting
$cl->setRankingMode(SPH_RANK_NONE); //no point actually ranking results.
$cl->SetLimits(0,1);
$cl->AddQuery($query, $index);
//run both queries at once...
$results = $cl->RunQueries();
var_dump($results);
//$results[0] contains the normal text query results, use its total_found
//$results[1] second contains just the SUM() data
This also shows setting up as Multi-Queries!
http://sphinxsearch.com/docs/current.html#multi-queries
We have a big index, around 1 Billion of documents. Our application does not allow users to search everything. They have subscriptions and they should be able to search only in them.
Our first iteration of the index used attributes, so a typical query looked like this (we are using PHP API):
$cl->SetFilter('category_id', $category_ids); // array with all user subscriptions
$result = $cl->Query($term,"documents");
This worked without issues, but was very slow. Then we saw this article. The analogy with un-indexed MySQL query was alarming and we decided to ditch the attribute based filter and try with a full text column. So now, our category_id is a full_text column. Indeed our initial tests showed that searching is a lot faster, but when we launched the index into production we ran into an issue. Some users have many subscriptions and we started to receive this error from Sphinx:
Error: index documents: query too complex, not enough stack (thread_stack_size=337K or higher required)
Our new queries look like this:
user_input #category_id c545|c547|c549|c556|c568|c574|c577|c685...
When there are too many categories the above error shows up. We thought it will be easy to fix, by just increasing thread_stack to higher value, but it turned out to be limited to 2MB and we still have queries exceeding that.
The question is what to do now? We were thinking about splitting the query into smaller queries, but then how will we aggregate the results with the correct limit (we are using $cl->SetLimits($page, $limit); for pagination)?
Any ideas will be welcome.
You can do the 'pagination' in the application, this is sort of how sphinx does merging when quering distrubuted indexes.
$upper_limit = ($page_number*$page_size)+1;
$cl->setLimits(0,$upper_limit);
foreach ($indexes as $index) {
$cl->addQuery(...);
}
$cl->RunQueries()
$all = array;
foreach ($results) {
foreach (result->matches) {
$all[$id] = $match['weight'];
}
}
asort($all);
$results = array_slice($all,$page,$page_size)
(I know its not completely valid PHP, its just to show the basic procedure)
... yes its wasteful, but in practice most queries are the first few pages anyway, so doesnt matter all that much. Its 'deep' results will be particully slow.
I work on a site which sells let's say stuff and offers a "vendors search". On this search you enter your city, or postal code, or region and a distance (in km or miles) then the site gives you a list of vendors.
To do that, I have a database with the vendors. In the form to save these vendors, you enter their full address and when you click on the save button, a request to google maps is made in order to get their latitude and longitude.
When someone does a search, I look on a table where I store all the search terms and their lat/lng.
This table looks like
+--------+-------+------+
| term | lat | lng |
+--------+-------+------+
So the first query is something very simple
select lat, lng from my_search_table where term = "the term"
If I find a result, I then search with a nice method for all the vendors in the range the visitor wants and print the result on a map.
If I don't find a result, I search with a levenshtein function because people writing bruxelle or bruxeles instead of bruxelles is something really common and I don't want to make a request to google maps all the time (I also have a "how many time searched" column in my table to get some stats)
So I request my_search_time with no where clause and loop through all results to get the smallest levensthein distance. If the smallest result is greater than 2, I request coordinates from google maps.
Here is my problem. For some countries (we have several sites all around the world), my_search_table has 15-20k+ entries... and php doesn't (really) like looping on such data (which I perfectly understand) and my request falls under the php timeout. I could increase this timeout but the problem will be the same in a few months.
So I tried a levensthein MySQL function (found on stackoverflow btw) but it's also very slow.
So my question is "is there any way to make this search fast even on very large datasets ?"
My suggestion is based on three things:
First, your data set is big. That means - it's: big enough to reject the idea of "select all" + "run levenshtein() in PHP application"
Second, you have control over your database. So you can adjust some architecture-related things
Finally, performance of SELECT queries is the most important thing, while performance for adding new data doesn't matter.
The thing is you can not perform fast levenshtein search because levenshtein itself is very slow. I mean, calculating levenshtein distance is a slow thing. Thus, you'll not be able to resolve the issue with only "smart search". You'll have to prepare some data.
Possible solution will be: create some group index and assign it during adding/updating data. That means - you'll store additional column which will store some hash (numeric, for example). When adding new data, you'll:
Perform search with levenshtein distance (for that you may either use your application or that function which you've (already mentioned) over all records in your table against inserted data
Set group index for new row to value of index which found rows in previous step have.
If nothing found, set some new group index value (it' the first row and there are no similar rows yet) - which will be different from any group index values that already present in table
To search desired rows, you'll need just select rows with same group index value. That means: your select queries will be very fast. But - yes, this will cause extremely huge overhead when adding/changing your data. Thus, it isn't applicable for case, when performance of updating/inserting matters.
You could try MySQL function SOUNDS LIKE
SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"
You can use a kd-tree or a ternary tree to speed up the search. The idea is to use a binary search.
I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.
So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.
Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.
I am storing data for lap times in a database, the data consists of a distance, time, average speed and max speed.
I am trying to display a leaderboard which shows the top ten people in whatever query you set (who has gone thr furthest in total, best average time etc). However, below the top ten I want to show the user who is logged ins position in the leaderboard. To do this I am trying to run the same query ordering my results and adding a ROW NUMBER to get the position.
I am using symfony 1.4 with the Doctrine ORM and I can't for the life of me figure out how to get row numbers in a query. I know you can do it in SQL like so:
SELECT full_name, ROW_NUMBER() OVER(ORDER BY distance) AS row_number
Yet I can't get it working in Doctrine Symfony.
Does anyone have any ideas on a way I can do this? (or even another way of going about it)
Thanks in advance.
Ok heres my solution:
I kind of found an answer to this after some more experimenting and collaborating.
What I did was get rid of the select and just return the ordered list of results;
$results = self::getInstance()->createQuery('r')
->orderBy('r.distance')
->execute();
Then I hydrated this result into an array and used array_search() to find the key of the result in the array (luckily I am returning user data here, and I know the user I am looking for in the array)
$index = array_search($user->toArray(), $results->toArray());
I can then return the $index + 1 to give me the users position in the leaderboard.
There is probably a better way to do this database side, but I couldn't for the life of me find out how.
If anyone has a better solution then please share.