Sphinx search on multiple indexes with different attributes - php

Is it possible to search on multiple indexes with different attributes and keep consistent PAGIN.
For example we have 2 indexes:
Places with GEO data
Objects without GEO data
And we want to to apply GEO filters for index #1(SetFilterFloatRange, SetGeoAnchor) and SKIP this filter for index #2. We want to show these results in one result set with one paging.
Is it possible with SPHINX?

No, this is not currently possible - you will receive an error if you try to do so.
The workaround for this would be to have the same field inside the index #2, but with some value, which indicates that this check should be skipped.
Your search query might look like this: (#somefield ("%s") | #somefield ("NONE")), where NONE is your "empty value" and %s is the string you are actually looking for.

Related

With the PHP Riak Client, how would I search two indexes for values simultaneously?

I have a dataset in Riak with different items indexed (using index_bin). How would I go about searching objects where two of these indexes have particular values in a single request? Example: gender, last_name with gender = male, and last_name = Smith
Would I use Map/Reduce? If so, any example code?
A limitation of secondary indexes in Riak is that only a single index can be searched at a time. You will therefore not be able to directly combine indexes.
As the index data is stored in the metadata of the record, you could create a mapreduce job that takes one 2i query as input and has a map phase that filters on the other based on the metadata. Using mapreduce this way may however be quite slow and inefficient as all data passed into the map phase function need to be read from disk.
If you are looking to serve a reasonably common and predictable request, you can always create and use a composite index instead. You could e.g. create an index named gender_name_bin which could have values like male_Smith. This will allow you to do a range query on the last part of the index as long as the first part is fixed, which gives some flexibility.
It is in recent versions also possible to filter secondary index values based on regular expressions, which does not require the actual object to be loaded. More information about this can be found here.

Combining joined tag search with other parameters in RT sphinx

We use sphinx with a RealTime (RT) index to search through our database. Right now it contains fields such as longitude, latitude, title, content and it´s all working fine. The problem is that we want to implement a relational Tag-table and we are not sure how to do it.
In our current configuration we take advantage of a lot of the preconfigured methods available in the sphinxApi (for php), such as:
$this->_sphinxClient->setMatchMode(SPH_MATCH_EXTENDED2);
$this->_sphinxClient->SetGeoAnchor('latitude', 'longitude', (float)$this->latitude, (float)$this->longitude);
$this->_sphinxClient->SetFilterRange('price', $this->_priceMin, $this->_priceMax);
// And getting the final result with the:
$result = $this->_sphinxClient->Query($this->searchString, 'rt');
What we like to do if possible is either use mva (multi value attribute) or search through the results a second time with a join statement and seeding out the results that contain none of the tags.
We can´t get any of these options to work at the moment, so if anyone has any idea I would love a little help here. Use another index with id/tagname combination or a string attribute in the current one? Implement the search in the same query as the first one or search through those results in a second query with the tagjoin?
If I have missed anything important here please let me know, and thank you in advance!
Attach the tags to the current index. If you just need to search them, insert the tags in a full-text field and a string attribute if you want to get the tags as well in result. If you need to do grouping, you can:
use a MVA, but you will need to make a map between tag name and a tag id
use a JSON attribute. You can use IN on an array of strings like for MVA. For something more advanced you can use ALL() or ANY() functions.
For grouping, remember to use SetArrayResult(true). Also I recommend switching to SphinxQL interface.

Algorithm for optimising compound index search in MongoDb

I have a collection X on which I have to apply a filter.
The filter is saved as a sepparate entity (collection filters) and the only data it holds are the field name and the conditions applied to that field name
Example of filter:
Name is Stephan and Age BETWEEN 10, 20
Basically what I have to improve is the fact that each field in my filter is an index added upon creation of the filter.
The only structure that matches is a compound index on the fields filtered.
In conclusion, the problem is that when I have a filter like:
Name is Stephan and Age BETWEEN 10,20
My compound index in MongoDb will be: {'Name':1,'Age':1}
But then, if I add another filter, let's say: Age is 10 and Name is Adrian and Height BETWEEN 170,180
compound index is: {'Age':1,'Name':1, 'Height':1}
{'Name':1,'Age':1} <> {'Age':1,'Name':1, 'Height':1}
What can I do to make the last index fit with the first and the other way around.
Please let me know if I haven't been to explicit.
The cleanest solution to this problem is index intersections, which is currently in development. That way, an index for each of the criteria would be sufficient.
In the mean time, I see two options:
Use a separate search database that returns the relevant ids based on your criteria, then use $in in MongoDB to query the actual documents. There are a number of tools that use this approach, but it adds quite a bit of overhead because you need to code against and administer a second db, keep the data in sync, etc.
Use a smart mix of compound indexes and 'infinite range queries'. For instance, you can argue that a query for age in the range of (0, 200) won't discard anybody from the result set, neither will a height query between 0 and 400.
That might not be the cleanest approach, and its efficiency depends very much on the details of the queries, so that might require some fine-tuning.

Levenshtein search

I work on a site which sells let's say stuff and offers a "vendors search". On this search you enter your city, or postal code, or region and a distance (in km or miles) then the site gives you a list of vendors.
To do that, I have a database with the vendors. In the form to save these vendors, you enter their full address and when you click on the save button, a request to google maps is made in order to get their latitude and longitude.
When someone does a search, I look on a table where I store all the search terms and their lat/lng.
This table looks like
+--------+-------+------+
| term | lat | lng |
+--------+-------+------+
So the first query is something very simple
select lat, lng from my_search_table where term = "the term"
If I find a result, I then search with a nice method for all the vendors in the range the visitor wants and print the result on a map.
If I don't find a result, I search with a levenshtein function because people writing bruxelle or bruxeles instead of bruxelles is something really common and I don't want to make a request to google maps all the time (I also have a "how many time searched" column in my table to get some stats)
So I request my_search_time with no where clause and loop through all results to get the smallest levensthein distance. If the smallest result is greater than 2, I request coordinates from google maps.
Here is my problem. For some countries (we have several sites all around the world), my_search_table has 15-20k+ entries... and php doesn't (really) like looping on such data (which I perfectly understand) and my request falls under the php timeout. I could increase this timeout but the problem will be the same in a few months.
So I tried a levensthein MySQL function (found on stackoverflow btw) but it's also very slow.
So my question is "is there any way to make this search fast even on very large datasets ?"
My suggestion is based on three things:
First, your data set is big. That means - it's: big enough to reject the idea of "select all" + "run levenshtein() in PHP application"
Second, you have control over your database. So you can adjust some architecture-related things
Finally, performance of SELECT queries is the most important thing, while performance for adding new data doesn't matter.
The thing is you can not perform fast levenshtein search because levenshtein itself is very slow. I mean, calculating levenshtein distance is a slow thing. Thus, you'll not be able to resolve the issue with only "smart search". You'll have to prepare some data.
Possible solution will be: create some group index and assign it during adding/updating data. That means - you'll store additional column which will store some hash (numeric, for example). When adding new data, you'll:
Perform search with levenshtein distance (for that you may either use your application or that function which you've (already mentioned) over all records in your table against inserted data
Set group index for new row to value of index which found rows in previous step have.
If nothing found, set some new group index value (it' the first row and there are no similar rows yet) - which will be different from any group index values that already present in table
To search desired rows, you'll need just select rows with same group index value. That means: your select queries will be very fast. But - yes, this will cause extremely huge overhead when adding/changing your data. Thus, it isn't applicable for case, when performance of updating/inserting matters.
You could try MySQL function SOUNDS LIKE
SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"
You can use a kd-tree or a ternary tree to speed up the search. The idea is to use a binary search.

How do I do a boolean OR in solr?

I am completely new at this...
I am using durpal with the apachesolr module.
I have the following filter:
( tid:19895 OR tid:19937 ) AND type:poster"
This does not return any results, but if I have the following, it returns the results as expected
( tid:19937 ) AND type:poster"
EDIT: This is all in a filter. I need to have a list of tids that it could be along with having type be poster. So it has to be type:poster AND one of the following tids: 19895, 19937
although:
((tid:(19895 OR 19937)) AND type:poster) should work irrespective of defaultOperator being OR/AND
adding the clauses as separate filter is better, so add two filters as
tid:(19895 OR 19937)
type:poster
should filter out the results working as AND and also cache the results for this filter separately for improved query performance for new queries using this filter.
I think its important to note that SOLR isn't a database. Its a search engine and therefore isn't going to work like a database query you may be used to.
adding a + means the field is required
+tid:19895 would mean that TID field is required to equal exactly 19895
The * is the wildcard so adding it means your field would contain the value but not necessarily equal the value.
tid:19895* would mean TID field contains 19895
It looks like the issue is defining the field twice in the same filter???
So try removing the second "tid:"
In my testing I did this:
My SOLR query is:
+type:poster
And the Filter is:
tid:19895 OR 19937
Here is another stackoverflow question similar to yours since you obviously have a different syntax than I use.
using OR and NOT in solr query
As FYI if defalut boolen clause is "OR" then give the query like this
((tid:(19895 19937)) AND type:poster) This working fine i tested

Categories