Apache Solr - Alternate ways to boost results - php

I'm new to Solr and would feel like there are other ways of boosting results other than using "qf" and "pf" parameters.
Can someone just give me an alternate way to do this. I have three fields and would like to base my boost(s) on those three fields.
Lets say there is a field with boolean values ( either 1 or 0 ), I want to boost results that take value 1. Is there a way to do this? we'll have to write an "if" condition of some sort am I right? simply, is there a way to get it done?
Thanks

If you are using edismax, or dismax query parsers , which is most probably the case- you can use bq (boost query), or bf (boost function)
So for your example, I would add a boost query like this
bq=Myfield:1^2.0
http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
If you are using the standard query parser, you can use the BoostQParserPlugin, and type your query like this: q={!boost b=xxx}query
You can also use solr magic parameter _val_ which affects boosting score, and doesn't affect the matching.

Related

Setting Boost in Solr - Adding conditions

I would like to know if there is a way / method to set conditional boosts
Ex:
if( category:1 )( field_1^1.5 field_2^1.2 )else if( category:3 )( field_3^7.5 field_4^5.2 )
I'm planning to set the "qf" and "pf" parameters this way in order to boost my results, is it possible?
Conceptually - yes, it could be done using function queries (http://wiki.apache.org/solr/FunctionQuery), it contains if function, but I wasn't able to do that by myself, since i couldn't use == operator.
Also, you could write your own function query.
But anyway right now it more looks like a good place to start, not concrete answer.
I think you have two ways of doing this...
First way, is by simplifying things at index time, so maybe create other set of redundant fields in the schema (ex: boostfield_1, boostfield_2, etc), and if the document category is 1, you can set the value of boostfield_1 to field_1, and boostfield_2 to field_2. But if category is 2, you can set it to other fields.
This will allow you to use "pf" straight away without having any conditions, as you already specified the conditions at index time, and indexed the document differently based on the category. The problem with that, is you won't be able to change the score of boost values of the fields according to the category, but it is a simpler way anyway
Use the _val_, or bq parameters to specify a boost query, and you can write the same query differently, so you can write the same condition as the following:
url?q=query AND _val_:"(category:1 AND (field_1:query OR field_2:query)) OR (category:3 AND field_2:query)"
The little problem here as well is you repeat the query text in every inner query, which is not a big deal anyway.

Combining joined tag search with other parameters in RT sphinx

We use sphinx with a RealTime (RT) index to search through our database. Right now it contains fields such as longitude, latitude, title, content and it´s all working fine. The problem is that we want to implement a relational Tag-table and we are not sure how to do it.
In our current configuration we take advantage of a lot of the preconfigured methods available in the sphinxApi (for php), such as:
$this->_sphinxClient->setMatchMode(SPH_MATCH_EXTENDED2);
$this->_sphinxClient->SetGeoAnchor('latitude', 'longitude', (float)$this->latitude, (float)$this->longitude);
$this->_sphinxClient->SetFilterRange('price', $this->_priceMin, $this->_priceMax);
// And getting the final result with the:
$result = $this->_sphinxClient->Query($this->searchString, 'rt');
What we like to do if possible is either use mva (multi value attribute) or search through the results a second time with a join statement and seeding out the results that contain none of the tags.
We can´t get any of these options to work at the moment, so if anyone has any idea I would love a little help here. Use another index with id/tagname combination or a string attribute in the current one? Implement the search in the same query as the first one or search through those results in a second query with the tagjoin?
If I have missed anything important here please let me know, and thank you in advance!
Attach the tags to the current index. If you just need to search them, insert the tags in a full-text field and a string attribute if you want to get the tags as well in result. If you need to do grouping, you can:
use a MVA, but you will need to make a map between tag name and a tag id
use a JSON attribute. You can use IN on an array of strings like for MVA. For something more advanced you can use ALL() or ANY() functions.
For grouping, remember to use SetArrayResult(true). Also I recommend switching to SphinxQL interface.

Performance issues with mongo + PHP with pagination, distinct values

I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.
So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.
Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.

postprocess solr's faceted search result

I'm not sure how to handle the following issues. So i hope, to get here some ideas or something like that.
I'm using lucene with solr. Every document (which is indexed in lucene) has an date-field an an topic - field (with some keywords)
By using faceted search, i'm able to calculate the frequency of every keyword at an specific date.
Example 1 (pseudo code):
1st search where date=today:
web=>70
apple=>35
blue=>32
2nd search where date=yesterday:
web=>65
blue=>55
apple=>5
But now i would like to combine the results into one solr/lucene query in order to calculate which word-frequency grows very strong and witch doesn't.
An result could be:
Example 2:
one search merging both querys from example 1
web=>(70,65) <- growth +7,69%
blue=>(32,55) <- growth -41,81%
apple=>(34,5) <- growth +680%
Is it possible (and useful) to do this consolidation (and calclulation) inside solr or is it better to start 2 solr querys (see example 1) an postprocess the results with PHP?
Than you!
If you have the facet values a priori, you could do this with facet queries, i.e. something like facet.query=category:web AND date:[2011-06-14T00:00:00Z TO 2011-06-14T23:59:59Z]&facet.query=category:web AND date:[2011-06-13T00:00:00Z TO 2011-06-13T23:59:59Z]&... so you would do the cartesian product of facet values * dates.
Otherwise, to do this inside Solr I think you'd have to write some custom Java faceting code. Or do it client-side, with multiple queries as you mentioned.

How can I search for multiple terms in multiple table columns?

I have a table that lists people and all their contact info. I want for users to be able to perform an intelligent search on the table by simply typing in some stuff and getting back results where each term they entered matches at least one of the columns in the table. To start I have made a query like
SELECT * FROM contacts WHERE
firstname LIKE '%Bob%'
OR lastname LIKE '%Bob%'
OR phone LIKE '%Bob%' OR
...
But now I realize that that will completely fail on something as simple as 'Bob Jenkins' because it is not smart enough to search for the first an last name separately. What I need to do is split up the the search terms and search for them individually and then intersect the results from each term somehow. At least that seems like the solution to me. But what is the best way to go about it?
I have heard about fulltext and MATCH()...AGAINST() but that sounds like a rather fuzzy search and I don't know how much work it is to set up. I would like precise yes or no results with reasonable performance. The search needs to be done on about 20 columns by 120,000 rows. Hopefully users wouldn't type in more than two or three terms.
Oh sorry, I forgot to mention I am using MySQL (and PHP).
I just figured out fulltext search and it is a cool option to consider (is there a way to adjust how strict it is? LIMIT would just chop of the results regardless of how well it matched). But this requires a fulltext index and my website is using a view and you can't index a view right? So...
I would suggest using MATCH / AGAINST. Full-text searches are more advanced searches, more like Google's, less elementary.
It can match across multiple tables and rank them to how many matches they have.
Otherwise, if the word is there at all, esp. across multiple tables, you have no ranking. You can do ranking server-side, but that is going to take more programming/time.
Depending on what database you're using, the ability to do cross columns can become more or less difficult. You probably don't want to do 20 JOINs as that will be a very slow query.
There are also engines such as Sphinx and Lucene dedicated to do these types of searches.
BOOLEAN MODE
SELECT * FROM contacts WHERE
MATCH(firstname,lastname,email,webpage,country,city,street...)
AGAINST('+bob +jenkins' IN BOOLEAN MODE)
Boolean mode is very powerful. It might even fulfil all my needs. I will have to do some testing. By placing + in front of the search terms those terms become required. (The row must match 'bob' AND 'jenkins' instead of 'bob' OR 'jenkins'). This mode even works on non-indexed columns, and thus I can use it on a view although it will be slower (that is what I need to test). One final problem I had was that it wasn't matching partial search terms, so 'bob' wouldn't find 'bobby' for example. The usual % wildcard doesn't work, instead you use an asterisk *.

Categories