Sub-queries with "union" in elasticsearch - php

I'm currently busy working on a project in which we chose to use Elasticsearch as the search engine for a classifieds website.
Currently, I have the following business rule:
List 25 adverts per page. Of these 25, 10 of the displayed adverts must be "Paid Adverts", and the other 15 must be "Free". All 25 must be relevant to the search performed (i.e. Keywords, Region, Price, Category, etc.)
I know I can do this using two seperate queries, but this seems like an immense waste of resources. Is it possible to do "sub-queries" (if you can call them that?) and union these results into a single result set? Somehow only fetching 10 "Paid" adverts and 15 "Free" ones from elasticsearch, in one single query? Assuming of course that there are enough adverts to make this requirement possible.
Thanks for any help!
edit - Just adding my mapping information for more clarity.
"properties": {
"advertText": {
"type": "string",
"boost": 2,
"store": true,
"analyzer": "snowball"
},
"canonical": {
"type": "string",
"store": true
},
"category": {
"properties": {
"id": {
"type": "string",
"store": true
},
"name": {
"type": "string",
"store": true
},
"parentCategory": {
"type": "string",
"store": true
}
}
},
"contactNumber": {
"type": "string",
"index": "not_analyzed",
"store": true
},
"emailAddress": {
"type": "string",
"store": true,
"analyzer": "url_email_analyzer"
},
"advertType": {
"type": "string",
"index": "not_analyzed"
},
...
}
What I want then is to be able to query this and get 10 results where "advertType": "Paid" and 15 where "advertType": "Free"...

A couple of approaches you can take.
First, you can try using the multi-search API:
Multi Search API
The multi search API allows to execute several search requests within
the same API. The endpoint for it is _msearch.
The format of the request is similar to the bulk API format
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-multi-search.html
A basic example:
curl -XGET 'http://127.0.0.1:9200/advertising_index/_msearch?pretty=1' -d '
{}
{"query" : {"match" : {"Paid_Ads" : "search terms"}}, "size" : 10}
{}
{"query" : {"match" : {"Free" : "search terms"}}, "size" : 15}
'
I've made up the fields and query but overall you should get the idea - you hit the _msearch endpoint and pass it a series of queries starting with empty brackets {}. For Paid I've set size to 10 and for Free I've set size to 15.
Subject to the details of your own implementation you should be able to use something like this.
If that does not work for whatever reason you can also try using a limit filter:
Limit Filter
A limit filter limits the number of documents (per shard) to execute
on. For example:
{
"filtered" : {
"filter" : {
"limit" : {"value" : 100}
},
"query" : {
"term" : { "name.first" : "shay" }
}
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-limit-filter.html
Note that the limits are per shard, not per index. Given a default of 5 primary shards per index, to get a total response of 10 you would set limit to 2 (2X5 == 10). Also note that this can produce incomplete results if you have multiple matches on one shard but none on another.
You would then combine two filters with a bool filter:
Bool Filter
A filter that matches documents matching boolean combinations of other
queries. Similar in concept to Boolean query, except that the clauses
are other filters. Can be placed within queries that accept a filter.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-bool-filter.html
I've not fleshed this one out in any detail as it will require more information about your specific indexes, mappings, data and queries.

Try using limit filter that limits number of docs returned
{
"filtered" : {
"filter" : {
"limit" : {"value" : 10}
},
"query" : {
"term" : { "name.first" : "shay" }
}
}
}
Change value to 2 to get 10 results and 3 to get 15

You are asking for query?
(select * from tablename where advert = "Paid Advert" limit 10) union (select * from tablename where advert = "Free" limit 15);
of logic to generate limit per page?

Related

Elasticsearch "Join" tables

I need to do "Join" between 2 indexes (tables) and preform a check on specific field on documents that exists in both indexes.
I want to add condition like "dateExpiry" below, but I get an error. Is it possible to join 2 or more indexes?
GET cache-*/_search
{
"query": {
"bool": {
"must_not": [
{
"query": {
"terms": {
"TagId": {
"index": "domain_block-2016.06",
"type": "cBlock",
"id": "57692ef6ae8c50f67e8b45",
"path": "TagId",
"range" : {
"dateExpiry" : {
"gte" : "20160705T12:00:00"
}
}
}
}
}
]
}
}
}
Filters within a Terms Query Lookup are currently not supported. However, Elasticsearch has some great documentation on joins / relationships here.
Your best bet may be to run two queries against Elasticsearch - one to fetch the list of TagIds, then another that includes the list as an exclusion clause.

ElasticSearch sort on multiple fields with summation

I am doing a search using "filtered" query and then want to sort on SUM of 3 columns.
Eg:
"query": {
...
},
"sort": {
view_count + comment_count + like_count
order: DESC
}
The result should be in descending order of the sum of the above 3 counts.
How to achieve the SUM the columns and then order the results.
Any help is appreciated.
Use script sorting if you can't change the data/don't have control over the mapping/the three fields are changing (ie they are not insert and forget).
{
"sort": {
"_script": {
"type": "number",
"script": "return doc['view_count'].value + doc['comment_count'].value + doc['like_count'].value,
"lang": "groovy",
"order": "desc"
}
}
}
If you update the documents frequently, you can also add another field - let's call it sum - where you already index the sum of the three fields. And then you simply sort on the sum field.
In my case (Elasticsearch 5.6), below code sorts by sum of multiple fields.
GET /_search
{
"query" : {
"term" : { "user" : "kimchy" }
},
"sort" : {
"_script" : {
"type" : "number",
"script" : {
"lang": "painless",
"source": "doc['retweet_count'].value + doc['favorite_count'].value"
},
"order" : "desc"
}
}}
Source: Script-based Sorting

elasticsearch: applying an additional boost on a given field for a given value

I have a Symfony 2.7.6 application using the FOSElasticaBundle.
I have 2 types of search:
One without keyword, in this case only filters are applied and all documents scores are 1 (sometimes with a random order), in this case the main query is:
$query = new Elastica\Query\MatchAll();
One with keyword, same filters are applied and the match is run again a list of fields, (one with a different boost). And the results are stored by score. The main query is now:
$match = new Elastica\Query\MultiMatch();
$match->setQuery($keyword);
$match->setOperator('AND');
$match->setFields([
'field1^30',
'field2',
'field3',
'field4',
'_all'
]);
Those 2 search are working well.
Now for both search I want a dynamic boost to be applied for a given field value. Let's say: if field5 == 'value' then add boost 15, (15 is just an example, we will make tests to see what additional boost value has to be chosen) the value used here is not the keyword, it is another parameter.
I tried with a FunctionScore and with Boosting queries but without success. Any hint with a very simple elasticsearch query would be appreciated.
How about this:
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "blabla",
"operator": "AND",
"fields": [
"field1^30",
"field2",
"field3",
"field4",
"_all"
]
}
},
"functions": [
{
"filter": {
"term": {
"field5": "some_value"
}
},
"boost_factor": 15
}
]
}
}
}

advanced search with ElasticSerach

I've create a small application with PHP and I use ES.
My request is good, but I've got the good result.
My request look-like that:
link:9200/index/_search?from=0&size=130&q=try:'yes'
%2Bbrand:'BMW' %2Bmodel:'SERIE 5' %2Bprice:[500 TO 700000]
When I send this query, ES reply me with model 'SERIE 3' and 'SERIE 5', it's great, but when I send this query, I would like to recover only 'BMW' and 'SERIE 5'.
How can I fix this?
First, you should take a look at the documentation to be more familiar with these notions (analyze / difference between query and filters) which are very important for a good use of ElasticSearch. You can find a good getting started documentation here.
Your problem is that your "model" field is a string, which by default is analyzed using the standard analyzer.
It outputs 2 tokens because of the whitespace in the model name as you can see if you use the _analyze endpoint :
GET _analyze?analyzer=standard&text='Serie 5'
{
"tokens": [
{
"token": "serie",
"start_offset": 1,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "5",
"start_offset": 7,
"end_offset": 8,
"type": "<NUM>",
"position": 2
}
]
}
On top of that, you're using a query and though will return all results matching even partially. So, you're certainly having the two cars in your results, but the "SERIE 5" car must be the first (as it matches better) than the car "SERIE 3", which is represented by a higher _score attribute.
You need to use a term filter which will return only the documents containing the term value you provided.
However, as it works on terms, you have to change the mapping of your field to "not_analyzed" like this to keep it as it is :
PUT /test/car/_mapping
{
"properties":{
"model":{
"type": "string",
"index":"not_analyzed"
}
}
}
Finally, the search request will be something like this (with price criteria as range filter and the use of a and filter to combine both) :
GET /test/car/_search
{
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"model": "Serie 3"
}
},
{
"range": {
"price": {
"from": 500,
"to": 70000
}
}
}
]
}
}
}
}
}
Your query (url_decoded) looks like
link:9200/index/_search?from=0&size=130&q=try:'yes' +brand:'BMW' +model:'SERIE 5' +price:[500 TO 700000]
I think you are using '+' incorrectely, so that it is doing or operation for your query,
If you want to get with try:yes, brand:BMW and model:SERIE 5 then you have to join these query by AND keyword.
like.
link:9200/index/_search?from=0&size=130&q=try:'yes'
AND brand:'BMW' AND model:'SERIE 5' AND price:[500 TO 700000]
And you should be aware of choosing analyzer (in mapping of fields), so that things are indexed as you want.
It will work, Thanks
Reference

Percentage of OR conditions matched in mongodb

I have got my data in following format..
{
"_id" : ObjectId("534fd4662d22a05415000000"),
"product_id" : "50862224",
"ean" : "8808992479390",
"brand" : "LG",
"model" : "37LH3000",
"features" : [{
{
"key" : "Screen Format",
"value" : "16:9",
}, {
"key" : "DVD Player / Recorder",
"value" : "No",
},
"key" : "Weight in kg",
"value" : "12.6",
}
... so on
]
}
I need to compare features of one product with others and divide the result into separate categories ( 100% match, 50-99 % match) based on % of feature matches..
My initial thought was to prepare a dynamic query with or condition for each feature and do the percentage thing in php but then that means mongodb will return me even those product which only have 1 feature matching. And I I think nearly all products of a category might have some feature in common, so I fear I might be working on lot of products in php.
I have two questions basically.
is there any alternate ways?
And is the data structure I am using is good enough to support the functionality I am looking for, Or should I consider changing it
Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.
So of course what you really want is a way for that to have that processing on the server side:
db.products.aggregate([
// Match the documents that meet your conditions
{ "$match": {
"$or": [
{
"features": {
"$elemMatch": {
"key": "Screen Format",
"value": "16:9"
}
}
},
{
"features": {
"$elemMatch": {
"key" : "Weight in kg",
"value" : { "$gt": "5", "$lt": "8" }
}
}
},
]
}},
// Keep the document and a copy of the features array
{ "$project": {
"_id": {
"_id": "$_id",
"product_id": "$product_id",
"ean": "$ean",
"brand": "$brand",
"model": "$model",
"features": "$features"
},
"features": 1
}},
// Unwind the array
{ "$unwind": "$features" },
// Find the actual elements that match the conditions
{ "$match": {
"$or": [
{
"features.key": "Screen Format",
"features.value": "16:9"
},
{
"features.key" : "Weight in kg",
"features.value" : { "$gt": "5", "$lt": "8" }
},
]
}},
// Count those matched elements
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}},
// Restore the document and divide the mated elements by the
// number of elements in the "or" condition
{ "$project": {
"_id": "$_id._id",
"product_id": "$_id.product_id",
"ean": "$_id.ean",
"brand": "$_id.brand",
"model": "$_id.model",
"features": "$_id.features",
"matched": { "$divide": [ "$count", 2 ] }
}},
// Sort by the matched percentage
{ "$sort": { "matched": -1 } }
])
So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.
Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.
Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:
{ "$project": {
"product_id": 1
"ean": 1
"brand": 1
"model": 1,
"features": 1,
"matched": 1,
"category": { "$cond": [
{ "$eq": [ "$matched", 1 ] },
"100",
{ "$cond": [
{ "$gte": [ "$matched", .7 ] },
"70-99",
{ "$cond": [
"$gte": [ "$matched", .4 ] },
"40-69",
"under 40"
]}
]}
]}
}}
Or as something similar. But the $cond operator can help you here.
The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.
Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.
I'm assuming that you'd like to compare the rest of the collection to a given product, which is a textbook example of aggregation:
lookingat = db.products.findOne({product_id:'50862224'})
matches = db.products.aggregate([
{ $unwind: '$features' },
{ $match: { features: { $in: lookingat.features }}},
{ $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},
{ $sort: { matchedfeatures: -1 }},
{ $limit: 5 },
{ $project: { _id:0, product_id: '$_id',
pctmatch: { $multiply: [ '$matchedfeatures',
100/lookingat.features.length ]}
}}
])
Walking through this briefly from the perspective of a product in the collection that has 6 features, and comparing it to the target product ('lookingat') which has 4 features, 3 of which match:
$unwind turns 1 document with 6 features into 6 otherwise-identical documents with 1 feature each
$match looks for that feature in the target's feature array (be aware that two documents are "equal" only if they have the same field names and values, in the same order), discards the 3 that don't match, and passes along the 3 that do
$group consumes those 3 matching documents and produces a new one that tells you there were 3 documents that matched that product_id
$sort and $limit give you the most relevant results and leave behind all those 1-feature matches you were concerned about
$project lets you rename the _id from the $group step back to product_id and also math the number of matching features into a percentage (we avoided a $divide operation by recognizing that 2 of the 3 terms in our calculation are constants and can be divided in JS)

Categories