Custom sorting in Elasticsearch - php

Does anyone know if it's possible to custom sort in elasticsearch?
I have a sort on the category field. Which groups all of the records together by category. This works great.
However could you then give the sort a list e.g cars, books, food.
It would then show the cars first, then books and finally food?

You can use a function_score query, something like this:
{
"query": {
"function_score": {
"query": { "match_all": {} },
"boost": "5",
"functions": [
{
"filter": { "match": { "category": "cars" } },
"weight": 100
},
{
"filter": { "match": { "category": "books" } },
"weight": 50
},
{
"filter": { "match": { "category": "food" } },
"weight": 1
}
],
"score_mode": "max",
"boost_mode": "replace"
}
}
}
Where you, of course, put whichever query you are using now instead of the match_all query, and leave off the sort (the default is by score, which is what you want here).
This is replacing the score elasticsearch normally generates, with a custom score for each category. You could experiment with other boost_mode in order to have a reasonable ranking within the categories. In case you need to understand what is happening with the scoring, you can add "explain": true to the query at the top level.

You can use custom script for your own scoring.
More details at in Script Based Sorting section: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-sort.html

Related

Elasticsearch "Join" tables

I need to do "Join" between 2 indexes (tables) and preform a check on specific field on documents that exists in both indexes.
I want to add condition like "dateExpiry" below, but I get an error. Is it possible to join 2 or more indexes?
GET cache-*/_search
{
"query": {
"bool": {
"must_not": [
{
"query": {
"terms": {
"TagId": {
"index": "domain_block-2016.06",
"type": "cBlock",
"id": "57692ef6ae8c50f67e8b45",
"path": "TagId",
"range" : {
"dateExpiry" : {
"gte" : "20160705T12:00:00"
}
}
}
}
}
]
}
}
}
Filters within a Terms Query Lookup are currently not supported. However, Elasticsearch has some great documentation on joins / relationships here.
Your best bet may be to run two queries against Elasticsearch - one to fetch the list of TagIds, then another that includes the list as an exclusion clause.

Elasticsearch: What's the best way to search for a word within a string AND get score?

I'm using ElasticSearch's PHP client and I find really difficult to return results with scores whenever I want to search for a word that is "hidden" within a string.
This is an example:
I want to get all the documents where the field "file" has the word "anses" and files are named like this:
axx14anses19122015.zip
What I know about it
I know I should tokenize those words, can't realize how to do it.
Also I've read about aggregations but I'm really new to ES and I have to deliver a working piece ASAP.
What I've tried so far
REGEXP: using regular expressions is very expensive and does not return any scores, which is a must-to-have in order to shrink results and bring the user accurate information.
Wildcards: same thing, slow and no scores
Own script where I have a dictionary and search for critical words using regexp, if match, create a new field within that matched document with the word. The reason is to create a TOKEN so in future searches I can use regular match with scores. Negative side: the dictionary thing was totally denied by my boss so I'm here asking for any ideas.
Thanks in advance.
I suggest in your case nGram tokenizer see the example
I will create a analyzer and a mapping for a doc type
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 4,
"max_gram": 4,
"token_chars": [ "letter", "digit" ]
}
},
"analyzer": {
"ngram_tokenizer_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"term_vector": "yes",
"analyzer": "ngram_tokenizer_analyzer"
}
}
}
}
}
after that I`ll insert a document using your file name
PUT /test_index/doc/1
{
"text_field": "axx14anses19122015"
}
now I`ll just will use a query match
POST /test_index/_search
{
"query": {
"match": {
"text_field": "anses"
}
}
}
and will receive a reponse like this
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.10848885,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.10848885,
"_source": {
"text_field": "axx14anses19122015"
}
}
]
}
}
What i did?
i just created a nGram tokenizer that will explode our string in 4 characters terms and will index this terms separated and they will be searched when I search a part of the string.
To see more, read this article https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
Hope it help!
Ok after trying -so- many times it worked. I'll share the solution just in case someone else needs it. Thank you so much to Waldemar, it was a really good approach and I still cannot see why it's not working.
curl -XPUT 'http://ipaddresshere/tokentest' -d
'{ "settings":
{ "number_of_shards": 1, "analysis" :
{ "analyzer" : { "myngram" : { "tokenizer" : "mytokenizer" } },
"tokenizer" : { "mytokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "5",
"token_chars" : [ "letter", "digit" ] } } } },
"mappings":
{ "doc" :
{ "properties" :
{ "field" : {
"type" : "string",
"term_vector" : "yes",
"analyzer" : "myngram" } } } } }'
Sorry for bad indentation, I'm really hurry but want to post the solution.
So, this will take any string from "field" and split it into nGrams with lenght 3 to 5. For example: "abcanses14f.zip" will result in:
abc, abca, abcan, bca, bcan, bcans, etc... until it reaches anses or a similar term which is matcheable and has a score related to it.

multiple group by in elasticsearch including missing values

I'm trying to do a group by in elasticsearch, by multiple fields. I know that nested aggregation exists, but what I want is including in a certain bucket the record for which the field I'm grouping by is empty.
Say that we have this kind of data structure:
SONG_ID | SONG_GENRE | SONG_ARTIST
and i want to group by genere, artists.
I would like to have a group for each possibile combination, i.e
group by genre gives me 5 buckets (if genres are 5) plus the bucket in which there are the songs without a genre. grouping then by artist gives me, for each genre, bucket by artists plus the one with songs without an artist.
Basically, I'd like to have the same results that I have using a group by. Is that even possible?
You can approach in different ways to solve your need.
The simplest way would be to index a fix value say "notmentioned" against the genre field of songs if genre is not present. you can do it while indexing or by defining "null_value" in your field mapping.
"SONG_GENRE": {"type": "string", "null_value": "notmentioned"},
"SONG_ARTIST": {"type": "string", "null_value": "notmentioned"},
So during aggregation (nested) you will automatically find the count against "notmentioned" for songs not having genre.
Another approach would be to use the missing filter as another aggregation along with normal aggregation. Something like below.
{
"aggs": {
"SONG_GENRE": {
"terms": {
"field": "SONG_GENRE"
},
"aggs": {
"SONG_ARTIST": {
"terms": {
"field": "SONG_ARTIST"
}
},
"MISSING_SONG_ARTIST": {
"filter": {
"missing": {
"field": "SONG_ARTIST"
}
}
}
}
},
"MISSING_SONG_GENRE": {
"filter": {
"missing": {
"field": "SONG_GENRE"
}
},
"aggs": {
"MISSING_SONG_GENRE_SONG_ARTIST": {
"terms": {
"field": "SONG_ARTIST"
}
},
"MISSING_SONG_GENRE_MISSING_SONG_ARTIST": {
"filter": {
"missing": {
"field": "SONG_ARTIST"
}
}
}
}
}
}
}
I haven't verified the syntax. It is just to give you an idea
Another hacking way could be to treat the missing count (total hits - all aggregation count) as the count against no genre.

advanced search with ElasticSerach

I've create a small application with PHP and I use ES.
My request is good, but I've got the good result.
My request look-like that:
link:9200/index/_search?from=0&size=130&q=try:'yes'
%2Bbrand:'BMW' %2Bmodel:'SERIE 5' %2Bprice:[500 TO 700000]
When I send this query, ES reply me with model 'SERIE 3' and 'SERIE 5', it's great, but when I send this query, I would like to recover only 'BMW' and 'SERIE 5'.
How can I fix this?
First, you should take a look at the documentation to be more familiar with these notions (analyze / difference between query and filters) which are very important for a good use of ElasticSearch. You can find a good getting started documentation here.
Your problem is that your "model" field is a string, which by default is analyzed using the standard analyzer.
It outputs 2 tokens because of the whitespace in the model name as you can see if you use the _analyze endpoint :
GET _analyze?analyzer=standard&text='Serie 5'
{
"tokens": [
{
"token": "serie",
"start_offset": 1,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "5",
"start_offset": 7,
"end_offset": 8,
"type": "<NUM>",
"position": 2
}
]
}
On top of that, you're using a query and though will return all results matching even partially. So, you're certainly having the two cars in your results, but the "SERIE 5" car must be the first (as it matches better) than the car "SERIE 3", which is represented by a higher _score attribute.
You need to use a term filter which will return only the documents containing the term value you provided.
However, as it works on terms, you have to change the mapping of your field to "not_analyzed" like this to keep it as it is :
PUT /test/car/_mapping
{
"properties":{
"model":{
"type": "string",
"index":"not_analyzed"
}
}
}
Finally, the search request will be something like this (with price criteria as range filter and the use of a and filter to combine both) :
GET /test/car/_search
{
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"model": "Serie 3"
}
},
{
"range": {
"price": {
"from": 500,
"to": 70000
}
}
}
]
}
}
}
}
}
Your query (url_decoded) looks like
link:9200/index/_search?from=0&size=130&q=try:'yes' +brand:'BMW' +model:'SERIE 5' +price:[500 TO 700000]
I think you are using '+' incorrectely, so that it is doing or operation for your query,
If you want to get with try:yes, brand:BMW and model:SERIE 5 then you have to join these query by AND keyword.
like.
link:9200/index/_search?from=0&size=130&q=try:'yes'
AND brand:'BMW' AND model:'SERIE 5' AND price:[500 TO 700000]
And you should be aware of choosing analyzer (in mapping of fields), so that things are indexed as you want.
It will work, Thanks
Reference

Percentage of OR conditions matched in mongodb

I have got my data in following format..
{
"_id" : ObjectId("534fd4662d22a05415000000"),
"product_id" : "50862224",
"ean" : "8808992479390",
"brand" : "LG",
"model" : "37LH3000",
"features" : [{
{
"key" : "Screen Format",
"value" : "16:9",
}, {
"key" : "DVD Player / Recorder",
"value" : "No",
},
"key" : "Weight in kg",
"value" : "12.6",
}
... so on
]
}
I need to compare features of one product with others and divide the result into separate categories ( 100% match, 50-99 % match) based on % of feature matches..
My initial thought was to prepare a dynamic query with or condition for each feature and do the percentage thing in php but then that means mongodb will return me even those product which only have 1 feature matching. And I I think nearly all products of a category might have some feature in common, so I fear I might be working on lot of products in php.
I have two questions basically.
is there any alternate ways?
And is the data structure I am using is good enough to support the functionality I am looking for, Or should I consider changing it
Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.
So of course what you really want is a way for that to have that processing on the server side:
db.products.aggregate([
// Match the documents that meet your conditions
{ "$match": {
"$or": [
{
"features": {
"$elemMatch": {
"key": "Screen Format",
"value": "16:9"
}
}
},
{
"features": {
"$elemMatch": {
"key" : "Weight in kg",
"value" : { "$gt": "5", "$lt": "8" }
}
}
},
]
}},
// Keep the document and a copy of the features array
{ "$project": {
"_id": {
"_id": "$_id",
"product_id": "$product_id",
"ean": "$ean",
"brand": "$brand",
"model": "$model",
"features": "$features"
},
"features": 1
}},
// Unwind the array
{ "$unwind": "$features" },
// Find the actual elements that match the conditions
{ "$match": {
"$or": [
{
"features.key": "Screen Format",
"features.value": "16:9"
},
{
"features.key" : "Weight in kg",
"features.value" : { "$gt": "5", "$lt": "8" }
},
]
}},
// Count those matched elements
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}},
// Restore the document and divide the mated elements by the
// number of elements in the "or" condition
{ "$project": {
"_id": "$_id._id",
"product_id": "$_id.product_id",
"ean": "$_id.ean",
"brand": "$_id.brand",
"model": "$_id.model",
"features": "$_id.features",
"matched": { "$divide": [ "$count", 2 ] }
}},
// Sort by the matched percentage
{ "$sort": { "matched": -1 } }
])
So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.
Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.
Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:
{ "$project": {
"product_id": 1
"ean": 1
"brand": 1
"model": 1,
"features": 1,
"matched": 1,
"category": { "$cond": [
{ "$eq": [ "$matched", 1 ] },
"100",
{ "$cond": [
{ "$gte": [ "$matched", .7 ] },
"70-99",
{ "$cond": [
"$gte": [ "$matched", .4 ] },
"40-69",
"under 40"
]}
]}
]}
}}
Or as something similar. But the $cond operator can help you here.
The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.
Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.
I'm assuming that you'd like to compare the rest of the collection to a given product, which is a textbook example of aggregation:
lookingat = db.products.findOne({product_id:'50862224'})
matches = db.products.aggregate([
{ $unwind: '$features' },
{ $match: { features: { $in: lookingat.features }}},
{ $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},
{ $sort: { matchedfeatures: -1 }},
{ $limit: 5 },
{ $project: { _id:0, product_id: '$_id',
pctmatch: { $multiply: [ '$matchedfeatures',
100/lookingat.features.length ]}
}}
])
Walking through this briefly from the perspective of a product in the collection that has 6 features, and comparing it to the target product ('lookingat') which has 4 features, 3 of which match:
$unwind turns 1 document with 6 features into 6 otherwise-identical documents with 1 feature each
$match looks for that feature in the target's feature array (be aware that two documents are "equal" only if they have the same field names and values, in the same order), discards the 3 that don't match, and passes along the 3 that do
$group consumes those 3 matching documents and produces a new one that tells you there were 3 documents that matched that product_id
$sort and $limit give you the most relevant results and leave behind all those 1-feature matches you were concerned about
$project lets you rename the _id from the $group step back to product_id and also math the number of matching features into a percentage (we avoided a $divide operation by recognizing that 2 of the 3 terms in our calculation are constants and can be divided in JS)

Categories