I Have been using Elastic Search for a project, but I find the result of Snowball Analyzer a bit strange.
Below is my example of Mapping used.
$myTypeMapping = array(
'_source' => array(
'enabled' => true
),
'properties' => array(
'id' => array(
'type' => 'integer',
'index' => 'not_analyzed'
),
'name' => array(
'type' => 'string',
'analyzer' => 'snowball',
'boost' => 2.0
),
'food_types' => array(
'type' => 'string',
'analyzer' => 'keyword'
),
'location' => array(
'type' => 'geo_point',
"geohash_precision"=> 4
),
'city' => array(
'type' => 'string',
'analyzer' => 'keyword'
)
)
);
$indexParams['body']['mappings']['online_pizza'] = $myTypeMapping;
// Create the index
$elastic_client->indices()->create($indexParams);
On quering the http://localhost:9200/online_pizza/online_pizza/_mapping I get the following results,
{
"online_pizza": {
"properties": {
"city": {
"type": "string",
"analyzer": "keyword"
},
"food_types": {
"type": "string",
"analyzer": "keyword"
},
"id": {
"type": "integer"
},
"location": {
"type": "geo_point",
"geohash_precision": 4
},
"name": {
"type": "string",
"boost": 2,
"analyzer": "snowball"
}
}
}
}
My Question is, I have data, which has Name field as "Milano". On querying for "Milano" I get the desired result, but if I query for "Milan" or "Mil" I get no result found.
{
"query": {
"query_string": {
"default_field": "name",
"query": "Milan"
}
}
}
I've also tried to snowball analyzer during querying, no help.
{
"query": {
"query_string": {
"default_field": "name",
"query": "Milan",
"analyzer": "snowball"
}
}
}
Second Question is Keyword Search is case sensitive, eg, Pizza != pizza, how do i get away with this ?
Thanks,
The snowball stemmer doesn't want exact words. If you try it with jumping, it outputs jump as expected.
However, depending on the case, you word may be understemmed as it doesn't match any stemmer rule.
If you use the analyze API endpoint (more info here), you will see that analyzing Milano with snowball analyzer gives you the token milano :
GET _analyze?analyzer=snowball&text=Milano
Output :
{
"tokens": [
{
"token": "milano",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Then, using same snowball analyzer on Mil like this :
GET _analyze?analyzer=snowball&text=Mil
gives you this token :
{
"tokens": [
{
"token": "mil",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
}
]
}
That's why searching for 'milan' or 'mil' won't match 'Milano' documents : it doesn't match the milano term stored in index.
For your second question, you can prepare a custom analyzer combining keyword tokenizer and a lowercase tokenfilter in order to have your keyword search case-insensitive (if you use the same analyzer at search time) :
POST index_name
{
"analysis": {
"analyzer": {
"case_insensitive_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
Test :
GET analyse/_analyze?analyzer=case_insensitive_keyword&text=Choo Choo
Output :
{
"tokens": [
{
"token": "choo choo",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}
I hope I'm clear enough in my explainations :)
Related
I have an elasticsearch that works but it is really too large, it gives me too many results on terms that have nothing to do with it. I'm looking for a way to refine these results.
On a sample of fake text when I search for the term music, the terms that come out in highlights are :
must, much, alice, inside, patriotic, noticed
I think that the ngram doesn't help me but I think I really need it to have a better search.
Here is my configuration :
{
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "mySnowball", "myNgram"]
},
"default_search": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "mySnowball", "myNgram"]
}
},
"filter": {
"mySnowball": {
"type": "snowball",
"language": "English"
},
"myNgram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 6
}
}
}
}
Here is my request :
{
"query": {
"bool": {
"should": [{
"match": {
"content": "music"
}
}, {
"match": {
"url": "music"
}
}, {
"match": {
"h1": "music"
}
}, {
"match": {
"h2": "music"
}
}
],
"minimum_should_match": 1
}
},
"min_score": 8
}
My document is quite simple :
content => text,
url => text,
h1 => text,
h2 => text,
And the mapping too:
$configMapping = [
'content' => ['type' => 'text', 'boost' => 6],
'url' => ['type' => 'text', 'boost' => 6],
'h1' => ['type' => 'text', 'boost' => 9],
'h2' => ['type' => 'text', 'boost' => 7]
]
I welcome any modification that will allow me to obtain only consistent results.
As you said yourself, analyzing with 'ngram' is the reason you get all these unrelated results.
In all the results you get, you can see the token (2 characters token, as the minimum of your n-gram) that matched the query term 'music':
must, much, alice, inside, patriotic, noticed
Start by removing this filter from your analyzer and keep on tuning the results from there.
Database Server:
Elasticsearch 7.9.2
Centos 7.7
Dev env:
PHP 7.3.11
MacOS
I am fairly new to Elasticsearch, so please bare with me on this one.
It is driving me crazy though.
I am trying to to something very easy, but since I am from the relational database world, I need some mind bending. I have created a mapping with a parent-child relationship.
Product --> Price
This is the mapping I created:
PUT /products_pc
{
"mappings": {
"properties": {
"datafeed_id": {
"type": "integer"
},
"date_add": {
"type": "date"
},
"description": {
"type": "text"
},
"ean": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"image_url": {
"type": "text",
"index": false
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"sku": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"webshop_id": {
"type": "integer"
},
"price": {
"type": "float"
},
"url": {
"type": "text"
},
"date_mod":{
"type": "date"
},
"product_price" : {
"type":"join",
"relations": {
"product":"price"
}
}
}
}
}
So far so good. When I manually add a product and 2 prices I can get what I would expect: 1 parent with 2 child documents.
Now on to PHP, I am able to index the parent document, but not for the child documents. Looks like I am not able to send along a routing parameter (which I can with Kibana)
This is what I tried in PHP, parent _id = 123
$hosts = ['xxx.xxx.xxx.xxx:9200'];
$client = ClientBuilder::create()
->setHosts($hosts)
->build();
$params['body'][] = [
'create' => [
'_index' => 'products_pc',
'_id' => '123_1'
]
];
$params['body'][] = [
'webshop_id' => 1,
'date_mod' => time(),
'price' => 12,
'url' => '',
'product_price' => [
'name' => 'price',
'parent' => 123
]
];
$client->bulk($params);
But this does not work, as there is no routing set. If I add '_routing' => 123 below _id field I get an 400 error telling me the _routing field is wrong ("Action/metadata line [3] contains an unknown parameter [_routing]")
I have been searching for 2 days now, running in circles. All the different Elasticsearch versions are slightly different, so I have to admit that I am lost. Is there anybody who can point me my mistake? Or a hint in the right direction? It is driving me crazy. (As I am afraid it will be too simple to do...)
Thanks in advance!
So here we are, after 2 more days of searching... But I have found the solution it seems...
After some more hours searching I ended up at this page (again):
https://elastic.co/guide/en/elasticsearch/client/php-api/current/ElasticsearchPHP_Endpoints.html#Elasticsearch_Clientbulk_bulk
And there it was, in the params list of the bulk endpoint:
$params['routing'] = // (string) Specific routing value
Not quite sure how to use this at first, but...
Then I tried this for each of the child documents, which seems to be doing the trick!
$hosts = ['xxx.xxx.xxx.xxx:9200'];
$client = ClientBuilder::create()
->setHosts($hosts)
->build();
// insert price
$params['body'][] = [
'index' => [
'_index' => 'products_pc',
'_id' => '123_1',
'routing' => 123 // <-- Insert routing here.
]
];
$params['body'][] = [
'webshop_id' => 1,
'date_mod' => time(),
'price' => 12,
'url' => '',
'product_price' => [
'name' => 'price',
'parent' => 123 // <-- Parent _id value
]
];
$client->bulk($params);
As thought before, too easy actually. But I guess that is the life of a programmer.
Please be aware though, a LOT of documentation is mentioning the _routing field (Even de official docs for version 7.9: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/mapping-routing-field.html As seen in the text as in the right submenu under metadata fields) but the field is actually just "routing". Might save you a couple of days ;-)
I am needing to ignore the apostrophe with indexed results so that searching for "Johns potato" will show results for "John's potato"
I was able to get the analyzer accepted but now I return no search results. Does anyone see something obvious that I am missing?
$params = [
'index' => $index,
'body' => [
'settings' => [
'number_of_shards' => 5,
'number_of_replicas' => 2,
'analysis' => [
"analyzer" => [
"my_analyzer" => [
"tokenizer" => "keyword",
"char_filter" => [
"my_char_filter"
]
]
],
"char_filter" => [
"my_char_filter" => [
"type" => "mapping",
"mappings" => [
"' => "
]
]
]
]
],
'mappings' => [
$type => [
'_source' => [
'enabled' => true
],
'properties' => [
'title' => [
'type' => 'text',
'analyzer' => 'my_analyzer'
],
'content' => [
'type' => 'text',
'analyzer' => 'my_analyzer'
]
]
]
]
]
];
I did find out that removing the analyzer from my field mappings allowed results to reappear, but I get no results the second I add the analyzer.
Here's an example query that I make.
{
"body": {
"query": {
"bool": {
"must": {
"multi_match": {
"query": "apples",
"fields": [
"title",
"content"
]
}
},
"filter": {
"terms": {
"site_id": [
"1351",
"1349"
]
}
},
"must_not": [
{
"match": {
"visible": "false"
}
},
{
"match": {
"locked": "true"
}
}
]
}
}
}
}
Probably, what you really want, is to use the english analyzer that is provided. The standard analyzer which is the default will tokenize on whitespace and some punctuation, but will leave apostrophes alone. The english analyzer can stem and remove stop words since the language is known.
Here is the standard analyzer's output, where you can see "john's":
POST _analyze
{
"analyzer": "standard",
"text": "John's potato"
}
{
"tokens": [
{
"token": "john's",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "potato",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
And here is the english analyzer where you can see the 's is removed. The stemming will allow "John's", "Johns", and "John" to all match the document.
POST _analyze
{
"analyzer": "english",
"text": "John's potato"
}
{
"tokens": [
{
"token": "john",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "potato",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
i am trying to sort some data, where in my base skeleton my sorting is not working and if i remove the sorting it works fine.
So how can i put sorting in my base skeleton and sort some data.
i can't put just
$params['body'] = [
'sort' => [['title' => ['order' => 'asc']]]];
$results = $client->search($params);
Because i have other condition where i need the must condition.
Can anyone knows how it can be solve.
Any advice will be really appreciate.
// my base skeleton
$params = array(
'index' => "myIndex",
'type' => "myType",
'body' => array(
'query' => array(
'bool' => array(
'must' => array(
// empty should clause for starters
)
)
),
'sort' => array()
)
);
// sorting is not working with bool and must
if ($request->request->get('salarySort')) {
$params['body']['query']['bool']['must'][] = array(
'sort' => array(
"title" => array('order' => 'asc')
)
);
}
this is what i get as a json_encode ---
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1066,
"max_score": null,
"hits": [
{
"_index": "myIndex",
"_type": "myType",
"_id": "pe065319de73937aa6ef46413afd7aac26a58a611",
"_score": null,
"_source": {
"title": "Smarason trycker ",
"content": "HIF gör 2-0 mot Halmstad.",
"tag": [
"Soprts"
],
"category": [
"Sports"
]
},
"sort": [
"0"
]
},
{
"_index": "myIndex",
"_type": "myType",
"_id": "pebc44a70008f53f74f23ab23f8a1f79b2b729448",
"_score": null,
"_source": {
"title": "Anders Svenssons tips gav 1-0",
"content": "Anders Svenssons tips i halvtid Kalmar FF.",
"source": "Unknown",
"tag": [
"Soprts"
],
"category": [
"Sports"
]
},
"sort": [
"0"
]
}
]
}
}
query in JSON ---
{
"index": "myIndex",
"type": "myType",
"size": 30,
"body": {
"query": {
"match_all": []
},
"sort": [
{
"title": "asc"
}
]
}
}
You're almost there. You've correctly placed the empty sort array at the same level as your query, which is correct.
The issue comes later when you try to feed it as a bool/must constraint instead of in the empty sort array.
// sorting is not working with bool and must
if ($request->request->get('salarySort')) {
$params['body']['sort'][] = array( <---- this line needs to be changed
"Salary" => 'asc' <---- this line needs to be changed, too
);
}
I'm implementing ElasticSearch into my Laravel application using the php package from ElasticSearch.
My application is a small jobboard and currently my job document is looking like this:
{
"_index":"jobs",
"_type":"job",
"_id":"19",
"_score":1,
"_source":{
"0":"",
"name":"Programmer",
"description":"This is my first job! :)",
"text":"Programming is awesome",
"networks":[
{
"id":1,
"status":"PRODUCTION",
"start":"2015-02-26",
"end":"2015-02-26"
},
{
"id":2,
"status":"PAUSE",
"start":"2015-02-26",
"end":"2015-02-26"
}
]
}
}
As you can see a job can be attached to multiple networks. In my search query I would like to include WHERE network.id == 1 AND network.status == PRODUCTION.
My current query looks like this, however this returns documents where it has a network of id 1, if it has any network of status PRODUCTION. Is there anyway i can enforce both to be true within one network?
$query = [
'index' => $this->index,
'type' => $this->type,
'body' => [
'query' => [
'bool' => [
'must' => [
['networks.id' => 1]],
['networks.status' => 'PRODUCTION']]
],
'should' => [
['match' => ['name' => $query]],
['match' => ['text' => $query]],
['match' => ['description' => $query]],
],
],
],
],
];
You need to specify that the objects in the networks array should be stored as individual objects in the index, this will allow you to perform a search on individual network objects. You can do so using the nested type in Elasticsearch.
Also, if you doing exact matches it is better to use a filter rather than a query as the filters are cached and always give you better performance than a query.
Create your index with a new mapping. Use the nested type for the networks array.
POST /test
{
"mappings": {
"job": {
"properties": {
"networks": {
"type": "nested",
"properties": {
"status": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
}
Add a document:
POST /test/job/1
{
"0": "",
"name": "Programmer",
"description": "This is my first job! :)",
"text": "Programming is awesome",
"networks": [
{
"id": 1,
"status": "PRODUCTION",
"start": "2015-02-26",
"end": "2015-02-26"
},
{
"id": 2,
"status": "PAUSE",
"start": "2015-02-26",
"end": "2015-02-26"
}
]
}
As you have a nested type you will need to use a nested filter.
POST /test/job/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "networks",
"filter": {
"bool": {
"must": [
{
"term": {
"networks.id": "1"
}
},
{
"term": {
"networks.status.raw": "PRODUCTION"
}
}
]
}
}
}
}
}
}
}