Immense term error in elasticsearch

Immense term error in elasticsearch - php

I'm working on a membership administration program, for wich we want to use Elasticsearch as search engine. At this point we're having problems with indexing certain fields, because they generate an 'immense term'-error on the _all field.
Our settings:
curl -XGET 'http://localhost:9200/my_index?pretty=true'
{
"my_index" : {
"aliases" : { },
"mappings" : {
"Memberships" : {
"_all" : {
"analyzer" : "keylower"
},
"properties" : {
"Amount" : {
"type" : "float"
},
"Members" : {
"type" : "nested",
"properties" : {
"Startdate membership" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Enddate membership" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Members" : {
"type" : "string",
"analyzer" : "keylower"
}
}
},
"Membership name" : {
"type" : "string",
"analyzer" : "keylower"
},
"Description" : {
"type" : "string",
"analyzer" : "keylower"
},
"elementId" : {
"type" : "integer"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1441310632366",
"number_of_shards" : "1",
"analysis" : {
"filter" : {
"my_char_filter" : {
"type" : "asciifolding",
"preserve_original" : "true"
}
},
"analyzer" : {
"keylower" : {
"filter" : [ "lowercase", "my_char_filter" ],
"tokenizer" : "keyword"
}
}
},
"number_of_replicas" : "1",
"version" : {
"created" : "1040599"
},
"uuid" : "nn16-9cTQ7Gn9NMBlFxHsw"
}
},
"warmers" : { }
}
}
We use the keylower-analyzer, because we don't want the fullname to be split on whitespace. This is because we want to be able to search on 'john johnson' in the _all field as well as in the 'Members'-field.
The 'Members'-field can contain multiple members, wich is where the problems start. When the field only contains a couple of members (as in the example below), there is no problem. However, the field may contain hundreds or thousands of members, wich is when we get the immens term error.
curl 'http://localhost:9200/my_index/_search?pretty=true&q=*:*'
{
"took":1,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":1.0,
"hits":[
{
"_index":"my_index",
"_type":"Memberships",
"_id":"15",
"_score":1.0,
"_source":{
"elementId":[
"15"
],
"Membership name":[
"My membership"
],
"Amount":[
"100"
],
"Description":[
"This is the description."
],
"Members":[
{
"Members":"John Johnson",
"Startdate membership":"2015-01-09",
"Enddate membership":"2015-09-03"
},
{
"Members":"Pete Peterson",
"Startdate membership":"2015-09-09"
},
{
"Members":"Santa Claus",
"Startdate membership":"2015-09-16"
}
]
}
}
]
}
}
NOTE: The above example works! It's only when the field 'Members' contains (a lot) more members that we get the error. The error we get is:
"error":"IllegalArgumentException[Document contains at least one
immense term in field=\"_all\" (whose UTF8 encoding is longer than the
max length 32766), all of which were skipped. Please correct the
analyzer to not produce such terms. The prefix of the first immense
term is: '[...]...', original message: bytes can be at most 32766 in
length; got 106807]; nested: MaxBytesLengthExceededException[bytes can
be at most 32766 in length; got 106807]; " "status":500
We only get this error on the _all-field, not on the original Members-field. With ignore_above, it's not possible to search in the _all field on fullname anymore. With the standard analyzer, i would find this document if i would search on 'Santa Johnson', because the _all-fields has a token 'Santa' and 'Johnson'. That's why i use keylower for these fields.
What i would like is an analyzer that tokenizes on field, but doesn't break up the values in the fields itself. What happens now, is that the entire field 'Members' is being fed as one token, including the childfields. (so, the token in the example above would be:
John Johnson 2015-01-09 2015-09-03 Pete Peterson 2015-09-09 Santa Claus 2015-09-16
Is it possible to tokenize these fields in such a way that every field is being fed to _all as separate tokens, but without breaking up the values in the fields themself? So that the tokens would be:
John Johnson
2015-01-09
2015-09-03
Pete Peterson
2015-09-09
Santa Claus
2015-09-16
Note: We use the Elasticsearch php library.

There is a much better way of doing this. Whether or not the phrase search can span multiple field values is determined by position_offset_gap (in 2.0 it will be renamed into position_increment_gap). This parameter basically specifies how many words/positions should be "inserted" between the last token of one field and the first token of the following fields. By default, in elasticsearch prior to 2.0 position_increment_gap has value of 0. That's is what causing the issues that you describe.
By combining copy_to feature and specifying position_increment_gap you can create an alternative my_all field that will not have this issue. By setting this new field in index.query.default_field setting you can tell elasticsearch to use this field by default instead of _all field when no fields are specified.
curl -XDELETE "localhost:9200/test-idx?pretty"
curl -XPUT "localhost:9200/test-idx?pretty" -d '{
"settings" :{
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"query.default_field": "my_all"
}
},
"mappings": {
"doc": {
"_all" : {
"enabled" : false
},
"properties": {
"Members" : {
"type" : "nested",
"properties" : {
"Startdate membership" : {
"type" : "date",
"format" : "dateOptionalTime",
"copy_to": "my_all"
},
"Enddate membership" : {
"type" : "date",
"format" : "dateOptionalTime",
"copy_to": "my_all"
},
"Members" : {
"type" : "string",
"analyzer" : "standard",
"copy_to": "my_all"
}
}
},
"my_all" : {
"type": "string",
"position_offset_gap": 256
}
}
}
}
}'
curl -XPUT "localhost:9200/test-idx/doc/1?pretty" -d '{
"Members": [{
"Members": "John Johnson",
"Startdate membership": "2015-01-09",
"Enddate membership": "2015-09-03"
}, {
"Members": "Pete Peterson",
"Startdate membership": "2015-09-09"
}, {
"Members": "Santa Claus",
"Startdate membership": "2015-09-16"
}]
}'
curl -XPOST "localhost:9200/test-idx/_refresh?pretty"
echo
echo "Should return one hit"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
"query": {
"match_phrase" : {
"my_all" : "John Johnson"
}
}
}'
echo
echo "Should return one hit"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
"query": {
"query_string" : {
"query" : "\"John Johnson\""
}
}
}'
echo
echo "Should return no hits"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
"query": {
"match_phrase" : {
"my_all" : "Johnson 2015-01-09"
}
}
}'
echo
echo "Should return no hits"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
"query": {
"query_string" : {
"query" : "\"Johnson 2015-01-09\""
}
}
}'
echo
echo "Should return no hits"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
"query": {
"match_phrase" : {
"my_all" : "Johnson Pete"
}
}
}'

Related

Searching for exact phrase with synonyms

I am trying to build a query, where I am using exact phrase match and synonyms and I can't figure it out. Also, when using wildcard approach I don't know how to use fuzziness. Is it even possible with wildcards? It would be great to get same results for terms "call of duty", "cod" or "call of dutz".
I have created this index:
PUT exact_search
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "0",
"analysis": {
"analyzer": {
"analyzer_exact": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"icu_folding",
"synonyms"
]
}
},
"filter": {
"synonyms": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"analyzer_exact": {
"type": "text",
"analyzer": "analyzer_exact"
}
}
}
}
}
}
And I fill it with these items:
POST exact_search/_doc/1
{
"name": "Hoodie Call of Duty"
}
POST exact_search/_doc/2
{
"name": "Call of Duty 2"
}
POST exact_search/_doc/3
{
"name": "Call of Duty: Modern Warfare 2"
}
POST exact_search/_doc/4
{
"name": "COD: Modern Warfare 2"
}
POST exact_search/_doc/5
{
"name": "Call of duty"
}
POST exact_search/_doc/6
{
"name": "Call of the sea"
}
POST exact_search/_doc/7
{
"name": "Heavy Duty"
}
synonyms.txt looks like this:
cod,call of duty
And what I am trying to achieve is, to get all the results (exept call of the sea and heavy duty) when I search "call of duty" or "cod".
So far, I constructed this query, but it does not work as expected when using "cod" search term (term "call of duty" works fine):
GET exact_search/_search
{
"explain": false,
"query":{
"bool":{
"must":[
{
"wildcard": {
"name.analyzer_exact": {
"value": "*cod*"
}
}
}
]
}
}
}
But the result is only two items:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "exact_search",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"name" : "COD: Modern Warfare 2"
}
},
{
"_index" : "exact_search",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"name" : "Call of duty"
}
}
]
}
}
It looks like that the synonyms are working, because it returns "call of duty" game, but it ignores the wildcards - it won't return Call of Duty 2 for example.
I need to look for the exact phrase match, because I dont't want to get results Heavy Duty or Call of the sea (when words "call" and "duty" match).
Thank you for pointing me in the right direction.

I have my doubts if the analyzer would generate the tokens synonymous with the analyzer_exact "tokenizer": "keyword".
I would change a few things to make it work.
keyword -> standard
"analyzer_exact": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms"
]
}
I would use match phrase to eliminate names other than call of duty and cod.
{
"match_phrase": {
"name.analyzer_exact": "cod"
}
}
Response after changes
{
"hits": {
"hits": [
{
"_source": {
"name": "Call of duty"
}
},
{
"_source": {
"name": "COD: Modern Warfare 2"
}
},
{
"_source": {
"name": "Call of Duty 2"
}
},
{
"_source": {
"name": "hoddies Call of Duty"
}
},
{
"_source": {
"name": "Call of Duty: Modern Warfare 2"
}
}
]
}

Elasticsearch - excluding children from documents with join field

So I've set up an index with the following mapping:
PUT test_index
{
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text"
},
"author": {
"type": "text"
},
"reader_stats": {
"type": "join",
"relations": {
"book": "reader"
}
}
}
}
}
}
each parent document represents a book and its children represent a reader of that book. However, if I was to run:
GET test_index/_search
{
"query":{"match_all":{}}
}
The results would be populated with both books and readers like so:
"hits" : [
{
"_index" : "test_index",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "my second book",
"author" : "mr author",
"reader_stats" : {
"name" : "book"
}
}
},
{
"_index" : "test_index",
"_type" : "doc",
"_id" : "7",
"_score" : 1.0,
"_routing" : "2",
"_source" : {
"name" : "michael bookworm",
"clicks" : 1,
"reader_stats" : {
"name" : "reader",
"parent" : 2
}
}
}
]
Is there some way I can exclude reader documents and only show books? I already used match_all in my app to grab books so it would be good if I can avoid having to change that query but I guess that's not possible.
Also I'm a bit confused as to how mappings work with join fields as there is no definition for what fields are required of child documents. For example, in my mapping there's nowhere to specify that 'reader' documents must have 'name' and 'clicks' fields. Is this correct?

You need to use has_child (to search only parent docs) and has_parent (to search only child docs) keywords in your query.
Is there some way I can exclude reader documents and only show books?
YES
Your query will be:
GET test_index/_search
{
"query": {
"has_child": {
"type": "reader",
"query": {
"match_all": {}
}
}
}
}
For more detail info you can take a look at here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-has-child-query.html

Elasticsearch Range filter with year only

I need to filter my data with year only using elastic search. I am using PHP to fetch and show the results. Here is my JSON Format data
{ loc_cityname: "New York",
location_countryname: "US",
location_primary: "North America"
admitted_date : "1994-12-10"
},
{ loc_cityname: "New York",
location_countryname: "US",
location_primary: "North America"
admitted_date : "1995-12-10"
},
I am using below codes to filter the values by year.
$options='{
"query": {
"range" : {
"admitted_date" : {
"gte" : 1994,
"lte" : 2000
}
}
},
"aggs" : {
"citycount" : {
"cardinality" : {
"field" : "loc_cityname",
"precision_threshold": 100
}
}
}
}';
How can i filter the results with year only. Please somebody help me to fix this.
Thanks in advance,

You simply need to add the format parameter to your range query like this:
$options='{
"query": {
"range" : {
"admitted_date" : {
"gte" : 1994,
"lte" : 2000,
"format": "yyyy" <--- add this line
}
}
},
"aggs" : {
"citycount" : {
"cardinality" : {
"field" : "loc_cityname",
"precision_threshold": 100
}
}
}
}';
UPDATE
Note that the above solution only works for ES 1.5 and above. With previous versions of ES, you could use a script filter instead:
$options='{
"query": {
"filtered": {
"filter": {
"script": {
"script": "(min..max).contains(doc.admitted_date.date.year)",
"params": {
"min": 1994,
"max": 2000
}
}
}
}
},
"aggs": {
"citycount": {
"cardinality": {
"field": "loc_cityname",
"precision_threshold": 100
}
}
}
}';
In order to be able to run this script filter, you need to make sure that you have enabled scripting in elasticsearch.yml:
script.disable_dynamic: false

How to order a mongoDB query by a field in an embedded document?

I have these documents in a mongoDB:
/* 1 */
{
"_id" : ObjectId("553ce99a39108e2b7c1edeb9"),
"coleccion" : "aplicaciones",
"nombre" : "Mascotas",
"descripcion" : "Censo de mascotas",
"tipo" : "privada"
}
/* 2 */
{
"_id" : ObjectId("553e316e39108e802a1edeb9"),
"coleccion" : "aplicaciones",
"nombre" : "otra aplicacionn",
"descripcion" : "w aplicacion",
"tipo" : "privada",
"campoId" : [
{
"id" : "1430145364",
"id_campo" : "553bffca39108eb163cff7aa",
"orden" : 90
},
{
"id" : "1430145368",
"id_campo" : "553bffed39108e346ccff7ab",
"orden" : 100,
"estado" : "0"
},
{
"id" : "1430145370",
"id_campo" : "553c001139108ebc63cff7aa",
"orden" : 29,
"estado" : "1"
},
{
"id" : "1430145395",
"id_campo" : "553c001139108ebc63cff7aa",
"orden" : 9,
"estado" : "0"
}
]
}
I need to query and sort the data in ascending order of each document using the field " campoId.orden " and have executed this query:
db.getCollection('aplicaciones').find({}).sort({'campoId.orden' : -1})
but I do not get the order I want.
can anyone suggest me a way?

In your documents orden in nested array, so you should use mongo aggregation. So below step will follow :
1> First check campoId exits or not $exists
2> Then unwind campoId array $unwind
3> Then group all fields $group
So query as below :
db.aplicaciones.aggregate({
"$match": {
"campoId": {
"$exists": true // check here campoId presents or not using exists
}
}
}, {
"$unwind": "$campoId" // unwind campoId array
}, {
"$sort": {
"campoId.orden": -1
}
},
//groups all fields
{
"$group": {
"_id": "$_id",
"coleccion": {
"$first": "$coleccion"
},
"nombre": {
"$first": "$nombre"
},
"descripcion": {
"$first": "$descripcion"
},
"tipo": {
"$first": "$tipo"
},
"campoId": {
"$push": "$campoId"
}
}
}).pretty()

Elasticsearch find documents by location

I have indexed a number of documents in my Elasticsearch database and when I query for all them I see they have a structure like this:
GET http://localhost:9200/restaurants/restaurant/_search
Output:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 362,
"max_score": 1,
"hits": [
{
"_index": "restaurants",
"_type": "restaurant",
"_id": "2",
"_score": 1,
"_source": {
"businessName": "VeeTooNdoor Dine",
"geoDescription": "Right next to you2",
"tags": {},
"location": {
"lat": -33.8917007446,
"lon": 151.1369934082
}
}
},
...
]
}
}
I now want to search for restaurants around a given geo-location and following the documentation [1] I use something like this:
{
"filtered" : {
"query" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "1km",
"location" : {
"lat": -33.8917007446,
"lon": 151.1369934082
}
}
}
}
}
The thing I have changed is the match_all since I don't want to specify any search field in particular, I only care about the geo location.
When I run the query I get the following message:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[olAoWBSJSF2XfTnzEhKIYA][btq][3]: SearchParseException[[btq][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{\n \"filtered\" : {\n \"query\" : {\n \"match_all\" : {}\n },\n \"filter\" : {\n .....
Now I did notice the following on the tutorial page:
Update: The automatic mapping of “geo enabled” properties has been
disabled since publishing this article. You have to provide the
correct mapping for geo properties. Please see the documentation.
Which gives me the impression that I have to create a "mapping" which specifies the field types. However, in the document it refers to doesn't give me enough information on how to actually do this. It shows blobs of JSON but I'm not sure about the correct URL's for this.
Further more, I'm using the PHP client and I'm not sure if it even supports mappings as is demonstrated in this walk through [2].
I somehow get the impression that quite a bit of changes have been made to the query DSL etc. and that a lot of examples on the web don't work anymore, I could be wrong through. I'm using Elasticsearch 1.0.0.
[1] http://www.elasticsearch.org/blog/geo-location-and-search
[2] http://blog.qbox.io/elasticsearch-aggregations

Things that might be wrong:
1: your query shows pin.location and your field is just location.
2: your _mapping for location could be wrong
Does your mapping show something like:
"location": {
"type": "geo_point",
"geohash_precision": 4
}
I was able to run this search against some of my own data:
POST /myindex/mydata/_search
{
"query": {
"match_all": {}
},
"filter": {
"geo_distance" : {
"distance" : "100km",
"_latlng_geo" : {
"lat": -33.8917007446,
"lon": 151.1369934082
}
}
}
}
... a snippet of my mapping:
"properties": { .....
"_latlng_geo": {
"type": "geo_point",
"geohash_precision": 4
}
.....
EDIT : How to use Put Mapping API
You can create the mapping when you create the index like so:
PUT /twitter/
{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"tweet":{
"properties":{
"latlng" : {
"type" : "geo_point"
}
}
}
}
}

removing the query is what finally worked for me:
{
"filter": {
"geo_distance" : {
"distance" : "300km",
"location" : {
"lat" : 45,
"lon" : -122
}
}
}
}

You have to replace "location" by "restaurant.location" because ElasticSearch interprete it like a type not like a attribute.
{
"filtered" : {
"query" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "1km",
"restaurant.location" : {
"lat": -33.8917007446,
"lon": 151.1369934082
}
}
}
}
I hope it helps you.

As stated in the docs:
"We use the generic PHP stdClass object to represent an empty object. The JSON will now encode correctly."
http://www.elasticsearch.org/guide/en/elasticsearch/client/php-api/current/_dealing_with_json_arrays_and_objects_in_php.html
In your case you should use
$searchParams['body']['query']['filtered']['query']['match_all'] = new \stdClass();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Immense term error in elasticsearch - php

Related

Searching for exact phrase with synonyms

Elasticsearch - excluding children from documents with join field

Elasticsearch Range filter with year only

How to order a mongoDB query by a field in an embedded document?

Elasticsearch find documents by location

Categories

Resources