Elasticsearch -- count number of keyword occurences in a document

Elasticsearch -- count number of keyword occurences in a document - php

Database: Elasticsearch v7.2
Application: Laravel v5.7
Using Elasticsearch/Elasticsearch (https://github.com/elastic/elasticsearch-php) Official PHP Library
I have a query_string query for Elasticsearch with this code to retrieve documents that have a certain phrase as I search throughout my index
[
"query_string" => [
"default_field" => $content,
"query" => $keywords
]
],
and the $keywords variable contains:
("MCU" OR "Marvel" OR "Spiderman")
Now, I want to count the NUMBER OF OCCURENCES of these words in the documents that I'm about to retrieve
I used the aggs query with this:
'aggs' => [
'count' => [
'terms' => [
'field' => 'content.keyword'
]
]
]
However, I have no idea how to associate these doc_count and display it in a matched manner with the hits -- because the key itself is the content, instead of the IDs
Im planning to display the whole document and pertain how many times the $keywords above have occurred in each document as Mentions
Is there other way to do the counting of occurrences without using the aggs in Elasticsearch?

If you only wants to count the occurrences of keywords, then you don't have to enable fielddata, try the filters aggs along with your query
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "MCU OR Marvel OR Spiderman"
}
},
"aggs": {
"count": {
"filters": {
"filters": {
"mcu": {
"match": {
"content": "MCU"
}
},
"marvel": {
"match": {
"content": "Marvel"
}
},
"spiderman": {
"match": {
"content": "Spiderman"
}
}
}
}
}
}
}
Result with be like below :
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.219939,
"hits": [
....
....
]
},
"aggregations": {
"count": {
"buckets": {
"marvel": {
"doc_count": 2
},
"mcu": {
"doc_count": 2
},
"spiderman": {
"doc_count": 1
}
}
}
}
}
Source : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filters-aggregation.html

Thanks to sir #AshrafulIslam, I was able to come up with Elasticsearch's feature called highlights. Though highlights literally emphasizes keywords that occur, I resorted to PHP's substr_count() function to count the <em> tags
I added this code as a sibling of the ['body']['query'] element:
"highlight" => [
"fields" => [
"content" => ["number_of_fragments" => 0]
],
'require_field_match' => false
]
Then as I loop through the ['hits']['hits'] array element, I performed something like this:
$articles = $client->search($params);
$hits = $articles['hits']['hits'];
for($i=0; $i<count($hits); $i++){
$hits[$i]['_source']['count_mentions'] = substr_count($hits[$i]['highlight']['content'][0],"<em>");
}

Enabling Fieldata may not be the best way to enable text search.
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#before-enabling-fielddata
Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.
Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
"mappings": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}

Related

Group search result from multiple indices to single hits in Algolia

I am trying to implement the Algolia search. I am using PHP.
Scenario:
I have three tables (products, resources, and news). I am currently using MultipleQueries (DOCs Link Here) from this documentation.
As a result, I am getting results in the following format as in the documentation.
{
"results": [
{
"hits": [
{
........
},
],
"page": 0,
"nbHits": 1,
"nbPages": 1,
"hitsPerPage": 20,
"processingTimeMS": 1,
"query": "jimmie paint",
"params": "query=jimmie+paint&attributesToRetrieve=firstname,lastname&hitsPerPage=50"
"index": "people"
},
{
"hits": [
{
...........
},
{
...........
}
],
"page": 0,
"nbHits": 1,
"nbPages": 1,
"hitsPerPage": 20,
"processingTimeMS": 1,
"query": "jimmie paint",
"params": "query=jimmie+paint&attributesToRetrieve=firstname,lastname&hitsPerPage=50"
"index": "famous_people"
},
{
..............
}
]
}
This is great, but WHAT I WANT is to group the results of 3 indices into single hits. Below is a sample I am expecting from the API.
{
"results": [
{
"hits": [
{
........
},
{
........
},
{
........
},
{
........
},
{
........
},
{
........
},
{
........
},
],
"page": 0,
"nbHits": 1,
"nbPages": 1,
"hitsPerPage": 20,
"processingTimeMS": 1,
"query": "jimmie paint",
"params": "query=jimmie+paint&attributesToRetrieve=firstname,lastname&hitsPerPage=50"
"index": "indices goes here"
},
]
}
I searched a lot but could not come with suitable solution. Is this even possible using Algolia. Any help would be greatly appreciated.
Thank you in advance.

There's no concept of aggregation across multiple indices baked into Algolia. You'll have to aggregate the hit records yourself via code before displaying.
It's more typical for Algolia users to display results from multiple indices in a federated way using one of the front end libraries. Autocomplete is great at this:
https://www.algolia.com/doc/guides/solutions/ecommerce/search/autocomplete/federated-search/#combining-different-data-sources

Elasticsearch - Research that returns too many bad results

I have an elasticsearch that works but it is really too large, it gives me too many results on terms that have nothing to do with it. I'm looking for a way to refine these results.
On a sample of fake text when I search for the term music, the terms that come out in highlights are :
must, much, alice, inside, patriotic, noticed
I think that the ngram doesn't help me but I think I really need it to have a better search.
Here is my configuration :
{
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "mySnowball", "myNgram"]
},
"default_search": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "mySnowball", "myNgram"]
}
},
"filter": {
"mySnowball": {
"type": "snowball",
"language": "English"
},
"myNgram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 6
}
}
}
}
Here is my request :
{
"query": {
"bool": {
"should": [{
"match": {
"content": "music"
}
}, {
"match": {
"url": "music"
}
}, {
"match": {
"h1": "music"
}
}, {
"match": {
"h2": "music"
}
}
],
"minimum_should_match": 1
}
},
"min_score": 8
}
My document is quite simple :
content => text,
url => text,
h1 => text,
h2 => text,
And the mapping too:
$configMapping = [
'content' => ['type' => 'text', 'boost' => 6],
'url' => ['type' => 'text', 'boost' => 6],
'h1' => ['type' => 'text', 'boost' => 9],
'h2' => ['type' => 'text', 'boost' => 7]
]
I welcome any modification that will allow me to obtain only consistent results.

As you said yourself, analyzing with 'ngram' is the reason you get all these unrelated results.
In all the results you get, you can see the token (2 characters token, as the minimum of your n-gram) that matched the query term 'music':
must, much, alice, inside, patriotic, noticed
Start by removing this filter from your analyzer and keep on tuning the results from there.

Elasticsearch find subdomains

I try find subdomains by main domain in elasticsearch.
I added few domains to elastic:
$domains = [
'site.com',
'ns1.site.com',
'ns2.site.com',
'test.main.site.com',
'sitesite.com',
'test-site.com',
];
foreach ($domains as $domain) {
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => ['domain' => $domain],
];
$client->index($params);
}
Then I try to search:
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => [
'query' => [
'wildcard' => [
'domain' => [
'value' => '.site.com',
],
],
],
],
];
$response = $client->search($params);
But found nothing. :(
My mapping is:
https://pastebin.com/raw/k9MzjJUM
Any ideas to fix it?
Thanks

You're almost there, just a couple of things missing.
How to make an "ends with" query?
It's enough to add * in your query (that's why this query is called wildcard):
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain" : "*.site.com" }
}
}
This will give you the following result:
{
...
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
}
]
}
}
Seems to work, but we only get one of the results (not all of them).
Why it returns not all matching documents?
Returning to your mapping, the field domain has type text:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"domain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
This means that content of that field will be tokenized and lowercased (with standard analyzer). You can see which tokens will be actually searchable using _analyze API, like this:
POST _analyze
{
"text": "test.main.site.com"
}
{
"tokens": [
{
"token": "test.main.site.com",
"start_offset": 0,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 0
}
]
}
That's why wildcard query could match test.main.site.com.
What if we take n1.site.com?
POST _analyze
{
"text": "n1.site.com"
}
{
"tokens": [
{
"token": "n1",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "site.com",
"start_offset": 3,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
As you can see, there is no token that ends with .site.com (note the . before the site.com).
Fortunately, your mapping is already capable to return all results.
How to return all the results for "ends with" query?
You could use keyword field, which uses the exact value for querying:
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain.keyword" : "*.site.com" }
}
}
This will give you the following result:
{
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "Q4E8VGMBRuo1XmkIFRpy",
"_score": 1,
"_source": {
"domain": "ns1.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "RYE8VGMBRuo1XmkIORqG",
"_score": 1,
"_source": {
"domain": "ns2.site.com"
}
}
]
}
}
Is this the best way to do "ends with"-like queries?
Actually, no. wildcard queries can be very slow:
Note that this query can be slow, as it needs to iterate over many
terms. In order to prevent extremely slow wildcard queries, a wildcard
term should not start with one of the wildcards * or ?.
To achieve best performance, in your case, I would suggest creating another field, higherLevelDomains, and manually extracting the higher level domains from the original. The document might look like:
POST my_index/my_type
{
"domain": "test.main.site.com",
"higherLevelDomains": [
"main.site.com",
"site.com",
"com"
]
}
This will allow you to use term query:
POST my_index/my_type/_search
{
"query": {
"term" : { "higherLevelDomains.keyword" : "site.com" }
}
}
This is probably the most efficient query you can get with Elasticsearch for such task.
Hope that helps!

Updating a MongoDB array subelement field

I can't seem to figure out how to update a single element within a subarray. I'd like to update images > 59db1c3654819952005897 > sort to be 5
"_id" : 34,
"images": [
{
"59db1c3654819952005897": {
"name": "1024x1024.png",
"size": "19421",
"sort": 2
}
},
{
"59db1c3652cda581935479": {
"name": "200x200.png",
"size": "52100",
"sort": 3
}
}
]
Here's what I've tried but neither work:
updateOne(['_id' => 34], ['$set' => ["images.59db1c3654819952005897.sort" => 5]])
updateOne(['_id' => 34], ['$set' => ["images.$.59db1c3654819952005897.sort" => 5]])

When using the positional $ operator and the dot notation to update the embedded documents field, you need to include the array in the query otherwise it won't work. In the above case, the revised update operation would be
db.collection.updateOne(
{
"_id": 34,
"images.59db1c3654819952005897": { "$exists": true } // <-- include array in query
},
{
"$set": {
"images.$.59db1c3654819952005897.sort": 5
}
}
)

ElasticSearch match combination in array

I'm implementing ElasticSearch into my Laravel application using the php package from ElasticSearch.
My application is a small jobboard and currently my job document is looking like this:
{
"_index":"jobs",
"_type":"job",
"_id":"19",
"_score":1,
"_source":{
"0":"",
"name":"Programmer",
"description":"This is my first job! :)",
"text":"Programming is awesome",
"networks":[
{
"id":1,
"status":"PRODUCTION",
"start":"2015-02-26",
"end":"2015-02-26"
},
{
"id":2,
"status":"PAUSE",
"start":"2015-02-26",
"end":"2015-02-26"
}
]
}
}
As you can see a job can be attached to multiple networks. In my search query I would like to include WHERE network.id == 1 AND network.status == PRODUCTION.
My current query looks like this, however this returns documents where it has a network of id 1, if it has any network of status PRODUCTION. Is there anyway i can enforce both to be true within one network?
$query = [
'index' => $this->index,
'type' => $this->type,
'body' => [
'query' => [
'bool' => [
'must' => [
['networks.id' => 1]],
['networks.status' => 'PRODUCTION']]
],
'should' => [
['match' => ['name' => $query]],
['match' => ['text' => $query]],
['match' => ['description' => $query]],
],
],
],
],
];

You need to specify that the objects in the networks array should be stored as individual objects in the index, this will allow you to perform a search on individual network objects. You can do so using the nested type in Elasticsearch.
Also, if you doing exact matches it is better to use a filter rather than a query as the filters are cached and always give you better performance than a query.
Create your index with a new mapping. Use the nested type for the networks array.
POST /test
{
"mappings": {
"job": {
"properties": {
"networks": {
"type": "nested",
"properties": {
"status": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
}
Add a document:
POST /test/job/1
{
"0": "",
"name": "Programmer",
"description": "This is my first job! :)",
"text": "Programming is awesome",
"networks": [
{
"id": 1,
"status": "PRODUCTION",
"start": "2015-02-26",
"end": "2015-02-26"
},
{
"id": 2,
"status": "PAUSE",
"start": "2015-02-26",
"end": "2015-02-26"
}
]
}
As you have a nested type you will need to use a nested filter.
POST /test/job/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "networks",
"filter": {
"bool": {
"must": [
{
"term": {
"networks.id": "1"
}
},
{
"term": {
"networks.status.raw": "PRODUCTION"
}
}
]
}
}
}
}
}
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Elasticsearch -- count number of keyword occurences in a document - php

Related

Group search result from multiple indices to single hits in Algolia

Elasticsearch - Research that returns too many bad results

Elasticsearch find subdomains

Updating a MongoDB array subelement field

ElasticSearch match combination in array

Categories

Resources