How can I fetch distinct records from Elasticsearch - php

I am working on Elasticsearch (ES) for last couple of weeks. There are millions of records currently present in different search indices in ES.
I have noticed that in different search indices, there is duplication of records and it is creating problem.
We can search for duplicate records via code and remove those records. May be this can be applicable, but I have more than 100 million records so it will take lot of time.
My requirement is, while we fetch records from ES, we can apply different filters. Is there any filter or way we can only fetch distinct records? I am currently using REST API using PHP.
Here is the code that I am currently using and filters are working perfectly.
$params = [
'index' => 'MyIndex',
'type' => 'MyType',
'from' => 0,
'size' => 10,
'body' => [
'query' => [
'bool' => [
'must' => [
[ 'match' => [ 'image' => true ] ],
[ 'simple_query_string' => [ 'query' => 'MyQuery' ] ]
]
]
]
]
];
I also tried looking something from "Aggregations", but couldn't find something related to my requirement.
Quick help will be highly appreciated.
Thanks in advance.

I think what you are looking for is "collapsing".
Elasticsearch supports it from 6.x:
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/search-request-collapse.html

Related

How can I fetch results from elastic search from one column with different values at the same time?

I am using REST API using PHP for fetching data from Elastic search with following code
$params = [
'index' => $search_index,
'type' => $search_type,
'from' => $_POST["from"],
'size' => $_POST["fetch"],
'body' => [
'query' => [
'bool' => [
'must' => [
[ 'match' => [ 'is_validated' => false ] ],
[ 'query_string' => [ 'query' => $search_str, 'default_operator' => 'OR' ] ]
]
]
]
]
];
Now, this is working perfectly and giving me my desired results.
The data that is returned from ES, has one column "result_source" and it has predefined values like CNN, BBC or YouTube etc.
What I need is, I want to filter results on "result_source" column in a way that, I can only fetch the results with the option I want. Like I want results that have "result_source" value only "YouTube" or only "BBC & CNN" or only "CNN or YouTube" etc.
I have already tried "Should" option, but it also returns the data with other values that I don't need. Not sure how to skip those values of "result_source" column in fetching results from ES.
Any help on this will be appreciated.
Thanks
Solved!!
I am replying to my own question, because I found a solution for it. May be it can help someone else in future.
If anyone is looking for a solution of searching within the field / column of Elastic search, here is what can be done.
[ 'query_string' => [ 'query' => $search_str.'(result_source:CNN OR result_source:BBC)', 'default_operator' => 'OR' ] ]
"result_source" is actually the field / column name of ES on which filter is applied to return results that have result_source=BBC or result_source=CNN.
This actually solved my issue.

Elasticsearch Query for PHP

I am new to ElasticSearch, and I am trying to solve a query the best way possible. I'm using PHP so it would be helpful to get to view it in that format, but I am ok to see it in any ElasticSearch DSL.
The query I need basically has to match Any or All words in multiple fields, say for example [title, description]
But I also want to only include any documents that can be filtered by any true case (example if the Document has "either" field1 = true OR field2 = true)
So example I search for "Nike boots that are green"
So for I would like to see results that would have Nike boots and Green
so I could just do
'query' => [
'query_string' => [
'fields' => [ 'title^6', 'description^3' ],
'query' => 'Nike boots that are green'
],
],
And I get all content that has the best score.
What I really want to add to my results are basically "filters" or "should " that if the Document either has field 'access' == 1 OR field 'permission' == 5, how will I do that. I know now that it needs to be a boolean.
Is it possible to have both query and boolean query in the same search?
the query_string query supports OR
'query' => [
'query_string' => [
'query' => 'access:1 OR permission:5'
],
],

Creating campaign for dynamic TextMerge segment fails

I'm trying to send a campaign to a dynamic list segment based on a custom numeric merge field (GMT_OFFSET, in this case) but the code below yields the following error from the MailChimp API:
"errors" => [
0 => [
"field" => "recipients.segment_opts.conditions.item:0"
"message" => "Data did not match any of the schemas described in anyOf."
]
]
My code, using drewm/mailchimp-api 2.4:
$campaign = $mc->post('campaigns', [
'recipients' => [
'list_id' => config('services.mailchimp.list_id'),
'segment_opts' => [
'conditions' => [
[
'condition_type' => 'TextMerge',
'field' => 'GMT_OFFSET',
'op' => 'is',
'value' => 2,
],
],
'match' => 'all',
],
],
],
// Cut for brevity
];
If I am to take the field description literally (see below), the TextMerge condition type only works on merge0 or EMAIL fields, which is ridiculous considering the Segment Type title says it is a "Text or Number Merge Field Segment". However, other people have reported the condition does work when applied exclusively to the EMAIL field. (API Reference)
I found this issue posted but unresolved on both DrewM's git repo (here) and SO (here) from January 2017. Hoping somebody has figured this out by now, or found a way around it.
Solved it! I passed an integer value which seemed to make sense given that my GMT_OFFSET merge field was of a Number type. MailChimp support said this probably caused the error and suggested I send a string instead. Works like a charm now.

elasticsearch sort data in fuzzy mode

I want to sort data by more similar in elasticsearch with fuzzy mode
we have to record
1.panadol
2.penadol
when I search with panadol or penadol the first result is (penadol) but I want wen I type (panadol) the first result appear (panadol) and the second result id (penadol) etc ..
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => [
"track_scores"=> true,
'sort'=>[
'name'=> ['reverse'=>true],
'_score'=> ['order'=>'desc'],
],
'query' => [
'fuzzy' => [
'name' => [
"value"=> 'panadol',
"fuzziness" => 2,
]
]
],
]
];
Fuzziness is not meant for scoring. You can find more info about it in the docs.
If you want to sort the results by relevance to the original phrase your searched for you can use either the phrase-suggester or the completion-suggester, depending on your needs (and your data).

Elasticsearch exact match field

I have a field called url that is set to not_analyzed when I index it:
'url' => [
'type' => 'string',
'index' => 'not_analyzed'
]
Here is my method to determine if a URL already exists in the index:
public function urlExists($index, $type, $url) {
$params = [
'index' => $index,
'type' => $type,
'body' => [
'query' => [
'match' => [
'url' => $url
]
]
]
];
$results = $this->client->count($params);
return ($results['count'] > 0);
}
This seems to work fine however I can't be 100% sure this is the correct way to find an exact match, as reading the docs another way to do the search is with the params like:
$params = [
'index' => $index,
'type' => $type,
'body' => [
'query' => [
'filtered' => [
'filter' => [
'term' => [
'url' => $url
]
]
]
]
]
];
My question is would either params work the same way for a not_analyzed field?
The second query is the right approach. term level queries/filters should be used for exact match. Biggest advantage is caching. Elasticsearch uses bitset for this and you will get quicker response time with subsequent calls.
From the Docs
Exclude as many document as you can with a filter, then query just the
documents that remain.
Also if you observe your output, you will find that _score of every document is 1 as scoring is not applied to filters, same goes for highlighting but with match query you will see different _score. Again From the Docs
Keep in mind that once you wrap a query as a filter, it loses query
features like highlighting and scoring because these are not features
supported by filters.
Your first query uses match which is basically used for analyzed fields e.g when you want both Google and google to match all your documents containing google(case insensitive) match queries are used.
Hope this helps!!

Categories