ElasticSearch match combination in array

ElasticSearch match combination in array - php

I'm implementing ElasticSearch into my Laravel application using the php package from ElasticSearch.
My application is a small jobboard and currently my job document is looking like this:
{
"_index":"jobs",
"_type":"job",
"_id":"19",
"_score":1,
"_source":{
"0":"",
"name":"Programmer",
"description":"This is my first job! :)",
"text":"Programming is awesome",
"networks":[
{
"id":1,
"status":"PRODUCTION",
"start":"2015-02-26",
"end":"2015-02-26"
},
{
"id":2,
"status":"PAUSE",
"start":"2015-02-26",
"end":"2015-02-26"
}
]
}
}
As you can see a job can be attached to multiple networks. In my search query I would like to include WHERE network.id == 1 AND network.status == PRODUCTION.
My current query looks like this, however this returns documents where it has a network of id 1, if it has any network of status PRODUCTION. Is there anyway i can enforce both to be true within one network?
$query = [
'index' => $this->index,
'type' => $this->type,
'body' => [
'query' => [
'bool' => [
'must' => [
['networks.id' => 1]],
['networks.status' => 'PRODUCTION']]
],
'should' => [
['match' => ['name' => $query]],
['match' => ['text' => $query]],
['match' => ['description' => $query]],
],
],
],
],
];

You need to specify that the objects in the networks array should be stored as individual objects in the index, this will allow you to perform a search on individual network objects. You can do so using the nested type in Elasticsearch.
Also, if you doing exact matches it is better to use a filter rather than a query as the filters are cached and always give you better performance than a query.
Create your index with a new mapping. Use the nested type for the networks array.
POST /test
{
"mappings": {
"job": {
"properties": {
"networks": {
"type": "nested",
"properties": {
"status": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
}
Add a document:
POST /test/job/1
{
"0": "",
"name": "Programmer",
"description": "This is my first job! :)",
"text": "Programming is awesome",
"networks": [
{
"id": 1,
"status": "PRODUCTION",
"start": "2015-02-26",
"end": "2015-02-26"
},
{
"id": 2,
"status": "PAUSE",
"start": "2015-02-26",
"end": "2015-02-26"
}
]
}
As you have a nested type you will need to use a nested filter.
POST /test/job/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "networks",
"filter": {
"bool": {
"must": [
{
"term": {
"networks.id": "1"
}
},
{
"term": {
"networks.status.raw": "PRODUCTION"
}
}
]
}
}
}
}
}
}
}

Related

php echo from a start point to a end point

I have built a php site that takes code from a page using GET, and then echos it to the current page, but it exports text like this,
{ "selection13": [ { "name": "L" }, { "name": "100%\nA" } ], "selection7": [ { "name": "S" }, { "name": "100%\nA" } ], "selection8": [ { "name": "SS" }, { "name": "100%\nA" } ], "selection9": [ { "name": "SP" }, { "name": "100%\nA" } ], "selection10": [ { "name": "P" }, { "name": "100%\nA" } ], "selection11": [ { "name": "A" }, { "name": "100.00%\nA+" } ], "selection12": [ { "name": "H", "selection5": [ { "name": "T }, { "name": "100.00%\nA+" } ] }, { "name": "100.00%\nA+", "selection5": [ { "name": "T" }, { "name": "100.00%\nA+" } ] } ] }
I need to organizes it in to categorizes like, {L},{100%), here is the code I am currently using,
<?php
$params = http_build_query(array(
"api_key" => "()",
"format" => "json"
));
$result = file_get_contents(
'(url)'.$params,
false,
stream_context_create(array(
'http' => array(
'method' => 'GET'
)
))
);
echo gzdecode($result);
?>

Your script is outputting JSON but there appears to be a few issues with the data itself that need to be addressed first.
The next to the last "name": "T is missing a closing quote.
Section 12 has an inconsistent organization compared to the other sections. Can a section be a child of another section or is this an error?
"selection11": [
{
"name": "A"
},
{
"name": "100.00%\nA+"
}
],
"selection12": [
{
"name": "H",
"selection5": [
{
"name": "T"
},
{
"name": "100.00%\nA+"
}
]
},
{
"name": "100.00%\nA+",
"selection5": [
{
"name": "T"
},
{
"name": "100.00%\nA+"
}
]
}
]
These 2 issues need to be fixed in the JSON data you are retrieving first.
From there, the best way to reorganize this is to convert it to an array using json_decode.
$json = gzdecode($result);
$resultArray = json_decode($json);
I can't tell from your question exactly how you want to reorganize the data but use one or more of the builtin array functions to get the data in the structure you want. If you need to output it in JSON, use json_encode on the manipulated array to get the final data format. More specific help can be provided if you can be more clear on want output structure you are looking for and the format (does it need to be JSON or something else).

Elasticsearch - Research that returns too many bad results

I have an elasticsearch that works but it is really too large, it gives me too many results on terms that have nothing to do with it. I'm looking for a way to refine these results.
On a sample of fake text when I search for the term music, the terms that come out in highlights are :
must, much, alice, inside, patriotic, noticed
I think that the ngram doesn't help me but I think I really need it to have a better search.
Here is my configuration :
{
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "mySnowball", "myNgram"]
},
"default_search": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "mySnowball", "myNgram"]
}
},
"filter": {
"mySnowball": {
"type": "snowball",
"language": "English"
},
"myNgram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 6
}
}
}
}
Here is my request :
{
"query": {
"bool": {
"should": [{
"match": {
"content": "music"
}
}, {
"match": {
"url": "music"
}
}, {
"match": {
"h1": "music"
}
}, {
"match": {
"h2": "music"
}
}
],
"minimum_should_match": 1
}
},
"min_score": 8
}
My document is quite simple :
content => text,
url => text,
h1 => text,
h2 => text,
And the mapping too:
$configMapping = [
'content' => ['type' => 'text', 'boost' => 6],
'url' => ['type' => 'text', 'boost' => 6],
'h1' => ['type' => 'text', 'boost' => 9],
'h2' => ['type' => 'text', 'boost' => 7]
]
I welcome any modification that will allow me to obtain only consistent results.

As you said yourself, analyzing with 'ngram' is the reason you get all these unrelated results.
In all the results you get, you can see the token (2 characters token, as the minimum of your n-gram) that matched the query term 'music':
must, much, alice, inside, patriotic, noticed
Start by removing this filter from your analyzer and keep on tuning the results from there.

Elasticseach or query for comma separated values

I am saving id's in the database as comma separated and indexing the same to ElasticSearch. Now I need to retrieve if the user_id matches with the value.
For example it it saving like this in the indexing for the column user_ids (database type is varchar(500) in elasticsearch it is text)
8938,8936,8937
$userId = 8936; // For example expecting to return that row
$whereCondition = [];
$whereCondition[] = [
"query_string" => [
"query"=> $userId,
"default_field" => "user_ids",
"default_operator" => "OR"
]
];
$searchParams = [
'query' => [
'bool' => [
'must' => [
$whereCondition
],
'must_not' => [
['exists' => ['field' => 'deleted_at']]
]
]
],
"size" => 10000
];
User::search($searchParams);
Json Query
{
"query": {
"bool": {
"must": [
[{
"query_string": {
"query": 8936,
"default_field": "user_ids",
"default_operator": "OR"
}
}]
],
"must_not": [
[{
"exists": {
"field": "deleted_at"
}
}]
]
}
},
"size": 10000
}
Mapping details
{
"user_details_index": {
"aliases": {},
"mappings": {
"test_type": {
"properties": {
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"deleted_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"updated_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"user_ids": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
},
"settings": {
"index": {
"creation_date": "1546404165500",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "krpph26NTv2ykt6xE05klQ",
"version": {
"created": "6020299"
},
"provided_name": "user_details_index"
}
}
}
}
I am trying with above logic, but not unable to retrieve. Can someone help on this.

Since the field user_ids is of type text any no analyzer is specified for it by default it will use standard analyzer which won't break 8938,8936,8937 into terms 8938, 8936 and 8937 and hence the id can't match.
To solve this I would suggest you to store array of ids to user_ids field instead of csv. So while indexing you json input should look as below:
{
...
"user_ids": [
8938,
8936,
8937
]
...
}
Since user ids are integer values following changes should be done in mapping:
{
"user_ids": {
"type": "integer"
}
}
The query will be now as follow:
{
"query": {
"bool": {
"filter": [
[
{
"terms": {
"userIds": [
8936
]
}
}
]
],
"must_not": [
[
{
"exists": {
"field": "deleted_at"
}
}
]
]
}
},
"size": 10000
}

Elasticsearch find subdomains

I try find subdomains by main domain in elasticsearch.
I added few domains to elastic:
$domains = [
'site.com',
'ns1.site.com',
'ns2.site.com',
'test.main.site.com',
'sitesite.com',
'test-site.com',
];
foreach ($domains as $domain) {
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => ['domain' => $domain],
];
$client->index($params);
}
Then I try to search:
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => [
'query' => [
'wildcard' => [
'domain' => [
'value' => '.site.com',
],
],
],
],
];
$response = $client->search($params);
But found nothing. :(
My mapping is:
https://pastebin.com/raw/k9MzjJUM
Any ideas to fix it?
Thanks

You're almost there, just a couple of things missing.
How to make an "ends with" query?
It's enough to add * in your query (that's why this query is called wildcard):
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain" : "*.site.com" }
}
}
This will give you the following result:
{
...
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
}
]
}
}
Seems to work, but we only get one of the results (not all of them).
Why it returns not all matching documents?
Returning to your mapping, the field domain has type text:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"domain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
This means that content of that field will be tokenized and lowercased (with standard analyzer). You can see which tokens will be actually searchable using _analyze API, like this:
POST _analyze
{
"text": "test.main.site.com"
}
{
"tokens": [
{
"token": "test.main.site.com",
"start_offset": 0,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 0
}
]
}
That's why wildcard query could match test.main.site.com.
What if we take n1.site.com?
POST _analyze
{
"text": "n1.site.com"
}
{
"tokens": [
{
"token": "n1",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "site.com",
"start_offset": 3,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
As you can see, there is no token that ends with .site.com (note the . before the site.com).
Fortunately, your mapping is already capable to return all results.
How to return all the results for "ends with" query?
You could use keyword field, which uses the exact value for querying:
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain.keyword" : "*.site.com" }
}
}
This will give you the following result:
{
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "Q4E8VGMBRuo1XmkIFRpy",
"_score": 1,
"_source": {
"domain": "ns1.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "RYE8VGMBRuo1XmkIORqG",
"_score": 1,
"_source": {
"domain": "ns2.site.com"
}
}
]
}
}
Is this the best way to do "ends with"-like queries?
Actually, no. wildcard queries can be very slow:
Note that this query can be slow, as it needs to iterate over many
terms. In order to prevent extremely slow wildcard queries, a wildcard
term should not start with one of the wildcards * or ?.
To achieve best performance, in your case, I would suggest creating another field, higherLevelDomains, and manually extracting the higher level domains from the original. The document might look like:
POST my_index/my_type
{
"domain": "test.main.site.com",
"higherLevelDomains": [
"main.site.com",
"site.com",
"com"
]
}
This will allow you to use term query:
POST my_index/my_type/_search
{
"query": {
"term" : { "higherLevelDomains.keyword" : "site.com" }
}
}
This is probably the most efficient query you can get with Elasticsearch for such task.
Hope that helps!

Elasticsearch wont apply not_analyzed into my mapping

When I try to apply "not_analyzed" into my ES mapping it doesnt work.
I am using this package for ES in Laravel - Elasticquent
My mapping looks like:
'ad_title' => [
'type' => 'string',
'analyzer' => 'standard'
],
'ad_type' => [
'type' => 'integer',
'index' => 'not_analyzed'
],
'ad_type' => [
'type' => 'integer',
'index' => 'not_analyzed'
],
'ad_state' => [
'type' => 'integer',
'index' => 'not_analyzed'
],
Afterwards I do an API get call to view the mapping and it will output:
"testindex": {
"mappings": {
"ad_ad": {
"properties": {
"ad_city": {
"type": "integer"
},
"ad_id": {
"type": "long"
},
"ad_state": {
"type": "integer"
},
"ad_title": {
"type": "string",
"analyzer": "standard"
},
"ad_type": {
"type": "integer"
},
Note that not_analyzed is missing.
I cant see any errors/warnings in my logs either.

What I gathered from my own experience is that you must do the mapping before you do any indexing. Delete the index you've created, assign your not_analyzed mapper and then index your fields again, and you will have the not_analyzed field appear. Please let me know if this works for you. Thank you.

There is an easy test to see if it works:
PUT /stack
{
"mappings": {
"try": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"age": {
"type": "integer",
"index": "not_analyzed"
}
}
}
}
}
GET /stack/_mapping
and the response is:
{
"stack": {
"mappings": {
"try": {
"properties": {
"age": {
"type": "integer"
},
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

ElasticSearch match combination in array - php

Related

php echo from a start point to a end point

Elasticsearch - Research that returns too many bad results

Elasticseach or query for comma separated values

Elasticsearch find subdomains

Elasticsearch wont apply not_analyzed into my mapping

Categories

Resources