Elasticsearch find subdomains - php

I try find subdomains by main domain in elasticsearch.
I added few domains to elastic:
$domains = [
'site.com',
'ns1.site.com',
'ns2.site.com',
'test.main.site.com',
'sitesite.com',
'test-site.com',
];
foreach ($domains as $domain) {
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => ['domain' => $domain],
];
$client->index($params);
}
Then I try to search:
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => [
'query' => [
'wildcard' => [
'domain' => [
'value' => '.site.com',
],
],
],
],
];
$response = $client->search($params);
But found nothing. :(
My mapping is:
https://pastebin.com/raw/k9MzjJUM
Any ideas to fix it?
Thanks

You're almost there, just a couple of things missing.
How to make an "ends with" query?
It's enough to add * in your query (that's why this query is called wildcard):
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain" : "*.site.com" }
}
}
This will give you the following result:
{
...
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
}
]
}
}
Seems to work, but we only get one of the results (not all of them).
Why it returns not all matching documents?
Returning to your mapping, the field domain has type text:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"domain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
This means that content of that field will be tokenized and lowercased (with standard analyzer). You can see which tokens will be actually searchable using _analyze API, like this:
POST _analyze
{
"text": "test.main.site.com"
}
{
"tokens": [
{
"token": "test.main.site.com",
"start_offset": 0,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 0
}
]
}
That's why wildcard query could match test.main.site.com.
What if we take n1.site.com?
POST _analyze
{
"text": "n1.site.com"
}
{
"tokens": [
{
"token": "n1",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "site.com",
"start_offset": 3,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
As you can see, there is no token that ends with .site.com (note the . before the site.com).
Fortunately, your mapping is already capable to return all results.
How to return all the results for "ends with" query?
You could use keyword field, which uses the exact value for querying:
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain.keyword" : "*.site.com" }
}
}
This will give you the following result:
{
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "Q4E8VGMBRuo1XmkIFRpy",
"_score": 1,
"_source": {
"domain": "ns1.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "RYE8VGMBRuo1XmkIORqG",
"_score": 1,
"_source": {
"domain": "ns2.site.com"
}
}
]
}
}
Is this the best way to do "ends with"-like queries?
Actually, no. wildcard queries can be very slow:
Note that this query can be slow, as it needs to iterate over many
terms. In order to prevent extremely slow wildcard queries, a wildcard
term should not start with one of the wildcards * or ?.
To achieve best performance, in your case, I would suggest creating another field, higherLevelDomains, and manually extracting the higher level domains from the original. The document might look like:
POST my_index/my_type
{
"domain": "test.main.site.com",
"higherLevelDomains": [
"main.site.com",
"site.com",
"com"
]
}
This will allow you to use term query:
POST my_index/my_type/_search
{
"query": {
"term" : { "higherLevelDomains.keyword" : "site.com" }
}
}
This is probably the most efficient query you can get with Elasticsearch for such task.
Hope that helps!

Related

Elasticsearch - Research that returns too many bad results

I have an elasticsearch that works but it is really too large, it gives me too many results on terms that have nothing to do with it. I'm looking for a way to refine these results.
On a sample of fake text when I search for the term music, the terms that come out in highlights are :
must, much, alice, inside, patriotic, noticed
I think that the ngram doesn't help me but I think I really need it to have a better search.
Here is my configuration :
{
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "mySnowball", "myNgram"]
},
"default_search": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "mySnowball", "myNgram"]
}
},
"filter": {
"mySnowball": {
"type": "snowball",
"language": "English"
},
"myNgram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 6
}
}
}
}
Here is my request :
{
"query": {
"bool": {
"should": [{
"match": {
"content": "music"
}
}, {
"match": {
"url": "music"
}
}, {
"match": {
"h1": "music"
}
}, {
"match": {
"h2": "music"
}
}
],
"minimum_should_match": 1
}
},
"min_score": 8
}
My document is quite simple :
content => text,
url => text,
h1 => text,
h2 => text,
And the mapping too:
$configMapping = [
'content' => ['type' => 'text', 'boost' => 6],
'url' => ['type' => 'text', 'boost' => 6],
'h1' => ['type' => 'text', 'boost' => 9],
'h2' => ['type' => 'text', 'boost' => 7]
]
I welcome any modification that will allow me to obtain only consistent results.
As you said yourself, analyzing with 'ngram' is the reason you get all these unrelated results.
In all the results you get, you can see the token (2 characters token, as the minimum of your n-gram) that matched the query term 'music':
must, much, alice, inside, patriotic, noticed
Start by removing this filter from your analyzer and keep on tuning the results from there.

Elasticsearch get data using search in PHP

I am trying to get data from elastic search in PHP.
Following is my elastic type (curated) structure :
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "qwerty",
"_type": "curated",
"_id": "2",
"_score": 1,
"_source": {
"shows_on_lfz": [
{
"sh_id": 14,
"sh_parent_id": 102,
"sh_name": "Veg Versions Of Global Dishes"
}
]
}
},
{
"_index": "qwerty",
"_type": "curated",
"_id": "1",
"_score": 1,
"_source": {
"top_stories": [
{
"ts_id": 515,
"ts_parent_id": 485,
}
]
}
}
]
}
}
How can I get all data of top_stories using search query in PHP?
I am getting data using match_all & get() query but I need data using search().
I tried -
$body['query']['bool']['should'] = ['match' => ['top_stories'=> '']];
But getting a null response.
This is the query that will bring all entries that have the field top_stories
GET qwerty/_search
{
"query": {
"exists": {
"field": "top_stories"
}
}
}
You can convert this to code and you can test it through kibana

Elasticsearch check if fieldA has value x then search fieldB for term xyz

Suppose I have stored bellow data and want to search for term xy in old_value and new_value fields of those documents that their field_name is curriculum_name_en or curriculum_name_pr:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 98,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "audit_field",
"_id": "57526c197e83c",
"_score": 1,
"_source": {
"session_id": 119,
"trans_seq_no": 1,
"table_seq_no": 1,
"field_id": 2,
"field_name": "curriculum_id",
"new_value": 118,
"old_value": null
}
},
{
"_index": "my_index",
"_type": "audit_field",
"_id": "57526c197f2c3",
"_score": 1,
"_source": {
"session_id": 119,
"trans_seq_no": 1,
"table_seq_no": 1,
"field_id": 3,
"field_name": "curriculum_name_en",
"new_value": "Test Index creation",
"old_value": null
}
},
{
"_index": "my_index",
"_type": "audit_field",
"_id": "57526c198045c",
"_score": 1,
"_source": {
"session_id": 119,
"trans_seq_no": 1,
"table_seq_no": 1,
"field_id": 4,
"field_name": "curriculum_name_pr",
"new_value": null,
"old_value": null
}
},
{
"_index": "my_index",
"_type": "audit_field",
"_id": "57526c1981512",
"_score": 1,
"_source": {
"session_id": 119,
"trans_seq_no": 1,
"table_seq_no": 1,
"field_id": 5,
"field_name": "curriculum_name_pa",
"new_value": null,
"old_value": null
}
}
]
}
}
and many more fields may be there, now user may select one or more of those fields and define a search term across those fields that he/she selected, the challenge is here, how we can say elastic that consider field_name to match those fields that user selected, then search in old_value, and new_value.
for example if user select curriculum_name_en and curriculum_name_pr and then want to search for xy inside old_value and new_value fields of those documents that their field_name is above fields.
how we can do that?
The idea with this requirement is that you need to make something like: the query needs to match new_value and/or old_value only if field_name matches a certain value as well. There is no programmatic-like way of saying if this then that.
What I'm suggesting is something like this:
{
"query": {
"bool": {
"must": [
{
"terms": {
"field_name": [
"curriculum_name_en",
"curriculum_name_pr"
]
}
},
{
"multi_match": {
"query": "Test Index",
"fields": ["new_value","old_value"]
}
}
]
}
}
}
So, your if this then that condition is a must statement from a bool query where your if and then branches live inside the must.
This may solve your problem
{
"query": {
"filtered": {
"filter": {
"and": [
{
"query" : {
"terms" : {
"field_name" : [
"curriculum_name_en",
"curriculum_name_pr"
],
"minimum_match" : 1
}
}
},
{
"query" : {
"terms" : {
"new_value" : [
"test", "index"
],
"minimum_match" : 1
}
}
}
]
}
}
}
}
}

ElasticSearch match combination in array

I'm implementing ElasticSearch into my Laravel application using the php package from ElasticSearch.
My application is a small jobboard and currently my job document is looking like this:
{
"_index":"jobs",
"_type":"job",
"_id":"19",
"_score":1,
"_source":{
"0":"",
"name":"Programmer",
"description":"This is my first job! :)",
"text":"Programming is awesome",
"networks":[
{
"id":1,
"status":"PRODUCTION",
"start":"2015-02-26",
"end":"2015-02-26"
},
{
"id":2,
"status":"PAUSE",
"start":"2015-02-26",
"end":"2015-02-26"
}
]
}
}
As you can see a job can be attached to multiple networks. In my search query I would like to include WHERE network.id == 1 AND network.status == PRODUCTION.
My current query looks like this, however this returns documents where it has a network of id 1, if it has any network of status PRODUCTION. Is there anyway i can enforce both to be true within one network?
$query = [
'index' => $this->index,
'type' => $this->type,
'body' => [
'query' => [
'bool' => [
'must' => [
['networks.id' => 1]],
['networks.status' => 'PRODUCTION']]
],
'should' => [
['match' => ['name' => $query]],
['match' => ['text' => $query]],
['match' => ['description' => $query]],
],
],
],
],
];
You need to specify that the objects in the networks array should be stored as individual objects in the index, this will allow you to perform a search on individual network objects. You can do so using the nested type in Elasticsearch.
Also, if you doing exact matches it is better to use a filter rather than a query as the filters are cached and always give you better performance than a query.
Create your index with a new mapping. Use the nested type for the networks array.
POST /test
{
"mappings": {
"job": {
"properties": {
"networks": {
"type": "nested",
"properties": {
"status": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
}
Add a document:
POST /test/job/1
{
"0": "",
"name": "Programmer",
"description": "This is my first job! :)",
"text": "Programming is awesome",
"networks": [
{
"id": 1,
"status": "PRODUCTION",
"start": "2015-02-26",
"end": "2015-02-26"
},
{
"id": 2,
"status": "PAUSE",
"start": "2015-02-26",
"end": "2015-02-26"
}
]
}
As you have a nested type you will need to use a nested filter.
POST /test/job/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "networks",
"filter": {
"bool": {
"must": [
{
"term": {
"networks.id": "1"
}
},
{
"term": {
"networks.status.raw": "PRODUCTION"
}
}
]
}
}
}
}
}
}
}

Elasticsearch Snowball Analyzer wants exact word

I Have been using Elastic Search for a project, but I find the result of Snowball Analyzer a bit strange.
Below is my example of Mapping used.
$myTypeMapping = array(
'_source' => array(
'enabled' => true
),
'properties' => array(
'id' => array(
'type' => 'integer',
'index' => 'not_analyzed'
),
'name' => array(
'type' => 'string',
'analyzer' => 'snowball',
'boost' => 2.0
),
'food_types' => array(
'type' => 'string',
'analyzer' => 'keyword'
),
'location' => array(
'type' => 'geo_point',
"geohash_precision"=> 4
),
'city' => array(
'type' => 'string',
'analyzer' => 'keyword'
)
)
);
$indexParams['body']['mappings']['online_pizza'] = $myTypeMapping;
// Create the index
$elastic_client->indices()->create($indexParams);
On quering the http://localhost:9200/online_pizza/online_pizza/_mapping I get the following results,
{
"online_pizza": {
"properties": {
"city": {
"type": "string",
"analyzer": "keyword"
},
"food_types": {
"type": "string",
"analyzer": "keyword"
},
"id": {
"type": "integer"
},
"location": {
"type": "geo_point",
"geohash_precision": 4
},
"name": {
"type": "string",
"boost": 2,
"analyzer": "snowball"
}
}
}
}
My Question is, I have data, which has Name field as "Milano". On querying for "Milano" I get the desired result, but if I query for "Milan" or "Mil" I get no result found.
{
"query": {
"query_string": {
"default_field": "name",
"query": "Milan"
}
}
}
I've also tried to snowball analyzer during querying, no help.
{
"query": {
"query_string": {
"default_field": "name",
"query": "Milan",
"analyzer": "snowball"
}
}
}
Second Question is Keyword Search is case sensitive, eg, Pizza != pizza, how do i get away with this ?
Thanks,
The snowball stemmer doesn't want exact words. If you try it with jumping, it outputs jump as expected.
However, depending on the case, you word may be understemmed as it doesn't match any stemmer rule.
If you use the analyze API endpoint (more info here), you will see that analyzing Milano with snowball analyzer gives you the token milano :
GET _analyze?analyzer=snowball&text=Milano
Output :
{
"tokens": [
{
"token": "milano",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Then, using same snowball analyzer on Mil like this :
GET _analyze?analyzer=snowball&text=Mil
gives you this token :
{
"tokens": [
{
"token": "mil",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
}
]
}
That's why searching for 'milan' or 'mil' won't match 'Milano' documents : it doesn't match the milano term stored in index.
For your second question, you can prepare a custom analyzer combining keyword tokenizer and a lowercase tokenfilter in order to have your keyword search case-insensitive (if you use the same analyzer at search time) :
POST index_name
{
"analysis": {
"analyzer": {
"case_insensitive_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
Test :
GET analyse/_analyze?analyzer=case_insensitive_keyword&text=Choo Choo
Output :
{
"tokens": [
{
"token": "choo choo",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}
I hope I'm clear enough in my explainations :)

Categories