Elasticsearch - use EdgeNGram analyzer for case insensitive search - php

I want to make case insensitive search on fields with EdgeNGram analyzer. I am using ES in php via elastica.
I have table of users
{
"user": {
"analyzer": "analyzer_edgeNGram",
"properties": {
"admin": {
"type": "boolean"
},
"firstName": {
"type": "string",
"analyzer": "analyzer_edgeNGram"
},
"lastName": {
"type": "string",
"analyzer": "analyzer_edgeNGram"
},
"username": {
"type": "string",
"analyzer": "analyzer_edgeNGram"
}
}
}
}
My analyzers look like this (you can see there is lowercase filter in egdeNGram analyzer)
"index.analysis.filter.asciifolding.type": "asciifolding",
"index.number_of_replicas": "1",
"index.analysis.filter.standard.type": "standard",
"index.analysis.tokenizer.edgeNGram.token_chars.1": "digit",
"index.analysis.tokenizer.edgeNGram.max_gram": "10",
"index.analysis.analyzer.analyzer_edgeNGram.type": "custom",
"index.analysis.tokenizer.edgeNGram.token_chars.0": "letter",
"index.analysis.filter.lowercase.type": "lowercase",
"index.analysis.tokenizer.edgeNGram.side": "front",
"index.analysis.tokenizer.edgeNGram.type": "edgeNGram",
"index.analysis.tokenizer.edgeNGram.min_gram": "1",
"index.analysis.tokenizer.standard.type": "standard",
"index.analysis.analyzer.analyzer_edgeNGram.filters": "standard,lowercase,asciifolding",
"index.analysis.analyzer.analyzer_edgeNGram.tokenizer": "edgeNGram",
"index.number_of_shards": "1",
"index.version.created": "900299"
There is for example user with firstName Miroslav. If I do query like this
{"query": {"match": {"firstName": "miro"}}}
I have 0 hits. But if I changed in query miro to Miro it will find.
I've checked how are the tokens generated and they are case sensitive: M, Mi, Mir, ...
Any advice how to achieve case insensitive searching?
Thank you

The default search_analyzer set is standard and has the following settings:
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
So by default your queries must be case insensitive, but you can allways try to set the value of search_analyzer to something else. In the docs:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting:

Related

Rest API Field Requirements

I'm curious to see if there is a solution to send Field Requirements (type, length, required) together with the API, which I can use for Form Validation.
So... What I expect, is the following:
Page loads
Gets required fields and their requirements from API.
Builds form based on requirements
Just add more properties to the response body to specify the requirements. For example:
{
"fields": [{
"field": "username",
"type": "String",
"minLength": 3,
"maxLength": 20,
"required": true
}, {
"field": "password",
"type": "String",
"minLength": 6,
"maxLength": 15,
"required": true
}]
}

Instead Join in SQL what should use in MongoDB

I have two collection in MongoDB database, i want join two Collection in PHP
I have searched but unfortunately I have not found a compelling answer.
Data look like this:
users
{
"_id": "4ca30369fd0e910ecc000006",
"login": "user11",
"pass": "example_pass",
"date": "2017-12-15"
}
news
"_id": "4ca305c2fd0e910ecc000003",
"name": "news 333",
"content": "news content",
"user_id": "4ca30373fd0e910ecc000007",
"date": "2017-12-15"
}
Already answer in this thread
Note : I'm a MEAN developer
In mean we use .populate() method (mongoose) to achieve joins upto a level.
as for php
You can use different approach from RDBMS
Data Replication
"news": {
"_id": "4ca305c2fd0e910ecc000003",
"name": "news one",
"content": "news one",
"user": {
"_id": "4ca30369fd0e910ecc000006",
"login": "user11"
},
"date": "2017-12-15"
}

Elasticsearch: What's the best way to search for a word within a string AND get score?

I'm using ElasticSearch's PHP client and I find really difficult to return results with scores whenever I want to search for a word that is "hidden" within a string.
This is an example:
I want to get all the documents where the field "file" has the word "anses" and files are named like this:
axx14anses19122015.zip
What I know about it
I know I should tokenize those words, can't realize how to do it.
Also I've read about aggregations but I'm really new to ES and I have to deliver a working piece ASAP.
What I've tried so far
REGEXP: using regular expressions is very expensive and does not return any scores, which is a must-to-have in order to shrink results and bring the user accurate information.
Wildcards: same thing, slow and no scores
Own script where I have a dictionary and search for critical words using regexp, if match, create a new field within that matched document with the word. The reason is to create a TOKEN so in future searches I can use regular match with scores. Negative side: the dictionary thing was totally denied by my boss so I'm here asking for any ideas.
Thanks in advance.
I suggest in your case nGram tokenizer see the example
I will create a analyzer and a mapping for a doc type
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 4,
"max_gram": 4,
"token_chars": [ "letter", "digit" ]
}
},
"analyzer": {
"ngram_tokenizer_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"term_vector": "yes",
"analyzer": "ngram_tokenizer_analyzer"
}
}
}
}
}
after that I`ll insert a document using your file name
PUT /test_index/doc/1
{
"text_field": "axx14anses19122015"
}
now I`ll just will use a query match
POST /test_index/_search
{
"query": {
"match": {
"text_field": "anses"
}
}
}
and will receive a reponse like this
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.10848885,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.10848885,
"_source": {
"text_field": "axx14anses19122015"
}
}
]
}
}
What i did?
i just created a nGram tokenizer that will explode our string in 4 characters terms and will index this terms separated and they will be searched when I search a part of the string.
To see more, read this article https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
Hope it help!
Ok after trying -so- many times it worked. I'll share the solution just in case someone else needs it. Thank you so much to Waldemar, it was a really good approach and I still cannot see why it's not working.
curl -XPUT 'http://ipaddresshere/tokentest' -d
'{ "settings":
{ "number_of_shards": 1, "analysis" :
{ "analyzer" : { "myngram" : { "tokenizer" : "mytokenizer" } },
"tokenizer" : { "mytokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "5",
"token_chars" : [ "letter", "digit" ] } } } },
"mappings":
{ "doc" :
{ "properties" :
{ "field" : {
"type" : "string",
"term_vector" : "yes",
"analyzer" : "myngram" } } } } }'
Sorry for bad indentation, I'm really hurry but want to post the solution.
So, this will take any string from "field" and split it into nGrams with lenght 3 to 5. For example: "abcanses14f.zip" will result in:
abc, abca, abcan, bca, bcan, bcans, etc... until it reaches anses or a similar term which is matcheable and has a score related to it.

JSON Schema Requirement Enforcement

So this is my first time using JSON Schema and I have a fairly basic question about requirements.
My top level schema is as follows:
schema.json:
{
"id": "http://localhost/srv/schemas/schema.json",
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"event": { "$ref": "events_schema.json#" },
"building": { "$ref": "buildings_schema.json#" }
},
"required": [ "event" ],
"additionalProperties": false
}
I have two other schema definition files (events_schema.json and buildings_schema.json) that have object field definitions in them. The one of particular interest is buildings_schema.json.
buildings_schema.json:
{
"id": "http://localhost/srv/schemas/buildings_schema.json",
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "buildings table validation definition",
"type": "object",
"properties": {
"BuildingID": {
"type": "integer",
"minimum": 1
},
"BuildingDescription": {
"type": "string",
"maxLength": 255
}
},
"required": [ "BuildingID" ],
"additionalProperties": false
}
I am using this file to test my validation:
test.json:
{
"event": {
"EventID": 1,
"EventDescription": "Some description",
"EventTitle": "Test title",
"EventStatus": 2,
"EventPriority": 1,
"Date": "2007-05-05 12:13:45"
},
"building": {
"BuildingID": 1,
}
}
Which passes validation fine. But when I use the following:
test2.json
{
"event": {
"EventID": 1,
"EventDescription": "Some description",
"EventTitle": "Test title",
"EventStatus": 2,
"EventPriority": 1,
"Date": "2007-05-05 12:13:45"
}
}
I get the error: [building] the property BuildingID is required
Inside my buildings_schema.json file I have the line "required": [ "BuildingID" ] which is what causes the error. It appears that the schema.json is traversing down the property definitions and enforcing all the requirements. This is counter intuitive and I would like it to ONLY enforce a requirement if it's parent property is enforced.
I have a few ways around this that involve arrays and fundamentally changing the structure of the JSON, but that kind of defeats the purpose of my attempts at validating existing JSON. I have read over the documentation (/sigh) and have not found anything relating to this issue. Is there a some simple requirement inheritance setting I am missing?
I am using the Json-Schema for PHP implementation from here: https://github.com/justinrainbow/json-schema
After messing with different validators, it appears to be an issue with the validator. The validator assumes required inheritance through references. I fixed this by simply breaking apart the main schema into subschemas and only using the required subschema when necessary.

How may i retrieve Mysql data to this JSON format?

I have read about json_encode but still lack the logic in using it for my needs on this particular JSON structure.
Assuming the JSON structure is as follows :
{
"_id": "23441324",
"api_rev": "1.0",
"type": "router",
"hostname": "something",
"lat": -31.805412,
"lon": -64.424677,
"aliases": [
{
"type": "olsr",
"alias": "104.201.0.29"
}
],
"site": "roof town hall",
"community": "Freifunk/Berlin",
"attributes": {
"firmware": {
"name": "meshkit",
"url": "http:k.net/"
}
}
}
Some of the values of the attributes will be taken from the database while some are going to be hardcoded(static) like "type","api_rev". I was thinking of just using concatenation to build the structure but learnt its a bad idea. So if i am to use json_encode how may i be able to handle this structure ? array dimensions etc.

Categories