Elasticsearch match substring in php - php

Below given is my code to generate index using elasticsearch.Index is getting generated successfully.Basically I am using it to generate autosuggest depending upon movie name,actor name and gener.
Now my requirement is, I need to match substring with particular field.This is working fine if I use $params['body']['query']['wildcard']['field'] = '*sub_word*';.(i.e. search for 'to' gives 'tom kruz' but search for 'tom kr' returns no result).
This matches only particular word in string.I want to match substring containing multiple words(i.e. 'tom kr' should return 'tom kruz').
I found few docs, saying it will be possible using 'ngram'.
But I don't know, how should I implement it in my code, as I am using array based configurations for elasticsearch and all support docs are mentioning configuration in json fromat.
Please help.
require 'vendor/autoload.php';
$client = \Elasticsearch\ClientBuilder::create()
->setHosts(['http://localhost:9200'])->build();
/*************Index a document****************/
$params = ['body' => []];
$j = 1;
for ($i = 1; $i <= 100; $i++) {
$params['body'][] = [
'index' => [
'_index' => 'pvrmod',
'_type' => 'movie',
'_id' => $i
]
];
if ($i % 10 == 0)
$j++;
$params['body'][] = [
'title' => 'salaman khaan'.$j,
'desc' => 'salaman khaan description'.$j,
'gener' => 'movie gener'.$j,
'language' => 'movie language'.$j,
'year' => 'movie year'.$j,
'actor' => 'movie actor'.$j,
];
// Every 10 documents stop and send the bulk request
if ($i % 10 == 0) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = ['body' => []];
unset($responses);
}
}
// Send the last batch if it exists
if (!empty($params['body'])) {
$responses = $client->bulk($params);
}

The problem here lies in the fact that Elasticsearch builds an inverted index. Assuming you use the standard analyser, the sentence "tom kruz is a top gun" get's split into 6 tokens: tom - kruz - is - a - top - gun. These tokens get assigned to the document (with some metadata about there position but let's leave that on the side for now).
If you want to make a partial match, you can, but only on the separate tokens, not over the boundary of tokens as you would like. The suggestion for splitting your search string and building a wildcard query out of these strings is an option.
Another option would indeed be using an ngram or edge_ngram token filter. What that would do (at index time) is creating those partial tokens (like t - to - tom - ... - k - kr - kru - kruz - ...) in advance and you can just put in 'tom kr' in your (match) search and it would match. Be careful though: this will bloat your index (as you can see, it will store A LOT more tokens), you need custom analysers and probably quite a bit of knowledge about your mappings.
In general, the (edge_)ngram route is a good idea only for things like autocomplete, not for just any text field in your index. There's a few ways to get around your problem but most involve building separate features to detect misspelled words and trying to suggest the right terms for it.

Try to create this JSON
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"wildcard": {
"field": {
"value": "tom*",
"boost": 1
}
}
},
{
"field": {
"brandname": {
"value": "kr*",
"boost": 1
}
}
},
]
}
}
}
}
You can explode your search term
$searchTerms = explode(' ', 'tom kruz');
And then create the wildcard for each one
foreach($searchTerms as $searchTerm) {
//create the new array
}

Related

PhpMongo - how to apply AND condition for a single document present in an array?

My Mongo collection has two documents
{
"_id":ObjectId("567168393d5c6cd46a00002a"),
"type":"SURVEY",
"description":"YOU HAVE AN UNANSWERED SURVEY.",
"user_to_notification_seen_status":[
{
"user_id":1,
"status":"UNSEEN",
"time_updated":1450272825
},
{
"user_id":2,
"status":"SEEN",
"time_updated":1450273798
},
{
"user_id":3,
"status":"UNSEEN",
"time_updated":1450272825
}
],
"feed_id":1,
"time_created":1450272825,
"time_updated":1450273798
}
Here is the query I used to fetch only if the user_id is 2 & status is "UNSEEN".
**$query = array('$and' => array(array('user_to_notification_seen_status.user_id'=> 2,'user_to_notification_seen_status.status' => "UNSEEN")));**
$cursor = $notification_collection->find($query);
Ideally the above query shouldn't retrieve results but it returning results. If I give an invalid id or invalid status, it is not returning any record.
You're misunderstanding how the query works. It matches your document because user_to_notification_seen_status contains elements with user_id: 2 and status: UNSEEN.
What you can do to get the desired results is use the aggregation framework; unwind the array and then match both conditions. That way you'll only get the unwinded documents with the array element satisfying both conditions.
Run this in mongo shell (or convert to PHP equivalent). Also, change YourCollection to your actual collection name:
db.YourCollection.aggregate([ { $unwind: "$user_to_notification_seen_status" }, { $match: { "user_to_notification_seen_status.status": "UNSEEN", "user_to_notification_seen_status.user_id": 2 } } ] );
This will return no records, but if you change the id to 3 for example, it will return one.
Try:
$query = array(
array('$unwind' => '$user_to_notification_seen_status'),
array(
'$match' => array('user_to_notification_seen_status.status' => 'UNSEEN', 'user_to_notification_seen_status.user_id' => 2),
),
);
$cursor = $notification_collection->aggregate($query);

get first element of an array

Lets assume, the return value of an search-fuction is something like this
// If only one record is found
$value = [
'records' => [
'record' => ['some', 'Important', 'Information']
]
]
// If multiple records are found
$value = [
'records' => [
'record' => [
0 => ['some', 'important', 'information'],
1 => ['some', 'information', 'I dont care']
]
]
]
what woul'd be the best way to get the important information (in case of multiple records, it is always the first one)?
Should I check something like
if (array_values($value['record']['records'])[0] == 0){//do something};
But I guess, there is a way more elegant solution.
Edit:
And btw, this is not realy a duplicate of the refered question which only covers the multiple records.
If you want the first element of an array, you should use reset. This function sets the pointer to the first element and returns it.
$firstValue = reset($value['record']['records']);
Edit.. after reading your question again, it seems, you dont want the first element.
You rather want this
if (isset($value['record']['records'][0]) && is_array($value['record']['records'][0])) {
// multiple return values
} else {
// single return value
}
Doing this is kind of error proun and i wouldn't suggest that one function returns different kinds of array structures.
check like this..
if(is_array($value['records']['record'][0])) {
// multiple records
} else {
// single record
}

MongoDB: Advanced query on array

Suppose I have the following objects in my collection:
{id:'123', tags:['berry', 'apple']}
{id:'456', tags:['salad', 'tomatoe']}
{id:'789', tags:['bread', 'rice']}
My search term is "Strawberry". I want to find all objects, where one of the tags is part of search term. In this case it's the object with id '123', since 'berry' is part of 'Strawberry'.
I wanted to use Regex, like this (I'm using php btw):
$regex = new MongoRegex("/.*berry.*/i");
$results = $mongodb->data->find(array("tags" => array('$in' => array($regex))));
but the problem is that the regex is applied on the tags and not on the search result. So i'd need something like a reverse Regex.
Is a query like this somehow possible? Right now I'm doing it like this:
$search = "Strawberry";
$js = "function() { var i = 0; for (; i < this.tags.length; i++) { if ('".$search."'.indexOf(this.tags[i]) != -1) { return true; } } }";
$results = $mongodb->data->find($js);
That's OK for now, since the dataset isn't very large, but will be slow in the future.
Does anyone have a suggestion? Thanks.
UPDATE:
Sorry if this is still not clear.
My search Term is "Strawberry", not "berry". The php code I posted that contains the Regex was just to show that this is not a solution and does not work.
So again: My search term is "Strawberry" and I want to find all objects, where on of the tags is part of the search term, not the other way around
UPDATE 2:
To make it even clearer, in SQL this would be:
SELECT * FROM data WHERE 'Strawberry' LIKE CONCAT('%', tag, '%')
This query will match strawberry if you have in tags
db.collection.aggregate(
[
{$unwind: "$tags"},
{$match : {tags: /.*berry.*/i }}
]
)
Tested output'
{
"result" : [
{
"_id" : ObjectId("537373c17c3639c32fe515fb"),
"id" : "123",
"tags" : "berry"
},
{
"_id" : ObjectId("537375337c3639c32fe515fe"),
"id" : "789",
"tags" : "strawberry"
}
],
"ok" : 1
}
In terms of PHP,
$result = $mongodb->aggregate(array(
array(
'$unwind' => "$tags",
),
array(
'$match' => array(
'tags' => /.*berry.*/i
),
),
));

How to find into mongodb to the last item of an array?

I want to find documents where last elements in an array equals to some value.
Array elements may be accessed by specific array position:
// i.e. comments[0].by == "Abe"
db.example.find( { "comments.0.by" : "Abe" } )
but how do i search using the last item as criteria?
i.e.
db.example.find( { "comments.last.by" : "Abe" } )
By the way, i'm using php
I know this question is old, but I found it on google after answering a similar new question. So I thought this deserved the same treatment.
You can avoid the performance hit of $where by using aggregate instead:
db.example.aggregate([
// Use an index, which $where cannot to narrow down
{$match: { "comments.by": "Abe" }},
// De-normalize the Array
{$unwind: "$comments"},
// The order of the array is maintained, so just look for the $last by _id
{$group: { _id: "$_id", comments: {$last: "$comment"} }},
// Match only where that $last comment by `by.Abe`
{$match: { "comments.by": "Abe" }},
// Retain the original _id order
{$sort: { _id: 1 }}
])
And that should run rings around $where since we were able to narrow down the documents that had a comment by "Abe" in the first place. As warned, $where is going to test every document in the collection and never use an index even if one is there to be used.
Of course, you can also maintain the original document using the technique described here as well, so everything would work just like a find().
Just food for thought for anyone finding this.
Update for Modern MongoDB releases
Modern releases have added the $redact pipeline expression as well as $arrayElemAt ( the latter as of 3.2, so that would be the minimal version here ) which in combination would allow a logical expression to inspect the last element of an array without processing an $unwind stage:
db.example.aggregate([
{ "$match": { "comments.by": "Abe" }},
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$arrayElemAt": [ "$comments.by", -1 ] },
"Abe"
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
The logic here is done in comparison where $arrayElemAt is getting the last index of the array -1, which is transformed to just an array of the values in the "by" property via $map. This allows comparison of the single value against the required parameter, "Abe".
Or even a bit more modern using $expr for MongoDB 3.6 and greater:
db.example.find({
"comments.by": "Abe",
"$expr": {
"$eq": [
{ "$arrayElemAt": [ "$comments.by", -1 ] },
"Abe"
]
}
})
This would be by far the most performant solution for matching the last element within an array, and actually expected to supersede the usage of $where in most cases and especially here.
You can't do this in one go with this schema design. You can either store the length and do two queries, or store the last comment additionally in another field:
{
'_id': 'foo';
'comments' [
{ 'value': 'comment #1', 'by': 'Ford' },
{ 'value': 'comment #2', 'by': 'Arthur' },
{ 'value': 'comment #3', 'by': 'Zaphod' }
],
'last_comment': {
'value': 'comment #3', 'by': 'Zaphod'
}
}
Sure, you'll be duplicating some data, but atleast you can set this data with $set together with the $push for the comment.
$comment = array(
'value' => 'comment #3',
'by' => 'Zaphod',
);
$collection->update(
array( '_id' => 'foo' ),
array(
'$set' => array( 'last_comment' => $comment ),
'$push' => array( 'comments' => $comment )
)
);
Finding the last one is easy now!
You could do this with a $where operator:
db.example.find({ $where:
'this.comments.length && this.comments[this.comments.length-1].by === "Abe"'
})
The usual slow performance caveats for $where apply. However, you can help with this by including "comments.by": "Abe" in your query:
db.example.find({
"comments.by": "Abe",
$where: 'this.comments.length && this.comments[this.comments.length-1].by === "Abe"'
})
This way, the $where only needs to be evaluated against documents that include comments by Abe and the new term would be able to use an index on "comments.by".
I'm just doing :
db.products.find({'statusHistory.status':'AVAILABLE'},{'statusHistory': {$slice: -1}})
This gets me products for which the last statusHistory item in the array, contains the property status='AVAILABLE' .
I am not sure why my answer above is deleted. I am reposting it. I am pretty sure without changing the schema, you should be able to do it this way.
db.example.find({ "comments:{$slice:-1}.by" : "Abe" }
// ... or
db.example.find({ "comments.by" : "Abe" }
This by default takes the last element in the array.

Map Reduce To Get Most popular tags

I have a problem that I need some help on but I feel I'm close. It involves Lithium and MongoDB Code looks like this:
http://pastium.org/view/0403d3e4f560e3f790b32053c71d0f2b
$db = PopularTags::connection();
$map = new \MongoCode("function() {
if (!this.saved_terms) {
return;
}
for (index in this.saved_terms) {
emit(this.saved_terms[index], 1);
}
}");
$reduce = new \MongoCode("function(previous, current) {
var count = 0;
for (index in current) {
count += current[index];
}
return count;
}");
$metrics = $db->connection->command(array(
'mapreduce' => 'users',
'map' => $map,
'reduce' => $reduce,
'out' => 'terms'
));
$cursor = $db->connection->selectCollection($metrics['result'])->find()->limit(1);
print_r($cursor);
/**
User Data In Mongo
{
"_id" : ObjectId("4e789f954c734cc95b000012"),
"email" : "example#bob.com",
"saved_terms" : [
null,
[
"technology",
" apple",
" iphone"
],
[
"apple",
" water",
" beryy"
]
] }
**/
I am having a user savings terms they search on and then I am try to get the most populars terms
but I keep getting errors like :Uncaught exception 'Exception' with message 'MongoDB::__construct( invalid name '. does anyone have any idea how to do this or some direction?
First off I would not store this in the user object. MongoDb objects have an upper limit of 4/16MB (depending on version). Now this limit is normally not a problem, but when logging inline in one object you might be able to reach it. However a more real problem is that every time you need to act on these objects you need to load them into RAM and it becomes consuming. I dont think you want that on your user objects.
Secondly arrays in objects are not sortable and have other limitations that might come back to bite you later.
But, if you want to have it like this (low volume of searches should not be a problem really) you can solve this most easy by using a group query.
A group query is pretty much like a group query in sql, so its a slight trick as you need to group on something most objects share. (An active field on users maybe).
So, heres a working group example that will sum words used based on your structure.
Just put this method in your model and do MyModel::searchTermUsage() to get a Document object back.
public static function searchTermUsage() {
$reduce = 'function(obj, prev) {
obj.terms.forEach(function(terms) {
terms.forEach(function(term) {
if (!(term in prev)) prev[term] = 0;
prev[term]++;
});
});
}';
return static::all(array(
'initial' => new \stdclass,
'reduce' => $reduce,
'group' => 'common-value-key' // Change this
));
}
There is no protection against non-array types in the terms field (you had a null value in your example). I removed it for simplicity, its better to probably strip this before it ends up in the database.

Categories