Searching numbers with Zend_Search_Lucene

Searching numbers with Zend_Search_Lucene - php

So why does the first search example below return no results? And any ideas on how to modify the below code to make number searches possible would be much appreciated.
Create the index
$index = new Zend_Search_Lucene('/myindex', true);
$doc->addField(Zend_Search_Lucene_Field::Text('ssn', '123-12-1234'));
$doc->addField(Zend_Search_Lucene_Field::Text('cats', 'Fluffy'));
$index->addDocument($doc);
$index->commit();
Search - NO RESULTS
$index = new Zend_Search_Lucene('/myindex', true);
$results = $index->find('123-12-1234');
Search - WITH RESULTS
$index = new Zend_Search_Lucene('/myindex', true);
$results = $index->find('Fluffy');

First you need to change your text analizer to include numbers
Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum() );
Then for fields with numbers you want to use Zend_Search_Lucene_Field::Keyword instead of Zend_Search_Lucene_Field::Text
this will skip the the creation of tokens and saves the value 'as is' into the index. Then you can search by it. I don't know how it behaves with floats ( is probably not going to work for floats 3.0 is not going to match 3) but for natural numbers ( like ids ) works like a charm.

This is an effect of which Analyzer you have chosen.
I believe the default Analyzer will only index terms that match /[a-zA-Z]+/. This means that your SSN isn't being added to the index as a term.
Even if you switched to the text+numeric case insensitive Analyzer, what you are wanting still will not work. The expression for a term is /[a-zA-Z0-9]+/ this would mean your terms added to the index would be 12,123,1234.
If you need 123-12-1234 to be seen as a valid term, you are probably going to need to extend Zend_Search_Lucene_Analysis_Analyzer_Common and make it so that 123-12-1234 is a term.
See
http://framework.zend.com/manual/en/zend.search.lucene.extending.html#zend.search.lucene.extending.analysis
Your other choice is to store the ssn as a Zend_Search_Lucene_Field::Keyword. Since a keyword is not broken up into terms.
http://framework.zend.com/manual/en/zend.search.lucene.html#zend.search.lucene.index-creation.understanding-field-types

Related

Find a sentence that contains a word in text from database

I want to explain what I am trying to accomplish with an example. Let's assume I have two rows in my posts table.
It snows a lot in winter in Norway. While it barely snows where I live.
I run four miles every morning. I am trying to get fit.
When I search in winter I want to get the sentence It snows a lot in winter in Norway.
I know I can get the row with:
$posts = \App\Models\Post::where('body', 'like', "%{in winter}%")->get();
But I am not sure how to get the exact sentence.

While it might be technically possible using SQL to get the exact sentence, you are better off using PHP for getting the exact sentence from the collection.
I have created an example using collections (since you're using Laravel), starting with your provided Post query (although I did remove the curly braces from the like string).
1. Get the collection of posts with body like search query
$posts = Post::where('body', 'like', "%in winter%")->get();
2. Map collection of posts to individual sentences. Flatten to remove empty sentences.
$postSentences = $posts->map( function($post) {
// preg split makes sure the text is splitted on . or ! or ?
return preg_split('/\.|\?|!/', $post->body);
})->flatten();
3. Get the corresponding sentence(s) using a filter and Str::contains()
$matchingSentences = $postSentences->filter( function($sentence) {
return Str::contains($sentence, 'in winter');
});
$matchingSentences should return:
Illuminate\Support\Collection
all: [
"It snows a lot in winter in Norway",
],
}
The example can probably be altered / shortened to your fitting. But this should solve the aforementioned problem.

Is there a simple way to return WHAT is similar between 2 strings in PHP?

I've looked into the similar_text() and levenshtein() functions, but they only seem to return THAT there are similarities and the percentage of those similarities.
What I am trying to do is compare 2 strings to determine WHAT is actually similar between the two.
Basically:
<?php
$string1 = "IMG_1";
$string2 = "IMG_2";
echo CompareTheseStrings($string1,$string2); // returns "IMG_";
If this wonderful function doesn't exist, what would be the best way to accomplish this?
My end game plan is to read through a list of file names and then replace the similar text with something user defined or just remove it all together, but I don't want to replace each files unique identifier.

Reading your 'end goal' I think you're going about this completely the wrong way I think you should really be looking at str_replace
$string1 = "IMG_1";
$string2 = "IMG_2";
// you would create a loop here to insert each string
str_replace("IMG", "NEW_TERM", $string1);
If you want to remove the text altogether then just pass an empty string in as the 2nd parameter

How to highlight search-matching text on a web page

I'm trying to write a PHP function that takes some text to be displayed on a web page, and then based on some entered search terms, highlights the corresponding parts of the text. Unfortunately, I'm having a couple of issues.
To better explain the two issues I'm having, let's imagine that the following innocuous string is being searched on and will be displayed on the web page:
My daughter was born on January 11, 2011.
My first problem is that if more than one search term is entered, any placeholder text I use to mark the start and end of any matches for the first term may then be matched by the second term.
For example, I'm currently using the following delimiting strings to mark the beginning and end of a match (upon which I use the preg_replace function at the end of the function to turn the delimiters into HTML span tags):
'#####highlightStart#####'
'#####highlightEnd#####'
The problem is, if I do a search like 2011 light, then 2011 will be matched first, giving me:
My daughter was born on January 11, #####highlightStart#####2011#####highlightEnd#####.
Upon which when light is searched for, it will match the word light within both #####highlightStart##### and #####highlightEnd#####, which I don't want.
One thought I had was to create some really obscure delimiting strings (in perhaps a foreign language) that would likely never be searched on, but I can't guarantee that any particular string will never be searched on and it just seems like a really kludgy solution. Basically, I imagine that there is a better way to do it.
Any advice on this first point would be greatly appreciated.
My second question has to do with how to handle overlapping matches.
For example, with the same string My daughter was born on January 11, 2011., if the entered search is Jan anuar, then Jan will be matched first, giving me:
My daughter was born on #####highlightStart#####Jan#####highlightEnd#####uary 11, 2011.
And because the delimiting text is now a part of the string, the second search term, anuar will never be matched.
Regarding this issue, I am quite perplexed and really have no clue how to solve it.
I feel like I need to somehow do all of the search operations on the original string separately and then somehow combine them at the end, but again, I'm lost on how to do this.
Perhaps there's a way better solution altogether, but I don't know what that would be.
Any advice or direction on how to solve either or both of these problems would be greatly appreciated.
Thank you.

In this case I think it's simpler to use str_replace (though it won't be perfect).
Assuming you've got an array of terms you want to highlight, I'll call it $aSearchTerms for the sake of argument... and that wrapping the highlighted terms in the HTML5 <mark> tag is acceptable (for the sake of legibility, you've stated it's going on a web-page and it's easy to strip_tags() from your search terms):
$aSearchTerms = ['Jan', 'anu', 'Feb', '11'];
$sinContent = "My daughter was born on January 11, 2011.";
foreach($aSearchTerms as $sinTerm) {
$sinContent = str_replace($sinTerm, "<mark>{$sinTerm}</mark>", $sinContent);
}
echo $sinContent;
// outputs: My d<mark>au</mark>ghter was born on <mark>Jan</mark>uary <mark>11</mark>, 20<mark>11</mark>.
It's not perfect since, using the data in that array, the first pass will change January to <mark>Jan</mark>uary which means anu will no longer match in January - something like this will, however, cover most usage needs.
EDIT
Oki - I'm not 100% certain this is sane but I took a totally different approach looking at the link #AlexAtNet posted:
https://stackoverflow.com/a/3631016/886824
What I've done is looked at the points in the string where the search term is found numerically (the indexes) and built an array of start and end indexes where the <mark> and </mark> tags are going to be entered.
Then using the answer above merged those start and end indexes together - this covers your overlapping matches issue.
Then I've looped that array and cut the original string up into substrings and glued it back together inserting the <mark> and </mark> tags at the relevant points (based on the indexes). This should cover your second issue so you're not having string replacements replacing string replacements.
The code in full looks like:
<?php
$sContent = "Captain's log, January 11, 2711 - Uranus";
$ainSearchTerms = array('Jan', 'asduih', 'anu', '11');
//lower-case it for substr_count
$sContentForSearching = strtolower($sContent);
//array of first and last positions of the terms within the string
$aTermPositions = array();
//loop through your search terms and build a multi-dimensional array
//of start and end indexes for each term
foreach($ainSearchTerms as $sinTerm) {
//lower-case the search term
$sinTermLower = strtolower($sinTerm);
$iTermPosition = 0;
$iTermLength = strlen($sinTermLower);
$iTermOccursCount = substr_count($sContentForSearching, $sinTermLower);
for($i=0; $i<$iTermOccursCount; $i++) {
//find the start and end positions for this term
$iStartIndex = strpos($sContentForSearching, $sinTermLower, $iTermPosition);
$iEndIndex = $iStartIndex + $iTermLength;
$aTermPositions[] = array($iStartIndex, $iEndIndex);
//update the term position
$iTermPosition = $iEndIndex + $i;
}
}
//taken directly from this answer https://stackoverflow.com/a/3631016/886824
//just replaced $data with $aTermPositions
//this sorts out the overlaps so that 'Jan' and 'anu' will merge into 'Janu'
//in January - whilst still matching 'anu' in Uranus
//
//This conveniently sorts all your start and end indexes in ascending order
usort($aTermPositions, function($a, $b)
{
return $a[0] - $b[0];
});
$n = 0; $len = count($aTermPositions);
for ($i = 1; $i < $len; ++$i)
{
if ($aTermPositions[$i][0] > $aTermPositions[$n][1] + 1)
$n = $i;
else
{
if ($aTermPositions[$n][1] < $aTermPositions[$i][1])
$aTermPositions[$n][1] = $aTermPositions[$i][1];
unset($aTermPositions[$i]);
}
}
$aTermPositions = array_values($aTermPositions);
//finally chop your original string into the bits
//where you want to insert <mark> and </mark>
if($aTermPositions) {
$iLastContentChunkIndex = 0;
$soutContent = "";
foreach($aTermPositions as $aChunkIndex) {
$soutContent .= substr($sContent, $iLastContentChunkIndex, $aChunkIndex[0] - $iLastContentChunkIndex)
. "<mark>" . substr($sContent, $aChunkIndex[0], $aChunkIndex[1] - $aChunkIndex[0]) . "</mark>";
$iLastContentChunkIndex = $aChunkIndex[1];
}
//... and the bit on the end
$soutContent .= substr($sContent, $iLastContentChunkIndex);
}
//this *should* output the following:
//Captain's log, <mark>Janu</mark>ary <mark>11</mark>, 27<mark>11</mark> - Ur<mark>anu</mark>s
echo $soutContent;
The inevitable gotcha!
Using this on content that's already HTML may fail horribly.
Given the string.
In January this year...
A search/mark of Jan will insert the <mark>/</mark> around 'Jan' which is fine. However a search mark of something like In Jan is going to fail as there's markup in the way :\
Can't think of a good way around that I'm afraid.

Do not modify original string and store the matches in the individual array, either starts in odd and ends in even elements or store them in records (arrays of two items).
After searching for the several keywords, you end up with several arrays with matches. So the task now is how to merge two lists of segments, producing the segments that covers the areas. As the lists are sorted, this is a trivial task that can be solved in O(n) time.
Then just insert highlight tokens into positions recorded in the resulting array.

Search by term giving empty result using elastica

Using Elastica - Elasticsearch PHP Client. There are so many fields but I want to search in "Screen_name" field. What I have to do for it. I used term but without success. Result is empty array
Here is the code.
// Load index (database)
$elasticaIndex = $elasticaClient->getIndex('twitter_json');
//Load type (table)
$elasticaType = $elasticaIndex->getType('tweet_json');
$elasticaFilterScreenName = new \Elastica\Filter\Term();
$elasticaFilterScreenName->setTerm('screen_name', 'sohail');
//Search on the index.
$elasticaResultSet = $elasticaIndex->search($elasticaFilterScreenName);
var_dump($elasticaResultSet); exit;
$elasticaResults = $elasticaResultSet->getResults();
$totalResults = $elasticaResultSet->getTotalHits();

It's hard to say without knowing your mapping, but there is a good chance the document property "screen_name" does not contain the term "sohail".
Try using a Match query or a Query_String.
"Term" has special meaning in ElasticSearch. A Term is the base, atomic unit of an index. Terms are generated after your text is run through an analyzer. If "screen_name" has an analyzer associated with the index, "sohail" is being modified in some capacity before being saved into the index.

Multi-Term Wildcard queries in Lucene?

I'm using Zend_Search_Lucene, the PHP port of Java Lucene. I currently have some code that will build a search query based on an array of strings, finding results for which at least one index field matches each of the strings submitted. Simplified, it looks like this:
(Note: $words is an array constructed from user input.)
$query = new Zend_Search_Lucene_Search_Query_Boolean();
foreach ($words as $word) {
$term1 = new Zend_Search_Lucene_Index_Term($word, $fieldname1);
$term2 = new Zend_Search_Lucene_Index_term($word, $fieldname2);
$multiq = new Zend_Search_Lucene_Search_Query_MultiTerm();
$multiq->addTerm($term1);
$multiq->addTerm($term2);
$query->addSubquery($multiq, true);
}
$hits = $index->find($query);
What I would like to do is replace $word with ($word . '*') — appending an asterisk to the end of each word, turning it into a wildcard term.
But then, $multiq would have to be a Zend_Search_Lucene_Search_Query_Wildcard instead of a Zend_Search_Lucene_Search_Query_MultiTerm, and I don't think I would still be able to add multiple Index_Terms to each $multiq.
Is there a way to construct a query that's both a Wildcard and a MultiTerm?
Thanks!

Not in the way you're hoping to achieve it, unfortunately:
Lucene supports single and multiple
character wildcard searches within
single terms (but not within phrase
queries).
and even if it were possible, would probably not be a good idea:
Wildcard, range and fuzzy search
queries may match too many terms. It
may cause incredible search
performance downgrade.
I imagine the way to go if you insist on multiple wildcard terms, would be two execute two separate searches, one for each wildcarded term, and bundle the results together.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Searching numbers with Zend_Search_Lucene - php

Related

Find a sentence that contains a word in text from database

Is there a simple way to return WHAT is similar between 2 strings in PHP?

How to highlight search-matching text on a web page

Search by term giving empty result using elastica

Multi-Term Wildcard queries in Lucene?

Categories

Resources