Multi-Term Wildcard queries in Lucene? - php

I'm using Zend_Search_Lucene, the PHP port of Java Lucene. I currently have some code that will build a search query based on an array of strings, finding results for which at least one index field matches each of the strings submitted. Simplified, it looks like this:
(Note: $words is an array constructed from user input.)
$query = new Zend_Search_Lucene_Search_Query_Boolean();
foreach ($words as $word) {
$term1 = new Zend_Search_Lucene_Index_Term($word, $fieldname1);
$term2 = new Zend_Search_Lucene_Index_term($word, $fieldname2);
$multiq = new Zend_Search_Lucene_Search_Query_MultiTerm();
$multiq->addTerm($term1);
$multiq->addTerm($term2);
$query->addSubquery($multiq, true);
}
$hits = $index->find($query);
What I would like to do is replace $word with ($word . '*') — appending an asterisk to the end of each word, turning it into a wildcard term.
But then, $multiq would have to be a Zend_Search_Lucene_Search_Query_Wildcard instead of a Zend_Search_Lucene_Search_Query_MultiTerm, and I don't think I would still be able to add multiple Index_Terms to each $multiq.
Is there a way to construct a query that's both a Wildcard and a MultiTerm?
Thanks!

Not in the way you're hoping to achieve it, unfortunately:
Lucene supports single and multiple
character wildcard searches within
single terms (but not within phrase
queries).
and even if it were possible, would probably not be a good idea:
Wildcard, range and fuzzy search
queries may match too many terms. It
may cause incredible search
performance downgrade.
I imagine the way to go if you insist on multiple wildcard terms, would be two execute two separate searches, one for each wildcarded term, and bundle the results together.

Related

How do I make a Case Insensitive, Partial Text Search Engine that uses Regex with MongoDB and PHP?

I'm trying to improve the search bar in my application. If a user types "Titan" into the search bar right now, the application will retrieve the movie "Titanic" from MongoDB every time I use the following regex function:
require 'dbconnection.php';
if ($_SERVER["REQUEST_METHOD"] == "POST") {
$input= $_REQUEST['input'];
$query=$collection->find(['movie' => new MongoDB\BSON\Regex($input)]);
}
I can also make collections case insensitive by creating the following index within the Mongo shell, so if a user types "tiTAnIc" into the search bar, the application will retrieve the movie "Titanic" from MongoDB:
db.createCollection("c1", { collation: { locale: 'en_US', strength: 2 } } )
db.c1.createIndex( { movie: 1 } )
I am not capable of combining these two features at the same time, however. The index above will only remove case sensitivity when I change my query to this:
$query=$collection->find( [ 'movie' => $input] );
If I use the regex query at the top in tandem with the collated index, it will ignore the regex part, so if I type "Titan," it doesn't retrieve anything; if I type "Titanic," however, it will successfully retrieve "Titanic" (because "Titanic" is the exact word stored in my database).
Any advice?
Beware: Regex search on indexed column will affect the performance, as stated at $regex docs:
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
Your problem is that MongoDB use prefix (ex: /^acme/) on $regex to lookup index.
For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.
So it needs to be changed like this:
$query=$collection->find(['movie' => new MongoDB\BSON\Regex('^'.$input, 'i')]);
I suggest you design your collection more carefully.
Related:
https://stackoverflow.com/a/46228114/6118551
https://scalegrid.io/blog/mongodb-regular-expressions-indexes-performance/

PHP Questions. Loops or If statement?

I am trying to learn PHP while I write a basic application. I want a process whereby old words get put into an array $oldWords = array(); so all $words, that have been used get inserted using array_push(oldWords, $words).
Every time the code is executed, I want a process that finds a new word from $wordList = array(...). However, I don't want to select any words that have already been used or are in $oldWords.
Right now I'm thinking about how I would go about this. I've been considering finding a new word via $wordChooser = rand (1, $totalWords); I've been thinking of using an if/else statement, but the problem is if array_search($word, $doneWords) finds a word, then I would need to renew the word and check it again.
This process seems extremely inefficient, and I'm considering a loop function but, which one, and what would be a good way to solve the issue?
Thanks
I'm a bit confused, PHP dies at the end of the execution of the script. However you are generating this array, could you also not at the same time generate what words haven't been used from word list? (The array_diff from all words to used words).
Or else, if there's another reason I'm missing, why can't you just use a loop and quickly find the first word in $wordList that's not in $oldWord in O(n)?
function generate_new_word() {
foreach ($wordList as $word) {
if (in_array($word, $oldWords)) {
return $word; //Word hasn't been used
}
}
return null; //All words have been used
}
Or, just do an array difference (less efficient though, since best case is it has to go through the entire array, while for the above it only has to go to the first word)
EDIT: For random
$newWordArray = array_diff($allWords, $oldWords); //List of all words in allWords that are not in oldWords
$randomNewWord = array_rand($newWordArray, 1);//Will get a new word, if there are any
Or unless you're interested in making your own datatype, the best case for this could possibly be in O(log(n))

Mongodb like statement with array

I am trying to save some db action by compiling a looped bit of code with a single query, Before I was simply adding to the the like statements using a loop before firing off the query but i cant get the same idea going in Mongo, id appreciate any ideas....
I am basically trying to do a like, but with the value as an array
('app', replaces 'mongodb' down to my CI setup )
Here's how I was doing it pre mongofication:
foreach ($workids as $workid):
$this->ci->app->or_like('work',$workid) ;
endforeach;
$query = $this->ci->db->get("who_users");
$results = $query->result();
print_r($results);
and this is how I was hoping I could get it to work, but no joy here, that function is only designed to accept strings
$query = $this->ci->app->like('work',$workids,'.',TRUE,TRUE)->get("who_users");
print_r($query);
If anyone can think of a way any cunning methods I can get my returned array with a single call again it would be great I've not found any documentation on this sort of query, The only way i can think of is to loop over the query and push it into a new results array.... but that is really gonna hurt if my app scales up.
Are you using codeigniter-mongodb-library? Based on the existing or_like() documentation, it looks like CI wraps each match with % wildcards. The equivalent query in Mongo would be a series of regex matches in an $or clause:
db.who_users.find({
$or: [
{ work: /.*workIdA.*/ },
{ work: /.*workIdB.*/ },
...
]});
Unfortunately, this is going to be quite inefficient unless (1) the work field is indexed and (2) your regexes are anchored with some constant value (e.g. /^workId.*/). This is described in more detail in Mongo's regex documentation.
Based on your comments to the OP, it looks like you're storing multiple ID's in the work field as a comma-delimited string. To take advantage of Mongo's schema, you should model this as an array of strings. Thereafter, when you query on the work field, Mongo will consider all values in the array (documented discussed here).
db.who_users.find({
work: "workIdA"
});
This query would match a record whose work value was ["workIdA", "workIdB"]. And if we need to search for one of a set of ID's (taking this back to your OR query), we can extend this example with the $in operator:
db.who_users.find({
work: { $in: ["workIdA", "workIdB", ...] }
});
If that meets your needs, be sure to index the work field as well.

How search for thousands of possible keywords in a string

I have a database of thousands (about 10,000) keywords. When a user posts a blog on my site, I would like to automatically search for the keywords in the text, and tag the post with any direct matches.
So far, all I can think of is to pull the ENTIRE list of keywords, loop through it, and check for the presence of each tag in the post...which seems very inefficient (that's 10,000 loops).
Is there a more common way to do this? Should I maybe use a MySQL query to limit it down?
I imagine this is not a totally rare task.
No, just don't do that.
Instead of looping through 10000 elements, it is better to extract the words from the sentence or text, then add it to the SQL query and that way you will have all the needed records. This is surely more efficient than the solution you proposed.
You can do this in the following way using PHP:
$possible_keywords = preg_split('/\b/', $your_text, PREG_SPLIT_NO_EMPTY);
The above will split the text on the words' boundaries and will return no empty elements in the array.
Then you just can create the SQL query in a fashion similar to the following:
SELECT * FROM `keywords` WHERE `keywords`.`keyword` IN (...)
(just put the comma-separated list of extracted words in the bracket)
You should probably filter the $possible_keywords array before making the query (to include only the keywords with appropriate length and to exclude duplicates) plus make keyword column indexed.
I don't know what language you intend on using, but a standard trie (prefix tree) would solve this issue, if you were feeling up to it.
I guess you could build a regular expression dynamically which will enable you to match keywords inside a specific string. You can package all this in a class which does the grunt work.
class KeywordTagger {
static function getTags($body) {
if(preg_match_all(self::getRegex(), $body, $keywords)) {
return $keywords[0];
} else {
return null;
}
}
private static $regex;
private static function getRegex() {
if(self::$regex === null) {
// Load Keywords from DB here
$keywords = KeywordsTable::getAllKeywords();
// Let's escape
$keywords = array_map('KeywordTagger::pregQuoteWords', $keywords);
// Base Regex
$regex = '/\b(?:%s)\b/ui';
// Build Final
self::$regex = sprintf($regex, implode('|', $keywords));
}
return self::$regex;
}
private static function pregQuoteWords($word) {
return preg_quote($word, '/');
}
}
Then, all you have to do is, when a user writes a post, run it through the class:
$tags = KeywordTagger::getTags($_POST['messageBody']);
For a small speed up, you could cache the built regex using memcached, APC or a good-old file-based cache.
Well, I think that PHP's stripos is already quite optimized. If you want to optimize this search further, you would have to take advantage of similarities between your keywords (e.g. instead of looking for "foobar" and then for "foobaz", look for "fooba" and then check for each "fooba" if it's followed by a 'r', a 'z', or none). But this would require some sort of tree-representation of your keywords, like:
root (empty string)
|
fooba
/ \
foobar foobaz
Yes, that's a trie.

Searching numbers with Zend_Search_Lucene

So why does the first search example below return no results? And any ideas on how to modify the below code to make number searches possible would be much appreciated.
Create the index
$index = new Zend_Search_Lucene('/myindex', true);
$doc->addField(Zend_Search_Lucene_Field::Text('ssn', '123-12-1234'));
$doc->addField(Zend_Search_Lucene_Field::Text('cats', 'Fluffy'));
$index->addDocument($doc);
$index->commit();
Search - NO RESULTS
$index = new Zend_Search_Lucene('/myindex', true);
$results = $index->find('123-12-1234');
Search - WITH RESULTS
$index = new Zend_Search_Lucene('/myindex', true);
$results = $index->find('Fluffy');
First you need to change your text analizer to include numbers
Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum() );
Then for fields with numbers you want to use Zend_Search_Lucene_Field::Keyword instead of Zend_Search_Lucene_Field::Text
this will skip the the creation of tokens and saves the value 'as is' into the index. Then you can search by it. I don't know how it behaves with floats ( is probably not going to work for floats 3.0 is not going to match 3) but for natural numbers ( like ids ) works like a charm.
This is an effect of which Analyzer you have chosen.
I believe the default Analyzer will only index terms that match /[a-zA-Z]+/. This means that your SSN isn't being added to the index as a term.
Even if you switched to the text+numeric case insensitive Analyzer, what you are wanting still will not work. The expression for a term is /[a-zA-Z0-9]+/ this would mean your terms added to the index would be 12,123,1234.
If you need 123-12-1234 to be seen as a valid term, you are probably going to need to extend Zend_Search_Lucene_Analysis_Analyzer_Common and make it so that 123-12-1234 is a term.
See
http://framework.zend.com/manual/en/zend.search.lucene.extending.html#zend.search.lucene.extending.analysis
Your other choice is to store the ssn as a Zend_Search_Lucene_Field::Keyword. Since a keyword is not broken up into terms.
http://framework.zend.com/manual/en/zend.search.lucene.html#zend.search.lucene.index-creation.understanding-field-types

Categories