How search for thousands of possible keywords in a string - php

I have a database of thousands (about 10,000) keywords. When a user posts a blog on my site, I would like to automatically search for the keywords in the text, and tag the post with any direct matches.
So far, all I can think of is to pull the ENTIRE list of keywords, loop through it, and check for the presence of each tag in the post...which seems very inefficient (that's 10,000 loops).
Is there a more common way to do this? Should I maybe use a MySQL query to limit it down?
I imagine this is not a totally rare task.

No, just don't do that.
Instead of looping through 10000 elements, it is better to extract the words from the sentence or text, then add it to the SQL query and that way you will have all the needed records. This is surely more efficient than the solution you proposed.
You can do this in the following way using PHP:
$possible_keywords = preg_split('/\b/', $your_text, PREG_SPLIT_NO_EMPTY);
The above will split the text on the words' boundaries and will return no empty elements in the array.
Then you just can create the SQL query in a fashion similar to the following:
SELECT * FROM `keywords` WHERE `keywords`.`keyword` IN (...)
(just put the comma-separated list of extracted words in the bracket)
You should probably filter the $possible_keywords array before making the query (to include only the keywords with appropriate length and to exclude duplicates) plus make keyword column indexed.

I don't know what language you intend on using, but a standard trie (prefix tree) would solve this issue, if you were feeling up to it.

I guess you could build a regular expression dynamically which will enable you to match keywords inside a specific string. You can package all this in a class which does the grunt work.
class KeywordTagger {
static function getTags($body) {
if(preg_match_all(self::getRegex(), $body, $keywords)) {
return $keywords[0];
} else {
return null;
}
}
private static $regex;
private static function getRegex() {
if(self::$regex === null) {
// Load Keywords from DB here
$keywords = KeywordsTable::getAllKeywords();
// Let's escape
$keywords = array_map('KeywordTagger::pregQuoteWords', $keywords);
// Base Regex
$regex = '/\b(?:%s)\b/ui';
// Build Final
self::$regex = sprintf($regex, implode('|', $keywords));
}
return self::$regex;
}
private static function pregQuoteWords($word) {
return preg_quote($word, '/');
}
}
Then, all you have to do is, when a user writes a post, run it through the class:
$tags = KeywordTagger::getTags($_POST['messageBody']);
For a small speed up, you could cache the built regex using memcached, APC or a good-old file-based cache.

Well, I think that PHP's stripos is already quite optimized. If you want to optimize this search further, you would have to take advantage of similarities between your keywords (e.g. instead of looking for "foobar" and then for "foobaz", look for "fooba" and then check for each "fooba" if it's followed by a 'r', a 'z', or none). But this would require some sort of tree-representation of your keywords, like:
root (empty string)
|
fooba
/ \
foobar foobaz
Yes, that's a trie.

Related

PHP Questions. Loops or If statement?

I am trying to learn PHP while I write a basic application. I want a process whereby old words get put into an array $oldWords = array(); so all $words, that have been used get inserted using array_push(oldWords, $words).
Every time the code is executed, I want a process that finds a new word from $wordList = array(...). However, I don't want to select any words that have already been used or are in $oldWords.
Right now I'm thinking about how I would go about this. I've been considering finding a new word via $wordChooser = rand (1, $totalWords); I've been thinking of using an if/else statement, but the problem is if array_search($word, $doneWords) finds a word, then I would need to renew the word and check it again.
This process seems extremely inefficient, and I'm considering a loop function but, which one, and what would be a good way to solve the issue?
Thanks
I'm a bit confused, PHP dies at the end of the execution of the script. However you are generating this array, could you also not at the same time generate what words haven't been used from word list? (The array_diff from all words to used words).
Or else, if there's another reason I'm missing, why can't you just use a loop and quickly find the first word in $wordList that's not in $oldWord in O(n)?
function generate_new_word() {
foreach ($wordList as $word) {
if (in_array($word, $oldWords)) {
return $word; //Word hasn't been used
}
}
return null; //All words have been used
}
Or, just do an array difference (less efficient though, since best case is it has to go through the entire array, while for the above it only has to go to the first word)
EDIT: For random
$newWordArray = array_diff($allWords, $oldWords); //List of all words in allWords that are not in oldWords
$randomNewWord = array_rand($newWordArray, 1);//Will get a new word, if there are any
Or unless you're interested in making your own datatype, the best case for this could possibly be in O(log(n))

evaluate string with array php

I have a string like
"subscription link :%list:subscription%
unsubscription link :%list:unsubscription%
------- etc"
AND
I have an array like
$variables['list']['subscription']='example.com/sub';
$variables['list']['unsubscription']='example.com/unsub';
----------etc.
I need to replace %list:subscription% with $variables['list']['subscription'],And so on
here list is first index and subscription is the second index from $variable
.Is possible to use eval() for this? I don't have any idea to do this,please help me
Str replace should work for most cases:
foreach($variables as $key_l1 => $value_l1)
foreach($value_l1 as $key_l2 => $value_l2)
$string = str_replace('%'.$key_l1.':'.$key_l2.'%', $value_l2, $string);
Eval forks a new PHP process which is resource intensive -- so unless you've got some serious work cut out for eval it's going to slow you down.
Besides the speed issue, evals can also be exploited if the origin of the code comes from the public users.
You could write the string to a file, enclosing the string in a function definition within the file, and give the file a .php extension.
Then you include the php file in your current module and call the function which will return the array.
I would use regular expression and do it like that:
$stringWithLinks = "";
$variables = array();
// your link pattern, in this case
// the i in the end makes it case insensitive
$pattern = '/%([a-z]+):([a-z]+)%/i';
$matches = array();
// http://cz2.php.net/manual/en/function.preg-replace-callback.php
$stringWithReplacedMarkers = preg_replace_callback(
$pattern,
function($mathces) {
// important fact: $matches[0] contains the whole matched string
return $variables[$mathces[1]][$mathces[2]];
},
$stringWithLinks);
You can obviously write the pattern right inside, I simply want to make it clearer. Check PHP manual for more regular expression possibilities. The method I used is here:
http://cz2.php.net/manual/en/function.preg-replace-callback.php

Is it possible to render the content of a file without using eval in PHP?

I'm trying to create a simple framework in PHP which will include a file (index.bel) and render the variables within the file. For instance, the index.bel could contain the following:
<h1>{$variable_name}</h1>
How would I achieve this without using eval or demanding the users of the framework to type index.bel like this:
$index = "<h1>{$variable_name}</h1>";
In other words: Is it possible to render the content of a file without using eval? A working solution for my problem is this:
index.php:
<?php
$variable_name = 'Welcome!';
eval ('print "'.file_get_contents ("index.bel").'";');
index.bel:
<h1>{$variable_name}</h1>
I know many have recommended you to add template engine, but if you want to create your own, easiest way in your case is use str_replace:
$index = file_get_contents ("index.bel");
$replace_from = array ('$variable_a', '$variable_b');
$replace_to = array ($var_a_value, $var_b_value);
$index = str_replace($replace_from,$replace_to,$index);
Now that is for simple variable replace, but you soon want more tags, more functionality, and one way to do things like these are using preg_replace_callback. You might want to take a look at it, as it will eventually make possible to replace variables, include other files {include:file.bel}, replace text like {img:foo.png}.
EDIT: reading more your comments, you are on your way to create own framework. Take a look at preg_replace_callback as it gives you more ways to handle things.
Very simple example here:
...
$index = preg_replace_callback ('/{([^}]+)}>/i', 'preg_callback', $index);
...
function preg_callback($matches) {
var_dump($matches);
$s = preg_split("/:/",$matches[1]); // string matched split by :
$f = 'func_'.strtolower($s[0]); // all functions are like func_img,func_include, ...
$ret = $f($s); // call function with all split parameters
return $ret;
}
function func_img($s) {
return '<img src="'.$s[1].'" />';
}
From here you can improve this (many ways), for example dividing all functionalities to classes, if you want.
Yes, this is possible, but why are you making your own framework? The code you provided clearly looks like Smarty Template. You could try to look how they did it.
A possible way to run those code is splitting them into pieces. You split on the dollar sign and the next symbol which is not an underscore, a letter or an number. Once you did that. You could parse it into a variable.
$var = 'variable_name'; // Split it first
echo $$var; // Get the given variable
Did you mean something like this?

Regex for random text replacement on input stream

I have a paragraph of text below that I want to use an input source. And I want the doSpin() function to take the stream of text and pick one value at random, from each [%group of replacement candidates%].
This [%should|ought|would|could%] make
it much [%more
convenient|faster|easier%] and help
reduce duplicate content.
So this sentence, when filtered, could potentially result in any of the following when input...
1) This should make it much more convenient and help reduce duplicate content.
2) This ought make it much faster and help reduce duplicate content.
3) This would make it much easier and help reduce duplicate content.
// So the code stub would be...
$content = file_get_contents('path to my input file');
function doSpin($content)
{
// REGEX MAGIC HERE
return $content;
}
$myNewContent = doSpin($content);
*I know zilch of Regex. But I know what I'm trying to do requires it.
Any ideas?
Use preg_replace_callback():
function doSpin($content) {
return preg_replace_callback('!\[%(.*?)%\]!', 'pick_one', $content);
}
function pick_one($matches) {
$choices = explode('|', $matches[1]);
return array_rand($choices);
}
The way this works is that it searches for [%...%] and captures what's in between. That's passed as $matches[1] to the callback (as it is the first captured group). That group is split on | using explode() and a random one is returned using array_rand(),

Multi-Term Wildcard queries in Lucene?

I'm using Zend_Search_Lucene, the PHP port of Java Lucene. I currently have some code that will build a search query based on an array of strings, finding results for which at least one index field matches each of the strings submitted. Simplified, it looks like this:
(Note: $words is an array constructed from user input.)
$query = new Zend_Search_Lucene_Search_Query_Boolean();
foreach ($words as $word) {
$term1 = new Zend_Search_Lucene_Index_Term($word, $fieldname1);
$term2 = new Zend_Search_Lucene_Index_term($word, $fieldname2);
$multiq = new Zend_Search_Lucene_Search_Query_MultiTerm();
$multiq->addTerm($term1);
$multiq->addTerm($term2);
$query->addSubquery($multiq, true);
}
$hits = $index->find($query);
What I would like to do is replace $word with ($word . '*') — appending an asterisk to the end of each word, turning it into a wildcard term.
But then, $multiq would have to be a Zend_Search_Lucene_Search_Query_Wildcard instead of a Zend_Search_Lucene_Search_Query_MultiTerm, and I don't think I would still be able to add multiple Index_Terms to each $multiq.
Is there a way to construct a query that's both a Wildcard and a MultiTerm?
Thanks!
Not in the way you're hoping to achieve it, unfortunately:
Lucene supports single and multiple
character wildcard searches within
single terms (but not within phrase
queries).
and even if it were possible, would probably not be a good idea:
Wildcard, range and fuzzy search
queries may match too many terms. It
may cause incredible search
performance downgrade.
I imagine the way to go if you insist on multiple wildcard terms, would be two execute two separate searches, one for each wildcarded term, and bundle the results together.

Categories