I have two list of words, suppose LIST1 and LIST2. I want to compare LIST1 against LIST2 to find the duplicates, but it should find the plural of the word as well as ing form also. For example.
Suppose LIST1 has word "account", and LIST2 has words "accounts,accounting" When i do compare the result should show two match for word "account".
I am doing it in PHP and have the LIST in mysql tables.
You can use a technique called porter stemming to map each list entry to its stem, then compare the stems. An implementation of the Porter Stemming algorithm in PHP can be found here or here.
What I would do is take your word and compare it directly to LIST2 and at the same time remove your word from every word your're comparing looking for a left over ing, s, es to denote a plural or ing word (this should be accurate enough). If not you'll have to generate an algorithm for making plurals out of words as it not as simple as adding an S.
Duplicate Ending List
s
es
ing
LIST1
Gas
Test
LIST2
Gases
Tests
Testing
Now compare List1 to List2. During the same loop of comparison do a direct comparision to items and one where the word, from list 1, is removed from the current word you're looking at in list 2. Now just check is this result is in the Duplicate Ending List.
Hope that makes sense.
The problem with that is, in English at least, plurals are not all standard extensions, nor are present participles. You can make an approximation by using all words +'ing' and +'s', but that will give false positives and negatives.
You can handle it directly in MySQL if you wish.
SELECT DISTINCT l2.word
FROM LIST1 l1, LIST l2
WHERE l1.word = l2.word OR l1.word + 's' = l2.word OR l1.word + 'ing' = l2.word;
This function will output the plural of a word.
http://www.exorithm.com/algorithm/view/pluralize
Something similar could be written for gerunds and present participles (ing forms)
You might consider using the Doctrine Inflector class in conjunction with a stemmer for this.
Here's the algorithm at a high level
Split search string on spaces, process words individually
Lowercase the search word
Strip special characters
Singularize, replace differing portion with wildcard ('%')
Stem, replace differing portion with wildcard ('%')
Here's the function I put together
/**
* Use inflection and stemming to produce a good search string to match subtle
* differences in a MySQL table.
*
* #string $sInputString The string you want to base the search on
* #string $sSearchTable The table you want to search in
* #string $sSearchField The field you want to search
*/
function getMySqlSearchQuery($sInputString, $sSearchTable, $sSearchField)
{
$aInput = explode(' ', strtolower($sInputString));
$aSearch = [];
foreach($aInput as $sInput) {
$sInput = str_replace("'", '', $sInput);
//--------------------
// Inflect
//--------------------
$sInflected = Inflector::singularize($sInput);
// Otherwise replace the part of the inflected string where it differs from the input string
// with a % (wildcard) for the MySQL query
$iPosition = strspn($sInput ^ $sInflected, "\0");
if($iPosition !== null && $iPosition < strlen($sInput)) {
$sInput = substr($sInflected, 0, $iPosition) . '%';
} else {
$sInput = $sInput;
}
//--------------------
// Stem
//--------------------
$sStemmed = stem_english($sInput);
// Otherwise replace the part of the inflected string where it differs from the input string
// with a % (wildcard) for the MySQL query
$iPosition = strspn($sInput ^ $sStemmed, "\0");
if($iPosition !== null && $iPosition < strlen($sInput)) {
$aSearch[] = substr($sStemmed, 0, $iPosition) . '%';
} else {
$aSearch[] = $sInput;
}
}
$sSearch = implode(' ', $aSearch);
return "SELECT * FROM $sSearchTable WHERE LOWER($sSearchField) LIKE '$sSearch';";
}
Which I ran with several test strings
Input String: Mary's Hamburgers
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'mary% hamburger%';
Input String: Office Supplies
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'offic% suppl%';
Input String: Accounting department
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'account% depart%';
Probably not perfect, but it's a good start anyway! Where it will fall down is when multiple matches are returned. There's no logic to determine the best match. That's where things like MySQL fulltext and Lucene come in. Thinking about it a little more, you might be able to use levenshtein to rank multiple results with this approach!
Related
I'm working on a search / advertising system that matches a given ad group with keywords. Query is the string that is the search string, what we're looking for is the best and most efficient way to enhance the simple 'contains' script below that searches through the query, but looks for keyword matches on an AND (&&) explosion. With this script one could build either 'IF' or it could be a "CASE" below is the pseudo code:
$query = "apple berry tomato potato":
if contains ($query,"tomato") { }
if contains ($query,"potato,berry") { }
if contains ($query,"apple,berry") { }
else i.e. none of the above do { }
the function contains would use strpos but would also use some combination of explode to distinguish words that are separated by commas. So Apple,berry would be where a string contains the list of keywords separated by commas.
What would be the best way to write a contains script that searches through the query string and matches against the comma-separated values in the second parameter? Love your ideas. Thanks.
Here is a the classic simple 'contains' function, but it doesn't handle the comma-separated AND Explosion - it only works with single words or phrases
function contains($haystack,$needle)
{
return strpos($haystack, $needle) !== false;
}
Note : the enhanced contains function should scan for the match of the string on an AND basis. If commas exist in the $needle it needs to include all of the keywords to show a match. The simple contains script is explained on this post Check if String contains a given word . What I'm looking for is an expanded function by the same name that also searches for multiple keywords, not just a single word.
The $query string will always be space delimited.
The $needle string will always be comma delimited, or it could be delimited by spaces.
The main thing is that the function works in multiple directions
Suppose the $query = 'business plan template'
or $query = 'templates for business plan'
if you ran contains ($query,"business plan")
or contains ($query,"business,plan") both tests would show a match. The sequence of the words should not matter.
Here's a simple way. Just compare the count of $needle with the count of $needle(s) that are in $haystack using array_intersect():
function contains($haystack, $needle) {
$h = explode(' ', $haystack);
$n = explode(',', $needle);
return count(array_intersect($h, $n)) == count($n);
}
You could optionally pass $needle in as an array and then no need for that explode().
If you need it case-insensitive:
$h = explode(' ', strtolower($haystack));
$n = explode(',', strtolower($needle));
I am using MyISAM engine with fulltext indexing for storing a list of strings.
These strings can be a single word, or a sentence.
If I want to know how many times string hello appears in my table, I do
SELECT COUNT(*) Total
FROM String s
WHERE
MATCH (s.name) AGAINST ('hello')
I would like to create a similar report, but for all strings. Result should be a list of TOP-N strings that are the most common in this table (top ones most probably are "the", "a", "to" etc.).
Exact match case is pretty obvious:
SELECT name as String, count(*) as Total
FROM String
GROUP
BY name
ORDER
BY total desc
LIMIT *some number*
But it counts only whole strings.
Is there any way to achieve my desired result?
Thanks.
I guess there is no easy way for this. I would create a "statistic table" for this purpose only. One column for words themselves, one column for the number of occurrences. (Primary key on the first column of course.)
For this with a PL/SQL block scanning all strings, and split them for words.
If the string is not found in the statistic table, you insert a new row.
If the string is found in the statistic table, you increase the value in the second column.
This can run for a pretty long time, but after the first run is ready, you only have to check the new strings on insert, perhaps with a trigger. (Assuming you want to use it not once but regularly.)
Hope this helps, I have no simpler answer.
i think if you use the LIKE command will works
select name, count(*) as total from String where name like '%hello%' group by name order by total
let me know
I didn't find any solution with SQL and my Full text index, but I managed to get my desired result by getting all of my strings from DB and processing them on the backend with php:
//get all strings from DB
$queryResult = $db->query("SELECT name as String FROM String");
//Combine all of them into array
while($row = $queryResult->fetch_array(MYSQLI_ASSOC)) {
$stringArray[] = $row['String'];
}
//"Glue" all these strings into one huge string
$text = implode(" ", $stringArray);
//Make everything lowercase
$textLowercase = strtolower($text);
//Find all words
$result = preg_split('/[^a-z]/', $textLowercase, -1, PREG_SPLIT_NO_EMPTY);
//Filter some unwanted words
$result = array_filter($result, function($x){
return !preg_match("/^(.|the|and|of|to|it|in|or|is|a|an|not|are)$/",$x);
});
//Count a number of occurrence of each word
$result = array_count_values($result);
//Sort
arsort($result);
//Select TOP-N strings, where N is $amount
$result = array_slice($result, 0, $amount);
I have nodes in my database that are under the label Keywords with words as an attribute.I would like to compare a string ($mostRecentPost) with the words in the array, words.
$queryString ="WITH["Batman","Jaws","Fun","Baseball","Halo","PS4","Nike","Jeep","Mustang"] AS words MATCH (n.Keywords) WHERE ".$mostRecentPost." =~'(?i).*n.kw.*' IN words RETURN n";
$query = new Everyman\Neo4j\Cypher\Query($client, $queryString);
$relativePosts = $query->getResultSet();
Basically we have an example $mostRecentPost = a node, with content = "the new Halo looks awesome". I am trying to compare the contents of that node with the contents of the words array, when it matches one of the array words with some word in the post, it returns that word.
Your query seems to be totally off:
you don't use your "words" anywhere
not sure what .$mostRecentPost stands for
your regexp is not related to the words at all
WITH["Batman","Jaws","Fun","Baseball","Halo","PS4","Nike","Jeep","Mustang"] AS words
MATCH (n.Keywords)
WHERE ".$mostRecentPost." =~'(?i).n.kw.' IN words
RETURN n
You could do (which will be not fast):
MATCH (n:Keywords)
WHERE n.text =~ '(?i).*(Batman|Jaws|Fun|...).*'
RETURN n
and use a parameter for the regexp-string
You should use fulltext-search with a list of words, see this blog post for some info on how to set it up with Neo4j 2.0 http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
My orginal Pname in the table 'english' is "The Digital Santa Monica Mug".If users try to search using "Digital Mug", its not returning productwith the pname containing digital mug .
am using this query:
select *
from english
where((pname like '%$val%'
or desp1 like '%$search%'
or pid like '%$search%' $key_value)
and warehouse=0 and cid !=49)
group by pid;
use pname like '%".implode('%', explode(' ', $val))."%' instead of pname like '%$val%'
In this case order will matter. Means Digital Mug will give you result but MUG Digital won't.
Use full text searches for that
The thing is not working as The Digital Santa Monica Mug when searched as
Digital Mug will be taken as '%Digital Mug%' which tries to match a value having Digital Mug having words before and after.
Eg : THE Digital Mug Paradise
Such a text will be matched.
So try MYSQL FULL TEXT SEARCH for that
FULL TEXT SEARCH
Either what The C Man advised (split the search phrase and search for every word), or fulltext search.
For "splitting words" method, I'd advise to:
use regular expressions for splitting, something likepreg_match_all('#[a-zA-Z0-9]+#', $text, $words);you don't need to
search for symbols like "$", do you?
write a function that would generate where clause for you.
Function for generating where clause might look like this:
function generateFilter(array $fields, array $words) {
// prepare $word for putting into SQL statement
foreach ( $words as &$word ) {
// ensure that wildcard characters are used as regular characters
$word = str_replace('%', '\\%', $word);
$word = str_replace('_', '\\_', $word);
// prevent SQL injections
$word = mysql_real_escape_string($word);
}
unset($word);
// generate filter
$filter = array();
foreach ( $words as $word ) {
$wordFilter = array();
foreach ( $fields as $field ) {
$wordFilter[] = "{$field} like '%$word%'";
}
$filter[] = implode(' or ', $wordFilter);
}
$filter = '(' . implode(') and (', $filter) . ')';
return $filter;
}
$filter = generateFilter(
array('name', 'surname', 'address'),
array('john', 'doe')
);
echo $filter;
Result:
(name like '%john%' or surname like '%john%' or address like '%john%') and
(name like '%doe%' or surname like '%doe%' or address like '%doe%')
If you use prepared statements (which is highly advised), this function would be a bit more complicated, as resulting string would have placeholders for variables, while $words would be put into some array of variables that have to be bound to prepared statement.
"Splitting words" method works for small strings and small amounts of data. If you have huge amounts of data and/or large strings, consider using fulltext search. It does not require to split search phrase, though it has some limitations - it needs fulltext index on columns that are used for searching (IIRC, you can create index on multiple columns and then use fulltext search on all indexed columns at the same time, i.e., you don't have to search every column spearately), it has minimal length of keyword and it might give non-strict results, e.g., sometimes only 3 of 5 keywords might appear in result. Though, it gives relevance of every result - results that are closer to search terms will have higher relevance. This is useful for sorting results by relevance.
While creating index may seem to be an "extra work" for you, it will allow DBMS to perform the search faster than without index.
You could split up the input value into two different words. In order to do this, do
$term_array = explode(" ", $val);
Now, $term_array will hold both words separately, and you can run queries on the words individually. For example, you could go through the query twice, and run the same query on the single words. However, doing this would result in duplicates (and likely some unnecessary results). You could probably think of some kind of query using the two separated words that would yield better results, though.
To construct query:
$split = explode(" ", $val);
$qry_pname = "pname LIKE '%".implode("%' or pname LIKE '%", $split)."'%";
$qry = "
SELECT *
FROM english
WHERE( $qry_pname
or desp1 like '%$search%'
or pid like '%$search%' $key_value)
and warehouse=0 and cid !=49)
group by pid;
";
I'm looking either for routine or way to look for error tolerating string comparison.
Let's say, we have test string Čakánka - yes, it contains CE characters.
Now, I want to accept any of following strings as OK:
cakanka
cákanká
ČaKaNKA
CAKANKA
CAAKNKA
CKAANKA
cakakNa
The problem is, that I often switch letters in word, and I want to minimize user's frustration with not being able (i.e. you're in rush) to write one word right.
So, I know how to make ci comparison (just make it lowercase :]), I can delete CE characters, I just can't wrap my head around tolerating few switched characters.
Also, you often put one character not only in wrong place (character=>cahracter), but sometimes shift it by multiple places (character=>carahcter), just because one finger was lazy during writing.
Thank you :]
Not sure (especially about the accents / special characters stuff, which you might have to deal with first), but for characters that are in the wrong place or missing, the levenshtein function, that calculates Levenshtein distance between two strings, might help you (quoting) :
int levenshtein ( string $str1 , string $str2 )
int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )
The Levenshtein distance is defined as
the minimal number of characters you
have to replace, insert or delete to
transform str1 into str2
Other possibly useful functions could be soundex, similar_text, or metaphone.
And some of the user notes on the manual pages of those functions, especially the manual page of levenshtein might bring you some useful stuff too ;-)
You could transliterate the words to latin characters and use a phonetic algorithm like Soundex to get the essence from your word and compare it to the ones you have. In your case that would be C252 for all of your words except the last one that is C250.
Edit The problem with comparative functions like levenshtein or similar_text is that you need to call them for each pair of input value and possible matching value. That means if you have a database with 1 million entries you will need to call these functions 1 million times.
But functions like soundex or metaphone, that calculate some kind of digest, can help to reduce the number of actual comparisons. If you store the soundex or metaphone value for each known word in your database, you can reduce the number of possible matches very quickly. Later, when the set of possible matching value is reduced, then you can use the comparative functions to get the best match.
Here’s an example:
// building the index that represents your database
$knownWords = array('Čakánka', 'Cakaka');
$index = array();
foreach ($knownWords as $key => $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (!isset($index[$code])) {
$index[$code] = array();
}
$index[$code][] = $key;
}
// test words
$testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa');
echo '<ul>';
foreach ($testWords as $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (isset($index[$code])) {
echo '<li> '.$word.' is similar to: ';
$matches = array();
foreach ($index[$code] as $key) {
similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage);
$matches[$knownWords[$key]] = $percentage;
}
arsort($matches);
echo '<ul>';
foreach ($matches as $match => $percentage) {
echo '<li>'.$match.' ('.$percentage.'%)</li>';
}
echo '</ul></li>';
} else {
echo '<li>no match found for '.$word.'</li>';
}
}
echo '</ul>';
Spelling checkers do something like fuzzy string comparison. Perhaps you can adapt an algorithm based on that reference. Or grab the spell checker guessing code from an open source project like Firefox.