I am using MyISAM engine with fulltext indexing for storing a list of strings.
These strings can be a single word, or a sentence.
If I want to know how many times string hello appears in my table, I do
SELECT COUNT(*) Total
FROM String s
WHERE
MATCH (s.name) AGAINST ('hello')
I would like to create a similar report, but for all strings. Result should be a list of TOP-N strings that are the most common in this table (top ones most probably are "the", "a", "to" etc.).
Exact match case is pretty obvious:
SELECT name as String, count(*) as Total
FROM String
GROUP
BY name
ORDER
BY total desc
LIMIT *some number*
But it counts only whole strings.
Is there any way to achieve my desired result?
Thanks.
I guess there is no easy way for this. I would create a "statistic table" for this purpose only. One column for words themselves, one column for the number of occurrences. (Primary key on the first column of course.)
For this with a PL/SQL block scanning all strings, and split them for words.
If the string is not found in the statistic table, you insert a new row.
If the string is found in the statistic table, you increase the value in the second column.
This can run for a pretty long time, but after the first run is ready, you only have to check the new strings on insert, perhaps with a trigger. (Assuming you want to use it not once but regularly.)
Hope this helps, I have no simpler answer.
i think if you use the LIKE command will works
select name, count(*) as total from String where name like '%hello%' group by name order by total
let me know
I didn't find any solution with SQL and my Full text index, but I managed to get my desired result by getting all of my strings from DB and processing them on the backend with php:
//get all strings from DB
$queryResult = $db->query("SELECT name as String FROM String");
//Combine all of them into array
while($row = $queryResult->fetch_array(MYSQLI_ASSOC)) {
$stringArray[] = $row['String'];
}
//"Glue" all these strings into one huge string
$text = implode(" ", $stringArray);
//Make everything lowercase
$textLowercase = strtolower($text);
//Find all words
$result = preg_split('/[^a-z]/', $textLowercase, -1, PREG_SPLIT_NO_EMPTY);
//Filter some unwanted words
$result = array_filter($result, function($x){
return !preg_match("/^(.|the|and|of|to|it|in|or|is|a|an|not|are)$/",$x);
});
//Count a number of occurrence of each word
$result = array_count_values($result);
//Sort
arsort($result);
//Select TOP-N strings, where N is $amount
$result = array_slice($result, 0, $amount);
Related
I have nodes in my database that are under the label Keywords with words as an attribute.I would like to compare a string ($mostRecentPost) with the words in the array, words.
$queryString ="WITH["Batman","Jaws","Fun","Baseball","Halo","PS4","Nike","Jeep","Mustang"] AS words MATCH (n.Keywords) WHERE ".$mostRecentPost." =~'(?i).*n.kw.*' IN words RETURN n";
$query = new Everyman\Neo4j\Cypher\Query($client, $queryString);
$relativePosts = $query->getResultSet();
Basically we have an example $mostRecentPost = a node, with content = "the new Halo looks awesome". I am trying to compare the contents of that node with the contents of the words array, when it matches one of the array words with some word in the post, it returns that word.
Your query seems to be totally off:
you don't use your "words" anywhere
not sure what .$mostRecentPost stands for
your regexp is not related to the words at all
WITH["Batman","Jaws","Fun","Baseball","Halo","PS4","Nike","Jeep","Mustang"] AS words
MATCH (n.Keywords)
WHERE ".$mostRecentPost." =~'(?i).n.kw.' IN words
RETURN n
You could do (which will be not fast):
MATCH (n:Keywords)
WHERE n.text =~ '(?i).*(Batman|Jaws|Fun|...).*'
RETURN n
and use a parameter for the regexp-string
You should use fulltext-search with a list of words, see this blog post for some info on how to set it up with Neo4j 2.0 http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
My orginal Pname in the table 'english' is "The Digital Santa Monica Mug".If users try to search using "Digital Mug", its not returning productwith the pname containing digital mug .
am using this query:
select *
from english
where((pname like '%$val%'
or desp1 like '%$search%'
or pid like '%$search%' $key_value)
and warehouse=0 and cid !=49)
group by pid;
use pname like '%".implode('%', explode(' ', $val))."%' instead of pname like '%$val%'
In this case order will matter. Means Digital Mug will give you result but MUG Digital won't.
Use full text searches for that
The thing is not working as The Digital Santa Monica Mug when searched as
Digital Mug will be taken as '%Digital Mug%' which tries to match a value having Digital Mug having words before and after.
Eg : THE Digital Mug Paradise
Such a text will be matched.
So try MYSQL FULL TEXT SEARCH for that
FULL TEXT SEARCH
Either what The C Man advised (split the search phrase and search for every word), or fulltext search.
For "splitting words" method, I'd advise to:
use regular expressions for splitting, something likepreg_match_all('#[a-zA-Z0-9]+#', $text, $words);you don't need to
search for symbols like "$", do you?
write a function that would generate where clause for you.
Function for generating where clause might look like this:
function generateFilter(array $fields, array $words) {
// prepare $word for putting into SQL statement
foreach ( $words as &$word ) {
// ensure that wildcard characters are used as regular characters
$word = str_replace('%', '\\%', $word);
$word = str_replace('_', '\\_', $word);
// prevent SQL injections
$word = mysql_real_escape_string($word);
}
unset($word);
// generate filter
$filter = array();
foreach ( $words as $word ) {
$wordFilter = array();
foreach ( $fields as $field ) {
$wordFilter[] = "{$field} like '%$word%'";
}
$filter[] = implode(' or ', $wordFilter);
}
$filter = '(' . implode(') and (', $filter) . ')';
return $filter;
}
$filter = generateFilter(
array('name', 'surname', 'address'),
array('john', 'doe')
);
echo $filter;
Result:
(name like '%john%' or surname like '%john%' or address like '%john%') and
(name like '%doe%' or surname like '%doe%' or address like '%doe%')
If you use prepared statements (which is highly advised), this function would be a bit more complicated, as resulting string would have placeholders for variables, while $words would be put into some array of variables that have to be bound to prepared statement.
"Splitting words" method works for small strings and small amounts of data. If you have huge amounts of data and/or large strings, consider using fulltext search. It does not require to split search phrase, though it has some limitations - it needs fulltext index on columns that are used for searching (IIRC, you can create index on multiple columns and then use fulltext search on all indexed columns at the same time, i.e., you don't have to search every column spearately), it has minimal length of keyword and it might give non-strict results, e.g., sometimes only 3 of 5 keywords might appear in result. Though, it gives relevance of every result - results that are closer to search terms will have higher relevance. This is useful for sorting results by relevance.
While creating index may seem to be an "extra work" for you, it will allow DBMS to perform the search faster than without index.
You could split up the input value into two different words. In order to do this, do
$term_array = explode(" ", $val);
Now, $term_array will hold both words separately, and you can run queries on the words individually. For example, you could go through the query twice, and run the same query on the single words. However, doing this would result in duplicates (and likely some unnecessary results). You could probably think of some kind of query using the two separated words that would yield better results, though.
To construct query:
$split = explode(" ", $val);
$qry_pname = "pname LIKE '%".implode("%' or pname LIKE '%", $split)."'%";
$qry = "
SELECT *
FROM english
WHERE( $qry_pname
or desp1 like '%$search%'
or pid like '%$search%' $key_value)
and warehouse=0 and cid !=49)
group by pid;
";
I have two list of words, suppose LIST1 and LIST2. I want to compare LIST1 against LIST2 to find the duplicates, but it should find the plural of the word as well as ing form also. For example.
Suppose LIST1 has word "account", and LIST2 has words "accounts,accounting" When i do compare the result should show two match for word "account".
I am doing it in PHP and have the LIST in mysql tables.
You can use a technique called porter stemming to map each list entry to its stem, then compare the stems. An implementation of the Porter Stemming algorithm in PHP can be found here or here.
What I would do is take your word and compare it directly to LIST2 and at the same time remove your word from every word your're comparing looking for a left over ing, s, es to denote a plural or ing word (this should be accurate enough). If not you'll have to generate an algorithm for making plurals out of words as it not as simple as adding an S.
Duplicate Ending List
s
es
ing
LIST1
Gas
Test
LIST2
Gases
Tests
Testing
Now compare List1 to List2. During the same loop of comparison do a direct comparision to items and one where the word, from list 1, is removed from the current word you're looking at in list 2. Now just check is this result is in the Duplicate Ending List.
Hope that makes sense.
The problem with that is, in English at least, plurals are not all standard extensions, nor are present participles. You can make an approximation by using all words +'ing' and +'s', but that will give false positives and negatives.
You can handle it directly in MySQL if you wish.
SELECT DISTINCT l2.word
FROM LIST1 l1, LIST l2
WHERE l1.word = l2.word OR l1.word + 's' = l2.word OR l1.word + 'ing' = l2.word;
This function will output the plural of a word.
http://www.exorithm.com/algorithm/view/pluralize
Something similar could be written for gerunds and present participles (ing forms)
You might consider using the Doctrine Inflector class in conjunction with a stemmer for this.
Here's the algorithm at a high level
Split search string on spaces, process words individually
Lowercase the search word
Strip special characters
Singularize, replace differing portion with wildcard ('%')
Stem, replace differing portion with wildcard ('%')
Here's the function I put together
/**
* Use inflection and stemming to produce a good search string to match subtle
* differences in a MySQL table.
*
* #string $sInputString The string you want to base the search on
* #string $sSearchTable The table you want to search in
* #string $sSearchField The field you want to search
*/
function getMySqlSearchQuery($sInputString, $sSearchTable, $sSearchField)
{
$aInput = explode(' ', strtolower($sInputString));
$aSearch = [];
foreach($aInput as $sInput) {
$sInput = str_replace("'", '', $sInput);
//--------------------
// Inflect
//--------------------
$sInflected = Inflector::singularize($sInput);
// Otherwise replace the part of the inflected string where it differs from the input string
// with a % (wildcard) for the MySQL query
$iPosition = strspn($sInput ^ $sInflected, "\0");
if($iPosition !== null && $iPosition < strlen($sInput)) {
$sInput = substr($sInflected, 0, $iPosition) . '%';
} else {
$sInput = $sInput;
}
//--------------------
// Stem
//--------------------
$sStemmed = stem_english($sInput);
// Otherwise replace the part of the inflected string where it differs from the input string
// with a % (wildcard) for the MySQL query
$iPosition = strspn($sInput ^ $sStemmed, "\0");
if($iPosition !== null && $iPosition < strlen($sInput)) {
$aSearch[] = substr($sStemmed, 0, $iPosition) . '%';
} else {
$aSearch[] = $sInput;
}
}
$sSearch = implode(' ', $aSearch);
return "SELECT * FROM $sSearchTable WHERE LOWER($sSearchField) LIKE '$sSearch';";
}
Which I ran with several test strings
Input String: Mary's Hamburgers
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'mary% hamburger%';
Input String: Office Supplies
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'offic% suppl%';
Input String: Accounting department
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'account% depart%';
Probably not perfect, but it's a good start anyway! Where it will fall down is when multiple matches are returned. There's no logic to determine the best match. That's where things like MySQL fulltext and Lucene come in. Thinking about it a little more, you might be able to use levenshtein to rank multiple results with this approach!
I'm using a simple query for my search:
SELECT * FROM table WHERE field LIKE '%term%'
if I have a field = "Company Name 123" and I search for Company 123 the result is null
how can I improve this? it only finds if the term is in sequence
Replace spaces with %
$newTerm = str_replace(' ', '%', $term);
$sql = "SELECT * FROM table WHERE field LIKE '%$term%'"
$r = mysql_qery($sql, $conn);
You need to put a % between Company and 123 in order for it to match. You might want to check out full text search functions.
try to replace spaces
$searchtext =str_replace(' ','%',$searchtext);
you could:
split your searchterm into words and build a query with a lot of ANDs (or ORs if you just want to find one of the parts) out of it (ugly, but i've seen this a lot of times)
replace ' '(space) with % (thats a wildcard) in your term (the way to go)
I have a field that is in this format
5551112391^HUMAN^HUMAN-800-800^6-main^^
How would I only grab the numbers 5551112391 before the character ^?
Would you do this with regex?
You can make use of explode:
$var = '5551112391^HUMAN^HUMAN-800-800^6-main^^';
$arr = explode('^',$var);
$num = $arr[0];
Using regex:
$var = '5551112391^HUMAN^HUMAN-800-800^6-main^^';
if(preg_match('/^(\d+)/',trim($var),$m)){
$num = $m[1];
}
Regex overkill, nice...
What about simple cast to int? Will work perfectly OK if the number is in the beginning of data. And definitely faster than regexps...
$var = '5551112391^HUMAN^HUMAN-800-800^6-main^^';
$num = (int)$var;
http://www.php.net/manual/en/language.types.type-juggling.php
You're doing it in completely wrong way.
You treat mysql database as a flat text file. But it is not.
All these fields must be separated and stored in separate columns.
To get only certain data from the table, you should not select all rows and then compare one by one but make database do it for you:
SELECT * FROM table WHERE number=5551112391