I have a database table with lots and lots of words and strings. (Right now it has over 300K entries, but it grows.) What would be the best way to get only those values that fit the pattern? Lets say the table is:
apples
oranges
abba
car
real
tipi
riot
tidy
Now how to retrieve only pattern CVCV (ConsonantVowelConsonantVowel)? Or CVVC, LLLL (letter*4), etc? I could just make a column with different patterns like so:
word: real
patterns: LLLL,CVVC,LVVC,LVVL,LVLC,LLVC,LLLC,LVLL,CLLC,...
and search the database with "SELECT * FROM table WHERE word LIKE $pattern", but I was thinking if there is a better way?
CVCV:
SELECT 'cara' REGEXP '[bcdfghjklmnpqrstvwxz][aeiouy][bcdfghjklmnpqrstvwxz][aeiouy]';
true
SELECT 'abba' REGEXP '[bcdfghjklmnpqrstvwxz][aeiouy][bcdfghjklmnpqrstvwxz][aeiouy]';
false
If you're only looking for 4 letter words than that should be fairly simple to do with a regexp condition. For example, if you don't care what the order of the vowels and the consonants are, then it's as simple as this:
SELECT *
FROM yourTable
WHERE yourField REGEXP '^[a-z]{4}$'
All this says is find a word that starts and ends with 4 letters a-z.
***Note*** This only applies to lowercase letters using this pattern, if you're worried about uppercase letters you can either do:
1) LOWER(yourField) REGEXP '^[a-z]{4}$'
OR
2) yourField REGEXP '^[a-zA-Z]{4}$'
If you would like something similar to this but not quite what I gave you, read up on regular expressions. This is a pretty good starter reference: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
I would suggest you read up on regular expressions a little anyway as they are pretty powerful and fairly useful in a lot of instances of string manipulation.
Related
I have a query,
e.g.
name column have "Rodrigue Dattatray Desilva".
I want to write a query in such a way that,
If I search for 'gtl' and match anywhere in string it should show the result.
I know in PHP I can apply the patch like '%g%t%l%'.
But I want to know MySql way.
Note: I can search for anything, I am just giving above an example.
EDIT:
create table Test(id integer, title varchar(100));
insert into Test(id, title) values(1, "Rodrigue Dattatray Desilva");
select * from Test where title like '%g%t%l%';
Consider the above case. Where "gtl" is string I am trying to search in the title but search string can be anything.
gtl is string where it exists in the current title but not in sequence.
The easy answer is that you need an extra wildcard:
select * from Test where title like '%g%t%l%';
The query you posted does not have a wild card after the 'l', so would only match if the phrase ended with 'l'.
The more complicated answer is that you can also use regular expressions, which give you more power over the search.
The even more complicated answer is that performance of these string matching queries tends to be poor - the wild cards mean that indexes are usually ineffective. If you have a large number of rows in your table, full-text searching is much faster.
You can do the same in Mysql too.
You can use the keyword like in MySql.
% - The percent sign represents zero, one, or multiple characters
_ - The underscore represents a single character
I have a table with Tags (words). Each time I want to add a new item (word) to the table, I want to first see words that look the most like the word I am entering, so I could come realize I already have a word in the table that looks like it.
Kind of like using the match() function in Mysql, but I don't want a score of how many words are corresponding. But a score of within a word, how many characters are corresponding.
So something like: select * from tags order by look_a_like_score(#newword)
But is there such a function like look_a_like_score() ?
Example, I already have in table:
Restaurant
Elevator
Swimming pool
Wifi
Now I want to add:
Free swimmer facilities
What I would like to have now is a list with 'Swimming pool' on top, because the part 'swimm' is most matching.
Can you help me do this?
PS. I collect the entire table into PHP and then put them into an array. So a PHP approach is also welcome.
On MySQL side you have soundex, not really working good as I like.
You may want to implement a MySQL module to use levenshtein (you'll need to compile in C either).
On PHP side you have levenshtein() available which is quite decent to have similarity score
You may use too:
soundex() - Calculate the soundex key of a string
similar_text() - Calculate the similarity between two strings
metaphone() - Calculate the metaphone key of a string
Check the manual to know how to use them
You can look here here for an implementation of the levenshtein distance formula this is good for finding the edit distance between to string.
Other things that might work out for you is using soundex or possibly double metaphone to do "Sounds like" matches.
There is no function. But, you can do this with some SQL. Let me assume that #newtag contains your new tag and that you have a numbers table. You can do something like this:
select t.tag, max(len) as biggestmatch
from (select concat('%', substr(#newtag, n1.n, n2.n), '%') as pat,
n1.n as start, n2.n as len
from numbers n1 cross join
numbers n2
where n1.n <= length(#newtag) and n1+n2 <= length(#newtag)
) patterns join
tags t
on t.tag like patterns.pat
group by t.tag
order by max(len)
limit 1 /* you only need this if you want the best one */
I'm not promising that this will perform particularly well. But for a handful of tags and strings that are not too long, it might suit your purposes.
I have a large database of sentences, and a problem where sentences like "i'm good" do not match to "im good" and vise versa or "is that mine?" not matching with "is that mine" and vise versa when i would want them to be detected as a match.
I had made complicated and messy functions trying to do this with wildcards and researching but its just a big mess. and im sure there must be a way to search with this 1 character lee way. If i can i would like to control which characters get this lee way, like in my examples the main problem causers are the question mark and the half quote. (? ').
im currently using a plane select query with php and mysql to do the matching queries.
i would love some help to figure this out so i can clean up the big mess of code that is currently doing the job inconsistently.
in case anyone wants to see the code query checking for matches is like this:
$checkqwry = "select * from `eng-jap` where (eng = '$eng' or english = '$oldeng' or english = '$oldeng2') and (jap = '$jap' or japanese = '$oldjap' or japanese = '$oldjap2');";
the purpose of the query is to just check if there is already a translation with the $eng and $jap already in the DB. the reason you see $oldeng $oldeng2 and $oldeng3 and so on is like i said, my messy foolish attempts to match even if there is or is not a question mark and so on. where some of the $oldeng variables have questions marks or halfquotes and so on and the others dont. there is more code above appending and remove question marks and stuff. yes its a big mess.
You want to use a String Metric algorithm as mentioned above, PHP has this function built in http://php.net/manual/en/function.levenshtein.php as well as http://www.php.net/manual/en/function.similar-text.php.
MySQL doesn't implement this (specific algorithm) natively but some people have went ahead and wrote stored procedures to accomplish the same: http://www.artfulsoftware.com/infotree/queries.php#552
In my opinion using a String Metric that can handle arbitrary changes is better then stripping out punctuation, and can also catch omissions, transpositions, etc...
Probably better to simply strip non-alphanumeric characters out before comparing the strings.
You can use the replace function in sql to replace "'" with "" and "?" with "".
You might want to look at natural language full text searches in MySQL. Add a FULLTEXT index to the eng column.
ALTER TABLE `eng-jap` ADD FULLTEXT INDEX `full` (`eng`) ;
Then, use match function:
select * from `eng-jap` where match(eng) against ('Im happy');
This will return both I'm happy and Im happy
If you select the relevance score like:
select id, match(eng) against ('Im happy') from `eng-jap` where match(eng) against ('Im happy');
you can use it to further process the matches in PHP and filter.
[EDIT]: Just verified that the relevance score for yesterday and yesterday? are the same too:
select *, match(eng) against ('yesterday') as mc from `eng-jap`
Result is:
6, yesterday?, 0.9058732390403748
7, yesterday, 0.9058732390403748
Note: For Fulltext index to be applied, your mysql engine has to be MyISAM. Also, the sentence has to contain more than 3 characters. The index doesn't seem to match a word like 'yes'.
I am trying to do a search query with SQL; my page contains an input field who's value is taken and simply concatenated to my SQL statement.
So, Select * FROM users after a search then becomes SELECT * FROM users WHERE company LIKE '%georges brown%'.
It then returns results based on what the user types in; in this case Georges Brown. However, it only finds entries who's companies are exactly typed out as Georges Brown (with an 's').
What I am trying to do is return a result set that not only contains entries with Georges but also George (no 's').
Is there any way to make this search more flexible so that it finds results with Georges and George?
Try using more wildcards around george.
SELECT * FROM users WHERE company LIKE '%george% %brown%'
Try this query:
SELECT *
FROM users
WHERE company LIKE '%george% brown%'
Use SOUNDEX
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
You can also remove last 2 characters and get SOUNDEX codes and compare them.
You'll have to look at the documentation of your database system. MySQL for example provides the SOUNDEX function.
Otherwise, what should always work and give you better matching is to only work on upper or lower cased strings. SQL-92 defines the TRIM, UPPER, and LOWER functions. So you'd do something like WHERE UPPER(company) LIKE UPPER('%georges brown%').
In specific cases you can use a wildcard:
WHERE company LIKE '%george% brown%' -- will match `georges` but not `georgeani`
_ is a single-character wildcard, while % is a multi-character wildcard.
But maybe it's better to use another piece of software for indexing, like Sphinx.
It has:
"Flexible text processing. Sphinx indexing features include full support for SBCS and UTF-8 encodings (meaning that effectively all world's languages are supported); stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more."
It allows you do do smarter searches with partial matches, while providing a more accuracy than soundex, for example.
Probably best to explode out your search string into individual words then find the plural / singular of each of those words. Then do a like for both possibilities for each word.
However for this to be usably efficient on large amounts of data you probably want to run against a table of words linked to each company.
Soundex alone probably isn't much use as too many words are similar (it gives you a 4 character code, the first character being the first character of the word, while the next 3 are a numeric code). Levenshtein is more accurate but MySQL has no method for this built in although php does have a fast function for this (the MySQL functions I found to calculate it were far too slow to be useful on a large search).
What I did for a similar search function was to take the input string and explode it out to words, then converting those words to their singular form (my table of used words just contain singular versions of words). For each word I then found all the used words starting with the same letter and then used levenshtein to get the best match(es). And from this listed out the possible matches. Made it possible to cope with typoes (so it would likely find George if someone entered Goerge), and also to find best matches (ie, if someone searched on 5 words but only 4 were found). Also could come up with a few alternatives if the spelling was miles out.
You may also want to look up Metaphone and Double Metaphone.
Whats the best way to go around doing this?
I have columns: track_name, artist_name, album_name
I want all columns to be matched against the search query. and some flexibility while matching.
mysql like is too strict, even with %XXX%. It matches the string as a whole, not the parts.
Your MySQL query could have several OR clauses, searching for each space-delimited word entered by the user. For example, a user search for "Queens of the Stoneage" may be represented in SQL as SELECT * FROM songs WHERE artist_name LIKE "%Queens%" OR artist_name LIKE "Stoneage".
However, that could be undesirable because LIKE searches which start with an % are inefficient and could be terribly slow on a large database.
Though I can't speak to the performance implications, you should have a look at natural language full-text searches. It's probably the most effective solution you'll find:
SELECT * FROM songs WHERE MATCH(track_name, artist_name, album_name) AGAINST('Queens of the Stoneage' IN NATURAL LANGUAGE MODE);
Some PHP functions do exist for determining the similarity of strings of text, but keeping this work in the database will probably be most efficient (and less frustrating):
levenshtein()
similar_text()
soundex()
I think you need to rethink your application. What I understood from your comment is that you need to implement some logic operator like "and", "or" and "not" in your program. It's not only about fancy algorithm like fulltext index or longest common substring like this mysql match query. But I can be wrong.