I am trying to do a search query with SQL; my page contains an input field who's value is taken and simply concatenated to my SQL statement.
So, Select * FROM users after a search then becomes SELECT * FROM users WHERE company LIKE '%georges brown%'.
It then returns results based on what the user types in; in this case Georges Brown. However, it only finds entries who's companies are exactly typed out as Georges Brown (with an 's').
What I am trying to do is return a result set that not only contains entries with Georges but also George (no 's').
Is there any way to make this search more flexible so that it finds results with Georges and George?
Try using more wildcards around george.
SELECT * FROM users WHERE company LIKE '%george% %brown%'
Try this query:
SELECT *
FROM users
WHERE company LIKE '%george% brown%'
Use SOUNDEX
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
You can also remove last 2 characters and get SOUNDEX codes and compare them.
You'll have to look at the documentation of your database system. MySQL for example provides the SOUNDEX function.
Otherwise, what should always work and give you better matching is to only work on upper or lower cased strings. SQL-92 defines the TRIM, UPPER, and LOWER functions. So you'd do something like WHERE UPPER(company) LIKE UPPER('%georges brown%').
In specific cases you can use a wildcard:
WHERE company LIKE '%george% brown%' -- will match `georges` but not `georgeani`
_ is a single-character wildcard, while % is a multi-character wildcard.
But maybe it's better to use another piece of software for indexing, like Sphinx.
It has:
"Flexible text processing. Sphinx indexing features include full support for SBCS and UTF-8 encodings (meaning that effectively all world's languages are supported); stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more."
It allows you do do smarter searches with partial matches, while providing a more accuracy than soundex, for example.
Probably best to explode out your search string into individual words then find the plural / singular of each of those words. Then do a like for both possibilities for each word.
However for this to be usably efficient on large amounts of data you probably want to run against a table of words linked to each company.
Soundex alone probably isn't much use as too many words are similar (it gives you a 4 character code, the first character being the first character of the word, while the next 3 are a numeric code). Levenshtein is more accurate but MySQL has no method for this built in although php does have a fast function for this (the MySQL functions I found to calculate it were far too slow to be useful on a large search).
What I did for a similar search function was to take the input string and explode it out to words, then converting those words to their singular form (my table of used words just contain singular versions of words). For each word I then found all the used words starting with the same letter and then used levenshtein to get the best match(es). And from this listed out the possible matches. Made it possible to cope with typoes (so it would likely find George if someone entered Goerge), and also to find best matches (ie, if someone searched on 5 words but only 4 were found). Also could come up with a few alternatives if the spelling was miles out.
You may also want to look up Metaphone and Double Metaphone.
Related
I'm creating a paraphrasing system, where a user inputs text and the system paraphrases for them.
My database looks like this:
KeyWord: dainty
Synonyms1: choice; delicious; tasty; juicy; luscious; palatable; savoury
Synonyms2: ethereal; beautiful; fragile; charming; petite; frail; elegant
where Keyword (varchar), Synonym1 (text), and Synomy2 (text) are database columns. The example above is one row of a database with 3 fields and their values.
This how it works if the system finds, for example, a word like tasty, it can be replaced by any of the words separated by a semicolon from either Synomyn1 or Synonym2 or the keyword because they are all synonyms.
Let me explain how the word search is working. The system first searches for the word in the Keyword column, if the word is not found, I go further and search for a word in the Synmon1 column and so on.
My Problem is checking the user's specific word in the Synonym1 or Synonym2 columns. When I use the LIKE clause, the generic way of searching from the database, the system is not searching for a full name, instead, it's searching for characters. For example, let's assume the writer's text is: "Benson has an ice cube", the system is assuming the ice was found in the choice. I don't want that, I want to search for a full word.
If anyone has understood me, please help to solve this.
If I understand your question, you want to search for ice in columns Synonyms1 and Synonyms2 but make sure you do not inadvertently match a word such as choice.
If you have ever read or heard anything on the subject of database normalization you would realize that your database does not even meet the requirements for 1NF (first normal form) becuase it has columns that consist of repeating values, which, as you have found out, makes searching inefficient and difficult. But let's move on:
A synonym column might just contain one word, so it might look like:
ethereal
Or:
ethereal; beautiful; fragile; charming; petite; frail; elegant
Thus the word you are looking for might be:
the entire column value
preceded by nothing and followed by a ;
preceded by a space and followed by a ;
preceded by a space and followed by nothing
So if your version of MySQL does not support regular expressions, then if you are looking for example the word ice in column Synonyms2, the WHERE clause should be:
WHERE (
Synonyms2 = 'ice'
OR
Synonyms2 like 'ice;%'
OR
Synonyms2 like '% ice;%'
OR
Synonyms2 like '% ice'
)
If you are running SQL 8+, then:
WHERE regexp_like(Synonyms2, '( |^)ice(;|$)')
This states that ice must be preceded by either a space or start of string and followd by either a ; or end of string.
NB:
I could have used a dictionary like Pspell, Aspell or Hunspell but this case does not apply properly to business names or cities. Furthermore, I don't want to query the DB for all the suggested corrections (especially with a typeahead firing every 300ms) (more issues about these dictionnaries)
I could have used complementary search engine such as Elasticsearch or Sphinx but I do not have the financial or human resources allocated for this MVP. As suggested in this answer, MySQL fulltext should be enough and much less complex.
Technologies available:
MySQL 5.7 InnoDB with fulltext index boolean mode on desired fields, PHP 7.0 with php-fpm, VPS with Centos 7, corejs-typeahead
The objective:
I want to return from MySQL the results of a user search, whether it is a correct search or a misspelled search.
Example of common issues:
HYPHEN
words with hyphens '-' is annoying to search in partial search.
Potential solution:
I would have to wrap the search query within "" to search for a phrase (see [enter link description here][examples from man]. Still, it would not find a business named '"le dé-k-lé"' due to ft_min_word_len=3 AND "de" and "le" are stopwords (too frequent in many languages)
I could, but I will not get into the following solutions because I am not enough skilled or this is inappropriate. As suggested by MySQL manual to Modify the MySQL source or Modify a character set file or Add a new collation. For example, if I want to use the minus (-) operator to filter out some words in the future, it will not be possible anymore.
APOSTROPHE / SINGLE QUOTE
words with apostrophe are often searched without the apostrophe (especially on mobiles). For example "A'trego" would be input as "atrego". It will definitely be missed by a the fulltext index as "A'trego" is considered as 2 words "a" and "trego"
DOUBLE LETTERS MISSED
words with double letters are often missed or misspelled by the user. For example, 'Cerrutti' could be misspelled 'Cerutti' or 'Cerruti', etc.
Potential solution:
I could use SOUNDEX() but it is mostly designed for english language
I could use the levenshtein function but it would be slow for large datasets (e.g. table with all european cities). It seems that it has to do a fullscan, coupled with a typeahead, it is definitely not the the way to go. Even though some suggestions are interesting here and here
EXONYMS and PLURAL FORMS
Exonyms can be difficult to handle in searches (from the user perspective). For example, the Italian city Firenze is named Florenz in German, Florence in French, etc. People often switch from the exonym to the local name when they are in the city itself. Exonyms will not be handled appropriately by the previous algorithms. Furthermore, it is not a good user experience to have a city name without its exonyms. It is not good either for i18n.
Potential solution:
A self-made dictionary using Pspell or other similar libraries would return the string that is stored and indexed in MySQL.
DIACRITICS
- similar to exonyms, it can be difficult to handle by the user. The same for i18n. For example try to find a restaurant in Łódź in Poland using your usual keyboard. A Polish and an English person will definitely not approach this string the same way.
Potential solution:
- The potential solution is already managed in the front-end by the mapping used by the corejs-typeahead library. The remaining is cleaned with PHP $strCleaned = iconv('UTF-8', 'utf-8//TRANSLIT', $str);
ABBREVIATIONS & ACRONYMS
- Abbreviations are used interchangeably for company names and especially for blue chips. For example, LVMH, HP, GM, GE, BMW. The same goes for cities. Not returning a company or a city when searching with the abbreviations is a big fail in term of user experience.
Potential solution:
First, ft_min_word_len should be reduced to two characters.
Second, a stopword list should be implemented
Third, the fulltext index rebuilt.
I do not see any other sustainable alternative
This list is not exhaustive in the issues nor the potential solutions.
MY SOLUTION
My solution is inspired and extrapolated from an answer here
Basically, before each search, the user input should be stripped of characters like apostrophe, hyphen; simplified to remove similar consecutive letters.
Those cleaned alternative words will be stored in a column indexed with a fulltext index.
This solutions is kind of simple and appropriately respond to my requirements. But my short experience tells me to be cautious as it should definitely suffers from drawbacks (that I have not identified yet).
Below is a simplified version of my code.
PHP
// Get input from the typeahead searched word
$query = (!empty($_GET['q'])) ? strtolower($_GET['q']) : null;
// end the script if empty query
if (!isset($query)) {
die('Invalid query.');
}
// Clean and Strip input
$query = trim($query);
$query = str_replace("'","",$query);
$query = str_replace("-","",$query);
$query = preg_replace('{(.)\1+}','$1',$query);
// filter/sanitize query
if (!preg_match("/^([0-9 '#&\-\.\pL])+$/ui", $input[$field]) !== false) {exit;}
$query = mysqli_real_escape_string($conn, $query); // I will switch to PDO prepared statement soon as mysqli_real_escape_string do not offer enough protection
MySQL Query
SELECT DISTINCT
company.company_name,
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE) AS relevance
FROM company
WHERE
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE)
AND relevance > 1
ORDER BY
CASE
WHEN company_name = '$query' THEN 0
WHEN company_name LIKE '$query%' THEN 1
WHEN company_name LIKE '%$query' THEN 2
ELSE 3
END
LIMIT 20
MySQL Table
As a reminder, I got a two column fulltext index on (company_name,company_alternative)
**company_name** | **company_alternative**
l'Attrego | lattrego latrego attrego atrego
le Dé-K-Lé | dekle dekale decale
General Electric | GE
THE DRAWBACKS of my solution that I have identified
The alternative words will not contain the common spelling mistakes until I add it manually to the alternative_name column or a machine learning process is implemented. Thus, difficult to manage and not scalable (this drawback can be dropped with not too much difficulty with machine learning as I already collect all search queries).
I have to manage a dynamic and complex stopword list
I have to rebuild the indexes due to lowering ft_min_word_len to 2
So my question,
How to implement an autocorrect/alternative spelling search system with PHP and MySQL fulltext boolean mode for an MVP ?, could be reworded to,
Is my solution the least scalable ?
Do you see drawbacks that I do not see ?
How could I improve this approach if it is a reasonable one ?
This may be a newbie question, as I'm not an expert in SQL. However, couldn't find the answer using Google.
I have a table called record_fields which contains the majority of my system's content, which I want to search in. The content cell is defined as LONGTEXT as it can include extremely long input.
Originally, I used (simplifying the query a bit for clarity sake):
SELECT * FROM record_fields WHERE LOWER(content) LIKE LOWER('%{$keyword}%')
Execution time aside, this query has one major issue. If I search for the term "post" it will return all content which has words like "poster", "posting" and others. I wanted to add a FULLTEXT search.
Now the query looks like this (again, simplified):
SELECT * FROM record_fields WHERE MATCH (content) AGAINST ('{$keyword}')
However, this is still problematic. With MATCH, if my system's users search for the words "Bank of America", for example, all records that either have the word "Bank" and "America" will be returned.
TL;DR - my question is this:
how do I use MATCH to search for exact phrases with space in them?
Any help would be highly appreciated, thanks in advance!
%{keyword}% matches all text sub-strings that include your keyword anywhere in the string. MATCH usually takes all keywords in the match string as individual search terms, and matches against each. You can use boolean mode and use a + symbol before each required keyword. Take a look at the MySQL reference for this.
Edited the answer to reflect Idan's response in not getting the results from the suggested %keyword solution.
You can use Match Against With Boolean Mode and you can put your input string inside '"{$keyword}"'.
Check last example in below link
https://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
SELECT * FROM record_fields WHERE MATCH (content) AGAINST ('"{$keyword}"' IN BOOLEAN MODE )
In a database I work with, there are a few million rows of customers. To search this database, we use a match against Boolean expression. All was well and good, until we expanded into an Asian market, and customers are popping up with the name 'In'. Our search algorithm can't find this customer by name, and I'm assuming that it's because it's an InnoDB reserved word. I don't want to convert my query to a LIKE statement because that would reduce performance by a factor of five. Is there a way to find that name in a full text search?
The query in production is very long, but the portion that's not functioning as needed is:
SELECT
`customer`.`name`
FROM
`customer`
WHERE
MATCH(`customer`.`name`) AGAINST("+IN*+KYU*+YANG*" IN BOOLEAN MODE);
Oh, and the innodb_ft_min_token_size variable is set to 1 because our customers "need" to be able to search by middle initial.
It isn't a reserved word, but it is in the stopword list. You can override this with ft_stopword_file, to give your own list of stopwords. 2 possible problems with these are: (1) on altering it, you need to rebuild your fulltext index (2) it's a global variable: you can't alter it on a session / location / language-used basis, so if you really need all the words & are using a lot of different languages in one database, providing an empty one is almost the only way to go, which can hurt a bit for uses where you would like a stopword list to be used.
Say if I had a table of books in a MySQL database and I wanted to search the 'title' field for keywords (input by the user in a search field); what's the best way of doing this in PHP? Is the MySQL LIKE command the most efficient way to search?
Yes, the most efficient way usually is searching in the database. To do that you have three alternatives:
LIKE, ILIKE to match exact substrings
RLIKE to match POSIX regexes
FULLTEXT indexes to match another three different kinds of search aimed at natural language processing
So it depends on what will you be actually searching for to decide what would the best be. For book titles I'd offer a LIKE search for exact substring match, useful when people know the book they're looking for and also a FULLTEXT search to help find titles similar to a word or phrase. I'd give them different names on the interface of course, probably something like exact for the substring search and similar for the fulltext search.
An example about fulltext: http://www.onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html
Here's a simple way you can break apart some keywords to build some clauses for filtering a column on those keywords, either ANDed or ORed together.
$terms=explode(',', $_GET['keywords']);
$clauses=array();
foreach($terms as $term)
{
//remove any chars you don't want to be searching - adjust to suit
//your requirements
$clean=trim(preg_replace('/[^a-z0-9]/i', '', $term));
if (!empty($clean))
{
//note use of mysql_escape_string - while not strictly required
//in this example due to the preg_replace earlier, it's good
//practice to sanitize your DB inputs in case you modify that
//filter...
$clauses[]="title like '%".mysql_escape_string($clean)."%'";
}
}
if (!empty($clauses))
{
//concatenate the clauses together with AND or OR, depending on
//your requirements
$filter='('.implode(' AND ', $clauses).')';
//build and execute the required SQL
$sql="select * from foo where $filter";
}
else
{
//no search term, do something else, find everything?
}
Consider using sphinx. It's an open source full text engine that can consume your mysql database directly. It's far more scalable and flexible than hand coding LIKE statements (and far less susceptible to SQL injection)
You may also check soundex functions (soundex, sounds like) in mysql manual http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Its functional to return these matches if for example strict checking (by LIKE or =) did not return any results.
Paul Dixon's code example gets the main idea across well for the LIKE-based approach.
I'll just add this usability idea: Provide an (AND | OR) radio button set in the interface, default to AND, then if a user's query results in zero (0) matches and contain at least two words, respond with an option to the effect:
"Sorry, No matches were found for your search phrase. Expand search to match on ANY word in your phrase?
Maybe there's a better way to word this, but the basic idea is to guide the person toward another query (that may be successful) without the user having to think in terms of the Boolean logic of AND and ORs.
I think Like is the most efficient way if it's a word. Multi words may be split with explode function as said already. It may then be looped and used to search individually through the database. If same result is returned twice, it may be checked by reading the values into an array. If it already exists in the array, ignore it. Then with count function, you'll know where to stop while printing with a loop. Sorting may be done with similar_text function. The percentage is used to sort the array. That's the best.