In a database I work with, there are a few million rows of customers. To search this database, we use a match against Boolean expression. All was well and good, until we expanded into an Asian market, and customers are popping up with the name 'In'. Our search algorithm can't find this customer by name, and I'm assuming that it's because it's an InnoDB reserved word. I don't want to convert my query to a LIKE statement because that would reduce performance by a factor of five. Is there a way to find that name in a full text search?
The query in production is very long, but the portion that's not functioning as needed is:
SELECT
`customer`.`name`
FROM
`customer`
WHERE
MATCH(`customer`.`name`) AGAINST("+IN*+KYU*+YANG*" IN BOOLEAN MODE);
Oh, and the innodb_ft_min_token_size variable is set to 1 because our customers "need" to be able to search by middle initial.
It isn't a reserved word, but it is in the stopword list. You can override this with ft_stopword_file, to give your own list of stopwords. 2 possible problems with these are: (1) on altering it, you need to rebuild your fulltext index (2) it's a global variable: you can't alter it on a session / location / language-used basis, so if you really need all the words & are using a lot of different languages in one database, providing an empty one is almost the only way to go, which can hurt a bit for uses where you would like a stopword list to be used.
Related
NB:
I could have used a dictionary like Pspell, Aspell or Hunspell but this case does not apply properly to business names or cities. Furthermore, I don't want to query the DB for all the suggested corrections (especially with a typeahead firing every 300ms) (more issues about these dictionnaries)
I could have used complementary search engine such as Elasticsearch or Sphinx but I do not have the financial or human resources allocated for this MVP. As suggested in this answer, MySQL fulltext should be enough and much less complex.
Technologies available:
MySQL 5.7 InnoDB with fulltext index boolean mode on desired fields, PHP 7.0 with php-fpm, VPS with Centos 7, corejs-typeahead
The objective:
I want to return from MySQL the results of a user search, whether it is a correct search or a misspelled search.
Example of common issues:
HYPHEN
words with hyphens '-' is annoying to search in partial search.
Potential solution:
I would have to wrap the search query within "" to search for a phrase (see [enter link description here][examples from man]. Still, it would not find a business named '"le dé-k-lé"' due to ft_min_word_len=3 AND "de" and "le" are stopwords (too frequent in many languages)
I could, but I will not get into the following solutions because I am not enough skilled or this is inappropriate. As suggested by MySQL manual to Modify the MySQL source or Modify a character set file or Add a new collation. For example, if I want to use the minus (-) operator to filter out some words in the future, it will not be possible anymore.
APOSTROPHE / SINGLE QUOTE
words with apostrophe are often searched without the apostrophe (especially on mobiles). For example "A'trego" would be input as "atrego". It will definitely be missed by a the fulltext index as "A'trego" is considered as 2 words "a" and "trego"
DOUBLE LETTERS MISSED
words with double letters are often missed or misspelled by the user. For example, 'Cerrutti' could be misspelled 'Cerutti' or 'Cerruti', etc.
Potential solution:
I could use SOUNDEX() but it is mostly designed for english language
I could use the levenshtein function but it would be slow for large datasets (e.g. table with all european cities). It seems that it has to do a fullscan, coupled with a typeahead, it is definitely not the the way to go. Even though some suggestions are interesting here and here
EXONYMS and PLURAL FORMS
Exonyms can be difficult to handle in searches (from the user perspective). For example, the Italian city Firenze is named Florenz in German, Florence in French, etc. People often switch from the exonym to the local name when they are in the city itself. Exonyms will not be handled appropriately by the previous algorithms. Furthermore, it is not a good user experience to have a city name without its exonyms. It is not good either for i18n.
Potential solution:
A self-made dictionary using Pspell or other similar libraries would return the string that is stored and indexed in MySQL.
DIACRITICS
- similar to exonyms, it can be difficult to handle by the user. The same for i18n. For example try to find a restaurant in Łódź in Poland using your usual keyboard. A Polish and an English person will definitely not approach this string the same way.
Potential solution:
- The potential solution is already managed in the front-end by the mapping used by the corejs-typeahead library. The remaining is cleaned with PHP $strCleaned = iconv('UTF-8', 'utf-8//TRANSLIT', $str);
ABBREVIATIONS & ACRONYMS
- Abbreviations are used interchangeably for company names and especially for blue chips. For example, LVMH, HP, GM, GE, BMW. The same goes for cities. Not returning a company or a city when searching with the abbreviations is a big fail in term of user experience.
Potential solution:
First, ft_min_word_len should be reduced to two characters.
Second, a stopword list should be implemented
Third, the fulltext index rebuilt.
I do not see any other sustainable alternative
This list is not exhaustive in the issues nor the potential solutions.
MY SOLUTION
My solution is inspired and extrapolated from an answer here
Basically, before each search, the user input should be stripped of characters like apostrophe, hyphen; simplified to remove similar consecutive letters.
Those cleaned alternative words will be stored in a column indexed with a fulltext index.
This solutions is kind of simple and appropriately respond to my requirements. But my short experience tells me to be cautious as it should definitely suffers from drawbacks (that I have not identified yet).
Below is a simplified version of my code.
PHP
// Get input from the typeahead searched word
$query = (!empty($_GET['q'])) ? strtolower($_GET['q']) : null;
// end the script if empty query
if (!isset($query)) {
die('Invalid query.');
}
// Clean and Strip input
$query = trim($query);
$query = str_replace("'","",$query);
$query = str_replace("-","",$query);
$query = preg_replace('{(.)\1+}','$1',$query);
// filter/sanitize query
if (!preg_match("/^([0-9 '#&\-\.\pL])+$/ui", $input[$field]) !== false) {exit;}
$query = mysqli_real_escape_string($conn, $query); // I will switch to PDO prepared statement soon as mysqli_real_escape_string do not offer enough protection
MySQL Query
SELECT DISTINCT
company.company_name,
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE) AS relevance
FROM company
WHERE
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE)
AND relevance > 1
ORDER BY
CASE
WHEN company_name = '$query' THEN 0
WHEN company_name LIKE '$query%' THEN 1
WHEN company_name LIKE '%$query' THEN 2
ELSE 3
END
LIMIT 20
MySQL Table
As a reminder, I got a two column fulltext index on (company_name,company_alternative)
**company_name** | **company_alternative**
l'Attrego | lattrego latrego attrego atrego
le Dé-K-Lé | dekle dekale decale
General Electric | GE
THE DRAWBACKS of my solution that I have identified
The alternative words will not contain the common spelling mistakes until I add it manually to the alternative_name column or a machine learning process is implemented. Thus, difficult to manage and not scalable (this drawback can be dropped with not too much difficulty with machine learning as I already collect all search queries).
I have to manage a dynamic and complex stopword list
I have to rebuild the indexes due to lowering ft_min_word_len to 2
So my question,
How to implement an autocorrect/alternative spelling search system with PHP and MySQL fulltext boolean mode for an MVP ?, could be reworded to,
Is my solution the least scalable ?
Do you see drawbacks that I do not see ?
How could I improve this approach if it is a reasonable one ?
I am working on big eCommerce shopping website. I have around 40 databases. i want to create search page which show 18 result after searching by title in all databases.
(SELECT id_no,offers,image,title,mrp,store from db1.table1 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db3.table3 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db2.table2 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
LIMIT 18
currently i am using the above query its working fine for 4 or more character keyword search like laptop nokia etc but takes 10-15 sec for processes but for query with keyword less than 3 characters it takes 30-40sec or i end up with 500 internal server error. Is there any optimized way for searching in multiple databases. I generated two index primary and full text index with title
Currently my search page is in php i am ready to code in python or any
other language if i gets good speed
You can use the sphixmachine:http://sphinxsearch.com/. This is powerfull search for database. IMHO Sphinx this best decision
for search in your site.
FULLTEXT is not configured (by default) for searching for words less than three characters in length. You can configure that to handle shorter words by setting a ...min_token_size parameter. Read this. https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html You can only do this if you control the MySQL server. It won't be possible on shared hosting. Try this.
FULLTEXT is designed to produce more false-positive matches than false-negative matches. It's generally most useful for populating dropdown picklists like the ones under the location field of a browser. That is, it requires some human interaction to choose the correct record. To expect FULLTEXT to be able to do absolutely correct searches is probably a bad idea.
You simply cannot use AND column LIKE '%whatever%' if you want any reasonable performance at all. You must get rid of that. You might be able to rewrite your python program to do something different when the search term is one or two letters, and thereby avoid many, but not all, LIKE '%a%' and LIKE '%ab%' operations. If you go this route, create ordinary indexes on your title columns. Whatever you do, don't combine the FULLTEXT and LIKE searches in a single query.
If this were my project I'd consider using a special table with columns like this to hold all the short words from the title column in every row of each table.
id_pk INT autoincrement
id_no INT
word VARCHAR(3)
Then you can use a query like this to look up short words
SELECT a.id_no,offers,image,title,mrp,store
FROM db1.table1 a
JOIN db1.table1_shortwords s ON a.id_no = s.id_no
WHERE s.word = '$searchkey'
To do this, you will have to preprocess the title columns of your other tables to populate the shortwords tables, and put an index on the word column. This will be fast, but it will require a special-purpose program to do the preprocessing.
Having to search multiple tables with your UNION ALL operation is a performance problem. You will be able to improve performance dramatically by redesigning your schema so you need search only one table.
Having to search databases on different server machines is a performance problem. You may be able to rig up your python program to search them in parallel: that is, to somehow use separate tasks to search each one, then aggregate the results. Each of those separate search tasks requires its own connection to the data base, so this is not a cheap or simple solution.
If this system faces the public web, you will have to redesign it sooner or later, because it will never perform well enough as it is now. (Sorry to be the bearer of bad news.) Many system designers like to avoid redesigning systems after they become enormous. So, if I were you I would get the redesign done.
If your focus is on searching, then bend the schema to facilitate searching rather than the other way around.
Collect all the strings to search for in a single table. Whereas a UNION of 40 tables does work, it will be ~40 times as slow as having the strings collected together.
Use FULLTEXT when the words are long enough, use some other technique when they are not. (This addresses your 3-char problem; see also the Answer discussing innodb_ft_min_token_size. You are using InnoDB, correct?)
Use + and boolean mode to say that a word is mandatory: MATCH(col) AGAINST("+term" IN BOOLEAN MODE)
Do not add on a LIKE clause unless there is a good reason.
I am trying to do a search query with SQL; my page contains an input field who's value is taken and simply concatenated to my SQL statement.
So, Select * FROM users after a search then becomes SELECT * FROM users WHERE company LIKE '%georges brown%'.
It then returns results based on what the user types in; in this case Georges Brown. However, it only finds entries who's companies are exactly typed out as Georges Brown (with an 's').
What I am trying to do is return a result set that not only contains entries with Georges but also George (no 's').
Is there any way to make this search more flexible so that it finds results with Georges and George?
Try using more wildcards around george.
SELECT * FROM users WHERE company LIKE '%george% %brown%'
Try this query:
SELECT *
FROM users
WHERE company LIKE '%george% brown%'
Use SOUNDEX
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
You can also remove last 2 characters and get SOUNDEX codes and compare them.
You'll have to look at the documentation of your database system. MySQL for example provides the SOUNDEX function.
Otherwise, what should always work and give you better matching is to only work on upper or lower cased strings. SQL-92 defines the TRIM, UPPER, and LOWER functions. So you'd do something like WHERE UPPER(company) LIKE UPPER('%georges brown%').
In specific cases you can use a wildcard:
WHERE company LIKE '%george% brown%' -- will match `georges` but not `georgeani`
_ is a single-character wildcard, while % is a multi-character wildcard.
But maybe it's better to use another piece of software for indexing, like Sphinx.
It has:
"Flexible text processing. Sphinx indexing features include full support for SBCS and UTF-8 encodings (meaning that effectively all world's languages are supported); stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more."
It allows you do do smarter searches with partial matches, while providing a more accuracy than soundex, for example.
Probably best to explode out your search string into individual words then find the plural / singular of each of those words. Then do a like for both possibilities for each word.
However for this to be usably efficient on large amounts of data you probably want to run against a table of words linked to each company.
Soundex alone probably isn't much use as too many words are similar (it gives you a 4 character code, the first character being the first character of the word, while the next 3 are a numeric code). Levenshtein is more accurate but MySQL has no method for this built in although php does have a fast function for this (the MySQL functions I found to calculate it were far too slow to be useful on a large search).
What I did for a similar search function was to take the input string and explode it out to words, then converting those words to their singular form (my table of used words just contain singular versions of words). For each word I then found all the used words starting with the same letter and then used levenshtein to get the best match(es). And from this listed out the possible matches. Made it possible to cope with typoes (so it would likely find George if someone entered Goerge), and also to find best matches (ie, if someone searched on 5 words but only 4 were found). Also could come up with a few alternatives if the spelling was miles out.
You may also want to look up Metaphone and Double Metaphone.
I want to create an autosuggest for a fulltext search with AJAX, PHP & MySQL.
I am looking for the right way to implement the backend. While the user is typing, the input field should give him suggests. Suggests should be generated from text entrys in a table.
Some information for this entrys: They are stored in fulltext, generated from PDF with 3-4 pages each. There not more than 100 entrys for now and will reach a maximum of 2000 in the next few years.
If the user starts to type, the word he is typing should be completed with a word which is stored in the DB, sorted by occurrences descending. Next step is to suggest combinations with other words, witch have a high occurrence in the entrys matching the first word. Surely you can compare it to Google autosuggest.
I am thinking about 3 different ways to implement this:
Generate an index via cronjob, witch counts occurrences of words and combinations over night. The user searches on this index.
I do a live search within the entrys with an 'LIKE "%search%"' function. Then I look for the word after the this and GROUP them by occurrence.
I create a logfile for all user searches, and look for good combinations like in 1), so the search gets more intelligent with each search action.
What is the best way to start with this? The search should be fast and performant.
Is there a better possibility I did not think about?
I'd use mysql's MATCH() AGAINST() (http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html), eg:
SELECT *
FROM table
WHERE MATCH(column) AGAINST('search')
ORDER BY MATCH(column) AGAINST('search')
Another advantage is that you could further tweak the importance of words being searched for (if neccessary), like:
MATCH(column) AGAINST('>important <lessimportant') IN BOOLEAN MODE
Or say that certain words of the search term are to be required, whilst others may not be present in the result, eg:
MATCH(column) AGAINST('+required -prohibited') IN BOOLEAN MODE
I think, the idea no 1 is the best. By the way, dont't forget to eliminate stopwords from autosuggest (an, the, by, ...).
I have a table that lists people and all their contact info. I want for users to be able to perform an intelligent search on the table by simply typing in some stuff and getting back results where each term they entered matches at least one of the columns in the table. To start I have made a query like
SELECT * FROM contacts WHERE
firstname LIKE '%Bob%'
OR lastname LIKE '%Bob%'
OR phone LIKE '%Bob%' OR
...
But now I realize that that will completely fail on something as simple as 'Bob Jenkins' because it is not smart enough to search for the first an last name separately. What I need to do is split up the the search terms and search for them individually and then intersect the results from each term somehow. At least that seems like the solution to me. But what is the best way to go about it?
I have heard about fulltext and MATCH()...AGAINST() but that sounds like a rather fuzzy search and I don't know how much work it is to set up. I would like precise yes or no results with reasonable performance. The search needs to be done on about 20 columns by 120,000 rows. Hopefully users wouldn't type in more than two or three terms.
Oh sorry, I forgot to mention I am using MySQL (and PHP).
I just figured out fulltext search and it is a cool option to consider (is there a way to adjust how strict it is? LIMIT would just chop of the results regardless of how well it matched). But this requires a fulltext index and my website is using a view and you can't index a view right? So...
I would suggest using MATCH / AGAINST. Full-text searches are more advanced searches, more like Google's, less elementary.
It can match across multiple tables and rank them to how many matches they have.
Otherwise, if the word is there at all, esp. across multiple tables, you have no ranking. You can do ranking server-side, but that is going to take more programming/time.
Depending on what database you're using, the ability to do cross columns can become more or less difficult. You probably don't want to do 20 JOINs as that will be a very slow query.
There are also engines such as Sphinx and Lucene dedicated to do these types of searches.
BOOLEAN MODE
SELECT * FROM contacts WHERE
MATCH(firstname,lastname,email,webpage,country,city,street...)
AGAINST('+bob +jenkins' IN BOOLEAN MODE)
Boolean mode is very powerful. It might even fulfil all my needs. I will have to do some testing. By placing + in front of the search terms those terms become required. (The row must match 'bob' AND 'jenkins' instead of 'bob' OR 'jenkins'). This mode even works on non-indexed columns, and thus I can use it on a view although it will be slower (that is what I need to test). One final problem I had was that it wasn't matching partial search terms, so 'bob' wouldn't find 'bobby' for example. The usual % wildcard doesn't work, instead you use an asterisk *.