I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.
Related
Current Situation:
I am currently running a keyword search using multiple keywords in PHP and SQL. The field I'm applying the search to is the title field, which is a 250 VARCHAR field.
A user can input a single keyword, e.g. "apple" or also multiple, e.g. "apple banana yellow". The first option is trivial. For the second option, my current algorithm works like this:
Try and find items that match the exact entire string "apple banana yellow" in the title. Order the results by index id.
If no more results matching the exact entire string are found, or if none are found in the first place, search for all titles containing either "apple", "banana", or "yellow". Order the results by index id.
The algorithm is very basic but funny enough works pretty well.
What I'm looking for:
However I am now looking to implement a smarter search algorithm without having to rely on external paid scripts like Amazon services. I'm looking for a way to implement the following:
fuzzy search (I've read about SOUNDEX or levenshtein which may realize this)
smarter keyword search (Don't just either return items that match ALL words or JUST A SINGLE WORD, but maybe also 2 words or 3 words before)
order by relevance/likeness (Order by likeness of the search to the title, and not just the index id)
(Bonus: maybe even implement search for exact strings, like using " " on google to find exactly the words between the quotation marks)
What is the best way to get started with such a search? I am using InnoDB for MySQL.
Assuming MySQL, you can add a FULL Text index. Then, there are a number of functions that will allow you to so basic searches that meet all the needs you list: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
You end up using syntax like:
SELECT * FROM table_name WHERE MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE)
To see the match score
SELECT column_with_fulltext_index_on_it, MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE) AS score FROM table_name WHERE MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE)
It can be a little learning curve to overcome to understand how you can tweak the match clause perfect for your needs, but your examples seem pretty basic though (except the smarter search).
Also, good to note, there are system configs you need to control the the min/max characters of words/tokens to index by. You can read https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html to get deeper understanding of indexing options. Percona is a good resource as well https://www.percona.com/blog/2013/02/26/myisam-vs-innodb-full-text-search-in-mysql-5-6-part-1/ (typically more human digestible than the MySQL Doc's).
If you need to do more complex searches, you can look at adding other technologies like Solr, but I've always recommended, get the basic working with what you got, only adopt a new tech if you hit a brick wall, or have good metric on existing solution and know the new tech will somehow improve (speed, storage space, quality of results, etc...). If you can't quantify, stick to basic until you can.
Here's a good tutorial: http://www.w3resource.com/mysql/mysql-full-text-search-functions.php
I am working on a project and I need your suggestions in a database query. I am using PHP and MySQL.
Context
I have a table named phrases containing a phrases column in which there are phrases stored, each of which consists of one to three words.
I have a text string which contains 500 - 1000 words
I need to highlight all the phrases in the text string which exist in my phrases database table.
My solution
I go through every phrase in the phrase list and compare it against the text, but the number of phrases is large (100k) so it takes about 2 min or more to do this matching.
Is there any more efficient way of doing this?
I'm gonna focus on how to do the comparision part with 100K Values. This will require two steps.
a) Write a C++ library and link it to PHP using an extension. Google PHP-CPP. There is a framework which allows you to do this.
b) Inside C/C++ , you need to create a data structure which has a time complexity of O(n) . n being length of the phrases you're searching for. Normally, this is called a tries data structure. This is conventionally used for words without space[not phrases]. but, surely you can write your own.
Here is a link, which contains the word implementation. aka dictionary.
http://www.geeksforgeeks.org/trie-insert-and-search/
This takes quite a bit of Memory since, the number is 100K. fair to say, you need a large system. But, when you're looking for better performance, then, Memory tends to be a tradeoff.
Alternative Approach
Only PHP. Here , extract phrases from your text input. Convert them into a Hash. the table data that you contain, should also be stored in a hash. [Needs Huge Memory]. The performance here will be rocket fast, per search aka O(1). so, for a sentence of k words. your time complexity will be O(K-factorial).
I am in need of a lightweight fast search solution.
Today I use Fulltext in boolean mode, where every searchword is mandatory in the results.
The function is fast, working and meets the requirements.
BUT some of the fulltext limitations, http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html, have appeared to be a problem. The site is on a hosted server and Im not allowed to change the mysql settings (e.g. minimum lenght)
E.g.
the search must be able to find red, 11 and ab.cdwhich todays full text solution can't.
http://sphinxsearch.com/ is what you're looking for
though you have to understand that smaller words you find the bigger indexes you use.
Use Lucene, it's very often implemented with MySQL and it'll be both faster and more featureful.
Using the built-in FTS engine is relatively bad practice, especially since it doesn't work with the slightly more reliable InnoDB engine.
The only thing that would come to mind, would to be basing your search off the number of occurrences you can find. Your actual index method could vary, depending on what the DB offers.
Assuming DB size isn't an issue, a (very) basic approach would be to break the search blobs (say, a post on stackoverflow) into each word, normalize it (remove plurals, strip 'logic' words such as and, etc.) then insert each word as a new record, together with the ID that identifies your indexed resource.
Count the instances of the ID, order by count, higher number = more relevant.
Not exactly my field though, so tred carefully! =]
I'd recommend you try distance searching: Levenshtein
Or search for "N-gram fulltext indexing".
I haven't mucked around with it, but I read the theory of full text searching (with mysql at least) a little while back.
If memory serves me correctly you can use full text search for what you want, but you need to configure (and I think a recompile) to get it to work on smaller number of search characters. I think it is set to a default number of 4 characters. You'll want to change it to 2 characters in length with a few other options thrown in and test the results you get.
Someone correct me if this is incorrect. I would rather not throw him on a red herring.
How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?
This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.
Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.
Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.
When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/
This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.
You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex
I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.