search query "alien vs predator" - php

How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?

This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.

Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.

Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.

When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/

This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.

You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex

Related

PHP: searching with search terms for similar text on webpage

I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.

PHP/SQL: Multiple fuzzy keyword search based on likeness (Advanced SQL Search)

Current Situation:
I am currently running a keyword search using multiple keywords in PHP and SQL. The field I'm applying the search to is the title field, which is a 250 VARCHAR field.
A user can input a single keyword, e.g. "apple" or also multiple, e.g. "apple banana yellow". The first option is trivial. For the second option, my current algorithm works like this:
Try and find items that match the exact entire string "apple banana yellow" in the title. Order the results by index id.
If no more results matching the exact entire string are found, or if none are found in the first place, search for all titles containing either "apple", "banana", or "yellow". Order the results by index id.
The algorithm is very basic but funny enough works pretty well.
What I'm looking for:
However I am now looking to implement a smarter search algorithm without having to rely on external paid scripts like Amazon services. I'm looking for a way to implement the following:
fuzzy search (I've read about SOUNDEX or levenshtein which may realize this)
smarter keyword search (Don't just either return items that match ALL words or JUST A SINGLE WORD, but maybe also 2 words or 3 words before)
order by relevance/likeness (Order by likeness of the search to the title, and not just the index id)
(Bonus: maybe even implement search for exact strings, like using " " on google to find exactly the words between the quotation marks)
What is the best way to get started with such a search? I am using InnoDB for MySQL.
Assuming MySQL, you can add a FULL Text index. Then, there are a number of functions that will allow you to so basic searches that meet all the needs you list: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
You end up using syntax like:
SELECT * FROM table_name WHERE MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE)
To see the match score
SELECT column_with_fulltext_index_on_it, MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE) AS score FROM table_name WHERE MATCH(column_with_fulltext_index_on_it)
AGAINST('apple banana yellow' IN NATURAL LANGUAGE MODE)
It can be a little learning curve to overcome to understand how you can tweak the match clause perfect for your needs, but your examples seem pretty basic though (except the smarter search).
Also, good to note, there are system configs you need to control the the min/max characters of words/tokens to index by. You can read https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html to get deeper understanding of indexing options. Percona is a good resource as well https://www.percona.com/blog/2013/02/26/myisam-vs-innodb-full-text-search-in-mysql-5-6-part-1/ (typically more human digestible than the MySQL Doc's).
If you need to do more complex searches, you can look at adding other technologies like Solr, but I've always recommended, get the basic working with what you got, only adopt a new tech if you hit a brick wall, or have good metric on existing solution and know the new tech will somehow improve (speed, storage space, quality of results, etc...). If you can't quantify, stick to basic until you can.
Here's a good tutorial: http://www.w3resource.com/mysql/mysql-full-text-search-functions.php

solr search with spaces not working properly

I'm using solr to search my data, and I used to search with the query like
title:*something here*
And it is working fine without spaces. But if I search like
seva sa
Even I have below
seva samithi
It is searching with either "seva" or "sa". Can anyone suggest me to do a proper way to search in solr.
You're performing a wildcard search. Wildcard searches are not analyzed, so they bypass the regular analysis chain - the only thing that will match "*sema sa*" is a single token that contain the whole string in the exact case. That would probably be either a StrField or a field indexed with a KeywordTokenizer instead of the regular tokenizer.
A better solution if you want to match any content within one of the words might be to use a ngramfilter, so that each token gets indexed in its shorter forms.
Use (seva sa) assuming you have OR some default operator
Try these tutorial
http://www.solrtutorial.com/solr-query-syntax.html
From what I saw from your query you are using a wild card, there are ways to search on solr because by default it uses a fuzzy system to search that is why it is ending up the result you are posting.

Generate suggestion based on database (PHP and MySql)

Can you give me some tips how can i generate a suggestion based on the word entered by the user? Its not a misspelling thing, i wan't when a user enter the word: "hello" if the database does not contain the word "hello" but the word "helo" or "helol" suggest that.
Thank you.
FYI
You should look into PHP's levenshtein function, this finds closest matching words based on a score, using a dictionary file... I know you said it's not mispelling, but the dictionary file can be anything and you can have more than one, depending on how you want to use it
It will be way too complex to do with MySQL alone. You need to index commonly used words using something like Sphinx Search (a stand-alone full text search engine) and then run the queries against Sphinx.
There is a pretty good thread about it at http://sphinxsearch.com/forum/view.html?id=5898
You can use the Soundex function and compare submitted string to a dictionnary database, i.e.:
soundex("Hellllo") == soundex("Hello");
All you have to do, is storing your suggestions soundex in a table. Then when a user submit a word, you can search for his soundex hash in your table and return the words with the same / close pronounciation.
The soundex method is kind of fast, but your dictionnary table has to be indexed if you need performance.

coding inspiration needed - keywords contained within string

I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.

Categories