Need suggestion of alternative to Fulltext search - php

I am in need of a lightweight fast search solution.
Today I use Fulltext in boolean mode, where every searchword is mandatory in the results.
The function is fast, working and meets the requirements.
BUT some of the fulltext limitations, http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html, have appeared to be a problem. The site is on a hosted server and Im not allowed to change the mysql settings (e.g. minimum lenght)
E.g.
the search must be able to find red, 11 and ab.cdwhich todays full text solution can't.

http://sphinxsearch.com/ is what you're looking for
though you have to understand that smaller words you find the bigger indexes you use.

Use Lucene, it's very often implemented with MySQL and it'll be both faster and more featureful.
Using the built-in FTS engine is relatively bad practice, especially since it doesn't work with the slightly more reliable InnoDB engine.

The only thing that would come to mind, would to be basing your search off the number of occurrences you can find. Your actual index method could vary, depending on what the DB offers.
Assuming DB size isn't an issue, a (very) basic approach would be to break the search blobs (say, a post on stackoverflow) into each word, normalize it (remove plurals, strip 'logic' words such as and, etc.) then insert each word as a new record, together with the ID that identifies your indexed resource.
Count the instances of the ID, order by count, higher number = more relevant.
Not exactly my field though, so tred carefully! =]

I'd recommend you try distance searching: Levenshtein
Or search for "N-gram fulltext indexing".

I haven't mucked around with it, but I read the theory of full text searching (with mysql at least) a little while back.
If memory serves me correctly you can use full text search for what you want, but you need to configure (and I think a recompile) to get it to work on smaller number of search characters. I think it is set to a default number of 4 characters. You'll want to change it to 2 characters in length with a few other options thrown in and test the results you get.
Someone correct me if this is incorrect. I would rather not throw him on a red herring.

Related

PHP: searching with search terms for similar text on webpage

I'm busy with a program that needs to find similar text on a webpage. In SQL we have 400.000 search terms. For example, the search terms can be ‘San Miguel Pale Pilsen’, ‘Schaumburger Bali’ and ‘Rizmajer Cortez’.
Now I'm checking each word on the webpage in the database. For each word on the webpage I send a select query with a %like% operator. For each result I use similar text with php. If the word and the search term aren’t equal to the amount of words in it, it will get some extra words of the webpage to make it equal.
(And yes I know that it isn’t smart)
The problem is it takes a lot of time and server must work hard for it.
What is the best and fastest way to find similar text on a webpage?
The LIKE operator will be always slow if you start the pattern with a % wild card. This happens since you are negating the ability of MariaDB to use any indexing.
Considering you need to find words in any location of the VARCHAR column the best solution is to implement bona fide Full Text Search. See MariaDB's Full-Text Index Overview.
Searches will become orders of magnitude faster, not to mention scalability.

search query "alien vs predator"

How do you do so that when you search for "alien vs predator" you also get results with the string "alienS vs predator" with the "S"
example http://www.torrentz.com/search?q=alien+vs+predator
how have they implemented this?
is this advanced search engine stuff?
This is known as Word Stemming. When the text is indexed, words are "stemmed" to their "roots". So fighting becomes fight, skiing becomes ski, runs becomes run, etc. The same thing is done to the text that a user enters at search time, so when the search terms are compared to the values in the index, they match.
The Lucene project supports this. I wouldn't consider it an advanced feature. Especially with the expectations that Google has set.
Checking for plurals is a form of stemming. Stemming is a common feature of search engines and other text matching. See the wikipedia page: http://en.wikipedia.org/wiki/Stemming for a host of algorithms to perform stemming.
Typically when one sets up a search engine to search for text, one will construct a query that's something like:
SELECT * FROM TBLMOVIES WHERE NAME LIKE '%ALIEN%'
This means that the substring ALIEN can appear anywhere in the NAME field, so you'll get back strings like ALIENS.
When words are indexed they are indexed by root form. For example for "aliens", "alien", "alien's", "aliens'" are all stored as "alien".
And when words are search search engine also searches only the root form "alien".
This is often called as Porter Stemming Algorithm. You can download its realization for your favorite language here - http://tartarus.org/~martin/PorterStemmer/
This is a basic feature of a search engine, rather than just a program that matches your query with a set of pre-defined results.
If you have the time, this is a great read, all about different algorithms, and how they are implemented.
You could try using soundex() as a fuzzy match on your strings. If you save the soundex with the title then compare that index vs a substring using LIKE 'XXX%' you should have a decent match. The higher the substring count the closer they will match.
see docs: http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_soundex

How to find keywords (useful words) from text?

I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.

coding inspiration needed - keywords contained within string

I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.

mySQL LIKE query

I'm trying to search for a string in my DB , but it should be able to match any word and not the whole phrase.
Suppose, a table data has text like a b c d e f g. Then if I search for d c b it should be able to show the results.
field LIKE '%d c b%' doesn't work in this way.
Can someone suggest a more robust way to search, possible showing the relevance counter also.
I don't mind using PHP also for the above, but prefer to do the search at DB level.
For best results, you need to create FULLTEXT index on your data.
CREATE TABLE mytable (id INT NOT NULL, data TEXT NOT NULL, FULLTEXT KEY fx_mytable_data) ENGINE=MyISAM
SELECT *
FROM mytable
WHERE MATCH(data) AGAINST ('+word1 +word2 +word3' IN BOOLEAN MODE)
Note that to index one-letter words (as in your example), you'll need to set ft_min_word_len to 1 in MySQL confguration.
This syntax can work even if you don't have an index (as long as your table is MyISAM), but will be quite slow.
I think what you want to do is, for any of the letters:
field LIKE '%d%' or field like '%c%' or field like '%b%'
for all of the letters
field LIKE '%d%' and field like '%c%' and field like '%b%'
If you table is in MyISAM, you can use the FULLTEXT search integrated in MySQL : 11.8. Full-Text Search Functions
Though there will be some restrictions (for instance, if I remember correctly, you cannot search on word shorter than X characters -- X generally being 3 or 4).
Another solution would be to use some Fulltext engine, like Lucene, Solr, or Sphinx -- those generally do a better job when it comes to fulltext-searching : it is their job (MySQL's job being to store data, not do fulltext-search)
There have been lots of questions about those on SO ; for instance :
php mysql fulltext search: lucene, sphinx, or ?
Choosing a stand-alone full-text search server: Sphinx or SOLR?
Pros & cons of full text search engine Lucene, Sphinx, Postgresql full text search, MySQL full text search
how much more performant is sphinx than MySQL default fulltext search?
And many others (use the... search engine... on the top right of the site ;-) )
If you are using PHP and cannot install anything else, there is a full-PHP implementation of Lucene : Zend_Search_Lucene
In the end, MySQL LIKE clauses are not meant to be used as 'powerful' search tools to do word-based matching. It's a simple tool to find partial phrases. It also isn't known for scaling well, so if you are doing this on a high-end throughput website, you probably will want another solution.
So that being said, there ARE some options for you, to get what you are wanting:
REGEX support, there is support in MySQL for doing REGEX based searches. Using that, and with a complicated enough REGEX, you can find what you are looking for.
True Full Text Indexing in MySQL. MySQL does have a way to create FULLTEXT indexes. You need to be using MyISAM data engine, and there are restrictions on what exactly you can, or can't do. But it's much more powerful than the basic 'like' functionality that SQL has. I'd recommend reading up on it if you are interested.
3rd party indexers. This is actually the route that most people go. They will use Lucene / Solr, or other similar indexing technologies that are specifically designed for doing full text searching of words with various logic, just like how modern web search engines work. They are extremely efficient because they, essentially, keep their own database where they break everything up and store it in a manner that works best for exactly those types of searches.
Hopefully one of those three options will work for you.
When using the like clause take care that it is %variable% or variable% not %variable.
Secondly. to make an affective search use the explode function to break the words, like if I search "learn php" it should search like this: "learn+php" as in Google. It's explode() function.

Categories