Backend for autosuggest for fulltext search - php

I want to create an autosuggest for a fulltext search with AJAX, PHP & MySQL.
I am looking for the right way to implement the backend. While the user is typing, the input field should give him suggests. Suggests should be generated from text entrys in a table.
Some information for this entrys: They are stored in fulltext, generated from PDF with 3-4 pages each. There not more than 100 entrys for now and will reach a maximum of 2000 in the next few years.
If the user starts to type, the word he is typing should be completed with a word which is stored in the DB, sorted by occurrences descending. Next step is to suggest combinations with other words, witch have a high occurrence in the entrys matching the first word. Surely you can compare it to Google autosuggest.
I am thinking about 3 different ways to implement this:
Generate an index via cronjob, witch counts occurrences of words and combinations over night. The user searches on this index.
I do a live search within the entrys with an 'LIKE "%search%"' function. Then I look for the word after the this and GROUP them by occurrence.
I create a logfile for all user searches, and look for good combinations like in 1), so the search gets more intelligent with each search action.
What is the best way to start with this? The search should be fast and performant.
Is there a better possibility I did not think about?

I'd use mysql's MATCH() AGAINST() (http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html), eg:
SELECT *
FROM table
WHERE MATCH(column) AGAINST('search')
ORDER BY MATCH(column) AGAINST('search')
Another advantage is that you could further tweak the importance of words being searched for (if neccessary), like:
MATCH(column) AGAINST('>important <lessimportant') IN BOOLEAN MODE
Or say that certain words of the search term are to be required, whilst others may not be present in the result, eg:
MATCH(column) AGAINST('+required -prohibited') IN BOOLEAN MODE

I think, the idea no 1 is the best. By the way, dont't forget to eliminate stopwords from autosuggest (an, the, by, ...).

Related

faster way for Search in multiple databases

I am working on big eCommerce shopping website. I have around 40 databases. i want to create search page which show 18 result after searching by title in all databases.
(SELECT id_no,offers,image,title,mrp,store from db1.table1 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db3.table3 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db2.table2 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
LIMIT 18
currently i am using the above query its working fine for 4 or more character keyword search like laptop nokia etc but takes 10-15 sec for processes but for query with keyword less than 3 characters it takes 30-40sec or i end up with 500 internal server error. Is there any optimized way for searching in multiple databases. I generated two index primary and full text index with title
Currently my search page is in php i am ready to code in python or any
other language if i gets good speed
You can use the sphixmachine:http://sphinxsearch.com/. This is powerfull search for database. IMHO Sphinx this best decision
for search in your site.
FULLTEXT is not configured (by default) for searching for words less than three characters in length. You can configure that to handle shorter words by setting a ...min_token_size parameter. Read this. https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html You can only do this if you control the MySQL server. It won't be possible on shared hosting. Try this.
FULLTEXT is designed to produce more false-positive matches than false-negative matches. It's generally most useful for populating dropdown picklists like the ones under the location field of a browser. That is, it requires some human interaction to choose the correct record. To expect FULLTEXT to be able to do absolutely correct searches is probably a bad idea.
You simply cannot use AND column LIKE '%whatever%' if you want any reasonable performance at all. You must get rid of that. You might be able to rewrite your python program to do something different when the search term is one or two letters, and thereby avoid many, but not all, LIKE '%a%' and LIKE '%ab%' operations. If you go this route, create ordinary indexes on your title columns. Whatever you do, don't combine the FULLTEXT and LIKE searches in a single query.
If this were my project I'd consider using a special table with columns like this to hold all the short words from the title column in every row of each table.
id_pk INT autoincrement
id_no INT
word VARCHAR(3)
Then you can use a query like this to look up short words
SELECT a.id_no,offers,image,title,mrp,store
FROM db1.table1 a
JOIN db1.table1_shortwords s ON a.id_no = s.id_no
WHERE s.word = '$searchkey'
To do this, you will have to preprocess the title columns of your other tables to populate the shortwords tables, and put an index on the word column. This will be fast, but it will require a special-purpose program to do the preprocessing.
Having to search multiple tables with your UNION ALL operation is a performance problem. You will be able to improve performance dramatically by redesigning your schema so you need search only one table.
Having to search databases on different server machines is a performance problem. You may be able to rig up your python program to search them in parallel: that is, to somehow use separate tasks to search each one, then aggregate the results. Each of those separate search tasks requires its own connection to the data base, so this is not a cheap or simple solution.
If this system faces the public web, you will have to redesign it sooner or later, because it will never perform well enough as it is now. (Sorry to be the bearer of bad news.) Many system designers like to avoid redesigning systems after they become enormous. So, if I were you I would get the redesign done.
If your focus is on searching, then bend the schema to facilitate searching rather than the other way around.
Collect all the strings to search for in a single table. Whereas a UNION of 40 tables does work, it will be ~40 times as slow as having the strings collected together.
Use FULLTEXT when the words are long enough, use some other technique when they are not. (This addresses your 3-char problem; see also the Answer discussing innodb_ft_min_token_size. You are using InnoDB, correct?)
Use + and boolean mode to say that a word is mandatory: MATCH(col) AGAINST("+term" IN BOOLEAN MODE)
Do not add on a LIKE clause unless there is a good reason.

MySQL Match Against Reserved Word in Field

In a database I work with, there are a few million rows of customers. To search this database, we use a match against Boolean expression. All was well and good, until we expanded into an Asian market, and customers are popping up with the name 'In'. Our search algorithm can't find this customer by name, and I'm assuming that it's because it's an InnoDB reserved word. I don't want to convert my query to a LIKE statement because that would reduce performance by a factor of five. Is there a way to find that name in a full text search?
The query in production is very long, but the portion that's not functioning as needed is:
SELECT
`customer`.`name`
FROM
`customer`
WHERE
MATCH(`customer`.`name`) AGAINST("+IN*+KYU*+YANG*" IN BOOLEAN MODE);
Oh, and the innodb_ft_min_token_size variable is set to 1 because our customers "need" to be able to search by middle initial.
It isn't a reserved word, but it is in the stopword list. You can override this with ft_stopword_file, to give your own list of stopwords. 2 possible problems with these are: (1) on altering it, you need to rebuild your fulltext index (2) it's a global variable: you can't alter it on a session / location / language-used basis, so if you really need all the words & are using a lot of different languages in one database, providing an empty one is almost the only way to go, which can hurt a bit for uses where you would like a stopword list to be used.

optimize tables for search using LIKE clause in MySQL

I am building a search feature for the messages part of my site, and have a messages database with a little over 9,000,000 rows, and and index on the sender, subject, and message fields. I was hoping to use the LIKE mysql clause in my query, such as (ex)
SELECT sender, subject, message FROM Messages WHERE message LIKE '%EXAMPLE_QUERY%';
to retrieve results. unfortunately, MySQL doesn't use indexes when a leading wildcard is present , and this is necessary for the search query could appear anywhere in the message (this is how the wildcards work, no?). Queries are very very slow and I cannot use a full text index either, because of the annoying 50% rule (I just can't afford to rule that much out). Is there anyway (or even, any alternative to this) to optimize a query using like and two wildcards? Any help is appreciated.
You should either use full-text indexes (you said you can't), design a full-text search by yourself or offload the search from MySQL and use Sphinx/Lucene. For Lucene you can use Zend_Search_Lucene implementation from Zend Framework or use Solr.
Normal indexes in MySQL are B+Trees, and they can't be used if the starting of the string is not known (and this is the case when you have wildcard in the beginning)
Another option is to implement search on your own, using reference table. Split text in words and create table that contains word, record_id. Then in the search you split the query in words and search for each of the words in the reference table. In this way you are not limitting yourself to the beginning of the whole text, but only to the beginning of the given word (and you'll match the rest of the words anyway)
'%EXAMPLE_QUERY%'; is a very very bad idea .. am going to give you some
A. Avoid wildcards at the start of LIKE queries use 'EXAMPLE_QUERY%'; instead
B. Create Keywords where you can easily use MATCH
If you want to stick with using MySQL, you should use FULL TEXT indexes. Full text indexes index words in a text block. You can then search on word stems and return the results in order of relevance. So you can find the word "example" within a block of text, but you still can't search efficiently on "xampl" to find "example".
MySQL's full text search is not great, but it is functional.
http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
select * from emp where ename like '%e';
gives emp_name that ends with letter e.
select * from emp where ename like 'A%';
gives emp_name that begins with letter a.
select * from emp where ename like '_a%';
gives emp_name in which second letter is a.

Need a PHP MySQL script to search for keywords in a database

I need to implement a search option for user comments that are stored in a MySQL database. I would optimally like it to work in a similar manner to a standard web page search engine, but I am trying to avoid the large scale solutions. I'd like to just get a feel for the queries that would give me decent results. Any suggestions? Thanks.
It's possible to create a full indexing solution with some straightforward steps. You could create a table that maps words to each post, then when you search for some words find all posts that match.
Here's a short algorithm:
When a comment is posted, convert the string to lowercase and split it into words (split on spaces, and optionally dashes/punctuation).
In a "words" table store each word with an ID, if it's not already in the table. (Here you might wish to ignore common words like 'the' or 'for'.)
In an "indexedwords" table map the IDs of the words you just inserted to the post ID of the comment (or article if that is what you want to return).
When searching, split the search term on words and find all posts that contain each of the words. (Again here you might want to ignore common words.)
Order the results by number of occurrences. If the results must contain all the words you'd need to find the union of your different arrays of posts.
As an entry point, you can use MySQL LIKE queries.
For example if you have a table 'comments' with a column named 'comment', and you want to find all comments that contain the word 'red', use:
SELECT comment FROM comments WHERE comment LIKE '% red %';
Please note that fulltext searches can be slow, so if your database is very large or if you run this query a lot, you will want to find an optimized solution, such as Sphinx (http://sphinxsearch.com).

How can I search for multiple terms in multiple table columns?

I have a table that lists people and all their contact info. I want for users to be able to perform an intelligent search on the table by simply typing in some stuff and getting back results where each term they entered matches at least one of the columns in the table. To start I have made a query like
SELECT * FROM contacts WHERE
firstname LIKE '%Bob%'
OR lastname LIKE '%Bob%'
OR phone LIKE '%Bob%' OR
...
But now I realize that that will completely fail on something as simple as 'Bob Jenkins' because it is not smart enough to search for the first an last name separately. What I need to do is split up the the search terms and search for them individually and then intersect the results from each term somehow. At least that seems like the solution to me. But what is the best way to go about it?
I have heard about fulltext and MATCH()...AGAINST() but that sounds like a rather fuzzy search and I don't know how much work it is to set up. I would like precise yes or no results with reasonable performance. The search needs to be done on about 20 columns by 120,000 rows. Hopefully users wouldn't type in more than two or three terms.
Oh sorry, I forgot to mention I am using MySQL (and PHP).
I just figured out fulltext search and it is a cool option to consider (is there a way to adjust how strict it is? LIMIT would just chop of the results regardless of how well it matched). But this requires a fulltext index and my website is using a view and you can't index a view right? So...
I would suggest using MATCH / AGAINST. Full-text searches are more advanced searches, more like Google's, less elementary.
It can match across multiple tables and rank them to how many matches they have.
Otherwise, if the word is there at all, esp. across multiple tables, you have no ranking. You can do ranking server-side, but that is going to take more programming/time.
Depending on what database you're using, the ability to do cross columns can become more or less difficult. You probably don't want to do 20 JOINs as that will be a very slow query.
There are also engines such as Sphinx and Lucene dedicated to do these types of searches.
BOOLEAN MODE
SELECT * FROM contacts WHERE
MATCH(firstname,lastname,email,webpage,country,city,street...)
AGAINST('+bob +jenkins' IN BOOLEAN MODE)
Boolean mode is very powerful. It might even fulfil all my needs. I will have to do some testing. By placing + in front of the search terms those terms become required. (The row must match 'bob' AND 'jenkins' instead of 'bob' OR 'jenkins'). This mode even works on non-indexed columns, and thus I can use it on a view although it will be slower (that is what I need to test). One final problem I had was that it wasn't matching partial search terms, so 'bob' wouldn't find 'bobby' for example. The usual % wildcard doesn't work, instead you use an asterisk *.

Categories