How to make a fulltext search

How to make a fulltext search - php

I want to make a fulltext search with metaphone.
Everythings works fine. I have 4 fields ie.
ID |Category | Type |Title |Meta
1 |Vehicle |4 Wheelers |Farrari Car for Sale |FHKL WLRS FRR KR FR SL
2 |Real Estate |Residential Apt|3BHK for sale |RL ESTT RSTN APT BK FR SL
3 |Music |Instruments |Piano for sale |MSK INST PN FR SL
4 |Stationary |College |Bag for $50 |STXN KLJ BK FR
5 |Services |Job |Vacancy for Jr.Web Developer |SRFS JB FKNS FR JRWB TFLP
The above is the sample data. Here I want to use metaphone and fulltext search using match()against().
Everything works fine. However Some words like Bag, Job and Car are ignored as the default minimum character 4. The issue is now that I use shared hosting and the hosting provider has told me that he cannot provide me a mysql config file nor can they change this so doing this in config file
ft_min_word_len = 2
is not an option.
//Code for generating metaphone
<?php
$string = "Vacancy for Jr.Web Developer";
$a = explode(" ", $string);
foreach ($a as $value) {
echo metaphone($value,4)."<br>";
}
?>
I am using normal
SELECT * FROM tbl_sc WHERE MATCH(META) AGAINST('$USER_SEARCH');
All the information in the database are user generated so I cannot supervise. Since I use mysql, PHP and on a shared hosting. I cannot use any elastic search library or solr like things. I have searched google and stack overflow however I am not able to get anything
One options is using LIKE operator but I want to use MATCH() AGAINST() if possible.
Kindly help me out with some work around or alternate route.

first there are three types of fulltext searches
Natural language full text search
Boolean fulltext searches
Query expansion searches
what suits your question here is the natural language full-text search, since your queries are mostly in free language and uses no special characters.the syntax goes like this
SELECT * FROM table_name WHERE MATCH(col1, col2)
AGAINST('search terms' IN NATURAL LANGUAGE MODE)
in your case first, add the fulltext functionality to your table
$stmt_txt_search = $conn->prepare("ALTER TABLE tbl_sc ADD FULLTEXT (Category, Type, Title, Meta)");
$stmt_txt_search->execute();
your query should be something like this
$stmt_match = $conn->prepare("SELECT * FROM tbl_sc WHERE MATCH (Meta) AGAINST(? IN NATURAL LANGUAGE MODE)");
$stmt_match->bind_param("s",$USER_SEARCH);
$stmt_match->execute();
to alter the ft_min_word_len you have to go the the my.cnf file, change it to the desired value, restart the server and rebuild your indexes like so
[mysqld]
set-variable = ft_min_word_len=3
then
mysql> ALTER TABLE tbl_sc DROP INDEX Title, Category...;
mysql> ALTER TABLE tbl_sc ADD FULLTEXT Title, Category...;
but since you are in shared hosting account, you cannot access the my.cnf file. However, you using SHOW VARIABLES and INFORMATION SCHEMA you can see all set variables and even change them using SET in your session such that all db connections will be based on the newly set values
for instance to SHOW VARIABLES in sql you can use
SELECT * FROM information_schema.global_variables;
this shows all existing variables in your current session, for a variable like flush time it can be set to 1 using SET flush_time = 1; so now the database will have a flushtime of 1 onwards, in your case i suppose the variables ft_max_word_len and ft_min_word_len are dynamically changeable and i would therefore suggest trying
SET ft_min_word_len = 2; within your current session, for more information see server system variables

Related

CodeIgniter Query on table without primary key

For some customers I am working on a project which is using a MySQL database.
I need to implement a search functionality which should be able to search in the database the devices with all the features selected. I am using CodeIgniter. The problem is the structure of the table.
I've found out that the table contains 2 columns: ID_D (the id of the device) and ID_F (the id of the feature). Basically the table doesn't contain any primary key (that's why I cannot execute any join at all).
So, it's also possible that a device id can appear in 10 rows for each feature it has. When I execute the search, I have a list of the features ID and I should be able to read only the devices with all the features selected.
if (isset($feature_array)) {
foreach($feature_array as $key => $row) {
$this->db->where('f_id',$row['f_id']);
}
}
Naturally, something like that won't work. Any ideas?

I think this might solve your problem, in case the feature ids are unique and do not contain spaces. In case they do contain spaces you should select a separator, that is not in the range of the feature ids.
Basically the code uses the mySQL group_concat function to concatenate all feature ids and matches the created string by all searched features. This finds all devices, that support at least the given set of features.
CI itself does not support that fucntion, so it is added by a workarround in the select method.
Also this might be a bit load intensive if it is called on a large table. Maybe someone else got a faster solution?
$this->db->select(['ID_D', 'GROUP_CONCAT(ID_F SEPARATOR \' \']) as id_f_concatenated']);
foreach($feature_array as $row) {
$this->db->where("id_f_concatenated REGEXP '{$row['id_f']} | {$row['id_f']}|^{$row['id_f']}$'");
}
$this->db->group_by('ID_D');
$result = $this->db-get('tablename');
EDIT:
Changed the LIKE to an REGEXP to make sure, that the feature id 112 is not matched by the id 12, without making the expression to complicated.

How to use match against with 4 or less characters on shared hosting?

I'm building search engine for jobs and I've used match against but I have problem when I search with 'php', 'hr', 'IT'.
there are no results because I can't change ft_min_word_len variable on shared hosting , but my client tells me that he know others companies use same queries with match against on shared hosting and it works !, so what is the solution now ?
Thanks

#Shadow pointed out that ft_min_word_len is MySQL server-wide setting. Some shared hosting companies will say yes when you ask for a change to that variable, and others will say no. The worst ones will say huh? Avoid those companies.
So your first task is to find out which companies say yes. Can you use one of those companies? If so, go do it.
If that doesn't work, you could consider moving your application to a virtual machine provider like Amazon Web Services, and installing your own MySQL server instance. It's a little more trouble, but you can control things.
If that doesn't work, and you must be able to search for short words, you'll need to populate your own short word table and search it with non-FULLTEXT SQL filtering.
To explain this in detail is beyond the scope of a SO answer. But here's the outline.
create a shortwords table containing two columns, item_id and word. item_id is a FK to another table containing your text items. word is a VARCHAR(6) column. The primary key of this table can be a composite of both columns, that is, (word, item_id).
populate that table by writing a php program to process each bit of text in your text items. It must explode the text strings into words, then INSERT each short word into your shortwords table with the item_id. value.
When you search for short words do this sort of query.
SELECT item.text
FROM item
JOIN shortwords ON item.item_id = shortwords.item_id
WHERE shortwords.word = 'php'
OR shortwords.word = 'it'
OR shortwords.word = 'hr'
Look, this is an outline of the solution to your short word problem. It is going to take quite a bit of programming, by you, to get this working. Please don't ask SO to do this programming for you.

MySQL Match Against Reserved Word in Field

In a database I work with, there are a few million rows of customers. To search this database, we use a match against Boolean expression. All was well and good, until we expanded into an Asian market, and customers are popping up with the name 'In'. Our search algorithm can't find this customer by name, and I'm assuming that it's because it's an InnoDB reserved word. I don't want to convert my query to a LIKE statement because that would reduce performance by a factor of five. Is there a way to find that name in a full text search?
The query in production is very long, but the portion that's not functioning as needed is:
SELECT
`customer`.`name`
FROM
`customer`
WHERE
MATCH(`customer`.`name`) AGAINST("+IN*+KYU*+YANG*" IN BOOLEAN MODE);
Oh, and the innodb_ft_min_token_size variable is set to 1 because our customers "need" to be able to search by middle initial.

It isn't a reserved word, but it is in the stopword list. You can override this with ft_stopword_file, to give your own list of stopwords. 2 possible problems with these are: (1) on altering it, you need to rebuild your fulltext index (2) it's a global variable: you can't alter it on a session / location / language-used basis, so if you really need all the words & are using a lot of different languages in one database, providing an empty one is almost the only way to go, which can hurt a bit for uses where you would like a stopword list to be used.

Perform accent insensitive fulltext search MySQL

I'm currently developing a search functionality for a website. Users search for other users by name. I'm having some trouble getting good results for users that have accents on their name.
I have a FULLTEXT index on the name column and the table's collation is utf8_general_ci.
Currently if somebody registers for the site, and has a name with accents (for example: Alberto Andrés), the name is stored in the DB as shown in the following image:
So if I perform the following query SELECT * MATCH(name) AGAINST('alberto andres') I get lots of results with better match scores like 'Alberto', 'Andres', 'Andrés' and finally with a low match score the record the user is probably looking for 'Alberto Andrés'.
What could I do to take into account the way accented records are currently stored in the DB?
Thanks!

It looks to me like the surname of el Señor Andrés is actually stored correctly. The rendering you showed us is the way some non-UTF apps mangle UTF8 text.
You might try this modification of your query if you don't yet have a whole bunch of records in your table. Fulltext (non-boolean) mode works weirdly on small data sets.
SELECT *
FROM TABLE
WHERE MATCH(name) AGAINST('alberto andres' IN BOOLEAN MODE)
You also might try
SELECT *
FROM TABLE
WHERE MATCH(name) AGAINST(CONVERT('alberto andres' USING utf8))
just to make sure your match string is in the same character set as your MySQL columns.

Searching a big mysql database with relevance

I'm building a rather large "search" engine for our company intranet, it has 1miljon plus entries
it's running on a rather fast server and yet it takes up to 1 min for some search queries.
This is how the table looks
I tried create an index for it, but it seems as if i'm missing something, this is how the show index is showing
and this is the query itself, it is the ordering that slows the query mostly but even a query without the sorting is somewhat slow.
SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` = 'tanto'
THEN 0
WHEN `businessunit`.`Name` LIKE 'tanto %'
THEN 1
WHEN `businessunit`.`Name` LIKE 'tanto%'
THEN 2
ELSE 3
END , `businessunit`.`Name`
LIMIT 0 , 30
any help is very much appreciated
Edit:
What's choking this query 99% is ordering by relevance with the wildcharacter %
When i Do an explain it says using where; using fsort

You should try sphinx search solution which is full-text search engine will give you very good performance along with lots of options to set relevancy.
Click here for more details.

Seems like the index doesn't cover Premium, yet that is the first ORDER BY argument.
Use EXPLAIN your query here to figure out the query plan and change your index to remove any table scans as explained in http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

MySQL is good for storing data but not great when it comes down to fast text based search.
Apart from Sphinx which has been already suggested I recommend two fantastic search engines:
Solr with http://pecl.php.net/package/solr - very popular search engine. Used on massive services like NetFlix.
Elastic Search - relatively new software but with very active community and lots of respect
Both solution are based on the same library Apache Lucene

If the "ORDER BY" is really the bottleneck, the straight-forward solution would be to remove the "ORDER BY" logic from your query, and re-implement the sorting directly in your application's code using C# sorting. Unfortunately, this means you'd also have to move your pagination into your application, since you'd need to obtain the complete result set before you can sort & paginate it. I'm just mentioning this because no-one else so far appears to have thought of it.
Frankly (like others have pointed out), the query you showed at the top should not need full-text indexing. A single suffix wildcard (e.g., LIKE 'ABC%') should be very effective as long as a BTREE (and not a HASH) index is available on the column in question.
And, personally, I have no aversion to even double-wildcard (e.g., LIKE '%ABC%"), which of course can never make use of indexes, as long as a full table scan is cheap. Probably 250,000 rows is the point where I'll start to seriously consider full-text indexing. 100,000 is definitely no problem.
I always make sure my SELECT's are dirty-reads, though (no transactionality applied to the select).
It's dirty once it gets to the user's eyeballs in any case!

Most of the search engine oriended sites are use FULL-TEXT-SEARCH.
It will be very faster compare to select and LIKE...
I have added one example and some links ...
I think it will be useful for you...
In this full text search have some conditions also...
STEP:1
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)
);
STEP:2
INSERT INTO articles (title,body) VALUES
('MySQL Tutorial','DBMS stands for DataBase ...'),
('How To Use MySQL Well','After you went through a ...'),
('Optimizing MySQL','In this tutorial we will show ...'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL vs. YourSQL','In the following database comparison ...'),
('MySQL Security','When configured properly, MySQL ...');
STEP:3
Natural Language Full-Text Searches:
SELECT * FROM articles
WHERE MATCH (title,body) AGAINST ('database');
Boolean Full-Text Searches
SELECT * FROM articles WHERE MATCH (title,body)
AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
Go through this links
viralpatel.net,devzone.zend.com,sqlmag.com,colorado.edu,en.wikipedia.org

It's so strange query :)
Let's try to understand what it does.
The results are less than 30 rows from the table "businessunit" with some conditions.
The first condition is a foreign key of the "businessunit-postaddress" table.
Please check if you have an index on the column businessunit-postaddress.BusinessUnit.
The second one is a filter for returning rows only with businessunit.Name begining with 'tanto'.
If I didn't make a mistake you have a very complex index 'Business' consists of 11 fields!
And field 'Name' is not the first field in this index.
So this index is useless when you run "like tanto%"'s query.
I have strong doubt about necessity of this index at all.
By the way it demands quite big resources for its maintaining and slow down edit operations with this table.
You have to make an index with the only field 'Name'.
After filtering the query is sorting results and do it in some strange way too.
At first it sorts by field businessunit.Premium - it's normal.
However next statements with CASE are useless too.
That's why.
The zero are assigned to Name = 'tanto' (exactly).
The next rows with the one are rows with space after 'tanto' - these will be after 'tanto' in any cases (except special symbols) cause space is lower than any letter.
The next rows with the two are rows with some letters after 'tanto' (include space!). These rows will be in this order too by definition.
And the three is "reserved" for "other" rows but you won't get "other" rows - remeber about [WHERE businessunit.Name LIKE 'tanto%'] condition.
So this part of ORDER BY is meaningless.
And at the end of ORDER BY there is businessunit.Name again...
My advice: you need rebuild the query from scratch keeping in mind what you want to get.
Anyway I guess you can use
SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC,
`businessunit`.`Name`
LIMIT 0 , 30
Don't forget about an index on field businessunit-postaddress.BusinessUnit!
And I have strong assumption about field Premium.
I guess it is designed for storing binary data (yes/no).
So an ordinary (BTREE) index doesn't match.
You have to use bitmap index.
P.S. I'm not sure that you really need to use SQL_CALC_FOUND_ROWS
MySQL: Pagination - SQL_CALC_FOUND_ROWS vs COUNT()-Query

Its either full-text(http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html) or the pattern matching (http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html) from php and mysql side.
From experience and theory:
Advantages of full-text -
1) Results are very relevant and de-limit characters like spacing in the search query does not hinder the search.
Disadvantages of full-text -
1) There are stopwords that are used as restrictions by webhosters to prevent excess load of data.(E.g. search results containing the word 'one' or 'moz' are not displayed. And this can be avoided if you're running your own server by keeping no stopwords.
2) If I type 'ree' it only displays words containing exactly 'ree' not 'three' or 'reed'.
Advantages of pattern matching -
1) It does not have any stopwords as in full-text and if you search for 'ree', it displays any word containing 'ree' like 'reed' or 'three' unlike fulltext where only the exact word is retreived.
Disadvantages of pattern matching-
1) If delimiters like spaces are used in your search words and if these spaces are not there in the results, because each word is separate from any delimiters, then it returns no result.

If the argument of LIKE doesn't begin with a wildchard character, like in your example, LIKE operator should be able to take advantage of indexes.
In this case, LIKE operator should perform better than LOCATE or LEFT, so I suspect that changing the condition like this could make things worse, but I still think it's worth trying (who knows?):
WHERE LOCATE('tanto', `businessunit`.`Name`)=1
or:
WHERE LEFT(`businessunit`.`Name`,5)='tanto'
I would also change your order by clause:
ORDER BY
`businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` LIKE 'tanto %' THEN 1
WHEN `businessunit`.`Name` = 'tanto' THEN 0
ELSE 2 END,
`businessunit`.`Name`
Name has to be LIKE 'tanto%' already, so you can skip a condition (CASE will never return value 3). Of course, make sure that Premium field is indexed.
Hope this helps.

I think you need to collect the keys only, sort them, then join last
SELECT A.*,B.* FROM
(
SELECT * FROM (
SELECT id BusinessUnit,Premium
CASE
WHEN Name = 'tanto' THEN 0
WHEN Name LIKE 'tanto %' THEN 1
WHEN Name LIKE 'tanto%' THEN 2
ELSE 3
END SortOrder
FROM businessunit Name LIKE 'tanto%'
) AA ORDER BY Premium,SortOrder LIMIT 0,30
) A LEFT JOIN `businessunit-postaddress` B USING (BusinessUnit);
This will still generate a filesort.
You may want to consider preloading the needed keys in a separate table you can index.
CREATE TABLE BusinessKeys
(
id int not null auto_increment,
BusinessUnit int not null,
Premium int not null,
SortOrder int not null,
PRIMARY KEY (id),
KEY OrderIndex (Premuim,SortOrder,BusinessUnit)
);
Populate all keys that match
INSERT INTO BusinessKeys (BusinessUnit,Premuim,SortOrder)
SELECT id,Premium
CASE
WHEN Name = 'tanto' THEN 0
WHEN Name LIKE 'tanto %' THEN 1
WHEN Name LIKE 'tanto%' THEN 2
ELSE 3
END
FROM businessunit Name LIKE 'tanto%';
Then, to paginate, run LIMIT on the BusinessKeys only
SELECT A.*,B.*
FROM
(
SELECT FROM BusinessKeys
ORDER BY Premium,SortOrder
LIMIT 0,30
) BK
LEFT JOIN businessunit A ON BK.BusinessUnit = A.id
LEFT JOIN `businessunit-postaddress` B ON A.BusinessUnit = B.BusinessUnit
;
CAVEAT : I use LEFT JOIN instead of INNER JOIN because LEFT JOIN preserves the order of the keys from the left side of the query.

I've read the answer to use Sphinx to optimize the search. But regarding my experience I would advise a different solution. We used Sphinx for some years and had a few nasty problems with segmentation faults and corrupted indice. Perhaps Sphinx isn't as buggy as a few years before, but for a year now we are very happy with a different solution:
http://www.elasticsearch.org/
The great benefits:
Scalability - you can simply add another server with nearly zero configuration. If you know mysql replication, you'll love this feature
Speed - Even under heavy load you get good results in much less than a second
Easy to learn - Only by knowing HTTP and JSON you can use it. If you are a Web-Developer, you feel like home
Easy to install - it is useable without touching the configuration. You just need simple Java (no Tomcat or whatever) and a Firewall to block direct access from the public
Good Javascript integration - even a phpMyAdmin-like Tool is a simple HTML-Page using Javascript: https://github.com/mobz/elasticsearch-head
Good PHP Integration with https://github.com/ruflin/Elastica
Good community support
Good documentation (it is not eye friendly, but it covers nearly every function!)
If you need an additional storage solution, you can easily combine the search engine with http://couchdb.apache.org/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.