Using a hyphen in fulltext search with an InnoDB engine? - php

I have a FULLTEXT search in a table of part numbers. Some part numbers have hyphens.
The table engine is InnoDB using MySQL 5.6.
The problem I am having is that MySQL was treating the hyphen (-) character as a word separator.
So I created a new MySQL charset collation whereas the hyphen is treated as a letter.
I followed this tutorial: http://dev.mysql.com/doc/refman/5.0/en/full-text-adding-collation.html
I made a test table, using the syntax at the bottom of the link, however i used the InnoDB Engine. I searched for '----' and received "syntax error, unexpected '-'"
However If I change the engine to MyISAM, I get the correct result.
How to I get this to work with the InnoDB engine?
It seems with MySQL its one step forward and two steps back.
Edit: I found this link for 5.6 (http://dev.mysql.com/doc/refman/5.6/en/full-text-adding-collation.html), which is the same tutorial using InnoDB as the engine.
But here's my test:
create table test (a TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, FULLTEXT INDEX(a)) ENGINE=InnoDB
Added a row that is just "----"
select * from test where MATCH(a) AGAINST('----' IN BOOLEAN MODE)
syntax error, unexpected '-'
Drop the table, MyISAM
create table test (a TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, FULLTEXT INDEX(a)) ENGINE=MyISAM
Added a row that is just "----"
select * from test where MATCH(a) AGAINST('----' IN BOOLEAN MODE)
1 result
Edit 2, if it helps to see visually, heres my 2 tests:

I encountered this exact issue recently. I had previously added a custom collation per the docs and was using MyISAM and it was working fine. Then a few weeks ago switched to InnoDB and things stopped working. I tried:
Rebuilding my collation and A/B testing to make sure they are working
Disabling stopword by setting innodb_ft_enable_stopword to 0
Rebuilding my fulltext table and index
In the end I took a different approach since InnoDB doesn't seem to follow the same rules as MyISAM when it comes to fulltext indexing. This is a bit hacky but works for my application:
Create a special search column containing the data I need to search for. This column has a fulltext index and exists for the sole purposes of doing a fulltext search, which is still very fast on a table with millions of rows.
Search/replace all - in my search column with an unused character that is considered a "word" character. See my question here regarding this: https://dba.stackexchange.com/questions/248607/which-characters-are-considered-word-characters. Figuring out what word characters are turns out to be not so easy but here are a few that worked for me: Ω œ π µ. These characters are probably not used in the data you need to be searching but they will be recognized by the parser as searchable characters. In my case I replace - with Ω. Since I only need the row ID, it doesn't matter what the data in this column looks like to human eyes.
Revise my updates and inserts to keep the search column data and substitutions up to date. In my case this was easy since there is only one place in the application that updates this particular table. A couple of triggers could also be used to handle this:
CREATE TRIGGER update_search BEFORE UPDATE ON animals
FOR EACH ROW SET NEW.search = REPLACE(NEW.animal_name, '-', 'Ω');
CREATE TRIGGER insert_search BEFORE INSERT ON animals
FOR EACH ROW SET NEW.search = REPLACE(NEW.animal_name, '-', 'Ω');
Replace - in my search queries with Ω.
Voila. Here's a fiddle demonstrating: https://www.db-fiddle.com/f/x1WZpZP6wcqbTTvTEFFXYc/0
The above workaround might not be realistic for every application but hopefully it's useful for someone. Would be great to have a real solution to this for InnoDB.

The InnoDb FULLTEXT search is probably treating the hyphens as stop-words. So when it gets to the second hyphen, it would expect a word, not a hyphen. This would explain the 'syntax error'.
Why it doesn't do this in MyISAM is because the implementation in InnoDB of FULLTEXT indexes is quite different, and of course, they've only been added for InnoDB in MySQL 5.6.
What can you do about this? It seems you can influence the list of stop-words through a special table: http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_ft_user_stopword_table. This could stop MySQL from treating hyphens as stop-words.

Related

Store full width and half width character in unique column of database

I have a word list stored in mysql, and the size is around 10k words. The column is marked as unique. However, I cannot insert full-width and half-width character of punctuation mark.
Here are some examples:
(half-width, full-width)
('?', '?')
('/', '/')
The purpose is that, I have many articles containing both full-width and half-width characters and want to find out if the articles contain these words. I use php to do the comparison and it can know that '?' is different than '?'. Is there any idea how to do it in mysql too? Or is there some ways so that php can make it equal?
I use utf8_unicode_ci for the database encoding, and the column is also used utf8_unicode_ci for the encoding. When I made these queries, both return the same record, '?測試'
SELECT word FROM word_list WHERE word='?測試'
SELECT word FROM word_list WHERE word='?測試'
Most likely explanation is a characterset translation issue; for example, the column you are storing the value to is defined as latin1 characterset.
But it's not necessarily the characterset of the column that's causing the issue. It's a characterset conversion happening somewhere.
If you aren't aware of characterset encodings, I recommend consulting the source of all knowledge: google.
I highly recommend the two top hits for this search:
what every programmer needs to know about character encoding
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/

finding words in mysql huge database

So, i've never worked with a database this huge. We are talking about 200.000.000++ words that i want to be able to search through. How should i approach this? using the normal "where" statement would take 10+++ minutes, should i split up the database or something?
Any help would be great!
MySQL FULLTEXT indexes are quite useful when searching for words. You have to define the fields which contain the relevant text/character strings as indexes. Then you can use
SELECT * FROM table WHERE MATCH (text_index_field) AGAINST ('what you need to look for');
You should use MySql FULLTEXT indexing.
Use AlTER TABLE for create a FULLTEXT index on your desire column.
and from http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
Full-text indexes can be used only with MyISAM tables. (In MySQL 5.6 and up, they can also be used with InnoDB tables.) Full-text indexes can be created only for CHAR, VARCHAR, or TEXT columns.

Opencart - Search regardless accent

It had been written many times already that Opencart's basic search isn't good enough .. Well, I have came across this issue:
When customer searches product in my country (Slovakia (UTF8)) he probably won't use diacritics. So he/she writes down "cucoriedka" and found nothing.
But, there is product named "čučoriedka" in database and I want it to display too, since that's what he was looking for.
Do you have an idea how to get this work? The simple the better!
I'm ignorant of Slovak, I am sorry. But the Slovak collation utf8_slovak_ci treats the Slovak letter č as distinct from c. (Do the surnames starting with Č all come after those starting with C in your telephone directories? They probably do. The creators of MySQL certainly think they do.)
The collation utf8_general_ci treats č and c the same. Here's a sql fiddle demonstrating all this. http://sqlfiddle.com/#!9/46c0e/1/0
If you change the collation of the column containing your product name to utf8_general_ci, you will get a more search-friendly table. Suppose your table is called product and the column with the name in it is called product_name. Then this SQL data-definition statement will convert the column as you require. You should look up the actual datatype of the column instead of using varchar(nnn) as I have done in this example.
alter table product modify product_name varchar(nnn) collate utf8_general_ci
If you can't alter the table, then you can change your WHERE clause to work like this, specifying the collation explicitly.
WHERE 'userInput' COLLATE utf8_general_ci = product_name
But this will be slower to search than changing the column collation.
You can use SOUNDEX() or SOUNDS LIKE function of MySQL.
These functions compare phonetics.
Accuracy of soundex is doubtful for other than English. But, it can be improved if we use it like
select soundex('ball')=soundex('boll') from dual
SOUNDS LIKE can also be used.
Using combination of both SOUNDEX() and SOUNDS LIKE will improve accuracy.
Kindly refer MySQL documentation for details OR mysql-sounds-like-and-soundex

mysql collation and indexes

What mysql collation should I use for my tables to support all european languages, all latin american languages, and maybe chinese, maybe asian languages? Thanks!
What is the rule when it comes to using indexes on mysql table columns? When should you not use an index for a column in a table?
UTF8 would probably be the best choice, more specific; utf8_general_ci.
Indices should not be set in a table that you're going to perform a huge amount of insertions into. Indices speed up SELECT-queries, but these indices need to be rebuilt everytime you INSERT into the table. So, if you have a table that's... well, let's say it stores news articles - suitable indices might be the title or something that you might wanna "search" for.
Hope this clears some things up.
utf8
utf8-general
is universal character set...
you should not use index when you're sure you will not search for it (via WHERE clause)

Partial keyword searching

Can anyone give me an idea of how I can do partial keyword searching with php/mysql search engine?
For example, if a person search for "just can't get enough" i want it to return search result containing keywords "just can't get enough by black eyed peas" or from keywords " black eyed peas just can't get enough".
Another Example: If I entered "orange juice" i want it to return result with keywords "orange juice taste good"
Its pretty much like google and youtube search.
The code I'm using is: http://tinypaste.com/eac6cf
The search method you've used is the standard method of searching from within small amounts of records. For example, if you had just around a thousand records, it would be ok.
But if you've got to search from millions of records, this method is not to be used as it will be terribly slow.
Rather you have two options.
Explode your search field and build your own index containing single words and a reference to the record position. Then only search your index and seek the corresponding record from the main table.
Use MySQL's Full text search feature. This is easier to implement but has its own restrictions. This way you don't have to build the index yourself.
MySQL full-text search would help here, but would only work with myISAM tables and the performance tends to go through the drain when your data-sets becomes quite large.
At the company I work for, we push our search queries to Sphinx. Sites like Craigslist, The Pirate Bay, Slashdot all use this so it's pretty much proven for production use.
In MySQL, you can use a MyISAM type table and simply define a text field (CHAR, VARCHAR, or TEXT) and then create a FULLTEXT index. Just keep in mind the size of the text field, the more allowed characters, the larger the size of the index and the slower it's going to be to update.
Other large data-set options would include something like Solr but unless you already know your data is going to have a ton of data, you could certainly start with MySql and see how it goes.
Most MySQL editors, including phpmyadmin provide a gui for adding indexes, if you're doing it by hand the code would look something like:
CREATE TABLE IF NOT EXISTS `test2` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` text CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `ft_name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Categories