Opencart - Search regardless accent

Opencart - Search regardless accent - php

It had been written many times already that Opencart's basic search isn't good enough .. Well, I have came across this issue:
When customer searches product in my country (Slovakia (UTF8)) he probably won't use diacritics. So he/she writes down "cucoriedka" and found nothing.
But, there is product named "čučoriedka" in database and I want it to display too, since that's what he was looking for.
Do you have an idea how to get this work? The simple the better!

I'm ignorant of Slovak, I am sorry. But the Slovak collation utf8_slovak_ci treats the Slovak letter č as distinct from c. (Do the surnames starting with Č all come after those starting with C in your telephone directories? They probably do. The creators of MySQL certainly think they do.)
The collation utf8_general_ci treats č and c the same. Here's a sql fiddle demonstrating all this. http://sqlfiddle.com/#!9/46c0e/1/0
If you change the collation of the column containing your product name to utf8_general_ci, you will get a more search-friendly table. Suppose your table is called product and the column with the name in it is called product_name. Then this SQL data-definition statement will convert the column as you require. You should look up the actual datatype of the column instead of using varchar(nnn) as I have done in this example.
alter table product modify product_name varchar(nnn) collate utf8_general_ci
If you can't alter the table, then you can change your WHERE clause to work like this, specifying the collation explicitly.
WHERE 'userInput' COLLATE utf8_general_ci = product_name
But this will be slower to search than changing the column collation.

You can use SOUNDEX() or SOUNDS LIKE function of MySQL.
These functions compare phonetics.
Accuracy of soundex is doubtful for other than English. But, it can be improved if we use it like
select soundex('ball')=soundex('boll') from dual
SOUNDS LIKE can also be used.
Using combination of both SOUNDEX() and SOUNDS LIKE will improve accuracy.
Kindly refer MySQL documentation for details OR mysql-sounds-like-and-soundex

Related

Mariadb case insensitive fulltext search on json columns

I am using laravel but can use raw sql if needed.
I have multiple fields that are json fields, in that json there is translated data for that language. So, for example post table has field title and that title is {"en": "Title in english", "et": "Title in estonian"}
Now I need to make a fulltext search that searches these fields, for some columns i need to search term from all languages, not just from active one.
I am using MariaDB latest stable.
If i make index of these fields for fulltext search i can search fine but the search is case sensitive.
How can i make the search case insensitive? The json fields are currently longtext and utf8mb4_bin, laravel chose them for json field. I know bin is case sensitive collation but what else could i put so the functionality to find records by translated slug (for example) would be still there.
In laravel, one can search like ->where('slug->en','some-post-slug'). So i need to keep laravels json fields functionality intact.
I have been trying to achieve this for 2 days now, i need some external input.

The MySQL documentation states the following:
By default, the search is performed in case-insensitive fashion. To perform a case-sensitive full-text search, use a case-sensitive or binary collation for the indexed columns. For example, a column that uses the utf8mb4 character set of can be assigned a collation of utf8mb4_0900_as_cs or utf8mb4_bin to make it case-sensitive for full-text searches.
I bet it's the same in MariaDB. If you change your collation it will work with case-insensitive searches like expected. You can change it that way:
ALTER TABLE mytable
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;

Multi language database encoding in search engine

I have a database(Mysql) in which I store more then 100 000 keywords with keyword in different languages. So an example if I have three colums [id] [turkish (utf8_turkish_ci)] [german(utf8)]
The users could enter a german or a turkish word in the search box. If the user enters a german word all is fine so it prints out the turkish word but how to solve it with the turkish one. I ask because each language has its own additional characters like ä ü ö ş etc.
So should I use
mb_convert_encoding
to convert the string but then how to check if it is a german or turkish string I think that would be to complex. Or is the encoding of the tables wrong?
Stuck now so how to implement it so the user could enter keyword of both languages words

You have several issues to solve to make this work correctly.
First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.
Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.
Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.
You can explicitly specify character set and collation in queries. For example, you can write
WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;
In this expression, _utf8 'München' is a character constant, and
constant COLLATE utf8_unicode_ci = table.name
is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html
Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.
Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.

Store full width and half width character in unique column of database

I have a word list stored in mysql, and the size is around 10k words. The column is marked as unique. However, I cannot insert full-width and half-width character of punctuation mark.
Here are some examples:
(half-width, full-width)
('?', '？')
('/', '／')
The purpose is that, I have many articles containing both full-width and half-width characters and want to find out if the articles contain these words. I use php to do the comparison and it can know that '?' is different than '？'. Is there any idea how to do it in mysql too? Or is there some ways so that php can make it equal?
I use utf8_unicode_ci for the database encoding, and the column is also used utf8_unicode_ci for the encoding. When I made these queries, both return the same record, '?測試'
SELECT word FROM word_list WHERE word='?測試'
SELECT word FROM word_list WHERE word='？測試'

Most likely explanation is a characterset translation issue; for example, the column you are storing the value to is defined as latin1 characterset.
But it's not necessarily the characterset of the column that's causing the issue. It's a characterset conversion happening somewhere.
If you aren't aware of characterset encodings, I recommend consulting the source of all knowledge: google.
I highly recommend the two top hits for this search:
what every programmer needs to know about character encoding
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/

Using a hyphen in fulltext search with an InnoDB engine?

I have a FULLTEXT search in a table of part numbers. Some part numbers have hyphens.
The table engine is InnoDB using MySQL 5.6.
The problem I am having is that MySQL was treating the hyphen (-) character as a word separator.
So I created a new MySQL charset collation whereas the hyphen is treated as a letter.
I followed this tutorial: http://dev.mysql.com/doc/refman/5.0/en/full-text-adding-collation.html
I made a test table, using the syntax at the bottom of the link, however i used the InnoDB Engine. I searched for '----' and received "syntax error, unexpected '-'"
However If I change the engine to MyISAM, I get the correct result.
How to I get this to work with the InnoDB engine?
It seems with MySQL its one step forward and two steps back.
Edit: I found this link for 5.6 (http://dev.mysql.com/doc/refman/5.6/en/full-text-adding-collation.html), which is the same tutorial using InnoDB as the engine.
But here's my test:
create table test (a TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, FULLTEXT INDEX(a)) ENGINE=InnoDB
Added a row that is just "----"
select * from test where MATCH(a) AGAINST('----' IN BOOLEAN MODE)
syntax error, unexpected '-'
Drop the table, MyISAM
create table test (a TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, FULLTEXT INDEX(a)) ENGINE=MyISAM
Added a row that is just "----"
select * from test where MATCH(a) AGAINST('----' IN BOOLEAN MODE)
1 result
Edit 2, if it helps to see visually, heres my 2 tests:

I encountered this exact issue recently. I had previously added a custom collation per the docs and was using MyISAM and it was working fine. Then a few weeks ago switched to InnoDB and things stopped working. I tried:
Rebuilding my collation and A/B testing to make sure they are working
Disabling stopword by setting innodb_ft_enable_stopword to 0
Rebuilding my fulltext table and index
In the end I took a different approach since InnoDB doesn't seem to follow the same rules as MyISAM when it comes to fulltext indexing. This is a bit hacky but works for my application:
Create a special search column containing the data I need to search for. This column has a fulltext index and exists for the sole purposes of doing a fulltext search, which is still very fast on a table with millions of rows.
Search/replace all - in my search column with an unused character that is considered a "word" character. See my question here regarding this: https://dba.stackexchange.com/questions/248607/which-characters-are-considered-word-characters. Figuring out what word characters are turns out to be not so easy but here are a few that worked for me: Ω œ π µ. These characters are probably not used in the data you need to be searching but they will be recognized by the parser as searchable characters. In my case I replace - with Ω. Since I only need the row ID, it doesn't matter what the data in this column looks like to human eyes.
Revise my updates and inserts to keep the search column data and substitutions up to date. In my case this was easy since there is only one place in the application that updates this particular table. A couple of triggers could also be used to handle this:
CREATE TRIGGER update_search BEFORE UPDATE ON animals
FOR EACH ROW SET NEW.search = REPLACE(NEW.animal_name, '-', 'Ω');
CREATE TRIGGER insert_search BEFORE INSERT ON animals
FOR EACH ROW SET NEW.search = REPLACE(NEW.animal_name, '-', 'Ω');
Replace - in my search queries with Ω.
Voila. Here's a fiddle demonstrating: https://www.db-fiddle.com/f/x1WZpZP6wcqbTTvTEFFXYc/0
The above workaround might not be realistic for every application but hopefully it's useful for someone. Would be great to have a real solution to this for InnoDB.

The InnoDb FULLTEXT search is probably treating the hyphens as stop-words. So when it gets to the second hyphen, it would expect a word, not a hyphen. This would explain the 'syntax error'.
Why it doesn't do this in MyISAM is because the implementation in InnoDB of FULLTEXT indexes is quite different, and of course, they've only been added for InnoDB in MySQL 5.6.
What can you do about this? It seems you can influence the list of stop-words through a special table: http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_ft_user_stopword_table. This could stop MySQL from treating hyphens as stop-words.

mysql collation and indexes

What mysql collation should I use for my tables to support all european languages, all latin american languages, and maybe chinese, maybe asian languages? Thanks!
What is the rule when it comes to using indexes on mysql table columns? When should you not use an index for a column in a table?

UTF8 would probably be the best choice, more specific; utf8_general_ci.
Indices should not be set in a table that you're going to perform a huge amount of insertions into. Indices speed up SELECT-queries, but these indices need to be rebuilt everytime you INSERT into the table. So, if you have a table that's... well, let's say it stores news articles - suitable indices might be the title or something that you might wanna "search" for.
Hope this clears some things up.

utf8
utf8-general
is universal character set...
you should not use index when you're sure you will not search for it (via WHERE clause)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.