searching for Persian string in MySQL using laravel elequent - php

In my laravel in order to search in products title column I use the following code:
$products->where('title', 'like', '%' . $request->title . '%');
the title column is a string column and data stored in it are in Persian. Also, the database collation is UTF8_general_ci. however, when I search something some titles are found and some aren't. I need the result to find every product which contains the $request->title in their title columns.
can you help me?

Change Collation UTF8_general_ci to latin1_swedish_ci
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci. The output for SHOW CHARACTER SET indicates which collation is the default for each displayed character set.
There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
reference here

Related

Working with SET NAMES utf8mb4 with utf8 tables

In a large system based on Mysql 5.5.57 Php 5.6.37 setup
Currently the whole system is working in utf8 including SET NAMES utf8 at the beginning of each db connection.
I need to support emojis in one of the tables so I need to switch it to utf8mb4. I don't want to switch other tables.
My question is - if I change to SET NAMES utf8mb4 for all connections (utf8 and utf8mb4) and switch the specific table only to utf8mb4 (and only write mb4 data to this table). Will the rest of the system work as before?
Can there be any issue from working with SET NAMES utf8mb4 in the utf8 tables/data/connections?
I think there should no problem using SET NAMES utf8mb4 for all connections.
(utf8mb3 is a synonym of utf8 in MySQL; I'll use the former for clarity.)
utf8mb3 is a subset of utf8mb4, so your client's bytes will be happy either way (except for Emoji, which needs utf8mb4). When the bytes get to (or come from) a column that is declared only there will be a check to verify that you are not storing Emoji or certain Chinese characters, but otherwise, it goes through with minimal fuss.
I suggest
ALTER TABLE ... CONVERT TO utf8mb4
as the 'right' way to convert a table. However, it converts all varchar/text columns. This may be bad...
If you JOIN a converted table to an unconverted table, then you will be trying to compare a utf8mb3 string to a utf8mb4 string. MySQL will throw up its hands and convert all rows from one to the other. That is no INDEX will be useful.
So... Be sure to at least be consistent about any columns that are involved in JOINs.

SQL select case insensitive text but not Latex function names

I have a MySql database of latex snippets. Each snippet contains normal text and latex commands. The commands are all preceded by a backslash \ . I would like to search through these snippets such that the text is case insensitive but the commands are case insensitive. So selecting for vector gives results where the text contains either vector or Vector whereas selecting for \Vector will not return \vector.
Your question is about collations. A column in a table has a collation setting, for example utfmb4_general_ci or utf8mb4_bin. The first of those is case-insensitive, meaning a search like this
WHERE col LIKE '%vector%'
will yield rows containing both ...Vector... and ...vector...
If you use the utf8_bin (binary match, case sensitive) collation, that search excludes ...Vector...
You can specify the collation to use in a filter clause, like so:
WHERE col LIKE '%Vector%' COLLATE utf8_bin;
and that will force MySQL to use the collation you want. If you don't specificy the collation (the normal case) MySQL uses the collation specified at the time you created the table or the column.
When you specify a collation explicitly, and it's different from the column's collation, MySQL's query planner cannot use an index on the column to satisfy your query. So it might be slower. Of course the filter-clause pattern column LIKE '%value%' (with a leading %) also prevents the use the index.

PHP + MySQL + Spanish

My system deals with spanish data. I am using laravel + mysql. My database collation is latin1 - default collation and my tables structure looks something like this:
CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) CHARACTER SET latin1 NOT NULL,
) ENGINE=InnoDB AUTO_INCREMENT=298 DEFAULT CHARSET=utf8mb4;
Have a few questions:
I load data from file to db. Is it a good practice to
utf8_encode($name) before inserting to db? I am currently doing so,
else some comparison throw error : SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_unicode_ci,COERCIBLE) for operation '='
If using utf8_encode is the way to go, do i need to utf8_encode even name i want to search? i.e. select... where name =
utf8_encoded(name)?
Is there any flaws or better way to handle the above? As i doing spanish for the first time (characters with accents).
Your product.name column has the character set latin1. You know that. It also has the collation latin1_swedish_ci. That's the default. The original developers of MySQL are Swedish. Because you're working in Spanish, you probably want to use latin1_spanish_ci for your collation; it sorts Ñ after N. The other Latin-language collations sort them together.
Because your product.name column is stored in latin1, it is a bad, not a good, idea to use utf8_encode() on text before storing it to that column.
Your best course of action, especially if your application is new, is to make the character set for all columns utf8mb4. That means changing the defined character set of your name column. Then you can convert text strings to unicode before storing them.
You probably would be wise to make the default collation of each table utf8mb4_spanish_ci as well. Collations get baked into indexes for varchar() columns. (If you're working in traditional Spanish, in which ch is a distinct letter, use utf8mb4_spanish2_ci.)

Multi language database encoding in search engine

I have a database(Mysql) in which I store more then 100 000 keywords with keyword in different languages. So an example if I have three colums [id] [turkish (utf8_turkish_ci)] [german(utf8)]
The users could enter a german or a turkish word in the search box. If the user enters a german word all is fine so it prints out the turkish word but how to solve it with the turkish one. I ask because each language has its own additional characters like ä ü ö ş etc.
So should I use
mb_convert_encoding
to convert the string but then how to check if it is a german or turkish string I think that would be to complex. Or is the encoding of the tables wrong?
Stuck now so how to implement it so the user could enter keyword of both languages words
You have several issues to solve to make this work correctly.
First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.
Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.
Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.
You can explicitly specify character set and collation in queries. For example, you can write
WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;
In this expression, _utf8 'München' is a character constant, and
constant COLLATE utf8_unicode_ci = table.name
is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html
Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.
Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.

Store full width and half width character in unique column of database

I have a word list stored in mysql, and the size is around 10k words. The column is marked as unique. However, I cannot insert full-width and half-width character of punctuation mark.
Here are some examples:
(half-width, full-width)
('?', '?')
('/', '/')
The purpose is that, I have many articles containing both full-width and half-width characters and want to find out if the articles contain these words. I use php to do the comparison and it can know that '?' is different than '?'. Is there any idea how to do it in mysql too? Or is there some ways so that php can make it equal?
I use utf8_unicode_ci for the database encoding, and the column is also used utf8_unicode_ci for the encoding. When I made these queries, both return the same record, '?測試'
SELECT word FROM word_list WHERE word='?測試'
SELECT word FROM word_list WHERE word='?測試'
Most likely explanation is a characterset translation issue; for example, the column you are storing the value to is defined as latin1 characterset.
But it's not necessarily the characterset of the column that's causing the issue. It's a characterset conversion happening somewhere.
If you aren't aware of characterset encodings, I recommend consulting the source of all knowledge: google.
I highly recommend the two top hits for this search:
what every programmer needs to know about character encoding
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/

Categories