SQL collation not working - php

I am trying to insert some utf8 encoded data into a mysql database. The special characters are correctly displayed in the browser, but not in my database. After manually changing the collation to utf8_unicode_ci and confirming that "this operation will attempt to convert your data to the new collation", the data gets displayed correctly. However if I create the table using
CREATE TABLE IF NOT EXISTS table_name (
date date NOT NULL,
searchengine VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
location VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
keyword VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
position INT NOT NULL,
competition VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
url VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL
)
and insert the data after creating the table, the data is still not shown correctly, even though the colletion is utf8_unicode_ci. Any ideas on how to fix this?

A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations. A collation orders characters based on weights.
utf8mb4_unicode_ci is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.

Related

MySQL - searching for words in a table written in Cyrillic or, for example, in Arabic and Polish words written without the use of diacritics

I'm creating a website design in PHP with aquarium ads - www.akwa-market.pl
I wanted the site to be international.
I have created a database and tables, one of them is with classifieds entries. I also have a built-in word search.
Suppose I typed in the polish word turtle - żółw
I run the command:
SELECT FROM * table_name WHERE item REGEXP '[[:<:]]{$keyword}[[:>:]]';
or:
SELECT FROM * table_name WHERE item REGEXP '{$keyword}';
and find a record in the database that contains that word.
The problem occurs when I enter a word in Russian, for example Привет
The search engine does not show a record with the given Russian word.
After connecting to the database, I call.
mysqli_set_charset($db, "utf8mb4");
mysqli_query($db, "SET NAMES utf8mb4");
mysqli_query($db, "SET CHARACTER SET utf8mb4");
I tried to set in db the string comparison method to utf8mb4_bin or to utf8mb4_unicode_ci
but this does not give successful results. It is still possible to search only words containing Latin characters, including diacritics of European languages.
I don't know how to set the query to be able to search for words in other languages, e.g. Russian, Chinese or Arabic.
I mean to use it together with REGEXP, preferably to search for several words at the same time, separated by the 'I' sign, i.e. OR, where, for example, the $keyword variable takes the value:
$keyword = 'Привет | Андрей';
P.S. Additionally, I don't know how to solve the case when someone types - polish word tortoise - zółw with z letter without dot, so that the word tortoise - żółw can be searched. Or even written without pressing ALT, e.g. zolw
Can anyone help? The site is almost functioning normally there is no errors given by the server, but the search engine is not working correctly and I don't really know how to solve it.
Sorry, but English is my second language.
Edit:
Output show create table as user #ysth asked.
CREATE TABLE `items` (
`index` int(11) NOT NULL AUTO_INCREMENT,
`id` varchar(11) COLLATE utf8_unicode_ci NOT NULL,
`user_name` varchar(30) COLLATE utf8_unicode_ci DEFAULT NULL,
`user_id` varchar(30) COLLATE utf8_unicode_ci DEFAULT NULL,
`price` decimal(10,2) DEFAULT NULL,
`main_category` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`category` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`sub_category` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`description` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
`country` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`voivodeship` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`town` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`latitude` decimal(11,8) DEFAULT NULL,
`longitude` decimal(11,8) DEFAULT NULL,
`email` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`entry_start` timestamp NULL DEFAULT NULL,
`entry_expire` timestamp NULL DEFAULT NULL,
`adv_type` varchar(30) COLLATE utf8_unicode_ci NOT NULL,
`adv_status` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`adv_views` decimal(9,0) DEFAULT 0,
`users_report` text COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`index`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
as I see now the table description column where the Russian text was saved has different collation utf8mb4_bin, if that is the case of problem sorry for disturbing, I have to check that right now and let You know.

MySQL Character Set & Select Query Performance in stored procedure

Recently I noticed few queries are taking very long time in execution, checked further and found that MySQL Optimizer is trying to use COLLATE in Where clause and that's causing performance issue, if I run below query without COLLATE then getting quick response from database:
SELECT notification_id FROM notification
WHERE ref_table = 2
AND ref_id = NAME_CONST('v_wall_detail_id',_utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE 'utf8mb4_unicode_ci')
MySQL version 5.7
Database Character Set: utf8mb4
Column Character set: UTF8
Column Data Type: CHAR(36) UUID
From PHP in Connection object passing: utf8mb4
Index is applied
This query is written in MySQL stored procedure
SHOW CREATE TABLE
CREATE TABLE `notification` (
`notification_id` CHAR(36) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`title` VARCHAR(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`notification_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8mb4
SHOW VARIABLES LIKE 'coll%';
collation_connection utf8_general_ci
collation_database utf8mb4_unicode_ci
collation_server latin1_swedish_ci
SHOW VARIABLES LIKE 'char%';
character_set_client, Connection,Result, System: utf8
character_set_database utf8mb4
character_set_server latin1
character_set_system utf8
Any suggestion, what improvements are needed to make my queries faster?
The table's character set is utf8, so I guess its collation is one of utf8_general_ci or utf8_unicode_ci. You can check this way:
SELECT collation_name from INFORMATION_SCHEMA.COLUMNS
WHERE table_schema = '...your schema...' AND table_name = 'notification'
AND column_name = 'ref_id';
You are forcing it to compare to a string with a utf8mb4 charset and collation. An index is a sorted data structure, and the sort order depends on the collation of the column. Using that index means taking advantage of the sort order to look up values rapidly, without examining every row.
When you compared the column to a string with a different collation, MySQL cannot infer that the sort order or string equivalence of your UUID constant is compatible. So it must do string comparison the hard way, row by row.
This is not a bug, this is the intended way for collations to work. To take advantage of the index, you must compare to a string with a compatible collation.
I tested and found that the following expressions fail to use the index:
Different character set, different collation:
WHERE ref_id = _utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE utf8mb4_general_ci
WHERE ref_id = _utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE utf8mb4_unicode_ci
Same character set, different collation:
WHERE ref_id = _utf8'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE 'utf8_unicode_ci'
The following expressions successfully use the index:
Different character set, default collation:
WHERE ref_id = _utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c'
Same character set, same collation:
WHERE ref_id = _utf8'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE 'utf8_general_ci'
Same character set, default collation:
WHERE ref_id = _utf8'c37e32fc-b3b5-11ec-befc-02447a44a47c'
To simplify your environment, I recommend you should just use one character set and one collation in all tables and in your session. I suggest:
ALTER TABLE notification CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
This will rebuild the indexes on string columns, using the sort order for the specified collation.
Then using COLLATE utf8mb4_unicode_ci will be compatible, and will use the index.
P.S. In all cases I omitted the NAME_CONST() function, because it has no purpose in a WHERE clause as far as I know. I don't know why you are using it.
These say what the client is talking in:
collation_connection utf8_general_ci
character_set_client, Connection,Result, System: utf8
Either change them or change the various columns to match them.
If you have Stored routines, they need to be dropped, do SET NAMES to match what you picked, then re-CREATEd.
Since you are using 5.7, I recommend using utf8mb4 and utf8mb4_unicode_520_ci throughout.

SilverStripe duplicating database field error, when runing dev/build

Hello i had to use older version for my SS project (SS 3.1.5). I add SS fluent (3.2.3) to make my project multilanguage, everything works fine with , ENG, LAT, SW, DE languages but when i add Russian (RU) and then run dev/build i get duplicating DB column error. I get them even if i manually remove the columns.
This is one of the error messages what i get. Full build list and error list i posted here - https://github.com/tractorcow/silverstripe-fluent/issues/340
Couldn't run query: ALTER TABLE "SiteTree_Live" ADD "URLSegment_ru_RU" varchar(255) character set utf8 collate utf8_general_ci, ADD "Title_ru_RU" varchar(255) character set utf8 collate utf8_general_ci, ADD "MenuTitle_ru_RU" varchar(100) character set utf8 collate utf8_general_ci, ADD "Content_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "MetaDescription_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "ExtraMeta_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "ReportClass_ru_RU" varchar(50) character set utf8 collate utf8_general_ci, ADD index "URLSegment_ru_RU" ("URLSegment_ru_RU") Duplicate column name 'URLSegment_ru_RU'
MySQLDatabase.php:598
MySQLDatabase->databaseError(Couldn't run query: ALTER TABLE "SiteTree_Live" ADD "URLSegment_ru_RU" varchar(255) character set utf8 collate utf8_general_ci, ADD "Title_ru_RU" varchar(255) character set utf8 collate utf8_general_ci, ADD "MenuTitle_ru_RU" varchar(100) character set utf8 collate utf8_general_ci, ADD "Content_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "MetaDescription_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "ExtraMeta_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "ReportClass_ru_RU" varchar(50) character set utf8 collate utf8_general_ci, ADD index "URLSegment_ru_RU" ("URLSegment_ru_RU") | Duplicate column name 'URLSegment_ru_RU',256)
MySQLDatabase.php:150
MySQLDatabase->query(ALTER TABLE "SiteTree_Live" ADD "URLSegment_ru_RU" varchar(255) character set utf8 collate utf8_general_ci, ADD "Title_ru_RU" varchar(255) character set utf8 collate utf8_general_ci, ADD "MenuTitle_ru_RU" varchar(100) character set utf8 collate utf8_general_ci, ADD "Content_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "MetaDescription_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "ExtraMeta_ru_RU" mediumtext character set utf8 collate utf8_general_ci, ADD "ReportClass_ru_RU" varchar(50) character set utf8 collate utf8_general_ci, ADD index "URLSegment_ru_RU" ("URLSegment_ru_RU"))
MySQLDatabase.php:320
MySQLDatabase->alterTable(SiteTree_Live,Array,Array,Array,Array,,)
Database.php:223
SS_Database->endSchemaUpdate()
DatabaseAdmin.php:215
DatabaseAdmin->doBuild(,1)
DatabaseAdmin.php:100
DatabaseAdmin->build()
DatabaseAdmin.php:80
DatabaseAdmin->index(SS_HTTPRequest)
RequestHandler.php:288
RequestHandler->handleAction(SS_HTTPRequest,index)
Controller.php:194
Controller->handleAction(SS_HTTPRequest,index)
RequestHandler.php:200
RequestHandler->handleRequest(SS_HTTPRequest,DataModel)
Controller.php:153
Controller->handleRequest(SS_HTTPRequest,DataModel)
DevelopmentAdmin.php:146
DevelopmentAdmin->build(SS_HTTPRequest)
RequestHandler.php:288
RequestHandler->handleAction(SS_HTTPRequest,build)
Controller.php:194
Controller->handleAction(SS_HTTPRequest,build)
RequestHandler.php:200
RequestHandler->handleRequest(SS_HTTPRequest,DataModel)
Controller.php:153
Controller->handleRequest(SS_HTTPRequest,DataModel)
Director.php:366
Director::handleRequest(SS_HTTPRequest,Session,DataModel)
Director.php:152
Director::direct(/dev/build,DataModel)
main.php:189

How can I sort strings in multiple languages?

I am currently working on a project which is translated in 18 languages like russian, german, swedish or chinese. I have some issues with sorting countries names in different languages. For example, countries names in french are sorted like that :
- États-Unis
- Éthiopie
- Afghanistan
I don't have this issue on my local server using MAMP.
My database's character set is configured as utf8 and the collation is utf8_unicode_ci. I have exactly the same configuration on the distant server.
I created a my.cnf file on my local server with the following instructions in order to correctly display special characters :
[mysqld]
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8
On the distant server, the my.cnf file does not contain these lines. When I tried to add them, MySQL did not recognise special characters anymore like if it was interpreting them as latin1.
I checked collation_database and all character_set variables but they are all set as utf8 / utf8_unicode_ci.
Here is the SQL code for the creation of the table :
CREATE TABLE esth_countries (
country_id varchar(2) COLLATE utf8_unicode_ci NOT NULL,
name varchar(100) COLLATE utf8_unicode_ci NOT NULL,
region varchar(40) COLLATE utf8_unicode_ci NOT NULL,
language_id varchar(2) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (country_id,language_id),
KEY language_id (language_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Special characters are correctly displayed on my distant server. The only problem concerns sorting using ORDER BY clause.
It seems like there is something wrong with the distant server's configuration but I can't figure out what.

How to convert latin1_swedish_ci data into utf8_general_ci?

I have a MySQL database with all the table fields collation as
latin1_swedish_ci
It has almost 1000 of the records already stored and now I want to convert all these data into
utf8_general_ci
So that I can display any language content. I have already altered the field collations into utf8_general_ci but this does not CONVERT all the old records into utf8_general_ci
one funny thing.
CONVERT TO CHARSET and CONVERT()/CAST() suggested by Anshu will work fine if charset in the table is in right encoding.
If for some reason latin1 column containts utf8 text, CONVERT() and CAST() will not be able to help. I had "messed" my database with that setup so spend bit more time on solving this.
to fix this in addition to character set conversion, there are several exercises required.
"Hard one" is to recreate the database from dump that will be converted via console
"Simple one" is to convert row by row or table by table:
INSERT INTO UTF8_TABLE (UTF8_FIELD)
SELECT convert(cast(convert(LATIN1_FIELD using latin1) as binary) using utf8)
FROM LATIN1_TABLE;
basically, both cases will process string to original symbols and then to right encoding, that won't happen with simple convert(field using encoding) from table; command.
Export your table.
Drop the table.
Open the export file in the editor.
Edit it manually where the table structure is created.
old query:
CREATE TABLE `message` (
`message_id` int(11) NOT NULL,
`message_thread_id` int(11) NOT NULL,
`message_from` int(11) NOT NULL,
`message_to` int(11) NOT NULL,
`message_text` longtext NOT NULL,
`message_time` varchar(50) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
new query: ( suppose you want to change message_text field. )
CREATE TABLE `message` (
`message_id` int(11) NOT NULL,
`message_thread_id` int(11) NOT NULL,
`message_from` int(11) NOT NULL,
`message_to` int(11) NOT NULL,
`message_text` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`message_time` varchar(50) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
save the file and import back to the database.

Categories