How can I sort strings in multiple languages?

How can I sort strings in multiple languages? - php

I am currently working on a project which is translated in 18 languages like russian, german, swedish or chinese. I have some issues with sorting countries names in different languages. For example, countries names in french are sorted like that :
- États-Unis
- Éthiopie
- Afghanistan
I don't have this issue on my local server using MAMP.
My database's character set is configured as utf8 and the collation is utf8_unicode_ci. I have exactly the same configuration on the distant server.
I created a my.cnf file on my local server with the following instructions in order to correctly display special characters :
[mysqld]
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8
On the distant server, the my.cnf file does not contain these lines. When I tried to add them, MySQL did not recognise special characters anymore like if it was interpreting them as latin1.
I checked collation_database and all character_set variables but they are all set as utf8 / utf8_unicode_ci.
Here is the SQL code for the creation of the table :
CREATE TABLE esth_countries (
country_id varchar(2) COLLATE utf8_unicode_ci NOT NULL,
name varchar(100) COLLATE utf8_unicode_ci NOT NULL,
region varchar(40) COLLATE utf8_unicode_ci NOT NULL,
language_id varchar(2) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (country_id,language_id),
KEY language_id (language_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Special characters are correctly displayed on my distant server. The only problem concerns sorting using ORDER BY clause.
It seems like there is something wrong with the distant server's configuration but I can't figure out what.

Related

MySQL Character Set & Select Query Performance in stored procedure

Recently I noticed few queries are taking very long time in execution, checked further and found that MySQL Optimizer is trying to use COLLATE in Where clause and that's causing performance issue, if I run below query without COLLATE then getting quick response from database:
SELECT notification_id FROM notification
WHERE ref_table = 2
AND ref_id = NAME_CONST('v_wall_detail_id',_utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE 'utf8mb4_unicode_ci')
MySQL version 5.7
Database Character Set: utf8mb4
Column Character set: UTF8
Column Data Type: CHAR(36) UUID
From PHP in Connection object passing: utf8mb4
Index is applied
This query is written in MySQL stored procedure
SHOW CREATE TABLE
CREATE TABLE `notification` (
`notification_id` CHAR(36) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`title` VARCHAR(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`notification_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8mb4
SHOW VARIABLES LIKE 'coll%';
collation_connection utf8_general_ci
collation_database utf8mb4_unicode_ci
collation_server latin1_swedish_ci
SHOW VARIABLES LIKE 'char%';
character_set_client, Connection,Result, System: utf8
character_set_database utf8mb4
character_set_server latin1
character_set_system utf8
Any suggestion, what improvements are needed to make my queries faster?

The table's character set is utf8, so I guess its collation is one of utf8_general_ci or utf8_unicode_ci. You can check this way:
SELECT collation_name from INFORMATION_SCHEMA.COLUMNS
WHERE table_schema = '...your schema...' AND table_name = 'notification'
AND column_name = 'ref_id';
You are forcing it to compare to a string with a utf8mb4 charset and collation. An index is a sorted data structure, and the sort order depends on the collation of the column. Using that index means taking advantage of the sort order to look up values rapidly, without examining every row.
When you compared the column to a string with a different collation, MySQL cannot infer that the sort order or string equivalence of your UUID constant is compatible. So it must do string comparison the hard way, row by row.
This is not a bug, this is the intended way for collations to work. To take advantage of the index, you must compare to a string with a compatible collation.
I tested and found that the following expressions fail to use the index:
Different character set, different collation:
WHERE ref_id = _utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE utf8mb4_general_ci
WHERE ref_id = _utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE utf8mb4_unicode_ci
Same character set, different collation:
WHERE ref_id = _utf8'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE 'utf8_unicode_ci'
The following expressions successfully use the index:
Different character set, default collation:
WHERE ref_id = _utf8mb4'c37e32fc-b3b5-11ec-befc-02447a44a47c'
Same character set, same collation:
WHERE ref_id = _utf8'c37e32fc-b3b5-11ec-befc-02447a44a47c' COLLATE 'utf8_general_ci'
Same character set, default collation:
WHERE ref_id = _utf8'c37e32fc-b3b5-11ec-befc-02447a44a47c'
To simplify your environment, I recommend you should just use one character set and one collation in all tables and in your session. I suggest:
ALTER TABLE notification CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
This will rebuild the indexes on string columns, using the sort order for the specified collation.
Then using COLLATE utf8mb4_unicode_ci will be compatible, and will use the index.
P.S. In all cases I omitted the NAME_CONST() function, because it has no purpose in a WHERE clause as far as I know. I don't know why you are using it.

These say what the client is talking in:
collation_connection utf8_general_ci
character_set_client, Connection,Result, System: utf8
Either change them or change the various columns to match them.
If you have Stored routines, they need to be dropped, do SET NAMES to match what you picked, then re-CREATEd.
Since you are using 5.7, I recommend using utf8mb4 and utf8mb4_unicode_520_ci throughout.

SQL collation not working

I am trying to insert some utf8 encoded data into a mysql database. The special characters are correctly displayed in the browser, but not in my database. After manually changing the collation to utf8_unicode_ci and confirming that "this operation will attempt to convert your data to the new collation", the data gets displayed correctly. However if I create the table using
CREATE TABLE IF NOT EXISTS table_name (
date date NOT NULL,
searchengine VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
location VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
keyword VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
position INT NOT NULL,
competition VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
url VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL
)
and insert the data after creating the table, the data is still not shown correctly, even though the colletion is utf8_unicode_ci. Any ideas on how to fix this?

A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations. A collation orders characters based on weights.
utf8mb4_unicode_ci is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.

Searching Geoname database with non-latin characters

I have a copy of the Geonames database stored in a MySQL database, and a PHP application that allows users to search the database for their city. It works fine if they type the city name in English, but I want them to be able to search in their native language.
For example, instead of asking a Japanese speaker to search for Tokyo, they should be able to search for 東京.
The Geonames database contains an alternatenames column with, "alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)."
For example, the alternatenames value for the Tokyo row is Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,TokiÃ³,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - æ±äº¬,TÃ²quio,TÃ³kÃ½Ã³,TÃ³quio,TÅkyÅ,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Î¤ÏŒÎºÎ¹Î¿,Ð¢Ð¾ÐºÐ¸Ð¾,Ð¢Ð¾ÐºÑ‘,Ð¢Ð¾ÐºÑ–Ð¾,ÕÕ¸Õ¯Õ«Õ¸,×˜×•×§×™×•,ØªÙˆÙƒÙŠÙˆ,ØªÙˆÚ©ÛŒÙˆ,Ø·ÙˆÙƒÙŠÙˆ,Ü›Ü˜ÜŸÜÜ˜,ÜœÜ˜ÜŸÜÜ˜,à¤Ÿà¥‹à¤•à¥à¤¯à¥‹,à®Ÿà¯‹à®•à¯à®•à®¿à®¯à¯‹,à¹‚à¸•à¹€à¸à¸µà¸¢à¸§,áƒ¢áƒáƒ™áƒ˜áƒ,ä¸œäº¬,æ±äº¬,æ±äº¬éƒ½,ë„ì¿„.
Those values don't contain 東京 exactly, but I'm guessing that they contain a form of it that has been encoded or converted in some way. So, I assuming that if I perform the same encoding/conversion on my search string, then I'll be able to match the row. For example:
mysql_query( sprintf( "
SELECT * FROM geoname
WHERE
MATCH( name, asciiname, alternatenames )
AGAINST ( %s )
LIMIT 1",
iconv( 'UTF-8', 'ASCII', '東京' )
) );
The problem is that I don't know what that conversion would be. I've tried lots of combinations of iconv(), mb_convert_string(), etc, but with no luck.
The MySQL table looks like this:
CREATE TABLE `geoname` (
`geonameid` int(11) NOT NULL DEFAULT '0',
`name` varchar(200) DEFAULT NULL,
`asciiname` varchar(200) DEFAULT NULL,
`alternatenames` mediumtext,
`latitude` decimal(10,7) DEFAULT NULL,
`longitude` decimal(10,7) DEFAULT NULL,
`fclass` char(1) DEFAULT NULL,
`fcode` varchar(10) DEFAULT NULL,
`country` varchar(2) DEFAULT NULL,
`cc2` varchar(60) DEFAULT NULL,
`admin1` varchar(20) DEFAULT NULL,
`admin2` varchar(80) DEFAULT NULL,
`admin3` varchar(20) DEFAULT NULL,
`admin4` varchar(20) DEFAULT NULL,
`population` int(11) DEFAULT NULL,
`elevation` int(11) DEFAULT NULL,
`gtopo30` int(11) DEFAULT NULL,
`timezone` varchar(40) DEFAULT NULL,
`moddate` date DEFAULT NULL,
PRIMARY KEY (`geonameid`),
KEY `timezone` (`timezone`),
FULLTEXT KEY `namesearch` (`name`,`asciiname`,`alternatenames`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4
Can anyone point me in the right direction?

When I download the Japan file and set up a database like this:
CREATE TABLE geonames (
geonameid SERIAL,
name varchar(200),
asciiname varchar(200),
alternatenames varchar(10000),
latitude float,
longitude float,
featureclass varchar(1),
featurecode varchar(10),
countrycode varchar(2),
cc2 varchar(200),
admin1code varchar(20),
admin2code varchar(80),
admin3code varchar(20),
admin4code varchar(20),
population BIGINT,
elevation INT,
dem INT,
timezone varchar(40),
modificationdate DATE
) CHARSET utf8mb4;
Then I load the data like this:
LOAD DATA INFILE '/tmp/JP.txt' INTO TABLE geonames CHARACTER SET utf8mb4;
And select it like this:
SELECT alternatenames FROM geonames WHERE geonameid=1850147\G
I get this:
*************************** 1. row ***************************
alternatenames: Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,Tokió,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - 東京,Tòquio,Tókýó,Tóquio,Tōkyō,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Τόκιο,Токио,Токё,Токіо,Տոկիո,טוקיו,توكيو,توکیو,طوكيو,ܛܘܟܝܘ,ܜܘܟܝܘ,टोक्यो,டோக்கியோ,โตเกียว,ტოკიო,东京,東京,東京都,도쿄
I can also do a search like this:
SELECT name FROM geonames WHERE alternatenames LIKE '%,東京,%';
Which is a long way of saying: Note the charset declaration when I created the table. I believe this is what you failed to do when you created your database.

Recommended reading:
https://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/
In terms of MySQL, what is of critical importance is the characterset of the MySQL connection. That's the characterset that MySQL Server thinks the client is using in its communication.
SHOW VARIABLES LIKE '%characterset%'
If that isn't set right, for example, the client is sending latin1 (ISO-8859-1), but MySQL server thinks it's receiving UTF8, or vice versa, there's potential for mojibake.
Also of importance is the characterset of the alternatenames column.
One issue dealing with multibyte character set is going to be the PHP sprintf function. Many of the string handling functions in PHP have "mutlibyte" equivalents , that correctly handle strings containing multibyte characters.
https://secure.php.net/manual/en/book.mbstring.php
Unfortunately, there is no builtin mb_sprintf function.
For a more detailed description of string handling in PHP including multibyte characters/charactersets:
https://secure.php.net/manual/en/language.types.string.php#language.types.string.details
excerpt:
Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data and using instead the functions that do behave correctly, generally from the intl and mbstring extensions. However, using functions that can handle Unicode encodings is just the beginning. No matter the functions the language provides, it is essential to know the Unicode specification.
Also, a google search of "utf8 all the way through" may return some helpful notes. But be aware that this mantra is not a silver bullet or panacea to the issues.
Another possible issue, noted in the MySQL Reference Manual:
https://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html
13.9.5 Full-Text Restrictions
Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the built-in full-text parser cannot determine where words begin and end in these and other such languages.
In MySQL 5.7.6, a character-based ngram full-text parser that supports Chinese, Japanese, and Korean (CJK), and a word-based MeCab parser plugin that supports Japanese are provided for use with InnoDB and MySIAM tables.

Wrong character encoding with database output using laravel

I've recently started using laravel for a project I'm working on, and I'm currently having problems displaying data from my database in the correct character encoding.
My current system consists of a separate script responsible for populating the database with data, while the laravel project is reponsible for displaying the data. The view that is used, is set to display all text as utf-8, which works as I've successfully printed special characters in the view. Text from the database is not printed as utf8, and will not print special characters the right way. I've tried using both eloquent models and DB::select(), but they both show the same poor result.
charset in database.php is set to utf8 while collation is set to utf8_unicode_ci.
The database table:
CREATE TABLE `RssFeedItem` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`feedId` smallint(5) unsigned NOT NULL,
`title` varchar(250) COLLATE utf8_unicode_ci NOT NULL,
`url` varchar(250) COLLATE utf8_unicode_ci NOT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`text` mediumtext COLLATE utf8_unicode_ci,
`textSha1` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`),
KEY `feedId` (`feedId`),
CONSTRAINT `RssFeedItem_ibfk_1` FOREIGN KEY (`feedId`) REFERENCES `RssFeed` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6370 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I've also set up a test page in order to see if the problem could be my database setup, but the test page prints everything just fine. The test page uses PDO to select all data, and prints it on a simple html page.
Does anyone know what the problem might be? I've tried searching around with no luck besides this link, but I haven't found anything that might help me.

I did eventually end up solving this myself. The problem was caused by the separate script responsible for populating my database with data. This was solved by running a query with SET NAMES utf8 before inserting data to the database. The original data was pulled out, and then sent back in after running said query.
The reason for it working outside laravel, was simply because the said query wasn't executed on my test page. If i ran the query before retrieving the data, it came out with the wrong encoding because the query stated that the data was encoded as utf8, when it really wasn't.

Why is unicode not working in my MySQL table?

I have a MySQL DB table where I store addresses, including Norwegain addresses.
CREATE TABLE IF NOT EXISTS `addresses` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`street1` varchar(50) COLLATE utf8_danish_ci NOT NULL,
`street2` varchar(50) COLLATE utf8_danish_ci DEFAULT 'NULL',
`zipcode` varchar(10) COLLATE NOT NULL,
`city` varchar(30) COLLATE utf8_danish_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `index_store` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
Now, this table was fine untill I screwed up and accidentaly set all cities = 'test'. Luckilly I had another table called helper_zipcode. This table contains alle zipcodes and cities for Norway.
So I updated addresses table with data from helper_zipcode.
Unfortunately in the front end, cities like Bodø now shows like Bod�.
All æ ø åare now shown as � � � (but they look fine in the DB).
I'm using HTML 5, so my header looks like this:
<!DOCTYPE HTML>
<head>
<meta charset = "utf-8" />
(...)
This is not the first time I struggle with unicode.
What is the seceret for storing unicode characters (from Europe) in DB and display the same way when retrieved from DB?

from mysql docs:
Posted by lorenz pressler on May 2
2006 12:46pm [Delete] [Edit]
if you
get data via php from your mysql-db
(everything utf-8) but still get '?'
for some special characters in your
browser (<meta
http-equiv="Content-Type"
content="text/html; charset=utf-8"
/>), try this:
after mysql_connect() , and
mysql_select_db() add this lines:
mysql_query("SET NAMES utf8");
worked for me. i tried first with the
utf8_encode, but this only worked for
äüöéè... and so on, but not for
kyrillic and other chars.
Is your problem storing the data in mysql or from retrieving the stored data using php?

Before query (1-st time) you must need add mysql_query("SET NAMES UTF8");.

What happens if you change your browser encoding from auto-detect to UTF-8 or Unicode ?
What I'm trying to determine if its the Database or the Web-browser that's wrong.
Alternatively. if you have a Database tool for your MySQL database, does that show the right or wrong characters ?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.