PHP + MySQL + Spanish - php

My system deals with spanish data. I am using laravel + mysql. My database collation is latin1 - default collation and my tables structure looks something like this:
CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) CHARACTER SET latin1 NOT NULL,
) ENGINE=InnoDB AUTO_INCREMENT=298 DEFAULT CHARSET=utf8mb4;
Have a few questions:
I load data from file to db. Is it a good practice to
utf8_encode($name) before inserting to db? I am currently doing so,
else some comparison throw error : SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_unicode_ci,COERCIBLE) for operation '='
If using utf8_encode is the way to go, do i need to utf8_encode even name i want to search? i.e. select... where name =
utf8_encoded(name)?
Is there any flaws or better way to handle the above? As i doing spanish for the first time (characters with accents).

Your product.name column has the character set latin1. You know that. It also has the collation latin1_swedish_ci. That's the default. The original developers of MySQL are Swedish. Because you're working in Spanish, you probably want to use latin1_spanish_ci for your collation; it sorts Ñ after N. The other Latin-language collations sort them together.
Because your product.name column is stored in latin1, it is a bad, not a good, idea to use utf8_encode() on text before storing it to that column.
Your best course of action, especially if your application is new, is to make the character set for all columns utf8mb4. That means changing the defined character set of your name column. Then you can convert text strings to unicode before storing them.
You probably would be wise to make the default collation of each table utf8mb4_spanish_ci as well. Collations get baked into indexes for varchar() columns. (If you're working in traditional Spanish, in which ch is a distinct letter, use utf8mb4_spanish2_ci.)

Related

Unequal binary char comparison when MySQL string fetched from PHP

I have some interesting issue. The following two codes do not produce the same output:
$result = $sql->QueryFetch("SELECT machinecodeSigned FROM ...");
echo bin2hex($result['machinecodeSigned']);
and
$result = $sql->QueryFetch("SELECT HEX(machinecodeSigned) FROM ...");
echo $result['machinecodeSigned'];
So, $sql is just some wrapper class and method QueryFetch internally just calls PHP standard functions for query and fetch to attain values.
I get two different results, though. For example, for some arbitrary input in my database, I get:
08c3bd79c3a0c2a66fc2bb375b6370c399c3acc3ba7bc2b8c2b203c39d70
and
08FD79E0A66FBB375B6370D9ECFA7BB8B203DD70
Ignoring case-sensitivity, the first output is nonsense while the other one is correct.
machinecodeSigned is a char(255) field that is latin-1 encoded and has collation latin-1 (which should not play a role, I assume).
What could be the reason that I get two different results? This used to yield the same results for years, but suddenly I had to change the code from version 1 to version 2 in order for it to produce the correct result. It seems, as if PHP does some arbitrary conversion of the bytes in the string.
Edit: It seems necessary to say that the field is not human-readable. In any case, since the second output is the correct one, feel free to convert the hexadecimal form to ASCII characters, if this helps you.
Edit:
SHOW CREATE TABLE yields:
CREATE TABLE `user` (
`ID` int(9) NOT NULL AUTO_INCREMENT,
`machinecodeSigned` char(255) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL
PRIMARY KEY (`ID`),
) ENGINE=InnoDB AUTO_INCREMENT=10092 DEFAULT CHARSET=latin1 COLLATE=latin1_german2_ci
char(255) CHARACTER SET latin1 COLLATE latin1_bin
will read/write bytes unchanged. It would be better to say BINARY(255), or perhaps something else.
If you tell the server that your client wants to talk in "utf8", and you SELECT that column, then MySQL will translate from latin1 (the charset of the data) to utf8 (the encoding you say the client wants). This leads to the longer hex string.
You say that phpmyadmin says "utf8" somewhere; that is probably the cause of the confusion.
If it had been stored as base64, there would be no confusion because base64 uses very few different characters, and they are encoded identically in latin1 and utf8. Furthermore, latin1_bin would have been appropriate. So, another explanation of what went wrong is the unwanted reconversion from base64 to binary.
MySQL's implementation of latin1_bin is simple and permissive -- all 256 bit values are simply stored and loaded, unchecked. This makes it virtually identical to BLOB and BINARY.
This is probably the base64_encode that should have been stored:
MDhGRDc5RTBBNjZGQkIzNzVCNjM3MEQ5RUNGQTdCQjhCMjAzREQ3MA==
Datatypes starting with VAR or ending with BLOB or TEXT are implemented via a 'length' field plus the bytes needed to represent the value.
On the other hand, CHAR and BINARY are fixed length, and padded by spaces (CHAR) or \0 (BINARY).
So, writing binary info to CHAR(255) actually may modify the data due to spaces appended.

searching for Persian string in MySQL using laravel elequent

In my laravel in order to search in products title column I use the following code:
$products->where('title', 'like', '%' . $request->title . '%');
the title column is a string column and data stored in it are in Persian. Also, the database collation is UTF8_general_ci. however, when I search something some titles are found and some aren't. I need the result to find every product which contains the $request->title in their title columns.
can you help me?
Change Collation UTF8_general_ci to latin1_swedish_ci
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci. The output for SHOW CHARACTER SET indicates which collation is the default for each displayed character set.
There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
reference here

Getting duplicate entry errors when converting utf-8 database to utf8mb4

I'm trying to convert a database to use utf8mb4 instead of utf8. Everything is going fine except one table:
CREATE TABLE `search_terms` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`search_term` varchar(128) NOT NULL,
`time_added` timestamp NULL DEFAULT NULL,
`count` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `search_term` (`search_term`),
KEY `search_term_count` (`count`)
) ENGINE=InnoDB AUTO_INCREMENT=198981 DEFAULT CHARSET=utf8;
Basically all it does is save an entry every time somebody searches something in a form so we can track the number of searches, very simple.
There's a unique index on search_term because we want to only have one row per search term and instead increment the count value.
However when converting to utf8mb4 I am getting duplicate entry errors. Here is the command I am running:
ALTER TABLE `search_terms` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Looking in the database I can see various examples like this:
fm2012
fm2012
fm2012
In it's current utf8 character set, these are all being treated as unique and exist within the database without ever having an issue with the unique index on search_term.
But when converting to utf8mb4 they are now being considered equal and throwing an error due to that index.
I can figure out how to merge these together easily enough, but i'm concerned this may be a symptom of a greater underlying problem. I'm not really sure how this has happened or what the consequences may be, so my questions are a bit vague:
Why is utf8mb4 treating these differently to utf8?
What are the possible consequences?
Is there someway I can do a conversion so things like "fm2012" never appear in my database and I only have "fm2012" (I am also using Laravel 5.1)
Your problem is the change of collation: you're using general_ci and you're converting to unicode_ci: general_ci is quite a simple collation that doesn't know much about unicode, but unicode_ci does.
The first "f" in your example string is a "Fullwidth Latin Small Letter F" (U+FF46) which is considered equal to "Latin Small Letter F" (U+0066) by unicode_ci but not by general_ci.
Normally it's recommended to use unicode_ci exactly because of its unicode-awareness but you could convert to utf8mb4_general_ci to prevent this problem.
To prevent this problem in the future, you should normalize your input before saving it in the DB. Normally you'd use NFC, but your case seems to call for NFKC. This should bring all "equivalent" strings to the same form.
Despite what was said previously it is not about general_ci being more simplistic than unicode_ci. Yes, it can be true, but the issue is that you need to keep it matching to the sub-type you have.
For example, my database is utf8_bin. I cannot convert to utf8mb4_unicode_ci nor to utf8mb4_general_ci. These commands will throw an error of a duplicate key being found. However the correct collation utf8mb4_bin completes without issues.

Multi language database encoding in search engine

I have a database(Mysql) in which I store more then 100 000 keywords with keyword in different languages. So an example if I have three colums [id] [turkish (utf8_turkish_ci)] [german(utf8)]
The users could enter a german or a turkish word in the search box. If the user enters a german word all is fine so it prints out the turkish word but how to solve it with the turkish one. I ask because each language has its own additional characters like ä ü ö ş etc.
So should I use
mb_convert_encoding
to convert the string but then how to check if it is a german or turkish string I think that would be to complex. Or is the encoding of the tables wrong?
Stuck now so how to implement it so the user could enter keyword of both languages words
You have several issues to solve to make this work correctly.
First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.
Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.
Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.
You can explicitly specify character set and collation in queries. For example, you can write
WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;
In this expression, _utf8 'München' is a character constant, and
constant COLLATE utf8_unicode_ci = table.name
is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html
Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.
Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.

Issue with charset and data

I try to explain the whole problem with my poor english:
I use to save data from my application (encoded on utf8) to database using the default connection of PHP (latin1) to the tables of my DB with latin1 as charset.
That wasn't a big problem : for example the string Magnüs was stored as Magnüs, and when I recovered the data I saw correctly the string Magnüs (because the default connection, latin1).
Now, I change the connection, using the correct charset, with mysql_query("SET NAMES 'utf8'", $mydb), and I've also changed the charset of my tables's fields, so the value now is correctly store as Magnüs on DB; Then I still seeing Magnüs when I retrieve the data and I print on my Web Application.
Of course, unfortunatly, some old values now are badly printed (Magnüs is printed as Magnüs).
What I'd like to do is "to convert" these old values with the real encoding.
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8; will convert only the field type, not the data.
So, a solution (discovered on internet) should be this:
ALTER TABLE table CHANGE field field BLOB;
ALTER TABLE table CHANGE field field VARCHAR(255) CHARACTER SET utf8;
But these old string won't change on database, so neither in the Web Application when I print them.
Why? And what can I do?
Make sure that your forms are sending UTF-8 encoded text, and that the text in your table is also UTF-8 encoded.
According to the MySQL reference, the last two ALTER you mentioned do not change the column contents encoding, its more like a "reinterpretation" of the contents.
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.

Categories