mysql collation and indexes

mysql collation and indexes - php

What mysql collation should I use for my tables to support all european languages, all latin american languages, and maybe chinese, maybe asian languages? Thanks!
What is the rule when it comes to using indexes on mysql table columns? When should you not use an index for a column in a table?

UTF8 would probably be the best choice, more specific; utf8_general_ci.
Indices should not be set in a table that you're going to perform a huge amount of insertions into. Indices speed up SELECT-queries, but these indices need to be rebuilt everytime you INSERT into the table. So, if you have a table that's... well, let's say it stores news articles - suitable indices might be the title or something that you might wanna "search" for.
Hope this clears some things up.

utf8
utf8-general
is universal character set...
you should not use index when you're sure you will not search for it (via WHERE clause)

Related

Compare same values stored with different encodings

This question is not a duplicate of PHP string comparison between two different types of encoding because my question requires a SQL solution, not a PHP solution.
Context ► There's a museum with two databases with the same charset and collation (engine=INNODB charset=utf8 collate=utf8_unicode_ci) used by two different PHP systems. Each PHP system stores the same data in a different way, next image is an example :
There are tons of data already stored that way and both systems are working fine, so I can't change the PHP encoding or the databases'. One system handles the sales from the box office, the other handles the sales from the website.
The problem ► I need to compare the right column (tipo_boleto_tipo) to the left column (tipo) in order to get the value in another column of the left table (unseen in image), but I'm getting no results because the same values are stored different, for example, when I search for "Niños" it is not found because it was stored as "NiÃ±os" ("children" in spanish). I tried to do it via PHP by using utf8_encode and utf8_decode but it is unacceptably slow, so I think it's better to do it with SQL only. This data will be used for a unified report of sales (box office and internet) in variable periods of time and it has to compare hundreds of thousands of rows, that's why it is so slow with PHP.
The question ► Is there anything like utf8_encode or utf8_decode in MYSQL that allows me to match the equivalent values of both columns? Any other suggestion will be welcome.
Next is my current code (with no results) :
DATABASE TABLE COLUMN
▼ ▼ ▼
SELECT boleteria.tipos_boletos.genero ◄ DESIRED COLUMN.
FROM boleteria.tipos_boletos ◄ DATABASE WITH WEIRD CHARS.
INNER JOIN venta_en_linea.ventas_detalle ◄ DATABASE WITH PROPER CHARS.
ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo
WHERE venta_en_linea.ventas_detalle.evento_id='1'
AND venta_en_linea.ventas_detalle.tipo_boleto_tipo = 'Niños'
The line ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo is never gonna work because both values are different ("Niños" vs "NiÃ±os").

It appears the application which writes to the boleteria database is not storing correct UTF-8. The database column character set refers to how MySQL interprets strings, but your application can still write in other character sets.
I can't tell from your example exactly what the incorrect character set is, but assuming it's Latin-1 you can convert it to latin1 (to make it "correct"), then convert it back to "actual" utf8:
SELECT 1
FROM tipos_boletos, ventas_detalle
WHERE CONVERT(CAST(CONVERT(tipo USING latin1) AS binary) USING utf8)
= tipo_boleto_tipo COLLATE utf8_unicode_ci
I've seen this all too often in PHP applications that weren't written carefully from the start to use UTF-8 strings. If you find the performance too slow and you need to convert frequently, and you don't have an opportunity to update the application writing the data incorrectly, you can add a new column and trigger to the tipos_boletos table and convert on the fly as records are added or edited.

Store full width and half width character in unique column of database

I have a word list stored in mysql, and the size is around 10k words. The column is marked as unique. However, I cannot insert full-width and half-width character of punctuation mark.
Here are some examples:
(half-width, full-width)
('?', '？')
('/', '／')
The purpose is that, I have many articles containing both full-width and half-width characters and want to find out if the articles contain these words. I use php to do the comparison and it can know that '?' is different than '？'. Is there any idea how to do it in mysql too? Or is there some ways so that php can make it equal?
I use utf8_unicode_ci for the database encoding, and the column is also used utf8_unicode_ci for the encoding. When I made these queries, both return the same record, '?測試'
SELECT word FROM word_list WHERE word='?測試'
SELECT word FROM word_list WHERE word='？測試'

Most likely explanation is a characterset translation issue; for example, the column you are storing the value to is defined as latin1 characterset.
But it's not necessarily the characterset of the column that's causing the issue. It's a characterset conversion happening somewhere.
If you aren't aware of characterset encodings, I recommend consulting the source of all knowledge: google.
I highly recommend the two top hits for this search:
what every programmer needs to know about character encoding
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/

Opencart - Search regardless accent

It had been written many times already that Opencart's basic search isn't good enough .. Well, I have came across this issue:
When customer searches product in my country (Slovakia (UTF8)) he probably won't use diacritics. So he/she writes down "cucoriedka" and found nothing.
But, there is product named "čučoriedka" in database and I want it to display too, since that's what he was looking for.
Do you have an idea how to get this work? The simple the better!

I'm ignorant of Slovak, I am sorry. But the Slovak collation utf8_slovak_ci treats the Slovak letter č as distinct from c. (Do the surnames starting with Č all come after those starting with C in your telephone directories? They probably do. The creators of MySQL certainly think they do.)
The collation utf8_general_ci treats č and c the same. Here's a sql fiddle demonstrating all this. http://sqlfiddle.com/#!9/46c0e/1/0
If you change the collation of the column containing your product name to utf8_general_ci, you will get a more search-friendly table. Suppose your table is called product and the column with the name in it is called product_name. Then this SQL data-definition statement will convert the column as you require. You should look up the actual datatype of the column instead of using varchar(nnn) as I have done in this example.
alter table product modify product_name varchar(nnn) collate utf8_general_ci
If you can't alter the table, then you can change your WHERE clause to work like this, specifying the collation explicitly.
WHERE 'userInput' COLLATE utf8_general_ci = product_name
But this will be slower to search than changing the column collation.

You can use SOUNDEX() or SOUNDS LIKE function of MySQL.
These functions compare phonetics.
Accuracy of soundex is doubtful for other than English. But, it can be improved if we use it like
select soundex('ball')=soundex('boll') from dual
SOUNDS LIKE can also be used.
Using combination of both SOUNDEX() and SOUNDS LIKE will improve accuracy.
Kindly refer MySQL documentation for details OR mysql-sounds-like-and-soundex

Which collation is preferred when sorting is not required?

I have a table e.g.:
create table T1(Id int primary key auto_increment, Value text)
Value is used to store "textual" data but rows are never sorted according to the Value column.
Which collation should be prefered for Value?
Would utf8mb4_bin be a better choice or utf8mb4_general_ci?

That looks fine. I certainly wouldn't use a case-insensitive collation if it wasn't needed (as per your case) since it may result in slower queries (though I doubt it would be used for non-textual fields anyway).
You should keep in mind, however, that collation is not just for sorting, but for selection as well (e.g., the where clause). If you're only going to retrieve rows based on columns other than Value, that shouldn't matter.
In any case, I'm actually not a big fan of case-insensitive collations being done by the database itself, since I'd rather keep the database running as blindingly fast as possible, and use my own methods to handle case issues (such as an extra indexed column holding lower-cased last names, and updated with insert/update triggers to maintain consistency with the rest of the row).
Basically, I'm a Luddite :-) but nobody ever complains about how big their databases are, only about how slow.

Mysql phpMyAdmin few questions:

I am quite new to the mysql phpMyadmin environment, and I would like to have some area
1. I need a field of text that should be up to around 500 characters.
Does that have to be "TEXT" field? does it take the application to be responsible for the length ?
indexes. I understand that when I signify a field as "indexed", that means that field would have a pointer table and upon each a WHERE inclusive command, the search would be optimized by that field (log n complexity). But what happens if I signify a field as indexed after the fact ? say after it has some rows in it ? can I issue a command like "walk through all that table and index that field" ?
When I mark fields as indexed, I sometimes get them in phpMyAdmin as having the keyname
for accessing the table by the indexed field when I write php, does it take an extra effort on my side to use that keyname that is written down there at the "structure" view to use the table as indexed, or does that keyname is being used behind the scenes and I should not care about it whatsoever ?
I sometimes get the keynames referencing two or more fields altogether. The fields show one on top of the other. I don't know how it happened, but I need them to index only one field. What is going on ?
I use UTF-8 values in my db. When I created it, I think I marked it as utf8_unicode_ci, and some fields are marked as utf8_general_ci, does it matter ? Can I go back and change the whole DB definition to be utf8_general_ci ?
I think that was quite a bit,
I thank you in advance!
Ted

First, be aware that this not per se something about phpmyadmin, but more about mysql / databases.
1)
An index means that you make a list (most of the time a tree) of the values that are present. This way you can easily find the row with that/those values. This tree can be just as easily made after you insert values then before. Mind you, this means that all the "add to index" commands are put together, so not something you want to do on a "live" table with loads of entries. But you can add an index whenever you want it. Just add the index and the index will be made, either for an empty table or for a 'used' one.
2)
I don't know what you mean by this. Indexes have a name, it doesn't really matter what it is. A (primary) key is an index, but not all indexes are keys.
3)
You don't need to 'force' mysql to use a key, the optimizer knows best how and when to use keys. If your keys are correct they are used, if they are not correct they can't be used so you can't force it: in other words: don't think about it :)
4)
PHPMYADMIN makes a composite keys if you mark 2 fields as key at the same time. THis is annoying and can be wrong. If you search for 2 things at once, you can use the composite key, but if you search for the one thing, you can't. Just mark them as a key one at a time, or use the correct SQL command manually.
5)
you can change whatever you like, but I don't know what will happen with your values. Better check manually :)

If you need a field to contain 500 characters, you can do that with VARCHAR. Just set its length to 500.
You don't index field by field, you index a whole column. So it doesn't matter if the table has data in it. All the rows will be indexed.
Not a question
The indexes will be used whenever they can. You only need to worry about using the same columns that you have indexed in the WHERE section of your query. Read about it here
You can add as many columns as you wish in an index. For example, if you add columns "foo", "bar" and "ming" to an index, your database will be speed optimized for searches using those columns in the WHERE clause, in that order. Again, the link above explains it all.
I don't know. I'm 100% sure that if you use only UTF-8 values in the database, it won't matter. You can change this later though, as explained in this Stackoverflow question: How to convert an entire MySQL database characterset and collation to UTF-8?
I would recommend you scrap PHPMyAdmin for HeidiSQL though. HeidiSQL is a windows client that manages all your MySQL servers. It has lots of cool functions, like copying a table or database directly from one MySQL server to another. Try it out (it's free)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.