Best way to select from database and using special characters - php

I would like to put the following question in front of you.
The database consists of various names. These names can contain 'special' characters like é, ê & alike.
My plan was to have a SEO friendly URL containing the names from the database. The creation of the link is not a big issue, just replace all the special characters with their normal characters using the normalize function or a str_replace function.
But then when I want to query the database using the SEO friend URL (after the necessary validations and checks). The DB ofcourse doesn't return any results due to the special characters.
Now comes the question, is there a way to query the database without the special characters or would it be better to have a separate column in the database where I store the name without the special characters (less need to clean them during the creation of the link and easy to query on). This will have the need to maintain two columns, but will allow me to easily query the database.
Any idea's?

Depending on the collation hôtel will be equal to hotel:
mysql> select 'hôtel' = 'hotel';
+--------------------+
| 'hôtel' = 'hotel' |
+--------------------+
| 1 |
+--------------------+
The problem you have is that you are storing html entities (hôtel) which I think is very wrong because you are mixing a presentation issue with a persistence issue. Get rid of it and choose an adequate collation to your charset then make it the database default or set it for each query.
An adequate charset for web use would be utf8. To set the database up do:
create database mydatabase character set utf8 collate utf8_unicode_ci;

The easy way is to put an ID in the URL like URL's from Stackflow.

Related

Compare same values stored with different encodings

This question is not a duplicate of PHP string comparison between two different types of encoding because my question requires a SQL solution, not a PHP solution.
Context ► There's a museum with two databases with the same charset and collation (engine=INNODB charset=utf8 collate=utf8_unicode_ci) used by two different PHP systems. Each PHP system stores the same data in a different way, next image is an example :
There are tons of data already stored that way and both systems are working fine, so I can't change the PHP encoding or the databases'. One system handles the sales from the box office, the other handles the sales from the website.
The problem ► I need to compare the right column (tipo_boleto_tipo) to the left column (tipo) in order to get the value in another column of the left table (unseen in image), but I'm getting no results because the same values are stored different, for example, when I search for "Niños" it is not found because it was stored as "Niños" ("children" in spanish). I tried to do it via PHP by using utf8_encode and utf8_decode but it is unacceptably slow, so I think it's better to do it with SQL only. This data will be used for a unified report of sales (box office and internet) in variable periods of time and it has to compare hundreds of thousands of rows, that's why it is so slow with PHP.
The question ► Is there anything like utf8_encode or utf8_decode in MYSQL that allows me to match the equivalent values of both columns? Any other suggestion will be welcome.
Next is my current code (with no results) :
DATABASE TABLE COLUMN
▼ ▼ ▼
SELECT boleteria.tipos_boletos.genero ◄ DESIRED COLUMN.
FROM boleteria.tipos_boletos ◄ DATABASE WITH WEIRD CHARS.
INNER JOIN venta_en_linea.ventas_detalle ◄ DATABASE WITH PROPER CHARS.
ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo
WHERE venta_en_linea.ventas_detalle.evento_id='1'
AND venta_en_linea.ventas_detalle.tipo_boleto_tipo = 'Niños'
The line ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo is never gonna work because both values are different ("Niños" vs "Niños").
It appears the application which writes to the boleteria database is not storing correct UTF-8. The database column character set refers to how MySQL interprets strings, but your application can still write in other character sets.
I can't tell from your example exactly what the incorrect character set is, but assuming it's Latin-1 you can convert it to latin1 (to make it "correct"), then convert it back to "actual" utf8:
SELECT 1
FROM tipos_boletos, ventas_detalle
WHERE CONVERT(CAST(CONVERT(tipo USING latin1) AS binary) USING utf8)
= tipo_boleto_tipo COLLATE utf8_unicode_ci
I've seen this all too often in PHP applications that weren't written carefully from the start to use UTF-8 strings. If you find the performance too slow and you need to convert frequently, and you don't have an opportunity to update the application writing the data incorrectly, you can add a new column and trigger to the tipos_boletos table and convert on the fly as records are added or edited.

Discover the charset from foreign database

I have mysql database (not mine). In this database all the encodings set to utf-8, and I connect with charset utf-8. But, when I try to read from the database I get this:
×¢×?ק 1
בית ×ª×•×’× ×” העוסק במספר שפות ×ª×•×›× ×”
× × ×œ× ×œ×¤× ×•×ª ×חרי 12 בלילה ..
What I supposed to get:
עסק 1
בית תוגנה העוסק במספר שפות תוכנה
נא לא לפנות אחרי 12 בלילה ..
When I look from phpmyadmin, I have the same thing(connection in pma is to utf-8).
I know that the data is supposed to be in Hebrew. Someone have an idea how to fix these?
You appear to have UTF-8 data that was treated as Windows-1252 and subsequently converted to UTF-8 (sometimes referred to as "double-encoding").
The first thing that you need to determine is at what stage the conversion took place: before the data was saved in the table, or upon your attempts to retrieve it? The easiest way is often to SELECT HEX(the_column) FROM the_table WHERE ... and manually inspect the byte-encoding as it is currently stored:
If, for the data above, you see C397C2A9... then the data is stored erroneously (an incorrect connection character set at the time of data insertion is the most common culprit); it can be corrected as follows (being careful to use data types of sufficient length in place of TEXT and BLOB as appropriate):
Undo the conversion from Windows-1252 to UTF-8 that caused the data corruption:
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET latin1;
Drop the erroneous encoding metadata:
ALTER TABLE the_table MODIFY the_column BLOB;
Add corrected encoding metadata:
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET utf8;
See it on sqlfiddle.
Beware to correctly insert any data in the future, or else the table will be partly encoded in one way and partly in another (which can be a nightmare to try and fix).
If you're unable to modify the database schema, the records can be transcoded to the correct encoding on-the-fly with CONVERT(BINARY CONVERT(the_column USING latin1) USING utf8) (see it on sqlfiddle), but I strongly recommended that you fix the database when possible instead of leaving it containing broken data.
However, if you see D7A2D73F... then the data is stored correctly and the corruption is taking place upon data retrieval; you will have to perform further tests to identify the exact cause. See UTF-8 all the way through for guidance.

Opencart - Search regardless accent

It had been written many times already that Opencart's basic search isn't good enough .. Well, I have came across this issue:
When customer searches product in my country (Slovakia (UTF8)) he probably won't use diacritics. So he/she writes down "cucoriedka" and found nothing.
But, there is product named "čučoriedka" in database and I want it to display too, since that's what he was looking for.
Do you have an idea how to get this work? The simple the better!
I'm ignorant of Slovak, I am sorry. But the Slovak collation utf8_slovak_ci treats the Slovak letter č as distinct from c. (Do the surnames starting with Č all come after those starting with C in your telephone directories? They probably do. The creators of MySQL certainly think they do.)
The collation utf8_general_ci treats č and c the same. Here's a sql fiddle demonstrating all this. http://sqlfiddle.com/#!9/46c0e/1/0
If you change the collation of the column containing your product name to utf8_general_ci, you will get a more search-friendly table. Suppose your table is called product and the column with the name in it is called product_name. Then this SQL data-definition statement will convert the column as you require. You should look up the actual datatype of the column instead of using varchar(nnn) as I have done in this example.
alter table product modify product_name varchar(nnn) collate utf8_general_ci
If you can't alter the table, then you can change your WHERE clause to work like this, specifying the collation explicitly.
WHERE 'userInput' COLLATE utf8_general_ci = product_name
But this will be slower to search than changing the column collation.
You can use SOUNDEX() or SOUNDS LIKE function of MySQL.
These functions compare phonetics.
Accuracy of soundex is doubtful for other than English. But, it can be improved if we use it like
select soundex('ball')=soundex('boll') from dual
SOUNDS LIKE can also be used.
Using combination of both SOUNDEX() and SOUNDS LIKE will improve accuracy.
Kindly refer MySQL documentation for details OR mysql-sounds-like-and-soundex

Issue with charset and data

I try to explain the whole problem with my poor english:
I use to save data from my application (encoded on utf8) to database using the default connection of PHP (latin1) to the tables of my DB with latin1 as charset.
That wasn't a big problem : for example the string Magnüs was stored as Magnüs, and when I recovered the data I saw correctly the string Magnüs (because the default connection, latin1).
Now, I change the connection, using the correct charset, with mysql_query("SET NAMES 'utf8'", $mydb), and I've also changed the charset of my tables's fields, so the value now is correctly store as Magnüs on DB; Then I still seeing Magnüs when I retrieve the data and I print on my Web Application.
Of course, unfortunatly, some old values now are badly printed (Magnüs is printed as Magnüs).
What I'd like to do is "to convert" these old values with the real encoding.
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8; will convert only the field type, not the data.
So, a solution (discovered on internet) should be this:
ALTER TABLE table CHANGE field field BLOB;
ALTER TABLE table CHANGE field field VARCHAR(255) CHARACTER SET utf8;
But these old string won't change on database, so neither in the Web Application when I print them.
Why? And what can I do?
Make sure that your forms are sending UTF-8 encoded text, and that the text in your table is also UTF-8 encoded.
According to the MySQL reference, the last two ALTER you mentioned do not change the column contents encoding, its more like a "reinterpretation" of the contents.
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.

Can't get the right characters to display from the database

I'm re-designing a Web site and I have a problem with the existing data base:
The database collate is set to utf8_unicode_ci and in the table row I'm calling the collate seems to be set to latin1_swedish_ci the characters store in it are Japanese (but even in phpmyadmin) you see other characters (I guess because of the latin1_swedish_ci).
When I print the result from the query I get a bunch of ??? now using
mysql_query('SET NAMES utf8');
mysql_set_charset('utf8',$conn);
Will output 2009â€N10ŒŽÂ†2009?N10???2009â€N11ŒŽÂ†2009?N11???
Any ideas?
Because the table was set to use latin1_swedish_ci, it was unable to correctly store the UTF-8 data that was entered. You need to switch that table to use utf8_unicode_ci for data going forward, but any existing data is essentially corrupted. You would have to re-enter the data after switching the collate to get the correct Japanese characters for the existing records.
You need to change the charset to utf8. The collation do not need to be changed to display japanese characters (but to be able to sort and compare texts it might be a good idea to change it to utf8_general_ci).
Hi all thanks for your reply's this is what happened, I couldn't really change anything in the DB since there's another version of the site that still uses that DB and will be up. So the solution I found was the following:
Case scenario:
The DB is set to use UTF8 -> (utf8_general_ci) but the field (at least the one's I needed where set to latin1_swedish_ci.
Solution:
After mysql_connect I put the following:
mysql_query("SET NAMES 'Shift_JIS'",$conn);
mysql_set_charset('Shift_JIS',$conn);
Then in the PHP file:
$titleJP = $row['titleJP'];
$titleJP = mb_convert_encoding($titleJP, "UTF-8", mb_detect_encoding($titleJP,"Shift_JIS,JIS,SJIS,eucjp-win"));
Now that worked perfectly the characters are displayed in correct Japanese.
I tried every other solution I could think of with no luck (utf-8_decode/encode php functions, etc.. etc..)

Categories