converting from latin character set to unicode - php

I am trying to change the character set from latin1 to utf8
Problem: The passwords are not working for French characters. The password works for special characters(like quotes, brackets, dollar sign, etc.). If I convert the character set in the code part back to latin1, I can login using French characters, but not with utf8
What have I done so far:
Changed the character set of the database; I can see all column types
are showing as utf8. I ran the query both at the database and the
table level.
Changed the character set for the code part to utf8.
My testing shows all is cool, I can see accented French characters
fine, and nothing seems broken. It is only for Passwords that
are giving me issues.
Please suggest:
Do I need to change the data itself to utf8 as well?
I ran alter table command, and it changed the column character set to
utf8, am I missing something here?
I am suspecting this may be the cause because the passwords are working fine if I convert the code part to latin1. So I am thinking as the code and the database were latin1, so it can recognize the special characters, but when I change it to utf8 it cannot interpret the special French letters as those were initially stored as latin1.
Both PHP and MySQL version are latest.
Since my response was long, I decided to add it here:
The hashing functions are very complex, it is using a combination of md5,encode64,and crypt function. I have noticed the resultant pwd is different for latin and unicode. That is the reason, I was suspecting that previously generated pwd using latin1 can match the pwd, and not unicode after conversion. Again, it is only happening for French letters, and not the ascii range for 0 to 127. I am not sure how to handle this situation where the existing users can successfully login, with the char set changed to unicode-8. I can't use iconv(), as there is no way I can distinguish whether the passwords are created using latin1 or unicode8. Do I need to change the data too in addition to changing the database, and how ? If I am thinking right, then the data conversion to unicode8 may take care of French characters as well?

if you need to convert char from some Unicode to another
you can use this function
iconv()

Related

Special characters in mySQL (and php) - THE BASICS

I am confused! Recently my webhotel updated php and now my old tables render special characters differently (wrongly).
Both my tables and my input/output-php-pages are set to utf-8 and since this update, also the inputs from php are treated differently; now my special characters are being utf-8-encoded as they enter the database. So since this change, when I review tables within phpMyAdmin, the old inserts have the original (non-encoded) special characters - the new posts have utf-8-encoded charcters (also special).
So what I would like to do is rewrite input and output to insert and show non-encoded characters - but I am not sure if this is possible without skipping utf-8 entirely (in php and mySQL). But is there an utf-8- way to submit non-encoded characters?
AND - perhaps more fundamentally - I need to understand what the possible downsides are. I am using Danish characters in and out and I'm not going to use any other language (for this project). So if it IS possible to insert and output non-encoded characters using utf-8 - am I then going to have unexpected/destructive issues?
I have read a lot of posts regarding php/mySQL/special characters but I haven't seen this angle on the issue yet. Hope I am not duplicating
I hope not because it has been working very nicely until the update.
Even if you are using only Danish characters, you may as well go utf8 all the way.
There are many places where the encoding needs to be stated:
The at the top of the html
The columns in the database (column CHARACTER SET defaults from table, which defaults from database)
The encoding in your PHP code.
When you CREATE TABLE, tack on DEFAULT CHARACTER SET utf8. If you have existing tables, without that, speak up; we may need to deal with them.
If you want Danish collation, the specify COLLATION utf8_danish_ci, too. Then (if I recall correctly), aa will sort after z.
(The default is utf8_general_ci, which won't do that sorting.)
Figure out what encoding you have (or can get) in your php code. If you have some text with accents in it, do this:
$hex = unpack('H*', $text);
echo implode('', $hex)
If you have utf8, å will be C3A5, for latin1 it will be E5.
Regardless of what encoding in in the tables, you must call set_charset('utf8') or set_charset('latin1') depending on what encoding is in the data in PHP. MySQL will gladly transcode between latin1 and utf8 as things are passed between PHP and MySQL. For different APIs:
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
For much more info, see http://mysql.rjweb.org/doc.php/charcoll .

Character encoding MSSQL.. ISO -> Utf-8 -> Latin-1..need reversed

We are trying to migrate database content (with a PHP script).
Content has been copied into a CMS and then written to the database. Content copied could be from any character encoding scheme (e.g. IS0-...-14) and any website.
The PHP CMS is UTF-8 so the character pasted into a textbox would be converted to UTF-8 when it was POSTed but then written to the database as Latin-1 (MSSQL db...db charset and query charset both latin-1).
We are desperately trying to think up how this could be reversed or if it is even possible (to get it so the character is fully UTF-8) in PHP.
If we can get the logic we can write an extension in C++ if PHP cant handle it (which it probably cant, mb_shite and iconv).
I keep getting lost in UTF-8 4 byte character streams (i.e. 0-127 is..ect).
Anybody got any ideas?
So far we have used PHP's ord() function to try and produce a Unicode/Acsii char ref for each char (I know ord returns ASCII but it prints character numbers over 128 which I thought was wierd if it is just meant to be ASCII, or maybe it repeats itself).
My thoughts are the latin1 will struggle to convert back to UTF-8 and will result in black diamond due to single byte char stream in Latin1 (ISO-...-1).
If latin1 is an 8-bit clean encoding for your database (it is in MySQL, donno about MSSQL), then you don't need to do anything to reconstruct the utf-8 string. When you pull it out of your database into PHP you will get back the same bytes you put in, i.e. UTF-8.
If latin1 is not an 8-bit-clean encoding for your database then your strings are irretrievably broken. This means any characters which the database considered invalid were either dropped or replaced the moment you wrote your utf-8 string to the database. There isn't any way to recover from this.

How do I insert UCS-2 data with PHP PDO into MySQL?

The manual clearly states " ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET". So how can I insert, for example, the codepoint U+2193? I am using PHP 5.3 + PDO.
If you want to use Unicode for communicating with a MySQL server, your only option is to use UTF-8.
If you're working with UCS-2 or UTF-16 strings in PHP now, you'll have to convert them to UTF-8 before trying to store them. Also note that MySQL will give you back UTF-8 if that's what you set your client character set to, so you'll need to convert query results as well if you're committed to working with UCS-2 on the PHP side. (If you're in a position to make bigger changes, you'd likely be better off simply using UTF-8 everywhere than doing all this extra conversion.)
As for storing the codepoint U+2193, no worries: UTF-8 can represent every Unicode codepoint (in this specific case, it'd be 0xE2 0x86 0x93).
Technically, this is fudging a little, since MySQL's utf8 and ucs2 character sets only cover a subset of Unicode called the Basic Multilingual Plane (BMP). The world of Unicode charsets is expanded in MySQL 5.5 to move beyond the BMP, but you still can't use ucs2, the new utf16 or utf32 charsets as client charsets, leaving you still stuck with UTF-8.
For posterity, CREATE TABLE test (encoding varchar(255) CHARACTER SET ucs2); and then INSERT INTO test VALUES (1, CHAR(0x2193));. If I then run a SELECT * FROM test I see a down arrow.

How do browsers/PHP handle characters outside the set characterset?

I'm looking into how characters are handled that are outside of the set characterset for a page.
In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.
As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed.
I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.
Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...
htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.
A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).
He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).
btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.
Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.
using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.
Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.
You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:
pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"
I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").
When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?
(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)
Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.
You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".
BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.
The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

Categories