charset issue with accents retrieving DB2 values with php over odbc

charset issue with accents retrieving DB2 values with php over odbc - php

I'm trying to do a select from a DB2 through PHP and odbc and then save those values on a file. The OS where the code is being executed is Debian. What I do is the following:
$query = "SELECT NAME FROM DATABASE_EXAMPLE.TABLE_EXAMPLE";
$result = odbc_prepare($server, $query);
$success = odbc_execute($result);
$linias = "";
if ($success) {
while ($myRow = odbc_fetch_array($result)) {
$linias .=format_word($myRow['NAME'], 30) . "\r\n";
}
generate_file($linias);
function format_word($paraula, $longitut) {
return str_pad(utf8_encode($paraula), $longitut, " ", STR_PAD_LEFT);
}
function generate_file($linias) {
$nom_fitxer = date('YmdGis');
file_put_contents($nom_fitxer . ".tmp", $linias);
rename($nom_fitxer . '.tmp', $nom_fitxer . '.itf');
}
The problem is that some of the retrieved values contains spanish letters and accents. To make and example, one of the values is "ÁNGULO". If I var_dump the code on my browser I get the word fine, but when it's write into the file it apends weird characters on it (that's why I think there is a problem with the charset). I have tried different workarounds but it just make it worst. The file opened with Notepad++ (with UTF8 encoding enabled) looks like:
Is there a function in PHP that translate between charsets?
Edit
Following erg instructions I do further research:
The DB2 database use IBM284 charset, as I found executing the next command:
select table_schema, table_name, column_name, character_set_name from SYSIBM.COLUMNS
Firefox says the page is encoded as Unicode.
If i do:
var_dump(mb_detect_encoding($paraula));
I get bool(false) as a result.
I have changed my function for formating the word hoping that iconv resolve the conflict:
function format_word($paraula, $longitut) {
$paraula : mb_convert_encoding($paraula, 'UTF-8');
$paraula= iconv("IBM284", "UTF-8", $paraula);
return $paraula;
}
But it doesn't. Seems like the ODBC it's doing some codification bad and that is what mess the data. How can I modify the odbc to codificate to the right charset? I have seen some on Linux changing the locale, but if I execute the command locale on the PC I get:
LC_NAME="es_ES.UTF-8"
LC_ADDRESS="es_ES.UTF-8"
...

I will try to summarize from the comments into an answer:
First note that PHPs utf8_encode will convert from ISO-8859-1 to utf-8. If your database / ODBC-Driver does not return ISO-8859-1 encoded strings, PHPs utf8_encode will fail or return garbage.
The easiest solution should be to let the database / driver convert the values to correct encoding, using its CAST function: https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_castspecification.html
Try to alter your query to let DB2 convert everything to UTF-8 directly and omit the utf8_encode call. This can be done by altering you query to something like:
SELECT CAST(NAME AS VARCHAR(255) CCSID 1208) FROM DATABASE_EXAMPLE.TABLE_EXAMPLE
Thanks to Sergei for the note about CCSID 1208 on IBM PUA. I changed CCSID UNICODE to CCSID 1208.
I do not have DB2 at hand here, so the above query is untested. I'm not sure if this will return utf-8 or utf-16..

Related

mysql & UTF8 Issue with arabic

this might look like a similar issues for utf8 and Arabic language with MySQL database but i searched for result and found none..
my database endocing is set to utf8_general_ci ,
i had my php paging to be encoded as ansi by default
the arabic language in database shows as : ãÌÑÈ
but i changed it to utf8 ,
if i add new input to database , the arabic language in database shows as : Ø²ÙŠÙ†
i dont care how it show indatabase as long as it shows normally in php page ,
after changing the php page to utf8 , when adding input than retriving it , if show result as it should .
but the old data which was added before converting the page encoding to uft8 show like this : �����
i tried a lot of methods for fixis this like using iconv in ssh and php , utf8_decode() utf8_encode() .. and more but none worked .
so i was hoping that you have a solution for me here ?
update :: Main goal was solved by retrieving data from php page in old encoding ' windows-1256' than update it from ssh .
but one issue left ::
i have some text that was inserted as 'windows-1256' and other that was inserted as 'utf-8' so now the windows encoding was converted to utf-8 and works fine , but the original utf-8 was converted as well to something unreadable , using iconv in php, with old page encoding ..
so is there a way to check what encoding is original in order to convert or not ?

Try run query set name utf8 after create a DB connection, before run any other query.
Such as :
$dbh = new PDO('mysql:dbname='.DB_NAME.';host='.DB_HOST, DB_USER, DB_PASSWORD);
$dbh->exec('set names utf8');

PHP MySQL search cyrillic characters

I would like to optimize my search sql query from cyrillic input.
If the user enters 'čšž' database return results with 'čšž' and 'csz'.
$result = mysqli_query($con,"SELECT * FROM Persons WHERE name = ?");
SQL or PHP should convert character to non cyrillic.
Any ideas?

You should use UTF8_unicode_ci collation for your column (or table or whole database) and then database should handle all that you want.

This will do the trick using iconv:
// your input
$input = "Fóø Bår";
setlocale(LC_ALL, "en_US.utf8");
// this will remove accents and output standard ascii characters
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
// do whatever you want with your output, for example use it in your MySQL query
// or just "echo" it, for demonstration
echo $output;
Note that this may not work as expected when a wrong version of iconv is installed on the server.
The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.
In that case, you may use this ugly hack that will work for most cases :
function stripAccents($stripAccents){
return strtr($stripAccents,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
Taken from this answer.

How to safelycompare UTF-8 to ISO 8859-1 (latin1) in PHP?

This might be a stupid question, but nothing seems to be working for me:
I'm having to compare values between 2 columns on 2 different databases (which I don't have access to change the values).
The encoding in db1 is UTF-8.
The encoding in db2 is latin1.
So, for example, these are the 2 values I'm comparing and should be the same in the comparison:
**db1_value** = 'Maranhão'
**db2_value** = 'Maranhão';
They display exactly the same way using utf_encode, displaying is not the issue.
I'd like to compare the variable db1_value to the field db2_value in the db, so I'm using something very simple like this:
$query = "SELECT **db2_value** FROM db2 WHERE db2_field LIKE '" . **$db1_value** . "'";
How do I convert 'Maranhão' into '**Maranhão**' before comparing?
I've tried several methods, iconv, utf8_encode, and a few others, but they make no difference to the variable. I'm just wondering if I'm taking the right approach to do this.
Appreciate any constructive comments on this.
Thanks a lot,

You need to convert not from UTF-8 but from HTML-ENTITIES into actual value
Luckily mbstring extension has such conversion available:
$latin1 = mb_convert_encoding($db1_value, "ISO-8859-1", "HTML-ENTITIES");
Here we specify the HTML-ENTITIES as the FROM charset
Then you can compare $latin1 to your $db2_value.

How to ensure all data going in and out of a database is utf-8 encoded?

I just learned about character sets today, so forgive the newb factor if this is confusing. Please ask for clarification if it's needed.
I wrote a program in php which recursively goes through the files in a folder and stores the file names in a database. The file names are then all exported from the database in json format using the json_encode($array) function.
However this function only works with UTF-8 encoded data. And since a few of the key-value pairs in the json export have the value of null, I'm lead to believe that those strings of filenames taken from the database are in fact not utf-8.
I've ensured that all the data going in and out of the the database is utf-8 by setting the defaults to utf-8 in my.cnf and restarting mysql from the command line using service mysql restart
[client]
default-character-set=utf8
[mysqld]
default-character-set = utf8
I then created my database, the table and all the columns in the table and confirmed that the database, table and all the columns are in fact utf-8
Checks if database is utf-8
SELECT default_character_set_name FROM information_schema.SCHEMATA S
WHERE schema_name = "schemaname";
Checks if table is utf-8
SELECT CCSA.character_set_name FROM information_schema.`TABLES` T,
information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA
WHERE CCSA.collation_name = T.table_collation
AND T.table_schema = "schemaname"
AND T.table_name = "tablename";
Checks if field is utf-8
SELECT character_set_name FROM information_schema.`COLUMNS` C
WHERE table_schema = "schemaname"
AND table_name = "tablename"
AND column_name = "columnname";
There's this file that has the characters –µ–ª–∫—É–Ω—á–∏–∫ in the file name. When it's stored in the database the values appear as â€“Â©â€“Âµâ€“Âªâ€“â'.
Per my database settings, are all the strings going in and out of my database utf-8?
What can I do to ensure the data I am SELECT'ing from the database is utf-8, so I can perform json_encode($array)? (NOTE: this function only works on utf-8 encoded data)

Unfortunately I don't know how you can ensure everything coming out is UTF-8 (now I'm curious too!), but a starting point would be trying this in your PHP:
$encodedNames = array();
$errors = array();
// Loop through all of the filenames
foreach($filenames as $filename)
{
// Check if it's UTF-8 encoded
if('UTF-8' === mb_detect_encoding($filename, 'UTF-8', true))
{
$encodedNames[] = $filename;
}
else
{
$errors[] = $filename;
}
}
// json_encode the UTF-8 filenames
$jsonString = json_encode($encodedNames);
// Log the other filenames here so you can deal with them later...
http://php.net/manual/en/function.mb-detect-encoding.php

how to detect and fix character encoding in a mysql database via php?

I have received this database full of people names and data in French, which means, using characters such as é,è,ö,û, etc. Around 3000 entries.
Apparently, the data inside has been encoded sometimes using utf8_encode(), and sometimes not. This result in a messed up output: at some places the characters show up fine, at others they don't.
At first i tried to track down every place in the UI where those issues arise and use utf8_decode() where necessary, but it's really not a practicable solution.
I did some testing and there is no reason to use utf8_encode in the first place, so i'd rather remove all that and just work in UTF8 everywhere - at the browser, middleware and database levels. So i need to clean the database, converting all misencoded data by its cleaned up version.
Question : would it be possible to create a function in php that would check if a utf8 string is correctly encoded (without utf8_encode) or not (with utf8_encode), and, if it was, convert it back to its original state?
In other terms: i would like to know how i could detect utf8 content that has been utf8_encode() to utf8 content that has not been utf8_encode()d.
**UPDATE: EXAMPLE **
Here is a good example: you take a string full of special chars and take a copy of that string and utf8_encode() it. The function i'm dreaming of takes both strings, leaves the first one untouched and the second string is now the same as string one.
I tried this:
$loc_fr = setlocale(LC_ALL, 'fr_BE.UTF8','fr_BE#euro', 'fr_BE', 'fr', 'fra', 'fr_FR');
$str1= "éèöûêïà ";
$str2 = utf8_encode($str1);
function convert_charset($str) {
$charset= mb_detect_encoding($str);
if( $charset=="UTF-8" ) {
return utf8_decode($str);
}
else {
return $str;
}
}
function correctString($str) {
echo "\nbefore: $str";
$str= convert_charset($str);
echo "\nafter: $str";
}
correctString($str1);
echo('<hr/>'."\n");
correctString($str2);
And that gives me:
before: éèöûêïà after: �������
before: Ã©Ã¨Ã¶Ã»ÃªÃ¯Ã after: éèöûêïà
Thanks,
Alex

It's not completely clear from the question what character-encoding lens you're currently looking through (this depends on the defaults of your text editor, browser headers, database configuration, etc), and what character-encoding transformations the data has gone through. It may be that, for example, by tweaking a database configuration everything will be corrected, and that's a lot better than making piecemeal changes to data.
It looks like it might be a problem of utf8 double-encoding, and if that's the case, both the original and the corrupted data will be in utf8, so encoding detection won't give you the information you need. The approach in that case requires making assumptions about what characters can reasonably turn up in your data: as far as PHP and Mysql are concerned "Ã©" is perfectly legal utf8, so you have to make a judgement based on what you know about the data and its authors that it must be corrupted. These are risky assumptions to make if you're just a technician. Luckily, if you know the data is in French and there's only 3000 records, it's probably ok to make those kinds of assumptions.
Below is a script that you can adapt first of all to check your data, then to correct it, and finally to check it again. All it's doing is processing a string as utf8, breaking it into characters, and comparing the characters against a whitelist of expected French characters. It signals a problem if the string is either not in utf8 or contains characters that aren't normally expected in French, for example:
PROBABLY OK Côte d'Azur
HAS NON-WHITELISTED CHAR CÃ´te d'Azur 195,180 Ã´
NON-UTF8 C�e d'Azur
Here's the script, you'll need to download the dependent unicode functions from http://hsivonen.iki.fi/php-utf8/
<?php
// Download from http://hsivonen.iki.fi/php-utf8/
require "php-utf8/utf8.inc";
$my_french_whitelist = array_merge(
range(0,127), // throw in all the lower ASCII chars
array(
0xE8, // small e-grave
0xE9, // small e-acute
0xF4, // small o-circumflex
//... Will need to add other accented chars,
// Euro sign, and whatever other chars
// are normally expected in the data.
)
);
// NB, whether this string literal is in utf8
// depends on the encoding of the text editor
// used to write the code
$str1 = "Côte d'Azur";
$test_data = array(
$str1,
utf8_encode($str1),
utf8_decode($str1),
);
foreach($test_data as $str){
$questionable_chars = non_whitelisted(
$my_french_whitelist,
$str
);
if($questionable_chars===true){
p("NON-UTF8", $str);
}else if ($questionable_chars){
p(
"HAS NON-WHITELISTED CHAR",
$str,
implode(",", $questionable_chars),
unicodeToUtf8($questionable_chars)
);
}else{
p("PROBABLY OK", $str);
}
}
function non_whitelisted($whitelist, $utf8_str){
$codepoints = utf8ToUnicode($utf8_str);
if($codepoints===false){ // has non-utf8 char
return true;
}
return array_diff(
array_unique($codepoints),
$whitelist
);
}
function p(){
$args = func_get_args();
echo implode("\t", $args), "\n";
}

I think you might be taking a more compilation approach. I received a Bulgarian database a few weeks back that was dynamically encoded in the DB, but when moving it to another database I got the funky ???
The way I solved that was by dumping the database, setting the database to utf8 collation and then importing the data as binary. This auto-converted everything to utf8 and didn't give me anymore ???.
This was in MySQL

When you connect to the database remember to always use mysql_set_charset('utf8', $db_connection);
it will fix everything, it solved all my problems.
See this: http://phpanswer.com/store-french-characters-into-mysql-db-and-display/

As you said that your data is sometimes converted using utf8_encode, your data is encoded with either UTF-8 oder ISO 8859-1 (since utf8_encode converts from ISO 8859-1 to UTF-8). And since UTF-8 encodes the characters from 128 to 255 with two bytes starting with 1100001x, you just have to test if your data is valid UTF-8 and convert it if not.
So scan all your data if it already is UTF-8 (see several is_utf8 functions) and use utf8_encode if it’s not UTF-8.

my problem is that somehow I got in my database chars like these à,é,ê in plain format or utf8 encoded. After investigation I got the conclusion that some browser (I do not know IE or FF or other) is encoding the submitted input data as there was no utf8 encoding intentionally added to handling the submit forms. So, if I would read data with utf8_encode, I'll alter the other plain chars, and vice-versa.
My solution, after I studied solutions given above:
1. I created a new database with charset utf8
2. Imported the database AFTER I changed the charset definition on CREATE TABLE statement in sql dump file from Latin.... to UTF8.
3. import data from original database
(until here maybe will be enough just to change the charset on existing db and tables, and this only if original db is not utf8)
4. update the content in database directly by replacing the utf8 encoded chars with there plain format something like
UPDATE `clients` SET `name` = REPLACE(`name`,"Ã©",'é' ) WHERE `name` LIKE CONVERT( _latin1 '%é%' USING utf8 );
I put in db class (for php code) this line to make sure that their is a UTF8 communication
$this->query('SET CHARSET UTF8');
So, ho to update? (step 4)
I've built an array with possible chars that might be encoded
$special_chars = array(
'ù','û','ü',
'ÿ',
'à','â','ä','å','æ',
'ç',
'é','è','ê','ë',
'ï','î',
'ô','','ö','ó','ø',
'ü');
I've buit an array with pairs of table,field that should be updated
$where_to_look = array(
array("table_name" , "field_name"),
..... );
than,
foreach($special_chars as $char)
{
foreach($where_to_look as $pair)
{
//$table = $pair[0]; $field = $pair[1]
$sql = "SELECT id , `" . $pair[1] . "` FROM " .$pair[0] . " WHERE `" . $pair[1] . "` LIKE CONVERT( _latin1 '%" . $char . "%' USING utf8 );";
if($db->num_rows() > 0){
$sql1 = "UPDATE " . $pair[0] . " SET `" . $pair[1] . "` = REPLACE(`" . $pair[1] . "`,CONVERT( _latin1 '" . $char . "' USING utf8 ),'" . $char . "' ) WHERE `" . $pair[1] . "` LIKE CONVERT( _latin1 '%" . $char . "%' USING utf8 )";
$db1->query($sql1);
}
}
}
The basic ideea is to use encoding features of mysql to avoid encoding done between mysql, apache, browser and back;
NOTE: I had not avaiable php functions like mb_....
Best

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

charset issue with accents retrieving DB2 values with php over odbc - php

Related

mysql & UTF8 Issue with arabic

PHP MySQL search cyrillic characters

How to safelycompare UTF-8 to ISO 8859-1 (latin1) in PHP?

How to ensure all data going in and out of a database is utf-8 encoded?

how to detect and fix character encoding in a mysql database via php?

Categories

Resources