convert latin1_swedish_ci to utf8 - php

I am importing data from an xml, and it appears they use " latin1_swedish_ci" which is causing me lots of issues with php and mysql (PDO).
I'm getting a lot of these errors:
General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
I am wondering how I can convert them to proper UTF-8 to store in my database.
I've tried doing this:
$game['human_name'] = iconv(mb_detect_encoding($game['human_name'], mb_detect_order(), true), "UTF-8", $game['human_name']);
Which I found here: PHP: Convert any string to UTF-8 without knowing the original character set, or at least try
I still seem to get the same error though?

Don't use any conversion functions -- get utf8 specified throughout the processing.
Make sure the data is encoded utf8.
When using PDO:
$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
Table definitions (or at least columns):
CHARACTER SET utf8
Web page output:
<meta ... charset=UTF-8>

Related

How has this mysql string been encoded and how can I replicate it?

Here are the hex values of two strings stored in a MySQL database using two different methods.
20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
and
E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.
Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥
I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.
I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.
If, within the PHP application, I run SET NAMES utf8 ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8 to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.
I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.
How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?
I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:
import re
expected = '20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5'
original = 'काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥'
# Encode in UTF-8 w/ BOM (U+FEFF encoded in UTF-8 as a signature)
step1 = original.encode('utf-8-sig')
# Windows-1252 doesn't define some byte -> codepoint mappings and Python normally
# raises an error on those bytes. Use an error handler to keep the bytes that
# fail, then replace the escape codes with the matching Unicode codepoint.
step2 = step1.decode('cp1252',errors='backslashreplace')
step3 = re.sub(r'\\x([0-9a-f]{2})', lambda x: chr(int(x.group(1),16)), step2)
# There is an extra space before the UTF-8-encoded BOM for some reason
step4 = ' ' + step3
step5 = step4.encode('utf8')
# Format to match expected string
final = step5.hex().upper()
print(final == expected) # True
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 to utf8mb4 hex
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A' is utf8mb4 to double-encoded
See "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored
More
"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.
You explained the cause here:
The database connection charset is latin1. The charset of the table is utf8
BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)
The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".
The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.

Encoding problem when connecting to SQL Server database via odbc_connect()

Firstly, I don't have an option to use latest SQLSRV drivers on my host so I am stuck with odbc connection.
$connection_string = 'DRIVER={SQL Server};SERVER=111.111.111.111;DATABASE=MY_DATABASE';
$user = 'name';
$pass = 'pass';
$connection = odbc_connect( $connection_string, $user, $pass, SQL_CUR_USE_ODBC );
The collation of that database is Slovak_CI_AI. If I set my PHP header to utf-8, output data looks messed up, encoding is wrong.
If I put 'Slovak_CI_AI' as a charset to my PHP header, data displays fine, but it is probably a no go, because I need to work with that data in WordPress, which fails to process them if they contain special/non-english characters (those strings looks broken to WP).
I've tried many conversions with mb_convert_encoding, iconv or utf8_decode, but no luck. WordPress uses utf-8.
I can't find any solution for this.
Update: I've tried adding CHARSET=UTF8 to my odbc connection string, but no luck. Also I found out the character set for texts in the database is cp1250. I've tried setting cp1250 as a charset to my PHP header, output is fine but WordPress still fails once it encounters a special character. I've tried converting those strings from cp1250 to utf-8 with iconv, but no luck as well - strings have wrong encoding on output and WordPress fails as well.
This whole encoding thing still feels chaotic to me, but I somehow managed to do it. It works when:
odbc connection string contains charset=cp1250
PHP header character set is set to utf-8
I convert all problematic strings from cp1250 to utf-8 with iconv

I get an Ansi string instead of Utf-8 from Utf-8 mysql table [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 4 years ago.
When I moved from php mysql shared hosting to my own VPS I've found that code which outputs user names in UTF8 from mysql database outputs ?�??????� instead of 鬼神❗. My page has utf-8 encoding, and I have default_charset = "UTF-8" in php.ini, and header('Content-Type: text/html; charset=utf-8'); in my php file, as well as <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in html part of it.
My database has collation utf8_bin, and table has the same. On both previos and current hosting in phpmyadmin for this database record I see: 鬼神❗. When I create ANSI text file in Notepad++, paste 鬼神❗ into it and select Encoding->Encode in UTF-8 menu I see 鬼神❗, so I suppose it is correct encoded UTF-8 string.
Ok, and then I added
init_connect='SET collation_connection = utf8_general_bin'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_bin
skip-character-set-client-handshake
to my.cnf and now my page shows 鬼神❗ instead of ?�??????�. This is the same output I get in phpmyadmin on both hostings, so I'm on a right way. And still somehow on my old hosting the same php script returns utf-8 web page with name 鬼神❗ while on new hosting - 鬼神❗. It looks like the string is twice utf-8 encoded: I get utf-8 string, I give it as ansi string to Notepad++ and it encodes it in correct utf-8 string.
However when I try utf8_encode() I get й¬ÑзÒÑвÑâ, and utf8_decode() returns ?�???????. The same result return mb_convert_encoding($name,"UTF-8","ISO-8859-1"); and iconv( "ISO-8859-1","UTF-8", $name);.
So how could I reproduce the same conversion Notepad++ does?
See answer below.
The solution was simple yet not obvious for me, as I never saw my.cnf on that shared hosting: it seems that that server had settings as follows
init_connect='SET collation_connection = cp1252'
init_connect='SET NAMES cp1252'
character-set-server=cp1252
So to make no harm to other code on my new server I have to place mysql_query("SET NAMES CP1252"); on top of each php script which works with utf8 strings.
The trick here was script gets a string as is (ansi) and outputs it, and when browser is warned that page is in utf-8 encoding it just renders my strings as utf-8.

json_encode - Invalid UTF-8 sequence in argument

I had issue with json_encode. I was getting
PHP Warning: json_encode() [<a href='function.json-encode'>function.json-encode</a>]: Invalid UTF-8 sequence in argument in /var/www/html/web/example.php on line 500
Then I set magic_quotes_gpc = 0 in php.ini (it was 1 before) and it stopped showing the json_encode error.
Now, I started getting the same error again. magic_quotes_gpc is 0 in php.ini. I am using PHP 5.3
I found many answers which say to convert it to UTF-8. But I cannot do it because I am using json_encode in many places and changing all is not possible.
I would like to fix the root issue so that I don't need to change the json_encode code.
In MySQL, the result for
SHOW VARIABLES LIKE 'character_set%';
is
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/
What is the reason for the json_encode error?
I am using zend server and I see this json_encode error in zend server logs.
Another thing I noticed is, even if I see the error in the server logs, it is properly converting the array to json.
There is no error in converting array to json. Then Why I see the error in zend server?
json_encode() needs valid UTF-8 data as input.
You are feeding it invalid data.
From what you describe, you probably are getting latin1 data from your database connection, which will cause json_encode() to choke.
Set the database's connection in your script to UTF-8. How to do that depends on the database library you are using.
Here is a list of ways to switch to UTF-8 in the most common libraries:
UTF-8 all the way through
Try to add the instruction below after the connection parameters:
mysql_set_charset('utf8');
And for the result value:
mb_convert_encoding($result,'UTF-8','UTF-8');

Dealing with eacute and other special characters using Oracle, PHP and Oci8

Hi I am trying to store names into an Oracle database and fetch them back using PHP and oci8.
However, if I insert the é directly into the Oracle database and use oci8 to fetch it back I just receive an e
Do I have to encode all special characters (including é) into html entities (ie: é) before inserting into database ... or am I missing something ?
Thx
UPDATE: Mar 1 at 18:40
found this function:
http://www.php.net/manual/en/function.utf8-decode.php#85034
function charset_decode_utf_8($string) {
if(#!ereg("[\200-\237]",$string) && #!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
seems to work, although not sure if its the optimal solution
UPDATE: Mar 8 at 15:45
Oracle's character set is ISO-8859-1.
in PHP I added:
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1");
to force the oci8 connection to use that character set.
Retrieving the é using oci8 from PHP now worked ! (for varchars, but not CLOBs had to do utf8_encode to extract it )
So then I tried saving the data from PHP to Oracle ... and it doesnt work..somewhere along the way from PHP to Oracle the é becomes a ?
UPDATE: Mar 9 at 14:47
So getting closer.
After adding the NLS_LANG variable, doing direct oci8 inserts with é works.
The problem is actually on the PHP side.
By using ExtJs framework, when submitting a form it encodes it using encodeURIComponent.
So é is sent as %C3%A9 and then re-encoded into é.
However it's length is now 2 (strlen($my_sent_value) = 2) and not 1.
And if in PHP I try: $my_sent_value == é = FALSE
I think if I am able to re-encode all these characters in PHP back into lengths of byte size 1 and then inserting them into Oracle, it should work.
Still no luck though
UPDATE: Mar 10 at 11:05
I keep thinking I am so close (yet so far away).
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9"); works very sporadicly.
I created a small php script to test:
header('Content-Type: text/plain; charset=ISO-8859-1');
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9");
$conn= oci_connect("user", "pass", "DB");
$stmt = oci_parse($conn, "UPDATE temp_tb SET string_field = '|é|'");
oci_execute($stmt, OCI_COMMIT_ON_SUCCESS);
After running this once and loggin into the Oracle Database directly I see that STRING_FIELD is set to |¿|. Obviously not what I had come to expect from my previous experience.
However, if I refresh that PHP page twice quickly.... it worked !!!
In Oracle I correctly saw |é|.
It seems like maybe the environment variable is not being correctly set or sent in time for the first execution of the script, but is available for the second execution.
My next experiment is to export the variable into PHP's environment, however, I need to reset Apache for that...so we'll see what happens, hopefully it works.
I presume you are aware of these facts:
There are many different character sets: you have to pick one and, of course, know which one you are using.
Oracle is perfectly capable of storing text without HTML entities (é). HTML entities are used in, well, HTML. Oracle is not a web browser ;-)
You must also know that HTML entities are not bind to a specific charset; on the contrary, they're used to represent characters in a charset-independent context.
You indistinctly talk about ISO-8859-1 and UTF-8. What charset do you want to use? ISO-8859-1 is easy to use but it can only store text in some latin languages (such as Spanish) and it lacks some common chars like the € symbol. UTF-8 is trickier to use but it can store all characters defined by the Unicode consortium (which include everything you'll ever need).
Once you've taken the decision, you must configure Oracle to hold data in such charset and choose an appropriate column type. E.g., VARCHAR2 is fine for plain ASCII, NVARCHAR2 is good for UTF-8.
This is what I finally ended up doing to solve this problem:
Modified the profile of the daemon running PHP to have:
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
So that the oci8 connection uses ISO-8859-1.
Then in my PHP configuration set the default content-type to ISO-8859-1:
default_charset = "iso-8859-1"
When I am inserting into an Oracle Table via oci8 from PHP, I do:
utf8_decode($my_sent_value)
And when receiving data from Oracle, printing the variable should just work as so:
echo $my_received_value
However when sending that data over ajax I have had to use:
utf8_encode($my_received_value)
If you really cannot change the character set that oracle will use then how about Base64 encoding your data before storing it in the database. That way, you can accept characters from any character set and store them as ISO-8859-1 (because Base64 will output a subset of the ASCII character set which maps exactly to ISO-8859-1). Base64 encoding will increase the length of the string by, on average, 37%
If your data is only ever going to be displayed as HTML then you might as well store HTML entities as you suggested, but be aware that a single entity can be up to 10 characters per unencoded character e.g. ϑ is ϑ
I had to face this problem : the LatinAmerican special characters are stored as "?" or "¿" in my Oracle database ... I can't change the NLS_CHARACTER_SET because we're not the database owners.
So, I found a workaround :
1) ASP.NET code
Create a function that converts string to hexadecimal characters:
public string ConvertirStringAHex(String input)
{
Encoding encoding = System.Text.Encoding.GetEncoding("ISO-8859-1");
Byte[] stringBytes = encoding.GetBytes(input);
StringBuilder sbBytes = new StringBuilder(stringBytes.Length);
foreach (byte b in stringBytes)
{
sbBytes.AppendFormat("{0:X2}", b);
}
return sbBytes.ToString();
}
2) Apply the function above to the variable you want to encode, like this
myVariableHex = ConvertirStringZHex( myVariable );
In ORACLE, use the following:
PROCEDURE STORE_IN_TABLE( iTEXTO IN VARCHAR2 )
IS
BEGIN
INSERT INTO myTable( SPECIAL_TEXT )
VALUES ( UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW( iTEXTO ));
COMMIT;
END;
Of course, iTEXTO is the Oracle parameter which receives the value of "myVariableHex" from ASP.NET code.
Hope it helps ... if there's something to improve pls don't hesitate to post your comments.
Sources:
http://www.nullskull.com/faq/834/convert-string-to-hex-and-hex-to-string-in-net.aspx
https://forums.oracle.com/thread/44799
If you have different charsets between the server side code (php in this case) and the Oracle database, you should set server side code charset in the Oracle connection, then Oracle made the conversion.
Example: Let's assume:
php charset utf-8 (default).
Oracle charset AMERICAN_AMERICA.WE8ISO8859P1
In the connection to Oracle made by php you should set UTF8 (third parameter).
oci_pconnect("USER", "PASS", "URL"),"UTF8");
Doing this, you write code in utf-8 (not doing any conversion at all) and get utf-8 from the database through this connection.
So you could write something like SELECT * FROM SOME_TABLE WHERE TEXT = 'SOME TEXT LIKE áéíóú Ñ' and also get utf-8 text as a result.
According to the php documentation, by default, Oracle client (oci_pconnect) takes the NLS_LANG environment variable from the Operating system. Some debian based systems has no NLS_LANG enviromental variable, so I think Oracle client use it's own charset (AMERICAN_AMERICA.WE8ISO8859P1) if we don't specify the third parameter.

Categories