Latvian characters into MySQL Database converting - php

I have some problem. I have some data in my DB in Latvian (i.e Valentīna) and I need to display this on my page.
Other data are saved in cp1257 encoding and looks like AÎDA MACIJEVSKA - and it displays as Aīda Macijevska
So what I have tried...
1 - ucwords(mb_strtolower(iconv("windows-1257", "UTF-8//TRANSLIT", trim($row['pac_name'])), "UTF-8"));
2 - ucwords(mb_strtolower(iconv("windows-1257", "UTF-8", trim($row['pac_name'])), "UTF-8"));
3 - just show without any converting from DB `$row["pac_name"]`;
and all 3 points display same result - Valent?na
P.S Database has utf8_general_ci collation, also I gave header for utf-8 encoding - header('Content-Type: text/html; charset=utf-8');
So can anyone please help me with my problem?

Assuming you are truly using cp1257 and not utf8, then you need
SET NAMES cp1257 (or some other way for the client to tell mysqld that the bytes are encoded using cp1257)
CHARACTER SET cp1257 on each column (or perhaps defaulting from table definition).
But it sounds like you should go with utf8, not cp1257...
As I see it, Î does not exist in cp1257. Reference: http://en.wikipedia.org/wiki/Windows-1257 . Hence, the code you mentioned is free to screw it up, by either using ī or ?.
If you really need I-hat, go with utf8. Note that the collation utf8_latvian_ci exists. All the i's mentioned here do exist in utf8.
If you have further questions, please provide SELECT HEX(col)... for any text in question. For example (with spaces added for clarity):
In utf8: AÎDA --> 41 C38E 44 41; Aīda --> 41 C4AB 64 61
In cp1257: Aīda --> 41 EE 64 61

Related

How has this mysql string been encoded and how can I replicate it?

Here are the hex values of two strings stored in a MySQL database using two different methods.
20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
and
E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.
Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥
I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.
I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.
If, within the PHP application, I run SET NAMES utf8 ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8 to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.
I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.
How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?
I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:
import re
expected = '20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5'
original = 'काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥'
# Encode in UTF-8 w/ BOM (U+FEFF encoded in UTF-8 as a signature)
step1 = original.encode('utf-8-sig')
# Windows-1252 doesn't define some byte -> codepoint mappings and Python normally
# raises an error on those bytes. Use an error handler to keep the bytes that
# fail, then replace the escape codes with the matching Unicode codepoint.
step2 = step1.decode('cp1252',errors='backslashreplace')
step3 = re.sub(r'\\x([0-9a-f]{2})', lambda x: chr(int(x.group(1),16)), step2)
# There is an extra space before the UTF-8-encoded BOM for some reason
step4 = ' ' + step3
step5 = step4.encode('utf8')
# Format to match expected string
final = step5.hex().upper()
print(final == expected) # True
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 to utf8mb4 hex
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A' is utf8mb4 to double-encoded
See "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored
More
"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.
You explained the cause here:
The database connection charset is latin1. The charset of the table is utf8
BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)
The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".
The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.

I get an Ansi string instead of Utf-8 from Utf-8 mysql table [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 4 years ago.
When I moved from php mysql shared hosting to my own VPS I've found that code which outputs user names in UTF8 from mysql database outputs ?�??????� instead of 鬼神❗. My page has utf-8 encoding, and I have default_charset = "UTF-8" in php.ini, and header('Content-Type: text/html; charset=utf-8'); in my php file, as well as <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in html part of it.
My database has collation utf8_bin, and table has the same. On both previos and current hosting in phpmyadmin for this database record I see: 鬼神❗. When I create ANSI text file in Notepad++, paste 鬼神❗ into it and select Encoding->Encode in UTF-8 menu I see 鬼神❗, so I suppose it is correct encoded UTF-8 string.
Ok, and then I added
init_connect='SET collation_connection = utf8_general_bin'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_bin
skip-character-set-client-handshake
to my.cnf and now my page shows 鬼神❗ instead of ?�??????�. This is the same output I get in phpmyadmin on both hostings, so I'm on a right way. And still somehow on my old hosting the same php script returns utf-8 web page with name 鬼神❗ while on new hosting - 鬼神❗. It looks like the string is twice utf-8 encoded: I get utf-8 string, I give it as ansi string to Notepad++ and it encodes it in correct utf-8 string.
However when I try utf8_encode() I get й¬ÑзÒÑвÑâ, and utf8_decode() returns ?�???????. The same result return mb_convert_encoding($name,"UTF-8","ISO-8859-1"); and iconv( "ISO-8859-1","UTF-8", $name);.
So how could I reproduce the same conversion Notepad++ does?
See answer below.
The solution was simple yet not obvious for me, as I never saw my.cnf on that shared hosting: it seems that that server had settings as follows
init_connect='SET collation_connection = cp1252'
init_connect='SET NAMES cp1252'
character-set-server=cp1252
So to make no harm to other code on my new server I have to place mysql_query("SET NAMES CP1252"); on top of each php script which works with utf8 strings.
The trick here was script gets a string as is (ansi) and outputs it, and when browser is warned that page is in utf-8 encoding it just renders my strings as utf-8.

PHP sudden UTF8 encoding issues previously absent. Application not changed

I run a personal website that always showed accented chars correctly.
Now, suddenly, it doesn't any longer. The funny thing is, even its localhost version doesn't.
The application is unaltered over years in this regard and here it is what it does, in the given order:
mysql database set to collation utf8_general_ci
Application sends these two queries:
"SET NAMES 'utf8' COLLATE 'utf8'"
and
"SET CHARACTER_SET 'utf8'"
Php headers send the following headers before anything is printed:
header('Content-type: text/xml; charset=utf-8'."\r\n");
header('Content-transfer-encoding: utf-8'."\r\n");
Each web page shows a meta tag as follows:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Yet, now, suddenly, chars are shown all wrong.
If I replace manually the chars, they are shwon as intended. But I cannot fathom if or what may have "corrupted" the database then. And certainly I cannot fix manually hundreds of posts.
Any idea why this strange thing suddenly happens and suggestions about how to fix it?
Instance of a wrong line:
"Non ho mai avuto l' opportunità di incontrarti di persona. Non so se è perchè non ho cercato abbastanza l' occasione o perchè" etc...
è is Mojibake for è. Were you expecting a grave-e? Regardless of what changed or did not change, let's look at fixing it.
See this and look for Mojibake. It says to check/fix these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with .
Also see the technique for checking the HEX of what is stored for è:
utf8 hex is C3A8.
Hex C383C2A8 means you have "double encoding"; that will lead to other issues.
E8 is the latin1 hex -- I doubt if you will see this.

PHP: characters wrongly encoded in csv when using utf8_encode

I am facing a strange issue when extracting data from a MySql database and inserting it in a CSV file. In the database, the field value is the following:
K Secure Connection 1 año 1 PC
When I echo it before writing it to the CSV file, I get the same as the above in my terminal.
I use the following code to write content to the CSV file:
fwrite($this->fileHandle, utf8_encode($lineContent . PHP_EOL));
Yet, when I open the CSV with LibreOffice Calc (and specify UTF-8 as the encoding format), the following is displayed:
K Secure Connection 1 año 1 PC
I have no idea why this happens. Can someone explain how to solve this?
REM:
SELECT ##character_set_database;
returns
latin1
REM 2:
`var_dump($lineContent, bin2hex($lineContent))`
gives
string(39) "Kaspersky Secure Connection 1 año 1 PC"
string(78) "4b6173706572736b792053656375726520436f6e6e656374696f6e20312061c3b16f2031205043"
The var_dump shows that the string is already encoded in UTF-8. Using utf8_encode on it will garble it (the function attempts a conversion from Latin-1 to UTF-8). You're therefore actually writing "año" encoded in UTF-8 into your file, which is then "correctly" picked up by LibreOffice.
Simply don't utf8_encode.
I would try to open the csv file with other editor just to make sure te problem is not with the office...
You may be double encoding the content if it is already in UTF-8 format.
I also prefer to aways work with UTF-8, so I get the data from database already in UTF-8 and no more convertion is needed. For that I run this query right after opening the SQL connection:
"set names 'utf8'"

Dealing with eacute and other special characters using Oracle, PHP and Oci8

Hi I am trying to store names into an Oracle database and fetch them back using PHP and oci8.
However, if I insert the é directly into the Oracle database and use oci8 to fetch it back I just receive an e
Do I have to encode all special characters (including é) into html entities (ie: é) before inserting into database ... or am I missing something ?
Thx
UPDATE: Mar 1 at 18:40
found this function:
http://www.php.net/manual/en/function.utf8-decode.php#85034
function charset_decode_utf_8($string) {
if(#!ereg("[\200-\237]",$string) && #!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
seems to work, although not sure if its the optimal solution
UPDATE: Mar 8 at 15:45
Oracle's character set is ISO-8859-1.
in PHP I added:
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1");
to force the oci8 connection to use that character set.
Retrieving the é using oci8 from PHP now worked ! (for varchars, but not CLOBs had to do utf8_encode to extract it )
So then I tried saving the data from PHP to Oracle ... and it doesnt work..somewhere along the way from PHP to Oracle the é becomes a ?
UPDATE: Mar 9 at 14:47
So getting closer.
After adding the NLS_LANG variable, doing direct oci8 inserts with é works.
The problem is actually on the PHP side.
By using ExtJs framework, when submitting a form it encodes it using encodeURIComponent.
So é is sent as %C3%A9 and then re-encoded into é.
However it's length is now 2 (strlen($my_sent_value) = 2) and not 1.
And if in PHP I try: $my_sent_value == é = FALSE
I think if I am able to re-encode all these characters in PHP back into lengths of byte size 1 and then inserting them into Oracle, it should work.
Still no luck though
UPDATE: Mar 10 at 11:05
I keep thinking I am so close (yet so far away).
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9"); works very sporadicly.
I created a small php script to test:
header('Content-Type: text/plain; charset=ISO-8859-1');
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9");
$conn= oci_connect("user", "pass", "DB");
$stmt = oci_parse($conn, "UPDATE temp_tb SET string_field = '|é|'");
oci_execute($stmt, OCI_COMMIT_ON_SUCCESS);
After running this once and loggin into the Oracle Database directly I see that STRING_FIELD is set to |¿|. Obviously not what I had come to expect from my previous experience.
However, if I refresh that PHP page twice quickly.... it worked !!!
In Oracle I correctly saw |é|.
It seems like maybe the environment variable is not being correctly set or sent in time for the first execution of the script, but is available for the second execution.
My next experiment is to export the variable into PHP's environment, however, I need to reset Apache for that...so we'll see what happens, hopefully it works.
I presume you are aware of these facts:
There are many different character sets: you have to pick one and, of course, know which one you are using.
Oracle is perfectly capable of storing text without HTML entities (é). HTML entities are used in, well, HTML. Oracle is not a web browser ;-)
You must also know that HTML entities are not bind to a specific charset; on the contrary, they're used to represent characters in a charset-independent context.
You indistinctly talk about ISO-8859-1 and UTF-8. What charset do you want to use? ISO-8859-1 is easy to use but it can only store text in some latin languages (such as Spanish) and it lacks some common chars like the € symbol. UTF-8 is trickier to use but it can store all characters defined by the Unicode consortium (which include everything you'll ever need).
Once you've taken the decision, you must configure Oracle to hold data in such charset and choose an appropriate column type. E.g., VARCHAR2 is fine for plain ASCII, NVARCHAR2 is good for UTF-8.
This is what I finally ended up doing to solve this problem:
Modified the profile of the daemon running PHP to have:
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
So that the oci8 connection uses ISO-8859-1.
Then in my PHP configuration set the default content-type to ISO-8859-1:
default_charset = "iso-8859-1"
When I am inserting into an Oracle Table via oci8 from PHP, I do:
utf8_decode($my_sent_value)
And when receiving data from Oracle, printing the variable should just work as so:
echo $my_received_value
However when sending that data over ajax I have had to use:
utf8_encode($my_received_value)
If you really cannot change the character set that oracle will use then how about Base64 encoding your data before storing it in the database. That way, you can accept characters from any character set and store them as ISO-8859-1 (because Base64 will output a subset of the ASCII character set which maps exactly to ISO-8859-1). Base64 encoding will increase the length of the string by, on average, 37%
If your data is only ever going to be displayed as HTML then you might as well store HTML entities as you suggested, but be aware that a single entity can be up to 10 characters per unencoded character e.g. ϑ is ϑ
I had to face this problem : the LatinAmerican special characters are stored as "?" or "¿" in my Oracle database ... I can't change the NLS_CHARACTER_SET because we're not the database owners.
So, I found a workaround :
1) ASP.NET code
Create a function that converts string to hexadecimal characters:
public string ConvertirStringAHex(String input)
{
Encoding encoding = System.Text.Encoding.GetEncoding("ISO-8859-1");
Byte[] stringBytes = encoding.GetBytes(input);
StringBuilder sbBytes = new StringBuilder(stringBytes.Length);
foreach (byte b in stringBytes)
{
sbBytes.AppendFormat("{0:X2}", b);
}
return sbBytes.ToString();
}
2) Apply the function above to the variable you want to encode, like this
myVariableHex = ConvertirStringZHex( myVariable );
In ORACLE, use the following:
PROCEDURE STORE_IN_TABLE( iTEXTO IN VARCHAR2 )
IS
BEGIN
INSERT INTO myTable( SPECIAL_TEXT )
VALUES ( UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW( iTEXTO ));
COMMIT;
END;
Of course, iTEXTO is the Oracle parameter which receives the value of "myVariableHex" from ASP.NET code.
Hope it helps ... if there's something to improve pls don't hesitate to post your comments.
Sources:
http://www.nullskull.com/faq/834/convert-string-to-hex-and-hex-to-string-in-net.aspx
https://forums.oracle.com/thread/44799
If you have different charsets between the server side code (php in this case) and the Oracle database, you should set server side code charset in the Oracle connection, then Oracle made the conversion.
Example: Let's assume:
php charset utf-8 (default).
Oracle charset AMERICAN_AMERICA.WE8ISO8859P1
In the connection to Oracle made by php you should set UTF8 (third parameter).
oci_pconnect("USER", "PASS", "URL"),"UTF8");
Doing this, you write code in utf-8 (not doing any conversion at all) and get utf-8 from the database through this connection.
So you could write something like SELECT * FROM SOME_TABLE WHERE TEXT = 'SOME TEXT LIKE áéíóú Ñ' and also get utf-8 text as a result.
According to the php documentation, by default, Oracle client (oci_pconnect) takes the NLS_LANG environment variable from the Operating system. Some debian based systems has no NLS_LANG enviromental variable, so I think Oracle client use it's own charset (AMERICAN_AMERICA.WE8ISO8859P1) if we don't specify the third parameter.

Categories