Dealing with eacute and other special characters using Oracle, PHP and Oci8 - php

Hi I am trying to store names into an Oracle database and fetch them back using PHP and oci8.
However, if I insert the é directly into the Oracle database and use oci8 to fetch it back I just receive an e
Do I have to encode all special characters (including é) into html entities (ie: é) before inserting into database ... or am I missing something ?
Thx
UPDATE: Mar 1 at 18:40
found this function:
http://www.php.net/manual/en/function.utf8-decode.php#85034
function charset_decode_utf_8($string) {
if(#!ereg("[\200-\237]",$string) && #!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
seems to work, although not sure if its the optimal solution
UPDATE: Mar 8 at 15:45
Oracle's character set is ISO-8859-1.
in PHP I added:
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1");
to force the oci8 connection to use that character set.
Retrieving the é using oci8 from PHP now worked ! (for varchars, but not CLOBs had to do utf8_encode to extract it )
So then I tried saving the data from PHP to Oracle ... and it doesnt work..somewhere along the way from PHP to Oracle the é becomes a ?
UPDATE: Mar 9 at 14:47
So getting closer.
After adding the NLS_LANG variable, doing direct oci8 inserts with é works.
The problem is actually on the PHP side.
By using ExtJs framework, when submitting a form it encodes it using encodeURIComponent.
So é is sent as %C3%A9 and then re-encoded into é.
However it's length is now 2 (strlen($my_sent_value) = 2) and not 1.
And if in PHP I try: $my_sent_value == é = FALSE
I think if I am able to re-encode all these characters in PHP back into lengths of byte size 1 and then inserting them into Oracle, it should work.
Still no luck though
UPDATE: Mar 10 at 11:05
I keep thinking I am so close (yet so far away).
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9"); works very sporadicly.
I created a small php script to test:
header('Content-Type: text/plain; charset=ISO-8859-1');
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9");
$conn= oci_connect("user", "pass", "DB");
$stmt = oci_parse($conn, "UPDATE temp_tb SET string_field = '|é|'");
oci_execute($stmt, OCI_COMMIT_ON_SUCCESS);
After running this once and loggin into the Oracle Database directly I see that STRING_FIELD is set to |¿|. Obviously not what I had come to expect from my previous experience.
However, if I refresh that PHP page twice quickly.... it worked !!!
In Oracle I correctly saw |é|.
It seems like maybe the environment variable is not being correctly set or sent in time for the first execution of the script, but is available for the second execution.
My next experiment is to export the variable into PHP's environment, however, I need to reset Apache for that...so we'll see what happens, hopefully it works.

I presume you are aware of these facts:
There are many different character sets: you have to pick one and, of course, know which one you are using.
Oracle is perfectly capable of storing text without HTML entities (é). HTML entities are used in, well, HTML. Oracle is not a web browser ;-)
You must also know that HTML entities are not bind to a specific charset; on the contrary, they're used to represent characters in a charset-independent context.
You indistinctly talk about ISO-8859-1 and UTF-8. What charset do you want to use? ISO-8859-1 is easy to use but it can only store text in some latin languages (such as Spanish) and it lacks some common chars like the € symbol. UTF-8 is trickier to use but it can store all characters defined by the Unicode consortium (which include everything you'll ever need).
Once you've taken the decision, you must configure Oracle to hold data in such charset and choose an appropriate column type. E.g., VARCHAR2 is fine for plain ASCII, NVARCHAR2 is good for UTF-8.

This is what I finally ended up doing to solve this problem:
Modified the profile of the daemon running PHP to have:
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
So that the oci8 connection uses ISO-8859-1.
Then in my PHP configuration set the default content-type to ISO-8859-1:
default_charset = "iso-8859-1"
When I am inserting into an Oracle Table via oci8 from PHP, I do:
utf8_decode($my_sent_value)
And when receiving data from Oracle, printing the variable should just work as so:
echo $my_received_value
However when sending that data over ajax I have had to use:
utf8_encode($my_received_value)

If you really cannot change the character set that oracle will use then how about Base64 encoding your data before storing it in the database. That way, you can accept characters from any character set and store them as ISO-8859-1 (because Base64 will output a subset of the ASCII character set which maps exactly to ISO-8859-1). Base64 encoding will increase the length of the string by, on average, 37%
If your data is only ever going to be displayed as HTML then you might as well store HTML entities as you suggested, but be aware that a single entity can be up to 10 characters per unencoded character e.g. ϑ is ϑ

I had to face this problem : the LatinAmerican special characters are stored as "?" or "¿" in my Oracle database ... I can't change the NLS_CHARACTER_SET because we're not the database owners.
So, I found a workaround :
1) ASP.NET code
Create a function that converts string to hexadecimal characters:
public string ConvertirStringAHex(String input)
{
Encoding encoding = System.Text.Encoding.GetEncoding("ISO-8859-1");
Byte[] stringBytes = encoding.GetBytes(input);
StringBuilder sbBytes = new StringBuilder(stringBytes.Length);
foreach (byte b in stringBytes)
{
sbBytes.AppendFormat("{0:X2}", b);
}
return sbBytes.ToString();
}
2) Apply the function above to the variable you want to encode, like this
myVariableHex = ConvertirStringZHex( myVariable );
In ORACLE, use the following:
PROCEDURE STORE_IN_TABLE( iTEXTO IN VARCHAR2 )
IS
BEGIN
INSERT INTO myTable( SPECIAL_TEXT )
VALUES ( UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW( iTEXTO ));
COMMIT;
END;
Of course, iTEXTO is the Oracle parameter which receives the value of "myVariableHex" from ASP.NET code.
Hope it helps ... if there's something to improve pls don't hesitate to post your comments.
Sources:
http://www.nullskull.com/faq/834/convert-string-to-hex-and-hex-to-string-in-net.aspx
https://forums.oracle.com/thread/44799

If you have different charsets between the server side code (php in this case) and the Oracle database, you should set server side code charset in the Oracle connection, then Oracle made the conversion.
Example: Let's assume:
php charset utf-8 (default).
Oracle charset AMERICAN_AMERICA.WE8ISO8859P1
In the connection to Oracle made by php you should set UTF8 (third parameter).
oci_pconnect("USER", "PASS", "URL"),"UTF8");
Doing this, you write code in utf-8 (not doing any conversion at all) and get utf-8 from the database through this connection.
So you could write something like SELECT * FROM SOME_TABLE WHERE TEXT = 'SOME TEXT LIKE áéíóú Ñ' and also get utf-8 text as a result.
According to the php documentation, by default, Oracle client (oci_pconnect) takes the NLS_LANG environment variable from the Operating system. Some debian based systems has no NLS_LANG enviromental variable, so I think Oracle client use it's own charset (AMERICAN_AMERICA.WE8ISO8859P1) if we don't specify the third parameter.

Related

How has this mysql string been encoded and how can I replicate it?

Here are the hex values of two strings stored in a MySQL database using two different methods.
20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
and
E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.
Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥
I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.
I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.
If, within the PHP application, I run SET NAMES utf8 ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8 to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.
I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.
How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?
I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:
import re
expected = '20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5'
original = 'काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥'
# Encode in UTF-8 w/ BOM (U+FEFF encoded in UTF-8 as a signature)
step1 = original.encode('utf-8-sig')
# Windows-1252 doesn't define some byte -> codepoint mappings and Python normally
# raises an error on those bytes. Use an error handler to keep the bytes that
# fail, then replace the escape codes with the matching Unicode codepoint.
step2 = step1.decode('cp1252',errors='backslashreplace')
step3 = re.sub(r'\\x([0-9a-f]{2})', lambda x: chr(int(x.group(1),16)), step2)
# There is an extra space before the UTF-8-encoded BOM for some reason
step4 = ' ' + step3
step5 = step4.encode('utf8')
# Format to match expected string
final = step5.hex().upper()
print(final == expected) # True
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 to utf8mb4 hex
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A' is utf8mb4 to double-encoded
See "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored
More
"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.
You explained the cause here:
The database connection charset is latin1. The charset of the table is utf8
BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)
The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".
The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.

How to store and retrieve extended ASCII characters in MSSQL

I was surprised that I was unable to find a straightforward answer to this question by searching.
I have a web application in PHP that takes user input. Due to the nature of the application, users may often use extended ASCII characters (a.k.a. "ALT codes").
My specific issue at the moment is with ALT code 26, which is a right arrow (→). This will be accompanied with other text to be stored in the same field (for example, 'this→that').
My column type is NVARCHAR.
Here's what I've tried:
I've tried doing no conversions and just inserting the value as normal, but the value gets stored as thisâ??that.
I've tried converting the value to UCS-2 in PHP using iconv('UTF-8', 'UCS-2', $value), but I get an error saying Unclosed quotation mark after the character string 't'.. The query ends up looking like this: UPDATE myTable SET myColumn = 'this�!that'.
I've tried doing the above conversion and then adding an N before the quoted value, but I get the same error message. The query looks like this: UPDATE myTable SET myColumn = N'this�!that'.
I've tried removing the UCS-2 conversion and just adding the N before the quoted value, and the query works again, but the value is stored as thisâ that.
I've tried using utf8_decode($value) in PHP, but then the arrow is just replaced with a question mark.
So can anyone answer the (seemingly simple) question of, how can I store this value in my database and then retrieve it as it was originally typed?
I'm using PHP 5.5 and MSSQL 2012. If any question of driver/OS version comes into play, it's a Linux server connecting via FreeTDS. There is no possibility of changing this.
You might try base64 encoding the input, this is fairly trivial to handle with PHP's base64_encode() and base64_decode() and it should handle what ever your users throw at it.
(edit: You can apparently also do the base64 encoding on the SQL Server side. This doesn't seem like something it should be responsible for imho, but it's an option.)
It seems like your freetds.conf is wrong. You need a TDS protocol version >= 7.0 to support unicode. See this for more details.
Edit your freetds.conf:
[global]
# TDS protocol version
tds version = 7.4
client charset = UTF-8
Also make sure to configure PHP correct:
ini_set('mssql.charset', 'UTF-8');
The accepted answer seems to do the job; yes you can encode it to base64 and then decode it back again, but then all the applications that use that remote database, should change and support the fields to be base64 encoded. My thought is that if there is a remote MS SQL Server database, there could be an other application (or applications) that may use it, so that application have to also be changed to support both plain and base64 encoding. And you'll have to also handle both plain text and base64 converted text.
I searched a little bit and I found how to send UNICODE text to the MS SQL Server using MS SQL commands and PHP to convert the UNICODE bytes to HEX numbers.
If you go at the PHP documentation for the mssql_fetch_array (http://php.net/manual/ru/function.mssql-fetch-array.php#80076), you'll see at the comments a pretty good solution that converts the text to UNICODE HEX values and then sends that HEX data directly to MS SQL Server like this:
Convert Unicode Text to HEX Data
// sending data to database
$utf8 = 'Δοκιμή με unicode → Test with Unicode'; // some Greek text for example
$ucs2 = iconv('UTF-8', 'UCS-2LE', $utf8);
// converting UCS-2 string into "binary" hexadecimal form
$arr = unpack('H*hex', $ucs2);
$hex = "0x{$arr['hex']}";
// IMPORTANT!
// please note that value must be passed without apostrophes
// it should be "... values(0x0123456789ABCEF) ...", not "... values('0x0123456789ABCEF') ..."
mssql_query("INSERT INTO mytable (myfield) VALUES ({$hex})", $link);
Now all the text actually is stored to the NVARCHAR database field correctly as UNICODE, and that's all you have to do in order to send and store it as plain text and not encoded.
To retrieve that text, you need to ask MS SQL Server to send back UNICODE encoded text like this:
Retrieving Unicode Text from MS SQL Server
// retrieving data from database
// IMPORTANT!
// please note that "varbinary" expects number of bytes
// in this example it must be 200 (bytes), while size of field is 100 (UCS-2 chars)
// myfield is of 50 length, so I set VARBINARY to 100
$result = mssql_query("SELECT CONVERT(VARBINARY(100), myfield) AS myfield FROM mytable", $link);
while (($row = mssql_fetch_array($result, MSSQL_BOTH)))
{
// we get data in UCS-2
// I use UTF-8 in my project, so I encode it back
echo '1. '.iconv('UCS-2LE', 'UTF-8', $row['myfield'])).PHP_EOL;
// or you can even use mb_convert_encoding to convert from UCS-2LE to UTF-8
echo '2. '.mb_convert_encoding($row['myfield'], 'UTF-8', 'UCS-2LE').PHP_EOL;
}
The MS SQL Table with the UNICODE Data after the INSERT
The output result using a PHP page to display the values
I'm not sure if you can reach my test page here, but you can try to see the live results:
http://dbg.deve.wiznet.gr/php56/mssql/test1.php

Htmlentities function returns null, after upgrading PHP and MySQL

I have a web server with a hundred applications developed on PHP 5.2.3, Apache 2.2.4, MySQL 5.0.37 (databases with utf8_general_ci character set).
I set up a new machine to which I will port the web server that has PHP 5.5.12 (default_charset=UTF-8), Apache 2.4.9 (html head with content="text/html; charset=utf-8", MySQL 5.6.17 (test database with utf8_general_ci character set).
In the PHP scripts I used many time the htmlentities function in a form like htmlentities($var) (ok not the best way but I was a beginner) in which $var is text extracted from MySQL and contain special chars like "è" (when I save in the dbase I use set var=_utf8'è').
The problem is that on the new server the htmlentities function returns nothing (the same code in the old server return the correct &egrave).
After some googling I found a solution that is rewrite the call as htmlentities(utf8_encode($var)), but you know to correct hundred application...
There is a solution (with .ini variables, database charset modification or similar) to maintain the "old" functionality of the htmlentities function?
[EDIT]
Thanks to CBroe's suggestion, I can solve the problem linked to MySQL using the function mysql_set_charset called when I connect to the DB (a common function).
But, the problem still remain for a generic conversion. For example, if I want to print the euro symbol and I want to to use the htmlentities function instead to remember the html code.
As other note if I use htmlentities("è",ENT_QUOTES,'UTF-8') the result is nothing, if I use htmlentities("è",ENT_QUOTES,'ISO-8859-1') or htmlentities("è",ENT_QUOTES,'') the result is right.
PS:
The problem is the same if I pass a string with a special char like "abcdè".
[EDIT]
I found a solution for the same problem on the ODBC connection:
https://www.saotn.org/php-56-default_charset-change-may-break-html-output/
so setting default_charset = "iso-8859-1" the old applications still work fine.

From URL to PHP to MySQL, how to encode and decode UTF-8?

Let's say I want to send Unicode glyph U+CABC 쪼 via a web service to be saved in a database.
For example, wget is being used to connect to a web service:
shell_exec("wget 'http://doit.com/testing.php?glyph=.f(0xCABC)."'")
Where f is the PHP function (or functions) to convert/encode/escape the glyph U+CABC.
In testing.php, the glyph is accessed via $_REQUEST:
$glyph = $_REQUEST['glyph'];
I'd like to put it in the DB, so let's set up a query string like this:
$query = 'INSERT INTO UTF8_TABLE (UTF8_FIELD) VALUES ('.g($glyph).')';.
Where g is the PHP function (or functions) to convert the glyph into a MySQL compatible representation.
I can't seem to find what I need for the functions f and g.
For f, tried escaping and encoding via numerous functions, e.g. as HTML encoded UTF-8: %EC%AA%BC. For g, tried various unescaping and decoding functions, e.g. html_entity_decode, utf_decode, etc.
But no matter how I encode it, it always gets interpreted as a string of three characters 쪼, which are then saved in the DB as 쪼 (i.e. six bytes), and not as 쪼 (i.e. three bytes).
I haven't even begun to figure out how to return the glyph via SQL SELECT and encoding JSON, but for now, would just like a straightforward way to handle UTF-8 from origin to destination.
$glyph = "쪼"; //or
$glyph = "\xEC\xAA\xBC";
This is your glyph encoded in UTF-8. The former works if you save your source code in UTF-8, the latter works in any case. To transport this in a URL, URL-encode it:
$url = 'http://...?glyph=' . rawurlencode($glyph);
On the server, PHP will automatically decode it again, so:
$glyph = $_GET['glyph'];
From there, insert it into the database the same way you would any other UTF-8 encoded text, mostly making sure the database connection encoding is set correctly. See UTF-8 all the way through.

Accents in uploaded file being replaced with '?'

I am building a data import tool for the admin section of a website I am working on. The data is in both French and English, and contains many accented characters. Whenever I attempt to upload a file, parse the data, and store it in my MySQL database, the accents are replaced with '?'.
I have text files containing data (charset is iso-8859-1) which I upload to my server using CodeIgniter's file upload library. I then read the file in PHP.
My code is similar to this:
$this->upload->do_upload()
$data = array('upload_data' => $this->upload->data());
$fileHandle = fopen($data['upload_data']['full_path'], "r");
while (($line = fgets($fileHandle)) !== false) {
echo $line;
}
This produces lines with accents replaced with '?'. Everything else is correct.
If I download my uploaded file from my server over FTP, the charset is still iso-8850-1, but a diff reveals that the file has changed. However, if I open the file in TextEdit, it displays properly.
I attempted to use PHP's stream_encoding method to explicitly set my file stream to iso-8859-1, but my build of PHP does not have the method.
After running out of ideas, I tried wrapping my strings in both utf8_encode and utf8_decode. Neither worked.
If anyone has any suggestions about things I could try, I would be extremely grateful.
It's Important to see if the corruption is happening before or after the query is being issued to mySQL. There are too many possible things happening here to be able to pinpoint it. Are you able to output your MySql to check this?
Assuming that your query IS properly formed (no corruption at the stage the query is being outputted) there are a couple of things that you should check.
What is the character encoding of the database itself? (collation)
What is the Charset of the connection - this may not be set up correctly in your mysql config and can be manually set using the 'SET NAMES' command
In my own application I issue a 'SET NAMES utf8' as my first query after establishing a connection as I am unable to change the MySQL config.
See this.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
Edit: If the issue is not related to mysql I'd check the following
You say the encoding of the file is 'charset is iso-8859-1' - can I ask how you are sure of this?
What happens if you save the file itself as utf8 (Without BOM) and try to reprocess it?
What is the encoding of the php file that is performing the conversion? (What are you using to write your php - it may be 'managing' this for you in an undesired way)
(an aside) Are the files you are processing suitable for processing using fgetcsv instead?
http://php.net/manual/en/function.fgetcsv.php
Files uploaded to your server should be returned the same on download. That means, the encoding of the file (which is just a bunch of binary data) should not be changed. Instead you should take care that you are able to store the binary information of that file unchanged.
To achieve that with your database, create a BLOB field. That's the right column type for it. It's just binary data.
Assuming you're using MySQL, this is the reference: The BLOB and TEXT Types, look out for BLOB.
The problem is that you are using iso-8859-1 instead of utf-8. In order to encode it in the correct charset, you should use the iconv function, like so:
$output_string = iconv('utf-8", "utf-8//TRANSLIT", $input_string);
iso-8859-1 does not have the encoding for any sort of accents.
It would be so much better if everything were utf-8, as it handles virtually every character known to man.

Categories