I'm experiencing a problem developing a project where I can't put accented characters.
I'm using to make the connection to the MySQL database, Adodb in version 5.20.14.
I have a main configuration file where the connection is set as follows:
$s_driver = "mysqli";
$o_db = adoNewConnection($s_driver);
$o_db->connect($s_dbhost,$s_dbuser,$s_dbpasswd,$s_dbname);
$o_db->SetFetchMode(ADODB_FETCH_ASSOC);
$o_db->setCharset("utf8");
All files developed are UTF-8 encoded, as shown in the image below:
Image 02
NOTE: I use VsCode.
All PHP files that are pages, have the charset meta set to utf-8, as shown in the image below:
<!DOCTYPE html>
<html lang="pt-br">
<head>
<meta charset="utf-8">
My database is configured for utf8, as shown in the image below:
Image 04
The tables are configured with the utf8 charset, as shown in the image below:
Image 5
The columns are configured with the utf8 charset, as shown in the image below:
Image 6
NOTE: I added the "NOT NULL" rule to ignore tables that do not have a charset configuration.
When executing the query to include the information in the database, I run as follows:
$s_query_incluir = "INSERT INTO agtb_ordensdeservicos(id_agenda,
id_empresa,
hora_ini,
hora_fim,
observacoes,
tipo,
csa)
VALUES('".$a_post['add_id_os_dt_agenda']."',
'".ID_EMP_ATUAL."',
'".$a_post["add_os_hora_ini"]."',
'".$a_post['add_os_hora_fim']."',
'".$a_post['add_observacao']."',
'".$a_post['add_os_tipo']."',
'".$a_post['add_os_csa']."');";
$o_db->execute($s_query_incluir);
NOTE: at the top of the file I include the configuration file that contains the information shown in the first image.
After performing this operation, the database shows the information as follows:
Image 8
When viewing the information on the website, it appears as follows:
Image 9
The original text being this:
Image 10
I managed to make it work by adding the "setCharset" before giving the "execute" in the query, as shown in the image below:
$s_query_incluir = "INSERT INTO agtb_ordensdeservicos(id_agenda,
id_empresa,
hora_ini,
hora_fim,
observacoes,
tipo,
csa)
VALUES('".$a_post['add_id_os_dt_agenda']."',
'".ID_EMP_ATUAL."',
'".$a_post["add_os_hora_ini"]."',
'".$a_post['add_os_hora_fim']."',
'".$a_post['add_observacao']."',
'".$a_post['add_os_tipo']."',
'".$a_post['add_os_csa']."');";
$o_db->setCharset("utf8");
$o_db->execute($s_query_incluir);
Achieving the following result in MySQL:
Image 12
I would like to understand where I am going wrong. I'm trying to make utf8 "automatic" without having to call "setCharset" before any kind of query execution.
I appreciate any kind of help. :)
NOTE: if you need more information about the process to better understand the problem, just let me know.
I have identified the reason for the problem.
Even configuring the database, tables and columns for UTF-8 / utf8_general_ci and the PHP file in UTF-8, the problem of strange characters remained.
The problem was in the standard PHP functions that format text, including:
strtolower()
strtoupper()
ucfirst()
ucwords()
When executing these functions in accented texts (for example, "acentuação"), their return was lost.
After removing these functions from my entire project, the strange character problem in the database stopped occurring (including and querying through PHP and through MySQL Workbench).
I hope this can help other people who end up experiencing the same problem as me. :)
NOTE: it took me a long time to answer because I was only able to validate this information after placing these changes in a production environment.
Related
Here are the hex values of two strings stored in a MySQL database using two different methods.
20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
and
E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.
Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥
I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.
I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.
If, within the PHP application, I run SET NAMES utf8 ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8 to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.
I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.
How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?
I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:
import re
expected = '20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5'
original = 'काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥'
# Encode in UTF-8 w/ BOM (U+FEFF encoded in UTF-8 as a signature)
step1 = original.encode('utf-8-sig')
# Windows-1252 doesn't define some byte -> codepoint mappings and Python normally
# raises an error on those bytes. Use an error handler to keep the bytes that
# fail, then replace the escape codes with the matching Unicode codepoint.
step2 = step1.decode('cp1252',errors='backslashreplace')
step3 = re.sub(r'\\x([0-9a-f]{2})', lambda x: chr(int(x.group(1),16)), step2)
# There is an extra space before the UTF-8-encoded BOM for some reason
step4 = ' ' + step3
step5 = step4.encode('utf8')
# Format to match expected string
final = step5.hex().upper()
print(final == expected) # True
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 to utf8mb4 hex
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A' is utf8mb4 to double-encoded
See "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored
More
"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.
You explained the cause here:
The database connection charset is latin1. The charset of the table is utf8
BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)
The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".
The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.
First of all, sorry for my English :p
I want to upload an excel file (.xlsx) with names through my web. I upload and save the data correctly in my database, but when I show that data on my website the names like João or André are shown like: Jo�o and Andr�.
The collation in that table is utf_8_general_ci, and that names are shown like Joã£o and Andrã©.
According to the function mb_detect_encoding(), that names in the excel file are utf-8.
I tried to convert the names to utf-8 with utf8_encode() and mb_convert_encoding(), I tried to save the excel file like utf-8, I tried to save the excel file like ISO-8859-15, I tried to paste the names to notepad and save them like utf-8 and copy to my excel...I have tried many things and none has worked for me!.
I can't covert the excel file to .csv because it has to be an Excel Workbook, I'm saying it because I read that it could be a solution.
I have run out of ideas...
UPDATE: It's very strange because in localhost doesn't work, but when I upload it to the server the characters are displayed correctly
If mb_detect_encoding() is TRUE for 'UTF-8' then you only have to specify the character encoding in your HTML meta tag. Then your browser knows how to decode and display the data.
<meta charset="UTF-8">
https://www.w3schools.com/html/html_charset.asp
I've migrated a WordPress website from a Hostgator shared host to a Ubuntu Digital Ocean LAMP stack.
The trouble started when I exported the image files which had special characters, for example the file
operários_tarsila-1024x640.jpg.
When WordPress tries to reach the file, it displays an error. I've found the cause:
I can see via Inspect Element that Wordpress tries to call: http://mywebsite.com/wp-content/uploads/2013/02/oper%C3%A1rios_tarsila-1024x640.jpg and the server returns a 404 error.
However if I type this URL in the browser: http://mywebsite.com/wp-content/uploads/2013/02/opera%CC%81rios_tarsila-1024x640.jpg it works and the image is displayed.
So, it seems like this difference between the á encoding from %C3%A1 (á character) to a+%CC%81 (combining accute accent) is what is causing WordPress to not display my images.
So now I have in my server thousands of accented image filenames with the structure character+ combining accent and WordPress calling the image filenames with the structure accented character.
Is there a way bash rename all of them with a comparisson table? Or a way to make Apache aware of those differences and point to the right file when this kind of confusion happen?
Apparently the problem is how the backup is decompressed on the new server.
There are 2 ways to fix this:
Rename the files manually by names without accents and then modify the database and change the file names in the database (This maluco and can be dangerous, it would be best to back up the database).
Upload files using Filezilla, but setting it to force the charset encoding in UTF-8.
File> Site Manager> {YOUR SITE}> Tab Charset> Force UTF-8
We have same problem - Mac + FileZilla + special characters in SK language.
Problem fixed using another FTP client (Cyberduck in our case ).
It seems to be a problem with FileZilla filenames encofing. Force utf8 encoding (FileZilla host settings) doesn't help.
So, just to touch upon this issue and a solution that worked for me... I also migrated a Wordpress site and found that all images with special characters in their filename produced a 404 after migration.
I ended up having to do the manual file renaming and edits to the database via phpMyAdmin. It was arduous and I definitely recommend backing up your database first.
In my case, I had a ton of media attachments that used the special character © in their filename.
First, I locally renamed the files by removing the character. I used 1-4a rename. Just found the filename and replaced it with nothing (not even a space). Then, I removed all the old files from the /wp-content/uploads/ folder and replaced them with the new files.
Next, I went into my database to update the table values. Media attachments have info stored in both the wp_posts and wp_postmeta tables. Below is the SQL I ran to update both -
update wp_posts set guid = replace(guid,'©','');
UPDATE wp_postmeta SET meta_value = REPLACE(meta_value, '©', '')
WHERE LOWER(RIGHT(meta_value, 5)) = '.jpeg' OR
LOWER(RIGHT(meta_value, 4)) IN ('.jpg', '.gif', '.png')
Which, again, we are replacing the character with nothing, not even a space.
I had to use the WP plugin Regenerate Thumbnails in order to have all of thumbnails + various attachment sizes update, but that did the trick.
I really appreciate everyone's efforts on this post and this post to help me figure it out! Hope this helps someone!
Have you tried setting the same encoding in PHP script, Mysql and HTML ?
PHP : http://php.net/manual/en/function.mb-internal-encoding.php
Mysql : http://php.net/manual/en/function.mysql-set-charset.php
HTML : <meta http-equiv="content-type" content="text/html; charset=utf-8" />
This problem is looking like a charset accordance problem between all these languages.
If this is not working, you will have to use a small script to rename all your pictures, using a function like :
function wd_remove_accents($str, $charset='utf-8')
{
$str = htmlentities($str, ENT_NOQUOTES, $charset);
$str = preg_replace('#&([A-za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#', '\1', $str);
$str = preg_replace('#&([A-za-z]{2})(?:lig);#', '\1', $str); // pour les ligatures e.g. 'œ'
$str = preg_replace('#&[^;]+;#', '', $str); // supprime les autres caractères
return $str;
}
Source : http://www.weirdog.com/blog/php/supprimer-les-accents-des-caracteres-accentues.html
We have just had a similar problem with french caracters in our wordpress deployment, and our solution was to upload the files with FileZilla from a PC instead of FileZilla from a Mac.
When I would upload from mac OSX to the CentOS server, the files will only show if called in the a+%CC%81 format.
When I uploaded the files from the PC, apache found the files in the %C3%A1 format, which was how wordpress had them encoded.
If you have WP_CLI run this BashScript. You must change the wp_ table prefix.
It only modifies the file-names that are NOT on FORM_D format.
Backup your DB just in case something goes wrong.
#!/bin/bash
normalizeWP_PHP_Script=$'
global $wpdb;
$rows = $wpdb->get_results( "SELECT * FROM wp_postmeta where meta_key='"'"'_wp_attached_file'"'"'");
foreach ( $rows as $row )
{
$postId = $row->{'"'"'post_id'"'"'};
$filePath = $row->{'"'"'meta_value'"'"'};
if( ! normalizer_is_normalized($filePath, Normalizer::FORM_D) ){
$filename_nfd = Normalizer::normalize($filePath, Normalizer::FORM_D);
echo $filename_nfd." | ";
$wpdb->query($wpdb->prepare("UPDATE wp_postmeta SET meta_value='"'"'$filename_nfd'"'"' WHERE post_id=$postId"));
}
}';
wp eval "$normalizeWP_PHP_Script"
echo " - Uploads-url nomalized --nfd"
There's a plugin for this situation.
You can check on Media File Renamer
I am building a data import tool for the admin section of a website I am working on. The data is in both French and English, and contains many accented characters. Whenever I attempt to upload a file, parse the data, and store it in my MySQL database, the accents are replaced with '?'.
I have text files containing data (charset is iso-8859-1) which I upload to my server using CodeIgniter's file upload library. I then read the file in PHP.
My code is similar to this:
$this->upload->do_upload()
$data = array('upload_data' => $this->upload->data());
$fileHandle = fopen($data['upload_data']['full_path'], "r");
while (($line = fgets($fileHandle)) !== false) {
echo $line;
}
This produces lines with accents replaced with '?'. Everything else is correct.
If I download my uploaded file from my server over FTP, the charset is still iso-8850-1, but a diff reveals that the file has changed. However, if I open the file in TextEdit, it displays properly.
I attempted to use PHP's stream_encoding method to explicitly set my file stream to iso-8859-1, but my build of PHP does not have the method.
After running out of ideas, I tried wrapping my strings in both utf8_encode and utf8_decode. Neither worked.
If anyone has any suggestions about things I could try, I would be extremely grateful.
It's Important to see if the corruption is happening before or after the query is being issued to mySQL. There are too many possible things happening here to be able to pinpoint it. Are you able to output your MySql to check this?
Assuming that your query IS properly formed (no corruption at the stage the query is being outputted) there are a couple of things that you should check.
What is the character encoding of the database itself? (collation)
What is the Charset of the connection - this may not be set up correctly in your mysql config and can be manually set using the 'SET NAMES' command
In my own application I issue a 'SET NAMES utf8' as my first query after establishing a connection as I am unable to change the MySQL config.
See this.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
Edit: If the issue is not related to mysql I'd check the following
You say the encoding of the file is 'charset is iso-8859-1' - can I ask how you are sure of this?
What happens if you save the file itself as utf8 (Without BOM) and try to reprocess it?
What is the encoding of the php file that is performing the conversion? (What are you using to write your php - it may be 'managing' this for you in an undesired way)
(an aside) Are the files you are processing suitable for processing using fgetcsv instead?
http://php.net/manual/en/function.fgetcsv.php
Files uploaded to your server should be returned the same on download. That means, the encoding of the file (which is just a bunch of binary data) should not be changed. Instead you should take care that you are able to store the binary information of that file unchanged.
To achieve that with your database, create a BLOB field. That's the right column type for it. It's just binary data.
Assuming you're using MySQL, this is the reference: The BLOB and TEXT Types, look out for BLOB.
The problem is that you are using iso-8859-1 instead of utf-8. In order to encode it in the correct charset, you should use the iconv function, like so:
$output_string = iconv('utf-8", "utf-8//TRANSLIT", $input_string);
iso-8859-1 does not have the encoding for any sort of accents.
It would be so much better if everything were utf-8, as it handles virtually every character known to man.
Hi I am trying to store names into an Oracle database and fetch them back using PHP and oci8.
However, if I insert the é directly into the Oracle database and use oci8 to fetch it back I just receive an e
Do I have to encode all special characters (including é) into html entities (ie: é) before inserting into database ... or am I missing something ?
Thx
UPDATE: Mar 1 at 18:40
found this function:
http://www.php.net/manual/en/function.utf8-decode.php#85034
function charset_decode_utf_8($string) {
if(#!ereg("[\200-\237]",$string) && #!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
seems to work, although not sure if its the optimal solution
UPDATE: Mar 8 at 15:45
Oracle's character set is ISO-8859-1.
in PHP I added:
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1");
to force the oci8 connection to use that character set.
Retrieving the é using oci8 from PHP now worked ! (for varchars, but not CLOBs had to do utf8_encode to extract it )
So then I tried saving the data from PHP to Oracle ... and it doesnt work..somewhere along the way from PHP to Oracle the é becomes a ?
UPDATE: Mar 9 at 14:47
So getting closer.
After adding the NLS_LANG variable, doing direct oci8 inserts with é works.
The problem is actually on the PHP side.
By using ExtJs framework, when submitting a form it encodes it using encodeURIComponent.
So é is sent as %C3%A9 and then re-encoded into é.
However it's length is now 2 (strlen($my_sent_value) = 2) and not 1.
And if in PHP I try: $my_sent_value == é = FALSE
I think if I am able to re-encode all these characters in PHP back into lengths of byte size 1 and then inserting them into Oracle, it should work.
Still no luck though
UPDATE: Mar 10 at 11:05
I keep thinking I am so close (yet so far away).
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9"); works very sporadicly.
I created a small php script to test:
header('Content-Type: text/plain; charset=ISO-8859-1');
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9");
$conn= oci_connect("user", "pass", "DB");
$stmt = oci_parse($conn, "UPDATE temp_tb SET string_field = '|é|'");
oci_execute($stmt, OCI_COMMIT_ON_SUCCESS);
After running this once and loggin into the Oracle Database directly I see that STRING_FIELD is set to |¿|. Obviously not what I had come to expect from my previous experience.
However, if I refresh that PHP page twice quickly.... it worked !!!
In Oracle I correctly saw |é|.
It seems like maybe the environment variable is not being correctly set or sent in time for the first execution of the script, but is available for the second execution.
My next experiment is to export the variable into PHP's environment, however, I need to reset Apache for that...so we'll see what happens, hopefully it works.
I presume you are aware of these facts:
There are many different character sets: you have to pick one and, of course, know which one you are using.
Oracle is perfectly capable of storing text without HTML entities (é). HTML entities are used in, well, HTML. Oracle is not a web browser ;-)
You must also know that HTML entities are not bind to a specific charset; on the contrary, they're used to represent characters in a charset-independent context.
You indistinctly talk about ISO-8859-1 and UTF-8. What charset do you want to use? ISO-8859-1 is easy to use but it can only store text in some latin languages (such as Spanish) and it lacks some common chars like the € symbol. UTF-8 is trickier to use but it can store all characters defined by the Unicode consortium (which include everything you'll ever need).
Once you've taken the decision, you must configure Oracle to hold data in such charset and choose an appropriate column type. E.g., VARCHAR2 is fine for plain ASCII, NVARCHAR2 is good for UTF-8.
This is what I finally ended up doing to solve this problem:
Modified the profile of the daemon running PHP to have:
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
So that the oci8 connection uses ISO-8859-1.
Then in my PHP configuration set the default content-type to ISO-8859-1:
default_charset = "iso-8859-1"
When I am inserting into an Oracle Table via oci8 from PHP, I do:
utf8_decode($my_sent_value)
And when receiving data from Oracle, printing the variable should just work as so:
echo $my_received_value
However when sending that data over ajax I have had to use:
utf8_encode($my_received_value)
If you really cannot change the character set that oracle will use then how about Base64 encoding your data before storing it in the database. That way, you can accept characters from any character set and store them as ISO-8859-1 (because Base64 will output a subset of the ASCII character set which maps exactly to ISO-8859-1). Base64 encoding will increase the length of the string by, on average, 37%
If your data is only ever going to be displayed as HTML then you might as well store HTML entities as you suggested, but be aware that a single entity can be up to 10 characters per unencoded character e.g. ϑ is ϑ
I had to face this problem : the LatinAmerican special characters are stored as "?" or "¿" in my Oracle database ... I can't change the NLS_CHARACTER_SET because we're not the database owners.
So, I found a workaround :
1) ASP.NET code
Create a function that converts string to hexadecimal characters:
public string ConvertirStringAHex(String input)
{
Encoding encoding = System.Text.Encoding.GetEncoding("ISO-8859-1");
Byte[] stringBytes = encoding.GetBytes(input);
StringBuilder sbBytes = new StringBuilder(stringBytes.Length);
foreach (byte b in stringBytes)
{
sbBytes.AppendFormat("{0:X2}", b);
}
return sbBytes.ToString();
}
2) Apply the function above to the variable you want to encode, like this
myVariableHex = ConvertirStringZHex( myVariable );
In ORACLE, use the following:
PROCEDURE STORE_IN_TABLE( iTEXTO IN VARCHAR2 )
IS
BEGIN
INSERT INTO myTable( SPECIAL_TEXT )
VALUES ( UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW( iTEXTO ));
COMMIT;
END;
Of course, iTEXTO is the Oracle parameter which receives the value of "myVariableHex" from ASP.NET code.
Hope it helps ... if there's something to improve pls don't hesitate to post your comments.
Sources:
http://www.nullskull.com/faq/834/convert-string-to-hex-and-hex-to-string-in-net.aspx
https://forums.oracle.com/thread/44799
If you have different charsets between the server side code (php in this case) and the Oracle database, you should set server side code charset in the Oracle connection, then Oracle made the conversion.
Example: Let's assume:
php charset utf-8 (default).
Oracle charset AMERICAN_AMERICA.WE8ISO8859P1
In the connection to Oracle made by php you should set UTF8 (third parameter).
oci_pconnect("USER", "PASS", "URL"),"UTF8");
Doing this, you write code in utf-8 (not doing any conversion at all) and get utf-8 from the database through this connection.
So you could write something like SELECT * FROM SOME_TABLE WHERE TEXT = 'SOME TEXT LIKE áéíóú Ñ' and also get utf-8 text as a result.
According to the php documentation, by default, Oracle client (oci_pconnect) takes the NLS_LANG environment variable from the Operating system. Some debian based systems has no NLS_LANG enviromental variable, so I think Oracle client use it's own charset (AMERICAN_AMERICA.WE8ISO8859P1) if we don't specify the third parameter.