php's iconv() not displaying proper strings anymore - php

I have been using iconv to convert data from my company's database (windows-1250) to UTF8. It all worked fine until recently.
I'm not really sure what happened, as I've noticed the change only recently. The problem is that iconv seems to stop working well - it still throws notices when I use bad encoding name.
Earlier, when I saved a string to the db with
htmlspecialchars(iconv('UTF-8', 'windows-1250', $string), ENT_QUOTES) it was fine. Now only question marks are written to my db instead of e.g. ąęś.
When I correct them via PL/SQL Developer and read them via php: htmlspecialchars_decode(iconv('windows-1250', 'UTF-8', $string), ENT_QUOTES)
I receive aes. I tried to set the encoding in php, right before string output:
header('Content-Type: text/html; charset=utf-8');, but it didn't help.
My software is:
PHP 5.3.15 (cli)
iconv (GNU libc) 2.15
Apache/2.2.22
openSUSE 12.2
Oracle 10.2.0.4
oci

After some help from hakre, I was able to solve my problem.
My strings were already transliterated when I selected them from the database. PHP's oci_connect() fourth parameter is character set. If you don't provide it, it is taken from environment variable NLS_LANG.
I had neither fourth parameter nor environment variable, so my database connection charset was wrong. Once I added NLS_LANG variable it started to work fine.
Thank you hakre! : )

Related

PHP HTML Entity Decode Issues

I'm on PHP 5.5.8 and saw that I was getting weird data, turns out when I decode html entities I'm getting some type of corrupted characters or something.
echo html_entity_decode('é');
Displays ├® in my terminal and é in a browser, when it should be é. I've used html_entity_decode('é', ENT_QUOTES, 'UTF-8') and defined my default charset to be UTF-8 as well. The thing is I've tried it on another server and it worked fine. But on my local environment it's failing so .. probably something to do with some settings but I don't know where to look. Can anyone help?
It seems that you have a mismatch of encodings, check in your php.ini, if your default_charset there is not utf-8, that will mess things up.
You can also set it at run time with ini_set.
ini_set ( 'default_charset', 'utf-8' );

Ubuntu encoding of new files

I'm searching there for a long time, but without any helpful result.
I'm developing a PHP project using eclipse on a Ubuntu 11.04 VM. Every thing works fine. I've never need to look for the file encoding. But after deploying the project to my server, all contents were shown with the wrong encoding. After a manual conversion to UTF8 with Notepad++ my problems were solved.
Now I want to change it in my Ubuntu VM, too. And there's the problem. I've checked the preferences in Eclipse but every property ist set to UTF8: General content types, workspace, project settings, everything ...
If I look for the encoding on the terminal, it says "test_new.dat: text/plain; charset=us-ascii". All files are saved to ascii format. If I try to create a new file with the terminal ("touch") it's also the same.
Then I've tried to convert the files with iconv:
iconv -f US-ASCII -t UTF8 -o test.dat test_new.dat
But the encoding doesn't change. Especially PHP files seems to be resistant. I have some *.ini files in my project for which a conversion works?!
Any idea what to do?
Here are my locale settings of Ubuntu:
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
I was also wondering about character encoding and found something that might be usefull here.
When I create a new empty .txt-file on my ubuntu 12.04 and ask for its character encoding with: "file -bi filename.txt" it shows me: charset=binary. After opening it and writing something inside like "haha" I saved it using "save as" and explicitly chose UTF-8 as character encoding. Now very strangely it did not show me charset=UTF-8 after asking again, but returned charset=us-ascii. This seemed already strange. But it got even stranger, when I did the whole thing again but this time included some german specific charakters (ä in this case) in the file and saved again (this time without saving as, I just pressed save). Now it said charset=UTF-8.
It therefore seems that at least gedit is checking the file and downgrading from UTF-8 to us-ascii if there is no need for UTF-8 since the file can be encoded using us-ascii.
Hope this helped a bit even though it is not php related.
Greetings
UTF-8 is compatible with ASCII. An ASCII text file is therefore also valid UTF-8, and a conversion from ASCII to UTF-8 is a no-op.

UTF8 characters from database don't show up properly in the browser - MySQL & PHP CodeIgniter

My database and tables are set to utf8_general_ci collation and utf8 charset. CodeIgniter is set to utf8. I've added meta tag charset=utf8, and I'm still getting something like: квартиры instead of cyrillic letters...
The same code running on the local machine works fine - Mac OSX. It's only breaking in the production machine, which is Ubuntu 11.10 64bit in AWS EC2. Static content from the .php files show up correctly, only the data coming from the database are messed up. Example page: http://dev.uzlist.com/browse/cat/nkv
Any ideas why?
Thanks.
FYI:
When I do error_log() the data coming from the database, it's the same values I'm seeing on the page. Hence, it's not the browser-server issue. It's something between mysql and php, since when I run SELECT * FROM categories, it shows the data in the right format. I'm using PHP CodeIgniter framework for database connection and query and as mentioned here, I have configured it to use utf8 connection and utf8_general_ci collation.
Make sure your my.cnf (likely to be in /etc/) has the following entries :
[mysqld]
default-character-set=utf8
default-collation=utf8_general_ci
character-set-server=utf8
collation-server=utf8_general_ci
init-connect='SET NAMES utf8'
[client]
default-character-set=utf8
You'll need to restart the mysql service once you make your changes.
Adding my comments in here to make this a little clearer.
Make sure the following HTTP header is being set so the browser knows what charset to expect.
Content-type: text/html; charset=UTF-8
Also try adding this tag into the top of your html <head> tag
<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />
To make the browser show up correctly.you should check three points:
encoding of your script file.
encoding of connection.
encoding of database or table schema.
if all of these are compatible, you'll get the page you want.
The original data has been encoded as UTF-8, the result interpreted in Windows-1252 and then UTF-8 encoded again. This is really bad; it isn't about a simple encoding mismatch that a header would fix. Your data is actually broken.
If the data is ok in the database (check with SELECT hex(column) FROM myTable) to see if it was double encoded already in the database), then there must be your code that is converting it to UTF-8 on output.
Search your project for uses of function utf8_encode, convert_to_utf8, or just iconv or mb_convert_encoding. Running
$ grep -rn "\(utf8_\(en\|de\)code\|convert_to_utf8\|iconv\|mb_convert_encoding\)" .
On your application's /application folder should be enough to find something.
Also see config values for these:
<?php
var_dump(
ini_get( "mbstring.http_output" ),
ini_get( "mbstring.encoding_translation" )
);
Well, if you absolutely and positively sure that your mysql client encoding is set to utf8, there are 2 possible cases. One - double encoding - described by Esailija.
But there is another one: you have your data actually encoded in 1251, not in utf-8.
In this case you have to either recode your data or set proper encoding on the tables. Though it is not one button push task
Here is a manual (in russian) exаctly for that case: http://phpfaq.ru/charset#repair
In short, you have to dump your table, using the same encoding set on the table (to avoid recoding), backup that dump in safe place, then change table definitions to reflect the actual encoding and then load it back.
Potentially this may also be caused by the mbstring extension not being installed (which would explain a difference between your dev and production environments)
Check out this post, might give you a few more answers.
Try mysql_set_charset('utf8') after the mysql connect. Then it should works.
After 2 days of fighting this bug, finally figured out the issue. Thanks for #yourcommonsense, #robsquires, and a friend of mine from work for good resources that helped to debug the issue.
The issue was that at the time of the sql file dump to the database (import), charset for server, database, client, and connection was set to latin1 (status command helped to figure that out). So the command line was set to latin1 as well, which is why it was showing the right characters, but the connection with the PHP code was UTF8 and it was trying to encode it again. Ended up with double encoding.
Solution:
mysqldump the tables and the data (while in latin1)
dump the database
set the default charsets to UTF8 in /etc/my.cnf as Rob Squires mentioned
restart the mysql
create the database again with the right charset and collation
dump the file back into it
And it works fine.
Thanks all for contribution!

UTF-8 problems with PHP DOM on Debian server

I have a problem with UTF-8 strings in PHP on my Debian server.
Update in details
I´ve done a little more testing and the situation is now more specific. I updated the title and details to fit it better the situation. Thanks for the responses and sorry that the problem wasn´t described clearly. The following script works fine on my local Windows machine but not on my Debian server:
<?php
header("Content-Type: text/html; charset=UTF-8");
$string = '<html><head></head><body>UTF-8: ÄÖÜ<br /></body</html>';
$document = new DOMDocument();
#$document->loadHTML($string);
echo $document->saveHTML();
echo $string;
As expected on my local machine the output is:
UTF-8: ÄÖÜ
UTF-8: ÄÖÜ
On my server the output is:
UTF-8: ÄÖÜ
UTF-8: ÄÖÜ
I wrote the script in Notepad++ in UTF-8 without BOM and transferred it over SSH. As noticed by guido the string itself is properly UTF-8 encoded. There seems to be a problem with PHP DOM or maybe libxml. And the reason must be some setting since it is machine dependant.
Original question
I work locally with XAMPP on Windows and everything is fine. But when I deploy my project on the server UTF-8 strings get all messed up. In fact when I upload this test script
echo utf8_encode('UTF-8 test: ÄÖÜ');
I get "ÃÃÃ". Also when I connect with putty to the server I cannot write umlauts (ÄÖÜ) correctly in the shell. I have no idea if this issue is even PHP related.
Check for your apache's AddDefaultCharset setting.
On standard debian apache distributions, the setting can be modified in /etc/apache2/conf.d/charset.
Please verify that your file is byte-to-byte the same as on your local machine. FTP transfer in text mode could have messed it up. You may want to try binary one.
EDIT: answer for updated question:
<?php
header("Content-Type: text/html; charset=UTF-8");
$string = '<html><head>'
.'<meta http-equiv="content-type" content="text/html; charset=utf-8">'
.'</head><body>UTF-8: ÄÖÜ<br /></body</html>';
$document = new DOMDocument();
#$document->loadHTML($string);
echo $document->saveHTML();
echo $string;
?>
I suspect your input string may be already UTF-8. Try:
setlocale(LC_CTYPE, 'de_DE.UTF-8');
$s = "UTF-8 test: ÄÖÜ";
if (mb_detect_encoding($s, "UTF-8") == "UTF-8") {
echo "No need to encode";
} else {
$s = utf8_encode($s);
echo "Encoded string $s";
}
Are you explicitly sending a content-type header? If you omit it, it's likely that Apache is sending one for you. If the file is served with a Latin-1 encoding (by Apache) and the browser reads it as such, then your UTF-8 characters will be malformed.
Try this:
<?php
echo "Drop some UTF-8 characters here.";
Then this:
<?php
header("Content-Type: text/html; charset=UTF-8");
echo "Drop some UTF-8 characters here.";
The second should work, if the first doesn't. You may also want to save the file as a UTF-8-encoded file, if it's not already.
If your database characters are messed up, try setting the (My)SQL connection encoding.
Try changing the defualt charset on the server in your php.ini file:
default_charset = "UTF-8"
also, make sure your are sending out the proper content type headers as utf-8
In my experience with utf-8, if you properly configure the php mbstring module and use the mbstring functions, and also make sure your database connection is using utf-8 then you won't have any problems.
The db part can be done for mysql with the query "SET NAMES 'utf8'"
I usually started an output buffer using mbstring to handle the buffer. This is what I use in production websites and it is a very solid approach. Then send the buffer when you have finished rendering your content.
Let me know if you would like the sampe code for that.
Another easy trick to just see if it is the wrong headers being sent out by php or the webserver is to use the view->encoding menu on your browser and see if it is utf-8. If it's not and you switch it to utf-8 and everything looks ok then it is a problem with your headers or content type. If it is already utf-8 and the text is screwed up then it is something going wrong in your code or db connection. If you are using mysql make sure the tables and columns involved are also utf-8
The cause of the problem was an old version of libxml (2.6.32.) on the server. On the development machine it was 2.7.3. I upgraded libxml to an unstable package resulting in version 2.7.8. The problems are now gone.

filename encoding issue

I am getting a file with a faroese name and trying to save it in a PHP script:
2010_08_Útflutningur.xls
In Ubuntu 10.04 LTS is saving it as :
2010_08_�tflutningur.xls (invalid encoding)
I've installed and run utf8-migration-tool, but with no effect.
Is this a ubuntu error that I can fix or I just have to give up and modify the name in php?
Thanks
Ubuntu uses UTF8 internally for its filenames. In this particular case utf8_encode does the trick as the original filename is ISO-8859-1 encoded. In other cases I could use iconv, and detect the encoding if is unknown.
"Ú" it's not the ubuntu error.Basically your "Ú" chartered takes as a unreadable special charecter.So it's better to modify the name.

Categories