Saving a file with umlauts on a CentOS machine - php

We have a CentOS 6 machine with an Apache webserver that accepts file uploads from a rich Javascript client. The files are saved with php's move_uploaded_file
The client and the server (php) files are all encoded in iso-8859-1, and the database on the server as well. Also, the html output declares iso-8859-1 as charset.
File uploading works fine so far, except files with umlauts (or other yet unknown special characters) result in an error. For example, the file 1.Nachtrag Gemeinde Höchst.pdf gets echoed correctly in the application, and also the link which is produced to download the file has the correct (url-)encoding:
http://ourdomain/saba/data/dok/00000092/1.Nachtrag%20Gemeinde%20H%C3%B6chst.pdf
But when clicking on this link, a 404 error appears. When looking for the file in the shell, it gets displayed as 1.Nachtrag Gemeinde H?chst.pdf, which indicates some sort of wrong encoding, although it might be just because the shell has a utf-8 encoding.
What did we forget?

As #Amadan has correctly pointed out, the filename needed to be converted to utf-8 before saving, i.e.:
$filename = iconv('ISO-8859-1', 'UTF-8', $filename);
$is_successful = #move_uploaded_file($tmp_filename, $ordnername . DIRECTORY_SEPARATOR . $filename);

Related

File encoding incorrectly detected

On Linux server if user uploads a CSV file created in MS Office Excel (thus having Windows 1250 [or cp1250 or ASCII if you want to] encoding) all to me known methods of detecting the file encoding return incorrect ISO-8859-1 (or latin1 if you want to) encoding.
This is crucial for the encoding conversion to final UTF-8.
Methods I tried:
cli
file -i [FILE] returning iso-8859-1
file -b [FILE] returning iso-8859-1
vim
vim [FILE] and then :set fileencoding? returning latin1
PHP
mb_detect_encoding(file_get_contents($filename)) returning (surprisingly) UTF-8
while the file is indeed in WINDOWS-1250 (ASCII) as proves i.e. opening the CSV file in LibreOffice - Math asks for file encoding and selecting either of ISO-8859-1 or UTF-8 results in wrongly presented characters while selecting ASCII displays all characters correctly!
How to correctly detect the file encoding on Linux server (Ubuntu) (best if possible with default Ubuntu utilities or with PHP)?
The last option I can think of is to detect the user agent (and user OS) when uploading the file and it is windows then automatically assume the encoding is ASCII...

Ubuntu encoding of new files

I'm searching there for a long time, but without any helpful result.
I'm developing a PHP project using eclipse on a Ubuntu 11.04 VM. Every thing works fine. I've never need to look for the file encoding. But after deploying the project to my server, all contents were shown with the wrong encoding. After a manual conversion to UTF8 with Notepad++ my problems were solved.
Now I want to change it in my Ubuntu VM, too. And there's the problem. I've checked the preferences in Eclipse but every property ist set to UTF8: General content types, workspace, project settings, everything ...
If I look for the encoding on the terminal, it says "test_new.dat: text/plain; charset=us-ascii". All files are saved to ascii format. If I try to create a new file with the terminal ("touch") it's also the same.
Then I've tried to convert the files with iconv:
iconv -f US-ASCII -t UTF8 -o test.dat test_new.dat
But the encoding doesn't change. Especially PHP files seems to be resistant. I have some *.ini files in my project for which a conversion works?!
Any idea what to do?
Here are my locale settings of Ubuntu:
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
I was also wondering about character encoding and found something that might be usefull here.
When I create a new empty .txt-file on my ubuntu 12.04 and ask for its character encoding with: "file -bi filename.txt" it shows me: charset=binary. After opening it and writing something inside like "haha" I saved it using "save as" and explicitly chose UTF-8 as character encoding. Now very strangely it did not show me charset=UTF-8 after asking again, but returned charset=us-ascii. This seemed already strange. But it got even stranger, when I did the whole thing again but this time included some german specific charakters (ä in this case) in the file and saved again (this time without saving as, I just pressed save). Now it said charset=UTF-8.
It therefore seems that at least gedit is checking the file and downgrading from UTF-8 to us-ascii if there is no need for UTF-8 since the file can be encoded using us-ascii.
Hope this helped a bit even though it is not php related.
Greetings
UTF-8 is compatible with ASCII. An ASCII text file is therefore also valid UTF-8, and a conversion from ASCII to UTF-8 is a no-op.

echo and UTF-8 (PHP)

I have installed Apache on my server (I wasn't using Apache) and special characters started to show wrong.
So I changed every file to UTF-8, configured MySQL to work with UTF-8 and everything worked fine. However, my Python app (which retrieves some information from the website) doesn't work properly.
For example, I had a file "test.php" which returned either 0 or 1. Python code then did whatever with that result.
But now, my Python app doesn't receive "0", I don't know what it gets from the website. I made the app send a GET request to my site with what it was getting and it sent me this: "???0".
What can I do? I tried to change the header to send the result as ISO-8859-1 (as it was before) but isn't working either.
It's BOM symbol. Remove this symbol from script in Notepad++ editor (Menu -> Encoding -> Encode in UTF-8 without BOM).

UTF-8 problems with PHP DOM on Debian server

I have a problem with UTF-8 strings in PHP on my Debian server.
Update in details
I´ve done a little more testing and the situation is now more specific. I updated the title and details to fit it better the situation. Thanks for the responses and sorry that the problem wasn´t described clearly. The following script works fine on my local Windows machine but not on my Debian server:
<?php
header("Content-Type: text/html; charset=UTF-8");
$string = '<html><head></head><body>UTF-8: ÄÖÜ<br /></body</html>';
$document = new DOMDocument();
#$document->loadHTML($string);
echo $document->saveHTML();
echo $string;
As expected on my local machine the output is:
UTF-8: ÄÖÜ
UTF-8: ÄÖÜ
On my server the output is:
UTF-8: ÄÖÜ
UTF-8: ÄÖÜ
I wrote the script in Notepad++ in UTF-8 without BOM and transferred it over SSH. As noticed by guido the string itself is properly UTF-8 encoded. There seems to be a problem with PHP DOM or maybe libxml. And the reason must be some setting since it is machine dependant.
Original question
I work locally with XAMPP on Windows and everything is fine. But when I deploy my project on the server UTF-8 strings get all messed up. In fact when I upload this test script
echo utf8_encode('UTF-8 test: ÄÖÜ');
I get "ÃÃÃ". Also when I connect with putty to the server I cannot write umlauts (ÄÖÜ) correctly in the shell. I have no idea if this issue is even PHP related.
Check for your apache's AddDefaultCharset setting.
On standard debian apache distributions, the setting can be modified in /etc/apache2/conf.d/charset.
Please verify that your file is byte-to-byte the same as on your local machine. FTP transfer in text mode could have messed it up. You may want to try binary one.
EDIT: answer for updated question:
<?php
header("Content-Type: text/html; charset=UTF-8");
$string = '<html><head>'
.'<meta http-equiv="content-type" content="text/html; charset=utf-8">'
.'</head><body>UTF-8: ÄÖÜ<br /></body</html>';
$document = new DOMDocument();
#$document->loadHTML($string);
echo $document->saveHTML();
echo $string;
?>
I suspect your input string may be already UTF-8. Try:
setlocale(LC_CTYPE, 'de_DE.UTF-8');
$s = "UTF-8 test: ÄÖÜ";
if (mb_detect_encoding($s, "UTF-8") == "UTF-8") {
echo "No need to encode";
} else {
$s = utf8_encode($s);
echo "Encoded string $s";
}
Are you explicitly sending a content-type header? If you omit it, it's likely that Apache is sending one for you. If the file is served with a Latin-1 encoding (by Apache) and the browser reads it as such, then your UTF-8 characters will be malformed.
Try this:
<?php
echo "Drop some UTF-8 characters here.";
Then this:
<?php
header("Content-Type: text/html; charset=UTF-8");
echo "Drop some UTF-8 characters here.";
The second should work, if the first doesn't. You may also want to save the file as a UTF-8-encoded file, if it's not already.
If your database characters are messed up, try setting the (My)SQL connection encoding.
Try changing the defualt charset on the server in your php.ini file:
default_charset = "UTF-8"
also, make sure your are sending out the proper content type headers as utf-8
In my experience with utf-8, if you properly configure the php mbstring module and use the mbstring functions, and also make sure your database connection is using utf-8 then you won't have any problems.
The db part can be done for mysql with the query "SET NAMES 'utf8'"
I usually started an output buffer using mbstring to handle the buffer. This is what I use in production websites and it is a very solid approach. Then send the buffer when you have finished rendering your content.
Let me know if you would like the sampe code for that.
Another easy trick to just see if it is the wrong headers being sent out by php or the webserver is to use the view->encoding menu on your browser and see if it is utf-8. If it's not and you switch it to utf-8 and everything looks ok then it is a problem with your headers or content type. If it is already utf-8 and the text is screwed up then it is something going wrong in your code or db connection. If you are using mysql make sure the tables and columns involved are also utf-8
The cause of the problem was an old version of libxml (2.6.32.) on the server. On the development machine it was 2.7.3. I upgraded libxml to an unstable package resulting in version 2.7.8. The problems are now gone.

$_GET encoding problem with cyrillic text

I'm trying this code (on my local web server)
<?php
echo 'the word is / думата е '.$_GET['word'];
?>
but I get corrupted result when enter ?word=проба
the word is / думата е ����
The document is saved as 'UTF-8 without BOM' and headers are also UTF-8.
I have tried urlencode() and urldecode() but the effect was same.
When upload it on web server, works fine...
What if you try sending a HTTP Content-type header, to indicate the browser which encoding / charset your page is generating ?
For instance, something like this might help :
header('Content-type: text/html; charset=UTF-8');
echo 'the word is / думата е '.$_GET['word'];
Of course, this is if you are generating HTML -- you probably are.
Considering there is a configuration setting at the server's level that defines which encoding is sent by default, maybe the default encoding on your server is OK -- while the one on your local server is not.
Sending such a header by yourself would solve the problem : it would make sure the encoding is always set properly.
I suppose you are using the Apache web server.
There is a common problem with Apache configuration - a line with "AddDefaultCharset" in the config should be commented out (add # in the begining of the line, or replace the line with "AddDefaultCharset off") because it "overrides any encoding given in the files in meta http-equiv or xml encoding tags".
In my current installation (Apache2 # Ubuntu Linux) the line is found in "/etc/apache2/conf.d/charset" but in other (Linux/Unix) setups can be in "/etc/apache2/httpd.conf", or "/etc/apache/httpd.conf" (if you are using Apache 1). If you don't find it in these files you can search for it with "cd /etc/apache2 ; grep -r AddDefaultCharset *" (for Apache 2 # Unix/Linux).
Take a look at Changing the server encoding. An excellent read!
Cheers!
If You recieve $_GET from AJAX make sure that Your blablabla.js file in UTF-8 encode. Also You can use iconv("cp1251","utf8",$_GET['word']); to display your $_GET['word'] in UTF-8
I just had the issue and it sometimes happens if you filter the GET variable with htmlentities(). It seems like this function converts cyrillic characters into weird stuff.

Categories