File encoding incorrectly detected - php

On Linux server if user uploads a CSV file created in MS Office Excel (thus having Windows 1250 [or cp1250 or ASCII if you want to] encoding) all to me known methods of detecting the file encoding return incorrect ISO-8859-1 (or latin1 if you want to) encoding.
This is crucial for the encoding conversion to final UTF-8.
Methods I tried:
cli
file -i [FILE] returning iso-8859-1
file -b [FILE] returning iso-8859-1
vim
vim [FILE] and then :set fileencoding? returning latin1
PHP
mb_detect_encoding(file_get_contents($filename)) returning (surprisingly) UTF-8
while the file is indeed in WINDOWS-1250 (ASCII) as proves i.e. opening the CSV file in LibreOffice - Math asks for file encoding and selecting either of ISO-8859-1 or UTF-8 results in wrongly presented characters while selecting ASCII displays all characters correctly!
How to correctly detect the file encoding on Linux server (Ubuntu) (best if possible with default Ubuntu utilities or with PHP)?
The last option I can think of is to detect the user agent (and user OS) when uploading the file and it is windows then automatically assume the encoding is ASCII...

Related

How to force file saving with ISO-8859-1 encoding instead of UTF-8

I have to able xml files downloading with ISO-8859-1 (I know, that UTF-8 is much better, but our partner has strict requirements to encoding and we cannot force him to change his policy).
Server background:
Google Chrome 71.0.3578.98 (Official Build) (64-bit)
Ubuntu 16.04
nginx
php 7.2
Symfony 4.0.15
Controller returns a response with proper charset:
return (new Response($xml->content(), Response::HTTP_CREATED, ['Content-Type' => $xml->contentType()]))
->setCharset($xml->charset());
It looks perfectly fine (at least in Chrome DevTools there is a correct response header):
But problem is that file stored in the file system with UTF-8 encoding.
$ file --mime test.xml
test.xml: application/xml; charset=utf-8
and XML file renders incorrect after opening it in the browser:
<INSIGMA>
<AktuarMed>
<Person>
<Name>Hans Müller</Name>
<Surname>Müller</Surname>
<Forename>Hans</Forename>
</Person>
</AktuarMed>
</INSIGMA>
Surname has to be Müller, but it is displaying wrong. If I change the encoding of this file to the expected one, then it displays it correct:
$ iconv -f UTF-8 test.xml -t ISO-8859-1 > test.xml
$ file --mime test.xml
test.xml: application/xml; charset=iso-8859-1
TL;DR: So the question is
Why this file stores with utf-8 encoding at all, if the server responds, that ISO-8859-1 charset should be used?
Do I need to send some extra headers to force downloading file with ISO-8859-1 charset? or
Does it default behaviour of the browser? or
Does it default behaviour of the operating system?
How to catch this problem and on which step should I find a solution?
You can try the following:
Create xml file that should act as your server response in gedit, in 'Save as' select proper character encoding (ISO-8859-1) at the bottom of the dialog.
Put the file into nginx and make at accessible (public) by some URL (like http://localhost/sample-response.xml or something else).
Access it with your browser
Ensure that it's correct on the client side (after saving with browser)
Record xml file downloading (accessing) request-response log (with wireshark or tcpflow or something else) as plain text.
Now access php application for the xml file generated by your php app and record its request-response log as in step #5
Compare request-response logs from step 5 & 6 with file comparison tool (meld, kdiff3 or something else)
After that you might see where the problem is.

Bad encoding from database in php file using IIS 8.5

I must migrate large database and large php systems from php4 to php5.
Databases tables stored in UTF-8 format, but the data contain windows-1257;
All page in header is:
But I get data from database like this: AutomobiliĆø stovĆ«jimo;
var_dump(mysql_client_encoding($connect)); return utf8;
File encoding: windows-1257;
In Apache server (try Wamp in W7 and Windows server 2012) get normal data.
But IIS dont.. Mb IIS dont understand file encoding or etc..
I give up, and I need your help...
SOVLED: I change mysql configuration (my.ini) and set character_set_server utf8 to latin1
And now var_dump(mysql_client_encoding($connect)); return latin1;
And all projects works fine.
Databases tables stored in UTF-8 format, but the data contain
windows-1257.
Try converting the data from Windows-1257 to UTF-8 with something like:
$encoded = iconv ( "CP1257", "UTF-8", $string );

Ubuntu encoding of new files

I'm searching there for a long time, but without any helpful result.
I'm developing a PHP project using eclipse on a Ubuntu 11.04 VM. Every thing works fine. I've never need to look for the file encoding. But after deploying the project to my server, all contents were shown with the wrong encoding. After a manual conversion to UTF8 with Notepad++ my problems were solved.
Now I want to change it in my Ubuntu VM, too. And there's the problem. I've checked the preferences in Eclipse but every property ist set to UTF8: General content types, workspace, project settings, everything ...
If I look for the encoding on the terminal, it says "test_new.dat: text/plain; charset=us-ascii". All files are saved to ascii format. If I try to create a new file with the terminal ("touch") it's also the same.
Then I've tried to convert the files with iconv:
iconv -f US-ASCII -t UTF8 -o test.dat test_new.dat
But the encoding doesn't change. Especially PHP files seems to be resistant. I have some *.ini files in my project for which a conversion works?!
Any idea what to do?
Here are my locale settings of Ubuntu:
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
I was also wondering about character encoding and found something that might be usefull here.
When I create a new empty .txt-file on my ubuntu 12.04 and ask for its character encoding with: "file -bi filename.txt" it shows me: charset=binary. After opening it and writing something inside like "haha" I saved it using "save as" and explicitly chose UTF-8 as character encoding. Now very strangely it did not show me charset=UTF-8 after asking again, but returned charset=us-ascii. This seemed already strange. But it got even stranger, when I did the whole thing again but this time included some german specific charakters (ä in this case) in the file and saved again (this time without saving as, I just pressed save). Now it said charset=UTF-8.
It therefore seems that at least gedit is checking the file and downgrading from UTF-8 to us-ascii if there is no need for UTF-8 since the file can be encoded using us-ascii.
Hope this helped a bit even though it is not php related.
Greetings
UTF-8 is compatible with ASCII. An ASCII text file is therefore also valid UTF-8, and a conversion from ASCII to UTF-8 is a no-op.

What encoding does MAC Excel use?

I have a client that wants to export a .csv to the server where it will be parsed by PHP in order to generate a table with its data. I'm using iconv to convert to the appropriate encoding (UTF-8). Unfortunately I'm a on Windows, so I don't know what the source encoding is.
What encoding would MAC Excel use to generate a .csv? I've tried so many different combinations, but none work on the french accents, which are - as far as I know - not arranged the same way in the MAC's charset as in UTF-8
For example:
The correct display should be:
'Délégation'
Most types of encoding (including using utf8_encode()) gives:
'DÈlÈgation'
macintosh to UTF-8 gives:
'D»l»gation'
If I open the .csv file - that was saved from MAC - on my PC, I see the french 'é' accents as 'È', so is there a possibility that saving the file onto my computer (or server) forces the file directly to UTF-8 so now the 'È' are the direct values of the characters, instead of an UTF-8 encoding misinterpretation?
Hex Dump
Using bin2hex(), the hex dump for the string:
'DÈlÈgation 1' is:
44c86cc8676174696f6e2031
-- in fact, I'm assuming that it's DÈlÈgation and not Délégation because if I open the .csv file in notepad (on my PC), it shows it up as È and not é.
A common encoding for Mac programs to use is MacRoman.
Would it be possible for your client to install the trial version of Apple Numbers from the Apple website, open the .csv file using Numbers, and then go to "file", "export", "CSV", and pick either "UTF-8" or "windows Latin 1" and resend you the UTF-8 and the Windows Latin 1 files?
The "Numbers" application on a Mac solves problematic issues encountered on Excel sometimes...

Saving a file with umlauts on a CentOS machine

We have a CentOS 6 machine with an Apache webserver that accepts file uploads from a rich Javascript client. The files are saved with php's move_uploaded_file
The client and the server (php) files are all encoded in iso-8859-1, and the database on the server as well. Also, the html output declares iso-8859-1 as charset.
File uploading works fine so far, except files with umlauts (or other yet unknown special characters) result in an error. For example, the file 1.Nachtrag Gemeinde Höchst.pdf gets echoed correctly in the application, and also the link which is produced to download the file has the correct (url-)encoding:
http://ourdomain/saba/data/dok/00000092/1.Nachtrag%20Gemeinde%20H%C3%B6chst.pdf
But when clicking on this link, a 404 error appears. When looking for the file in the shell, it gets displayed as 1.Nachtrag Gemeinde H?chst.pdf, which indicates some sort of wrong encoding, although it might be just because the shell has a utf-8 encoding.
What did we forget?
As #Amadan has correctly pointed out, the filename needed to be converted to utf-8 before saving, i.e.:
$filename = iconv('ISO-8859-1', 'UTF-8', $filename);
$is_successful = #move_uploaded_file($tmp_filename, $ordnername . DIRECTORY_SEPARATOR . $filename);

Categories