I have a client that wants to export a .csv to the server where it will be parsed by PHP in order to generate a table with its data. I'm using iconv to convert to the appropriate encoding (UTF-8). Unfortunately I'm a on Windows, so I don't know what the source encoding is.
What encoding would MAC Excel use to generate a .csv? I've tried so many different combinations, but none work on the french accents, which are - as far as I know - not arranged the same way in the MAC's charset as in UTF-8
For example:
The correct display should be:
'Délégation'
Most types of encoding (including using utf8_encode()) gives:
'DÈlÈgation'
macintosh to UTF-8 gives:
'D»l»gation'
If I open the .csv file - that was saved from MAC - on my PC, I see the french 'é' accents as 'È', so is there a possibility that saving the file onto my computer (or server) forces the file directly to UTF-8 so now the 'È' are the direct values of the characters, instead of an UTF-8 encoding misinterpretation?
Hex Dump
Using bin2hex(), the hex dump for the string:
'DÈlÈgation 1' is:
44c86cc8676174696f6e2031
-- in fact, I'm assuming that it's DÈlÈgation and not Délégation because if I open the .csv file in notepad (on my PC), it shows it up as È and not é.
A common encoding for Mac programs to use is MacRoman.
Would it be possible for your client to install the trial version of Apple Numbers from the Apple website, open the .csv file using Numbers, and then go to "file", "export", "CSV", and pick either "UTF-8" or "windows Latin 1" and resend you the UTF-8 and the Windows Latin 1 files?
The "Numbers" application on a Mac solves problematic issues encountered on Excel sometimes...
Related
1) I migrated WEB page to another server! From Ms Server 2012/Xampp to Cento7/httpd. On Centos7 Web page contains question marks where should be a special characters "āīūņļš". Web page is built on PHP.
Old server is running old XAMP installation:
PHP Version 5.4.27
5.5.36 - MySQL
CentOS running:
apach/httpd
PHP 7.0.*
5.5.52-MariaDB
2) Both servers contains files with same encoding and DB tables with same charset colletions and both servers have "Server charset is UTF-8 Unicode (utf8)".
Only difference is "Server connection collation". Old server have "Collation" and CentOS have "utf8_general_ci".
3) I tried:
encode files to utf8
define utf8 in meta tags
header('Content-Type: text/html; charset=ISO-8859-1'); and
header('Content-Type: text/html; charset=utf8');
mysqli_set_charset($con,"utf8");
AddDefaultCharset UTF-8 in httpd.conf
I just don't understand, why on one server everything is OK and on another server text contains "?" when files/php code/db is the same! Is there a chance that the httpd doesn't have some module enabled?
And there is one more problem! Some php files and DB tables have different encoding and char collection. I tried change file encoding to UTF8 and it solved problem for static text in php files. Some text in db contains strange characters and db contains lot of information. In some cases mysqli_set_charset($con,"utf8"); works but there ar times when text randomly disappear when mysqli_set_charset is used!
On Linux server if user uploads a CSV file created in MS Office Excel (thus having Windows 1250 [or cp1250 or ASCII if you want to] encoding) all to me known methods of detecting the file encoding return incorrect ISO-8859-1 (or latin1 if you want to) encoding.
This is crucial for the encoding conversion to final UTF-8.
Methods I tried:
cli
file -i [FILE] returning iso-8859-1
file -b [FILE] returning iso-8859-1
vim
vim [FILE] and then :set fileencoding? returning latin1
PHP
mb_detect_encoding(file_get_contents($filename)) returning (surprisingly) UTF-8
while the file is indeed in WINDOWS-1250 (ASCII) as proves i.e. opening the CSV file in LibreOffice - Math asks for file encoding and selecting either of ISO-8859-1 or UTF-8 results in wrongly presented characters while selecting ASCII displays all characters correctly!
How to correctly detect the file encoding on Linux server (Ubuntu) (best if possible with default Ubuntu utilities or with PHP)?
The last option I can think of is to detect the user agent (and user OS) when uploading the file and it is windows then automatically assume the encoding is ASCII...
I'm using exif_read_data() to extract exif data from uploaded pictures. This worked fine on my Windows machine but on my Mac with latest XAMPP all fields seem to be extracted correctly except the keywords/tags. If I look in the file the camera model (which is extracted correctly) is encoded in ASCII it seems (one byte per char). However, the keywords (which were originally edited on Windows (Explorer)) are encoded in UTF16-LE it seems (i.e. ASCII code followed by 0x00). So it seems to be a mix of character encoding.
I tried to force the character encoding to a certain standard (with e.g. ini_set('exif.encode_unicode', 'byte2le')) but most of the times I get question marks in the keywords or nothing at all.
Anyone any idea what's wrong, how to fix it and why this worked fine on Windows XAMMP and not Mac XAMPP?
Thanks
I found the answer:
Forcing exif.decode_unicode_motorola to UCS-2LE instead of the default value UCS-2BE did the trick.
ini_set('exif.decode_unicode_motorola', 'UCS-2LE');
Still don't understand why it works on a Windows machine without this.
I must migrate large database and large php systems from php4 to php5.
Databases tables stored in UTF-8 format, but the data contain windows-1257;
All page in header is:
But I get data from database like this: AutomobiliĆø stovĆ«jimo;
var_dump(mysql_client_encoding($connect)); return utf8;
File encoding: windows-1257;
In Apache server (try Wamp in W7 and Windows server 2012) get normal data.
But IIS dont.. Mb IIS dont understand file encoding or etc..
I give up, and I need your help...
SOVLED: I change mysql configuration (my.ini) and set character_set_server utf8 to latin1
And now var_dump(mysql_client_encoding($connect)); return latin1;
And all projects works fine.
Databases tables stored in UTF-8 format, but the data contain
windows-1257.
Try converting the data from Windows-1257 to UTF-8 with something like:
$encoded = iconv ( "CP1257", "UTF-8", $string );
I'm searching there for a long time, but without any helpful result.
I'm developing a PHP project using eclipse on a Ubuntu 11.04 VM. Every thing works fine. I've never need to look for the file encoding. But after deploying the project to my server, all contents were shown with the wrong encoding. After a manual conversion to UTF8 with Notepad++ my problems were solved.
Now I want to change it in my Ubuntu VM, too. And there's the problem. I've checked the preferences in Eclipse but every property ist set to UTF8: General content types, workspace, project settings, everything ...
If I look for the encoding on the terminal, it says "test_new.dat: text/plain; charset=us-ascii". All files are saved to ascii format. If I try to create a new file with the terminal ("touch") it's also the same.
Then I've tried to convert the files with iconv:
iconv -f US-ASCII -t UTF8 -o test.dat test_new.dat
But the encoding doesn't change. Especially PHP files seems to be resistant. I have some *.ini files in my project for which a conversion works?!
Any idea what to do?
Here are my locale settings of Ubuntu:
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
I was also wondering about character encoding and found something that might be usefull here.
When I create a new empty .txt-file on my ubuntu 12.04 and ask for its character encoding with: "file -bi filename.txt" it shows me: charset=binary. After opening it and writing something inside like "haha" I saved it using "save as" and explicitly chose UTF-8 as character encoding. Now very strangely it did not show me charset=UTF-8 after asking again, but returned charset=us-ascii. This seemed already strange. But it got even stranger, when I did the whole thing again but this time included some german specific charakters (ä in this case) in the file and saved again (this time without saving as, I just pressed save). Now it said charset=UTF-8.
It therefore seems that at least gedit is checking the file and downgrading from UTF-8 to us-ascii if there is no need for UTF-8 since the file can be encoded using us-ascii.
Hope this helped a bit even though it is not php related.
Greetings
UTF-8 is compatible with ASCII. An ASCII text file is therefore also valid UTF-8, and a conversion from ASCII to UTF-8 is a no-op.