I have a file, which contains some cyrillic characters. When I open this file in Notepad++ I see, that it has ANSI encoding. If I manually encode it into UTF-8 using Notepad++, then everything is absolutely ok - I can use this file in my parsers and get results. But what I want is to do it programmatically, using PHP. This is what I tried after searching through SO and documentation:
file_put_contents($file, utf8_encode(file_get_contents($file)));
In this case when my algorithm parses the resulting files, it meets such letters as "è", "í" , "â". In other words, in this case I get some rubbish. I also tried this:
file_put_contents($file, iconv('WINDOWS-1252', 'UTF-8', file_get_contents($file)));
But it produces the very same rubbish. So, I really wonder how can I achive programmatically what Notepad++ does. Thanks!
Notepad++ may report your encoding as ANSI but this does not necessarily equate to Windows-1252. 1252 is an encoding for the Latin alphabet, whereas 1251 is designed to encode Cyrillic script. So use
file_put_contents($file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($file)));
to convert from 1251 to utf-8 with iconv.
Related
I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?
I have to create a csv (semicolon separated) file as export for some Česká pošta DOS based system. Its not basicaly problem, but it have to be strictly in CP852 encoding (which is extended ASCII table with czech special characters ěščřžýáíé etc. named also Latin 2). So its no solution to remove diacritics.
I tried many approches, search through stack and google, but find no working solution. Source is saved in UTF8. I try to use iconv, mb_convert_encoding and also some other libraries, but nothing works right.
Closest is to use
iconv("UTF-8", "ISO-8859-2", 'abcěščřžýáíé');
Which looks in browser that is correctly encoded in ISO
abc���������
(when I change encoding to ISO, it shows character right).
abcěščřžýáíé
But when I write it into file by FPutS and read it again by FGetS followed by mb_detect_encoding that string, it shows UTF-8 instead. Pošta's software is very strict about encoding, so I have to make it right.
When I simply use for example
$text = iconv("UTF-8", "ASCII", 'abcěščřžýáíé');
it removes all characters with diacritics, but its not solution.
Is there anyone able to help with it? I am realy stuck ...
Using PHP CLI, this works well:
$result = iconv (LATIN1, 'UTF-8', N�n��;M�tt);
Result is: Nönüß
This also works for CP437, Windows, Macintosh etc.
On apache, the SAME code results in:
$result = iconv (LATIN1, 'UTF-8', N�n��;M�tt);
Result is: Nönüß
I googled around and added setlocale(LC_ALL, "en_US.utf8"); to the script, but made no difference. Thanks for helping!
I run Debian Linux with apache2 and php 5.4. I am trying to convert different CSV files as they are being uploaded into UTF-8 for processing.
UPDATE: I found my own solution.
$result = utf8_decode (iconv (LATIN1, 'UTF-8', N�n��;M�tt));
utf8_decode makes it show up correctly in the browser and when saved to the MySQL DB.
There are always two sides to encoding: the encoded string, and the entity interpreting this encoded string into readable characters! This "entity", as I'll ambiguously call it, can be the database, the browser, your text editor, the console, or whatever else.
$result = iconv('LATIN1', 'UTF-8', 'N�n��;M�tt');
Result is: Nönüß
Not sure where you're getting 'N�n��;M�tt' from exactly, but the UNICODE REPLACEMENT CHARACTERS � in there indicate that you're trying to interpret this string as UTF-8, but the string is not actually UTF-8 encoded. Using iconv to convert it from Latin-1 to UTF-8 makes the correct characters appear - that means the string was originally Latin-1 encoded and converting it to your expected encoding solved the discrepancy.
On apache, the SAME code results in Nönüß
That means the interpreting party here is not interpreting the string as UTF-8 this time, even though the string is UTF-8. I assume by "Apache" you mean "in the browser". You need to tell your browser through HTTP headers or HTML meta tags that it's supposed to interpret the text as UTF-8.
I found my own solution.
$result = utf8_decode (iconv (LATIN1, 'UTF-8', N�n��;M�tt));
Guess what utf8_decode does. It converts the encoding of a string from UTF-8 to Latin-1. So the above code converts Latin-1 to ... Latin-1.
Please read the following:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));
When using the following code:
$myString = 'some contents';
$fh=fopen('newfile.txt',"w");
fwrite($fh, "\xEF\xBB\xBF" . $myString);
Is there any point of using PHP functions to first encode the text ($myString in the example) e.g. like running utf8_encode($myString); or similar iconv() commands?
Assuming that the BOM \xEF\xBB\xBF is first inputted into the file and that UTF8 represents practically all characters in the world I don't see any potential failure scenarion of creating a file this way. In other words I don't see any case where any major text editor wouldn't be able to interpret the newly created file corectly, displaying all characters as intended. This even if $myString would be a PHP $_POST variable from a HTML form. Am I right?
If your source file is UTF-8 encoded, then the string $myString is also UTF-8 encoded, you don't need to convert it. Otherwise, you need to use iconv() to convert the encoding first before write it to the file.
And note utf8_encode() is used to encode an ISO-8859-1 string to UTF-8.
Note that utf8_encode will only convert ISO-8859-1 encoded strings.
In general, given that PHP only supports a 256 char character set, you will need to utf-8 encode any string containing non-ASCII characters before writing it to UTF-8.
The BOM is optional (most text file readers now will scan the file for its encoding).
From Wikipedia
The Unicode Standard permits the BOM in UTF-8,[2] but does not require
or recommend for or against its use