file_put_contents encoding used on web servers? - php

I am trying to use file_put_contents (and file_get_contents for that matter) with a UTF-8 ¥ following this stackoverflow post: How to write file in UTF-8 format? which uses:
$data = mb_convert_encoding($data, 'UTF-8', 'OLD-ENCODING');
Which wasn't really explained well, since it produces an error of:
mb_convert_encoding(): Illegal character encoding specified
So 'OLD-ENCODING' was just a placeholder they were using.
The question I have is what encoding should I change this to? ASCII or ISO-8859-1? What encoding do most web hosts use? Does it matter?
When I open the file, I will get the symbol correctly, only if I have my notepad set with encoding UTF-8. If I open it with another character set it will show up with a "?".

Try without third parameter.
$str = mb_convert_encoding($str, "UTF-8");
Or auto:
$str = mb_convert_encoding($str, "UTF-8", "auto");
More info and examples on:
http://php.net/manual/function.mb-convert-encoding.php

mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data));
mb_detect_encoding

Related

PHP encoding Windows-1257 to UTF-8 error

I encountered a problem with converting the Windows-1257 file to UTF-8. The original file has
<?xml version="1.0" encoding="windows-1257"?>
on top and I try to convert it using this code:
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
file_put_contents('data/rmtools/import/utf8/'.$files_single, $unicode_xml);
It saves the file as UTF-8, but when I open this file I still get the error:
XML parsing error: Input is not proper UTF-8, indicate encoding ! Bytes: 0x04 0x50 0x72 0x65
Is there any proper way I could convert it to readable UTF-8, or it means that there is still some symbols in the file which is NOT on UTF-8?
You're trying to convert UTF8 to UTF8//IGNORE, and that's why you're receiving that error. The first parameter is the in_charset. iconv on PHP.net Please change
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
to
$unicode_xml = iconv("CP1257", "UTF-8//IGNORE", $baltic_xml);
However I'd personally recommend you to use mb_* as iconv relies heavily on your OS's implementation of iconv and can show differences in between OS, mb_* on the other hand is pure php extension and is consistent. Making your code use mb_* changes whole to
ini_set('mbstring.substitute_character','none'); //to remove the unknown characters, in place of //IGNORE in iconv
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
$unicode_xml = utf8_encode($unicode_xml); //to correct utf-8 bytes
$unicode_xml = preg_replace('/[^\PC\s]/u', '', $unicode_xml); //to remove control chars in case it has
file_put_contents('data/rmtools/import/utf8/' . $files_single, $unicode_xml);
According to mb supported encodings CP-1257 is not one of them, you may use ISO-8859-13 instead, however please note that there are some inconsistencies between them in some graphical characters (language characters however seem to be consistent according to wikipedia )

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

Convert from mysql cp1251_general_ci collation (Windows-1251) into UTF-8 php

I have a mysql varchar(50) row in cp1251_general_ci collation.
After mysql_fetch_row in php i got a $string.
Then i do the following:
echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // echoes Windows-1251
$string = mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // again echoes Windows-1251
Why the second time the string is not UTF-8?
I also tried
$string = iconv('Windows-1251', 'UTF-8', $string);
But again the out charset is Windows-1251.
And in the final result i got broken encoding in my filename which consists of $string variable.
How can i convert from mysql cp1251_general_ci collation (Windows-1251) into UTF-8?
P.S.
echo $string; \\ echoes ������
echo bin2hex($string); \\ echoes cce5e3e0f4eeed
$string = mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
echo $string; \\ echoes Мегафон
echo bin2hex($string); \\ echoes d09cd0b5d0b3d0b0d184d0bed0bd
But
fopen("../tmp/$string.log", "w");
creates a file .../tmp/??????????????.log (in linux)
Found the reason of this strange situation!
In short words: if you see a proper encoded UTF-8 string on a server (in terminal) in unreadable symbols — check the server locale.
And if you see a strange behavior of the mb_detect_encoding() method, don't forget that — mb_detect_encoding doesn't give you a precise encoding determination of a string.
The reason of not correct encoding in filename: .../tmp/??????????????.log file is the locale on the server! Here is the locale command result on the server where the file is located:
$ locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
For correct displaying UFT-8 symbols in file names on the server the server locale must be utf-8 too.
And about all the converting in the question. Both methods:
iconv('Windows-1251', 'UTF-8', $string);
and
mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
works fine in this case.
The only question is why the second echo of
echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // echoes Windows-1251
$string = mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // again echoes Windows-1251
is not UTF-8?
And the answer is — mb_detect_encoding doesn't give you a precise encoding determination of a string

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

Categories