How to set text file encoding in PHP? - php

How can I set a text file ENCODING (for instance UTF-8) in PHP?
Let me show you my problem. This is my code:
<?php
file_put_contents('test.txt', $data); // data is some non-English text with UTF-8 charset
?>
Output: اÙ!
fwrite() has the similar output.
But when I create the test.txt by notepad and set the charset UTF-8 the output is what I want.
I wanna set the charset in the PHP file.
Now this is my question: How to set text file encoding by PHP?

PHP does not apply an encoding when storing text in a file: it stores data exactly as it is laid out in the string.
You mention that you have problems opening the file in notepad.exe. That text editor is not very good at guessing the encoding of the file you are opening; if the text is encoded in UTF-8 you must choose to open it as UTF-8. Use another text editor if possible. Notepad++ is a popular replacement.
If you must use notepad.exe, as a last resort, write a Byte Order Mark to the file before you write anything else; this will make it recognize the file as UTF-8 while potentially making the file unusable for other purposes (see the Wikipedia article for details).
file_put_contents("file.txt", "\xEF\xBB\xBF" . $data);

You may try this using mb_convert_encoding
$data = mb_convert_encoding($data, 'UTF-8', 'auto');
file_put_contents('test.txt', $data);
Also check iconv.
Update : (try this and find the right encoding for your text)
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($data, 'UTF-8', $chr)." : ".$chr."<br>";
}
Also, try this on GitHub.

Try:
file_put_contents('test.txt', utf8_encode($data));

You can create a function which converts a string array into a utf8 encoded string array and another to decode and write to a notepad file for you.
<?php
function utf8_string_encode(&$array){
$myencode = function(&$value,&$key){
if(is_string($value)){
$value = utf8_encode($value);
}
if(is_string($key)){
$key = utf8_encode($key);
}
if(is_array($value)){
utf8_string_encode($value);
}
};
array_walk($array,$func);
return $array;
}
?>

Related

ANSI encoded file converting to UTF-8 encoded file with php script?

How can ANSI encoded file converting to UTF-8 encoded file with php, or any script, or any command line under linux?
Firstly, ANSI is not a type of character encoding. With ANSI, you need to find out what the encoding options are for the particular file that you're trying to read. First you should find out first if the file is already UTF-8 encoded, and if not, then simply encode it. Below, we check the encoding and if successful we return the file.
$output = false;
if( !mb_check_encoding( $myFile, 'UTF-8', true ) ):
$output = mb_convert_encoding( $myFile, 'UTF-8' );
endif;
Then simply check if the encoding worked.
return $output ? $output : 'Failed encoding file!';
Not my answer but highlighting one of the comments on Ohgodwhy's answer:
If you can use the command line then I would probably use iconv for this.
iconv -f iso-8859-1 -t utf-8 <infile >outfile
and of course adjust the variables accordingly. – Ohgodwhy

UTF-8 characters in fwrite

I'm trying to save HTML to a .html file,
This is working:
$html_file = "output.html";
$output_string="string with characters like ã or ì";
$fileHandle = fopen($html_file, 'w') or die("file could not be accessed/created");
fwrite($fileHandle, $output_string);
fclose($fileHandle);
When I check the output.html file, these special characters in my output_string are not read correctly.
My HTML file can't have a <head> tag with the charset information, this makes it work, but my output can't have any <html>, <head> or <body> tags.
I have tried stuff like
header('Content-type: text/plain; charset=utf-8');
I also tried utf8_encode() on the string before fwrite, but with no success so far.
If I read the output.html file in Notepad++ or Netbeans IDE, it shows the correct characters being saved, it's the browser that isn't them reading properly.
I'm pretty sure PHP is saving my file with the incorrect charset, because if I create HTML files in my computer with those special characters (even without any charset setting), these are read correctly.
Try to add a BOM (Byte Order Mark) to your file :
$output_string = "\xEF\xBB\xBF";
$output_string .= "string with characters like ã or ì";
$fileHandle = // ...
Yes, PHP is writing the file correctly, only the reading program doesn't know what character encoding it is and interprets the data with the wrong charset. If you cannot include meta information that convey the correct charset and if the file format itself (plain text) does not offer a way to specify the charset and if the reading application is not able to correctly guess the charset, then there's no solution.
Whatever editor you are using to write this code must have facility to set character-type as 'UTF-8'.
Set the character-type of the file in which you have written this code.
I am using an editor that allows to change the character encoding of file from the bottom. There must be something similar for the editor you are using.
If you need the string in UTF-8 regardless of the php-script-file-encoding (if it's a single-byte one), you should use the UTF-8 encoding of those characters:
$output_string = "string with characters like \xC3\xA3 or \xC3\x8C";

PHP fputcsv encoding

I create csv file with fputcsv. I want csv file to be in windows1251 ecnding. But can't find the solution. How can I do that?
Cheers
Default encoding for an excel file is machine specific ANSI, mainly windows1252. But since you are creating that file and maybe inserting UTF-8 characters, that file is not going to be handled ok.
You could use iconv() when creating the file. Eg:
function encodeCSV(&$value, $key){
$value = iconv('UTF-8', 'Windows-1252', $value);
}
array_walk($values, 'encodeCSV');
It's Working: Enjoy
use this before fputcsv:
$line = array_map("utf8_decode", $line);
The file will be in whatever encoding your strings are in. PHP strings are raw byte arrays, their encodings depends on wherever the bytes came from. If you just read them from your source code files, they're in whatever you saved your source code as. If they're from a database, they're in whatever encoding the database connection was set to.
If you need to convert from one encoding to another, use iconv. If you need more in-depth information, see here:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App
fputs( $fp, $bom = chr(0xEF) . chr(0xBB) . chr(0xBF) );
Try this it worked for me!!!
Try the iconv function:
http://php.net/manual/en/function.iconv.php
To transform the members of the array you're passing to the fputcsv function to the encoding you want.
$string = iconv(mb_detect_encoding($string), 'Windows-1252//TRANSLIT', $string);

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");

Categories