SOLUTION:
$output = '–– € ––';
//written like this php 5 does not understand because it interprets it as single-byte chars.
//so i found the function below to write a multi-byte char in a string.
//unicode version of php's chr()
function uchr ($codes) {
if (is_scalar($codes)) $codes= func_get_args();
$str= '';
foreach ($codes as $code) $str.= html_entity_decode('&#'.$code.';',ENT_NOQUOTES,'UTF-8');
return $str;
}
//decimal values of unicode chars: – 8211 - 8211, [space] 32, € 8364,[space] 32, – 8211 - 8211
$output = uchr(8211,8211,32,8364,32,8211,8211);
//or
$output = uchr(8211,8211).' '.uchr(8364).' '.uchr(8211,8211);
echo $output;
QUESTION:
How can i write these special chars to a simple file?
$file = "./upload/myfile.txt";
$output = "–– € ––".PHP_EOL; // the "–" is not an underscore _ or - but –
file_put_contents($file, $output);
If I access this file from the browser http://mydomain.com/upload/myfile.txt i only get "�" characters.
However if i save "–– € ––" with Zend Developer or my local texteditor (on OSX) and upload this everything is perfectly fine. The browser shows it correctly.
How can i achieve this with php? it seems php uses a different way of writing the file than my macbook. thought i thought php's standard was UTF-8 and i also saved the file as UTF-8 in my local text editor.
EXTRA INFO: in the .htaccess file that's in the upload folder i wrote:
AddDefaultCharset utf-8
AddCharset utf-8 .txt
otherwise the firebug addon from firefox gave a message that the charset was not specified.
any ideas?
It has to do with saving the file because my uploaded file shows correctly.
i tried different options while saving the file like:
$output = mb_convert_encoding($output, 'UTF-8', 'OLD-ENCODING');
and the iconv function of php, but i cant find the solution.
any help is greatly appreciated.
EDIT: if i get the content from my uploaded file and echo it the following happens
$output = file_get_contents('./upload/myuploadedfile.txt',FILE_USE_INCLUDE_PATH);
//it show correctly –– € ––
$output = $output[1]; //it shows a �
$output = $output[3]; //it shows a �
echo $output;
PHP will write the contents of the file exactly as they are in your source code. It takes bytes exactly as they are encoded in your .php file and puts them in a file. From then it depends on how the file is interpreted. Assuming your source code is actually UTF-8 encoded, so will the file be. Try opening it with a text editor that can understand UTF-8. Change the encoding the browser interprets it with to UTF-8 (View menu > Encoding). Check if the web server actually sets the correct charset header when you open it in the browser (Firebug Network tab, headers of the response).
It's correct that $output[0] shows a broken UTF-8 character, since PHP only gives you the first byte of the multi-byte character "–".
For more in-depth information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
Related
I have a test project which allows upload of various text(subs), then content of these text file gets displayed. Problem occurs when characters used, are not alphabet, i.e Cyrillic as in diacritics as ŠŽČĆ. Characters in text file are ok pre upload, but when i opened a uploaded file on server, all characters ŠŽČĆĐ get replaced by a . Yes, you saw it correctly, it's a rectangle thingy.
I use this line which work great on localhost, but on shared hosting throws a fit.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
Where $temp variable is the string to be decoded.
Is it hosting thing, could i do something to prevent it?
PS: If i don't use utf8_encode on $tmp variable, server throws an error.
Edit1:
First image shows how it looks when file is opened on shared hosting.
And when i copy/paste that thing, it looks like this
sadly it doesn't get rendered on SO. Or lucky, depends of how you look at it ...
Above this sentence is an image, not typed out characters. It is how ever, a text i typed and character that is in a uploaded file copied then pasted when you making post on SO.
Edit2:
I sort a figured it out what's the problem. File is correctly saved as utf8 which contains previously said letters.
When file gets uploaded, these letters get changed to rectangle thing. So when i open file on server, instead ŠŽČĆĐ, i get rectangles. How to prevent server changing anything and to upload as is?
So it's not a formatting thing, athrough setting encoding to utf8 seems to help to at least display it and if i don't set encoding to utf8, it throws an error.
I'm using Laravel as backend.
Edit3:
If i test specific char after being read from file with this
mb_convert_encoding(file($path)[8][9])//It should be **š** character
It shows it's utf8, but if it was it will be shown.
If i try this line:
mb_convert_encoding(file($path)[8][9], "UTF-8", "ISO-8859-1")
then it shows rectangle thing like in file on server.
If i use to detect encoding with additional parameters like:
mb_detect_encoding(file($path)[8], "UTF-8", TRUE);
to determine if it's actual utf8, it says it's false.
And if i paste rectangle thing into google translate it shows an "š".
which is correct letter.
If i use bin2hex() to see hex code and for example argument is š letter, i get 9a hex code.
If anybody has any idea how to recreate function that will differentiate between these rectangles and show correct hex code or char itself, or how to upload to shared hosting without allowing it to change letters encoding in text file, or how to approach whole problem, it would be much obliged.
Do not use utf8-encode. It is only to converting from ISO-8859-1 and it doesn’t work with Windows-1252.
https://www.php.net/manual/en/function.utf8-encode.php
The second problem is that your code do a double encoding. I have marked two function that convert a string to UTF-8.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
/* ^^^^^ ^^^^^^^^^^^ */
If the code below do not work, I would debug output of mb_detect_encoding($tmp, mb_detect_order(), true). The default values for mb_detect_order() may be far for optimal for Your situation.
https://www.php.net/manual/en/function.mb-detect-encoding.php
https://www.php.net/manual/en/function.mb-detect-order.php
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", $tmp);
You can use mb_convert_encoding() in place of iconv.
https://www.php.net/manual/en/function.mb-convert-encoding.php
For Your problem I would write this code:
/* If there are no Asian languages, the UTF-8 is the only encoding the mb_detect_encoding can recognize. */
if (mb_detect_encoding($tmp, 'UTF-8')) {
$temp = $tmp;
} else {
/* It is not UTF-8. Assume WINDOWS-1252. */
$temp = mb_convert_encoding($tmp, 'UTF-8', 'WINDOWS-1252');
}
It is very hard to reliably detect a particular single byte encoding. I am not aware of any build in PHP function for this.
I work on a system that automates signature generation for outlook. The part to generate the .htm files works great. But now I need to also add files in .txt format. If I use the content without any change in the encoding, all my accentuated characters are converted to a different value for example : "é" becomes "é" or "ô" becomes "ô".
This issue clearly looked like an encoding conflict of some sort. I tried to correct it by converting the text value input to the "Windows-1252" encoding.
$myText = iconv( mb_detect_encoding( $myText ) , "Windows-1252//TRANSLIT", $myText);
But it didn't change anything. I also tried with :
$myText = mb_convert_encoding($myText, "Windows-1252");
And it didn't work either. For both of these tests, I checked the file type with Atom (my IDE) and it recognise these files as UTF-8. But when I check on terminal with file -I signature.txt it responds with this encoding signature.txt: text/plain; charset=iso-8859-1
Note that if I manually change the encoding to Windows-1252 in Atom, the characters are correct.
Has anyone met the same problem ? Is there another way in php to specify the encoding of the file ?
I figured it out. The code to use was (as pointed out by #Powerlord):
$monTexteTXT = mb_convert_encoding($monTexteTXT, "Windows-1252", "UTF-8");
I had a false negative when I first tried this solution because when I opened the file the characters seemed broken. But once it was opened with outlook it was fine.
i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.
I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using
$file = fopen("file.txt", "r");
while(!feof($file)){
$line = fgets($file);
//...
}
fclose($file);
using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)
When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).
So echo $arr[0] will result in something like this: ΑΘΗÎÎ.
I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.
So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?
Thank you for your help!
UPDATE: I'm adding sample strings and base64 equivalents (thanks to #chris' for his suggestion)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?
UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on #hakre's remove_utf8_bom i added it's complementary function
function add_utf8_bom($str){
$bom= "\xEF\xBB\xBF";
return substr($str,0,3)===$bom?$str:$bom.$str;
}
and voila each line is read correctly now.
I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.
Thanks to #chris, #hakre and #jacob for their time!
UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to #jake for his suggestion.
Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.
UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:
ASCII characters stay the same when encoded to UTF-8
no other characters will be encoded to ASCII characters
This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.
In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.
PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.
PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.
When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.
I like to use a PHP file similar to the following:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>Test page for project XY</title>
</head>
<body>
<h1>Test Page</h1>
<pre><?php
echo print_r($_GET,1);
?></pre>
</body>
</html>
If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.
Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().
header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);
$peices = mb_split(';', $str);
var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);
Does it? It works as expected for me( bool true, and the strings in the array are correct)
The mb_splitDocs function should be fine, but you should define the charset it's using as well with mb_regex_encodingDocs:
mb_regex_encoding('UTF-8');
About mb_detect_encodingDocs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.
How to remove the BOM mask:
You can filter the string input and remove a UTF-8 bom with a small helper function:
/**
* remove UTF-8 BOM if string has it at the beginning
*
* #param string $str
* #return string
*/
function remove_utf8_bom($str)
{
if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF")
{
$str = substr($str, 3);
}
return $str;
}
Usage:
$line = remove_utf8_bom($line);
There are probably better ways to do it, but this should work.
I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");
A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)
The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.
Before outputting the text from the file, I go over several different layers to try to clean it up:
$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);
// BOM Fun
$boms = array
(
"utf8" => array(3,pack("CCC",0xEF,0xBB,0xBF)),
"utf16be" => array(2,pack("CC",0xFE,0xFF)),
"utf16le" => array(2,pack("CC",0xFF,0xFE)),
"utf32be" => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
"utf32le" => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
"gb18030" => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
);
foreach($boms as $bom)
{
if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
{
$txtfile = substr($txtfile,$bom[0]);
break;
}
}
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");
The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.
This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.
This code does not work (for no particular reason) when the uploaded file is UTF-8.
When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.
This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.
Any and all help on this would be greatly appreciated.
If I understand correctly your problem is that your code that replaces "extended ASCII" characters for their ASCII counterparts fails when the user submits a file in UTF-8.
This was to be expected. You cannot operate on UTF-8 files with str_replace and the like, which operate at the byte level, while a character in UTF-8 is constituted by one byte only for characters in the ASCII range.
What I'd recommend you to do is to use some heuristic to determine if the file is encoded in UTF-8 (the BOM is a good way if you're sure it'll be present) or Windows-1252 or whatever and then convert it to UTF-8 if it isn't. In that case, you wouldn't need to replace any characters, you could preserve the smart quotes.
The characters you are trying to replace have different byte values in UTF8. Actually, they have more than one byte each in UTF8. You are trying to search for them with Windows encoding values and that's why you won't find them.
Look up the UTF8 byte sequences of the characters and use them for the search.