UTF-8 characters in fwrite - php

I'm trying to save HTML to a .html file,
This is working:
$html_file = "output.html";
$output_string="string with characters like ã or ì";
$fileHandle = fopen($html_file, 'w') or die("file could not be accessed/created");
fwrite($fileHandle, $output_string);
fclose($fileHandle);
When I check the output.html file, these special characters in my output_string are not read correctly.
My HTML file can't have a <head> tag with the charset information, this makes it work, but my output can't have any <html>, <head> or <body> tags.
I have tried stuff like
header('Content-type: text/plain; charset=utf-8');
I also tried utf8_encode() on the string before fwrite, but with no success so far.
If I read the output.html file in Notepad++ or Netbeans IDE, it shows the correct characters being saved, it's the browser that isn't them reading properly.
I'm pretty sure PHP is saving my file with the incorrect charset, because if I create HTML files in my computer with those special characters (even without any charset setting), these are read correctly.

Try to add a BOM (Byte Order Mark) to your file :
$output_string = "\xEF\xBB\xBF";
$output_string .= "string with characters like ã or ì";
$fileHandle = // ...

Yes, PHP is writing the file correctly, only the reading program doesn't know what character encoding it is and interprets the data with the wrong charset. If you cannot include meta information that convey the correct charset and if the file format itself (plain text) does not offer a way to specify the charset and if the reading application is not able to correctly guess the charset, then there's no solution.

Whatever editor you are using to write this code must have facility to set character-type as 'UTF-8'.
Set the character-type of the file in which you have written this code.
I am using an editor that allows to change the character encoding of file from the bottom. There must be something similar for the editor you are using.

If you need the string in UTF-8 regardless of the php-script-file-encoding (if it's a single-byte one), you should use the UTF-8 encoding of those characters:
$output_string = "string with characters like \xC3\xA3 or \xC3\x8C";

Related

There is a hidden character in my output and I don't know what it is [duplicate]

I have a CSS file that looks fine when I open it using gedit, but when it's read by PHP (to merge all the CSS files into one), this CSS has the following characters prepended to it: 
PHP removes all whitespace, so a random  in the middle of the code messes up the entire thing. As I mentioned, I can't actually see these characters when I open the file in gedit, so I can't remove them very easily.
I googled the problem, and there is clearly something wrong with the file encoding, which makes sense being as I've been shifting the files around to different Linux/Windows servers via ftp and rsync, with a range of text editors. I don't really know much about character encoding though, so help would be appreciated.
If it helps, the file is being saved in UTF-8 format, and gedit won't let me save it in ISO-8859-15 format (the document contains one or more characters that cannot be encoded using the specified character encoding). I tried saving it with Windows and Linux line endings, but neither helped.
Three words for you:
Byte Order Mark (BOM)
That's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.
To automatize the BOM's removal you can use awk as shown in this question.
As another answer says, the best would be for PHP to actually interpret the BOM correctly, for that you can use mb_internal_encoding(), like this:
<?php
//Storing the previous encoding in case you have some other piece
//of code sensitive to encoding and counting on the default value.
$previous_encoding = mb_internal_encoding();
//Set the encoding to UTF-8, so when reading files it ignores the BOM
mb_internal_encoding('UTF-8');
//Process the CSS files...
//Finally, return to the previous encoding
mb_internal_encoding($previous_encoding);
//Rest of the code...
?>
Open your file in Notepad++. From the Encoding menu, select Convert to UTF-8 without BOM, save the file, replace the old file with this new file. And it will work, damn sure.
In PHP, you can do the following to remove all non characters including the character in question.
$response = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $response);
For those with shell access here is a little command to find all files with the BOM set in the public_html directory - be sure to change it to what your correct path on your server is
Code:
grep -rl $'\xEF\xBB\xBF' /home/username/public_html
and if you are comfortable with the vi editor, open the file in vi:
vi /path-to-file-name/file.php
And enter the command to remove the BOM:
set nobomb
Save the file:
wq
BOM is just a sequence of characters ($EF $BB $BF for UTF-8), so just remove them using scripts or configure the editor so it's not added.
From Removing BOM from UTF-8:
#!/usr/bin/perl
#file=<>;
$file[0] =~ s/^\xEF\xBB\xBF//;
print(#file);
I am sure it translates to PHP easily.
I don't know PHP, so I don't know if this is possible, but the best solution would be to read the file as UTF-8 rather than some other encoding. The BOM is actually a ZERO WIDTH NO BREAK SPACE. This is whitespace, so if the file were being read in the correct encoding (UTF-8), then the BOM would be interpreted as whitespace and it would be ignored in the resulting CSS file.
Also, another advantage of reading the file in the correct encoding is that you don't have to worry about characters being misinterpreted. Your editor is telling you that the code page you want to save it in won't do all the characters that you need. If PHP is then reading the file in the incorrect encoding, then it is very likely that other characters besides the BOM are being silently misinterpreted. Use UTF-8 everywhere, and these problems disappear.
For me, this worked:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
If I remove this meta, the  appears again. Hope this helps someone...
You can use
vim -e -c 'argdo set fileencoding=utf-8|set encoding=utf-8| set nobomb| wq'
Replacing with awk seems to work, but it is not in place.
grep -rl $'\xEF\xBB\xBF' * | xargs vim -e -c 'argdo set fileencoding=utf-8|set encoding=utf-8| set nobomb| wq'
I had the same problem with the BOM appearing in some of my PHP files ().
If you use PhpStorm you can set at hotkey to remove it in Settings -> IDE Settings -> Keymap -> Main Menu - > File -> Remove BOM.
In Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM". Then save.
See Stack Overflow question How to make Notepad to save text in UTF-8 without BOM?.
Open the PHP file under question, in Notepad++.
Click on Encoding at the top and change from "Encoding in UTF-8 without BOM" to just "Encoding in UTF-8". Save and overwrite the file on your server.
Same problem, different solution.
One line in the PHP file was printing out XML headers (which use the same begin/end tags as PHP). Looks like the code within these tags set the encoding, and was executed within PHP which resulted in the strange characters. Either way here's the solution:
# Original
$xml_string = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
# fixed
$xml_string = "<" . "?xml version=\"1.0\" encoding=\"UTF-8\"?" . ">";
If you need to be able to remove the BOM from UTF-8 encoded files, you first need to get hold of an editor that is aware of them.
I personally use E Text Editor.
In the bottom right, there are options for character encoding, including the BOM tag. Load your file, deselect Byte Order Marker if it is selected, resave, and it should be done.
Alt text http://oth4.com/encoding.png
E is not free, but there is a free trial, and it is an excellent editor (limited TextMate compatibility).
You can open it by PhpStorm and right-click on your file and click on Remove BOM...
Here is another good solution for the problem with BOM. These are two VBScript (.vbs) scripts.
One for finding the BOM in a file and one for KILLING the damned BOM in the file. It works pretty fine and is easy to use.
Just create a .vbs file, and paste the following code in it.
You can use the VBScript script simply by dragging and dropping the suspicious file onto the .vbs file. It will tell you if there is a BOM or not.
' Heiko Jendreck - personal helpdesk & webdesign
' http://www.phw-jendreck.de
' 2010.05.10 Vers 1.0
'
' find_BOM.vbs
' ====================
' Kleines Hilfsmittel, welches das BOM finden soll
'
Const UTF8_BOM = ""
Const UTF16BE_BOM = "þÿ"
Const UTF16LE_BOM = "ÿþ"
Const ForReading = 1
Const ForWriting = 2
Dim fso
Set fso = WScript.CreateObject("Scripting.FileSystemObject")
Dim f
f = WScript.Arguments.Item(0)
Dim t
t = fso.OpenTextFile(f, ForReading).ReadAll
If Left(t, 3) = UTF8_BOM Then
MsgBox "UTF-8-BOM detected!"
ElseIf Left(t, 2) = UTF16BE_BOM Then
MsgBox "UTF-16-BOM (Big Endian) detected!"
ElseIf Left(t, 2) = UTF16LE_BOM Then
MsgBox "UTF-16-BOM (Little Endian) detected!"
Else
MsgBox "No BOM detected!"
End If
If it tells you there is BOM, go and create the second .vbs file with the following code and drag the suspicios file onto the .vbs file.
' Heiko Jendreck - personal helpdesk & webdesign
' http://www.phw-jendreck.de
' 2010.05.10 Vers 1.0
'
' kill_BOM.vbs
' ====================
' Kleines Hilfmittel, welches das gefundene BOM löschen soll
'
Const UTF8_BOM = ""
Const ForReading = 1
Const ForWriting = 2
Dim fso
Set fso = WScript.CreateObject("Scripting.FileSystemObject")
Dim f
f = WScript.Arguments.Item(0)
Dim t
t = fso.OpenTextFile(f, ForReading).ReadAll
If Left(t, 3) = UTF8_BOM Then
fso.OpenTextFile(f, ForWriting).Write (Mid(t, 4))
MsgBox "BOM gelöscht!"
Else
MsgBox "Kein UTF-8-BOM vorhanden!"
End If
The code is from Heiko Jendreck.
In PHPStorm, for multiple files and BOM not necessarily at the beginning of the file, you can search \x{FEFF} (Regular Expression) and replace with nothing.
Same problem, but it only affected one file so I just created a blank file, copy/pasted the code from the original file to the new file, and then replaced the original file. Not fancy but it worked.
Use Total Commander to search for all BOMed files:
Elegant way to search for UTF-8 files with BOM?
Open these files in some proper editor (that recognizes BOM) like Eclipse.
Change the file's encoding to ISO (right click, properties).
Cut  from the beginning of the file, save
Change the file's encoding back to UTF-8
...and do not even think about using n...d again!
I had the same problem. The problem was because one of my php files was in utf-8 (the most important, the configuaration file which is included in all php files).
In my case, I had 2 different solutions which worked for me :
First, I changed the Apache Configuration by using AddDefaultCharsetDirective in configuration files (or in .htaccess). This solution forces Apache to use the correct encodage.
AddDefaultCharset ISO-8859-1
The second solution was to change the bad encoding of the php file.
Copy the text of your filename.css file.
Close your css file.
Rename it filename2.css to avoid a filename clash.
In MS Notepad or Wordpad, create a new file.
Paste the text into it.
Save it as filename.css, selecting UTF-8 from the encoding options.
Upload filename.css.
This works for me!
def removeBOMs(fileName):
BOMs = ['',#Bytes as CP1252 characters
'þÿ',
'ÿþ',
'^#^#þÿ',
'ÿþ^#^#',
'+/v',
'÷dL',
'Ýsfs',
'Ýsfs',
'^Nþÿ',
'ûî(',
'„1•3']
inputFile = open(fileName, 'r')
contents = inputFile.read()
for BOM in BOMs:
if not BOM in contents:#no BOM in the file...
pass
else:
newContents = contents.replace(BOM,'', 1)
newFile = open(fileName, 'w')
newFile.write(newContents)
return None
Check on your index.php, find "... charset=iso-8859-1" and replace it with "... charset=utf-8".
Maybe it'll work.

XML file isn't UTF-8 encoded when created in PHP

I'm trying to output XML file using PHP, and everything is right except that the file that is created isn't UTF-8 encoded, it's ANSI. (I see that when I open the file an do the Save as...).
I was using
$dom = new DOMDocument('1.0', 'UTF-8');
but I figured out that non-english characters don't appear on the output.
I was searching for solution and I tryed first adding
header("Content-Type: application/xml; charset=utf-8");
at the beginning of the php script but it say's:
Extra content at the end of the document
Below is a rendering of the page up to the first error.
I've tryed some other suggestions like not to include 'UTF-8' when creating the document but to write it separately:
$doc->encoding = 'UTF-8'; , but the result was the same.
I used
$doc->save("filename.xml");
to save the file, and I've tryed to change it to
$doc->saveXML();
but the non-english characters didn't appear.
Any ideas?
ANSI is not a real encoding. It's a word that basically means "whatever encoding my Windows computer is configured to use". Getting ANSI is a clear sign of relying on default encoding somewhere.
In order to generate valid UTF-8 output, you have to feed all XML functions with proper UTF-8 input. The most straightforward way to do it is to save your PHP source code as UTF-8 and then just type some non-English letters. If you are reading data from external sources (such as a database) you need to ensure that the complete toolchain makes proper use of encodings.
Whatever, using "Save as" in an undisclosed piece of software is not a reliable way to determine the file encoding.

Php cannot find way to split utf-8 strings

i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.
I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using
$file = fopen("file.txt", "r");
while(!feof($file)){
$line = fgets($file);
//...
}
fclose($file);
using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)
When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).
So echo $arr[0] will result in something like this: ΑΘΗÎÎ.
I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.
So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?
Thank you for your help!
UPDATE: I'm adding sample strings and base64 equivalents (thanks to #chris' for his suggestion)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?
UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on #hakre's remove_utf8_bom i added it's complementary function
function add_utf8_bom($str){
$bom= "\xEF\xBB\xBF";
return substr($str,0,3)===$bom?$str:$bom.$str;
}
and voila each line is read correctly now.
I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.
Thanks to #chris, #hakre and #jacob for their time!
UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to #jake for his suggestion.
Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.
UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:
ASCII characters stay the same when encoded to UTF-8
no other characters will be encoded to ASCII characters
This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.
In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.
PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.
PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.
When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.
I like to use a PHP file similar to the following:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>Test page for project XY</title>
</head>
<body>
<h1>Test Page</h1>
<pre><?php
echo print_r($_GET,1);
?></pre>
</body>
</html>
If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.
Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().
header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);
$peices = mb_split(';', $str);
var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);
Does it? It works as expected for me( bool true, and the strings in the array are correct)
The mb_splitDocs function should be fine, but you should define the charset it's using as well with mb_regex_encodingDocs:
mb_regex_encoding('UTF-8');
About mb_detect_encodingDocs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.
How to remove the BOM mask:
You can filter the string input and remove a UTF-8 bom with a small helper function:
/**
* remove UTF-8 BOM if string has it at the beginning
*
* #param string $str
* #return string
*/
function remove_utf8_bom($str)
{
if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF")
{
$str = substr($str, 3);
}
return $str;
}
Usage:
$line = remove_utf8_bom($line);
There are probably better ways to do it, but this should work.

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

How do I remove  from the beginning of a file?

I have a CSS file that looks fine when I open it using gedit, but when it's read by PHP (to merge all the CSS files into one), this CSS has the following characters prepended to it: 
PHP removes all whitespace, so a random  in the middle of the code messes up the entire thing. As I mentioned, I can't actually see these characters when I open the file in gedit, so I can't remove them very easily.
I googled the problem, and there is clearly something wrong with the file encoding, which makes sense being as I've been shifting the files around to different Linux/Windows servers via ftp and rsync, with a range of text editors. I don't really know much about character encoding though, so help would be appreciated.
If it helps, the file is being saved in UTF-8 format, and gedit won't let me save it in ISO-8859-15 format (the document contains one or more characters that cannot be encoded using the specified character encoding). I tried saving it with Windows and Linux line endings, but neither helped.
Three words for you:
Byte Order Mark (BOM)
That's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.
To automatize the BOM's removal you can use awk as shown in this question.
As another answer says, the best would be for PHP to actually interpret the BOM correctly, for that you can use mb_internal_encoding(), like this:
<?php
//Storing the previous encoding in case you have some other piece
//of code sensitive to encoding and counting on the default value.
$previous_encoding = mb_internal_encoding();
//Set the encoding to UTF-8, so when reading files it ignores the BOM
mb_internal_encoding('UTF-8');
//Process the CSS files...
//Finally, return to the previous encoding
mb_internal_encoding($previous_encoding);
//Rest of the code...
?>
Open your file in Notepad++. From the Encoding menu, select Convert to UTF-8 without BOM, save the file, replace the old file with this new file. And it will work, damn sure.
In PHP, you can do the following to remove all non characters including the character in question.
$response = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $response);
For those with shell access here is a little command to find all files with the BOM set in the public_html directory - be sure to change it to what your correct path on your server is
Code:
grep -rl $'\xEF\xBB\xBF' /home/username/public_html
and if you are comfortable with the vi editor, open the file in vi:
vi /path-to-file-name/file.php
And enter the command to remove the BOM:
set nobomb
Save the file:
wq
BOM is just a sequence of characters ($EF $BB $BF for UTF-8), so just remove them using scripts or configure the editor so it's not added.
From Removing BOM from UTF-8:
#!/usr/bin/perl
#file=<>;
$file[0] =~ s/^\xEF\xBB\xBF//;
print(#file);
I am sure it translates to PHP easily.
I don't know PHP, so I don't know if this is possible, but the best solution would be to read the file as UTF-8 rather than some other encoding. The BOM is actually a ZERO WIDTH NO BREAK SPACE. This is whitespace, so if the file were being read in the correct encoding (UTF-8), then the BOM would be interpreted as whitespace and it would be ignored in the resulting CSS file.
Also, another advantage of reading the file in the correct encoding is that you don't have to worry about characters being misinterpreted. Your editor is telling you that the code page you want to save it in won't do all the characters that you need. If PHP is then reading the file in the incorrect encoding, then it is very likely that other characters besides the BOM are being silently misinterpreted. Use UTF-8 everywhere, and these problems disappear.
For me, this worked:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
If I remove this meta, the  appears again. Hope this helps someone...
You can use
vim -e -c 'argdo set fileencoding=utf-8|set encoding=utf-8| set nobomb| wq'
Replacing with awk seems to work, but it is not in place.
grep -rl $'\xEF\xBB\xBF' * | xargs vim -e -c 'argdo set fileencoding=utf-8|set encoding=utf-8| set nobomb| wq'
I had the same problem with the BOM appearing in some of my PHP files ().
If you use PhpStorm you can set at hotkey to remove it in Settings -> IDE Settings -> Keymap -> Main Menu - > File -> Remove BOM.
In Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM". Then save.
See Stack Overflow question How to make Notepad to save text in UTF-8 without BOM?.
Open the PHP file under question, in Notepad++.
Click on Encoding at the top and change from "Encoding in UTF-8 without BOM" to just "Encoding in UTF-8". Save and overwrite the file on your server.
Same problem, different solution.
One line in the PHP file was printing out XML headers (which use the same begin/end tags as PHP). Looks like the code within these tags set the encoding, and was executed within PHP which resulted in the strange characters. Either way here's the solution:
# Original
$xml_string = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
# fixed
$xml_string = "<" . "?xml version=\"1.0\" encoding=\"UTF-8\"?" . ">";
If you need to be able to remove the BOM from UTF-8 encoded files, you first need to get hold of an editor that is aware of them.
I personally use E Text Editor.
In the bottom right, there are options for character encoding, including the BOM tag. Load your file, deselect Byte Order Marker if it is selected, resave, and it should be done.
Alt text http://oth4.com/encoding.png
E is not free, but there is a free trial, and it is an excellent editor (limited TextMate compatibility).
You can open it by PhpStorm and right-click on your file and click on Remove BOM...
Here is another good solution for the problem with BOM. These are two VBScript (.vbs) scripts.
One for finding the BOM in a file and one for KILLING the damned BOM in the file. It works pretty fine and is easy to use.
Just create a .vbs file, and paste the following code in it.
You can use the VBScript script simply by dragging and dropping the suspicious file onto the .vbs file. It will tell you if there is a BOM or not.
' Heiko Jendreck - personal helpdesk & webdesign
' http://www.phw-jendreck.de
' 2010.05.10 Vers 1.0
'
' find_BOM.vbs
' ====================
' Kleines Hilfsmittel, welches das BOM finden soll
'
Const UTF8_BOM = ""
Const UTF16BE_BOM = "þÿ"
Const UTF16LE_BOM = "ÿþ"
Const ForReading = 1
Const ForWriting = 2
Dim fso
Set fso = WScript.CreateObject("Scripting.FileSystemObject")
Dim f
f = WScript.Arguments.Item(0)
Dim t
t = fso.OpenTextFile(f, ForReading).ReadAll
If Left(t, 3) = UTF8_BOM Then
MsgBox "UTF-8-BOM detected!"
ElseIf Left(t, 2) = UTF16BE_BOM Then
MsgBox "UTF-16-BOM (Big Endian) detected!"
ElseIf Left(t, 2) = UTF16LE_BOM Then
MsgBox "UTF-16-BOM (Little Endian) detected!"
Else
MsgBox "No BOM detected!"
End If
If it tells you there is BOM, go and create the second .vbs file with the following code and drag the suspicios file onto the .vbs file.
' Heiko Jendreck - personal helpdesk & webdesign
' http://www.phw-jendreck.de
' 2010.05.10 Vers 1.0
'
' kill_BOM.vbs
' ====================
' Kleines Hilfmittel, welches das gefundene BOM löschen soll
'
Const UTF8_BOM = ""
Const ForReading = 1
Const ForWriting = 2
Dim fso
Set fso = WScript.CreateObject("Scripting.FileSystemObject")
Dim f
f = WScript.Arguments.Item(0)
Dim t
t = fso.OpenTextFile(f, ForReading).ReadAll
If Left(t, 3) = UTF8_BOM Then
fso.OpenTextFile(f, ForWriting).Write (Mid(t, 4))
MsgBox "BOM gelöscht!"
Else
MsgBox "Kein UTF-8-BOM vorhanden!"
End If
The code is from Heiko Jendreck.
In PHPStorm, for multiple files and BOM not necessarily at the beginning of the file, you can search \x{FEFF} (Regular Expression) and replace with nothing.
Same problem, but it only affected one file so I just created a blank file, copy/pasted the code from the original file to the new file, and then replaced the original file. Not fancy but it worked.
Use Total Commander to search for all BOMed files:
Elegant way to search for UTF-8 files with BOM?
Open these files in some proper editor (that recognizes BOM) like Eclipse.
Change the file's encoding to ISO (right click, properties).
Cut  from the beginning of the file, save
Change the file's encoding back to UTF-8
...and do not even think about using n...d again!
I had the same problem. The problem was because one of my php files was in utf-8 (the most important, the configuaration file which is included in all php files).
In my case, I had 2 different solutions which worked for me :
First, I changed the Apache Configuration by using AddDefaultCharsetDirective in configuration files (or in .htaccess). This solution forces Apache to use the correct encodage.
AddDefaultCharset ISO-8859-1
The second solution was to change the bad encoding of the php file.
Copy the text of your filename.css file.
Close your css file.
Rename it filename2.css to avoid a filename clash.
In MS Notepad or Wordpad, create a new file.
Paste the text into it.
Save it as filename.css, selecting UTF-8 from the encoding options.
Upload filename.css.
This works for me!
def removeBOMs(fileName):
BOMs = ['',#Bytes as CP1252 characters
'þÿ',
'ÿþ',
'^#^#þÿ',
'ÿþ^#^#',
'+/v',
'÷dL',
'Ýsfs',
'Ýsfs',
'^Nþÿ',
'ûî(',
'„1•3']
inputFile = open(fileName, 'r')
contents = inputFile.read()
for BOM in BOMs:
if not BOM in contents:#no BOM in the file...
pass
else:
newContents = contents.replace(BOM,'', 1)
newFile = open(fileName, 'w')
newFile.write(newContents)
return None
Check on your index.php, find "... charset=iso-8859-1" and replace it with "... charset=utf-8".
Maybe it'll work.

Categories