UTF-8 BOM signature in PHP files

UTF-8 BOM signature in PHP files - php

I was writing some commented PHP classes and I stumbled upon a problem. My name (for the #author tag) ends up with a ș (which is a UTF-8 character, ...and a strange name, I know).
Even though I save the file as UTF-8, some friends reported that they see that character totally messed up (È™). This problem goes away by adding the BOM signature. But that thing troubles me a bit, since I don't know that much about it, except from what I saw on Wikipedia and on some other similar questions here on SO.
I know that it adds some things at the beginning of the file, and from what I understood it's not that bad, but I'm concerned because the only problematic scenarios I read about involved PHP files. And since I'm writing PHP classes to share them, being 100% compatible is more important than having my name in the comments.
But I'm trying to understand the implications, should I use it without worrying? or are there cases when it might cause damage? When?

Indeed, the BOM is actual data sent to the browser. The browser will happily ignore it, but still you cannot send headers then.
I believe the problem really is your and your friend's editor settings. Without a BOM, your friend's editor may not automatically recognize the file as UTF-8. He can try to set up his editor such that the editor expects a file to be in UTF-8 (if you use a real IDE such as NetBeans, then this can even be made a project setting that you can transfer along with the code).
An alternative is to try some tricks: some editors try to determine the encoding using some heuristics based on the entered text. You could try to start each file with
<?php //Úτƒ-8 encoded
and maybe the heuristic will get it. There's probably better stuff to put there, and you can either google for what kind of encoding detection heuristics are common, or just try some out :-)
All in all, I recommend just fixing the editor settings.
Oh wait, I misread the last part: for spreading the code to anywhere, I guess you're safest just making all files only contain the lower 7-bit characters, i.e. plain ASCII, or to just accept that some people with ancient editors see your name written funny. There is no fail-safe way. The BOM is definitely bad because of the headers already sent thing. On the other side, as long as you only put UTF-8 characters in comments and so, the only impact of some editor misunderstanding the encoding is weird characters. I'd go for correctly spelling your name and adding a comment targeted at heuristics so that most editors will get it, but there will always be people who'll see bogus chars instead.

BOM would cause Headers already sent error, so, you can't use BOM in PHP files

This is an old post and have already been answered, but i can leave you some others resources that i found when i faced with this BOM issue.
http://people.w3.org/rishida/utils/bomtester/index.php with this page you can check if a specific file contains BOM.
There is also a handy script that outputs all files with BOM on your current directory.
<?php
function fopen_utf8 ($filename) {
$file = #fopen($filename, "r");
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
{
return false;
}
else
{
return true;
}
}
function file_array($path, $exclude = ".|..|design", $recursive = true) {
$path = rtrim($path, "/") . "/";
$folder_handle = opendir($path);
$exclude_array = explode("|", $exclude);
$result = array();
while(false !== ($filename = readdir($folder_handle))) {
if(!in_array(strtolower($filename), $exclude_array)) {
if(is_dir($path . $filename . "/")) {
// Need to include full "path" or it's an infinite loop
if($recursive) $result[] = file_array($path . $filename . "/", $exclude, true);
} else {
if ( fopen_utf8($path . $filename) )
{
//$result[] = $filename;
echo ($path . $filename . "<br>");
}
}
}
}
return $result;
}
$files = file_array(".");
?>
I found that code at php.net
Dreamweaver also helps with this, it gives you the option to save the file and not include the BOM stuff
Its a late answer, but i still hope it helps.
Bye

Just so you know, there's an option in php, zend.multibyte, which allows php to read files with BOM without giving the Headers already sent error.
From the php.ini file:
; If enabled, scripts may be written in encodings that are incompatible with
; the scanner. CP936, Big5, CP949 and Shift_JIS are the examples of such
; encodings. To use this feature, mbstring extension must be enabled.
; Default: Off
;zend.multibyte = Off

In PHP, in addition to the "headers already sent" error, the presence of a BOM can also screw up the HTML in the browser in more subtle ways.
See Display problems caused by the UTF-8 BOM for an outline of the problem with some focus on PHP (W3C Internationalization).
When this occurs, not only is there usually a noticeable space at the top of the rendered page, but if you inspect the HTML in Firefox or Chrome, you may notice that the head section is empty and its elements appear to be in the body.
Of course viewing source will show everything where it was inserted, but the browser is interpreting it as body content (text) and inserting it there into the Document Object Model (DOM).

Or you could activate output buffering in php.ini which will solve the "headers already sent" problem. It is also very important to use output buffering for performance if your site has significant load.

BOM is actually the most efficient way of identifying an UTF-8 file, and both modern browsers and standards support and encourage the use of it in HTTP response bodies.
In case of PHP files its not the file but the generated output that gets sent as response so obviously it's not a good idea to save all PHP files with the BOM at the beginning, but it doesn't mean you shouldn't use the BOM in your response.
You can in fact safely inject the following code right before your doctype declaration (in case you are generating HTML as response):
<?="\u{FEFF}"?> (or before PHP 7.0.0: <?="\xEF\xBB\xBF"?>)
For further read: https://www.w3.org/International/questions/qa-byte-order-mark#transcoding

Adding to #omabena answer use this code to locate and remove bom from your files. Be sure to back up your files first just in case.
function fopen_utf8 ($filename) {
$file = #fopen($filename, "r");
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
{
return false;
}
else
{
return true;
}
}
function file_array($path, $exclude = ".|..|design", $recursive = true) {
$path = rtrim($path, "/") . "/";
$folder_handle = opendir($path);
$exclude_array = explode("|", $exclude);
$result = array();
while(false !== ($filename = readdir($folder_handle))) {
if(!in_array(strtolower($filename), $exclude_array)) {
if(is_dir($path . $filename . "/")) {
// Need to include full "path" or it's an infinite loop
if($recursive) $result[] = file_array($path . $filename . "/", $exclude, true);
} else {
if ( fopen_utf8($path . $filename) )
{
//$result[] = $filename;
echo ($path . $filename . "<br>");
$pathname = $path . $filename; // change the pathname to your target file(s) which you want to remove the BOM.
$file_handler = fopen($pathname, "r");
$contents = fread($file_handler, filesize($pathname));
fclose($file_handler);
for ($i = 0; $i < 3; $i++){
$bytes[$i] = ord(substr($contents, $i, 1));
}
if ($bytes[0] == 0xef && $bytes[1] == 0xbb && $bytes[2] == 0xbf){
$file_handler = fopen($pathname, "w");
fwrite($file_handler, substr($contents, 3));
fclose($file_handler);
printf("%s BOM removed.<br/>n", $pathname);
}
}
}
}
}
return $result;
}
$files = file_array(".");

Related

Perform a mathematical operation after retrieving from another file

I have a text file (math.txt) in which any kind of arithmetic operation could be written. I have to read the file using PHP and determine the output. I am using the below mentioned code to read the content of the file.
$file = 'math.txt'; // 2+3 is written in math.txt
$open = fopen($file, 'r');
$read = fgets($open);
$close = fclose($open);
Using the above code, i am getting the content. But echoing the content is displaying the original content (i.e 2+3) rather than displaying the output(i.e 5). I am not understanding what should i do in this case.
Any help on this will be appreciated. Thanks in advance.

But echoing the content is displaying the original content (i.e 2+3)
rather than displaying the output(i.e 5).
This is completely expected behaviour. You read a string from a file. How should PHP know that you want it to calculate the expression?
You have to implement a simple parser (or search one on the Internet) which analyses the expression and caulates the result.
dave1010 provided a very nice function in one of his posts:
function do_maths($expression) {
eval('$o = ' . preg_replace('/[^0-9\+\-\*\/\(\)\.]/', '', $expression) . ';');
return $o;
}
echo do_maths('1+1');
But note that this can still halt your script execution if the input contains a syntax error!
Here is a better library which uses a real parser: https://github.com/stuartwakefield/php-math-parser

read the file parse according to operator
like file=2*5;
$open = fopen($file, 'r');
$read = fgets($open);
$key = preg_split("/[*+-\/]+/", $read);
$operator= substr($a, strpos($a,$key[1])-1,1);
if($operator=='+')
{
echo $key[0]+ $key[1];
}
else if($operator=='-')
{
echo $key[0]- $key[1];
}
else if($operator=='*')
{
echo $key[0]* $key[1];
}
else if($operator=='/')
{
echo $key[0]/$key[1];
}

C++ download binary file from http

I'm creating an update mechanism for my first program written in c++.
Theory is:
program sends it's version to the server php as a http header
server checks if later version exists
if it does, server sends the new binary to the client.
Most of it works however the binary received is malformed. When I compare the malformed exe with the working exe I have differences at places where I have \r\ns in the compiled exe. Seems like the \r is doubled.
My c++ code for downloading:
void checkForUpdates () {
SOCKET sock = createHttpSocket (); // creates the socket, nothing wrong here, other requests work
char* msg = (char*)"GET /u/2 HTTP/1.1\r\nHost: imgup.hu\r\nUser-Agent: imgup uploader app\r\nVersion: 1\r\n\r\n";
if (send(sock, msg, strlen(msg), 0) == SOCKET_ERROR) {
error("send failed with error\n");
}
shutdown(sock, SD_SEND);
FILE *fp = fopen("update.exe", "w");
char answ[1024] = {};
int iResult;
bool first = false;
do {
if ((iResult = recv(sock, answ, 1024, 0)) < 0) {
error("recv failed with error\n");
}
if (first) {
info (answ); // debug purposes
first = false;
} else {
fwrite(answ, 1, iResult, fp);
fflush(fp);
}
} while (iResult > 0);
shutdown(sock, SD_RECEIVE);
if (closesocket(sock) == SOCKET_ERROR) {
error("closesocket failed with error\n");
}
fclose(fp);
delete[] answ;
}
and my php to process the request
<?php
if (!function_exists('getallheaders')) {
function getallheaders() {
$headers = '';
foreach ($_SERVER as $name => $value) {
if (substr($name, 0, 5) == 'HTTP_') {
$headers[str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5)))))] = $value;
}
}
return $headers;
}
}
$version = '0';
foreach (getallheaders() as $name => $value) {
if (strtolower ($name) == 'version') {
$version = $value;
break;
}
}
if ($version == '0') {
exit('error');
}
if ($handle = opendir('.')) {
while (false !== ($entry = readdir($handle))) {
if ($entry != '.' && $entry != '..' && $entry != 'u.php') {
if (intval ($entry) > intval($version)) {
header('Content-Version: ' . $entry);
header('Content-Length: ' . filesize($entry));
header('Content-Type: application/octet-stream');
echo "\r\n";
ob_clean();
flush();
readfile($entry);
exit();
}
}
}
closedir($handle);
}
echo 'error2';
?>
notice the way I flush content after I send the headers ob_clean(); flush(); so I don't have to parse them in c++. The first bytes written to the file are fine, so I doubt there is any problem here.
Also, example comparison of the binaries http://i.imgup.hu/meC16C.png
Question: Does http escape \r\n in binary file transfers? If not, what is causing this behavior and how do I solve this problem?

fopen opens a File in the mode you specified, first read/write/both, then Append, then a binary identifier.
r/w should be clear to you, append is also quite obvious. The Trick & Trouble in your case is the binary-mode.
If a file is threated as a Text-File (without the "b") then, depending on the environment where the application runs, some special character conversion may occur in input/output operations in text mode to adapt them to a system-specific text file format. On Windows this would be \r\n, on a linux machine you have \n and on some architectures exist, where it is \r.
In your case, the input file is read as a text file. This means, all your line-endings get converted when reading the file from the HTTP-Data.
Opening the File as a binary file (wich indeed it is!) avoids trouble that your file is not binary identically anymore.

The problem is that the output file isn't being opened in binary mode. To do that, change the mode to "wb" versus just "w" like this:
FILE *fp = fopen("update.exe", "wb");
In text mode on Windows the ctrl+z character specifies the end of the file when seeking/reading, and the linefeed character \n is translated to \r\n when writing and \r\n pairs are translated to \n on reading. In binary mode, the file data is not interpreted or translated in any way.
On other platforms the translations may not apply, but it is still good practice to show the intent of the code by specifying the explicit mode even when not strictly necessary. This is especially true for code meant to be portable.

PHP strpos() finds just single characters, not a whole string

I have a strange problem...
I would like to search in a logfile.
$lines = file($file);
$sampleName = "T3173sGas";
foreach ($lines as &$line) {
if (strpos($line, $sampleName) !== false) {
echo "yes";
}
}
This code is not working, $sampleName is to 100% in the log file. The search works just for single characters; for example "T" or "3" but not for "T3".
Do you have an idea why it's not working? Is the encoding of the logfile wrong?
Thanks a lot for your help!

If you can only find single characters I would assume that your logfile is in some multi-byte character set like UTF-16. As you already assume similar, next step for you is to consult the documentation / specification of the logfile you're trying to operate with regarding the character encoding.
You then can use character-encoding specific string functions, the package is called http://php.net/mbstring.
$encoding = ... ; // encoding of logfile
if (mb_strpos($line, $sampleName, 0, $encoding) !== false) {
echo "yes";
}

This may work, it searches for the entire string
<?php
$filename = 'test.php';
$file = file_get_contents($filename);
$sampleName = "T3173sGas";
if(strlen(strstr($file,$sampleName))>0)
{
echo "yes";
}
?>

very weird issue in html when include PHP function include [duplicate]

I was writing some commented PHP classes and I stumbled upon a problem. My name (for the #author tag) ends up with a ș (which is a UTF-8 character, ...and a strange name, I know).
Even though I save the file as UTF-8, some friends reported that they see that character totally messed up (È™). This problem goes away by adding the BOM signature. But that thing troubles me a bit, since I don't know that much about it, except from what I saw on Wikipedia and on some other similar questions here on SO.
I know that it adds some things at the beginning of the file, and from what I understood it's not that bad, but I'm concerned because the only problematic scenarios I read about involved PHP files. And since I'm writing PHP classes to share them, being 100% compatible is more important than having my name in the comments.
But I'm trying to understand the implications, should I use it without worrying? or are there cases when it might cause damage? When?

BOM would cause Headers already sent error, so, you can't use BOM in PHP files

This is an old post and have already been answered, but i can leave you some others resources that i found when i faced with this BOM issue.
http://people.w3.org/rishida/utils/bomtester/index.php with this page you can check if a specific file contains BOM.
There is also a handy script that outputs all files with BOM on your current directory.
<?php
function fopen_utf8 ($filename) {
$file = #fopen($filename, "r");
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
{
return false;
}
else
{
return true;
}
}
function file_array($path, $exclude = ".|..|design", $recursive = true) {
$path = rtrim($path, "/") . "/";
$folder_handle = opendir($path);
$exclude_array = explode("|", $exclude);
$result = array();
while(false !== ($filename = readdir($folder_handle))) {
if(!in_array(strtolower($filename), $exclude_array)) {
if(is_dir($path . $filename . "/")) {
// Need to include full "path" or it's an infinite loop
if($recursive) $result[] = file_array($path . $filename . "/", $exclude, true);
} else {
if ( fopen_utf8($path . $filename) )
{
//$result[] = $filename;
echo ($path . $filename . "<br>");
}
}
}
}
return $result;
}
$files = file_array(".");
?>
I found that code at php.net
Dreamweaver also helps with this, it gives you the option to save the file and not include the BOM stuff
Its a late answer, but i still hope it helps.
Bye

Just so you know, there's an option in php, zend.multibyte, which allows php to read files with BOM without giving the Headers already sent error.
From the php.ini file:
; If enabled, scripts may be written in encodings that are incompatible with
; the scanner. CP936, Big5, CP949 and Shift_JIS are the examples of such
; encodings. To use this feature, mbstring extension must be enabled.
; Default: Off
;zend.multibyte = Off

In PHP, in addition to the "headers already sent" error, the presence of a BOM can also screw up the HTML in the browser in more subtle ways.
See Display problems caused by the UTF-8 BOM for an outline of the problem with some focus on PHP (W3C Internationalization).
When this occurs, not only is there usually a noticeable space at the top of the rendered page, but if you inspect the HTML in Firefox or Chrome, you may notice that the head section is empty and its elements appear to be in the body.
Of course viewing source will show everything where it was inserted, but the browser is interpreting it as body content (text) and inserting it there into the Document Object Model (DOM).

Or you could activate output buffering in php.ini which will solve the "headers already sent" problem. It is also very important to use output buffering for performance if your site has significant load.

BOM is actually the most efficient way of identifying an UTF-8 file, and both modern browsers and standards support and encourage the use of it in HTTP response bodies.
In case of PHP files its not the file but the generated output that gets sent as response so obviously it's not a good idea to save all PHP files with the BOM at the beginning, but it doesn't mean you shouldn't use the BOM in your response.
You can in fact safely inject the following code right before your doctype declaration (in case you are generating HTML as response):
<?="\u{FEFF}"?> (or before PHP 7.0.0: <?="\xEF\xBB\xBF"?>)
For further read: https://www.w3.org/International/questions/qa-byte-order-mark#transcoding

Adding to #omabena answer use this code to locate and remove bom from your files. Be sure to back up your files first just in case.
function fopen_utf8 ($filename) {
$file = #fopen($filename, "r");
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
{
return false;
}
else
{
return true;
}
}
function file_array($path, $exclude = ".|..|design", $recursive = true) {
$path = rtrim($path, "/") . "/";
$folder_handle = opendir($path);
$exclude_array = explode("|", $exclude);
$result = array();
while(false !== ($filename = readdir($folder_handle))) {
if(!in_array(strtolower($filename), $exclude_array)) {
if(is_dir($path . $filename . "/")) {
// Need to include full "path" or it's an infinite loop
if($recursive) $result[] = file_array($path . $filename . "/", $exclude, true);
} else {
if ( fopen_utf8($path . $filename) )
{
//$result[] = $filename;
echo ($path . $filename . "<br>");
$pathname = $path . $filename; // change the pathname to your target file(s) which you want to remove the BOM.
$file_handler = fopen($pathname, "r");
$contents = fread($file_handler, filesize($pathname));
fclose($file_handler);
for ($i = 0; $i < 3; $i++){
$bytes[$i] = ord(substr($contents, $i, 1));
}
if ($bytes[0] == 0xef && $bytes[1] == 0xbb && $bytes[2] == 0xbf){
$file_handler = fopen($pathname, "w");
fwrite($file_handler, substr($contents, 3));
fclose($file_handler);
printf("%s BOM removed.<br/>n", $pathname);
}
}
}
}
}
return $result;
}
$files = file_array(".");

How secure (hardened) is this script (part 2)

In my previous question on this topic, what would the implications be if I removed the dynamic variable and instead replaced it with a static one like you see below...
$source = 'http://mycentralserver.com/protected/myupdater.zip';
I've included the code below for convenience...
<?php
// TEST.PHP
$source = 'http://mycentralserver.com/protected/myupdater.zip';
$target = '.';
$out_file = fopen(basename($source), 'w');
$in_file = fopen($source, 'r');
while ($chunk = fgets($in_file)) {
fputs($out_file, $chunk);
}
fclose($in_file);
fclose($out_file);
$zip = new ZipArchive();
$result = $zip->open(basename($source));
if ($result) {
$zip->extractTo($target);
$zip->close();
}
?>

You should at least be hashing the zip with SHA-1 and checking it against a digest to ensure it hasn't changed. These digests should be extremely hard to replace.
I still think automated updates are a bit iffy.

PHP 5.2.6 and older had a vulnerability that allowed writing to arbitrary locations via Zip's extractTo() -method.
See: http://www.securityfocus.com/bid/32625
So, if the contents of the zip are untrustworthy, you must use PHP 5.2.7 or newer (or use your own Zip parser).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.