Strange characters in PHP - php

This is driving me crazy.
I have this one php file on a test server at work which does not work.. I kept deleting stuff from it till it became
<?
print 'Hello';
?>
it outputs
Hello
if I create a new file and copy / paste the same script to it it works!
Why does this one file give me the strange characters all the time?

That's the BOM (Byte Order Mark) you are seeing.
In your editor, there should be a way to force saving without BOM which will remove the problem.

Found it, file -> encoding -> UTF8 with BOM , changed to to UTF :-)
I should ahve asked before wasing time trying to figure it out :-)

Just in case, here is a list of bytes for BOM
Encoding Representation (hexadecimal)
UTF-8 EF BB BF
UTF-16 (BE) FE FF
UTF-16 (LE) FF FE
UTF-32 (BE) 00 00 FE FF
UTF-32 (LE) FF FE 00 00
UTF-7 2B 2F 76, and one of the following bytes: [ 38 | 39 | 2B | 2F ]†
UTF-1 F7 64 4C
UTF-EBCDIC DD 73 66 73
SCSU 0E FE FF
BOCU-1 FB EE 28 optionally followed by FF†

Related

UTF-8 issue with PHP's json_decode

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.
How can I keep PHP/json_encode from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.
This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.
try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

PHP gettext and non-ANSII charters

I have a PHP web application which is originally in Polish. But I was asked to locale it into Russian. I've decided to use gettext. But I've problem when I'm trying to translate string with Polish special characters. For example:
echo gettext('Urządzenie');
Display "Urządzenie" in web browser instead of word in Russian.
All files are encoded in UTF-8 and .po file was generated with --from-code utf-8 . Translations without Polish special chars such as
echo gettext('Instrukcja');
works well. Do you know what could be the reason of this strange behaviour?
Are you sure the PHP file is in UTF-8 format? To verify, try this:
echo bin2hex('Urządzenie');
You should see the following bytes:
55 72 7a c4 85 64 7a 65 6e 69 65

htmlentities, htmlspecialchars, and "invalid multibyte sequence"

This question tells me
htmlentities is identical to htmlspecialchars() in all ways, except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities.
Sounds like htmlentities is the one I want.
Then this question tells me I need the "UTF-8" argument to get rid of this error:
Invalid multibyte sequence in argument
So, here is my encoding wrapper function (to normalise behaviour across different PHP versions)
function html_entities ($s)
{
return htmlentities ($s, ENT_COMPAT /* ENT_HTML401 */, "UTF-8");
}
I am still getting the "multibyte sequence in argument" error.
Here is a sample string which triggers the error, and it's hex encoding:
Jigue à Baptiste
4a 69 67 75 65 20 e0 20 - 42 61 70 74 69 73 74 65
I notice that the à is encoded as 0xe0 but as a single byte which is above 0x80.
What am I doing wrong?
Your string is encoded in ISO-8859-1, not UTF-8. Plain and simple.
function html_entities ($s)
{
return htmlentities ($s, ENT_COMPAT /* ENT_HTML401 */, "ISO-8859-1");
^^^^^^^^^^^^
}
If à is encoded as 0xE0 then you didn't save the file in UTF-8 encoding. 0xE0 is invalid UTF-8. It should be 0xC3 0xA0
Save your file in UTF-8 encoding. Also see UTF-8 all the way through
If you saved it correctly in utf-8, the hex should look like so:
4A 69 67 75 65 20 C3 A0 20 42 61 70 64 69 73 74 65
J i g u e à B a p t i s t e

Add BOM stamp in xml by php

I am completing my project on fusion chart. I need to add BOM signature in my dynamic xml. But I am unable to figure out that how can I add BOM signature for dynamic xml using php.
My codes are like this
$filename="a.xml";
$file= fopen("$filename", "w");
$_xml="<something/>";
fwrite($file, $_xml);
fclose($file);
In fusion chart documentation I found I need to add for general php output
header ( 'Content-type: text/xml' );
echo pack ( "C3" , 0xef, 0xbb, 0xbf );
So can any one help me with this?
Thank you,
You can use a BOM as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.
If you want to, just pass a string (which is binary in PHP) that contains the BOM. Example strings:
Bytes PHP String Encoding Form
----- ---------- -------------
00 00 FE FF "\0\0\xFE\xFF" UTF-32, big-endian
FF FE 00 00 "\xFF\xFE\0\0" UTF-32, little-endian
FE FF "\xFE\xFF" UTF-16, big-endian
FF FE "\xFF\xFE" UTF-16, little-endian
EF BB BF "\xEF\xBB\xBF" UTF-8
See http://unicode.org/faq/utf_bom.html

Error when output binary data with PHP

I write a simple php to load some binary data from DB then output to client.
$sql="select FightPlayEnd from ZMTXLogic.FightLog where ID=".addslashes($id);
$result=$db->query($sql);
if($db->num_rows($result)>0)
{
$row = mysql_fetch_assoc($result);
$nByteCount = mb_strlen($row["FightPlayEnd"], '8bit');
//echo $nByteCount;
header("Content-type:application/octet-stream");
header("Accept-Ranges:bytes");
header("Accept-Length:".$nByteCount);
header("Content-Disposition:attachment;filename=FightPlayEnd.bin");
header( "Content-type: application/octet-stream");
echo $row["FightPlayEnd"];
}
The problem is the data got from IE is not same as original binary data but added
EF BB BF(view in UltraEdit) at the header and 0D 0A 0D 0A at the end. What is wrong with it?
0D 0A 0D 0A is just \r\n\r\n, that is, two linebreaks.
EE BB BF is a byte order mark. It signals UTF-8 encoding.
Edit (see comments):
Your script might be outputting more than it should (particularly those \r\n\r\n).
You need to clean the output buffer before you start outputting the data (ob_clean()) and exit right after your echo.

Categories