I am completing my project on fusion chart. I need to add BOM signature in my dynamic xml. But I am unable to figure out that how can I add BOM signature for dynamic xml using php.
My codes are like this
$filename="a.xml";
$file= fopen("$filename", "w");
$_xml="<something/>";
fwrite($file, $_xml);
fclose($file);
In fusion chart documentation I found I need to add for general php output
header ( 'Content-type: text/xml' );
echo pack ( "C3" , 0xef, 0xbb, 0xbf );
So can any one help me with this?
Thank you,
You can use a BOM as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.
If you want to, just pass a string (which is binary in PHP) that contains the BOM. Example strings:
Bytes PHP String Encoding Form
----- ---------- -------------
00 00 FE FF "\0\0\xFE\xFF" UTF-32, big-endian
FF FE 00 00 "\xFF\xFE\0\0" UTF-32, little-endian
FE FF "\xFE\xFF" UTF-16, big-endian
FF FE "\xFF\xFE" UTF-16, little-endian
EF BB BF "\xEF\xBB\xBF" UTF-8
See http://unicode.org/faq/utf_bom.html
Related
I am trying to send bengali text as sms using our local carrier api. But they don't support unicode (utf-8) text as post/get parameter. they replied this:
For every Bengali alphabet there is standard HEXDUMP representation
which need to be inserted in message content part.
Like below Bengali word is having below HEXDUMP representation
বাংলাদেশ : 09AC09BE098209B209BE09A609C709B6
So, I tried following two code gathered from SO.
Code-1:
$strBN = 'বাংলাদেশ';
echo bin2hex($strBN);
//it reutrns this value "e0a6ace0a6bee0a682e0a6b2e0a6bee0a6a6e0a787e0a6b6"
Code-2:
$strBN = 'বাংলাদেশ';
echo fToHex($strBN);
function fToHex($string)
{
$strHData = '';
for ($i = 0; $i < strlen($string); $i++)
{
$strHData .= str_pad(dechex(ord($string[$i])), 2, '0', STR_PAD_LEFT);
}
return $strHData;
}
//This also return same value as above "e0a6ace0a6bee0a682e0a6b2e0a6bee0a6a6e0a787e0a6b6"
So, my question is how I can convert that text/string to hexdump as my carrier expected.
The hex dump that you are getting is UTF-8 format, which is a way to represent Unicode characters reliably in a 8-bit stream.
E0 A6 AC E0 A6 BE E0 A6 82 E0 A6 B2 E0 A6 BE E0 A6 A6 E0 A7 87 E0 A6 B6
The example on the other hand is a dump of the UTF-16 (or truncated 16-bit Unicode codepoint) values:
09AC 09BE 0982 09B2 09BE 09A6 09C7 09B6
In your case the solution is to convert to UTF-16 encoding:
echo bin2hex(mb_convert_encoding('বাংলাদেশ', 'UTF-16'));"
> 09ac09be098209b209be09a609c709b6
Note that using Unicode characters in code is unreliable, because the interpretation of the bytes in a string will depend on your system details / editor / compiler or interpreter settings etc.
In place where I use require_once('something.php'); I have strange char in html and when I check page on validator:
Validation Output: 1 Error
Error Line 1, Column 1: character "" not allowed in prolog
This only happens when I'm using UTF-8. Earlier I have files in ANSI and it was ok.
Yes I'm changing meta and save all files in UTF-8
I can move code to section body but this is strange for me.
This is most probably caused by the UTF-8 BOM (Byte Order Mark). Open any file in some HEX viewer / editor and check the first 3 bytes in that file.
UTF-8 BOM in Windows-1250 encoding looks like this: . Or  in ISO-8859-1. That's EF BB BF in hexadecimal.
Just save your files as UTF-8 without BOM. For example Notepad++ editor has both options under Format menu:
Convert into UTF-8 (without BOM)
Convert into UTF-8
I have a lot of images which has been imported from SQL dump with utf-8 encoding. Thus, instead of "FF D8 FF E0" I see "C3 BF C3 98 C3 BF C3 A0" in the beginning of jpeg images.
I've tried iconv('utf-8', 'iso-8859-1', $data) but it not converts whole file (there is chars in utf-8 which can not be converted to iso-8859-1.
How I can to convert utf-8 simple to one-byte binary with unrespect to encoding?
The problem was because there are some representations of the same character in UTF-8, called "non-shortest" form. That characters can be converted mathematically, but iconv counts them as errorneous and not converts.
I've made a short function, which converts text of any utf-8 character to Unicode (UTF-16) codepoints array. And then remap some non-ASCII values to ASCII by simple table (for example 0x20ac is the same as 0x80, etc). You can found complete code and remapping table here: Converting UTF-8 with non-shortest characters to one-byte encoding
The characters I am getting from the URL, for example www.mydomain.com/?name=john , were fine, as longs as they were not in Russian.
If they were are in Russian, I was getting '����'.
So I added $name= iconv("cp1251","utf-8" ,$name); and now it works fine for Russian and English characters, but screws up other languages. :)))
For example 'Jānis' ( Latvian ) that worked fine before iconv, now turns into 'jДЃnis'.
Any idea if there's some universal encoder that would work with both the Cyrillic languages and not screw up other languages?
Why don't you just use UTF-8 with all files and processes?
Actually this runs down to the problem of how the URL is encoded. If you're clicking a link on a given page the browser will use the page's encoding to sent the request but if you enter the URL directly into the address-bar of your browser the behavior is somehow undefined as there is no standardized way on the encoding to use (Firefox provides an about:config switch to use UTF-8 encoded URLs).
Besides using some encoding detection there is no way to know the encoding used with the URL in the given request.
EDIT:
Just to backup what I said above, I wrote a small test script that shows the default behavior of the five major browsers (running Mac OS X in my case - Windows Vista via Parallels in case of the IE):
$p = $_GET['p'];
for ($i = 0; $i < strlen($p); $i++) {
// this displays the binary data received via the URL in hex format
echo dechex(ord($p[$i])) . ' ';
}
Calling http://path/to/script.php?p=äöü leads to
Safari (4.0.5): c3 a4 c3 b6 c3 bc
Firefox (3.6.3): c3 a4 c3 b6 c3 bc
Google Chrome (5.0.375.38): c3 a4 c3 b6 c3 bc
Opera (10.10): e4 f6 fc
Internet Explorer (8.0.6001.18904): e4 f6 fc
So obviously the first three use UTF-8 encoded URLs while Opera and IE use ISO-8859-1 or some of its variants. Conclusion: you cannot be sure what's the encoding of textual data sent via an URL.
Seems like the issue is the file encoding, you should always use UTF-8 no BOM as the prefered encoding for your .php files, code editors such as Intype let you easily specify this (UTF-8 Plain).
Also, add the following code to your files before any output:
header('Content-Type: text/html; charset=utf-8');
You should also read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.
This is in reference to this (excellent) answer. He states that the best solution for escaping input in PHP is to call mb_convert_encoding followed by html_entities.
But why exactly would you call mb_convert_encoding with the same to and from parameters (UTF8)?
Excerpt from the original answer:
Even if you use htmlspecialchars($string) outside of HTML tags, you are still vulnerable to multi-byte charset attack vectors.
The most effective you can be is to use the a combination of mb_convert_encoding and htmlentities as follows.
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');
Does this have some sort of benefit I'm missing?
Not all binary data is valid UTF8. Invoking mb_convert_encoding with the same from/to encodings is a simple way to ensure that one is dealing with a correctly encoded string for the given encoding.
A way to exploit the omission of UTF8 validation is described in section 6 (security considerations) in rfc2279:
Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.
This may be more easily understood by examining the binary representation:
110xxxxx 10xxxxxx # header bits used by the encoding
11000000 10101110 # C0 AE
00101110 # 2E the '.' character
In other words: (C0 AE - header-bits) == '.'
As the quoted text points out, C0 AE is not a valid UTF8 octet sequence, so mb_convert_encoding would have removed it from the string (or translated it to '.', or something else :-).