How can Detect UTF 16 decoding

How can Detect UTF 16 decoding - php

I have to read a file and identify its decoding type, I used mb_detect_encoding() to detect utf-16 but am getting wrong result.. how can i detectutf-16 encoding type in php.
Php file is utf-16 and my header was windows-1256 ( because of Arabic)
header('Content-Type: text/html; charset=windows-1256');
$delimiter = '\t';
$f= file("$fileName");
foreach($f as $dailystatmet)
{
$transactionData = str_replace("'", '', $dailystatmet);
preg_match_all("/('?\d+,\d+\.\d+)?([a-zA-Z]|[0-9]|)[^".$delimiter."]+/",$transactionData,$matches);
array_push($matchesz, $matches[0]);
}
$searchKeywords = array ("apple", "orange", 'mango');
$rowCount = count($matchesz);
for ($row = 1; $row <= $rowCount; $row++) {
$myRow = $row;
$cell = $matchesz[$row];
foreach ($searchKeywords as $val) {
if (partialArraySearch($cell[$c_description], $val)) {
}
}}
function partialArraySearch($cell, $searchword)
{
if (strpos(strtoupper($cell), strtoupper($searchword)) !== false) {
return true;
}
return false;
}
Above code is for search with in the uploaded file.. if the file was in utf-8 then match was getting but when same file with utf-16 or utf-32 am not getting the result..
so how can i get the encoding type of uploaded file ..

If someone is still searching for a solution, I have hacked something like this in the "voku/portable-utf8" repo on github. => "UTF8::file_get_contents()"
The "file_get_contents"-wrapper will detect the current encoding via "UTF8::str_detect_encoding()" and will convert the content of the file automatically into UTF-8.
e.g.: from the PHPUnit tests ...
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16pe.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16le.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);

My solution is to detect UTF-16 and convert the code in Latin 15 is
preg_match_all('/\x00/',$content,$count);
if(count($count[0])/strlen($content)>0.4) {
$content = iconv('UTF-16', 'ISO-8859-15', $content);
}
In other words i check the frequency of the hexadecimal character 00. If it is higher than 0.4 probably the text contains characters in the base set encoded in UTF-16. This means two bytes for character but usually the second byte is 00.

Related

How to print Hexadecimal UTF-8 characters in PHP

How to print UFT-8 Characters from their Hexadecimal UTF-8 values? I read this post, but it did not solve my problem...
I work with many strings that are sanskrit words stored in a database. I have their HTML values, 16 bit binary code points, hex codes, and decimal codes, but I want to be able to work with their Hexadecimal UTF-8 values and output their symbolic form.
For example, here is a word आम that has a Binary UTF-8 value of 111000001010010010111000111000001010010010101110. I want to see/store/print its Hexadecimal UTF-8 value and print its symbolic form.
For example, here's a snippet of my code:
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
$Temporary = dechex(bindec($BinaryUTF8));
$HexadecimalUTF8 = NULL;
for($i = 0; $i < strlen($Temporary); $i+=2)
{
$HexadecimalUTF8 .= "\x".$Temporary[$i].$Temporary[$i+1];
}
$Test = "\xe0\xa4\x86\xe0\xa4\xae";
echo "\$Test = ".$Test;
echo "<br>";
echo "\$HexadecimalUTF8 = ".$HexadecimalUTF8;
The output is:
$Test = आम
$HexadecimalUTF8 = \xe0\xa4\x86\xe0\xa4\xae
$Test output the desired characters.
Why does $HexadecimalUTF8 not output the desired characters?

Your binary is wrong (I have fixed it below)
You are making a string containing the text "\xe0" instead of the character which represents that, The hex is just a number really.
This seems to work now
<?php
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
$Temporary = dechex(bindec($BinaryUTF8));
$HexadecimalUTF8 = NULL;
for($i = 0; $i < strlen($Temporary); $i+=2)
{
$HexadecimalUTF8 .= '\x' . $Temporary[$i].$Temporary[$i+1];
}
$Test = "\xe0\xa4\x86\xe0\xa4\xae";
echo "\$Test = ".$Test;
echo "<br>";
echo "\$HexadecimalUTF8 = " . makeCharFromHex($HexadecimalUTF8);
function makeCharFromHex($hex) {
return preg_replace_callback(
'#(\\\x[0-9A-F]{2})#i',
function ($matches) {
return chr(hexdec($matches[1]));
},
$hex
);
}
This question reminds me how poor PHP is for multi byte support

To print UTF-8 characters from their decimal value you can use this function
<?php
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
echo chr_utf8(9405).chr_utf8(9402).chr_utf8(9409).chr_utf8(hexdec('24C1')).chr_utf8(9412);
// Output ⒽⒺⓁⓁⓄ
// Note : Use hexdec to print UTF-8 encoded characters from hexadecimal number.
For your snippet you can try this… and check it in https://eval.in/748161
<?php
// function chr_utf8 shown above is required…
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
if (preg_match_all('#(0[01]{7})|(?:110([01]{5})10([01]{6}))|(?:1110([01]{4})10([01]{6})10([01]{6}))|(?:11110([01]{3})10([01]{6}),10([01]{6})10([01]{6}))#',$BinaryUTF8,$a,PREG_SET_ORDER))
$result=implode('',array_map(function($n){return chr_utf8(bindec(implode('',array_slice($n,1))));},$a));
echo $result;
// Output आम
// Note : If you work with "binary" the length of input must be multiple of 8.
// You can't remove leading zeros because this regex will not detect the character…
One other nice inline solution is the following… (php v5.6+ required) Check it in https://eval.in/748162
<?php
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
echo pack('C*',...array_map('bindec',str_split($BinaryUTF8,8)));
// Output आम
// Note : Length or $BinaryUTF8 of input must be multiple of 8.

Parse UTF-8 string char-by-char in PHP

I'm sorry if I'm asking the obvious, but I can't seem to find a working solution for a simple task. On the input I have a string, provided by a user, encoded with UTF-8 encoding. I need to sanitize it by removing all characters less than 0x20 (or space), except 0x7 (or tab.)
The following works for ANSI strings, but not for UTF-8:
$newName = "";
$ln = strlen($name);
for($i = 0; $i < $ln; $i++)
{
$ch = substr($name, $i, 1);
$och = ord($ch);
if($och >= 0x20 ||
$och == 0x9)
{
$newName .= $ch;
}
}
It totally missed UTF-8 encoded characters and treats them as bytes. I keep finding posts where people suggest using mb_ functions, but that still doesn't help me. (For instance, I tried calling mb_strlen($name, "utf-8"); instead of strlen, but it still returns the length of string in BYTEs instead of characters.)
Any idea how to do this in PHP?
PS. Sorry, my PHP is somewhat rusty.

If you use multibyte functions (mb_) then you have to use them for everything. In this example you should use mb_strlen() and mb_substr().
The reason it is not working is probably because you are using ord(). It only works with ASCII values:
ord
(PHP 4, PHP 5)
ord — Return ASCII value of character
...
Returns the ASCII value of the first character of string.
In other words, if you throw a multibyte character into ord() it will only use the first byte, and throw away the rest.

Wow, PHP is one messed up language. Here's what worked for me (but how much slower will this run for a longer chunk of text...):
function normalizeName($name, $encoding_2_use, $encoding_used)
{
//'$name' = string to normalize
// INFO: Must be encoded with '$encoding_used' encoding
//'$encoding_2_use' = encoding to use for return string (example: "utf-8")
//'$encoding_used' = encoding used to encode '$name' (can be also "utf-8")
//RETURN:
// = Name normalized, or
// = "" if error
$resName = "";
$ln = mb_strlen($name, $encoding_used);
if($ln !== false)
{
for($i = 0; $i < $ln; $i++)
{
$ch = mb_substr($name, $i, 1, $encoding_used);
$arp = unpack('N', mb_convert_encoding($ch, 'UCS-4BE', $encoding_used));
if(count($arp) >= 1)
{
$och = intval($arp[1]); //Index 1?! I don't understand why, but it works...
if($och >= 0x20 || $och == 0x9)
{
$ch2 = mb_convert_encoding('&#'.$och.';', $encoding_2_use, 'HTML-ENTITIES');
$resName .= $ch2;
}
}
}
}
return $resName;
}

Convert inline specified UTF-8 mail subject

want to convert the following raw mail subject to normal UTF-8 text:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=
The real text for that is:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
My first approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
mb_internal_encoding("UTF-8");
echo mb_decode_mimeheader($mime);
This gives me the following result:
Schuker_hat_sich_vom_Übungsabend_(01.01.2012)_abgemeldet
(Questions here: What am I doing wrong? Why do those underscores occur?)
My second approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
echo imap_utf8($mime);
This gives me the following (correct) result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Why does this work? On which method should I rely on?
The reason I ask is that I previously asked another mail subject decoding related question where mb_decode_mimeheader was the solution whereas here imap_utf8 would be the way to go. How can I ensure to decode everything correct for those both examples:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?
and
=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?=
=?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=
Should give me the expected results:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
and
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"

Based on the hbit response, I've improved the imapUtf8() function to convert the subject text to UTF-8 using the charset information. The result is something like:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/', $str);
for ($i=0; $i < count($subLines); $i++) {
$convLine = '';
$linePartArr = imap_mime_header_decode($subLines[$i]);
for ($j=0; $j < count($linePartArr); $j++) {
if ($linePartArr[$j]->charset === 'default') {
if ($linePartArr[$j]->text != " ") {
$convLine .= ($linePartArr[$j]->text);
}
} else {
$convLine .= iconv($linePartArr[$j]->charset, 'UTF-8', $linePartArr[$j]->text);
}
}
$convStr .= $convLine;
}
return $convStr;
}

This function works for both examples:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/',$str); // split multi-line subjects
for($i=0; $i < count($subLines); $i++){ // go through lines
$convLine = '';
$linePartArr = imap_mime_header_decode(trim($subLines[$i])); // split and decode by charset
for($j=0; $j < count($linePartArr); $j++){
$convLine .= ($linePartArr[$j]->text); // append sub-parts of line together
}
$convStr .= $convLine; // append to whole subject
}
return $convStr; // return converted subject
}
Tests:
$sub1 = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
$sub2 = '=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=';
echo imapUtf8($sub1);
echo imapUtf8($sub2);
Result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"

It's also in the comments in the manual for mb_decode_mimeheader, and I actually assume it is a bug. None in the database, so I'd file it as a new one.
However, AFAIK imap_mime_header_decode will cope with both your encodings without a problem, so that will keep your code going.

About the mysterious underscore in the Subject header field:
RFC2047 4.2(2) states explicitly:
The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "_" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the "_"
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
The encoding rule for Subject line is documented in the very RFC2047 .

best way to detect number of SMS needed to send a text

I'm looking for a code/lib in php that I will call it and pass a text to it and it will tell me:
What is the encode I need to use in order to send this text as SMS (7,8,16 bit)
How many SMS message I will use to send this text (it must be smart to count "segmenation information" like in http://ozekisms.com/index.php?owpn=612)
do you have any idea of any code/lib exists that will do this for me?
Again I'm not looking for sending SMS or converting SMS, just to give me information about the text
Update:
Ok I did the below code and it seems to be working fine, let me know if you have better/optimized code/solution/lib
$text = '\#£$¥èéùìòÇØøÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ -./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ§¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€' ; //"\\". //'"';//' ';
print $text . "\n";
print isGsm7bit($text). "\n";
print getNumberOfSMSsegments($text). "\n";
function getNumberOfSMSsegments($text,$MaxSegments=6){
/*
http://en.wikipedia.org/wiki/SMS
Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages,
in which case each message will start with a user data header (UDH) containing segmentation information.
Since UDH is part of the payload, the number of available characters per segment is lower:
153 for 7-bit encoding,
134 for 8-bit encoding and
67 for 16-bit encoding.
The receiving handset is then responsible for reassembling the message and presenting it to the user as one long message.
While the standard theoretically permits up to 255 segments,[35] 6 to 8 segment messages are the practical maximum,
and long messages are often billed as equivalent to multiple SMS messages. See concatenated SMS for more information.
Some providers have offered length-oriented pricing schemes for messages, however, the phenomenon is disappearing.
*/
$TotalSegment=0;
$textlen = mb_strlen($text);
if($textlen==0) return false; //I can see most mobile devices will not allow you to send empty sms, with this check we make sure we don't allow empty SMS
if(isGsm7bit($text)){ //7-bit
$SingleMax=160;
$ConcatMax=153;
}else{ //UCS-2 Encoding (16-bit)
$SingleMax=70;
$ConcatMax=67;
}
if($textlen<=$SingleMax){
$TotalSegment = 1;
}else{
$TotalSegment = ceil($textlen/$ConcatMax);
}
if($TotalSegment>$MaxSegments) return false; //SMS is very big.
return $TotalSegment;
}
function isGsm7bit($text){
$gsm7bitChars = "\\\#£\$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ§¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€";
$textlen = mb_strlen($text);
for ($i = 0;$i < $textlen; $i++){
if ((strpos($gsm7bitChars, $text[$i])==false) && ($text[$i]!="\\")){return false;} //strpos not able to detect \ in string
}
return true;
}

I'm adding some extra information here because the previous answer isn't quite correct.
These are the issues:
You need to be specifying the current string encoding to mb_string, otherwise this may be incorrectly gathered
In 7-bit GSM encoding, the Basic Charset Extended characters (^{}\[~]|€) require 14-bits each to encode, so they count as two characters each.
In UCS-2 encoding, you have to be wary of emoji and other characters outside the 16-bit BMP, because...
GSM with UCS-2 counts 16-bit characters, so if you have a 💩 character (U+1F4A9), and your carrier and phone sneakily support UTF-16 and not just UCS-2, it will be encoded as a surrogate pair of 16-bit characters in UTF-16, and thus be counted as TWO 16-bit characters toward your string length. mb_strlen will count this as a single character only.
How to count 7-bit characters:
What I've come up with so far is the following to count 7-bit characters:
// Internal encoding must be set to UTF-8,
// and the input string must be UTF-8 encoded for this to work correctly
protected function count_gsm_string($str)
{
// Basic GSM character set (one 7-bit encoded char each)
$gsm_7bit_basic = "#£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ§¿abcdefghijklmnopqrstuvwxyzäöñüà";
// Extended set (requires escape code before character thus 2x7-bit encodings per)
$gsm_7bit_extended = "^{}\\[~]|€";
$len = 0;
for($i = 0; $i < mb_strlen($str); $i++) {
$c = mb_substr($str, i, 1);
if(mb_strpos($gsm_7bit_basic, $c) !== FALSE) {
$len++;
} else if(mb_strpos($gsm_7bit_extended, $c) !== FALSE) {
$len += 2;
} else {
return -1; // cannot be encoded as GSM, immediately return -1
}
}
return $len;
}
How to count 16-bit characters:
Convert the string into UTF-16 representation (to preserve the emoji characters with mb_convert_encoding($str, 'UTF-16', 'UTF-8').
do not convert into UCS-2 as this is lossy with mb_convert_encoding)
Count bytes with count(unpack('C*', $utf16str)) and divide by two to get the number of UCS-2 16-bit characters that count toward the GSM multipart length
*caveat emptor, a word on counting bytes:
Do not use strlen to count the number of bytes. While it may work, strlen is often overloaded in PHP installations with a multibyte-capable version, and is also a candidate for API change in the future
Avoid mb_strlen($str, 'UCS-2'). While it does currently work, and will return, correctly, 2 for a pile of poo character (as it looks like two 16-bit UCS-2 characters), its stablemate mb_convert_encoding is lossy when converting from >16-bit to UCS-2. Who's to say that mb_strlen won't be lossy in the future?
Avoid mb_strlen($str, '8bit') / 2. It also currently works, and is recommended in a PHP docs comment as a way to count bytes. But IMO it suffers from the same issue as the above UCS-2 technique.
That leaves the safest current way (IMO) as unpacking into a byte array, and counting that.
So, what does this look like?
// Internal encoding must be set to UTF-8,
// and the input string must be UTF-8 encoded for this to work correctly
protected function count_ucs2_string($str)
{
$utf16str = mb_convert_encoding($str, 'UTF-16', 'UTF-8');
// C* option gives an unsigned 16-bit integer representation of each byte
// which option you choose doesn't actually matter as long as you get one value per byte
$byteArray = unpack('C*', $utf16str);
return count($byteArray) / 2;
}
Putting it all together:
function multipart_count($str)
{
$one_part_limit = 160; // use a constant i.e. GSM::SMS_SINGLE_7BIT
$multi_limit = 153; // again, use a constant
$max_parts = 3; // ... constant
$str_length = count_gsm_string($str);
if($str_length === -1) {
$one_part_limit = 70; // ... constant
$multi_limit = 67; // ... constant
$str_length = count_ucs2_string($str);
}
if($str_length <= $one_part_limit) {
// fits in one part
return 1;
} else if($str_length > ($max_parts * $multi_limit) {
// too long
return -1; // or throw exception, or false, etc.
} else {
// divide the string length by multi_limit and round up to get number of parts
return ceil($str_length / $multi_limit);
}
}
Turned this into a library...
https://bitbucket.org/solvam/smstools

The best solution I have so far:
$text = '\#£$¥èéùìòÇØøÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ -./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ§¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€' ; //"\\". //'"';//' ';
print $text . "\n";
print isGsm7bit($text). "\n";
print getNumberOfSMSsegments($text). "\n";
function getNumberOfSMSsegments($text,$MaxSegments=6){
/*
http://en.wikipedia.org/wiki/SMS
Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages,
in which case each message will start with a user data header (UDH) containing segmentation information.
Since UDH is part of the payload, the number of available characters per segment is lower:
153 for 7-bit encoding,
134 for 8-bit encoding and
67 for 16-bit encoding.
The receiving handset is then responsible for reassembling the message and presenting it to the user as one long message.
While the standard theoretically permits up to 255 segments,[35] 6 to 8 segment messages are the practical maximum,
and long messages are often billed as equivalent to multiple SMS messages. See concatenated SMS for more information.
Some providers have offered length-oriented pricing schemes for messages, however, the phenomenon is disappearing.
*/
$TotalSegment=0;
$textlen = mb_strlen($text);
if($textlen==0) return false; //I can see most mobile devices will not allow you to send empty sms, with this check we make sure we don't allow empty SMS
if(isGsm7bit($text)){ //7-bit
$SingleMax=160;
$ConcatMax=153;
}else{ //UCS-2 Encoding (16-bit)
$SingleMax=70;
$ConcatMax=67;
}
if($textlen<=$SingleMax){
$TotalSegment = 1;
}else{
$TotalSegment = ceil($textlen/$ConcatMax);
}
if($TotalSegment>$MaxSegments) return false; //SMS is very big.
return $TotalSegment;
}
function isGsm7bit($text){
$gsm7bitChars = "\\\#£\$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ§¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€";
$textlen = mb_strlen($text);
for ($i = 0;$i < $textlen; $i++){
if ((strpos($gsm7bitChars, $text[$i])==false) && ($text[$i]!="\\")){return false;} //strpos not able to detect \ in string
}
return true;
}

page 1 : 160 byte
page 2 : 146 byte
page 3 : 153 byte
page 4 : 153 byte
page 5 : 153 byte, ....
So regardless of language :
// strlen($text) show bytes
$count = 0;
$len = strlen($text);
if ($len > 306) {
$len = $len - 306;
$count = floor($len / 153) + 3;
} else if($len>160){
$count = 2;
}else{
$count = 1;
}

PHP read binary file in real binary

I searched google for my problem but found no solution.
I want to read a file and convert the buffer to binary like 10001011001011001.
If I have something like this from the file
bmoov���lmvhd�����(tF�(tF�_�
K�T��������������������������������������������#���������������������������������trak���\tkh
d����(tF�(tF������� K������������������������������������������������#������������$edts��
How can I convert all characters (including also this stuff ��) to 101010101000110010 representation??
I hope someone can help me :)

Use ord() on each byte to get its decimal value and then sprintf to print it in binary form (and force each byte to include 8 bits by padding with 0 on front).
<?php
$buffer = file_get_contents(__FILE__);
$length = filesize(__FILE__);
if (!$buffer || !$length) {
die("Reading error\n");
}
$_buffer = '';
for ($i = 0; $i < $length; $i++) {
$_buffer .= sprintf("%08b", ord($buffer[$i]));
}
var_dump($_buffer);
$ php test.php
string(2096) "00111100001111110111000001101000011100000000101000100100011000100111010101100110011001100110010101110010001000000011110100100000011001100110100101101100011001010101111101100111011001010111010001011111011000110110111101101110011101000110010101101110011101000111001100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000100100011011000110010101101110011001110111010001101000001000000011110100100000011001100110100101101100011001010111001101101001011110100110010100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000001010011010010110011000100000001010000010000100100100011000100111010101100110011001100110010101110010001000000111110001111100001000000010000100100100011011000110010101101110011001110111010001101000001010010010000001111011000010100010000000100000011001000110100101100101001010000010001001010010011001010110000101100100011010010110111001100111001000000110010101110010011100100110111101110010010111000110111000100010001010010011101100001010011111010000101000001010001001000101111101100010011101010110011001100110011001010111001000100000001111010010000000100111001001110011101100001010011001100110111101110010001000000010100000100100011010010010000000111101001000000011000000111011001000000010010001101001001000000011110000100000001001000110110001100101011011100110011101110100011010000011101100100000001001000110100100101011001010110010100100100000011110110000101000100000001000000010010001011111011000100111010101100110011001100110010101110010001000000010111000111101001000000111001101110000011100100110100101101110011101000110011000101000001000100010010100110000001110000110010000100010001011000010000001100100011001010110001101100010011010010110111000101000011011110111001001100100001010000010010001100010011101010110011001100110011001010111001001011011001001000110100101011101001010010010100100101001001110110000101001111101000010100000101001110110011000010111001001011111011001000111010101101101011100000010100000100100010111110110001001110101011001100110011001100101011100100010100100111011"

On thing you could do is to read the file into a string variable, then print the string in your binary number representation with the use of sprintfDocs:
$string = file_get_contents($file);
for($l=strlen($string), $i=0; $i<$l; $i++)
{
printf('%08b', ord($string[$i]));
}
If you're just looking for a hexadecimal representation, you can use bin2hexDocs:
echo bin2hex($string);
If you're looking for a nicer form of hexdump, please see the related question:
How can I get a hex dump of a string in PHP?

Reading a file word-wise (32 bits at once) would be faster than byte-wise:
$s = file_get_contents("filename");
foreach(unpack("L*", $s) as $n)
$buf[] = sprintf("%032b", $n);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can Detect UTF 16 decoding - php

Related

How to print Hexadecimal UTF-8 characters in PHP

Parse UTF-8 string char-by-char in PHP

Convert inline specified UTF-8 mail subject

best way to detect number of SMS needed to send a text

PHP read binary file in real binary

Categories

Resources