How to print Hexadecimal UTF-8 characters in PHP - php

How to print UFT-8 Characters from their Hexadecimal UTF-8 values? I read this post, but it did not solve my problem...
I work with many strings that are sanskrit words stored in a database. I have their HTML values, 16 bit binary code points, hex codes, and decimal codes, but I want to be able to work with their Hexadecimal UTF-8 values and output their symbolic form.
For example, here is a word आम that has a Binary UTF-8 value of 111000001010010010111000111000001010010010101110. I want to see/store/print its Hexadecimal UTF-8 value and print its symbolic form.
For example, here's a snippet of my code:
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
$Temporary = dechex(bindec($BinaryUTF8));
$HexadecimalUTF8 = NULL;
for($i = 0; $i < strlen($Temporary); $i+=2)
{
$HexadecimalUTF8 .= "\x".$Temporary[$i].$Temporary[$i+1];
}
$Test = "\xe0\xa4\x86\xe0\xa4\xae";
echo "\$Test = ".$Test;
echo "<br>";
echo "\$HexadecimalUTF8 = ".$HexadecimalUTF8;
The output is:
$Test = आम
$HexadecimalUTF8 = \xe0\xa4\x86\xe0\xa4\xae
$Test output the desired characters.
Why does $HexadecimalUTF8 not output the desired characters?

Your binary is wrong (I have fixed it below)
You are making a string containing the text "\xe0" instead of the character which represents that, The hex is just a number really.
This seems to work now
<?php
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
$Temporary = dechex(bindec($BinaryUTF8));
$HexadecimalUTF8 = NULL;
for($i = 0; $i < strlen($Temporary); $i+=2)
{
$HexadecimalUTF8 .= '\x' . $Temporary[$i].$Temporary[$i+1];
}
$Test = "\xe0\xa4\x86\xe0\xa4\xae";
echo "\$Test = ".$Test;
echo "<br>";
echo "\$HexadecimalUTF8 = " . makeCharFromHex($HexadecimalUTF8);
function makeCharFromHex($hex) {
return preg_replace_callback(
'#(\\\x[0-9A-F]{2})#i',
function ($matches) {
return chr(hexdec($matches[1]));
},
$hex
);
}
This question reminds me how poor PHP is for multi byte support

To print UTF-8 characters from their decimal value you can use this function
<?php
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
echo chr_utf8(9405).chr_utf8(9402).chr_utf8(9409).chr_utf8(hexdec('24C1')).chr_utf8(9412);
// Output ⒽⒺⓁⓁⓄ
// Note : Use hexdec to print UTF-8 encoded characters from hexadecimal number.
For your snippet you can try this… and check it in https://eval.in/748161
<?php
// function chr_utf8 shown above is required…
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
if (preg_match_all('#(0[01]{7})|(?:110([01]{5})10([01]{6}))|(?:1110([01]{4})10([01]{6})10([01]{6}))|(?:11110([01]{3})10([01]{6}),10([01]{6})10([01]{6}))#',$BinaryUTF8,$a,PREG_SET_ORDER))
$result=implode('',array_map(function($n){return chr_utf8(bindec(implode('',array_slice($n,1))));},$a));
echo $result;
// Output आम
// Note : If you work with "binary" the length of input must be multiple of 8.
// You can't remove leading zeros because this regex will not detect the character…
One other nice inline solution is the following… (php v5.6+ required) Check it in https://eval.in/748162
<?php
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
echo pack('C*',...array_map('bindec',str_split($BinaryUTF8,8)));
// Output आम
// Note : Length or $BinaryUTF8 of input must be multiple of 8.

Related

Parse UTF-8 string char-by-char in PHP

I'm sorry if I'm asking the obvious, but I can't seem to find a working solution for a simple task. On the input I have a string, provided by a user, encoded with UTF-8 encoding. I need to sanitize it by removing all characters less than 0x20 (or space), except 0x7 (or tab.)
The following works for ANSI strings, but not for UTF-8:
$newName = "";
$ln = strlen($name);
for($i = 0; $i < $ln; $i++)
{
$ch = substr($name, $i, 1);
$och = ord($ch);
if($och >= 0x20 ||
$och == 0x9)
{
$newName .= $ch;
}
}
It totally missed UTF-8 encoded characters and treats them as bytes. I keep finding posts where people suggest using mb_ functions, but that still doesn't help me. (For instance, I tried calling mb_strlen($name, "utf-8"); instead of strlen, but it still returns the length of string in BYTEs instead of characters.)
Any idea how to do this in PHP?
PS. Sorry, my PHP is somewhat rusty.
If you use multibyte functions (mb_) then you have to use them for everything. In this example you should use mb_strlen() and mb_substr().
The reason it is not working is probably because you are using ord(). It only works with ASCII values:
ord
(PHP 4, PHP 5)
ord — Return ASCII value of character
...
Returns the ASCII value of the first character of string.
In other words, if you throw a multibyte character into ord() it will only use the first byte, and throw away the rest.
Wow, PHP is one messed up language. Here's what worked for me (but how much slower will this run for a longer chunk of text...):
function normalizeName($name, $encoding_2_use, $encoding_used)
{
//'$name' = string to normalize
// INFO: Must be encoded with '$encoding_used' encoding
//'$encoding_2_use' = encoding to use for return string (example: "utf-8")
//'$encoding_used' = encoding used to encode '$name' (can be also "utf-8")
//RETURN:
// = Name normalized, or
// = "" if error
$resName = "";
$ln = mb_strlen($name, $encoding_used);
if($ln !== false)
{
for($i = 0; $i < $ln; $i++)
{
$ch = mb_substr($name, $i, 1, $encoding_used);
$arp = unpack('N', mb_convert_encoding($ch, 'UCS-4BE', $encoding_used));
if(count($arp) >= 1)
{
$och = intval($arp[1]); //Index 1?! I don't understand why, but it works...
if($och >= 0x20 || $och == 0x9)
{
$ch2 = mb_convert_encoding('&#'.$och.';', $encoding_2_use, 'HTML-ENTITIES');
$resName .= $ch2;
}
}
}
}
return $resName;
}

Levenshtein distance on diacritic characters

In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example
echo levenshtein('à', 'a');
it returns "2". In this case only one replacement has to be done, so I expect it to return "1".
Am I missing something?
I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms
The default PHP levenshtein(), like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.
There is no multibyte version (i.e. mb_levenshtein()) so you have two options:
1) Re-implement the function yourself, using mb_ functions. Possible example code from a Gist:
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.

Convert inline specified UTF-8 mail subject

want to convert the following raw mail subject to normal UTF-8 text:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=
The real text for that is:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
My first approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
mb_internal_encoding("UTF-8");
echo mb_decode_mimeheader($mime);
This gives me the following result:
Schuker_hat_sich_vom_Übungsabend_(01.01.2012)_abgemeldet
(Questions here: What am I doing wrong? Why do those underscores occur?)
My second approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
echo imap_utf8($mime);
This gives me the following (correct) result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Why does this work? On which method should I rely on?
The reason I ask is that I previously asked another mail subject decoding related question where mb_decode_mimeheader was the solution whereas here imap_utf8 would be the way to go. How can I ensure to decode everything correct for those both examples:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?
and
=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?=
=?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=
Should give me the expected results:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
and
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"
Based on the hbit response, I've improved the imapUtf8() function to convert the subject text to UTF-8 using the charset information. The result is something like:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/', $str);
for ($i=0; $i < count($subLines); $i++) {
$convLine = '';
$linePartArr = imap_mime_header_decode($subLines[$i]);
for ($j=0; $j < count($linePartArr); $j++) {
if ($linePartArr[$j]->charset === 'default') {
if ($linePartArr[$j]->text != " ") {
$convLine .= ($linePartArr[$j]->text);
}
} else {
$convLine .= iconv($linePartArr[$j]->charset, 'UTF-8', $linePartArr[$j]->text);
}
}
$convStr .= $convLine;
}
return $convStr;
}
This function works for both examples:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/',$str); // split multi-line subjects
for($i=0; $i < count($subLines); $i++){ // go through lines
$convLine = '';
$linePartArr = imap_mime_header_decode(trim($subLines[$i])); // split and decode by charset
for($j=0; $j < count($linePartArr); $j++){
$convLine .= ($linePartArr[$j]->text); // append sub-parts of line together
}
$convStr .= $convLine; // append to whole subject
}
return $convStr; // return converted subject
}
Tests:
$sub1 = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
$sub2 = '=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=';
echo imapUtf8($sub1);
echo imapUtf8($sub2);
Result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"
It's also in the comments in the manual for mb_decode_mimeheader, and I actually assume it is a bug. None in the database, so I'd file it as a new one.
However, AFAIK imap_mime_header_decode will cope with both your encodings without a problem, so that will keep your code going.
About the mysterious underscore in the Subject header field:
RFC2047 4.2(2) states explicitly:
The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "_" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the "_"
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
The encoding rule for Subject line is documented in the very RFC2047 .

PHP read binary file in real binary

I searched google for my problem but found no solution.
I want to read a file and convert the buffer to binary like 10001011001011001.
If I have something like this from the file
bmoov���lmvhd�����(tF�(tF�_�
K�T��������������������������������������������#���������������������������������trak���\tkh
d����(tF�(tF������� K������������������������������������������������#������������$edts��
How can I convert all characters (including also this stuff ��) to 101010101000110010 representation??
I hope someone can help me :)
Use ord() on each byte to get its decimal value and then sprintf to print it in binary form (and force each byte to include 8 bits by padding with 0 on front).
<?php
$buffer = file_get_contents(__FILE__);
$length = filesize(__FILE__);
if (!$buffer || !$length) {
die("Reading error\n");
}
$_buffer = '';
for ($i = 0; $i < $length; $i++) {
$_buffer .= sprintf("%08b", ord($buffer[$i]));
}
var_dump($_buffer);
$ php test.php
string(2096) "00111100001111110111000001101000011100000000101000100100011000100111010101100110011001100110010101110010001000000011110100100000011001100110100101101100011001010101111101100111011001010111010001011111011000110110111101101110011101000110010101101110011101000111001100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000100100011011000110010101101110011001110111010001101000001000000011110100100000011001100110100101101100011001010111001101101001011110100110010100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000001010011010010110011000100000001010000010000100100100011000100111010101100110011001100110010101110010001000000111110001111100001000000010000100100100011011000110010101101110011001110111010001101000001010010010000001111011000010100010000000100000011001000110100101100101001010000010001001010010011001010110000101100100011010010110111001100111001000000110010101110010011100100110111101110010010111000110111000100010001010010011101100001010011111010000101000001010001001000101111101100010011101010110011001100110011001010111001000100000001111010010000000100111001001110011101100001010011001100110111101110010001000000010100000100100011010010010000000111101001000000011000000111011001000000010010001101001001000000011110000100000001001000110110001100101011011100110011101110100011010000011101100100000001001000110100100101011001010110010100100100000011110110000101000100000001000000010010001011111011000100111010101100110011001100110010101110010001000000010111000111101001000000111001101110000011100100110100101101110011101000110011000101000001000100010010100110000001110000110010000100010001011000010000001100100011001010110001101100010011010010110111000101000011011110111001001100100001010000010010001100010011101010110011001100110011001010111001001011011001001000110100101011101001010010010100100101001001110110000101001111101000010100000101001110110011000010111001001011111011001000111010101101101011100000010100000100100010111110110001001110101011001100110011001100101011100100010100100111011"
On thing you could do is to read the file into a string variable, then print the string in your binary number representation with the use of sprintfDocs:
$string = file_get_contents($file);
for($l=strlen($string), $i=0; $i<$l; $i++)
{
printf('%08b', ord($string[$i]));
}
If you're just looking for a hexadecimal representation, you can use bin2hexDocs:
echo bin2hex($string);
If you're looking for a nicer form of hexdump, please see the related question:
How can I get a hex dump of a string in PHP?
Reading a file word-wise (32 bits at once) would be faster than byte-wise:
$s = file_get_contents("filename");
foreach(unpack("L*", $s) as $n)
$buf[] = sprintf("%032b", $n);

How can I cram 6+31 numeric characters into 22 alphanumeric characters?

I've got a 6-digit number and a 31-digit number (e.g. "234536" & "201103231043330478311223582826") that I need to cram into the same 22-character alphanumeric field in an API using PHP. I tried converting each to base 32 (had to use a custom function as base_convert() doesn't handle big numbers well) and joining with a single-character delimiter, but that only gets me down to 26 characters. It's a REST API, so the characters need to be URI-safe.
I'd really like to do this without creating a database table cross referencing the two numbers with another reference value, if possible. Any suggestions?
Use a radix of 62 instead. That will get you 3.35 characters for the former and 17.3 characters for the latter, for an upper total of 22 characters.
>>> math.log(10**6)/math.log(62)
3.3474826039165504
>>> math.log(10**31)/math.log(62)
17.295326786902177
You can write something like pack() that works with big numbers using bc. Here is my quick solution, it converts your second number in a 13-character string. Pretty nice !
<?php
$i2 = "201103231043330478311223582826";
function pack_large($i) {
$ret = '';
while(bccomp($i, 0) !== 0) {
$mod = bcmod($i, 256);
$i = bcsub($i, $mod);
$ret .= chr($mod);
$i = bcdiv($i, 256);
}
return $ret;
}
function unpack_large($s) {
$ret = '0';
$len = strlen($s);
for($i = $len - 1; $i >= 0; --$i) {
$add = ord($s[$i]);
$ret = bcmul($ret, 256);
$ret = bcadd($ret, $add);
}
return $ret;
}
var_dump($i2);
var_dump($pack = pack_large($i2));
var_dump(unpack_large($pack));
Sample output :
string(30) "201103231043330478311223582826"
string(13) "jàÙl¹9±̉"
string(47) "201103231043330478311223582826.0000000000000000"
Since you need URL-friendly characters, use base64_encode on the packed string, this will give you a 20-character string (18 if your remove the padding).

Categories