Parse UTF-8 string char-by-char in PHP

Parse UTF-8 string char-by-char in PHP - php

I'm sorry if I'm asking the obvious, but I can't seem to find a working solution for a simple task. On the input I have a string, provided by a user, encoded with UTF-8 encoding. I need to sanitize it by removing all characters less than 0x20 (or space), except 0x7 (or tab.)
The following works for ANSI strings, but not for UTF-8:
$newName = "";
$ln = strlen($name);
for($i = 0; $i < $ln; $i++)
{
$ch = substr($name, $i, 1);
$och = ord($ch);
if($och >= 0x20 ||
$och == 0x9)
{
$newName .= $ch;
}
}
It totally missed UTF-8 encoded characters and treats them as bytes. I keep finding posts where people suggest using mb_ functions, but that still doesn't help me. (For instance, I tried calling mb_strlen($name, "utf-8"); instead of strlen, but it still returns the length of string in BYTEs instead of characters.)
Any idea how to do this in PHP?
PS. Sorry, my PHP is somewhat rusty.

If you use multibyte functions (mb_) then you have to use them for everything. In this example you should use mb_strlen() and mb_substr().
The reason it is not working is probably because you are using ord(). It only works with ASCII values:
ord
(PHP 4, PHP 5)
ord — Return ASCII value of character
...
Returns the ASCII value of the first character of string.
In other words, if you throw a multibyte character into ord() it will only use the first byte, and throw away the rest.

Wow, PHP is one messed up language. Here's what worked for me (but how much slower will this run for a longer chunk of text...):
function normalizeName($name, $encoding_2_use, $encoding_used)
{
//'$name' = string to normalize
// INFO: Must be encoded with '$encoding_used' encoding
//'$encoding_2_use' = encoding to use for return string (example: "utf-8")
//'$encoding_used' = encoding used to encode '$name' (can be also "utf-8")
//RETURN:
// = Name normalized, or
// = "" if error
$resName = "";
$ln = mb_strlen($name, $encoding_used);
if($ln !== false)
{
for($i = 0; $i < $ln; $i++)
{
$ch = mb_substr($name, $i, 1, $encoding_used);
$arp = unpack('N', mb_convert_encoding($ch, 'UCS-4BE', $encoding_used));
if(count($arp) >= 1)
{
$och = intval($arp[1]); //Index 1?! I don't understand why, but it works...
if($och >= 0x20 || $och == 0x9)
{
$ch2 = mb_convert_encoding('&#'.$och.';', $encoding_2_use, 'HTML-ENTITIES');
$resName .= $ch2;
}
}
}
}
return $resName;
}

Related

How to print Hexadecimal UTF-8 characters in PHP

How to print UFT-8 Characters from their Hexadecimal UTF-8 values? I read this post, but it did not solve my problem...
I work with many strings that are sanskrit words stored in a database. I have their HTML values, 16 bit binary code points, hex codes, and decimal codes, but I want to be able to work with their Hexadecimal UTF-8 values and output their symbolic form.
For example, here is a word आम that has a Binary UTF-8 value of 111000001010010010111000111000001010010010101110. I want to see/store/print its Hexadecimal UTF-8 value and print its symbolic form.
For example, here's a snippet of my code:
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
$Temporary = dechex(bindec($BinaryUTF8));
$HexadecimalUTF8 = NULL;
for($i = 0; $i < strlen($Temporary); $i+=2)
{
$HexadecimalUTF8 .= "\x".$Temporary[$i].$Temporary[$i+1];
}
$Test = "\xe0\xa4\x86\xe0\xa4\xae";
echo "\$Test = ".$Test;
echo "<br>";
echo "\$HexadecimalUTF8 = ".$HexadecimalUTF8;
The output is:
$Test = आम
$HexadecimalUTF8 = \xe0\xa4\x86\xe0\xa4\xae
$Test output the desired characters.
Why does $HexadecimalUTF8 not output the desired characters?

Your binary is wrong (I have fixed it below)
You are making a string containing the text "\xe0" instead of the character which represents that, The hex is just a number really.
This seems to work now
<?php
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
$Temporary = dechex(bindec($BinaryUTF8));
$HexadecimalUTF8 = NULL;
for($i = 0; $i < strlen($Temporary); $i+=2)
{
$HexadecimalUTF8 .= '\x' . $Temporary[$i].$Temporary[$i+1];
}
$Test = "\xe0\xa4\x86\xe0\xa4\xae";
echo "\$Test = ".$Test;
echo "<br>";
echo "\$HexadecimalUTF8 = " . makeCharFromHex($HexadecimalUTF8);
function makeCharFromHex($hex) {
return preg_replace_callback(
'#(\\\x[0-9A-F]{2})#i',
function ($matches) {
return chr(hexdec($matches[1]));
},
$hex
);
}
This question reminds me how poor PHP is for multi byte support

To print UTF-8 characters from their decimal value you can use this function
<?php
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
echo chr_utf8(9405).chr_utf8(9402).chr_utf8(9409).chr_utf8(hexdec('24C1')).chr_utf8(9412);
// Output ⒽⒺⓁⓁⓄ
// Note : Use hexdec to print UTF-8 encoded characters from hexadecimal number.
For your snippet you can try this… and check it in https://eval.in/748161
<?php
// function chr_utf8 shown above is required…
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
if (preg_match_all('#(0[01]{7})|(?:110([01]{5})10([01]{6}))|(?:1110([01]{4})10([01]{6})10([01]{6}))|(?:11110([01]{3})10([01]{6}),10([01]{6})10([01]{6}))#',$BinaryUTF8,$a,PREG_SET_ORDER))
$result=implode('',array_map(function($n){return chr_utf8(bindec(implode('',array_slice($n,1))));},$a));
echo $result;
// Output आम
// Note : If you work with "binary" the length of input must be multiple of 8.
// You can't remove leading zeros because this regex will not detect the character…
One other nice inline solution is the following… (php v5.6+ required) Check it in https://eval.in/748162
<?php
$BinaryUTF8 = "111000001010010010000110111000001010010010101110";
echo pack('C*',...array_map('bindec',str_split($BinaryUTF8,8)));
// Output आम
// Note : Length or $BinaryUTF8 of input must be multiple of 8.

Levenshtein distance on diacritic characters

In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example
echo levenshtein('à', 'a');
it returns "2". In this case only one replacement has to be done, so I expect it to return "1".
Am I missing something?

I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms

The default PHP levenshtein(), like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.
There is no multibyte version (i.e. mb_levenshtein()) so you have two options:
1) Re-implement the function yourself, using mb_ functions. Possible example code from a Gist:
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.

PHP read binary file in real binary

I searched google for my problem but found no solution.
I want to read a file and convert the buffer to binary like 10001011001011001.
If I have something like this from the file
bmoov���lmvhd�����(tF�(tF�_�
K�T��������������������������������������������#���������������������������������trak���\tkh
d����(tF�(tF������� K������������������������������������������������#������������$edts��
How can I convert all characters (including also this stuff ��) to 101010101000110010 representation??
I hope someone can help me :)

Use ord() on each byte to get its decimal value and then sprintf to print it in binary form (and force each byte to include 8 bits by padding with 0 on front).
<?php
$buffer = file_get_contents(__FILE__);
$length = filesize(__FILE__);
if (!$buffer || !$length) {
die("Reading error\n");
}
$_buffer = '';
for ($i = 0; $i < $length; $i++) {
$_buffer .= sprintf("%08b", ord($buffer[$i]));
}
var_dump($_buffer);
$ php test.php
string(2096) "00111100001111110111000001101000011100000000101000100100011000100111010101100110011001100110010101110010001000000011110100100000011001100110100101101100011001010101111101100111011001010111010001011111011000110110111101101110011101000110010101101110011101000111001100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000100100011011000110010101101110011001110111010001101000001000000011110100100000011001100110100101101100011001010111001101101001011110100110010100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000001010011010010110011000100000001010000010000100100100011000100111010101100110011001100110010101110010001000000111110001111100001000000010000100100100011011000110010101101110011001110111010001101000001010010010000001111011000010100010000000100000011001000110100101100101001010000010001001010010011001010110000101100100011010010110111001100111001000000110010101110010011100100110111101110010010111000110111000100010001010010011101100001010011111010000101000001010001001000101111101100010011101010110011001100110011001010111001000100000001111010010000000100111001001110011101100001010011001100110111101110010001000000010100000100100011010010010000000111101001000000011000000111011001000000010010001101001001000000011110000100000001001000110110001100101011011100110011101110100011010000011101100100000001001000110100100101011001010110010100100100000011110110000101000100000001000000010010001011111011000100111010101100110011001100110010101110010001000000010111000111101001000000111001101110000011100100110100101101110011101000110011000101000001000100010010100110000001110000110010000100010001011000010000001100100011001010110001101100010011010010110111000101000011011110111001001100100001010000010010001100010011101010110011001100110011001010111001001011011001001000110100101011101001010010010100100101001001110110000101001111101000010100000101001110110011000010111001001011111011001000111010101101101011100000010100000100100010111110110001001110101011001100110011001100101011100100010100100111011"

On thing you could do is to read the file into a string variable, then print the string in your binary number representation with the use of sprintfDocs:
$string = file_get_contents($file);
for($l=strlen($string), $i=0; $i<$l; $i++)
{
printf('%08b', ord($string[$i]));
}
If you're just looking for a hexadecimal representation, you can use bin2hexDocs:
echo bin2hex($string);
If you're looking for a nicer form of hexdump, please see the related question:
How can I get a hex dump of a string in PHP?

Reading a file word-wise (32 bits at once) would be faster than byte-wise:
$s = file_get_contents("filename");
foreach(unpack("L*", $s) as $n)
$buf[] = sprintf("%032b", $n);

Text to Hex conversion in php is inaccurate

I'm trying to convert a text string to hexadecimal in php (which sounds trivial enough) but all the conversions I have tried output incorrect data.
The string I need to convert is;
RTP1 •. • A ¥;¥9ÈKJ| %¯ : E~WF 3HxI#Y¥
The correct result is;
525450310120209501022e2095204120030503040ba53b03040ba539c84b041f4a7c1120202025af032020203a20457e0357462033487849230459a52020202020
But I consistently get;
52545031012020e280a201022e20e280a2204120030503040bc2a53b03040bc2a539c3884b041f4a7c1120202025c2af032020203a20457e0357462033487849230459c2a52020202020
The online calculator at http://www.swingnote.com/tools/texttohex.php works on this perfectly - I have emailed the author to request the php source code but have had no answer.
I've tried the following functions without success;
bin2hex($data);
function strToHex($string)
{
$hex='';
for ($i=0; $i < strlen($string); $i++)
{
$hex .= dechex(ord($string[$i]));
}
return $hex;
}
for ($i = 0; $i < strlen($string); $i++) {
echo dechex(ord($string[$i]));
}
and a few others I can no longer find... I'm really at a loss with this so any help will be greatly appreciated!
Thanks!
Matthew

The input string appears to contain utf-8 encoded characters (I say this based on the output). Try converting these characters back into an ASCII/ISO-8859-1 alike format.
$indat = utf8_decode("...");
$hexdata = bin2hex($indat);

I usually just process it one char at a time.
$str = 'My Cool String!';
$hex = '';
$str_ary = str_split($str);
foreach($str_ary as $char)
{
$hex .= dechex(ord($char));
}
echo $hex;
Edit:
Looking at it again, it looks like our code is very similar (didn't notice the code :\ ). I believe Jeff Parker has the right idea in the comment, it might just be a display issue.

How can I cram 6+31 numeric characters into 22 alphanumeric characters?

I've got a 6-digit number and a 31-digit number (e.g. "234536" & "201103231043330478311223582826") that I need to cram into the same 22-character alphanumeric field in an API using PHP. I tried converting each to base 32 (had to use a custom function as base_convert() doesn't handle big numbers well) and joining with a single-character delimiter, but that only gets me down to 26 characters. It's a REST API, so the characters need to be URI-safe.
I'd really like to do this without creating a database table cross referencing the two numbers with another reference value, if possible. Any suggestions?

Use a radix of 62 instead. That will get you 3.35 characters for the former and 17.3 characters for the latter, for an upper total of 22 characters.
>>> math.log(10**6)/math.log(62)
3.3474826039165504
>>> math.log(10**31)/math.log(62)
17.295326786902177

You can write something like pack() that works with big numbers using bc. Here is my quick solution, it converts your second number in a 13-character string. Pretty nice !
<?php
$i2 = "201103231043330478311223582826";
function pack_large($i) {
$ret = '';
while(bccomp($i, 0) !== 0) {
$mod = bcmod($i, 256);
$i = bcsub($i, $mod);
$ret .= chr($mod);
$i = bcdiv($i, 256);
}
return $ret;
}
function unpack_large($s) {
$ret = '0';
$len = strlen($s);
for($i = $len - 1; $i >= 0; --$i) {
$add = ord($s[$i]);
$ret = bcmul($ret, 256);
$ret = bcadd($ret, $add);
}
return $ret;
}
var_dump($i2);
var_dump($pack = pack_large($i2));
var_dump(unpack_large($pack));
Sample output :
string(30) "201103231043330478311223582826"
string(13) "jàÙl¹9±̉"
string(47) "201103231043330478311223582826.0000000000000000"
Since you need URL-friendly characters, use base64_encode on the packed string, this will give you a 20-character string (18 if your remove the padding).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse UTF-8 string char-by-char in PHP - php

Related

How to print Hexadecimal UTF-8 characters in PHP

Levenshtein distance on diacritic characters

PHP read binary file in real binary

Text to Hex conversion in php is inaccurate

How can I cram 6+31 numeric characters into 22 alphanumeric characters?

Categories

Resources