I want to get the UCS-2 code points for a given UTF-8 string. For example the word "hello" should become something like "0068 0065 006C 006C 006F". Please note that the characters could be from any language including complex scripts like the east asian languages.
So, the problem comes down to "convert a given character to its UCS-2 code point"
But how? Please, any kind of help will be very very much appreciated since I am in a great hurry.
Transcription of questioner's response posted as an answer
Thanks for your reply, but it needs to be done in PHP v 4 or 5 but not 6.
The string will be a user input, from a form field.
I want to implement a PHP version of utf8to16 or utf8decode like
function get_ucs2_codepoint($char)
{
// calculation of ucs2 codepoint value and assign it to $hex_codepoint
return $hex_codepoint;
}
Can you help me with PHP or can it be done with PHP with version mentioned above?
Use an existing utility such as iconv, or whatever libraries come with the language you're using.
If you insist on rolling your own solution, read up on the UTF-8 format. Basically, each code point is stored as 1-4 bytes, depending on the value of the code point. The ranges are as follows:
U+0000 โ U+007F: 1 byte: 0xxxxxxx
U+0080 โ U+07FF: 2 bytes: 110xxxxx 10xxxxxx
U+0800 โ U+FFFF: 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
U+10000 โ U+10FFFF: 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Where each x is a data bit. Thus, you can tell how many bytes compose each code point by looking at the first byte: if it begins with a 0, it's a 1-byte character. If it begins with 110, it's a 2-byte character. If it begins with 1110, it's a 3-byte character. If it begins with 11110, it's a 4-byte character. If it begins with 10, it's a non-initial byte of a multibyte character. If it begins with 11111, it's an invalid character.
Once you figure out how many bytes are in the character, it's just a matter if bit twiddling. Also note that UCS-2 cannot represent characters above U+FFFF.
Since you didn't specify a language, here's some sample C code (error checking omitted):
wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
if(!(utf8[0] & 0x80)) // 0xxxxxxx
return (wchar_t)utf8[0];
else if((utf8[0] & 0xE0) == 0xC0) // 110xxxxx
return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
else if((utf8[0] & 0xF0) == 0xE0) // 1110xxxx
return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
else
return ERROR; // uh-oh, UCS-2 can't handle code points this high
}
Scott Reynen wrote a function to convert UTF-8 into Unicode. I found it looking at the PHP documentation.
function utf8_to_unicode( $str ) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < ord('A') ) {
// exclude 0-9
if ($thisValue >= ord('0') && $thisValue <= ord('9')) {
// number
$unicode[] = chr($thisValue);
}
else {
$unicode[] = '%'.dechex($thisValue);
}
} else {
if ( $thisValue < 128)
$unicode[] = $str[ $i ];
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$number = dechex($number);
$unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;
$values = array();
$lookingFor = 1;
} // if
} // if
}
} // for
return implode("",$unicode);
} // utf8_to_unicode
PHP code (which assumes valid utf-8, no check for non-valid utf-8):
function ord_utf8($c) {
$b0 = ord($c[0]);
if ( $b0 < 0x10 ) {
return $b0;
}
$b1 = ord($c[1]);
if ( $b0 < 0xE0 ) {
return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
}
return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);
}
I'm amused because I just gave this problem to students on a final exam. Here's a sketch of UTF-8:
hex binary UTF-8 binary
0000-007F 00000000 0abcdefg => 0abcdefg
0080-07FF 00000abc defghijk => 110abcde 10fghijk
0800-FFFF abcdefgh ijklmnop => 1110abcd 10efghij 10klmnop
And here's some C99 code:
static void check(char c) {
if ((c & 0xc0) != 0xc0) RAISE(Bad_UTF8);
}
uint16_t Utf8_decode(char **p) { // return code point and advance *p
char *s = *p;
if ((s[0] & 0x80) == 0) {
(*p)++;
return s[0];
} else if ((s[0] & 0x40) == 0) {
RAISE (Bad_UTF8);
return ~0; // prevent compiler warning
} else if ((s[0] & 0x20) == 0) {
if ((s[0] & 0xf0) != 0xe0) RAISE (Bad_UTF8);
check(s[1]); check(s[2]);
(*p) += 3;
return ((s[0] & 0x0f) << 12)
+ ((s[1] & 0x3f) << 6)
+ ((s[2] & 0x3f));
} else {
check(s[1]);
(*p) += 2;
return ((s[0] & 0x1f) << 6)
+ ((s[1] & 0x3f));
}
}
Use mb_ord() in php >= 7.2.
Or this function:
function ord_utf8($c) {
$len = strlen($c);
$code = ord($c);
if($len > 1) {
$code &= 0x7F >> $len;
for($i = 1; $i < $len; $i++) {
$code <<= 6;
$code += ord($c[$i]) & 0x3F;
}
}
return $code;
}
$c is a character.
If you need convert string to character array.You can use this.
$string = 'abcde';
$string = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
Related
In PHP using the built-in functions don't seem to include special and new symbols. ALL including the ones released 3 months ago. Looking to turn a string with mixed symbols such as:
๐๐๐ ๐ฏ๐ฌ๐ ๐ฐ ๐๐ ฮดฯฑะถ โ
into
๐๐๐ ๐ฏ๐ฌ๐ ๐ฐ ๐๐ ฮดฯฑะถ โ
(which the browser will render the same)
I see this being done on the fly. We're talking countless symbols here. And who knows how many more in the future.
How are they achieving this? No way they really have a 1000+ key array of every single symbol and its entity?
I've hit all the related questions, no luck so far.
This function will convert every character (current and future) excluding [0-9A-Za-z ] to a numeric entity. The UTF-8 character encoding is assumed:
function html_entity_encode_all($s) {
$out = '';
for ($i = 0; isset($s[$i]); $i++) {
// read UTF-8 bytes and decode to a Unicode codepoint value:
$x = ord($s[$i]);
if ($x < 0x80) {
// single byte codepoints
$codepoint = $x;
} else {
// multibyte codepoints
if ($x >= 0xC2 && $x <= 0xDF) {
$codepoint = $x & 0x1F;
$length = 2;
} else if ($x >= 0xE0 && $x <= 0xEF) {
$codepoint = $x & 0x0F;
$length = 3;
} else if ($x >= 0xF0 && $x <= 0xF4) {
$codepoint = $x & 0x07;
$length = 4;
} else {
// invalid byte
$codepoint = 0xFFFD;
$length = 1;
}
// read continuation bytes of multibyte sequences:
for ($j = 1; $j < $length; $j++, $i++) {
if (!isset($s[$i + 1])) {
// invalid: string truncated in middle of multibyte sequence
$codepoint = 0xFFFD;
break;
}
$x = ord($s[$i + 1]);
if (($x & 0xC0) != 0x80) {
// invalid: not a continuation byte
$codepoint = 0xFFFD;
break;
}
$codepoint = ($codepoint << 6) | ($x & 0x3F);
}
if (($codepoint > 0x10FFFF) ||
($length == 2 && $codepoint < 0x80) ||
($length == 3 && $codepoint < 0x800) ||
($length == 4 && $codepoint < 0x10000)) {
// invalid: overlong encoding or out of range
$codepoint = 0xFFFD;
}
}
// have codepoint, now output:
if (($codepoint >= 48 && $codepoint <= 57) ||
($codepoint >= 65 && $codepoint <= 90) ||
($codepoint >= 97 && $codepoint <= 122) ||
($codepoint == 32)) {
// leave plain 0-9, A-Z, a-z, and space unencoded
$out .= $s[$i];
} else {
// all others as numeric entities
$out .= '&#' . $codepoint . ';';
}
}
return $out;
}
For decoding, the standard function html_entity_decode can be used.
How are they achieving this? No way they really have a 1000+ key array of every single symbol and its entity?
They do in fact have a translation table and it does contain all the symbols you have in your question (and the table has more than 1500 entries :) ).
Fiddle
Simple: the encoding doesn't use any special knowledge. The input is a numerical character value, the output is &#<decimal-value>;.
I was using emojione to convert emoticons but there is problem.
When someone upload emoticon from mobile then something like, \ud83d\ude0c\ud83d\ude0c\ud83d\ude0c this unicode.
emojione doesn't convert this type of code.
Can anybody help me to convert this code or suggest me to use any other package
I have done # last.
$str = '\ud83d\ude0c\ud83d\ude0c\ud83d\ude0c';
$regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
|\\\u([\da-fA-F]{4})/sx';
echo preg_replace_callback($regex, function($matches) {
if (isset($matches[3])) {
$cp = hexdec($matches[3]);
} else {
$lead = hexdec($matches[1]);
$trail = hexdec($matches[2]);
// http://unicode.org/faq/utf_bom.html#utf16-4
$cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
}
// https://tools.ietf.org/html/rfc3629#section-3
// Characters between U+D800 and U+DFFF are not allowed in UTF-8
if ($cp > 0xD7FF && 0xE000 > $cp) {
$cp = 0xFFFD;
}
// https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
// php_utf32_utf8(unsigned char *buf, unsigned k)
if ($cp < 0x80) {
return chr($cp);
} else if ($cp < 0xA0) {
return chr(0xC0 | $cp >> 6) . chr(0x80 | $cp & 0x3F);
}
return html_entity_decode('&#' . $cp . ';');
}, $str);
output will be:
๐๐๐
is there any way to make 2 way encryption/decryption for an integer (or string)
Please note that I am not looking for encoding
i need something like this
crypting (100) --> 24694
crypting (101) --> 9564jh4 or 45216 or gvhjdfT or whatever ...
decrypting (24694) --> 100
I don't need encoding because it`s bijective
base64_encode(100) -->MTAw
base64_encode(101) -->MTAx
I hope I will find a way here to encrypt/decrypt PURE NUMBERS (computer love numbers, it's faster)
function decrypt($string, $key) {
$result = '';
$string = base64_decode($string);
for($i=0; $i<strlen($string); $i++) {
$char = substr($string, $i, 1);
$keychar = substr($key, ($i % strlen($key))-1, 1);
$char = chr(ord($char)-ord($keychar));
$result.=$char;
}
return $result;
}
function encrypt($string, $key) {
$result = '';
for($i=0; $i<strlen($string); $i++) {
$char = substr($string, $i, 1);
$keychar = substr($key, ($i % strlen($key))-1, 1);
$char = chr(ord($char)+ord($keychar));
$result.=$char;
}
return base64_encode($result);
}
Have you tried looking into ROT-13?
More serious answer: from this SO answer, you can use:
function numhash($n) {
return (((0x0000FFFF & $n) << 16) + ((0xFFFF0000 & $n) >> 16));
}
numhash(42); // 2752512
numhash(numhash(42)); // 42
64bit support. negative number support. and a little bit security salt.
#Petr Cibulka
class NumHash {
private static $SALT = 0xd0c0adbf;
public static function encrypt($n) {
return (PHP_INT_SIZE == 4 ? self::encrypt32($n) : self::encrypt64($n)) ^ self::$SALT;
}
public static function decrypt($n) {
$n ^= self::$SALT;
return PHP_INT_SIZE == 4 ? self::decrypt32($n) : self::decrypt64($n);
}
public static function encrypt32($n) {
return ((0x000000FF & $n) << 24) + (((0xFFFFFF00 & $n) >> 8) & 0x00FFFFFF);
}
public static function decrypt32($n) {
return ((0x00FFFFFF & $n) << 8) + (((0xFF000000 & $n) >> 24) & 0x000000FF);
}
public static function encrypt64($n) {
/*
echo PHP_EOL . $n . PHP_EOL;
printf("n :%20X\n", $n);
printf("<< :%20X\n", (0x000000000000FFFF & $n) << 48);
printf(">> :%20X\n", (0xFFFFFFFFFFFF0000 & $n) >> 16);
printf(">>& :%20X\n", ((0xFFFFFFFFFFFF0000 & $n) >> 16) & 0x0000FFFFFFFFFFFF);
printf("= :%20X\n", ((0x000000000000FFFF & $n) << 48) + (((0xFFFFFFFFFFFF0000 & $n) >> 16) & 0x0000FFFFFFFFFFFF));
/* */
return ((0x000000000000FFFF & $n) << 48) + (((0xFFFFFFFFFFFF0000 & $n) >> 16) & 0x0000FFFFFFFFFFFF);
}
public static function decrypt64($n) {
/*
echo PHP_EOL;
printf("n :%20X\n", $n);
printf("<< :%20X\n", (0x0000FFFFFFFFFFFF & $n) << 16);
printf(">> :%20X\n", (0xFFFF000000000000 & $n) >> 48);
printf(">>& :%20X\n", ((0xFFFF000000000000 & $n) >> 48) & 0x000000000000FFFF);
printf("= :%20X\n", ((0x0000FFFFFFFFFFFF & $n) << 16) + (((0xFFFF000000000000 & $n) >> 48) & 0x000000000000FFFF));
/* */
return ((0x0000FFFFFFFFFFFF & $n) << 16) + (((0xFFFF000000000000 & $n) >> 48) & 0x000000000000FFFF);
}
}
var_dump(NumHash::encrypt(42));
var_dump(NumHash::encrypt(NumHash::encrypt(42)));
var_dump(NumHash::decrypt(NumHash::encrypt(42)));
echo PHP_EOL;
// stability test
var_dump(NumHash::decrypt(NumHash::encrypt(0)));
var_dump(NumHash::decrypt(NumHash::encrypt(-1)));
var_dump(NumHash::decrypt(NumHash::encrypt(210021200651)));
var_dump(NumHash::decrypt(NumHash::encrypt(210042420501)));
Here's the step by step(remove the comments):
210042420501
n : 30E780FD15
<< : FD15000000000000
>> : 30E780
>>& : 30E780
= : FD1500000030E780
n : FD1500000030E780
<< : 30E7800000
>> : FFFFFFFFFFFFFD15
>>& : FD15
= : 30E780FD15
int(210042420501)
This may be more than what you are looking for, but I thought it would be fun to construct as an answer. Here is a simple format-preserving encryption which takes any 16-bit number (i.e. from 0 to 65535) and encrypts it to another 16-bit number and back again, based on a 128-bit symmetric key. You can build something like this.
It's deterministic, in that any input always encrypts to the same output with the same key, but for any number n, there is no way to predict the output for n + 1.
# Written in Ruby -- implement in PHP left as an exercise for the reader
require 'openssl'
def encrypt_block(b, k)
cipher = OpenSSL::Cipher::Cipher.new 'AES-128-ECB'
cipher.encrypt
cipher.key = k
cipher.update(b) + cipher.final
end
def round_key(i, k)
encrypt_block(i.to_s, k)
end
def prf(c, k)
encrypt_block(c.chr, k)[0].ord
end
def encrypt(m, key)
left = (m >> 8) & 0xff
right = m & 0xff
(1..7).each do |i|
copy = right
right = left ^ prf(right, round_key(i, key))
left = copy
end
(left << 8) + right
end
def decrypt(m, key)
left = (m >> 8) & 0xff
right = m & 0xff
(1..7).each do |i|
copy = left
left = right ^ prf(left, round_key(8 - i, key))
right = copy
end
(left << 8) + right
end
key = "0123456789abcdef"
# This shows no fails and no collisions
x = Hash.new
(0..65535).each do |n|
c = encrypt(n, key)
p = decrypt(c, key)
puts "FAIL" if n != p
puts "COLLISION" if x.has_key? c
x[c] = n
end
# Here are some samples
(0..10).each do |n|
c = encrypt(n, key)
p = decrypt(c, key)
puts "#{n} --> #{c}"
end
(0..10).each do
n = rand(65536)
c = encrypt(n, key)
p = decrypt(c, key)
puts "#{n} --> #{c}"
end
Some examples:
0 --> 39031
1 --> 38273
2 --> 54182
3 --> 59129
4 --> 18743
5 --> 7628
6 --> 8978
7 --> 15474
8 --> 49783
9 --> 24614
10 --> 58570
1343 --> 19234
19812 --> 18968
6711 --> 31505
42243 --> 29837
62617 --> 52334
27174 --> 56551
3624 --> 31768
38685 --> 40918
27826 --> 42109
62589 --> 25562
20377 --> 2670
a simply function that mangles integers keeping the smaller numbers small (if you need to preserve magnitude):
function switchquartets($n){
return ((0x0000000F & $n) << 4) + ((0x000000F0& $n)>>4)
+ ((0x00000F00 & $n) << 4) + ((0x0000F000& $n)>>4)
+ ((0x000F0000 & $n) << 4) + ((0x00F00000& $n)>>4)
+ ((0x0F000000 & $n) << 4) + ((0xF0000000& $n)>>4);
}
You can simply use 3DES CBC mode encryption to perform the operation. If you want to only accept values that you've generated, you can add a HMAC to the ciphertext. If the HMAC is not enough, you could rely on the format of the numbers for this particular scheme. If you want users not to be able to copy the values to each other, you can use a random IV.
So basically you store the number as a 8 byte or 8 ASCII character string by left-padding with zero values. Then you perform an encryption of a single block. This allows you to have 2^64 or 10^8 numbers. You can base 64 encrypt the result, replacing the + and / characters with the URL-safe - and _ characters.
Note that this encryption/decryption is of course bijective (or a permutation, as it is usually called in crypto). That's OK though, as the output is large enough for an attacker to have trouble guessing a value.
method "double square":
function dsCrypt($input,$decrypt=false) {
$o = $s1 = $s2 = array(); // Arrays for: Output, Square1, Square2
// ัะพัะผะธััะตะผ ะฑะฐะทะพะฒัะน ะผะฐััะธะฒ ั ะฝะฐะฑะพัะพะผ ัะธะผะฒะพะปะพะฒ
$basea = array('?','(','#',';','$','#',"]","&",'*'); // base symbol set
$basea = array_merge($basea, range('a','z'), range('A','Z'), range(0,9) );
$basea = array_merge($basea, array('!',')','_','+','|','%','/','[','.',' ') );
$dimension=9; // of squares
for($i=0;$i<$dimension;$i++) { // create Squares
for($j=0;$j<$dimension;$j++) {
$s1[$i][$j] = $basea[$i*$dimension+$j];
$s2[$i][$j] = str_rot13($basea[($dimension*$dimension-1) - ($i*$dimension+$j)]);
}
}
unset($basea);
$m = floor(strlen($input)/2)*2; // !strlen%2
$symbl = $m==strlen($input) ? '':$input[strlen($input)-1]; // last symbol (unpaired)
$al = array();
// crypt/uncrypt pairs of symbols
for ($ii=0; $ii<$m; $ii+=2) {
$symb1 = $symbn1 = strval($input[$ii]);
$symb2 = $symbn2 = strval($input[$ii+1]);
$a1 = $a2 = array();
for($i=0;$i<$dimension;$i++) { // search symbols in Squares
for($j=0;$j<$dimension;$j++) {
if ($decrypt) {
if ($symb1===strval($s2[$i][$j]) ) $a1=array($i,$j);
if ($symb2===strval($s1[$i][$j]) ) $a2=array($i,$j);
if (!empty($symbl) && $symbl===strval($s2[$i][$j])) $al=array($i,$j);
}
else {
if ($symb1===strval($s1[$i][$j]) ) $a1=array($i,$j);
if ($symb2===strval($s2[$i][$j]) ) $a2=array($i,$j);
if (!empty($symbl) && $symbl===strval($s1[$i][$j])) $al=array($i,$j);
}
}
}
if (sizeof($a1) && sizeof($a2)) {
$symbn1 = $decrypt ? $s1[$a1[0]][$a2[1]] : $s2[$a1[0]][$a2[1]];
$symbn2 = $decrypt ? $s2[$a2[0]][$a1[1]] : $s1[$a2[0]][$a1[1]];
}
$o[] = $symbn1.$symbn2;
}
if (!empty($symbl) && sizeof($al)) // last symbol
$o[] = $decrypt ? $s1[$al[1]][$al[0]] : $s2[$al[1]][$al[0]];
return implode('',$o);
}
echo dsCrypt('586851105743');
echo '<br />'.dsCrypt('tdtevmdrsdoc', 1);
I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:
preg_match('/\A(
[\x09\x0A\x0D\x20-\x7E]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/x', $string);
It's a regular expression taken from http://www.w3.org/International/questions/qa-forms-utf-8 . Everything was ok until I discovered a bug in PHP that seems to have been around at least since 2006. Preg_match() causes a seg fault if the $string is too long. There doesn't seem to be any workaround. You can view the bug submission here: http://bugs.php.net/bug.php?id=36463
Now, to avoid using preg_match I've created a function that does the exact same thing as the regular expression above. I don't know if this question is appropriate here at Stack Overflow, but I would like to know if the function I've made is correct. Here it is:
EDIT [13.01.2010]:
If anyone is interested, there were several bugs in the previous version I've posted. Below is the final version of my function.
function check_UTF8_string(&$string) {
$len = mb_strlen($string, "ISO-8859-1");
$ok = 1;
for ($i = 0; $i < $len; $i++) {
$o = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
if ($o == 9 || $o == 10 || $o == 13 || ($o >= 32 && $o <= 126)) {
}
elseif ($o >= 194 && $o <= 223) {
$i++;
$o2 = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
if (!($o2 >= 128 && $o2 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 224) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$i += 2;
if (!($o2 >= 160 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
$ok = 0;
break;
}
}
elseif (($o >= 225 && $o <= 236) || $o == 238 || $o == 239) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$i += 2;
if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 237) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$i += 2;
if (!($o2 >= 128 && $o2 <= 159) || !($o3 >= 128 && $o3 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 240) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
$i += 3;
if (!($o2 >= 144 && $o2 <= 191) ||
!($o3 >= 128 && $o3 <= 191) ||
!($o4 >= 128 && $o4 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o >= 241 && $o <= 243) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
$i += 3;
if (!($o2 >= 128 && $o2 <= 191) ||
!($o3 >= 128 && $o3 <= 191) ||
!($o4 >= 128 && $o4 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 244) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
$i += 5;
if (!($o2 >= 128 && $o2 <= 143) ||
!($o3 >= 128 && $o3 <= 191) ||
!($o4 >= 128 && $o4 <= 191)) {
$ok = 0;
break;
}
}
else {
$ok = 0;
break;
}
}
return $ok;
}
Yes, it's very long. I hope I've understood correctly how that regular expression works. Also hope it will be of help to others.
Thanks in advance!
You can always using the Multibyte String Functions:
If you want to use it a lot and possibly change it at sometime:
1) First set the encoding you want to use in your config file
/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");
2) Check the String
if(mb_check_encoding($string))
{
// do something
}
Or, if you don't plan on changing it, you can always just put the encoding straight into the function:
if(mb_check_encoding($string, 'UTF-8'))
{
// do something
}
Given that there is still no explicit isUtf8() function in PHP, here's how UTF-8 can be accurately validated in PHP depending on your PHP version.
Easiest and most backwards compatible way to properly validate UTF-8 is still via regular expression using function such as:
function isValid($string)
{
return preg_match(
'/\A(?>
[\x00-\x7F]+ # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x',
$string
) === 1;
}
Note the two key differences to the regular expression offered by W3C. It uses once only subpattern and has a '+' quantifier after the first character class. The problem of PCRE crashing still persists, but most of it is caused by using repeating capturing subpattern. By turning the pattern to a once only pattern and capturing multiple single byte characters in single subpattern, it should prevent PCRE from quickly running out of stack (and causing a segfault). Unless you're validating strings with lots of multibyte characters (in the range of thousands), this regular expression should serve you well.
Another good alternative is using mb_check_encoding() if you have the mbstring extension available. Validating UTF-8 can be done as simply as:
function isValid($string)
{
return mb_check_encoding($string, 'UTF-8') === true;
}
Note, however, that if you're using PHP version prior to 5.4.0, this function has some flaws in it's validation:
Prior to 5.4.0 the function accepts code point beyond allowed Unicode range. This means it also allows 5 and 6 byte UTF-8 characters.
Prior to 5.3.0 the function accepts surrogate code points as valid UTF-8 characters.
Prior to 5.2.5 the function is completely unusable due to not working as intended.
As the internet also lists numerous other ways to validate UTF-8, I will discuss some of them here. Note that the following should be avoided in most cases.
Use of mb_detect_encoding() is sometimes seen to validate UTF-8. If you have at least PHP version 5.4.0, it does actually work with the strict parameter via:
function isValid($string)
{
return mb_detect_encoding($string, 'UTF-8', true) === 'UTF-8';
}
It is very important to understand that this does not work prior to 5.4.0. It's very flawed prior to that version, since it only checks for invalid sequences but allows overlong sequences and invalid code points. In addition, you should never use it for this purpose without the strict parameter set to true (it does not actually do validation without the strict parameter).
One nifty way to validate UTF-8 is via the use of 'u' flag in PCRE. Though poorly documented, it also validates the subject string. An example could be:
function isValid($string)
{
return preg_match('//u', $string) === 1;
}
Every string should match an empty pattern, but usage of the 'u' flag will only match on valid UTF-8 strings. However, unless you're using at least 5.5.10. The validation is flawed as follows:
Prior to 5.5.10, it does not recognize 3 and 4 byte sequences as valid UTF-8. As it excludes most of unicode code points, this is pretty major flaw.
Prior to 5.2.5 it also allows surrogates and code points beyond allowed unicode space (e.g. 5 and 6 byte characters)
Using the 'u' flag behavior does have one advantage though: It's the fastest of the discussed methods. If you need speed and you're running the latest and greatest PHP version, this validation method might be for you.
One additional way to validate for UTF-8 is via json_encode(), which expects input strings to be in UTF-8. It does not work prior to 5.5.0, but after that, invalid sequences return false instead of a string. For example:
function isValid($string)
{
return json_encode($string) !== false;
}
I would not recommend on relying on this behavior to last, however. Previous PHP versions simply produced an error on invalid sequences, so there is no guarantee that the current behavior is final.
You should be able to use iconv to check for validity. Just try and convert it to UTF-16 and see if you get an error.
Have you tried ereg() instead of preg_match? Perhaps this one doesn't have that bug, and you don't need a potentially buggy workaround.
Here is a string-function based solution:
http://www.php.net/manual/en/function.mb-detect-encoding.php#85294
<?php
function is_utf8($str) {
$c=0; $b=0;
$bits=0;
$len=strlen($str);
for($i=0; $i<$len; $i++){
$c=ord($str[$i]);
if($c > 128){
if(($c >= 254)) return false;
elseif($c >= 252) $bits=6;
elseif($c >= 248) $bits=5;
elseif($c >= 240) $bits=4;
elseif($c >= 224) $bits=3;
elseif($c >= 192) $bits=2;
else return false;
if(($i+$bits) > $len) return false;
while($bits > 1){
$i++;
$b=ord($str[$i]);
if($b < 128 || $b > 191) return false;
$bits--;
}
}
}
return true;
}
?>
I want to get the UCS-2 code points for a given UTF-8 string. For example the word "hello" should become something like "0068 0065 006C 006C 006F". Please note that the characters could be from any language including complex scripts like the east asian languages.
So, the problem comes down to "convert a given character to its UCS-2 code point"
But how? Please, any kind of help will be very very much appreciated since I am in a great hurry.
Transcription of questioner's response posted as an answer
Thanks for your reply, but it needs to be done in PHP v 4 or 5 but not 6.
The string will be a user input, from a form field.
I want to implement a PHP version of utf8to16 or utf8decode like
function get_ucs2_codepoint($char)
{
// calculation of ucs2 codepoint value and assign it to $hex_codepoint
return $hex_codepoint;
}
Can you help me with PHP or can it be done with PHP with version mentioned above?
Use an existing utility such as iconv, or whatever libraries come with the language you're using.
If you insist on rolling your own solution, read up on the UTF-8 format. Basically, each code point is stored as 1-4 bytes, depending on the value of the code point. The ranges are as follows:
U+0000 โ U+007F: 1 byte: 0xxxxxxx
U+0080 โ U+07FF: 2 bytes: 110xxxxx 10xxxxxx
U+0800 โ U+FFFF: 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
U+10000 โ U+10FFFF: 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Where each x is a data bit. Thus, you can tell how many bytes compose each code point by looking at the first byte: if it begins with a 0, it's a 1-byte character. If it begins with 110, it's a 2-byte character. If it begins with 1110, it's a 3-byte character. If it begins with 11110, it's a 4-byte character. If it begins with 10, it's a non-initial byte of a multibyte character. If it begins with 11111, it's an invalid character.
Once you figure out how many bytes are in the character, it's just a matter if bit twiddling. Also note that UCS-2 cannot represent characters above U+FFFF.
Since you didn't specify a language, here's some sample C code (error checking omitted):
wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
if(!(utf8[0] & 0x80)) // 0xxxxxxx
return (wchar_t)utf8[0];
else if((utf8[0] & 0xE0) == 0xC0) // 110xxxxx
return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
else if((utf8[0] & 0xF0) == 0xE0) // 1110xxxx
return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
else
return ERROR; // uh-oh, UCS-2 can't handle code points this high
}
Scott Reynen wrote a function to convert UTF-8 into Unicode. I found it looking at the PHP documentation.
function utf8_to_unicode( $str ) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < ord('A') ) {
// exclude 0-9
if ($thisValue >= ord('0') && $thisValue <= ord('9')) {
// number
$unicode[] = chr($thisValue);
}
else {
$unicode[] = '%'.dechex($thisValue);
}
} else {
if ( $thisValue < 128)
$unicode[] = $str[ $i ];
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$number = dechex($number);
$unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;
$values = array();
$lookingFor = 1;
} // if
} // if
}
} // for
return implode("",$unicode);
} // utf8_to_unicode
PHP code (which assumes valid utf-8, no check for non-valid utf-8):
function ord_utf8($c) {
$b0 = ord($c[0]);
if ( $b0 < 0x10 ) {
return $b0;
}
$b1 = ord($c[1]);
if ( $b0 < 0xE0 ) {
return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
}
return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);
}
I'm amused because I just gave this problem to students on a final exam. Here's a sketch of UTF-8:
hex binary UTF-8 binary
0000-007F 00000000 0abcdefg => 0abcdefg
0080-07FF 00000abc defghijk => 110abcde 10fghijk
0800-FFFF abcdefgh ijklmnop => 1110abcd 10efghij 10klmnop
And here's some C99 code:
static void check(char c) {
if ((c & 0xc0) != 0xc0) RAISE(Bad_UTF8);
}
uint16_t Utf8_decode(char **p) { // return code point and advance *p
char *s = *p;
if ((s[0] & 0x80) == 0) {
(*p)++;
return s[0];
} else if ((s[0] & 0x40) == 0) {
RAISE (Bad_UTF8);
return ~0; // prevent compiler warning
} else if ((s[0] & 0x20) == 0) {
if ((s[0] & 0xf0) != 0xe0) RAISE (Bad_UTF8);
check(s[1]); check(s[2]);
(*p) += 3;
return ((s[0] & 0x0f) << 12)
+ ((s[1] & 0x3f) << 6)
+ ((s[2] & 0x3f));
} else {
check(s[1]);
(*p) += 2;
return ((s[0] & 0x1f) << 6)
+ ((s[1] & 0x3f));
}
}
Use mb_ord() in php >= 7.2.
Or this function:
function ord_utf8($c) {
$len = strlen($c);
$code = ord($c);
if($len > 1) {
$code &= 0x7F >> $len;
for($i = 1; $i < $len; $i++) {
$code <<= 6;
$code += ord($c[$i]) & 0x3F;
}
}
return $code;
}
$c is a character.
If you need convert string to character array.You can use this.
$string = 'abcde';
$string = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);