Check if string is in the BMP range

Check if string is in the BMP range - php

So I was searching for a proper way in PHP to detect if a string is in the BMP range (Basic Multilingual Plane) but I found nothing. Even mb-check-encoding and mb_detect_encoding do not offer any help in this particular case.
So I wrote my own code
<?php
function is_bmp($string) {
$str_ar = mb_str_split($string);
foreach ($str_ar as $char) {
/*Check if there's any character's code point outside the BMP range*/
if (mb_ord($char) > 0xFFFF)
return false;
}
return true;
}
/*String containing non-BMP Unicode characters*/
$string = '😈blah blah';
var_dump(is_bmp($string));
?>
Outputs:
bool(false)
Now my question is:
Is there a better approach? and are there any flaws in it?

If you have an correct UTF-8 encoded input string, you can just check its bytes to figure out does it have symbols out of BMP or not.
Literally, you need to detect: does the input string contains any symbol, which codepoint is greater than 0xFFFF (i.e. longer than 16 bits)
Note on how UTF-8 encoding works:
Codepoints with codes 0 thru 0x7F are encoded as is. By one byte.
All other codepoints have a code within range 0xC0 ... 0xFF as the first byte, which also encodes how many additional bytes folows. And codes 0x80...0xBF as additional bytes.
To encode code points 0x10000 and greater, UTF-8 requires a sequence of 4 bytes, and the first byte of that sequence will be 0xF0 or greater. In all other cases the whole string will contain bytes less than 0xF0.
In short your task just to find: does the binary representation of the string contanin any byte of range 0xF0...0xFF?
function is_bmp($string) {
return preg_match('#[\xF0-\xFF]#', $string) != 0;
}
OR
even simpler (but probably less effective on speed), you can use ability of PCRE to work with UTF-8 sequences (see option PCRE_UTF8):
function is_bmp($string) {
return preg_match('#[^\x00-\x{FFFF}]#u', $string) != 0;
}

var_dump(
!preg_match('/[^\x0-\x{ffff}]/u', '😈blah blah')
);

Related

PHP Unicode to character conversion

I receive country names like from a library: "\u00c3\u0096sterreich".
How do I convert this to Österreich?
Using PHP 7.3

This one is a lot trickier than it seem, but the below code appears to work.
First we pipe it through the standard regex for Unicode escape sequences, then pack that as a binary string, convert the encoding and finally decode. I cannot promise this is the best way to do this, but it appears to be working correct as far as I can tell.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
return utf8_decode(mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE'));
}, $str);
Demo here

The Unicode for the UTF-8 character "Ö" is U+00D6.
This character consists of the 2 hex bytes: c3 and 96.
The representation \u00c3 \u0096 for these 2 bytes is a bit strange. Provided that the multibyte character is represented byte for byte, the following code can also be used.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback(
'~\\\\u00([0-9a-f]{2})~i',
function($m){
return hex2bin($m[1]);
},
$str
);
//Test
$expect = "Österreich";
var_dump($str === $expect); //bool(true)

In case anyone else ends up here with a similar issue, I thought I'd try and shed some light on what's going on. Because as mentioned, this is a lot more complicated that it might look.
A string like \u00c3 refers to a Unicode code-point, in hexadecimal. Ö in the Unicode table is character 214, or \u00d6.
The 214 here is not directly related to how Ö is actually stored in any particular encoding (UTF-8, UTF-16, etc), it's just an abstract number in the overall Unicode table that refers to that character. UTF-8, for instance, will store it in two bytes 11000010 10010110 (194 150 in decimal). There's a really good explanation of how this works in this answer, if you're interested in the finer details.
What appears to have happened in your string is that these two bytes have then been encoded back into hexadecimal, and returned as two separate Unicode code points. u00c3 is Ã, and \u0096 is a control character. This is why any standard methods of decoding this (json_decode, etc) won't have worked - ultimately what you have is not a valid representation of the string Österreich.
The other answers should both work perfectly well, but this code snippet might better illustrate the issue with the format your library is using. It specifically matches two consecutive low Unicode code-points, recombines their decimal representations into an unsigned two-byte integer, and then returns the result.
$str = '\u00c3\u0096sterreich';
echo preg_replace_callback('/\\\\u00([0-9a-fA-F]{2})\\\\u00([0-9a-fA-F]{2})/', function ($match) {
$i = (hexdec($match[1]) << 8) + hexdec($match[2]);
return pack('N', $i);
}, $str);
Österreich
See https://3v4l.org/QtUuGD

Javascript hexadecimal to binary using UTF8

I have data stored in an SQLite database as BINARY(16), the value of which is determined by PHP's hex2bin function on a 32-character hexadecimal string.
As an example, the string 434e405b823445c09cb6c359fb1b7918 returns CN#[4EÀ¶ÃYûy.
The data stored in this database needs to be manipulated by JavaScript, and to do so I've used the following function (adapted from Andris's answer here):
// Convert hexadecimal to binary string
String.prototype.hex2bin = function ()
{
// Define the variables
var i = 0, l = this.length - 1, bytes = []
// Iterate over the nibbles and convert to binary string
for (i; i < l; i += 2)
{
bytes.push(parseInt(this.substr(i, 2), 16))
}
// Return the binary string
return String.fromCharCode.apply(String, bytes)
}
This works as expected, returning CN#[4EÀ¶ÃYûy from 434e405b823445c09cb6c359fb1b7918.
The problem I have, however, is that when dealing directly with the data returned by PHP's hex2bin function I am given the string CN#[�4E����Y�y rather than CN#[4EÀ¶ÃYûy. This is making it impossible for me to work between the two (for context, JavaScript is being used to power an offline iPad app that works with data retrieved from a PHP web app) as I need to be able to use JavaScript to generate a 32-character hexadecimal string, convert it to a binary string, and have it work with PHP's hex2bin function (and SQLite's HEX function).
This issue, I believe, is that JavaScript uses UTF-16 whereas the binary string is stored as utf8_unicode_ci. My initial thought, then, was that I need to convert the string to UTF-8. Using a Google search led me to here and searching StackOverflow led me to bobince's answer here, both of which recommend using unescape(encodeURIComponent(str)). However, this does return what I need (CN#[�4E����Y�y):
// CN#[Â4EÃÂÂ¶ÃYÃ»y
unescape(encodeURIComponent('434e405b823445c09cb6c359fb1b7918'.hex2bin()))
My question, then, is:
How can I use JavaScript to convert a hexadecimal string into a UTF-8 binary string?

Given a hex-encoded UTF-8 string, `hex',
hex.replace(/../g, '%$&')
will produce a URI-encoded UTF-8 string.
decodeURIComponent converts URI-encoded UTF-8 sequences into JavaScript UTF-16 encoded strings, so
decodeURIComponent(hex.replace(/../g, '%$&'))
should decode a properly hex-encoded UTF-8 string.
You can see that it works by applying it to the example from the hex2bin documentation.
alert(decodeURIComponent('6578616d706c65206865782064617461'.replace(/../g, '%$&')));
// alerts "example hex data"
The string you gave is not UTF-8 encoded though. Specifically,
434e405b823445c09cb6c359fb1b7918
^
82 must follow a byte with at least the first two bits set, and 5b is not such a byte.
RFC 2279 explains:
The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the UCS-4
character value.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx

Your applications don't have to handle binary at any point. Insertion is latest possible point and that's where you
convert to binary at last. Selection is earliest possible point and that's where you convert to hex, and use
hex-strings in application throughout.
When inserting, you can replace UNHEX with blob literals:
INSERT INTO table (id)
VALUES (X'434e405b823445c09cb6c359fb1b7918')
When selection, you can HEX:
SELECT HEX(id) FROM table

Expanding on Mike's answer, here's some code for encoding and decoding.
Note that the escape/unescape() functions are deprecated. If you need polyfills for them, you can check out the more comprehensive UTF-8 encoding example found here: http://jsfiddle.net/47zwb41o
// UTF-8 to hex
var utf8ToHex = function( s ){
s = unescape( encodeURIComponent( s ) );
var chr, i = 0, l = s.length, out = '';
for( ; i < l; i++ ){
chr = s.charCodeAt( i ).toString( 16 );
out += ( chr.length % 2 == 0 ) ? chr : '0' + chr;
}
return out;
};
// Hex to UTF-8
var hexToUtf8 = function( s ){
return decodeURIComponent( s.replace( /../g, '%$&' ) );
};

php unicode 16 bit

how can I append a 16 bit unicode character to a string in php
$test = "testing" . (U + 199F);
From what I see, \x only takes 8 bit characters aka ascii

From the manual:
PHP only supports a 256-character set, and hence does not offer native Unicode support.
You could enter a manually-encoded UTF-8 sequence, I suppose.
You can also type out UCS4 as byte sequence and use iconv("UTF-32LE", "UTF-8", $str); to convert it into UTF-8 for further processing. You just can't input the codepoint as a 32-bit code unit in one go.

Unicode characters don't directly exist in PHP(*), but you can deal with strings containing bytes represent characters in UTF-8 encoding. Here's one way of converting a numeric character code point to UTF-8:
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
$test= 'testing'.unichr(0x199F);
(*: and ‘16-bit’ Unicode characters don't exist at all; Unicode has code points way beyond U+FFFF. There are 16-bit ‘code units’ in UTF-16, but that's an ugly encoding you're unlikely to meet in PHP.)

Because unicode is just multibyte and PHP only supports single byte you can create multibyte characters with multiple single bytes :)
$test = "testing\x19\x9F";

Try:
$test = "testing" . "\u199F";

PHP encoding to only have letters and numbers

Is there a encoding function in PHP which will encode strings and the resulting output will only contain letters and numbers? I would use base64 but that still has some stuff which is not numeric/alphanumeric

You could use base32 (code easy to google), which is sort of a standard alternative to base64. Or resort to bin2hex() and pack("H*",$hex) to reverse. Hex encoding however leads to size doubling.

Short answer is no, base64 uses a reduced set of output chars compared with uuencode and was intended to solve most character converions issues - but still isn't url-safe (IIRC).
But the machanism is trivial and easily adapted - I'd suggest having a look at base32 encoding - same as base64 but using one less bit per input char to create the output (and hence a 32 char alphabet is all that's required) but using something different for the padding char ('=' is not url safe).
A quick google found this

Any of the hash functions (md5, sha1, etc.) output will only consist of hexadecimal digits but that's not exactly 'encoding'.

You could write your own base-62 encoder/decoder using a-z/A-Z/0-9. You'd need 3 digits for every ASCII character though, so not that efficient.

I wrote this to use letters, numbers and dashes.
I'm sure you can improve it to take out the dashes:
function pj_code($str) {
$len = strlen($str);
while ($len--) {
$enc .= base_convert(ord(substr($str,$len,1)),10,36) . '-';
}
return $enc;
}
function pj_decode($str) {
$ords = explode('-',$str);
$c = count($ords);
while ($c--) {
$dec .= chr(base_convert($ords[$c],36,10));
}
return $dec;
}

You can use the basic md5 hash function which output only alphanumeric characters.

filesize from a String

how can i get the "filesize" from a string in php?
I put the string in a mysql database as a blob and i need to store the size of the blob. My solution was to create a temp file and put the string into the temp file. now i can get the filesize from the "string". but that solution is not good...
greetings

It depends. If you have mbstring function overloading enabled, the only call that will work will be mb_strlen($string, '8bit');. If it's not enabled, strlen($string) will work fine as well.
So, you can handle both cases like this:
if (function_exists('mb_strlen')) {
$size = mb_strlen($string, '8bit');
} else {
$size = strlen($string);
}

SELECT length(field) FROM table
From the MySQL docs:
LENGTH(str)
Returns the length of the string str,
measured in bytes. A multi-byte
character counts as multiple bytes.
This means that for a string
containing five two-byte characters,
LENGTH() returns 10, whereas
CHAR_LENGTH() returns 5.

strlen()
before putting it into mysql, or in SQL:
LENGTH()
Notice that lenght can be various depending on character set. If you want to have real length in bytes use strlen(), if you want to have character count use mb_strlen() (if you have utf-8 encoding for example)

If all you are storing is the string, then the size should be the length of your string times the number of bytes in the charset. So for Unicode that would be 2*strlen($string).

strlen($string) is the best example for viewing the size(MB) of a string
strlen() doesnt actually return the number of elements in a string, but the number of bytes in it
example:
echo(strlen('a■')); will return 4, because the black square character is made of 3 bytes, and the 'a' character is made of one.

use mb_strlen() as then you can tell it what type of encoding the string uses (if any) to get the size of it in bytes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Check if string is in the BMP range - php

var_dump( !preg_match('/[^\x0-\x{ffff}]/u', '😈blah blah') );

Related

PHP Unicode to character conversion

Javascript hexadecimal to binary using UTF8

php unicode 16 bit

PHP encoding to only have letters and numbers

filesize from a String

Categories

Resources