PHP functions question - php

I'm fairly new to PHP functions I really dont know what the bottom functions do, can some one give an explanation or working example explaining the functions below. Thanks.
PHP functions.
function mbStringToArray ($str) {
if (empty($str)) return false;
$len = mb_strlen($str);
$array = array();
for ($i = 0; $i < $len; $i++) {
$array[] = mb_substr($str, $i, 1);
}
return $array;
}
function mb_chunk_split($str, $len, $glue) {
if (empty($str)) return false;
$array = mbStringToArray ($str);
$n = 0;
$new = '';
foreach ($array as $char) {
if ($n < $len) $new .= $char;
elseif ($n == $len) {
$new .= $glue . $char;
$n = 0;
}
$n++;
}
return $new;
}

The first function takes a multibyte string and converts it into an array of characters, returning the array.
The second function takes a multibyte string and inserts the $glue string every $len characters.

function mbStringToArray ($str) { // $str is a function argument
if (empty($str)) return false; // empty() checks if the argument is not equal to NULL (but does exist)
$len = mb_strlen($str); // returns the length of a multibyte string (ie UTF-8)
$array = array(); // init of an array
for ($i = 0; $i < $len; $i++) { // self explanatory
$array[] = mb_substr($str, $i, 1); // mb_substr() substitutes from $str one char for each pass
}
return $array; // returns the result as an array
}
That should help you to understand the second function

Related

How to return a sqrt result in a PHP public function [duplicate]

Trying to split this string "主楼怎么走" into separate characters (I need an array) using mb_split with no luck... Any suggestions?
Thank you!
try a regular expression with 'u' option, for example
$chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
An ugly way to do it is:
mb_internal_encoding("UTF-8"); // this IS A MUST!! PHP has trouble with multibyte
// when no internal encoding is set!
$string = ".....";
$chars = array();
for ($i = 0; $i < mb_strlen($string); $i++ ) {
$chars[] = mb_substr($string, $i, 1); // only one char to go to the array
}
You should also try your way with mb_split with setting the internal_encoding before it.
You can use grapheme functions (PHP 5.3 or intl 1.0) and IntlBreakIterator (PHP 5.5 or intl 3.0). The following code shows the diffrence among intl and mbstring and PCRE functions.
// http://www.php.net/manual/function.grapheme-strlen.php
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6)
$expected = ["a\xCC\x8A", "o\xCC\x88"];
$expected2 = ["a", "\xCC\x8A", "o", "\xCC\x88"];
var_dump(
$expected === str_to_array($string),
$expected === str_to_array2($string),
$expected2 === str_to_array3($string),
$expected2 === str_to_array4($string),
$expected2 === str_to_array5($string)
);
function str_to_array($string)
{
$length = grapheme_strlen($string);
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = grapheme_substr($string, $i, 1);
}
return $ret;
}
function str_to_array2($string)
{
$it = IntlBreakIterator::createCharacterInstance('en_US');
$it->setText($string);
$ret = [];
$prev = 0;
foreach ($it as $pos) {
$char = substr($string, $prev, $pos - $prev);
if ('' !== $char) {
$ret[] = $char;
}
$prev = $pos;
}
return $ret;
}
function str_to_array3($string)
{
$it = IntlBreakIterator::createCodePointInstance();
$it->setText($string);
$ret = [];
$prev = 0;
foreach ($it as $pos) {
$char = substr($string, $prev, $pos - $prev);
if ('' !== $char) {
$ret[] = $char;
}
$prev = $pos;
}
return $ret;
}
function str_to_array4($string)
{
$length = mb_strlen($string, "UTF-8");
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = mb_substr($string, $i, 1, "UTF-8");
}
return $ret;
}
function str_to_array5($string) {
return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}
When working on production environment, you need to replace invalid byte sequence with the substitute character since almost all grapheme and mbstring functions can't handle invalid byte sequence. If you have an interest, see my past answer: https://stackoverflow.com/a/13695364/531320
If you don't take of perfomance, htmlspecialchars and htmlspecialchars_decode can be used. The merit of this way is supporting various encoding other than UTF-8.
function str_to_array6($string, $encoding = 'UTF-8')
{
$ret = [];
str_replace_callback($string, function($char, $index) use (&$ret) { $ret[] = $char; return ''; }, $encoding);
return $ret;
}
function str_replace_callback($string, $callable, $encoding = 'UTF-8')
{
$str_size = strlen($string);
$string = str_scrub($string, $encoding);
$ret = '';
$char = '';
$index = 0;
for ($pos = 0; $pos < $str_size; ++$pos) {
$char .= $string[$pos];
if (str_check_encoding($char, $encoding)) {
$ret .= $callable($char, $index);
$char = '';
++$index;
}
}
return $ret;
}
function str_check_encoding($string, $encoding = 'UTF-8')
{
$string = (string) $string;
return $string === htmlspecialchars_decode(htmlspecialchars($string, ENT_QUOTES, $encoding));
}
function str_scrub($string, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($string, ENT_SUBSTITUTE, $encoding));
}
If you want to learn the specification of UTF-8, the byte manipulation is the good way to practice.
function str_to_array6($string)
{
// REPLACEMENT CHARACTER (U+FFFD)
$substitute = "\xEF\xBF\xBD";
$size = strlen($string);
$ret = [];
for ($i = 0; $i < $size; $i += 1) {
if ($string[$i] <= "\x7F") {
$ret[] = $string[$i];
} elseif ("\xC2" <= $string[$i] && $string[$i] <= "\xDF") {
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < "\x80" || "\xBF" < $string[$i+1]) {
$ret[] = $substitute;
} else {
$ret[] = substr($string, $i, 2);
$i += 1;
}
} elseif ("\xE0" <= $string[$i] && $string[$i] <= "\xEF") {
$left = "\xE0" === $string[$i] ? "\xA0" : "\x80";
$right = "\xED" === $string[$i] ? "\x9F" : "\xBF";
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < $left || $right < $string[$i+1]) {
$ret[] = $substitute;
} else {
if (!isset($string[$i+2])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {
$ret[] = $substitute;
$i += 1;
} else {
$ret[] = substr($string, $i, 3);
$i += 2;
}
}
} elseif ("\xF0" <= $string[$i] && $string[$i] <= "\xF4") {
$left = "\xF0" === $string[$i] ? "\x90" : "\x80";
$right = "\xF4" === $string[$i] ? "\x8F" : "\xBF";
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < $left || $right < $string[$i+1]) {
$ret[] = $substitute;
} else {
if (!isset($string[$i+2])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {
$ret[] = $substitute;
$i += 1;
} else {
if (!isset($string[$i+3])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+3] < "\x80" || "\xBF" < $string[$i+3]) {
$ret[] = $substitute;
$i += 2;
} else {
$ret[] = substr($string, $i, 4);
$i += 3;
}
}
}
} else {
$ret[] = $substitute;
}
}
return $ret;
}
The result of benchmark between these functions is here.
grapheme
0.12967610359192
IntlBreakIterator::createCharacterInstance
0.17032408714294
IntlBreakIterator::createCodePointInstance
0.079245090484619
mbstring
0.081080913543701
preg_split
0.043133974075317
htmlspecialchars
0.25599694252014
byte maniplulation
0.13132810592651
The benchmark code is here.
$string = '主楼怎么走';
foreach (timer([
'grapheme' => 'str_to_array',
'IntlBreakIterator::createCharacterInstance' => 'str_to_array2',
'IntlBreakIterator::createCodePointInstance' => 'str_to_array3',
'mbstring' => 'str_to_array4',
'preg_split' => 'str_to_array5',
'htmlspecialchars' => 'str_to_array6',
'byte maniplulation' => 'str_to_array7'
],
[$string]) as $desc => $time) {
echo $desc, PHP_EOL,
$time, PHP_EOL;
}
function timer(array $callables, array $arguments, $repeat = 10000) {
$ret = [];
$save = $repeat;
foreach ($callables as $key => $callable) {
$start = microtime(true);
do {
array_map($callable, $arguments);
} while($repeat -= 1);
$stop = microtime(true);
$ret[$key] = $stop - $start;
$repeat = $save;
}
return $ret;
}
The documentation of str_split() says
Note:
str_split() will split into bytes, rather than characters when dealing with a multi-byte encoded string.
PHP 7.4.0 adds a new function mb_str_split(). Use this instead.
Assuming you have set the desired encoding and regular expression encoding for the MB functions (such as to UTF-8), you could use my method from my String class library.
/**
* Splits a string into pieces (on whitespace by default).
^
* #param string $pattern
* #param string $target
* #param int $limit
* #return array
*/
public function split(string $target, string $pattern = '\s+', int $limit = -1): array
{
return mb_split($pattern, $target, $limit);
}
By wrapping the mb_split() function in a method, I make it much easier to use. Simply invoke it with the desired value for the variable $pattern.
Remember, set the character encoding appropriately for your task.
mb_internal_encoding('UTF-8'); // For example.
mb_regex_encoding('UTF-8'); // For example.
In the case of my wrapper method, supply the empty string to the method like so.
$string = new String('UTF-8', 'UTF-8'); // Sets the internal and regex encodings.
$string->split($yourString, "")
In the direct PHP case ...
$characters = mb_split("", $string);

How to compute the cartesian power of a range of characters?

I would like to make a function that is able to generate a list of letters and optional numbers using a-z,0-9.
$output = array();
foreach(range('a','z') as $i) {
foreach(range('a','z') as $j) {
foreach(range('a','z') as $k) {
$output[] =$i.$j.$k;
}
}
}
Thanks
example:
myfunction($include, $length)
usage something like this:
myfunction('a..z,0..9', 3);
output:
000
001
...
aaa
aab
...
zzz
The output would have every possible combination of the letters, and numbers.
Setting the stage
First, a function that expands strings like "0..9" to "0123456789" using range:
function expand_pattern($pattern) {
$bias = 0;
$flags = PREG_SET_ORDER | PREG_OFFSET_CAPTURE;
preg_match_all('/(.)\.\.(.)/', $pattern, $matches, $flags);
foreach ($matches as $match) {
$range = implode('', range($match[1][0], $match[2][0]));
$pattern = substr_replace(
$pattern,
$range,
$bias + $match[1][1],
$match[2][1] - $match[1][1] + 1);
$bias += strlen($range) - 4; // 4 == length of "X..Y"
}
return $pattern;
}
It handles any number of expandable patterns and takes care to preserve their position inside your source string, so for example
expand_pattern('abc0..4def5..9')
will return "abc01234def56789".
Calculating the result all at once
Now that we can do this expansion easily, here's a function that calculates cartesian products given a string of allowed characters and a length:
function cartesian($pattern, $length) {
$choices = strlen($pattern);
$indexes = array_fill(0, $length, 0);
$results = array();
$resets = 0;
while ($resets != $length) {
$result = '';
for ($i = 0; $i < $length; ++$i) {
$result .= $pattern[$indexes[$i]];
}
$results[] = $result;
$resets = 0;
for ($i = $length - 1; $i >= 0 && ++$indexes[$i] == $choices; --$i) {
$indexes[$i] = 0;
++$resets;
}
}
return $results;
}
So for example, to get the output described in the question you would do
$options = cartesian(expand_pattern('a..z0..9'), 3);
See it in action (I limited the expansion length to 2 so that the output doesn't explode).
Generating the result on the fly
Since the result set can be extremely large (it grows exponentially with $length), producing it all at once can turn out to be prohibitive. In that case it is possible to rewrite the code so that it returns each value in turn (iterator-style), which has become super easy with PHP 5.5 because of generators:
function cartesian($pattern, $length) {
$choices = strlen($pattern);
$indexes = array_fill(0, $length, 0);
$resets = 0;
while ($resets != $length) {
$result = '';
for ($i = 0; $i < $length; ++$i) {
$result .= $pattern[$indexes[$i]];
}
yield $result;
$resets = 0;
for ($i = $length - 1; $i >= 0 && ++$indexes[$i] == $choices; --$i) {
$indexes[$i] = 0;
++$resets;
}
}
}
See it in action.
See this answer for a code that produces all possible combinations:
https://stackoverflow.com/a/8567199/1800369
You just need to add the $length parameter to limit the combinations size.
You can use a recursive function
assuming you mean it can be any number of levels deep, you can use a recursive function to generate an array of the permutations e.g.:
/**
* take the range of characters, and generate an array of all permutations
*
* #param array $range range of characters to itterate over
* #param array $array input array - operated on by reference
* #param int $depth how many chars to put in the resultant array should be
* #param int $currentDepth internal variable to track how nested the current call is
* #param string $prefix internal variable to know what to prefix the current string with
* #return array permutations
*/
function foo($range, &$array, $depth = 1, $currentDepth = 0, $prefix = "") {
$start = !$currentDepth;
$currentDepth++;
if ($currentDepth > $depth) {
return;
}
foreach($range as $char) {
if ($currentDepth === $depth) {
$array[] = $prefix . $char;
continue;
}
foo($range, $array, $depth, $currentDepth, $prefix . $char);
}
if ($start) {
return $array;
}
With the above function, initialize the return variable and call it:
$return = array();
echo implode(foo(range('a', 'z'), $return, 3), "\n");
And you're output will be all three char combinations from aaa, to zzz:
aaa
aab
...
zzy
zzz
The numeric parameter determins how recursive the function is:
$return = array();
echo implode(foo(range('a', 'z'), $return, 1), "\n");
a
b
c
...
Here's a live example.
$number= range(0, 9);
$letters = range('a', 'z');
$array= array_merge($number, $letters);
//print_r($array);
for($a=0;$a<count($array);$a++){
for($b=0;$b<count($array);$b++){
for($c=0;$c<count($array);$c++){
echo $array[$a].$array[$b].$array[$c]."<br>";
}
}
}
tested and working :)

Optimal function to create a random UTF-8 string in PHP? (letter characters only)

I wrote this function that creates a random string of UTF-8 characters. It works well, but the regular expression [^\p{L}] is not filtering all non-letter characters it seems. I can't think of a better way to generate the full range of unicode without non-letter characters.. short of manually searching for and defining the decimal letter ranges between 65 and 65533.
function rand_str($max_length, $min_length = 1, $utf8 = true) {
static $utf8_chars = array();
if ($utf8 && !$utf8_chars) {
for ($i = 1; $i <= 65533; $i++) {
$utf8_chars[] = mb_convert_encoding("&#$i;", 'UTF-8', 'HTML-ENTITIES');
}
$utf8_chars = preg_replace('/[^\p{L}]/u', '', $utf8_chars);
foreach ($utf8_chars as $i => $char) {
if (trim($utf8_chars[$i])) {
$chars[] = $char;
}
}
$utf8_chars = $chars;
}
$chars = $utf8 ? $utf8_chars : str_split('abcdefghijklmnopqrstuvwxyz');
$num_chars = count($chars);
$string = '';
$length = mt_rand($min_length, $max_length);
for ($i = 0; $i < $length; $i++) {
$string .= $chars[mt_rand(1, $num_chars) - 1];
}
return $string;
}
\p{L} might be catching too much. Try to limit to {Ll} and {LU} -- {L} includes {Lo} -- others.
With PHP7 and IntlChar there is now a better way:
function utf8_random_string(int $length) : string {
$r = "";
for ($i = 0; $i < $length; $i++) {
$codePoint = mt_rand(0x80, 0xffff);
$char = \IntlChar::chr($codePoint);
if ($char !== null && \IntlChar::isprint($char)) {
$r .= $char;
} else {
$i--;
}
}
return $r;
}

Recursion with anonymous function [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
javascript: recursive anonymous function?
Anonymous recursive PHP functions
I was wondering... Is it possible to do recursion with anonymous function?
Here is one example: I need to get six-chars long string which may contain only numbers and spaces. The only rules are that it cannot start or end with spaces. We check for that and if that occurs - just call recursion on the same, anonymous, function. Just how!?
function() {
$chars = range(0, 9);
$chars[] = ' ';
length = 6;
$count = count($chars);
$string = '';
for ($i = 0; $i < $length; ++$i) {
$string .= $chars[mt_rand(0, $count - 1)];
}
$string = trim($string);
if (strlen($string) !== $length) { // There were spaces in front or end of the string. Shit!
// Do recursion.
}
return $string;
}
Yes it is, but I wouldn't recommend it as it's a bit tricky ;)
First possibility:
<?php
$some_var1="1";
$some_var2="2";
function($param1, $param2) use ($some_var1, $some_var2)
{
call_user_func(__FUNCTION__, $other_param1, $other_param2);
}
?>
Another one:
<?php
$recursive = function () use (&$recursive){
// The function is now available as $recursive
}
?>
Examples taken from http://php.net/
The answer is complicated but not impossible. It took me several minutes to figure out. We first must define a utility function called $combinator().
The solution to your problem:
$combinator(
function($self) { function() use (&$self) {
$chars = range(0, 9);
$chars[] = ' ';
length = 6;
$count = count($chars);
$string = '';
for ($i = 0; $i < $length; ++$i) {
$string .= $chars[mt_rand(0, $count - 1)];
}
$string = trim($string);
if (strlen($string) !== $length) {
return $self();
}
return $string;
} }
);
The definition of $combinator():
$combinator = function($principle)
{
(function($transept) use (&$principle)
{
$principle(
function($arguments) use (&$transept)
{
call_user_func_array($transept($transept), $arguments));
}
);
})
(function($transept) use (&$principle)
{
$principle(
function($arguments)
{
call_user_func_array($transept($transept), $arguments);
}
);
});
}
A much saner method to do the same thing. Requires only one loop as well.
$chars = array_merge(range(0, 9), array(' '));
$string = mt_rand(0, 9);
for ($i = 1; $i <= 4; $i++) {
$string .= $chars[array_rand($chars)];
}
$string .= mt_rand(0, 9);
Sorry for sidestepping the actual question though.
use goto
function() {
start:
$chars = range(0, 9);
$chars[] = ' ';
length = 6;
$count = count($chars);
$string = '';
for ($i = 0; $i < $length; ++$i) {
$string .= $chars[mt_rand(0, $count - 1)];
}
$string = trim($string);
if (strlen($string) !== $length) { // There were spaces in front or end of the string. Shit!
goto start;
}
return $string;
But it's not the best idea to use goto.

PHP: Split multibyte string (word) into separate characters

Trying to split this string "主楼怎么走" into separate characters (I need an array) using mb_split with no luck... Any suggestions?
Thank you!
try a regular expression with 'u' option, for example
$chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
An ugly way to do it is:
mb_internal_encoding("UTF-8"); // this IS A MUST!! PHP has trouble with multibyte
// when no internal encoding is set!
$string = ".....";
$chars = array();
for ($i = 0; $i < mb_strlen($string); $i++ ) {
$chars[] = mb_substr($string, $i, 1); // only one char to go to the array
}
You should also try your way with mb_split with setting the internal_encoding before it.
You can use grapheme functions (PHP 5.3 or intl 1.0) and IntlBreakIterator (PHP 5.5 or intl 3.0). The following code shows the diffrence among intl and mbstring and PCRE functions.
// http://www.php.net/manual/function.grapheme-strlen.php
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6)
$expected = ["a\xCC\x8A", "o\xCC\x88"];
$expected2 = ["a", "\xCC\x8A", "o", "\xCC\x88"];
var_dump(
$expected === str_to_array($string),
$expected === str_to_array2($string),
$expected2 === str_to_array3($string),
$expected2 === str_to_array4($string),
$expected2 === str_to_array5($string)
);
function str_to_array($string)
{
$length = grapheme_strlen($string);
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = grapheme_substr($string, $i, 1);
}
return $ret;
}
function str_to_array2($string)
{
$it = IntlBreakIterator::createCharacterInstance('en_US');
$it->setText($string);
$ret = [];
$prev = 0;
foreach ($it as $pos) {
$char = substr($string, $prev, $pos - $prev);
if ('' !== $char) {
$ret[] = $char;
}
$prev = $pos;
}
return $ret;
}
function str_to_array3($string)
{
$it = IntlBreakIterator::createCodePointInstance();
$it->setText($string);
$ret = [];
$prev = 0;
foreach ($it as $pos) {
$char = substr($string, $prev, $pos - $prev);
if ('' !== $char) {
$ret[] = $char;
}
$prev = $pos;
}
return $ret;
}
function str_to_array4($string)
{
$length = mb_strlen($string, "UTF-8");
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = mb_substr($string, $i, 1, "UTF-8");
}
return $ret;
}
function str_to_array5($string) {
return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}
When working on production environment, you need to replace invalid byte sequence with the substitute character since almost all grapheme and mbstring functions can't handle invalid byte sequence. If you have an interest, see my past answer: https://stackoverflow.com/a/13695364/531320
If you don't take of perfomance, htmlspecialchars and htmlspecialchars_decode can be used. The merit of this way is supporting various encoding other than UTF-8.
function str_to_array6($string, $encoding = 'UTF-8')
{
$ret = [];
str_replace_callback($string, function($char, $index) use (&$ret) { $ret[] = $char; return ''; }, $encoding);
return $ret;
}
function str_replace_callback($string, $callable, $encoding = 'UTF-8')
{
$str_size = strlen($string);
$string = str_scrub($string, $encoding);
$ret = '';
$char = '';
$index = 0;
for ($pos = 0; $pos < $str_size; ++$pos) {
$char .= $string[$pos];
if (str_check_encoding($char, $encoding)) {
$ret .= $callable($char, $index);
$char = '';
++$index;
}
}
return $ret;
}
function str_check_encoding($string, $encoding = 'UTF-8')
{
$string = (string) $string;
return $string === htmlspecialchars_decode(htmlspecialchars($string, ENT_QUOTES, $encoding));
}
function str_scrub($string, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($string, ENT_SUBSTITUTE, $encoding));
}
If you want to learn the specification of UTF-8, the byte manipulation is the good way to practice.
function str_to_array6($string)
{
// REPLACEMENT CHARACTER (U+FFFD)
$substitute = "\xEF\xBF\xBD";
$size = strlen($string);
$ret = [];
for ($i = 0; $i < $size; $i += 1) {
if ($string[$i] <= "\x7F") {
$ret[] = $string[$i];
} elseif ("\xC2" <= $string[$i] && $string[$i] <= "\xDF") {
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < "\x80" || "\xBF" < $string[$i+1]) {
$ret[] = $substitute;
} else {
$ret[] = substr($string, $i, 2);
$i += 1;
}
} elseif ("\xE0" <= $string[$i] && $string[$i] <= "\xEF") {
$left = "\xE0" === $string[$i] ? "\xA0" : "\x80";
$right = "\xED" === $string[$i] ? "\x9F" : "\xBF";
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < $left || $right < $string[$i+1]) {
$ret[] = $substitute;
} else {
if (!isset($string[$i+2])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {
$ret[] = $substitute;
$i += 1;
} else {
$ret[] = substr($string, $i, 3);
$i += 2;
}
}
} elseif ("\xF0" <= $string[$i] && $string[$i] <= "\xF4") {
$left = "\xF0" === $string[$i] ? "\x90" : "\x80";
$right = "\xF4" === $string[$i] ? "\x8F" : "\xBF";
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < $left || $right < $string[$i+1]) {
$ret[] = $substitute;
} else {
if (!isset($string[$i+2])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {
$ret[] = $substitute;
$i += 1;
} else {
if (!isset($string[$i+3])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+3] < "\x80" || "\xBF" < $string[$i+3]) {
$ret[] = $substitute;
$i += 2;
} else {
$ret[] = substr($string, $i, 4);
$i += 3;
}
}
}
} else {
$ret[] = $substitute;
}
}
return $ret;
}
The result of benchmark between these functions is here.
grapheme
0.12967610359192
IntlBreakIterator::createCharacterInstance
0.17032408714294
IntlBreakIterator::createCodePointInstance
0.079245090484619
mbstring
0.081080913543701
preg_split
0.043133974075317
htmlspecialchars
0.25599694252014
byte maniplulation
0.13132810592651
The benchmark code is here.
$string = '主楼怎么走';
foreach (timer([
'grapheme' => 'str_to_array',
'IntlBreakIterator::createCharacterInstance' => 'str_to_array2',
'IntlBreakIterator::createCodePointInstance' => 'str_to_array3',
'mbstring' => 'str_to_array4',
'preg_split' => 'str_to_array5',
'htmlspecialchars' => 'str_to_array6',
'byte maniplulation' => 'str_to_array7'
],
[$string]) as $desc => $time) {
echo $desc, PHP_EOL,
$time, PHP_EOL;
}
function timer(array $callables, array $arguments, $repeat = 10000) {
$ret = [];
$save = $repeat;
foreach ($callables as $key => $callable) {
$start = microtime(true);
do {
array_map($callable, $arguments);
} while($repeat -= 1);
$stop = microtime(true);
$ret[$key] = $stop - $start;
$repeat = $save;
}
return $ret;
}
The documentation of str_split() says
Note:
str_split() will split into bytes, rather than characters when dealing with a multi-byte encoded string.
PHP 7.4.0 adds a new function mb_str_split(). Use this instead.
Assuming you have set the desired encoding and regular expression encoding for the MB functions (such as to UTF-8), you could use my method from my String class library.
/**
* Splits a string into pieces (on whitespace by default).
^
* #param string $pattern
* #param string $target
* #param int $limit
* #return array
*/
public function split(string $target, string $pattern = '\s+', int $limit = -1): array
{
return mb_split($pattern, $target, $limit);
}
By wrapping the mb_split() function in a method, I make it much easier to use. Simply invoke it with the desired value for the variable $pattern.
Remember, set the character encoding appropriately for your task.
mb_internal_encoding('UTF-8'); // For example.
mb_regex_encoding('UTF-8'); // For example.
In the case of my wrapper method, supply the empty string to the method like so.
$string = new String('UTF-8', 'UTF-8'); // Sets the internal and regex encodings.
$string->split($yourString, "")
In the direct PHP case ...
$characters = mb_split("", $string);

Categories