php true multi-byte string shuffle function? - php

I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!

Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.

This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}

I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.

Related

Arabic not urlencoding correctly

I have the string:
$str = 'ماجد';
This need to be encoded as:
'%E3%C7%CC%CF'
But I cannot figure out how to reach this encoded string. I believe it is Windows-1256. The above encoded string is how it is being encoded by a program I have.
Does anyone know how to reach this string?
If you know you want to use Windows-1256 then all you have to do is to change the encoding of the input string (which is UTF-8) to Windows-1256. Then you apply urlencode() to the returned string and that's all.
There are several ways to change the encoding of a string in PHP. One of them (that I tested and provides the result you expect) is using iconv():
$str = 'ماجد';
$conv = iconv('utf-8', 'windows-1256', $str);
echo(urlencode($conv));
You need to somehow split the string into its hexadecimal representation and then put a % singn in front of the hex number pairs.
<?php
$hexString = bin2hex("ماجد");
for($i = 0; $i < strlen($hexString); $i += 2){
echo "%".substr($hexString, $i, 2);
}
?>
This will do the trick but im sure there is a more elegant way.

php sprintf() with foreign characters?

Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary?
I want the following lines to be aligned correctly for a report:
2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98
2011-11-30 A1856 -Rättat xx - 6 594,00 19.18
I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f
Using: php-5.3.23-nts-Win32-VC9-x86
Strings in PHP are basically arrays of bytes (not characters). They cannot work natively with multibyte encodings (such as UTF-8).
For details see:
https://www.php.net/manual/en/language.types.string.php#language.types.string.details
Most string functions in PHP have multibyte equivalent though (with the mb_ prefix). But the sprintf does not.
There's a user comment (by "viktor at textalk dot com") with multibyte implementation of the sprintf on the function's documentation page at php.net. It may work for you:
https://www.php.net/manual/en/function.sprintf.php#89020
I was actually trying to find out if PHP ^7 finally has a native mb_sprintf() but apparently no xD.
For the sake of completeness, here is a simple solution I've been using in some old projects. It just adds the diff between strlen & mb_strlen to the desired $targetLengh.
The non-multibyte example is just added for the sake of easy comparison =).
$text = "Gultigkeitsprufung ist fehlgeschlagen: %{errors}";
$mbText = "Gültigkeitsprüfung ist fehlgeschlagen: %{errors}";
$mbTextRussian = "Проверка не удалась: %{errors}";
$targetLength = 60;
$mbTargetLength = strlen($mbText) - mb_strlen($mbText) + $targetLength;
$mbRussianTargetLength = strlen($mbTextRussian) - mb_strlen($mbTextRussian) + $targetLength;
printf("%{$targetLength}s\n", $text);
printf("%{$mbTargetLength}s\n", $mbText);
printf("%{$mbRussianTargetLength}s\n", $mbTextRussian);
result
Gultigkeitsprufung ist fehlgeschlagen: %{errors}
Gültigkeitsprüfung ist fehlgeschlagen: %{errors}
Проверка не удалась: %{errors}
update 2019-06-12
#flowtron made me give it another thought. A simple mb_sprintf() could look like this.
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return strlen($value) - mb_strlen($value) + $length[0];
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
echo mb_sprintf("%-10s %-10s %10s\n", 'thüs', 'wörks', 'ök');
echo mb_sprintf("%-10s %-10s %10s\n", 'this', 'works', 'ok');
result
thüs wörks ök
this works ok
I only did some happy path testing here, but it works for PHP >=5.6 and should be good enough to give ppl an idea on how to encapsulate the behavior.
It does not work with the repetition/order modifiers though - e.g. %1$20s will be ignored/remain unchanged.
If you're using characters that fit in the ISO-8859-1 character set, you can convert the strings before formatting, and convert the result back to UTF8 when you are done
utf8_encode(sprintf("%-12s %-8s", utf8_decode($paramOne), utf8_decode($paramTwo))
Problem
There is no multibyte format functions.
Idea
You can't convert input strings. You should change format lengths.
A format %4s means 4 widths (not characters - see footnote). But PHP format functions count bytes.
So you should add format lengths to bytes - widths.
Implementations
from #nimmneun
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return $length[0] + strlen($value) - mb_strwidth($value);
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
And don't forget another option str_pad($input, $length, $pad_char=' ', STR_PAD_RIGHT)
function mb_str_pad(...$args) {
$args[1] += strlen($args[0]) - mb_strwidth($args[0]);
return str_pad(...$args);
}
Footnote
Asian characters have 3 bytes and 2 width and 1 character length.
If your format is %4s and the input is one asian character, you should need two spaces (padding) not three.

PHP String Function with non-English languages

I was trying range(); function with non-English language. It is not working.
$i =0
foreach(range('क', 'म') as $ab) {
++$i;
$alphabets[$ab] = $i;
}
Output: à =1
It was Hindi (India) alphabets. It is only iterating only once (Output shows).
For this, I am not getting what to do!
So, if possible, please tell me what to do for this and what should I do first before thinking of working with non-English text with any PHP functions.
Short answer: it's not possible to use range like that.
Explanation
You are passing the string 'क' as the start of the range and 'म' as the end. You are getting only one character back, and that character is à.
You are getting back à because your source file is encoded (saved) in UTF-8. One can tell this by the fact that à is code point U+00E0, while 0xE0 is also the first byte of the UTF-8 encoded form of 'क' (which is 0xE0 0xA4 0x95). Sadly, PHP has no notion of encodings so it just takes the first byte it sees in the string and uses that as the "start" character.
You are getting back only à because the UTF-8 encoded form of 'म' also starts with 0xE0 (so PHP also thinks that the "end character" is 0xE0 or à).
Solution
You can write range as a for loop yourself, as long as there is some function that returns the Unicode code point of an UTF-8 character (and one that does the reverse). So I googled and found these here:
// Returns the UTF-8 character with code point $intval
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
// Returns the code point for a UTF-8 character
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
With the above, you can now write:
for($char = uniord('क'); $char <= uniord('म'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
See it in action.
The lazy solution would be to use html_entity_decode() and range() only for the numeric ranges it was originally intended (that it works with ASCII is a bit silly anyway):
foreach (range(0x0915, 0x092E) as $char) {
$char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}
Another solution would be translating and getting the range then translate back again.
$first = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=क");
$second = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=म"); //not real value
$jsonfirst = json_decode($first);
$jsonsecond = json_decode($second);
$f = $jsonfirst->responseData->translatedText;
$l = $jsonsecond->responseData->translatedText;
foreach(range($f, $l) as $ab) {
echo $ab;
}
Outputs
ABCDEFGHI
To translate back use an arraymap and a callback function that translates each of the English values back to hindi.

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

Help with PHP and multibyte characters

I have a problem that I thought would be simple but it's turning out to be quite complex.
I have a long UTF-8 string that is a mix of Roman, Western-European, Japanese, and Korean characters and punctuation. Many are multibyte chars, but some (I think) are not.
I need to do 2 things:
Make sure there are no duplicate chars (and output that new string, stripped of dupes).
Randomly shuffle that new string.
(Sorry, I can't seem to get the code quoting to format right...)
function uniquechars($string) {
$l = mb_strlen($string);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($string, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
$uniquekeys = join('', array_keys($unique));
return $uniquekeys;
}
and:
function unicode_shuffle($string)
{
$len = mb_strlen($string);
$sploded = array();
while($len-- > 0) {
$sploded[] = mb_substr($string, $len, 1);
}
shuffle($sploded);
$shuffled = join('', $sploded);
return $shuffled;
}
Using those two functions, which someone very helpfully provided, I THOUGHT I was all set...except that curiously, it seems like the Unique string (no duplicates) and the Shuffled string do not contain the same number of characters. (I am highlighting these chars from my browser and then cutting-and-pasting into another application...one string is always a different length than the one above, but often it varies...it's not even the same number of chars getting truncated each time!).
I'm sorry I don't know enough about PHP nor about coding to sleuth this myself but what on earth is going wrong here? It seems like it should be easy to just shuffle a big long string, but apparently it's much harder than I thought. Is there maybe another, easier way to do this? Should I convert the string first into respective hex numbers and shuffle those, then convert back to UTF-8? Should I output to a file rather than the screen?
Anyone out there have suggestions? I'm sorry, I'm very new to this, so possibly I'm just doing something really dumb.
You can probably do things a lot simpler.
Here's a function to get only the unique characters in a string:
// returns an array of unique characters from a given string
function getUnique( $string ) {
$chars = preg_split( '//', $string, -1, PREG_SPLIT_NO_EMPTY );
$unique = array_unique( $chars );
return $unique;
}
Then, if you want to reshuffle the order, just pass the array of unique chars to shuffle:
$shuffled = shuffle( $unique );
Edit: For multi-byte characters, this function should do the trick (thanks to http://php.net/manual/en/function.mb-split.php for helping with the regex):
function getUnique( $string ) {
$chars = preg_split( '/(?<!^)(?!$)/u', $string );
$unique = array_unique( $chars );
return $unique;
}

Categories