Help with PHP and multibyte characters

Help with PHP and multibyte characters - php

I have a problem that I thought would be simple but it's turning out to be quite complex.
I have a long UTF-8 string that is a mix of Roman, Western-European, Japanese, and Korean characters and punctuation. Many are multibyte chars, but some (I think) are not.
I need to do 2 things:
Make sure there are no duplicate chars (and output that new string, stripped of dupes).
Randomly shuffle that new string.
(Sorry, I can't seem to get the code quoting to format right...)
function uniquechars($string) {
$l = mb_strlen($string);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($string, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
$uniquekeys = join('', array_keys($unique));
return $uniquekeys;
}
and:
function unicode_shuffle($string)
{
$len = mb_strlen($string);
$sploded = array();
while($len-- > 0) {
$sploded[] = mb_substr($string, $len, 1);
}
shuffle($sploded);
$shuffled = join('', $sploded);
return $shuffled;
}
Using those two functions, which someone very helpfully provided, I THOUGHT I was all set...except that curiously, it seems like the Unique string (no duplicates) and the Shuffled string do not contain the same number of characters. (I am highlighting these chars from my browser and then cutting-and-pasting into another application...one string is always a different length than the one above, but often it varies...it's not even the same number of chars getting truncated each time!).
I'm sorry I don't know enough about PHP nor about coding to sleuth this myself but what on earth is going wrong here? It seems like it should be easy to just shuffle a big long string, but apparently it's much harder than I thought. Is there maybe another, easier way to do this? Should I convert the string first into respective hex numbers and shuffle those, then convert back to UTF-8? Should I output to a file rather than the screen?
Anyone out there have suggestions? I'm sorry, I'm very new to this, so possibly I'm just doing something really dumb.

You can probably do things a lot simpler.
Here's a function to get only the unique characters in a string:
// returns an array of unique characters from a given string
function getUnique( $string ) {
$chars = preg_split( '//', $string, -1, PREG_SPLIT_NO_EMPTY );
$unique = array_unique( $chars );
return $unique;
}
Then, if you want to reshuffle the order, just pass the array of unique chars to shuffle:
$shuffled = shuffle( $unique );
Edit: For multi-byte characters, this function should do the trick (thanks to http://php.net/manual/en/function.mb-split.php for helping with the regex):
function getUnique( $string ) {
$chars = preg_split( '/(?<!^)(?!$)/u', $string );
$unique = array_unique( $chars );
return $unique;
}

Related

How to remove elements from an associative array in PHP

I am trying to read all the words in HTML documents locally. I have a loop which does it for me. I have created an array which holds the unwanted characters. I do not want those special unwanted characters to be in my word array. I have tried the following code but nothing changed.
$rii = new RecursiveIteratorIterator(new RecursiveDirectoryIterator('fulltext/course'));
$fulltext_course_files = array();
$unwantedChars = array('', ' ', '"', '!', '\'', '+', '%', '&', '/', '(', ')', '=',
'*', '.', ',', '?', '-', '_', ':', ';', '\\');
foreach ($rii as $f) {
if ($f->isDir()) {
continue;
} else {
$st = strip_tags(strtolower(file_get_contents($f)));
$swc = deleteStopWords(str_word_count($st, 1));
if (!in_array($st, $unwantedChars)) {
$fulltext_course_files[$f->getFilename()] = array_count_values($swc);
}
}
}
I still see dashes, empties ('') when I var_dump($arr);
array (size=230)
'4.html' =>
array (size=50)
'-' => int 7 ??
'cs' => int 1
'page' => int 1
'systems' => int 2
'programming' => int 1
'' => int 12 ??
'operating' => int 2
...
What can I do in order to remove the elements pointed with ??.
Edit 1
Better solution is preventing unwanted characters entering to the array as #David suggests. I have tried to change the if condition from
if (!in_array($st, $unwantedChars))
to
if (!in_array($f->getFilename(), $unwantedChars))
nothing changed. Unwanted keys still there.
Edit 2
I have tried the following also:
foreach ($fulltext_course_files as $key => $val) {
if (in_array($key, $unwantedChars)) {
unset($fulltext_course_files[$key] );
}
}
Again, no help!

You can use unset: http://php.net/manual/en/function.unset.php
unset($array['mykey']);

not sure what the $f->getFilename() does, but would it not be easier to test it against your characters?
if(!in_array($f->getFilename(), $unwantedChars) {
$fulltext_course_files[$f->getFilename()] = array_count_values($swc);
}

Instead of using in_array to search for unwanted characters, you could store them all in a string, and use strchr on it: it's basically equivalent to what you wrote, but with a string for storage rather than an array, which should be faster. That said...
My guess is that the unwanted characters still remaining in your final array are actually characters graphically similars to the normal punctuation chars, but with a different code point (the integer value corresponding to that character). Could it be that your document uses an encoding with several different dashes and double quote characters, like say, utf-8? If that's the case, you'll have a hard time filtering all the noise to keep only alphabet characters that way. However, if you use a white listing scheme (ie, check for good characters rather than bad ones), perhaps you'd be able to only keep those chars you're interested in. Luckily, there are functions to help you doing that: ctype_alpha for only alphabet, and ctype_alnum for alphanumeric ones. The Ctype extension they belong to is usually enabled in most php installation.
Here's a quick implementation:
function get_word_count($content){
$words = array();
$b = 0;
$len = strlen($content);
for ($i = 0; $i < $len; $i++){
$c = $content[$i];
if (!ctype_alnum($c)){
if ($b < $i){
$w = strtolower(substr($content, $b, $i - $b));
if (isset($words[$w]))
$words[$w]++;
else $words[$w] = 1;
}
$b = $i + 1;
}
}
return $words;
}
Beware that:
because it only accepts alphanumeric characters, you will not be able to index non english words.
even in that context, there are compound words you"d probably want to consider as one, for instance you're or step-wise. This function will not help you with that. If you need a more robust approach, I suggest that you look into existing natural language processing toolkits for PHP (your search engine of choice will report severals projects).

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.

Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.

Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

php true multi-byte string shuffle function?

I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!

Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.

This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}

I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.

How to use str_replace() to remove text a certain number of times only in PHP?

I am trying to remove the word "John" a certain number of times from a string. I read on the php manual that str_replace excepts a 4th parameter called "count". So I figured that can be used to specify how many instances of the search should be removed. But that doesn't seem to be the case since the following:
$string = 'Hello John, how are you John. John are you happy with your life John?';
$numberOfInstances = 2;
echo str_replace('John', 'dude', $string, $numberOfInstances);
replaces all instances of the word "John" with "dude" instead of doing it just twice and leaving the other two Johns alone.
For my purposes it doesn't matter which order the replacement happens in, for example the first 2 instances can be replaced, or the last two or a combination, the order of the replacement doesn't matter.
So is there a way to use str_replace() in this way or is there another built in (non-regex) function that can achieve what I'm looking for?

As Artelius explains, the last parameter to str_replace() is set by the function. There's no parameter that allows you to limit the number of replacements.
Only preg_replace() features such a parameter:
echo preg_replace('/John/', 'dude', $string, $numberOfInstances);
That is as simple as it gets, and I suggest using it because its performance hit is way too tiny compared to the tedium of the following non-regex solution:
$len = strlen('John');
while ($numberOfInstances-- > 0 && ($pos = strpos($string, 'John')) !== false)
$string = substr_replace($string, 'dude', $pos, $len);
echo $string;
You can choose either solution though, both work as you intend.

You've misunderstood the wording of the manual.
If passed, this will be set to the number of replacements performed.
The parameter is passed by reference and its value is changed by the function to indicate how many times the string was found and replaced. Its initial value is discarded.

There are a few things you could do to achieve this, but I can't think of one specific php function that will easily let you do this.
One option is to create your own replace function and utilize strripos and substr to do the replaces.
Another thing you can do is use preg_replace_callback and count the number of replacements you have done in the callback.
There's probably more ways but that's all I can think of on the fly. If performance is an issue I suggest you give both a try and do some simple benchmarks.

The cleanest, most-direct, single function call is to use preg_replace(). Its replacement limiting parameter makes the task intuitive and readable.
$string = preg_replace('/John/', 'dude', $string, $numberOfInstances);
The function is also attractive because making the search case-insensitive is as simple as adding the i pattern modifier to the end of the pattern. I won't delve into the usefulness of word boundaries (\b).
If a search string might contain characters with special meaning to the regex engine, then preg_quote() will be necessary -- this diminishes the beauty of the technique but not prohibitively so.
$search = '$5.99';
$pattern = '/' . preg_quote($search, '/') . '/';
$string = preg_replace($pattern, 'free', $string, $numberOfInstances);
For anyone who has an unnatural bias against regex functions, this can be done without regex and without looping -- it will be case-sensitive though.
Limited Explode & Implode: (Demo)
$numberOfInstances = 2;
$string = 'Hello John, how are you John. John are you happy with your life John?';
// explode here -^^^^ and ---------^^^^ only to create the following array:
// 0 => 'Hello ',
// 1 => ', how are you ',
// 2 => '. John are you happy with your life John?'
echo implode('dude', explode('John', $string, $numberOfInstances + 1));
Output:
Hello dude, how are you dude. John are you happy with your life John?
Notice the explode's limiting parameter dictates how many elements are generated, not how many explosions are executed on the string.

function str_replace_occurrences($find, $replace, $string, $count = -1) {
// current occrurence
$current = 0;
// while any occurrence
while (($pos = strpos($string, $find)) != false) {
// update length of str (size of string is changing)
$len = strlen($find);
// found next one
$current++;
// check if we've reached our target
// -1 is used to replace all occurrence
if($current <= $count || $count == -1) {
// do replacement
$string = substr_replace($string, $replace, $pos, $len);
} else {
// we've reached our
break;
}
}
return $string;
}

Artelius has already described how the function works, ill just show you how to do this via the manual methods:
function str_replace_occurrences($find,$replace,$string,$count = 0)
{
if($count == 0)
{
return str_replace($find,$replace,$string);
}
$pos = 0;
$len = strlen($find);
while($pos < $count && false !== ($pos = strpos($string,$find,$pos)))
{
$string = substr_replace($string,$replace,$pos,$len);
}
return $string;
}
This is untested but should work.

Text Obfuscation using base64_encode()

I'm playing around with encrypt/decrypt coding in php. Interesting stuff!
However, I'm coming across some issues involving what text gets encrypted into.
Here's 2 functions that encrypt and decrypt a string. It uses an Encryption Key, which I set as something obscure.
I actually got this from a php book. I modified it slightly, but not to change it's main goal.
I created a small example below that anyone can test.
But, I notice that some characters show up as the "encrypted" string. Characters like "=" and "+".
Sometimes I pass this encrypted string via the url. Which may not quite make it to my receiving scripts. I'm guessing the browser does something to the string if certain characters are seen. I'm really only guessing.
is there another function I can use to ensure the browser doesn't touch the string? or does anyone know enough php bas64_encode() to disallow certain characters from being used? I'm really not going to expect the latter as a possibility. But, I'm sure there's a work-around.
enjoy the code, whomever needs it!
define('ENCRYPTION_KEY', "sjjx6a");
function encrypt($string) {
$result = '';
for($i=0; $i<strlen($string); $i++) {
$char = substr($string, $i, 1);
$keychar = substr(ENCRYPTION_KEY, ($i % strlen(ENCRYPTION_KEY))-1, 1);
$char = chr(ord($char)+ord($keychar));
$result.=$char;
}
return base64_encode($result)."/".rand();
}
function decrypt($string){
$exploded = explode("/",$string);
$string = $exploded[0];
$result = '';
$string = base64_decode($string);
for($i=0; $i<strlen($string); $i++) {
$char = substr($string, $i, 1);
$keychar = substr(ENCRYPTION_KEY, ($i % strlen(ENCRYPTION_KEY))-1, 1);
$char = chr(ord($char)-ord($keychar));
$result.=$char;
}
return $result;
}
echo $encrypted = encrypt("reaplussign.jpg");
echo "<br>";
echo decrypt($encrypted);

You could use PHP's urlencode and urldecode functions to make your encryption results safe for use in URLs, e.g
echo $encrypted = urlencode(encrypt("reaplussign.jpg"));
echo "<br>";
echo decrypt(urldecode($encrypted));

You should look at urlencode() to escape the string correctly for use in the query.

If you are worried about +,= etc. similar characters, you should have a look at http://php.net/manual/en/function.urlencode.php and it's friends from "See also" section. Encode it in encrypt() and decode at the beginning of decrypt().
If this doesn't work for you, maybe some simple substitution?
$text = str_replace('+','%20',$text);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Help with PHP and multibyte characters - php

Related

How to remove elements from an associative array in PHP

PHP method for stripping duplicate chars from a multibyte string?

php true multi-byte string shuffle function?

How to use str_replace() to remove text a certain number of times only in PHP?

Text Obfuscation using base64_encode()

Categories

Resources