PHP method for stripping duplicate chars from a multibyte string?

PHP method for stripping duplicate chars from a multibyte string? - php

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.

Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.

Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

Related

How to remove elements from an associative array in PHP

I am trying to read all the words in HTML documents locally. I have a loop which does it for me. I have created an array which holds the unwanted characters. I do not want those special unwanted characters to be in my word array. I have tried the following code but nothing changed.
$rii = new RecursiveIteratorIterator(new RecursiveDirectoryIterator('fulltext/course'));
$fulltext_course_files = array();
$unwantedChars = array('', ' ', '"', '!', '\'', '+', '%', '&', '/', '(', ')', '=',
'*', '.', ',', '?', '-', '_', ':', ';', '\\');
foreach ($rii as $f) {
if ($f->isDir()) {
continue;
} else {
$st = strip_tags(strtolower(file_get_contents($f)));
$swc = deleteStopWords(str_word_count($st, 1));
if (!in_array($st, $unwantedChars)) {
$fulltext_course_files[$f->getFilename()] = array_count_values($swc);
}
}
}
I still see dashes, empties ('') when I var_dump($arr);
array (size=230)
'4.html' =>
array (size=50)
'-' => int 7 ??
'cs' => int 1
'page' => int 1
'systems' => int 2
'programming' => int 1
'' => int 12 ??
'operating' => int 2
...
What can I do in order to remove the elements pointed with ??.
Edit 1
Better solution is preventing unwanted characters entering to the array as #David suggests. I have tried to change the if condition from
if (!in_array($st, $unwantedChars))
to
if (!in_array($f->getFilename(), $unwantedChars))
nothing changed. Unwanted keys still there.
Edit 2
I have tried the following also:
foreach ($fulltext_course_files as $key => $val) {
if (in_array($key, $unwantedChars)) {
unset($fulltext_course_files[$key] );
}
}
Again, no help!

You can use unset: http://php.net/manual/en/function.unset.php
unset($array['mykey']);

not sure what the $f->getFilename() does, but would it not be easier to test it against your characters?
if(!in_array($f->getFilename(), $unwantedChars) {
$fulltext_course_files[$f->getFilename()] = array_count_values($swc);
}

Instead of using in_array to search for unwanted characters, you could store them all in a string, and use strchr on it: it's basically equivalent to what you wrote, but with a string for storage rather than an array, which should be faster. That said...
My guess is that the unwanted characters still remaining in your final array are actually characters graphically similars to the normal punctuation chars, but with a different code point (the integer value corresponding to that character). Could it be that your document uses an encoding with several different dashes and double quote characters, like say, utf-8? If that's the case, you'll have a hard time filtering all the noise to keep only alphabet characters that way. However, if you use a white listing scheme (ie, check for good characters rather than bad ones), perhaps you'd be able to only keep those chars you're interested in. Luckily, there are functions to help you doing that: ctype_alpha for only alphabet, and ctype_alnum for alphanumeric ones. The Ctype extension they belong to is usually enabled in most php installation.
Here's a quick implementation:
function get_word_count($content){
$words = array();
$b = 0;
$len = strlen($content);
for ($i = 0; $i < $len; $i++){
$c = $content[$i];
if (!ctype_alnum($c)){
if ($b < $i){
$w = strtolower(substr($content, $b, $i - $b));
if (isset($words[$w]))
$words[$w]++;
else $words[$w] = 1;
}
$b = $i + 1;
}
}
return $words;
}
Beware that:
because it only accepts alphanumeric characters, you will not be able to index non english words.
even in that context, there are compound words you"d probably want to consider as one, for instance you're or step-wise. This function will not help you with that. If you need a more robust approach, I suggest that you look into existing natural language processing toolkits for PHP (your search engine of choice will report severals projects).

Help with PHP and multibyte characters

I have a problem that I thought would be simple but it's turning out to be quite complex.
I have a long UTF-8 string that is a mix of Roman, Western-European, Japanese, and Korean characters and punctuation. Many are multibyte chars, but some (I think) are not.
I need to do 2 things:
Make sure there are no duplicate chars (and output that new string, stripped of dupes).
Randomly shuffle that new string.
(Sorry, I can't seem to get the code quoting to format right...)
function uniquechars($string) {
$l = mb_strlen($string);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($string, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
$uniquekeys = join('', array_keys($unique));
return $uniquekeys;
}
and:
function unicode_shuffle($string)
{
$len = mb_strlen($string);
$sploded = array();
while($len-- > 0) {
$sploded[] = mb_substr($string, $len, 1);
}
shuffle($sploded);
$shuffled = join('', $sploded);
return $shuffled;
}
Using those two functions, which someone very helpfully provided, I THOUGHT I was all set...except that curiously, it seems like the Unique string (no duplicates) and the Shuffled string do not contain the same number of characters. (I am highlighting these chars from my browser and then cutting-and-pasting into another application...one string is always a different length than the one above, but often it varies...it's not even the same number of chars getting truncated each time!).
I'm sorry I don't know enough about PHP nor about coding to sleuth this myself but what on earth is going wrong here? It seems like it should be easy to just shuffle a big long string, but apparently it's much harder than I thought. Is there maybe another, easier way to do this? Should I convert the string first into respective hex numbers and shuffle those, then convert back to UTF-8? Should I output to a file rather than the screen?
Anyone out there have suggestions? I'm sorry, I'm very new to this, so possibly I'm just doing something really dumb.

You can probably do things a lot simpler.
Here's a function to get only the unique characters in a string:
// returns an array of unique characters from a given string
function getUnique( $string ) {
$chars = preg_split( '//', $string, -1, PREG_SPLIT_NO_EMPTY );
$unique = array_unique( $chars );
return $unique;
}
Then, if you want to reshuffle the order, just pass the array of unique chars to shuffle:
$shuffled = shuffle( $unique );
Edit: For multi-byte characters, this function should do the trick (thanks to http://php.net/manual/en/function.mb-split.php for helping with the regex):
function getUnique( $string ) {
$chars = preg_split( '/(?<!^)(?!$)/u', $string );
$unique = array_unique( $chars );
return $unique;
}

php true multi-byte string shuffle function?

I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!

Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.

This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}

I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.

php outputting strange character

I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!

The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.

Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];

i think you generate special HTML characters
for example here and iso8859-1 table

You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.

As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}

The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.

php's range function behavior

PHP.net's documentation on the range function is a little lacking. These functions produce unexpected (to me anyways) results when given character ranges.
$m = range('A','z');
print_r($m);
$m = range('~','"');
print_r($m);
I'm looking for a reference that might explicitly define its behavior.

The issue is that range treats its arguments like integers, and if you give it a single character it will convert it to its ASCII character code.
In the first case, you're getting all characters between character 'A' (integer 65) and character 'z' (integer 122). This is expected behavior for those of us coming from a C (or C-like language) background.
This is one of the rare cases where PHP converts single characters to their ASCII codes rather than parsing the string as integer the way it does normally. Most of the PHP documentation is better at telling you when to expect this. strpos for example, notes:
Needle
If needle is not a string, it is converted to an integer and applied as the ordinal value of a character.
The documentation for range is strangely quiet about it.

Consider:
foreach (range('A','z') as $c)
echo $c."\n";
to be equivalent to:
for ($i = ord('A'); $i <= ord('z'); ++$i)
echo chr($i)."\n";
Likewise, your second example is equivalent to (since ord('~') > ord('"')):
for ($i = ord('~'); $i >= ord('"'); --$i)
echo chr($i)."\n";
It's not well documented, but that's how it is supposed to work.

that is because " is a lower character than ~ try
m = range('A','z'); print_r($m);
$m = range('z','A'); print_r($m);
the characters are pulled by their chr (ASCII Table) values:
http://www.asciitable.com/
the array is returned in the directional order of the 2 parameters.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP method for stripping duplicate chars from a multibyte string? - php

Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!

Related

How to remove elements from an associative array in PHP

Help with PHP and multibyte characters

php true multi-byte string shuffle function?

php outputting strange character

php's range function behavior

Categories

Resources