Unicode (UTF8) string word count in PHP

Unicode (UTF8) string word count in PHP - php

I need to have the word count of the following unicode string. Using str_word_count:
$input = 'Hello, chào buổi sáng';
$count = str_word_count($input);
echo $count;
the result is
7
which is aparentley wrong.
How to get the desired result (4)?

$tags = 'Hello, chào buổi sáng';
$word = explode(' ', $tags);
echo count($word);
Here's a demo: http://codepad.org/667Cr1pQ

Here is a quick and dirty regex-based (using Unicode) word counting function:
function mb_count_words($string) {
preg_match_all('/[\pL\pN\pPd]+/u', $string, $matches);
return count($matches[0]);
}
A "word" is anything that contains one or more of:
Any alphabetic letter
Any digit
Any hyphen/dash
This would mean that the following contains 5 "words" (4 normal, 1 hyphenated):
echo mb_count_words('Hello, chào buổi sáng, chào-sáng');
Now, this function is not well suited for very large texts; though it should be able to handle most of what counts as a block of text on the internet. This is because preg_match_all needs to build and populate a big array only to throw it away once counted (it is very inefficient). A more efficient way of counting would be to go through the text character by character, identifying unicode whitespace sequences, and incrementing an auxiliary variable. It would not be that difficult, but it is tedious and takes time.

You may use this function to count unicode words in given string:
function count_unicode_words( $unicode_string ){
// First remove all the punctuation marks & digits
$unicode_string = preg_replace('/[[:punct:][:digit:]]/', '', $unicode_string);
// Now replace all the whitespaces (tabs, new lines, multiple spaces) by single space
$unicode_string = preg_replace('/[[:space:]]/', ' ', $unicode_string);
// The words are now separated by single spaces and can be splitted to an array
// I have included \n\r\t here as well, but only space will also suffice
$words_array = preg_split( "/[\n\r\t ]+/", $unicode_string, 0, PREG_SPLIT_NO_EMPTY );
// Now we can get the word count by counting array elments
return count($words_array);
}
All credits go to the author.

I'm using this code to count word. You can try this
$s = 'Hello, chào buổi sáng';
$s1 = array_map('trim', explode(' ', $s));
$s2 = array_filter($s1, function($value) { return $value !== ''; });
echo count($s2);

Related

Explode string on second last and last space to create exactly 3 elements

I want to split a space-delimited string by its spaces, but I need the total elements in the result array to be exactly 3 AND if the string has more than two spaces, only the last two spaces should be used as delimiters.
My input strings follow a predictable format. The strings are one or more words, then a word, then a parenthetically wrapped word (word in this context is a substring with no whitespaces in it).
Sample strings:
Stack Over Flow Abcpqr (UR)becomes:["Stack Over Flow", "Abcpqr", "(UR)"]
Fluency in English Conversation Defklmno (1WIR)becomes:["Fluency in English Conversation","Defklmno","(1WIR)"]
English Proficiency GHI (2WIR)becomes:["English Proficiency","GHI","(2WIR)"]
Testing ADG (3WIR)becomes:["Testing","ADG","(3WIR)"]
I used the following code, but it is only good for Testing (3WIR).
$Original = $row['fld_example'];
$OriginalExplode = explode(' ', $Original);
<input name="example0" id="example0" value="<?php echo $OriginalExplode[0]; ?>" type="text" autocomplete="off" required>
<input name="example1" id="example1" value="<?php echo $OriginalExplode[1]; ?>" type="text" autocomplete="off" required>
Basically, I just need to explode the string on spaces, starting from the end of the string, and limiting the total explosions to 2 (to make 3 elements.

You can approach this using explode and str_replace
$string = "Testing (3WIR)";
$stringToArray = explode(":",str_replace("(",":(",$string));
echo '<pre>';
print_r($stringToArray);
Edited question answer:-
$subject = "Fluency in English Conversation Defklmno (1WIR)";
$toArray = explode(' ',$subject);
if(count($toArray) > 2){
$first = implode(" ",array_slice($toArray, 0,count($toArray)-2));
$second = $toArray[count($toArray)-2];
$third = $toArray[count($toArray)-1];
$result = array_values(array_filter([$first, $second, $third]));
}else{
$result = array_values(array_filter(explode(":",str_replace("(",":(",$subject))));
}
DEMO HERE

I am not a fan of regular expressions, but this one seems to work very fine:
Regex to split a string only by the last whitespace character
So the PHP code would be:
function splitAtLastWord($sentence)
{
return preg_split("/\s+(?=\S*+$)/", $sentence);
}
$sentence = "Fluency in English Conversation Defklmno (1WIR)";
list($begin, $end) = splitAtLastWord($sentence);
list($first, $middle) = splitAtLastWord($begin);
$result = [$first, $middle, $end];
echo "<pre>" . print_r($result, TRUE) . "</pre>";
The output is:
Array
(
[0] => Fluency in English Conversation
[1] => Defklmno
[2] => (1WIR)
)
You can also write the same function without a regular expression:
function splitAtLastWord($sentence)
{
$words = explode(" ", $sentence);
$last = array_pop($words);
return [implode(" ", $words), $last];
}
Which is, to be honest, a better way of doing this.
This is a computationally more efficient way to do it:
function splitAtLastWord($sentence)
{
$lastSpacePos = strrpos($sentence, " ");
return [substr($sentence, 0, $lastSpacePos), substr($sentence, $lastSpacePos + 1)];
}
It looks a bit less nice but it is faster.
Anyway, defining a separate function like this is useful, you can reuse it in other places.

To isolate the two delimiting spaces, use / (?=(?:\S+ )?\()/ which leverages a lookahead containing an optional group.
Code: (Demo)
$strings = [
'Stack Over Flow Abcpqr (UR)',
'Fluency in English Conversation Defklmno (1WIR)',
'English Proficiency GHI (2WIR)',
'Testing ADG (3WIR)',
];
foreach ($strings as $string) {
echo json_encode(
preg_split('/ (?=(?:\S+ )?\()/', $string)
) . "\n";
}
Output:
["Stack Over Flow","Abcpqr","(UR)"]
["Fluency in English Conversation","Defklmno","(1WIR)"]
["English Proficiency","GHI","(2WIR)"]
["Testing","ADG","(3WIR)"]
Pattern Breakdown:
#match a literal space
(?= #start lookahead
(?:\S+ )? #optionally match one or more non-whitespaces followed by a space
\( #match a literal opening parenthesis
) #end lookahead
When matching the first delimiting space, the optional subpattern will match characters. When matching the second delimiting space (before the parenthesis), the optional subpattern will not match any characters.
As a more generic solution, if the goal was to split on the space before either of the last two non-whitespace substrings, this pattern looks ahead in the same fashion but matches all the way to the end of the string.
/ (?=(?:\S+ )?\S+$)/
While I don't find non-regex solutions to be anywhere near as elegant or concise, here is one way to explode on all spaced then implode all elements except the last two: (Demo)
function implodeNotLastTwoElements($string) {
$array = explode(' ', $string);
array_splice($array, 0, -2, implode(' ', array_slice($array, 0, -2)));
return $array;
}
foreach ($strings as $string) {
echo json_encode(implodeNotLastTwoElements($string)) . "\n";
}
Or (Demo)
function implodeNotLastTwoElements($string) {
$array = explode(' ', $string);
return [implode(' ', array_slice($array, 0, -2))] + array_slice($array, -3);
}
These non-regex approaches are iterating/scanning over the data 4 times versus regex only scanning the input string once and directly creating the desired result. The decision between regex or non-regex is a no-brainer for me in this case.

php regex replace each character with asterisk

I am trying to something like this.
Hiding users except for first 3 characters.
EX)
apple -> app**
google -> goo***
abc12345 ->abc*****
I am currently using php like this:
$string = "abcd1234";
$regex = '/(?<=^(.{3}))(.*)$/';
$replacement = '*';
$changed = preg_replace($regex,$replacement,$string);
echo $changed;
and the result be like:
abc*
But I want to make a replacement to every single character except for first 3 - like:
abc*****
How should I do?

Don't use regex, use substr_replace:
$var = "abcdef";
$charToKeep = 3;
echo strlen($var) > $charToKeep ? substr_replace($var, str_repeat ( '*' , strlen($var) - $charToKeep), $charToKeep) : $var;
Keep in mind that regex are good for matching patterns in string, but there is a lot of functions already designed for string manipulation.
Will output:
abc***

Try this function. You can specify how much chars should be visible and which character should be used as mask:
$string = "abcd1234";
echo hideCharacters($string, 3, "*");
function hideCharacters($string, $visibleCharactersCount, $mask)
{
if(strlen($string) < $visibleCharactersCount)
return $string;
$part = substr($string, 0, $visibleCharactersCount);
return str_pad($part, strlen($string), $mask, STR_PAD_RIGHT);
}
Output:
abc*****

Your regex matches all symbols after the first 3, thus, you replace them with a one hard-coded *.
You can use
'~(^.{3}|(?!^)\G)\K.~'
And replace with *. See the regex demo
This regex matches the first 3 characters (with ^.{3}) or the end of the previous successful match or start of the string (with (?!^)\G), and then omits the characters matched from the match value (with \K) and matches any character but a newline with ..
See IDEONE demo
$re = '~(^.{3}|(?!^)\G)\K.~';
$strs = array("aa","apple", "google", "abc12345", "asdddd");
foreach ($strs as $s) {
$result = preg_replace($re, "*", $s);
echo $result . PHP_EOL;
}

Another possible solution is to concatenate the first three characters with a string of * repeated the correct number of times:
$text = substr($string, 0, 3).str_repeat('*', max(0, strlen($string) - 3));
The usage of max() is needed to avoid str_repeat() issue a warning when it receives a negative argument. This situation happens when the length of $string is less than 3.

How to check if a word exists in a sentence

For example, if my sentence is $sent = 'how are you'; and if I search for $key = 'ho' using strstr($sent, $key) it will return true because my sentence has ho in it.
What I'm looking for is a way to return true if I only search for how, are or you. How can I do this?

You can use the function preg-match that uses a regex with word boundaries:
if(preg_match('/\byou\b/', $input)) {
echo $input.' has the word you';
}

If you want to check for multiple words in the same string, and you're dealing with large strings, then this is faster:
$text = explode(' ',$text);
$text = array_flip($text);
Then you can check for words with:
if (isset($text[$word])) doSomething();
This method is lightning fast.
But for checking for a couple of words in short strings then use preg_match.
UPDATE:
If you're actually going to use this I suggest you implement it like this to avoid problems:
$text = preg_replace('/[^a-z\s]/', '', strtolower($text));
$text = preg_split('/\s+/', $text, NULL, PREG_SPLIT_NO_EMPTY);
$text = array_flip($text);
$word = strtolower($word);
if (isset($text[$word])) doSomething();
Then double spaces, linebreaks, punctuation and capitals won't produce false negatives.
This method is much faster in checking for multiple words in large strings (i.e. entire documents of text), but it is more efficient to use preg_match if all you want to do is find if a single word exists in a normal size string.

One thing you can do is breaking up your sentence by spaces into an array.
Firstly, you would need to remove any unwanted punctuation marks.
The following code removes anything that isn't a letter, number, or space:
$sent = preg_replace("/[^a-zA-Z 0-9]+/", " ", $sent);
Now, all you have are the words, separated by spaces. To create an array that splits by space...
$sent_split = explode(" ", $sent);
Finally, you can do your check. Here are all the steps combined.
// The information you give
$sent = 'how are you';
$key = 'ho';
// Isolate only words and spaces
$sent = preg_replace("/[^a-zA-Z 0-9]+/", " ", $sent);
$sent_split = explode(" ", $sent);
// Do the check
if (in_array($key, $sent))
{
echo "Word found";
}
else
{
echo "Word not found";
}
// Outputs: Word not found
// because 'ho' isn't a word in 'how are you'

#codaddict's answer is technically correct but if the word you are searching for is provided by the user, you need to escape any characters with special regular expression meaning in the search word. For example:
$searchWord = $_GET['search'];
$searchWord = preg_quote($searchWord);
if (preg_match("/\b$searchWord\b", $input) {
echo "$input has the word $searchWord";
}

With recognition to Abhi's answer, a couple of suggestions:
I added /i to the regex since sentence-words are probably treated case-insensitively
I added explicit === 1 to the comparison based on the documented preg_match return values
$needle = preg_quote($needle);
return preg_match("/\b$needle\b/i", $haystack) === 1;

Masking all but first letter of a word using Regex

I'm attempting to create a bad word filter in PHP that will analyze the word and match against an array of known bad words, but keep the first letter of the word and replace the rest with asterisks. Example:
fook would become f***
shoot would become s**
The only part I don't know is how to keep the first letter in the string, and how to replace the remaining letters with something else while keeping the same string length.
$string = preg_replace("/\b(". $word .")\b/i", "***", $string);
Thanks!

$string = 'fook would become';
$word = 'fook';
$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);
var_dump($string);

$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);

This can be done in many ways, with very weird auto-generated regexps...
But I believe using preg_replace_callback() would end up being more robust
<?php
# as already pointed out, your words *may* need sanitization
foreach($words as $k=>$v)
$words[$k]=preg_quote($v,'/');
# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);
# after that, a single preg_replace_callback() would do
$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);
function my_beloved_callback($m)
{
$len=strlen($m[1])-1;
return $m[1][0].str_repeat('*',$len);
}

Here is unicode-friendly regular expression for PHP:
function lowercase_except_first_letter($s) {
// the following line SKIP the first word and pass it to callback func...
// \W it allows to keep the first letter even in words in quotes and brackets
return preg_replace_callback('/(?<!^|\s|\W)(\w)/u', function($m) {
return mb_strtolower($m[1]);
}, $s);
}

Replace all characters in string apart from PHP

I have a string Trade Card Catalogue 1988 Edition I wish to remove everything apart from 1988.
I could have an array of all letters and do a str_replace and trim, but I wondered if this was a better solution?
$string = 'Trade Card Catalogue 1988 Edition';
$letters = array('a','b','c'....'x','y','z');
$string = str_to_lower($string);
$string = str_replace($letters, '', $string);
$string = trim($string);
Thanks in advance

Regular expression?
So assuming you want the number (and not the 4th word or something like that):
$str = preg_replace('#\D#', '', $str);
\D means every character that is not a digit. The same as [^0-9].
If there could be more numbers but you only want to get a four digit number (a year), this will also work (but obviously fails if you there are several four digit numbers and you want to get a specific one) :
$str = preg_replace('#.*?(\d{4,4}).*#', '\1', $str);

You can actually just pass the entire set of characters to be trimmed as a parameter to trim:
$string = trim($string, 'abc...zABC...Z ' /* don't forget the space */);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Unicode (UTF8) string word count in PHP - php

I need to have the word count of the following unicode string. Using str_word_count: $input = 'Hello, chào buổi sáng'; $count = str_word_count($input); echo $count; the result is 7 which is aparentley wrong. How to get the desired result (4)?

$tags = 'Hello, chào buổi sáng'; $word = explode(' ', $tags); echo count($word); Here's a demo: http://codepad.org/667Cr1pQ

I'm using this code to count word. You can try this $s = 'Hello, chào buổi sáng'; $s1 = array_map('trim', explode(' ', $s)); $s2 = array_filter($s1, function($value) { return $value !== ''; }); echo count($s2);

Related

Explode string on second last and last space to create exactly 3 elements

php regex replace each character with asterisk

How to check if a word exists in a sentence

Masking all but first letter of a word using Regex

Replace all characters in string apart from PHP

Categories

Resources