str_word_count() for non-latin words?

str_word_count() for non-latin words? - php

im trying to count the number of words in variable written in non-latin language (Bulgarian). But it seems that str_word_count() is not counting non-latin words. The encoding of the php file is UTF-8
$str = "текст на кирилица";
echo 'Number of words: '.str_word_count($str);
//this returns 0

You may do it with regex:
$str = "текст на кирилица";
echo 'Number of words: '.count(preg_split('/\s+/', $str));
here I'm defining word delimiter as space characters. If there may be something else that will be treated as word delimiter, you'll need to add it into your regex.
Also, note, that since there's no utf characters in regex (not in string) - /u modifier isn't required. But if you'll want some utf characters to act as delimiter, you'll need to add this regex modifier.
Update:
If you want only cyrillic letters to be treated in words, you may use:
$str = "текст
на 12453
кирилица";
echo 'Number of words: '.count(preg_split('/[^А-Яа-яЁё]+/u', $str));

And here is the solution that come to my mind:
$var = "текст на кирилица с пет думи";
$array = explode(" ", $var);
$i = 0;
foreach($array as $item)
{
if(strlen($item) > 2) $i++ ;
}
echo $i; // will return 5

As it stated in str_word_count description
'word' is defined as a locale dependent string
Specify Bulgarian locale before calling str_word_count
setlocale(LC_ALL, 'bg_BG');
echo str_word_count($content);
Read more about setlocale here.

The best solution I found is to provide a list of characters for word count function:
$text = 'текст на кирилице and on english too';
$count = str_word_count($text, 0, 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя');
echo $count; // => 7

Related

Splitting digits and latin letters from a string

Currently I have an array something like this
[0] => IS-001 開花した才能「篠ノ之 箒」
From this, I would like to extract only the IS-001 part and leave the Japanese character behind to something like this
[0] => 開花した才能「篠ノ之 箒」
Normal preg_split I am using currently only for white space but it seems like having some issue on the 箒」 character to fall into next array. So I decided if only I can split those non Japanese characters out?

Try this
echo preg_replace('/^[a-zA-Z0-9\-_]+/u','','IS-001 開花した才能「篠ノ之 箒」');
^ assert position at start of the string
[a-zA-Z0-9\-_] match a single character present in the list
+ Between one and unlimited times, as many times as possible, giving back as needed
u modifier unicode: Pattern strings are treated as UTF-16.

A solution to this is by using multibyte string functions.
So $char = substr($str, $i, 1); will become $char = mb_substr($str, $i, 1, 'UTF-8'); and strlen($str) will become mb_strlen($str, 'UTF-8').
$str="IS-001 開花した才能「篠ノ之 箒」";
$japanese = preg_replace(array('/[^\p{Han}？]/u', '/(\s)+/'), array('', '$1'), $str);
echo $japanese;
(or)
Remove latin letters and digits from string
$res = preg_replace('/[a-zA-Z0-9-]+/', '', $str);
echo $res;

If your string is the same in all your cases, you can use explode with limit parameter :
$string = 'IS-001 開花した才能「篠ノ之 箒」';
$array = explode(' ', $string, 2);
echo $array[1];

ucwords not capitalizing accented letters

I have a string with all letters capitalized. I'm using the ucwords() and the mb_strtolower() functions to capitalize only the first letter of a string. But I'm having some problems when the first letter of a word have a accent. For example:
ucwords(mb_strtolower('GRANDE ÁRVORE')); //outputs 'Grande árvore'
Why the first letter of the second word is not being capitalized? What can I do to solve this?

ucwords is one of the core PHP functions which is blissfully oblivious to non-ASCII or non-Latin-1 encodings.* For handling multibyte strings and/or non-ASCII strings, you should use the multibyte aware mb_convert_case:
mb_convert_case($str, MB_CASE_TITLE, 'UTF-8')
// your string encoding here --------^^^^^^^
* I'm not entirely sure whether it works only with ASCII or at least with Latin-1, but I wouldn't even bother to find out.

If you're looking to only capitalize the first letter only, here's a way to achieve it :
$s = "économie collégiale"
mb_strtoupper( mb_substr( $s, 0, 1 )) . mb_substr( $s, 1 )
// output : Économie collégiale

ucwords doesn't recognize the accented character. Try using mb_convert_case.
$str = 'GRANDE ÁRVORE';
function ucwords_accent($string)
{
if (mb_detect_encoding($string) != 'UTF-8') {
$string = mb_convert_case(utf8_encode($string), MB_CASE_TITLE, 'UTF-8');
} else {
$string = mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');
}
return $string;
}
echo ucwords_accent($str);

Regex to match string with and without special/accented characters?

Is there a regular expression to match a specific string with and without special characters? Special characters-insensitive, so to speak.
Like céra will match cera, and vice versa.
Any ideas?
Edit: I want to match specific strings with and without special/accented characters. Not just any string/character.
Test example:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
if (strpos($compareClientName, $this->search) !== false)
{
$clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}
Output: <span class="highlight">céra</span>
As you can see, I want to highlight the specific search string. However, I still want to display the original (accented) characters of the matched string.
I'll have to combine this with Michael Sivolobov's answer somehow, I guess.
I think I'll have to work with a separate preg_match() and preg_replace(), right?

You can use the \p{L} pattern to match any letter.
Source
You have to use the u modifier after the regular expression to enable unicode mode.
Example : /\p{L}+/u
Edit :
Try something like this. It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. You can then use the corrected search pattern to highlight your text.
function mbStringToArray($string)
{
$strlen = mb_strlen($string);
while($strlen)
{
$array[] = mb_substr($string, 0, 1, "UTF-8");
$string = mb_substr($string, 1, $strlen, "UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}
$clientName = 'céra';
$clientNameNoAccent = stripAccents($clientName);
$clientNameArray = mbStringToArray($clientName);
foreach($clientNameArray as $pos => &$char)
{
$charNA =$clientNameNoAccent[$pos];
if($char != $charNA)
{
$char = "(?:$char|$charNA|$charNA\p{M})";
}
}
$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra
$text = 'the client name is Céra but it could be Cera or céra too.';
$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);
echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.

If you want to know is there some accent or another mark on some letter you can check it by matching pattern \p{M}
UPDATE
You need to convert all your accented letters in pattern to group of alternatives:
E.g. céra -> c(?:é|e|e\p{M})ra
Why did I add e\p{M}? Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). e\p{M} matches e with grave accents (two separate Unicode characters)
As you convert your pattern to match all characters you can use it in your preg_match

As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. Why don't you use explode? Like that:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
$pieces = explode($compareClientName, $this->search);
if (count($pieces) > 1)
{
$clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}
Edit:
If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset? like this:
$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
$highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
'<span class="highlight">'.
mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
$offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');
Update 2:
It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; Also always make sure that you use the right encoding, take a look at this for example:
echo strlen('é');
> 2
echo mb_strlen('é');
> 2
echo mb_internal_encoding();
> ISO-8859-1
echo mb_strlen('é', 'UTF-8');
> 1
mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1

As you can see here, POSIX equivalence class is for matching characters with the same collating order that can be done by below regex:
[=a=]
This will match á and ä as well as a depending on your locale.

Unicode (UTF8) string word count in PHP

I need to have the word count of the following unicode string. Using str_word_count:
$input = 'Hello, chào buổi sáng';
$count = str_word_count($input);
echo $count;
the result is
7
which is aparentley wrong.
How to get the desired result (4)?

$tags = 'Hello, chào buổi sáng';
$word = explode(' ', $tags);
echo count($word);
Here's a demo: http://codepad.org/667Cr1pQ

Here is a quick and dirty regex-based (using Unicode) word counting function:
function mb_count_words($string) {
preg_match_all('/[\pL\pN\pPd]+/u', $string, $matches);
return count($matches[0]);
}
A "word" is anything that contains one or more of:
Any alphabetic letter
Any digit
Any hyphen/dash
This would mean that the following contains 5 "words" (4 normal, 1 hyphenated):
echo mb_count_words('Hello, chào buổi sáng, chào-sáng');
Now, this function is not well suited for very large texts; though it should be able to handle most of what counts as a block of text on the internet. This is because preg_match_all needs to build and populate a big array only to throw it away once counted (it is very inefficient). A more efficient way of counting would be to go through the text character by character, identifying unicode whitespace sequences, and incrementing an auxiliary variable. It would not be that difficult, but it is tedious and takes time.

You may use this function to count unicode words in given string:
function count_unicode_words( $unicode_string ){
// First remove all the punctuation marks & digits
$unicode_string = preg_replace('/[[:punct:][:digit:]]/', '', $unicode_string);
// Now replace all the whitespaces (tabs, new lines, multiple spaces) by single space
$unicode_string = preg_replace('/[[:space:]]/', ' ', $unicode_string);
// The words are now separated by single spaces and can be splitted to an array
// I have included \n\r\t here as well, but only space will also suffice
$words_array = preg_split( "/[\n\r\t ]+/", $unicode_string, 0, PREG_SPLIT_NO_EMPTY );
// Now we can get the word count by counting array elments
return count($words_array);
}
All credits go to the author.

I'm using this code to count word. You can try this
$s = 'Hello, chào buổi sáng';
$s1 = array_map('trim', explode(' ', $s));
$s2 = array_filter($s1, function($value) { return $value !== ''; });
echo count($s2);

Allow only [a-z][A-Z][0-9] in string using PHP

How can I get a string that only contains a to z, A to Z, 0 to 9 and some symbols?

You can filter it like:
$text = preg_replace("/[^a-zA-Z0-9]+/", "", $text);
As for some symbols, you should be more specific

You can test your string (let $str) using preg_match:
if(preg_match("/^[a-zA-Z0-9]+$/", $str) == 1) {
// string only contain the a to z , A to Z, 0 to 9
}
If you need more symbols you can add them before ]

Don't need regex, you can use the Ctype functions:
ctype_alnum: Check for alphanumeric character(s)
ctype_alpha: Check for alphabetic character(s)
ctype_cntrl: Check for control character(s)
ctype_digit: Check for numeric character(s)
ctype_graph: Check for any printable character(s) except space
ctype_lower: Check for lowercase character(s)
ctype_print: Check for printable character(s)
ctype_punct: Check for any printable character which is not whitespace or an alphanumeric character
ctype_space: Check for whitespace character(s)
ctype_upper: Check for uppercase character(s)
ctype_xdigit: Check for character(s) representing a hexadecimal digit
In your case use ctype_alnum, example:
if (ctype_alnum($str)) {
//...
}
Example:
<?php
$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach ($strings as $testcase) {
if (ctype_alnum($testcase)) {
echo 'The string ', $testcase, ' consists of all letters or digits.';
} else {
echo 'The string ', $testcase, ' don\'t consists of all letters or digits.';
}
}
Online example: https://ideone.com/BYN2Gn

Both these regexes should do it:
$str = preg_replace('~[^a-z0-9]+~i', '', $str);
Or:
$str = preg_replace('~[^a-zA-Z0-9]+~', '', $str);

A shortcut will be as below also:
if (preg_match('/^[\w\.]+$/', $str)) {
echo 'Str is valid and allowed';
} else
echo 'Str is invalid';
Here:
// string only contain the a to z , A to Z, 0 to 9 and _ (underscore)
\w - matches [a-zA-Z0-9_]+
Hope it helps!

If you need to preserve spaces in your string do this
$text = preg_replace("/[^a-zA-Z0-9 ]+/", "", $text);
Please note the way I have added space between 9 and the closing bracket. For example
$name = "!#$John Doe";
echo preg_replace("/[^a-zA-Z0-9 ]+/", "", $name);
the output will be:
John Doe
Spaces in the string will be preserved.
If you fail to include the space between 9 and the closing bracket the output will be:
JohnDoe
Hope it helps someone.

The best and most flexible way to accomplish that is using regular expressions.
But I`m not sure how to do that in PHP but this article can help. link

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

str_word_count() for non-latin words? - php

And here is the solution that come to my mind: $var = "текст на кирилица с пет думи"; $array = explode(" ", $var); $i = 0; foreach($array as $item) { if(strlen($item) > 2) $i++ ; } echo $i; // will return 5

As it stated in str_word_count description 'word' is defined as a locale dependent string Specify Bulgarian locale before calling str_word_count setlocale(LC_ALL, 'bg_BG'); echo str_word_count($content); Read more about setlocale here.

Related

Splitting digits and latin letters from a string

ucwords not capitalizing accented letters

Regex to match string with and without special/accented characters?

Unicode (UTF8) string word count in PHP

Allow only [a-z][A-Z][0-9] in string using PHP

Categories

Resources