Count Number of Characters in a Mixed String of ASCII and Unicode - php

strlen($username);
Username can carry ASCII, Unicode or both.
Example:
Jam123 (ASCII) - 6 characters
ابت (Unicode) - 3 characters but strlen returns 6 bytes as unicode is 2 bytes per char.
Jamت (Unicode and ASCII) - 5 characters (3 ASCII and 2 Unicode even though i have only one unicode character)
Username in all cases shouldn't go beyond 25 characters and shouldn't be less than 4 chars.
My main problem is when mixing Unicode and ASCII together, how can i keep track of count so the condition statement can deicde whether username is not over 25 and not less than 4.
if(strlen($username) <= 25 && !(strlen($username) < 4))
3 characters in unicode will be counted as 6 bytes which causes trouble because it allows user to have a username of 3 unicode characters when the characters should be minimum of 4.
Numbers will always be in ASCII

Use mb_strlen(). It takes care of unicode characters.
Example:
mb_strlen("Jamت", "UTF-8"); // 4

You can use mb_strlen where you select your encoding.
http://sandbox.phpcode.eu/g/3a144/1
<?php
echo mb_strlen('ابت', 'UTF8'); // returns 3

function to count words in UNICODE sentence/string:
function mb_count_words($string)
{
preg_match_all('/[\pL\pN\pPd]+/u', $string, $matches); return count($matches[0]);
}
or
function mb_count_words($string, $format = 0, $charlist = '[]') {
$string=trim($string);
if(empty($string))
$words = array();
else
$words = preg_split('~[^\p{L}\p{N}\']+~u',$string);
switch ($format) {
case 0:
return count($words);
break;
case 1:
case 2:
return $words;
break;
default:
return $words;
break;
}
}
then do:
echo mb_count_words("chào buổi sáng");

Related

PHP: categorizing characters in a string

Consider the following string
hello, my name is 冰岛, nice to meet you
I need to scan this string and categorize each character as one of the following types:
1) Western text (alphabet and numbers only)
2) Chinese text (ideograms only, no punctuation)
3) Anything else (anything else, whether western or chinese or else)
Anyone can point me in the right direction? Thanks
Edit: since I suppose this has been downvoted due to being too generic..
for($i=0, $l = mb_strlen($string) - 1; $i<$l; $i++)
{
$char = mb_substr($string, $i, 1);
if(preg_match("/^[a-zA-Z]$/", $char)) $type = "alpha";
else
...
;
}
Regular expressions other than detecting alphabetic characters defy my knowledge, especially what is needed to include Han Ideograms only and leave all Han punctuation and special symbols out.
I can suggest that you should use a preg_replace_callback to grab the chunks of text you need with a regex that will capture different categories of texts into separate groups, and build the resulting array based on these captures:
$s = "hello, my name is 冰岛, nice to meet you";
$res = array();
preg_replace_callback('~\b(?<Chinese>\p{Han}+)\b|\b(?<Western>[a-zA-Z0-9]+)\b|(?<Other>[^\p{Han}A-Za-z0-9\s]+)~su',
function($m) use (&$res) {
if (!empty($m["Chinese"])) {
$t = array("type" => "Han", "value" => $m["Chinese"]);
array_push($res,$t);
}
else if (!empty($m["Western"])) {
$t = array("type" => "Western", "value" => $m["Western"]);
array_push($res, $t);
}
else if (!empty($m["Other"])) {
$t=array("type" => "Other", "value" => $m["Other"]);
array_push($res, $t);
}
},
$s);
print_r($res);
See the online PHP demo
Pattern:
\b(?<Chinese>\p{Han}+)\b - a whole Chinese word
| - or
\b(?<Western>[a-zA-Z0-9]+)\b - a whole word consisting of only ASCII letters and digits
| - or
(?<Other>[^\p{Han}A-Za-z0-9\s]+) - any 1+ symbols other than Chinese chars, ASCII letters, ASCII digits and whitespaces (\s).
The ~s modifier is redundant here, but if you want to match linebreaks, it will make . match these chars.
The ~u is necessary here since you deal with Unicode strings.
Also, see more about Unicode properties in the Unicode Properties section at the regular-expressions.info (e.g. you might be interested in \p{P} and \p{S} properties).

PHP: str_word_count(åäöåäöåäöåäö) returns the integer value of 12

I am using special symbols such as å ä ö on my website which measures the lengths of different texts. Thing is, I have noticed that PHP counts the symbols "å" "ä" "ö" as 1 word each. So åäö counts as 3 words, and åäöåäöåäöåäöåäö counts as 15 words. Well this is clearly not correct and I cannot find an answer to this problem anywhere. I'd be thankful for a useful answer, thank you!
If there's a limited set of word characters that you need to take into account, just supply those into str_word_count with its third param (charlist):
$charlist = 'åäö';
echo str_word_count('åäöåäöåäöåäöåäö', 0, $charlist); // 1
Alternatively, you can write your own Unicode-ready str_word_count function. One possible approach is splitting the source string by non-word symbols, then counting the resulting array:
function mb_str_word_count($str) {
return preg_match_all('#[\p{L}\p{N}][\p{L}\p{N}\'-]*#u', $str);
}
Basically, this function counts all the substrings in the target string that start with either Letter or Number character, followed by any number (incl. zero) of Letters, Numbers, hyphens and single quote symbols (matching the description given in str_word_count() docs).
You can try adding
setlocale(LC_ALL, 'en_US.utf8')
before your call to str_word_count
or roll on your own with
substr_count(trim($str), ' ');
this work for me... hope its usefull.
USING str_word_count you need to use utf8_decode(utf8_encode)..
function cortar($str)
{
if (20>$count=str_word_count($str)) {
return $str;
}
else
{
$array = str_word_count($str,1,'.,-0123456789()+=?¿!"<>*ñÑáéíóúÁÉÍÓÚ#|/%$#¡');
$s='';
$c=0;
foreach ($array as $e) {
if (20>$c) {
if (19>$c) {
$s.=$e.' ';
}
else
{
$s.=$e;
}
}
$c+=1;
}
return utf8_decode(utf8_encode($s));
}
}
function returs 20 words
If it is a string without linebreaks, and words are separated by a whitespace, a simple workaround would be to trim() the string and then count the whitespaces.
$string = "Wörk has to be done.";
// 1 space is 2 words, 2 spaces are 3 words etc.
if(substr_count(trim($string), ' ') > 2)
{
// more than 3 words
// ...
}

Compare a symbol from multibyte string with one in ASCII

I want to detect a space or a hyphen in a multibyte string.
At first I splitting a string into array of chars
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Then I try to compare those symbols with a hyphen or a space
foreach ($chrArray as $char) {
if ($char == '-' || $char == ' ') {
// Do something
}
}
Oh, this one doesn't work. Ok, why? Maybe because those symbols in ASCII?
echo mb_detect_encoding('-'); // ASCII
Okay, I'll try to handle it.
$encoding = mb_detect_encoding($str); // UTF-8
$dash = mb_convert_encoding('-', $encoding);
$space = mb_convert_encoding(' ', $encoding);
Oh, but it doesn't work too. Wait a second...
echo mb_detect_encoding($dash); // ASCII
!!! What's happening??? How could I do what I want?
I've come to using regexes. This one
"/(?<=-| |^)([\w]*)/u"
finds all words in unicode that have either a hyphen, or a space, or nothing (first in a line) at previous position. Instead of iterating chars array I'm using the preg_replace_callback (in PHP >= 5.4.1 the mb_ereg_replace_callback can be used).

PHP regex for password validation

I not getting the desired effect from a script. I want the password to contain A-Z, a-z, 0-9, and special chars.
A-Z
a-z
0-9 >= 2
special chars >= 2
string length >= 8
So I want to force the user to use at least 2 digits and at least 2 special chars. Ok my script works but forces me to use the digits or chars back to back. I don't want that. e.g. password testABC55$$ is valid - but i don't want that.
Instead I want test$ABC5#8 to be valid. So basically the digits/special char can be the same or diff -> but must be split up in the string.
PHP CODE:
$uppercase = preg_match('#[A-Z]#', $password);
$lowercase = preg_match('#[a-z]#', $password);
$number = preg_match('#[0-9]#', $password);
$special = preg_match('#[\W]{2,}#', $password);
$length = strlen($password) >= 8;
if(!$uppercase || !$lowercase || !$number || !$special || !$length) {
$errorpw = 'Bad Password';
Using "readable" format (it can be optimized to be shorter), as you are regex newbie >>
^(?=.{8})(?=.*[A-Z])(?=.*[a-z])(?=.*\d.*\d.*\d)(?=.*[^a-zA-Z\d].*[^a-zA-Z\d].*[^a-zA-Z\d])[-+%#a-zA-Z\d]+$
Add your special character set to last [...] in the above regex (I put there for now just -+%#).
Explanation:
^ - beginning of line/string
(?=.{8}) - positive lookahead to ensure we have at least 8 chars
(?=.*[A-Z]) - ...to ensure we have at least one uppercase char
(?=.*[a-z]) - ...to ensure we have at least one lowercase char
(?=.*\d.*\d.*\d - ...to ensure we have at least three digits
(?=.*[^a-zA-Z\d].*[^a-zA-Z\d].*[^a-zA-Z\d])
- ...to ensure we have at least three special chars
(characters other than letters and numbers)
[-+%#a-zA-Z\d]+ - combination of allowed characters
$ - end of line/string
((?=(.*\d){3,})(?=.*[a-z])(?=.*[A-Z])(?=(.*[!##$%^&]){3,}).{8,})
test$ABC5#8 is not valid because you ask more than 2 digits and spec symbols
A-Z
a-z
0-9 > 2
special chars > 2
string length >= 8
For matching length of string including special characters:
$result = preg_match('/^(?=.[a-z])(?=.[A-Z])(?=.\d)(?=.[^A-Za-z\d])[\s\S]{6,16}$/', $string);
Answer explained: https://stackoverflow.com/a/46359397/5466401

PHP: word check for 30 chars

I would like to check that if $_POST[msg] contains a word that are longer than 30 chars(without no spaces) so you wouldnt be able to write:
example1 :
asdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsdsddsasdsdsdsdsd
example2:
hello my name is asdoksdosdkokosdkosdkodskodskodksosdkosdkokodsdskosdkosdkodkoskosdkosdkosdkosdsdksdoksd
(notice no spaces).
How can I do that?
You could use preg_match to look for that as follows...
if (preg_match('/\S{31,}/', $_POST['msg']))
{
//string contains sequence of non-spaces > 30 chars
}
The /S matches any non-space character, and is the inverse of /s which matches any space. See the manual page on PCRE escape sequences
You can use the regex \w{31,} to find for a word that has 31 or more characters:
if(preg_match('/\w{31,}/',$_POST['msg'])) {
echo 'Found a word >30 char in length';
}
If you want to find group of non-space characters that are 31 or more characters in length, you can use:
if(preg_match('/\S{31,}/',$_POST['msg'])) {
echo 'Found a group of non-space characters >30 in length';
}
First find the words:
// words are separated by space usually, add more logic here
$words = explode(' ', $_POST['msg']);
foreach($words as $word) {
if(strlen($word) > 30) { // if the word is bigger than 30
// do something
}
}
How about this? Just difference in logic
if (strlen(preg_replace('#\s+#', '', $_POST['msg'])) > 30) {
//string contain more then 30 length (spaces aren't counted)
}
First split the input into words:
explode(" ", $_POST['msg']);
then get the maximum length string:
max(explode(" ", $_POST['msg']));
and see if that is larger than 30:
strlen(max(explode(" ", $_POST['msg']))) > 30

Categories