I have a string with all letters capitalized. I'm using the ucwords() and the mb_strtolower() functions to capitalize only the first letter of a string. But I'm having some problems when the first letter of a word have a accent. For example:
ucwords(mb_strtolower('GRANDE ÁRVORE')); //outputs 'Grande árvore'
Why the first letter of the second word is not being capitalized? What can I do to solve this?
ucwords is one of the core PHP functions which is blissfully oblivious to non-ASCII or non-Latin-1 encodings.* For handling multibyte strings and/or non-ASCII strings, you should use the multibyte aware mb_convert_case:
mb_convert_case($str, MB_CASE_TITLE, 'UTF-8')
// your string encoding here --------^^^^^^^
* I'm not entirely sure whether it works only with ASCII or at least with Latin-1, but I wouldn't even bother to find out.
If you're looking to only capitalize the first letter only, here's a way to achieve it :
$s = "économie collégiale"
mb_strtoupper( mb_substr( $s, 0, 1 )) . mb_substr( $s, 1 )
// output : Économie collégiale
ucwords doesn't recognize the accented character. Try using mb_convert_case.
$str = 'GRANDE ÁRVORE';
function ucwords_accent($string)
{
if (mb_detect_encoding($string) != 'UTF-8') {
$string = mb_convert_case(utf8_encode($string), MB_CASE_TITLE, 'UTF-8');
} else {
$string = mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');
}
return $string;
}
echo ucwords_accent($str);
Related
im trying to count the number of words in variable written in non-latin language (Bulgarian). But it seems that str_word_count() is not counting non-latin words. The encoding of the php file is UTF-8
$str = "текст на кирилица";
echo 'Number of words: '.str_word_count($str);
//this returns 0
You may do it with regex:
$str = "текст на кирилица";
echo 'Number of words: '.count(preg_split('/\s+/', $str));
here I'm defining word delimiter as space characters. If there may be something else that will be treated as word delimiter, you'll need to add it into your regex.
Also, note, that since there's no utf characters in regex (not in string) - /u modifier isn't required. But if you'll want some utf characters to act as delimiter, you'll need to add this regex modifier.
Update:
If you want only cyrillic letters to be treated in words, you may use:
$str = "текст
на 12453
кирилица";
echo 'Number of words: '.count(preg_split('/[^А-Яа-яЁё]+/u', $str));
And here is the solution that come to my mind:
$var = "текст на кирилица с пет думи";
$array = explode(" ", $var);
$i = 0;
foreach($array as $item)
{
if(strlen($item) > 2) $i++ ;
}
echo $i; // will return 5
As it stated in str_word_count description
'word' is defined as a locale dependent string
Specify Bulgarian locale before calling str_word_count
setlocale(LC_ALL, 'bg_BG');
echo str_word_count($content);
Read more about setlocale here.
The best solution I found is to provide a list of characters for word count function:
$text = 'текст на кирилице and on english too';
$count = str_word_count($text, 0, 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя');
echo $count; // => 7
I need help...
I have hebrew php string and I want search position of substring.
my code:
$string = "אבגד הוזח טי";
$find = "הוזח";
$pos = strpos($string, $find);
echo $pos;
The strpos found the substring, but return wrong value of position.
It return $pos value 9 Instead of 5.
Why the strpos not working in hebrew strings?
Can you help me please?
Try using mb_strpos. You will have to set your internal character encoding to UTF-8 using mb_internal_encoding.
mb_internal_encoding("UTF-8");
$string = "אבגד הוזח טי";
$find = "הוזח";
$pos = mb_strpos($string, $find);
echo $pos; //5
Hebrew strings use multibyte characters, so each "character" can be 2 or more characters long, and not 1, like most latin-based characters. You will probably want to look into PHP Multibyte String Functions for your application.
Is there a regular expression to match a specific string with and without special characters? Special characters-insensitive, so to speak.
Like céra will match cera, and vice versa.
Any ideas?
Edit: I want to match specific strings with and without special/accented characters. Not just any string/character.
Test example:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
if (strpos($compareClientName, $this->search) !== false)
{
$clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}
Output: <span class="highlight">céra</span>
As you can see, I want to highlight the specific search string. However, I still want to display the original (accented) characters of the matched string.
I'll have to combine this with Michael Sivolobov's answer somehow, I guess.
I think I'll have to work with a separate preg_match() and preg_replace(), right?
You can use the \p{L} pattern to match any letter.
Source
You have to use the u modifier after the regular expression to enable unicode mode.
Example : /\p{L}+/u
Edit :
Try something like this. It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. You can then use the corrected search pattern to highlight your text.
function mbStringToArray($string)
{
$strlen = mb_strlen($string);
while($strlen)
{
$array[] = mb_substr($string, 0, 1, "UTF-8");
$string = mb_substr($string, 1, $strlen, "UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}
$clientName = 'céra';
$clientNameNoAccent = stripAccents($clientName);
$clientNameArray = mbStringToArray($clientName);
foreach($clientNameArray as $pos => &$char)
{
$charNA =$clientNameNoAccent[$pos];
if($char != $charNA)
{
$char = "(?:$char|$charNA|$charNA\p{M})";
}
}
$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra
$text = 'the client name is Céra but it could be Cera or céra too.';
$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);
echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.
If you want to know is there some accent or another mark on some letter you can check it by matching pattern \p{M}
UPDATE
You need to convert all your accented letters in pattern to group of alternatives:
E.g. céra -> c(?:é|e|e\p{M})ra
Why did I add e\p{M}? Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). e\p{M} matches e with grave accents (two separate Unicode characters)
As you convert your pattern to match all characters you can use it in your preg_match
As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. Why don't you use explode? Like that:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
$pieces = explode($compareClientName, $this->search);
if (count($pieces) > 1)
{
$clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}
Edit:
If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset? like this:
$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
$highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
'<span class="highlight">'.
mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
$offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');
Update 2:
It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; Also always make sure that you use the right encoding, take a look at this for example:
echo strlen('é');
> 2
echo mb_strlen('é');
> 2
echo mb_internal_encoding();
> ISO-8859-1
echo mb_strlen('é', 'UTF-8');
> 1
mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1
As you can see here, POSIX equivalence class is for matching characters with the same collating order that can be done by below regex:
[=a=]
This will match á and ä as well as a depending on your locale.
I want to detect a space or a hyphen in a multibyte string.
At first I splitting a string into array of chars
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Then I try to compare those symbols with a hyphen or a space
foreach ($chrArray as $char) {
if ($char == '-' || $char == ' ') {
// Do something
}
}
Oh, this one doesn't work. Ok, why? Maybe because those symbols in ASCII?
echo mb_detect_encoding('-'); // ASCII
Okay, I'll try to handle it.
$encoding = mb_detect_encoding($str); // UTF-8
$dash = mb_convert_encoding('-', $encoding);
$space = mb_convert_encoding(' ', $encoding);
Oh, but it doesn't work too. Wait a second...
echo mb_detect_encoding($dash); // ASCII
!!! What's happening??? How could I do what I want?
I've come to using regexes. This one
"/(?<=-| |^)([\w]*)/u"
finds all words in unicode that have either a hyphen, or a space, or nothing (first in a line) at previous position. Instead of iterating chars array I'm using the preg_replace_callback (in PHP >= 5.4.1 the mb_ereg_replace_callback can be used).
I need to convert a string to camel case, it's easy by using:
mb_convert_case($str, MB_CASE_TITLE, "UTF-8")
But what if string contains non-alphanumeric characters:
$str = 'he said "hello world"';
echo mb_convert_case($str, MB_CASE_TITLE, "UTF-8");
Result is:
He Said "hello World"
But I need:
He Said "Hello World"
How can we handle this?
tHanks
With a regular expression.
If you are only going to work with non-accented latin characters, it can be as simple as
$str = 'he said "hello WORLD"';
echo preg_replace('/\b([a-z])/e', 'strtoupper(\'$1\')', strtolower($str));
This matches any lowercase unaccented latin letter that is preceded by a word boundary. The letter is replaced with its uppercase equivalent.
If you want this to work with other languages and scripts as well, you will have to get fancy:
$str = 'he said "καλημέρα ΚΌΣΜΕ"'; // this has to be in UTF-8
echo preg_replace('/(?<!\p{L})(\p{Ll})/eu',
'mb_convert_case(\'$1\', MB_CASE_UPPER, \'UTF-8\')',
mb_convert_case($str, MB_CASE_LOWER, 'UTF-8'));
To grok this you need to refer to the Unicode functionality of PCRE, and note that I have added the u modifier to preg_replace. This matches any unicode letter that has an uppercase equivalent (with the pattern \p{Ll}), provided that it is not preceded by any other letter (negative lookbehind with the pattern \p{L}). It then replaces it with the uppercase equivalent.
See it in action.
Update: It looks like you intend to consider only whitespace as word boundaries. This can be done with the regular expressions
(?<=\s|^)([a-z])
(?<=\s|^)(\p{Ll})
Try something like this (according to PHP.net comments)
$str = 'he said "hello world"';
echo preg_replace('/([^a-z\']|^)([a-z])/ie', '"$1".strtoupper("$2")', strtolower($str));
use manual! :D found on php.net
<?php
function ucwordsEx($str, $separator){
$str = str_replace($separator, " ", $str);
$str = ucwords(strtolower($str));
$str = str_replace(" ", $separator, $str);
return $str;
}
/*
Example:
*/
echo ucwordsEx("HELLO-my-NAME-iS-maNolO", "-");
/*
Prints: "Hello My Name Is Manolo"
*/
?>
For non-unicode characters following will work to convert a string to camel case:
preg_replace('/\b([a-z])/e', 'strtoupper("$1")', strtolower($str));
here is very simple code
echo ucwords('he said '.ucwords('"hello world"')) ;
output He Said Hello World