php ucfirst ucwords discussion [duplicate] - php

This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
I just wanted to share my experience when needing to deal with an language independent version of ucfirst.
the problem is when you are mixing English texts with Japanese, chinese or other languages as in my case sometimes Swedish etc. with ÅÄÖ, traditional ucfirst has issues with converting the string to capitalized.
I did however sometime ago stumbled across the following code snippet here on stack overflow:
function myucfirst($str) {
$fc = mb_strtoupper(mb_substr($str, 0, 1));
return $fc.mb_substr($str, 1);
}
It works fine in most cases but recently I also needed the translations autogenerate texts in dynamic pdfs using TCPDF.
This is when I hit my head over why TCPDF had issues with the text. I had no problems anywhere else, the character encoding was utf8 but still it bricked.
When showing Kanji for Japanese signs, I just put ignore using the above function to captitalize the word but all of a sudden when using Swedish, I encountered the same brick when I need to capitalize ÅÄÖ.
That led me to realize that the problem with the function above is that it's only looking at the first character. ÅÄÖ is taking up 2 letter spaces and kanjis for chinese or Japanese letters take up 3 letter spaces and the function above did not consider that resulting to bricking TCPDF.
To give more context, When generating PDF documents with TCPDF the TCP font will end up getting errors since the gerneal mb_string function will translate the first character to "?�"vrigt for the swedish word Övrigt and with for instance Japanese "?��"のととろ, for 隣のトトロ (my neighbour totoro.) this will make the font translation for the � not work correctly. you need to do the conversion of ÅÄÖ for the first two letters substr($str, 0,2) to be able to convert the letter properly.
Also I am not sure if you see the code examples I gave but since neither chinese or japanese use upper case letters in their writing language, I am excluding every sign that requires 3 letter spaces since they are not managing upper / lower cases at all. I don't really want to exclude them but parsing them through mb_string will lead to similar errors in TCPDF so, my examples are a workaround for now or if someone has a better solution.
so... my approach was to solve the above problem by using the following function.
function myucfirst($str) {
if ($str[0] !== "?"){
for($i = 1; $i <= 3; $i++){
$first = substr($str, 0, $i);
$first = mb_convert_case($first, MB_CASE_UPPER, "UTF-8");
if ($first !== '?'){
$rest = substr($str, $i);
break;
}
}
if ($i < 3){
$ret_string = $first . $rest;
} else {
$ret_string = $str;
}
} else {
$ret_string = $str;
}
return $ret_string;
}
Thanks to Steven Pennys' help below, this is the solution that's working both with Swedish and Japanese / chinese special characters, even when needing to use a string with the library TCPDF for dynamically creating PDFs:
function myucfirst($str) {
$ret_string = mb_convert_case($str, MB_CASE_TITLE, 'UTF-8');
return $ret_string;
}
and following to do a similar fix for ucwords
function myucwords($str){
$str = trim($str);
if (strpos($str, ' ') !== false){
$str_arr = explode(' ', $str);
foreach ($str_arr as $word){
$ret_str .= isset($ret_str)? ' ' . myucfirst($word):myucfirst($word);
}
} else {
$ret_str = myucfirst($str);
}
return $ret_str;
}
The myucwords is using the first myucfirst to capitalize each word.
Since I am not that experienced as a developer or a stack overflow contributor, you should be able to see 3 code examples and I would really appreciate if there's better ways to write these functions but for now, for those who have the similar problem, please enjoy!
/Chris

The examples you gave are poor, as with Övrigt the input is exactly the same
as the output. So I modified the example so they can be useful. See below:
<?php
# example 1
$s1 = mb_convert_case('åäö', MB_CASE_TITLE);
# example 2
$s2 = mb_convert_case('övrigt', MB_CASE_TITLE);
# exmaple 3
$s3 = mb_convert_case('隣のトトロ', MB_CASE_TITLE);
# print
var_dump($s1 == 'Åäö', $s2 == 'Övrigt', $s3 == '隣のトトロ');
Note you will need this in your php.ini, if its not already:
extension = mbstring
https://php.net/function.mb-convert-case

Related

php regex match possible accented characters

I found alot of questions about this, but none of those helped me with my especific problem. The situation: I want to search a string with something like "blablebli" and be able to find a match with all possible accented variations of that ("blablebli", "blábleblí", "blâblèbli", etc...) in an text.
I already made a workaround to to the opposite (find a word without possible accents that i wrote). But i can't figure it out a way to implement what i want.
Here is my working code. (the relevant part, this was part of a foreach so we are only seeing a single word search):
$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->removeAccents($word); // Removed all accents
if(!empty($word)) {
$sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking with and without accents.
if (preg_match($sentence, $content)){
echo "found";
}
}
And my removeAccents() function (i'm not sure if i covered all possible accents with that preg_replace(). So far it's working. I would appreciate if someone check if i'm missing anything):
function removeAccents($string)
{
return preg_replace('/[\`\~\']/', '', iconv('UTF-8', 'ASCII//TRANSLIT', $string));
}
What i'm trying to avoid:
I know i could check my $word and replace all a for [aàáãâä] and
same thing with other letters, but i dont know... it seens a litle
overkill.
And sure i could use my own removeAccents() function in my if
statement to check the $content without accents, something like:
if (preg_match($sentence, $content) || preg_match($sentence, removeAccents($content)))
But my problem with that second situation is i want to hightlight the word found after the match. So i can't change my $content.
Is there any way to improve my preg_match() to include possible accented characters? Or should i use my first option above?
I would decompose the string, this makes it easier to remove the offending characters, something along the lines:
<?php
// Convert unicode input to NFKD form.
$str = Normalizer::normalize("blábleblí", Normalizer::FORM_KD);
// Remove all combining characters (https://en.wikipedia.org/wiki/Combining_character).
var_dump(preg_replace('/[\x{0300}-\x{036f}]/u', "", $str));
Thanks for the help everyone, but i will end it up using my first sugestion i made in my question. And thanks again #CasimiretHippolyte for your patience, and making me realize that isn't that overkill as i thought.
Here is the final code I'm using (first the functions):
function removeAccents($string)
{
return preg_replace('/[\x{0300}-\x{036f}]/u', '', Normalizer::normalize($string, Normalizer::FORM_KD));
}
function addAccents($string)
{
$array1 = array('a', 'c', 'e', 'i' , 'n', 'o', 'u', 'y');
$array2 = array('[aàáâãäå]','[cçćĉċč]','[eèéêë]','[iìíîï]','[nñ]','[oòóôõö]','[uùúûü]','[yýÿ]');
return str_replace($array1, $array2, strtolower($string));
}
And:
$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->addAccents($this->removeAccents($word)); //check all possible accents
if(!empty($word)) {
$sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking my normal word and the possible variations of it.
if (preg_match($sentence, $content)){
echo "found";
}
}
Btw, im covering all possible accents from my country (and some others). You should check if you need to improve the addAccents() function before use it.

PHP: UTF8_decode needed with filter for ASCII values 126-160; proposed solution

I previously began exploring this problem here. Here is the true problem, and a proposed solution:
Filenames with ASCII characters values between 32 and 255 pose a problem for utf8_encode(). Specifically, it doesn't handle the character values inclusively between 126 and 160 correctly. While filenames with those character names may be written to a database, passing those filenames to a function in PHP code will produce error messages stating the file cannot be found, etc.
I discovered this when trying to pass a filename with the offending characters to getimagesize().
What is needed for utf8_encode is a filter to EXCLUDE the conversion of the inclusive values between 126 and 160, while INCLUDING the conversion of all other characters (or any character, characters, or character ranges of the user's dersire; mine is for the ranges stated, for the reason provided).
The solution I devised requires two functions, listed below, and their application that follows:
// With thanks to Mark Baker for this function, posted elsewhere on StackOverflow
function _unichr($o) {
if (function_exists('mb_convert_encoding')) {
return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
} else {
return chr(intval($o));
}
}
// For each character where value is inclusively between 126 and 160,
// write out the _unichr of the character, else write out the UTF8_encode of the character
function smart_utf8_encode($source) {
$text_array = str_split($source, 1);
$new_string = '';
foreach ($text_array as $character) {
$value = ord($character);
if ((126 <= $value) && ($value <= 160)) {
$new_string .= _unichr($character);
} else {
$new_string .= utf8_encode($character);
}
}
return $new_string;
}
$file_name = "abcdefghijklmnopqrstuvxyz~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ.jpg";
// This MUST be done first:
$file_name = iconv('UTF-8', 'WINDOWS-1252', $file_name);
// Next, smart_utf8_encode the variable (from encoding.inc.php):
$file_name = smart_utf8_encode($file_name);
// Now the file name may be passed to getimagesize(), etc.
$getimagesize = getimagesize($file_name);
If only PHP7 (6 is being skipped in the numbering, yes?) would include a filter on utf8_encode() to exclude certain character values, none of this would be necessary.

php sprintf() with foreign characters?

Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary?
I want the following lines to be aligned correctly for a report:
2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98
2011-11-30 A1856 -Rättat xx - 6 594,00 19.18
I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f
Using: php-5.3.23-nts-Win32-VC9-x86
Strings in PHP are basically arrays of bytes (not characters). They cannot work natively with multibyte encodings (such as UTF-8).
For details see:
https://www.php.net/manual/en/language.types.string.php#language.types.string.details
Most string functions in PHP have multibyte equivalent though (with the mb_ prefix). But the sprintf does not.
There's a user comment (by "viktor at textalk dot com") with multibyte implementation of the sprintf on the function's documentation page at php.net. It may work for you:
https://www.php.net/manual/en/function.sprintf.php#89020
I was actually trying to find out if PHP ^7 finally has a native mb_sprintf() but apparently no xD.
For the sake of completeness, here is a simple solution I've been using in some old projects. It just adds the diff between strlen & mb_strlen to the desired $targetLengh.
The non-multibyte example is just added for the sake of easy comparison =).
$text = "Gultigkeitsprufung ist fehlgeschlagen: %{errors}";
$mbText = "Gültigkeitsprüfung ist fehlgeschlagen: %{errors}";
$mbTextRussian = "Проверка не удалась: %{errors}";
$targetLength = 60;
$mbTargetLength = strlen($mbText) - mb_strlen($mbText) + $targetLength;
$mbRussianTargetLength = strlen($mbTextRussian) - mb_strlen($mbTextRussian) + $targetLength;
printf("%{$targetLength}s\n", $text);
printf("%{$mbTargetLength}s\n", $mbText);
printf("%{$mbRussianTargetLength}s\n", $mbTextRussian);
result
Gultigkeitsprufung ist fehlgeschlagen: %{errors}
Gültigkeitsprüfung ist fehlgeschlagen: %{errors}
Проверка не удалась: %{errors}
update 2019-06-12
#flowtron made me give it another thought. A simple mb_sprintf() could look like this.
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return strlen($value) - mb_strlen($value) + $length[0];
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
echo mb_sprintf("%-10s %-10s %10s\n", 'thüs', 'wörks', 'ök');
echo mb_sprintf("%-10s %-10s %10s\n", 'this', 'works', 'ok');
result
thüs wörks ök
this works ok
I only did some happy path testing here, but it works for PHP >=5.6 and should be good enough to give ppl an idea on how to encapsulate the behavior.
It does not work with the repetition/order modifiers though - e.g. %1$20s will be ignored/remain unchanged.
If you're using characters that fit in the ISO-8859-1 character set, you can convert the strings before formatting, and convert the result back to UTF8 when you are done
utf8_encode(sprintf("%-12s %-8s", utf8_decode($paramOne), utf8_decode($paramTwo))
Problem
There is no multibyte format functions.
Idea
You can't convert input strings. You should change format lengths.
A format %4s means 4 widths (not characters - see footnote). But PHP format functions count bytes.
So you should add format lengths to bytes - widths.
Implementations
from #nimmneun
function mb_sprintf($format, ...$args) {
$params = $args;
$callback = function ($length) use (&$params) {
$value = array_shift($params);
return $length[0] + strlen($value) - mb_strwidth($value);
};
$format = preg_replace_callback('/(?<=%|%-)\d+(?=s)/', $callback, $format);
return sprintf($format, ...$args);
}
And don't forget another option str_pad($input, $length, $pad_char=' ', STR_PAD_RIGHT)
function mb_str_pad(...$args) {
$args[1] += strlen($args[0]) - mb_strwidth($args[0]);
return str_pad(...$args);
}
Footnote
Asian characters have 3 bytes and 2 width and 1 character length.
If your format is %4s and the input is one asian character, you should need two spaces (padding) not three.

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

php outputting strange character

I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!
The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.
Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];
i think you generate special HTML characters
for example here and iso8859-1 table
You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.
As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}
The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.

Categories