ucwords and french accented lettres encoding [duplicate]

ucwords and french accented lettres encoding [duplicate] - php

This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
We have a database of Canadian addresses all in CAPS , the client requested that we transform to lower case expect the first letter and the letter after a '-'
So i made this function , but I'm having problem with french accented letters .
When having the file and charset as ISO-88591 It works fine , but when i try to make it UTF-8 it doesn't work anymore .
Example of input : 'damien-claude élanger'
output : Damien-Claude élanger
the é in utf-8 will become �
function cap_letter($string) {
$lower = str_split("àáâçèéêë");
$caps = str_split("ÀÁÂÇÈÉÊË");
$letters = str_split(strtolower($string));
foreach($letters as $code => $letter) {
if($letter === '-' || $letter === ' ') {
$position = array_search($letters[$code+1],$lower);
if($position !== false) {
// test
echo $letters[$code+1] . ' == ' . $caps[$position] ;
$letters[$code+1] = $caps[$position];
}
else {
$letters[$code+1] = mb_strtoupper($letters[$code+1]);
}
}
}
//return ucwords(implode($letters)) ;
return implode($letters) ;
}
The Other solution i have in mind is to do : ucwords(strtolower($str)) since all the addresses are already in caps so the É will stay É even after applying strtolower .
But then I'll have the problem of having É inside ex : XXXÉXXÉ

Try mb_* string functions for multibyte characters.
echo mb_convert_case(mb_strtolower($str), MB_CASE_TITLE, "UTF-8");

I have the same problem in spanish, and I create this function
function capitalize($string)
{
if (mb_detect_encoding($string) === 'UTF-8') {
$string = mb_convert_case(utf8_encode($string), MB_CASE_TITLE, 'UTF-8');
} else {
$string = mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');
}
return $string;
}

Related

php str_shuffle does wrong encoding on greek letters

I'm working on a project where I use php to grab a random greek word from a xampp sql server . I then use str_shuffle() to randomize the word order (ex. bye => ybe).However using str_shuffle() on greek letters returns the word with many ???? in place of most greek letters . If I remove str_shuffle() from my code the greek word is displayed correctly with no ??? .
I have written code that ensures I have the correct encoding but str_shuffle() is the problem .
<h1 id = "hidden-word">The word is :
<?php
$link = mysqli_connect('localhost' , 'root' , '' ,'dodecanese');
if(!$link){
echo 'Error connecting to DB';
exit;
}
mysqli_query($link,"SET NAMES 'utf8'");
$query = "SELECT island_name FROM dodecanese ORDER BY RAND() LIMIT 1";
$result = mysqli_query($link, $query);
if(!$result){
echo 'There is an issue with the DB';
exit;
}
$row = mysqli_fetch_assoc($result);
//str shuffle creates ?
echo '<span id = "random-island">'.str_shuffle($row['island_name']). '</span>';
?>
</h1>
I also have encoding <meta charset="utf-8"/> on html . I have seen many posts about this and especially the UTF-8 all the way through but it did not help . I would appreciate your help with this . Thank you in advance .

I've looked in the the PHP manual for str_shuffle and found out in the comments that indeed there are problems with some unicode chars.
But there is also a solution there - which I've tested for you, and it works:
<?php
function str_shuffle_unicode($str) {
$tmp = preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
shuffle($tmp);
return join("", $tmp);
}
$a = "γκξπψ";
$b = str_shuffle($a);
$c = str_shuffle_unicode($a);
echo $a; // γκξπψ
echo "<br/>str_shuffle: ".$b; // ξ��κ�ψ�
echo "<br/>str_shuffle_unicode: ".$c; // κξγπψ
?>

Unfortunately str_shuffle() does not work with multibyte characters and there is no (or at least I do not know such) built-in function to do that. As a workaround you can write one your own. For example the below code will split the string into array of single letters, shuffle the array and then join its elements back to string (I used Cyrillic letters for the example):
$str = "абвгдежзий";
$temp = mb_str_split($str,1);
shuffle($temp);
$str = join("", $temp);
echo $str;
The above function mb_str_split is a PHP 7.4+ only. If you are using earlier version, you can use preg_split:
$str = "абвгдежзий";
$temp = preg_split("//u", $str, 0);
shuffle($temp);
$str = join("", $temp);
echo $str;
the more uncomfortable preg_match_all:
$str = "абвгдежзий";
preg_match_all('/./u', $str, $temp);
shuffle($temp[0]);
$str = join("", $temp[0]);
echo $str;
or even looping and adding to the array char-by-char (that way you save a regex call):
$str = "абвгдежзий";
$len = mb_strlen($str, 'UTF-8');
$temp = [];
for ($i = 0; $i < $len; $i++) {
$temp[] = mb_substr($str, $i, 1, 'UTF-8');
}
shuffle($temp);
$str = join("", $temp);
echo $str;

Encoding smileys in a string with mb_convert_encoding [duplicate]

How to convert a Unicode string to HTML entities? (HEX not decimal)
For example, convert Français to Français.

For the missing hex-encoding in the related question:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
This is similar to #Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.
If you prefer iconv over mb_convert_encoding, it's similar:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = iconv('UTF-8', 'UTF-32BE', $utf8);
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
I find this string manipulation a bit more clear then in Get hexcode of html entities.

Your string looks like UCS-4 encoding you can try
$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);
Output
string 'Français' (length=13)

Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.
Would like to document an alternative solution because it solved a similar problem for me.
I was using PHP's utf8_encode() to escape 'special' characters.
I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)
function unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
$foo = 'This is my test string \u03b50';
echo unicode2html($foo);
Hope this helps somebody in need :-)

See How to get the character from unicode code point in PHP? for some code that allows you to do the following :
Example use :
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
Output :
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"

You can also use mb_encode_numericentity which is supported by PHP 4.0.6+ (link to PHP doc).
function unicode2html($value) {
return mb_encode_numericentity($value, [
// start codepoint
// | end codepoint
// | | offset
// | | | mask
0x0000, 0x001F, 0x0000, 0xFFFF,
0x0021, 0x002C, 0x0000, 0xFFFF,
0x002E, 0x002F, 0x0000, 0xFFFF,
0x003C, 0x003C, 0x0000, 0xFFFF,
0x003E, 0x003E, 0x0000, 0xFFFF,
0x0060, 0x0060, 0x0000, 0xFFFF,
0x0080, 0xFFFF, 0x0000, 0xFFFF
], 'UTF-8', true);
}
In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.
Usage example:
$input = array(
'"Meno più, PIÙ o meno"',
'\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
'<script>alert("XSS");</script>',
'"`'
);
$output = array();
foreach ($input as $str)
$output[] = unicode2html($str)
Result:
$output = array(
'"Meno più, PIÙ o meno"',
''ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà'',
'<script>alert("XSS");</script>',
'"`'
);

This is solution like #hakre (Nov 8, 2012 at 0:35) but to html entity names:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Obóz w&eogon;drowny Ko&lstrok;a"
//while #hakre/#Baba both codes:
// => $output: "Obóz wędrowny Koła"
But always is problem with encountered not proper UTF-8, i.e.:
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Obóz w&eogon;drowny Ko&lstrok;a - - ok&lstrok;adka" in html ("\xB3" is ISO-8859-2/windows-1250 "ł")
but here
// => $output: (empty)
also with #hakre code... :(
It was hard to find out the cause, the only solution I know (maybe does anyone know a simpler one? please):
function utf_entities($input) {
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
if (empty($output) && (!empty($input))) { // Trouble... Maybe not UTF-8 code inside UTF-8 string...
/* Processing string against not UTF-8 chars... */
$output = ''; // New - repaired
for ($i=0; $i<strlen($input); $i++) {
if (($char = $input[$i])<"\x80") {
$output .= $char;
} else { // maybe UTF-8 (0b ..110xx..) or not UTF-8 (i.e. 0b11111111 etc.)
$j = 0; // how many chars more in UTF-8
$char = ord($char);
do { // checking first UTF-8 code char bits
$char = ($char << 1) % 0x100;
$j++;
} while (($j<4 /* 6 before RFC 3629 */)&& (($char & 0b11000000) === 0b11000000));
$k = $i+1;
if ($j<4 /* 6 before RFC 3629 */ && (($char & 0b11000000) === 0b10000000)) { // maybe UTF-8...
for ($k=$i+$j; $k>$i && ((ord($input[$k]) & 0b11000000) === 0b10000000); $k--) ; // ...checking next bytes for valid UTF-8 codes
}
if ($k>$i || ($j>=4 /* 6 before RFC 3629 */) || (($char & 0b11000000) !== 0b10000000)) { // Not UTF-8
$output .= '&#x'.dechex(ord($input[$i])).';'; // "&#xXX;"
} else { // UTF=8 !
$output .= substr($input, $i, 1+$j);
$i += $j;
}
}
}
return utf_entities($output); // recursively after repairing
}
return $output;
}
I.e.:
echo utf_entities("o\xC5\x82a - k\xB3a"); // o&lstrok;a - k³a - UTF-8 + fixed
echo utf_entities("o".chr(0b11111101).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// oñ¸¸¸¸¸a - invalid UTF-8 (6-bytes UTF-8 valid before RFC 3629), fixed
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a - k\xB3a");
// o񸸸a - k³a - UTF-8 + fixed ("\xB3")
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// o񸸸a - valid UTF-8!
echo utf_entities("o".chr(0b11110001).'a'.chr(0b10111000).chr(0b10111000)."a");
// oña¸¸a - invalid UTF-8, fixed

Check if a string contains the "\" [duplicate]

This question already has answers here:
How do you make a string in PHP with a backslash in it? [closed]
(3 answers)
Find the occurrence of backslash in a string
(3 answers)
Closed 3 years ago.
I am trying to see if the string contains \ and if i put it like this
$search4 = '\'; or like $search4 = "\"; it won't work as this is incorrect.
This is my search function
function Search($search, $string){
$position = strpos($string, $search);
if ($position == true){
return 'true';
}
else{
return 'false';
}}
And i am calling it like that echo Search($search4, $string);

You need to escape the \ by using 2 \. Because '\' is escaping the single quote and is giving you an error. The same will happend with double quotes.
http://php.net/manual/en/regexp.reference.escape.php
function Search($search, $string){
$position = strpos($string, $search);
if ($position == true){
return 'true';
}else{
return 'false';
}
}
$search = '\\';
print Search($search, "someString");

Why is php str_pad only working every 3rd time for Unicode Character “─” [duplicate]

This question already has answers here:
Special characters throwing off str_pad in php?
(3 answers)
Closed 5 years ago.
I am doing a simple str_pad with the Unicode Character “─” https://www.compart.com/en/unicode/U+2500
for($i=0;$i<50;$i++){
echo str_pad("", $i,"─");
echo "\n";
}
But the output in PHP CLI is displaying :
▒
▒
─
─▒
─▒
──
──▒
──▒
───
───▒
───▒
────
────▒
────▒
─────
─────▒
─────▒
──────
...
So it appears every 3rd line is correct, but the 1st and 2nd have a different end of line character.
But if I used str_repeat this works fine :
for($i=0;$i<50;$i++){
echo str_repeat("─", $i);
echo "\n";
}
Results in :
─
──
───
────
─────
──────
───────
────────
─────────
──────────
...
So str_repeat works fine, but str_pad has a very weird and unexpected result. Any idea why this is happening?

Looks like a multibyte issue.
A quick way to get str_pad to work using Unicode characters
for($i=0;$i<50;$i++){
echo str_pad("", $i*strlen("─"),"─");
echo "\n";
}
So this will give the desired output of :
─
──
───
────
─────
──────
───────
────────
...
But this will break when you actually have a string to pad out :
for($i=0;$i<50;$i++){
echo str_pad("X", $i*strlen("─"),"─");
echo "\n";
}
Becomes :
X
X▒
X─▒
X──▒
X───▒
X────▒
X─────▒
....
So just need to use the multibyte version example :
http://php.net/manual/en/function.str-pad.php#116244
for($i=0;$i<50;$i++){
echo mb_str_pad("X", $i,"─");
echo "\n";
}
function mb_str_pad($str, $pad_len, $pad_str = ' ', $dir = STR_PAD_RIGHT, $encoding = NULL)
{
$encoding = $encoding === NULL ? mb_internal_encoding() : $encoding;
$padBefore = $dir === STR_PAD_BOTH || $dir === STR_PAD_LEFT;
$padAfter = $dir === STR_PAD_BOTH || $dir === STR_PAD_RIGHT;
$pad_len -= mb_strlen($str, $encoding);
$targetLen = $padBefore && $padAfter ? $pad_len / 2 : $pad_len;
$strToRepeatLen = mb_strlen($pad_str, $encoding);
$repeatTimes = ceil($targetLen / $strToRepeatLen);
$repeatedString = str_repeat($pad_str, max(0, $repeatTimes)); // safe if used with valid utf-8 strings
$before = $padBefore ? mb_substr($repeatedString, 0, floor($targetLen), $encoding) : '';
$after = $padAfter ? mb_substr($repeatedString, 0, ceil($targetLen), $encoding) : '';
return $before . $str . $after;
}

str_ireplace not working with German characters

I am using the following function to search for words and color them inside a text. It works perfectly except for German characters (ä, ë, ß, etc). I already tried to encode to utf, decode, checked my meta tags and everything else like that but the problem is not the encoding as they show correctly on the site, they're just not "colored" by this function:
function highlight($keyword, $input, $linktext, $color){
$text = $input;
$word = $keyword;
$text = str_ireplace(" ".$word, ' <span id="">' . $word . '</span>', $text);
$iteration = 1;
while (true) {
$text = preg_replace('/<span.id="">' . $word . '<\/span>/imsxU', '<span style="background:'.$color.'" class="keyword" id="link' .
$iteration . "\" onclick=\"setLink2('$keyword','$linktext',$iteration)\">" . $word . '</span>', $text, 1, $count);
if (!$count) {
break;
}
$y++;
$iteration++;
}
return $text;
}
Any idea of how can I achieve this? I also tried to replace them but the German words should apear as they are on the text so that's a no go =/

as str_ functions in PHP do not support UTF, you have to use the mb_ extension. In your case, replace str_ireplace with mb_eregi_replace

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

ucwords and french accented lettres encoding [duplicate] - php

Try mb_* string functions for multibyte characters. echo mb_convert_case(mb_strtolower($str), MB_CASE_TITLE, "UTF-8");

Related

php str_shuffle does wrong encoding on greek letters

Encoding smileys in a string with mb_convert_encoding [duplicate]

Check if a string contains the "\" [duplicate]

Why is php str_pad only working every 3rd time for Unicode Character “─” [duplicate]

str_ireplace not working with German characters

Categories

Resources