I'm writing a basic function in PHP which takes an input string, converts a list of "weird" characters to URL-friendly ones. Writing the function is not the issue, but rather how it inteprets strings with weird charaters.
For example, right now I have this problem:
$string = "år";
echo $string[0]; // Output: �
echo $string[1]; // Output: �
echo $string[0] . $string[1]; // Output: å
echo $string[2]; // Output: r
So basically it interprets the letter "å" as two characters, which causes problem for me. Because I want to be able to look at each character of the string individually and replace it if needed.
I encode everything in UTF8 and I know my issue has to do something with UTF8 treating weird characters as two chars, as we've seen above.
But how do I work around this? Basically I want to achieve this:
$string = "år";
echo $string[0]; // Output: å
echo $string[1]; // Output: r
$string = "år";
mb_internal_encoding('UTF-8');
echo mb_substr($string, 0, 1); // å
echo mb_substr($string, 1, 1); // r
Since UTF encoding is not always 1 byte per-letter, but stretches as you need more space your non-ASCII letters actually take more than one byte of memory. And array-like access to a string variable returns that byte, not a letter. So to actually get it, you should use methods for that
echo mb_substr($string, 0,1);// Output: å
echo mb_substr($string, 1,1);// Output: r
Related
For grabbing the first letter of a string i use substr
$string = "John doe";
echo substr($string,0,1);
// output: J
But this does not work when the string is, per example in russian of with accented letter
$string = "Марина Матвиенко";
echo substr($string,0,1);
// output: nothing
$string = "Éduard Rousseaux";
echo substr($string,0,1);
// output: nothing
Do i need to convert the string into latin first or is there another way to grab the first letter from non latin characters?
You are looking for mb_substr when using multibyte chars.
echo mb_substr($string,0,1);
A questions answer here leaded me to the following "problem" or challenge:
Is it somehow possible to get a character from a specfic position if the string is UTF-8 encoded and contains special chars.
So for non-special char containing strings this works:
$str = 'abcd';
echo $str{1}; // will print "b"
But for a string like this:
$str = 'abc★';
echo $str{1}; // will return "b"
echo $str{3}; // leads to a question mark
Of course the PHP file is encoded in UTF-8 and <meta charset="utf-8"> is in the head of the HTML.
So is there any solution to get this method of catching a char in the string working?
One possible way
$str = 'abc★';
preg_match_all('/./su', $str, $m);
$chars = $m[0];
echo $chars[1]; // b
echo $chars[3]; // ★
/./su means "any character, including newline ("s"), in utf8 mode ("u")".
Or like this
echo mb_substr($str, 3, 1, 'utf8'); // ★
I have a string with all letters capitalized. I'm using the ucwords() and the mb_strtolower() functions to capitalize only the first letter of a string. But I'm having some problems when the first letter of a word have a accent. For example:
ucwords(mb_strtolower('GRANDE ÁRVORE')); //outputs 'Grande árvore'
Why the first letter of the second word is not being capitalized? What can I do to solve this?
ucwords is one of the core PHP functions which is blissfully oblivious to non-ASCII or non-Latin-1 encodings.* For handling multibyte strings and/or non-ASCII strings, you should use the multibyte aware mb_convert_case:
mb_convert_case($str, MB_CASE_TITLE, 'UTF-8')
// your string encoding here --------^^^^^^^
* I'm not entirely sure whether it works only with ASCII or at least with Latin-1, but I wouldn't even bother to find out.
If you're looking to only capitalize the first letter only, here's a way to achieve it :
$s = "économie collégiale"
mb_strtoupper( mb_substr( $s, 0, 1 )) . mb_substr( $s, 1 )
// output : Économie collégiale
ucwords doesn't recognize the accented character. Try using mb_convert_case.
$str = 'GRANDE ÁRVORE';
function ucwords_accent($string)
{
if (mb_detect_encoding($string) != 'UTF-8') {
$string = mb_convert_case(utf8_encode($string), MB_CASE_TITLE, 'UTF-8');
} else {
$string = mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');
}
return $string;
}
echo ucwords_accent($str);
I'm currently using the substr() function which works fine for characters written in english. But when I apply that to characters written in greek, the text is cut with a strange character (a questionmark inside a diamond shape) appearing before the 3 fullstops (...).
Below is the code, thanks:
$string //a varchar string written in greek and called from the database
if (strlen($string) > 200) {
echo substr($string, 0, 200).'...';
}
Use multibyte functions like so:
mb_internal_encoding( "UTF-8" );
if( mb_strlen( $string ) > 200 ) {
echo mb_substr( $string, 0, 200 ) . "...";
}
The normal functions work on bytes and don't have any character awareness like you are expecting from them. Text using common english characters in UTF-8 are all 1 byte per character, so the normal functions accidentally work for them.
str_replace does not replace accented letters by letters without accent. What's wrong with that?
This returns the expected result:
<?php
$string = get_post_custom_values ("text");
// Say get_post_custom_values ("text") equals "José José"
$string = str_replace(" ", "-", $string);
echo $string [0];
// Output "José-José"
?>
This does not work:
<?php
$string = get_post_custom_values ("text");
// Say get_post_custom_values ("text") equals "Joseph Joseph"
$string = str_replace("é", "e", $string);
echo $string [0];
// Output "José José". Nothing has changed
?>
Note: Translated from the Portuguese language with GoogleTranslate.
The easy, safe way to remove every accented letters is by using iconv :
setlocale(LC_ALL, "fr_CA.utf8"); // for instance
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
Your current problem is most likely caused by a different encoding.
The character é as saved in your source code is not in the same encoding as the data you get back from get_post_custom_values. Encoding doesn't match → not recognized as the same character → not replaced.