Access specific characters in UTF-8 string - php

A questions answer here leaded me to the following "problem" or challenge:
Is it somehow possible to get a character from a specfic position if the string is UTF-8 encoded and contains special chars.
So for non-special char containing strings this works:
$str = 'abcd';
echo $str{1}; // will print "b"
But for a string like this:
$str = 'abc★';
echo $str{1}; // will return "b"
echo $str{3}; // leads to a question mark
Of course the PHP file is encoded in UTF-8 and <meta charset="utf-8"> is in the head of the HTML.
So is there any solution to get this method of catching a char in the string working?

One possible way
$str = 'abc★';
preg_match_all('/./su', $str, $m);
$chars = $m[0];
echo $chars[1]; // b
echo $chars[3]; // ★
/./su means "any character, including newline ("s"), in utf8 mode ("u")".
Or like this
echo mb_substr($str, 3, 1, 'utf8'); // ★

Related

how to get the first letter of a non latin string with php

For grabbing the first letter of a string i use substr
$string = "John doe";
echo substr($string,0,1);
// output: J
But this does not work when the string is, per example in russian of with accented letter
$string = "Марина Матвиенко";
echo substr($string,0,1);
// output: nothing
$string = "Éduard Rousseaux";
echo substr($string,0,1);
// output: nothing
Do i need to convert the string into latin first or is there another way to grab the first letter from non latin characters?
You are looking for mb_substr when using multibyte chars.
echo mb_substr($string,0,1);

unexpected output of ltrim in php

Can anybody explain this unusual output of ltrim
var_dump(ltrim('/btcapi/participation/set-user-event-participation','/btcapi'));
rticipation/set-user-event-participation //output
While expected output has
/participation/set-user-event-participation
Use str_replace if you are sure this is the only one occurence in your string.
$str = '/btcapi/participation/set-user-event-participation';
echo str_replace('/btcapi', $str); // returns: '/participation/set-user-event-participation'
Or regex if you need replace/remove just the first at the beginning of string.
$str = '/btcapi/participation/set-user-event-participation';
preg_replace ('~^/btcapi~', '', $str);
The trim characters are read as individuals, not as a String.
It just replaces the second / for example because it is a part of the characters.
Just use str_replace or a custom loop.
RTM: http://php.net/ltrim
the second argument is a character MASK, e.g. characters you want to strip. CHARACTERS, not STRING.
php > $foo = 'abc123';
php > echo ltrim($foo, 'abpq');
c123
php > echo ltrim($foo, 'a1');
bc123
^---not stripped, because 'bc' are not in the mask.
php >
PHP will search strip all characters from the left of the string, based on the characters in the mask, until it encounters a character NOT in the mask.

PHP charset issue

I'm writing a basic function in PHP which takes an input string, converts a list of "weird" characters to URL-friendly ones. Writing the function is not the issue, but rather how it inteprets strings with weird charaters.
For example, right now I have this problem:
$string = "år";
echo $string[0]; // Output: �
echo $string[1]; // Output: �
echo $string[0] . $string[1]; // Output: å
echo $string[2]; // Output: r
So basically it interprets the letter "å" as two characters, which causes problem for me. Because I want to be able to look at each character of the string individually and replace it if needed.
I encode everything in UTF8 and I know my issue has to do something with UTF8 treating weird characters as two chars, as we've seen above.
But how do I work around this? Basically I want to achieve this:
$string = "år";
echo $string[0]; // Output: å
echo $string[1]; // Output: r
$string = "år";
mb_internal_encoding('UTF-8');
echo mb_substr($string, 0, 1); // å
echo mb_substr($string, 1, 1); // r
Since UTF encoding is not always 1 byte per-letter, but stretches as you need more space your non-ASCII letters actually take more than one byte of memory. And array-like access to a string variable returns that byte, not a letter. So to actually get it, you should use methods for that
echo mb_substr($string, 0,1);// Output: å
echo mb_substr($string, 1,1);// Output: r

Remove accents - Replace accented letters by letters without accents with str_replace

str_replace does not replace accented letters by letters without accent. What's wrong with that?
This returns the expected result:
<?php
$string = get_post_custom_values ("text");
// Say get_post_custom_values ​​("text") equals "José José"
$string = str_replace(" ", "-", $string);
echo $string [0];
// Output "José-José"
?>
This does not work:
<?php
$string = get_post_custom_values ("text");
// Say get_post_custom_values ​​("text") equals "Joseph Joseph"
$string = str_replace("é", "e", $string);
echo $string [0];
// Output "José José". Nothing has changed
?>
Note: Translated from the Portuguese language with GoogleTranslate.
The easy, safe way to remove every accented letters is by using iconv :
setlocale(LC_ALL, "fr_CA.utf8"); // for instance
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
Your current problem is most likely caused by a different encoding.
The character é as saved in your source code is not in the same encoding as the data you get back from get_post_custom_values. Encoding doesn't match → not recognized as the same character → not replaced.

reversing a regular expression in php

suppose I have this function:
function f($string){
$string = preg_replace("`\[.*\]`U","",$string);
$string = preg_replace('`&(amp;)?#?[a-z0-9]+;`i','-',$string);
$string = htmlentities($string, ENT_COMPAT, 'utf-8');
$string = preg_replace( "`&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);`i","\\1", $string );
$string = preg_replace( array("`[^a-z0-9]`i","`[-]+`") , "-", $string);
return $string;
}
how can I reverse this function...ie. how should I write the function fReverse() such that we have the following:
$s = f("some string223---");
$reversed = fReverse($s);
echo $s;
and output: some string223---
f is lossy. It is impossible to find an exact reverse. For example, both "some string223---" and "some string223--------" gives the same output (see http://ideone.com/DtGQZ).
Nevertheless, we could find a pre-image of f. The 5 replacements of f are:
Strip everything between [ and ].
Replace entities like <, { and encoded entities like &lt; to a hyphen -.
Escape special HTML characters (< → <, & → & etc.)
Remove accents of accented characters (é (=é) → e, etc.)
Turn non-alphanumerics and consecutive hyphens into a single hyphen -.
Out of these, it is possible that 1, 2, 4 and 5 be identity transforms. Therefore, one possible preimage is just reverse step 3:
function fReverse($string) {
return html_entity_decode($string, ENT_COMPAT, 'utf-8');
}

Categories