Browser does not display umlaut correctly when concatenating - php

My browser (chrome and firefox) does not display the umlaut "ö" correctly, once I concatenate a string with the umlaut character.
// words inside string with umlaute, later add http://www.lageplan23.de instead of "zahnstocher" as the correct solution
$string = "apfelsaft siebenundvierzig zahnstocher gelb ethereum österreich";
// get length of string
$l = mb_strlen($string);
$f = '';
// loop through length and output each letter by itself
for ($i = 0; $i <= $l; $i++){
// umlaute buggy when there is a concatenation
$f .= $string[$i] . " ";
}
var_dump($f);
When I replace $string[$i] . " "; with $string[$i]; everything works as expected.
Why is that and how can I fix it so I can concatenate each letter with another string?

In PHP, a string is a series of bytes. The documentation clumsily refers to those bytes as characters at times.
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support.
And then later
It has no information about how those bytes translate to characters, leaving that task to the programmer.
Using mb_strlen over just strlen is the correct way to get the number of actual characters in a string (assuming a sane byte order and internal encoding to begin with) however using array notation, $string[$i] is wrong because it only accesses the bytes, not the characters.
The proper way to do what you want is to split the string into characters using mb_str_split:
// words inside string with umlaute, later add http://zahnstocher47.de instead of "zahnstocher" as the correct solution
$string = "apfelsaft siebenundvierzig zahnstocher gelb ethereum österreich";
// get length of string
$l = mb_strlen($string);
$chars = mb_str_split($string);
$f = '';
// loop through length and output each letter by itself
for ($i = 0; $i <= $l; $i++){
// umlaute buggy when there is a concatenation
$f .= $chars[$i] . " ";
}
var_dump($f);
Demo here: https://3v4l.org/JIQoE

Related

UTF-8 char is not showing well in <td> elements

I have a strange problem...
I have the following string:
$sString = "This is my encoded string é à";
First, I remove html entities:
$sString = html_entity_decode($sString, ENT_COMPAT, 'UTF-8');
What I want is to split this string properly to show each char in a different column of the same table's line.
Well, logically, I used:
$aString = str_split($sString) // Fill an array with each char
It doesn't work. It show in box the char as I didn't used html_entity_decode...
So, I decided to try the following:
for($i = 0; $i < 16; $i++) {
echo "<td>";
echo $sLine1[$i];
echo "</td>";
}
It works BUT special chars as showed as a ? in a black box (encoding problem).
Where it's really strange, it's that when I don't put it in <td> elements, it shows well and there's no encoding problems !
My HTML page contains the charset to UTF-8 and is correctly formated (with doctype, html, body, etc...)
I have to admit that at this point, I've no idea from where this problem comes...
UPDATE
I just realized that when I show char by char outside the <td>, it doesn't work either. The encoded char needs to be by pair to work !
It's a problem for me because the string comes from a database, and special chars won't always be at the same place !
Exemple:
This will show the encoding problem char:
$sString = "Paëlla";
echo $sString[3];
But in this way, it will show the ë:
$sString = "Paëlla";
echo $sString[3];
echo $sString[4];
str_split split the string on bytes. But in UTF-8, characters like é and à are encoded on a sequence of 2 bytes. You need to use mbstring to be UTF-8 aware.
mb_internal_encoding('UTF-8');
function mb_str_split($string, $length = 1) {
$ret = array();
$l = mb_strlen($string);
for ($i = 0; $i < $l; $i += $length) {
$ret[] = mb_substr($string, $i, $length);
}
return $ret;
}
Same if you apply [offset] to a string: you get a byte, not a character if the charset of the string may encode a character on more than a byte. In this case, use mb_substr.
mb_internal_encoding('UTF-8');
echo mb_substr("Paëlla", 2, 1);
Some adding to dinesh123 answer:
Try to trim html strip_tags before you get a string ($sString)
Check a file encoding
Try to set header("Content-Type:text/html; charset=UTF-8") in start of file

Non ASCII Characters being converted to squares

I've got the following code which searches a string for Non ASCII characters and returns it via an AJAX query.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i];
}
}
If $strDescription contains £ (character # 156) the above code works fine. However, I want to separate each Non ASCII character found with a comma. When I modify my code below, it converts the £ character into squares.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i] . ", ";
}
}
What am I doing wrong and how do I fix it?
You assume 1 character = 1 byte.
This assumption is wrong when it comes to UTF-8 / UTF-16 etc.
UTF-8 e.a. consist of multi-byte chars: 1 character = 1 to 3 bytes.
So, your loop over 8-bit-bytes can not handle any UTF-8 chars.
Use the mb_... functions instead - multibyte string functions.
Additionaly: converting ASCII to UTF-8 and vice versa is
in general not needed
will always result in certain characters not available in either
encoding (i.e. the € sign is one of them)
will be a maintenance nightmare on the long run
My recommendation: it's worth the effort to switch all and everything from dev to production to entirely use UTF-8. All problems are gone afterwards.
I provide you two way. At first use utf8_decode. You can try these
$asciistring = 'a£bÂc£d';
$asciistring = utf8_decode($asciistring);
First way preg_match_all
if (preg_match_all('/[\x80-\xFF]/', $asciistring, $matches)) {
$display_string = implode(',', $matches[0]);
}
2nd way as you wrote
$display_string = array();
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127)
{
$display_string[] = $asciistring[$i];
}
}
$display_string = implode(',', $display_string);
Both give me the same output
£,Â,£
I think you will be helpful!

Character replacement encoding php

I have a string that I want to replace all 'a' characters to the greek 'α' character. I don't want to convert the html elements inside the string ie text.
The function:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$string[$i] = 'α';
}
}
return $string;
}
Another function I have made works perfectly but it doesn't take in account the hmtl elements.
function grstrletter_no_html($string){
return strtr($string, array('a' => 'α'));
}
I also tried a lot of encoding functions that php offers with no luck.
When I echo the greek letter the browser output it without a problem. When I return the string the browser outputs the classic strange question mark inside a triangle whenever the replace was occured.
My header has <meta http-equiv="content-type" content="text/html; charset=UTF-8"> and I also tried it with php header('Content-Type: text/html; charset=utf-8'); but again with no luck.
The string comes from a database in UTF-8 and the site is in wordpress so I just use the wordpress functions to get the content I want. I don't think is a db problem because when I use my function grstrletter_no_html() everything works fine.
The problem seems to happen when I iterate the string character by character.
The file is saved as UTF-8 without BOM (notepad++). I tried also to change the encoding of the file with no luck again.
I also tried to replace the greek letter with the corresponding html entity α and α but again same results.
I haven't tried yet any regex.
I would appreciate any help and thanks in advance.
Tried: Greek characters encoding works in HTML but not in PHP
EDIT
The solution based on deceze brilliant answer:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$part1 = substr($string, 0, $i);
$part1 = $part1 . 'α';
$string = $part1 . substr($string, $i+1);
}
}
return $string;
}
The problem is that you're setting only a single byte of your string. Example:
$str = "\x00\x00\x00";
var_dump(bin2hex($str));
$str[1] = "\xff\xff";
var_dump(bin2hex($str));
Output:
string(6) "000000"
string(6) "00ff00"
You're setting a two-byte character, but only one byte of it is actually pushed into the string. The second result here would have to be 00ffff for your code to work.
What you need is to cut the string from 0 to $i - 1, concatenate the 'α' into it, then concatenate the rest of the string $i + 1 to end onto it if you want to insert a multibyte character. That, or work with characters instead of bytes using the mbstring functions.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

php outputting strange character

I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!
The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.
Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];
i think you generate special HTML characters
for example here and iso8859-1 table
You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.
As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}
The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.

Categories