UTF-8 char is not showing well in <td> elements - php

I have a strange problem...
I have the following string:
$sString = "This is my encoded string é à";
First, I remove html entities:
$sString = html_entity_decode($sString, ENT_COMPAT, 'UTF-8');
What I want is to split this string properly to show each char in a different column of the same table's line.
Well, logically, I used:
$aString = str_split($sString) // Fill an array with each char
It doesn't work. It show in box the char as I didn't used html_entity_decode...
So, I decided to try the following:
for($i = 0; $i < 16; $i++) {
echo "<td>";
echo $sLine1[$i];
echo "</td>";
}
It works BUT special chars as showed as a ? in a black box (encoding problem).
Where it's really strange, it's that when I don't put it in <td> elements, it shows well and there's no encoding problems !
My HTML page contains the charset to UTF-8 and is correctly formated (with doctype, html, body, etc...)
I have to admit that at this point, I've no idea from where this problem comes...
UPDATE
I just realized that when I show char by char outside the <td>, it doesn't work either. The encoded char needs to be by pair to work !
It's a problem for me because the string comes from a database, and special chars won't always be at the same place !
Exemple:
This will show the encoding problem char:
$sString = "Paëlla";
echo $sString[3];
But in this way, it will show the ë:
$sString = "Paëlla";
echo $sString[3];
echo $sString[4];

str_split split the string on bytes. But in UTF-8, characters like é and à are encoded on a sequence of 2 bytes. You need to use mbstring to be UTF-8 aware.
mb_internal_encoding('UTF-8');
function mb_str_split($string, $length = 1) {
$ret = array();
$l = mb_strlen($string);
for ($i = 0; $i < $l; $i += $length) {
$ret[] = mb_substr($string, $i, $length);
}
return $ret;
}
Same if you apply [offset] to a string: you get a byte, not a character if the charset of the string may encode a character on more than a byte. In this case, use mb_substr.
mb_internal_encoding('UTF-8');
echo mb_substr("Paëlla", 2, 1);

Some adding to dinesh123 answer:
Try to trim html strip_tags before you get a string ($sString)
Check a file encoding
Try to set header("Content-Type:text/html; charset=UTF-8") in start of file

Related

Browser does not display umlaut correctly when concatenating

My browser (chrome and firefox) does not display the umlaut "ö" correctly, once I concatenate a string with the umlaut character.
// words inside string with umlaute, later add http://www.lageplan23.de instead of "zahnstocher" as the correct solution
$string = "apfelsaft siebenundvierzig zahnstocher gelb ethereum österreich";
// get length of string
$l = mb_strlen($string);
$f = '';
// loop through length and output each letter by itself
for ($i = 0; $i <= $l; $i++){
// umlaute buggy when there is a concatenation
$f .= $string[$i] . " ";
}
var_dump($f);
When I replace $string[$i] . " "; with $string[$i]; everything works as expected.
Why is that and how can I fix it so I can concatenate each letter with another string?
In PHP, a string is a series of bytes. The documentation clumsily refers to those bytes as characters at times.
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support.
And then later
It has no information about how those bytes translate to characters, leaving that task to the programmer.
Using mb_strlen over just strlen is the correct way to get the number of actual characters in a string (assuming a sane byte order and internal encoding to begin with) however using array notation, $string[$i] is wrong because it only accesses the bytes, not the characters.
The proper way to do what you want is to split the string into characters using mb_str_split:
// words inside string with umlaute, later add http://zahnstocher47.de instead of "zahnstocher" as the correct solution
$string = "apfelsaft siebenundvierzig zahnstocher gelb ethereum österreich";
// get length of string
$l = mb_strlen($string);
$chars = mb_str_split($string);
$f = '';
// loop through length and output each letter by itself
for ($i = 0; $i <= $l; $i++){
// umlaute buggy when there is a concatenation
$f .= $chars[$i] . " ";
}
var_dump($f);
Demo here: https://3v4l.org/JIQoE

Arabic not urlencoding correctly

I have the string:
$str = 'ماجد';
This need to be encoded as:
'%E3%C7%CC%CF'
But I cannot figure out how to reach this encoded string. I believe it is Windows-1256. The above encoded string is how it is being encoded by a program I have.
Does anyone know how to reach this string?
If you know you want to use Windows-1256 then all you have to do is to change the encoding of the input string (which is UTF-8) to Windows-1256. Then you apply urlencode() to the returned string and that's all.
There are several ways to change the encoding of a string in PHP. One of them (that I tested and provides the result you expect) is using iconv():
$str = 'ماجد';
$conv = iconv('utf-8', 'windows-1256', $str);
echo(urlencode($conv));
You need to somehow split the string into its hexadecimal representation and then put a % singn in front of the hex number pairs.
<?php
$hexString = bin2hex("ماجد");
for($i = 0; $i < strlen($hexString); $i += 2){
echo "%".substr($hexString, $i, 2);
}
?>
This will do the trick but im sure there is a more elegant way.

Non ASCII Characters being converted to squares

I've got the following code which searches a string for Non ASCII characters and returns it via an AJAX query.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i];
}
}
If $strDescription contains £ (character # 156) the above code works fine. However, I want to separate each Non ASCII character found with a comma. When I modify my code below, it converts the £ character into squares.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i] . ", ";
}
}
What am I doing wrong and how do I fix it?
You assume 1 character = 1 byte.
This assumption is wrong when it comes to UTF-8 / UTF-16 etc.
UTF-8 e.a. consist of multi-byte chars: 1 character = 1 to 3 bytes.
So, your loop over 8-bit-bytes can not handle any UTF-8 chars.
Use the mb_... functions instead - multibyte string functions.
Additionaly: converting ASCII to UTF-8 and vice versa is
in general not needed
will always result in certain characters not available in either
encoding (i.e. the € sign is one of them)
will be a maintenance nightmare on the long run
My recommendation: it's worth the effort to switch all and everything from dev to production to entirely use UTF-8. All problems are gone afterwards.
I provide you two way. At first use utf8_decode. You can try these
$asciistring = 'a£bÂc£d';
$asciistring = utf8_decode($asciistring);
First way preg_match_all
if (preg_match_all('/[\x80-\xFF]/', $asciistring, $matches)) {
$display_string = implode(',', $matches[0]);
}
2nd way as you wrote
$display_string = array();
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127)
{
$display_string[] = $asciistring[$i];
}
}
$display_string = implode(',', $display_string);
Both give me the same output
£,Â,£
I think you will be helpful!

Character replacement encoding php

I have a string that I want to replace all 'a' characters to the greek 'α' character. I don't want to convert the html elements inside the string ie text.
The function:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$string[$i] = 'α';
}
}
return $string;
}
Another function I have made works perfectly but it doesn't take in account the hmtl elements.
function grstrletter_no_html($string){
return strtr($string, array('a' => 'α'));
}
I also tried a lot of encoding functions that php offers with no luck.
When I echo the greek letter the browser output it without a problem. When I return the string the browser outputs the classic strange question mark inside a triangle whenever the replace was occured.
My header has <meta http-equiv="content-type" content="text/html; charset=UTF-8"> and I also tried it with php header('Content-Type: text/html; charset=utf-8'); but again with no luck.
The string comes from a database in UTF-8 and the site is in wordpress so I just use the wordpress functions to get the content I want. I don't think is a db problem because when I use my function grstrletter_no_html() everything works fine.
The problem seems to happen when I iterate the string character by character.
The file is saved as UTF-8 without BOM (notepad++). I tried also to change the encoding of the file with no luck again.
I also tried to replace the greek letter with the corresponding html entity α and α but again same results.
I haven't tried yet any regex.
I would appreciate any help and thanks in advance.
Tried: Greek characters encoding works in HTML but not in PHP
EDIT
The solution based on deceze brilliant answer:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$part1 = substr($string, 0, $i);
$part1 = $part1 . 'α';
$string = $part1 . substr($string, $i+1);
}
}
return $string;
}
The problem is that you're setting only a single byte of your string. Example:
$str = "\x00\x00\x00";
var_dump(bin2hex($str));
$str[1] = "\xff\xff";
var_dump(bin2hex($str));
Output:
string(6) "000000"
string(6) "00ff00"
You're setting a two-byte character, but only one byte of it is actually pushed into the string. The second result here would have to be 00ffff for your code to work.
What you need is to cut the string from 0 to $i - 1, concatenate the 'α' into it, then concatenate the rest of the string $i + 1 to end onto it if you want to insert a multibyte character. That, or work with characters instead of bytes using the mbstring functions.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

php multibyte string acessing via key [$i]

there is a string $string = "öşğüçı"; pay attention to the last one which is not i
when I want to print first char by echo $string[0] it prints nothing.. I know they are multibyte ones.. though printing first character can be accomplished by
echo $string[0].$string[1] but that is not what I want.. the question is
how can I make the obove mentioned issue just to program in a way below
for($i = 0; $i < sizeof($string); $i++)
echo $string[$i] . " ";
and it will print the following
ö ş ğ ü ç ı
masters of php please help...
to split a string into characters
$string = "öşğüçı";
preg_match_all('/./u', $string, $m);
$chars = $m[0];
note the "u" flag in the regular expression
<?php
// inform the browser you are sending text encoded with utf-8
header("Content-type: text/plain; charset=utf-8");
// if you're using a literal string make sure the file
// is saved using utf-8 as encoding
// or if you're getting it from another source make sure
// you get it in utf-8
$string = "öşğüçı";
// if you do not have your string in utf-8
// you need to find out the actual encoding
// and use "iconv" to convert it to utf-8
// process the string using the mb_* functions
// knowing that it is encoded in utf-8 at this point
$encoding = "UTF-8";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
Of course if you prefer another encoding (but I wouldn't see why; maybe just utf-16) you can substitute each instance of "utf-8" from above with your desired encoding and read and use accordingly.
Example for UTF-16 output (file/input is encoded in UTF-8)
<?php
header("Content-type: text/plain; charset=utf-16");
$string = "öşğüçı";
$string = iconv("UTF-8", "UTF-16", $string);
$encoding = "UTF-16";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
You cannot handle multi-byte strings in this way in PHP. If it's a fixed-length encoding, where every character takes up, say, two bytes, you can simply take two bytes at a time. If it's a variable-length encoding like UTF-8 though, you will need to use mb_substr and mb_strlen.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which explains this in more detail.
Use iconv_substr or mb_substr to get character and iconv_strlen or mb_strlen to get size of string.

Categories