working with UTF-8 encoded text

working with UTF-8 encoded text - php

I have a problem. I need to find some utf-8 characters from my text file and output them, but it doens't output the letters, instead it outputs "?", questionmarks...
ini_set( 'default_charset', 'UTF-8' );
$homepage = file_get_contents('t1.txt');
echo $homepage;
echo "\t";
echo "\t!!!!!!!!!!!!";
echo $homepage[14];
so, here it is very strange, if I'm using exsisting index it outputs nothing, but if I put
echo $homepage[35];
it outputs "?",
but my $homepage string is only 30 charecters long, what's wrong?
It is very strange, it takes the string from file correctly, and outputs it correctly, but when I call for the character by index, it doesn't work.. here is what's in my text file:
advhasgdvgv
олыолоываи
ouhh
and it outputs it correctly, when I just call $homepage, but when $homepage[14] it doesn't work.Here is output:
advhasgdvgv олыолоываи ouhh !!!!!!!!!!!!

Try mb_convert_encoding, and see if that fixes the problem.
http://www.php.net/manual/en/function.mb-convert-encoding.php
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )
$homepage = mb_convert_encoding(
file_get_contents('t1.txt'),
"UTF-8"
);
You should also check on the encodings of both the PHP file and the text file you have there.

I used this approach for dealing with UTF-8:
<?php
$string = 'ئاکام';//my name
mb_internal_encoding("UTF-8");
$mystring = mb_substr($string,0,1);ئ
//without mb_internal_encoding the return was Ø
echo $mystring;
?>
I also saved all files (Encoding as UTF-8)

Unicode characters have more than 1 byte per letter, so you access them you would have to do:
echo $homepage[30] . $homepage[31];
> и
But that is assuming the character is only 2 bytes, but there could be more; so a more general solution would be:
function charAt($str, $pos, $encoding = "UTF-8")
{
return mb_substr($str, $pos, 1, $encoding);
}

PHP does not really support UTF-8 in strings, which means that accessing text[n] will get the n'th byte instead of n'th char. UTF-8 chars might have 1-4 bytes in them, which means that you simply cannot access them by index using PHP, as you don't know what index a char starts from. Also, you obviously cannot retrieve a char using text[n], because it might need multiple bytes.
Depending on what you want, you can either convert the string to ISO 8859 using utf8_decode(), or use some UTF-8-aware mechanism to iterate through the string from the beginning and extract the bytes you want/need.
Be aware that Linux and Windows versions of PHP might produce different output on certain conversions, such as mb_strtoupper(), and that not all regex functions support UTF-8.

Related

Manipulating Thai Characters in PHP

I'm struggling getting Thai characters and PHP working together. This is what I'd like to do:
<?php
mb_internal_encoding('UTF-8');
$string = "ทาง";
echo $string[0];
?>
But instead of giving me the first character of $string (ท), I just get some messed up output. However, displaying $string itself works fine.
File itself is of course UTF-8 as well. Content-Type in Header is also set to UTF-8. I changed the neccessary lines in php.ini according to this site.
utf8_encoding() and utf8_decoding() also don't help. Maybe any of you has an idea?

In PHP When you access a string with $string[0] it doesn't return the fist character, but the first byte.
You should use mb_substr instead. For example:
mb_substr($string, 0, 1, 'UTF-8');
Note: Since you are using mb_internal_encoding('UTF-8'); you may as well ignore the last parameter.
This happens because PHP is not aware of the encoding a string is in (that is: the encoding is not stored in the string object). So it will treat it as ANSI/ASCII by default. If you don't want that, then you must use the Multibyte String Function (mb_*).
When you set mb_internal_encoding('UTF-8'); you are telling it to use UTF-8 for all the Multibyte String Function, but not for anything else.

PHP Utf8 Decoding Issue

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?

utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}

Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8

you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

Wrong output when using array indexing on UTF-8 string

I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:
$string = "üÜöÖäÄ";
echo $string[0];
I am expecting to see ü, but I get � -- why?

Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.
What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).
Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.
mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.
You can see that $string[0] gives you back just one byte with:
$string = "üÜöÖäÄ";
echo strlen($string[0]);
While mb_substr gives you back two bytes:
$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));
And these two bytes are in fact just one character (you need to use mb_strlen for this):
$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');
Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:
$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));
You can see most of the above in action.

Non-UTF8 files (Google CSV file)

I'm running into weird encoding issues when handling uploaded files.
I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from a Google Contacts export.
I've done the usual utf8_encode/decode, mb_detect_encoding, etc. Always returns as if the string is UTF-8, and tried many iconv options to try and revert encoding, but unsuccessful.
test.php
header('Content-type: text/html; charset=UTF-8');
if ($stream = fopen($_FILES['list']['tmp_name'], 'r'))
{
$string = stream_get_contents($stream);
fclose($stream);
}
echo substr($string, 0, 50);
var_dump(substr($string, 0, 50));
echo base64_encode(serialize(substr($string, 0, 50)));
Output
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
czo1MDoi//5OAGEAbQBlACwARwBpAHYAZQBuACAATgBhAG0AZQAsAEEAZABkAGkAdABpAG8AbgAiOw==

The beginning of the string carries the bytes \xFF \xFE which represent the Byte Order Mark for UTF-16 Little Endian. All letters are actually two-byte sequences. Mostly a leading \0 followed by the ASCII character.
Printing them on the console will make the terminal client interpret the UTF-16 sequences correctly. But you need to manually decode it (best via iconv) to make the whole array displayable.

When I decoded the base64 piece, I saw a strange mixed string: s:50:"\xff\xfeN\x00a\x00m\x00e\x00,\x00G\x00i\x00v\x00e\x00n\x00 \x00N\x00a\x00m\x00e\x00,\x00A\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00". The part after the second : is a 2-byte Unicode (UCS2) string enclosed in ASCII ", while "s" and "50" are plain ASCII. That \ff\fe piece is a byte-order mark of a UCS2 string. This is insane but parseable.
I suppose that you split the input string by :, strip " from beginning and end and try to decode each resulting string separately.

Unicode unknown "�" character detection in PHP

Is there any way in PHP of detecting the following character �?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if � is present in a string. How do I do so with strpos?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)

Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.

Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}

As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.

I use the CUSTOM method (using str_replace) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

working with UTF-8 encoded text - php

I used this approach for dealing with UTF-8: <?php $string = 'ئاکام';//my name mb_internal_encoding("UTF-8"); $mystring = mb_substr($string,0,1);ئ //without mb_internal_encoding the return was Ø echo $mystring; ?> I also saved all files (Encoding as UTF-8)

Related

Manipulating Thai Characters in PHP

PHP Utf8 Decoding Issue

Wrong output when using array indexing on UTF-8 string

Non-UTF8 files (Google CSV file)

Unicode unknown "�" character detection in PHP

Categories

Resources