Manipulating Thai Characters in PHP - php

I'm struggling getting Thai characters and PHP working together. This is what I'd like to do:
<?php
mb_internal_encoding('UTF-8');
$string = "ทาง";
echo $string[0];
?>
But instead of giving me the first character of $string (ท), I just get some messed up output. However, displaying $string itself works fine.
File itself is of course UTF-8 as well. Content-Type in Header is also set to UTF-8. I changed the neccessary lines in php.ini according to this site.
utf8_encoding() and utf8_decoding() also don't help. Maybe any of you has an idea?

In PHP When you access a string with $string[0] it doesn't return the fist character, but the first byte.
You should use mb_substr instead. For example:
mb_substr($string, 0, 1, 'UTF-8');
Note: Since you are using mb_internal_encoding('UTF-8'); you may as well ignore the last parameter.
This happens because PHP is not aware of the encoding a string is in (that is: the encoding is not stored in the string object). So it will treat it as ANSI/ASCII by default. If you don't want that, then you must use the Multibyte String Function (mb_*).
When you set mb_internal_encoding('UTF-8'); you are telling it to use UTF-8 for all the Multibyte String Function, but not for anything else.

Related

How to replace a symbol in a text string in PHP?

I want to do a search & replace in PHP with a symbol.
This is the symbol: ➤
I want to replace it with a dash, but that doesn't work. The problem looks like that the symbol cannot be found, even though it's there.
Other 'normal' search and replace operations work as expected. But replacing this symbol does not.
Any ideas how to address this symbol, so that the search and replace function actually can find it and replace it?
Your problem is (almost certainly) related to text/character encoding.
Special characters such as the ➤ you are referring to, are not part of the classical ISO-8859-1 character set; they are however part of Unicode family (codepoint U+27A4 to be exact). This means that, in order to use this (multibyte)character, you have to use a unicode character set, which generally means UTF-8.
All the basic characters (think A-Z, numbers, spaces, ...) overlap between UTF-8 and ISO-8859-1 (which is effectively the default character set), so when you don't use any special characters, you could use the wrong charset and things will pretty much continue to work just fine; that is until you try to use a character that is not part of the basic set.
Since your problem takes place entirely on the server side (inside PHP), and doesn't really touch upon the HTTP and HTML layers, we won't have to go into utf-8 content-type headers and the like, but you should be aware of them for future issues (if you weren't already).
The issue you have should be resolved once you meet 2 criteria:
Not all PHP functions are multibyte-aware; I'm not 100% sure, but i think str_replace is one of those which is not. The preg_replace function with its u flag enabled definitely is multibyte aware, and can serve the exact same function.
The text editor or IDE that you used to create the .php file may or may not be set to UTF-8 encoding, if it wasn't then you should switch that in order to be able to use such characters literally inside the source code.
Something like this should function correctly assuming the .php-file is stored in UTF-8 format:
$output = preg_replace('#➤#u', '-', $input);
Most likely you did not set the header of your PHP script to use the UTF-8 character set. Consider the following:
header('Content-type: text/plain; charset=utf-8');
$input = "This is the symbol: ➤";
$output = str_replace("➤", "-", $input);
echo $input . "\n" . $output;
This prints:
This is the symbol: ➤
This is the symbol: -
as that is simply replaceable using builtin php str_replace function, so that would be better if you can share us your code to check it more.
$str = "hey same let's change this to a dash: ➤";
echo "before: $str \n";
echo "after: ".str_replace("➤", "-", $str);
before: hey same let's change this to a dash: ➤
after: hey same let's change this to a dash: -
example

Understanding character encoding in PHP

I am struggling at understanding character encoding in PHP.
Consider the following script (you can run it here):
$string = "\xe2\x82\xac";
var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));
mb_internal_encoding("UTF-8");
var_dump($string);
var_dump($utf8string);
I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.
Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).
If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.
What I am missing?
The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.
The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac).
Does this clear up the behavior you see?
You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string. How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 then you get the Euro sign in the rendered page.
Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ($from_encoding). Why?
For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.
But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8. The outcome is obviously wrong.
Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).

How to change encoding of a web page retrieved by Simple HTML DOM?

I am trying to read contents of a web page
$html = file_get_html('http://www.example.com/somepage.aspx');
Since the page's encoding is Windows-1254, and I work on a page encoded as UTF-8, I cannot replace some words which have language-specific characters.
For Example:
If I try to
$str2 = str_replace('TÜRKÇE', 'TURKCE', $str);
it does not replace.
I have tried htmlentities() function, It worked but deleted some words which contains special characters.
Work in utf-8 only. If you have some data in other encodings, convert it. If you does not know the encoding, try to define it. If you cannot, use users. Then use mb_* functions only for all string operations, It is important! some functions is not present in native php, but search its hand-make realizations on php.net/.. in comments.
After getting strings I have used iconv('Windows-1254', 'utf-8', $str) function (thanks to #pguardiario). This solved my problem.

substr doesn't work fine with utf8

I am using a substr method to access the first 20 characters of a string. It works fine in normal situation, but while working on rtl languages (utf8) it gives me wrong results (about 10 characters are shown). I have searched the web but found nth useful to solve this issue. This is my line of code:
substr($article['CBody'],0,20);
Thanks in advance.
If you’re working with strings encoded as UTF-8 you may lose
characters when you try to get a part of them using the PHP substr
function. This happens because in UTF-8 characters are not restricted
to one byte, they have variable length to match Unicode characters,
between 1 and 4 bytes.
You can use mb_substr(), It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.
Try this:
$str = mb_substr($article['CBody'], 0, 20, 'UTF-8');
echo utf8_decode($str);
Hope this helps.
Use this instead, here is extra text to make the body long enough. This will handle multi-byte characters.
http://php.net/manual/en/function.mb-substr.php

strtolower() for unicode/multibyte strings

I have some text in a non-English/foreign language in my page,
but when I try to make it lowercase, it characters are converted into black diamonds containing question marks.
$a = "Երկիր Ավելացնել";
echo $b = strtolower($a);
//returns ����� ���������
I've set my charset in a metatag, but this didn't fix it.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
What can I do to convert my string to lowercase without corrupting it?
Have you tried using mb_strtolower()?
PHP5 is not UTF-8 compatible, so you still need to resort to the mb extension. I suggest you set the internal encoding of mb to utf-8 and then you can freely use its functions without specifying the charset all the time:
mb_internal_encoding('UTF-8');
...
$b = mb_strtolower($a);
echo $b;
i have found this solution from here
$string = 'Թ';
echo 'Uppercase: '.mb_convert_case($string, MB_CASE_UPPER, "UTF-8").'';
echo 'Lowercase: '.mb_convert_case($string, MB_CASE_LOWER, "UTF-8").'';
echo 'Original: '.$string.'';
works for me (lower case)
Have you tried mb_strtolower() and specifying the encoding as the second parameter?
The examples on that page appear to work.
You could also try:
$str = mb_strtolower($str, mb_detect_encoding($str));
Php by default does not know about utf-8. It assumes any string is ASCII, so it strtolower converts bytes containing codes of uppercase letters A-Z to codes of lowercase a-z. As the UTF-8 non-ascii letters are written with two or more bytes, the strtolower converts each byte separately, and if the byte happens to contain code equal to letters A-Z, it is converted. In the result the sequence is broken, and it no longer represents correct character.
To change this you need to configure the mbstring extension:
http://www.php.net/manual/en/book.mbstring.php
to replace strtolower with mb_strtolower or use mb_strtolower direclty. I any case, you need to spend some time to configure the mbstring settings to match your requirements.
Use mb_strtolower instead, as strtolower doesn't work on multi-byte characters.
strtolower() will perform the conversion in the currently selected locale only.
I would try mb_convert_case(). Make sure you explicitly specify an encoding.
You will need to set the locale; see the first example at http://ca3.php.net/manual/en/function.strtolower.php

Categories