how to convert an ASCII encoded string to UTF8 in php? [duplicate] - php

This question already has answers here:
Convert ASCII TO UTF-8 Encoding
(5 answers)
Closed 6 years ago.
I tried to do:
file_put_contents ( $file_name, utf8_encode($data) ) ;
But when i check the file encoding from the shell with the linux command: 'file file_name'
I get: 'file_name: ASCII text'
Does it mean that the utf8_encoding didn't worked? if so, what is the right way to convert from ASCII to UTF8

If your string doesn't contain any non-ASCII characters, then you likely won't see differences, since UTF-8 is backwards compatible with ASCII. Try writing, for example, the text "1000 さくら" and see what happens.

Please note that utf8_encode only converts a string encoded in
ISO-8859-1 to UTF-8. A more appropriate name for it would be
"iso88591_to_utf8". If your text is not encoded in ISO-8859-1, you do
not need this function. If your text is already in UTF-8, you do not
need this function. In fact, applying this function to text that is
not encoded in ISO-8859-1 will most likely simply garble that text.
If you need to convert text from any encoding to any other encoding,
look at iconv() instead.
See http://php.net/manual/en/function.utf8-encode.php

ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
Found at: Convert ASCII TO UTF-8 Encoding

Try this:
$data = mb_convert_encoding($data, 'UTF-8', 'ASCII');
file_put_contents ( $file_name, $data );
or use this to change file encoding:
$fd = fopen($file, 'r');
stream_filter_append($fd, 'convert.iconv.UTF-8/ASCII');
stream_copy_to_stream($fd, fopen($output, 'w'));
Reference: How to write file in UTF-8 format?

Related

UTF-8 encoding of UTF-8 encoded text is not the same as original UTF-8 encoded text

Here is a PHP code snippet I came up with when I found a bug in my project.
print(($str == utf8_encode($str) ? "the same text" : "not the same text") . PHP_EOL);
print(mb_detect_encoding($str));
Now what this does, is tell me if a string $str has the same encoding as its UTF-8 encoded version, after that it prints its initial encoding.
What I expected is that either the UTF-8 text is the same as the original, or that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original.
But what really happened is the following output:
not the same text
UTF-8
This is only the case if i set $str = array_keys($_POST)[0]; and i use a key with special characters in my request body like äöü=test so that the $str will be äöü (defining it directly in the code will not result in the same output).
I interpret from the output that the original character encoding is UTF-8, but the two strings are not the same. If I print the initial string it is empty and the encoded string would be äöü.
I don't understand how a string can be different when encoded with its own encoding. Can someone please explain this to me?
The problem is your assumption that "that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original".
From the PHP Official Documentation regarding utf8_encode (https://www.php.net/manual/en/function.utf8-encode.php):
This function converts the string data from the ISO-8859-1 encoding to UTF-8.
In other words, this function is a ISO-8859-1 to UTF-8 converter. A proper use of this function, as seen above, expects only a ISO-8859-1 string. Therefore, if you use another encoding as parameter you should expect garbage.
This thread (PHP: Convert any string to UTF-8 without knowing the original character set, or at least try) discuss an "any character enconding to UTF-8".
Hope it hepls

how to get utf 8 encoded file with php [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 6 years ago.
I want to get a .html or .txt file from a folder with PHP, but this file is UTF-8 encoded, and if I use $html=file_get_contents('somewhere/somewhat.html'); and after that I echo $html; then this won't be UTF-8 encoded. I see many "�" in the text. Any idea? How can I prevent this?
You need to convert it to UTF8 yourselves. To do that use mb_convert_encoding() and mb_detect_encoding() PHP functions.
Like this,
$html=file_get_contents('somewhere/somewhat.html');
$html=mb_convert_encoding($html, 'UTF-8',mb_detect_encoding($html, 'UTF-8, ISO-8859-1', true));
echo $html;
mb_convert_encoding() converts character encoding
mb_detect_encoding() detects character encoding
Try to use iconv on your string:
http://php.net/manual/pl/function.iconv.php
Other solution:
http://php.net/manual/en/function.mb-convert-encoding.php
Or:
http://php.net/manual/en/function.utf8-encode.php

PHP, convert string into UTF-8 and then hexadecimal

In PHP, I want to convert a string which contains non-ASCII characters into a sequence of hexadecimal numbers which represents the UTF-8 encoding of these characters. For instance, given this:
$text = 'ąćę';
I need to produce this:
C4=84=C4=87=C4=99
How do I do that?
As your question is written, and assuming that your text is properly UTF-8 encoded to start with, this should work:
$text = 'ąćę';
$result = implode('=', str_split(strtoupper(bin2hex($text)), 2));
If your text is not UTF-8, but some other encoding, then you can use
$utf8 = mb_convert_encoding($text, 'UTF-8', $yourEncoding);
to get it into UTF-8, where $yourEncoding is some other character encoding like 'ISO-8859-1'.
This works because in PHP, strings are just arrays of bytes. So as long as your text is encoded properly to start with, you don't have to do anything special to treat it as bytes. In fact, this code will work for any character encoding you want without modification.
Now, if you want to do quoted-printable, then that's another story. You could try using the function quoted_printable_encode (requires PHP 5.3 or higher).

convert encoding in php doesn't convert between ASCII and UTF-8 [duplicate]

This question already has answers here:
Convert ASCII TO UTF-8 Encoding
(5 answers)
Closed 6 years ago.
is UTF-8 not the same as ASCII? how you would explain the different results i get from:
$result = mb_detect_encoding($PLAINText, mb_detect_order(), true);
Sometimes i get "UTF-8" in $result and sometimes i get "ASCII". so they are different, but that is not my question, my question is why iconv() code doesn't convert from ASCII to UTF-8?
$result = iconv("ASCII","UTF-8//IGNORE",$PLAINText);
i check the $result encoding later using the mb_detect_encoding() function and it is still "ASCII" , not "UTF-8".
The reason is that when using only ASCII characters in an UTF-8 string, the UTF-8 string is indistinguishable from an ASCII string. (Unless a byte order mark is used, but it's optional.)

How to list files with special (norwegian) characters

I'm doing a simple (I thought) directory listing of files, like so:
$files = scandir(DOCROOT.'files');
foreach($files as $file)
{
echo ' <li>'.$file.PHP_EOL;
}
Problem is the files contains norwegian characters (æ,ø,å) and they for some reason come out as question marks. Why is this?
I can apparently fix(?) it by doing this before I echo it out:
$file = mb_convert_encoding($file, 'UTF-8', 'pass');
But it makes little sense to me why this helps, since pass should mean no character encoding conversion is performed, according to the docs... *confused*
Here is an example: http://random.geekality.net/files/index.php
It appears the encoding of the file names is in ISO Latin 1, but the page is interpreted by default using UTF-8. The characters do not come out as "question marks", but as Unicode replacement characters (�). That means the browser, which tries to interpret the byte stream as UTF-8, has encountered a byte invalid in UTF-8 and inserts the character at that point instead. Switch your browser to ISO Latin 1 and see the difference (View > Encoding > ...).
So what you need to do is to convert the strings from ISO Latin 1 to UTF-8, if you designate your page to be UTF-8 encoded. Use mb_convert_encoding($file, 'UTF-8', 'ISO-8859-1') to do so.
Why it works if you specify the $from encoding as pass I can only guess. What you're telling mb_convert_encoding with that is to convert from pass to UTF-8. I guess that makes mb_convert_encoding take the mb_internal_encoding value as the $from encoding, which happens to be ISO Latin 1. I suppose it's equivalent to 'auto' when used as the $from parameter.

Categories