How to decode string with mixed content (Latin & UTF-8) in PHP

How to decode string with mixed content (Latin & UTF-8) in PHP - php

I have a PHP script that read emails/usenet messages, I found a case where I have a text that's a mix of arabic & latin words, ie.
PHP and ARABIC_WORD
ie.
PHP and الساعة
The problem is, the text is encoded, ie.
Some Text =?utf-8?b?RVByaW50cyBhbmQg2KfZhNi52LHYqNmK2Kk=?=
My question is How can I decode this ?utf-8?... when it's mixed with latin text?
I'm using PHP 5.4.15

What you've got is the MIME Encoded-Word syntax used in email messages for non US-ASCII encoded texts:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
-An encoded-word may not be more than 75 characters long, including charset, encoding, encoded text, and delimiters. If it is desirable to encode more text than will fit in an encoded-word of 75 characters, multiple encoded-words (separated by CRLFSP) may be used.
So this little excerpt from wikipedia also contains how you can decode the string. Sure you're not the first one who needs to do this, therefore libraries exist. See as well:
Best way to handle email parsing/decoding in PHP?
proper way to decode incoming email subject (utf 8)

it seems to be encoded text: try with php function base64_decode.
$my_string = 'test string';
$res = base64_encode($my_string);
echo $res; //dGVzdCBzdHJpbmc=
echo base64_decode($res); // test string
in fact, decoding your string:
base64_decode("RVByaW50cyBhbmQg2KfZhNi52LHYqNmK2Kk=")
return something like this:
EPrints and Ø§Ù„Ø¹Ø±Ø¨ÙŠØ©

Related

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

PHP isn't always correct, what I write has to always be correct. In this case an email with a subject contains an en dash character. This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP. I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.
Here is my starting string from the subject header of an email:
<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>
Typically the next step is to do the following:
$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo
Typically past that point I'd do the following:
$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!
Now I received a static answer for an earlier static question. Eventually I was able to put this working set of code together:
<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!
//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);
//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}
//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>
Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode, iconv, mb_convert_encoding and utf8_encode) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.
So technically the question is:
How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?
If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string. Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.

You are blaming PHP for something that PHP could not possibly solve:
$s1 is an ASCII string; just as the string "smiling face emoji" is ASCII, even though it describes the string "🙂".
$s2 is decoded according to the information you were sent. In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.
Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.
Those two encodings agree on the meanings of 224 out of the 256 possible 8-bit values. They disagree on the values from 0x80 to 0x9F: those are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252.
Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing (for instance) 0x96. However, the extra control characters from ISO 8859 are very rarely used, so if the string claims to be ISO-8859-1 but contains bytes in that range, it's almost certainly in some other encoding. Since Windows-1252 is very widely used (and often mislabelled in this way), a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252.
That makes the solution really very simple:
// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;
// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
$input_encoding = 'Windows-1252';
}
// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

UTF-8 encoding of UTF-8 encoded text is not the same as original UTF-8 encoded text

Here is a PHP code snippet I came up with when I found a bug in my project.
print(($str == utf8_encode($str) ? "the same text" : "not the same text") . PHP_EOL);
print(mb_detect_encoding($str));
Now what this does, is tell me if a string $str has the same encoding as its UTF-8 encoded version, after that it prints its initial encoding.
What I expected is that either the UTF-8 text is the same as the original, or that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original.
But what really happened is the following output:
not the same text
UTF-8
This is only the case if i set $str = array_keys($_POST)[0]; and i use a key with special characters in my request body like äöü=test so that the $str will be äöü (defining it directly in the code will not result in the same output).
I interpret from the output that the original character encoding is UTF-8, but the two strings are not the same. If I print the initial string it is empty and the encoded string would be äöü.
I don't understand how a string can be different when encoded with its own encoding. Can someone please explain this to me?

The problem is your assumption that "that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original".
From the PHP Official Documentation regarding utf8_encode (https://www.php.net/manual/en/function.utf8-encode.php):
This function converts the string data from the ISO-8859-1 encoding to UTF-8.
In other words, this function is a ISO-8859-1 to UTF-8 converter. A proper use of this function, as seen above, expects only a ISO-8859-1 string. Therefore, if you use another encoding as parameter you should expect garbage.
This thread (PHP: Convert any string to UTF-8 without knowing the original character set, or at least try) discuss an "any character enconding to UTF-8".
Hope it hepls

convert string with UTF-16 and UTF-8 text to UTF-8

I read many posts on how to convert UTF-16 from/to UTF-8 but none advise what to do if I have both. I'm trying to insert an email body text that has UTF-16 and UTF-8 characters, using PHP, into SQL Server 2008 table column (UTF-8).
I use iconv() to convert from UTF-16 to UTF-8 but as I said it is not enough because it doesn't handle UTF-8:
$email->description_html = iconv("UTF-16","UTF-8//TRANSLIT",$that->getMessageText(
$msgNo, 'HTML', $structure, $fullHeader,$clean_email));
$email->description = iconv("UTF-16","UTF-8//TRANSLIT",$that->getMessageText(
$msgNo, 'PLAIN', $structure, $fullHeader,$clean_email));
I tried this for both UTF-16 and UTF-8 but it doesn't work, gives a database error:
can't convert UTF-16 to UTF-8
$email->description_html= iconv('','UTF-8',$that->getMessageText(
$msgNo, 'HTML', $structure, $fullHeader,$clean_email));
I don't know what else to do , please help.

There shouldn't be such a thing as "having both UTF-16 and UTF-8" in one text string. If this is the case, the string is broken. There must be an indicator stating which encoding was used, and this encoding only. This indicator must be trusted for converting characters into another encoding. If it doesn't work: Blame the source for incorrectly stating an encoding that wasn't true.
As for email: It might be possible to have a multipart mail that has two (read: more than one) different parts with two different multipart headers, both of them stating different encoding. Dealing with this must be done by applying the rules for parsing multipart mails, i.e. you cannot treat the whole mail as a single string, but must separate these parts first - and then you have a perfectly valid single encoding case for each part. :)

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

How to properly display utf encoded characters on my utf-8 encoded page?

I am retrieving emails and some of my emails have utf encoded text. However even though my page is encoded as utf 8 - in some places when I try to out put utf text I get funny characters like:
=?utf-8?B?Rlc6INqp24zYpyDYotm+INin2LMg2YXYs9qp2LHYp9uB2bkg2qnbjCDZhtmC?=
=?utf-8?B?2YQg2qnYsdiz2qnYqtuSINuB24zaug==?=
Whereas in other areas of the same page it displays fine. What is going on?

It's not "funny characters", those are legitimate ASCII characters. It's just that the string is MIME encoded for transport, so you'll need to put it through mb_decode_mimeheader.

You may be seeing undecoded e-mail headers: =? is the starting delimiter, utf-8 means the text is in utf-8 and B means base-64 encoded. ?= is the ending delimiter. So, base64_decode() the part between the question marks and you'll get the content.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.