PHP imap: how to decode and convert Windows-1252 charset emails?

PHP imap: how to decode and convert Windows-1252 charset emails? - php

My PHP app processes incoming emails. The processing code usually works fine, but the app crashed recently with the below exception:
Unexpected encoding - UTF-8 or ASCII was expected (View: /home/customer/www/gonativeguide.com/gng2-core/vendor/laravel/framework/src/Illuminate/Mail/resources/views/html/panel.blade.php) {"exception":"[object] (Facade\\Ignition\\Exceptions\\ViewException(code: 0): Unexpected encoding - UTF-8 or ASCII was expected (View: /home/customer/www/gonativeguide.com/gng2-core/vendor/laravel/framework/src/Illuminate/Mail/resources/views/html/panel.blade.php) at /home/customer/www/gonativeguide.com/gng2-core/vendor/league/commonmark/src/Input/MarkdownInput.php:30)
It seems that there was an incoming email whose text was not properly decoded and this made the app crash later on.
I realized that the email had a Windows-1252 encoding:
Content-Type: text/html; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
The email decoding code looks currently like this:
// DECODE DATA
$data = ($partno)?
imap_fetchbody($mbox,$mid,$partno): // multipart
imap_body($mbox,$mid); // simple
// Any part may be encoded, even plain text messages, so check everything.
if ($p->encoding==4)
$data = quoted_printable_decode($data);
elseif ($p->encoding==3)
$data = base64_decode($data);
I checked this page to understand what I need to change to decode emails with Windows-1252, but it not clear to me which value corresponds to Windows-1252 and how to decode and convert the data to UTF-8. I would highly appreciate any hints, preferably with suggested code on this.
Thanks,
W.

In your case, this line:
$data = quoted_printable_decode($data);
needs to be adapted like this:
$data = mb_convert_encoding(quoted_printable_decode($data), 'UTF-8', 'Windows-1252');
More generally, to cope with non-UTF-8 encodings, you may want to extract the charset of the body part:
from the body part structure, returned by imap_bodystruct(), or
from the body part MIME headers, returned by imap_fetchmime().

Related

PHP read a line from a csv file return wrong in charset

I got a csv file, if I set the charset to ISO-8859-2(eastern europe) in Libre Calc, than it renders the characters correctly, but since the server's locale set to EN-UK.
I can not read the characters correctly, for example:
it returns : T�t insted of Tót.
I tried many things like:
echo (mb_detect_encoding("T�t","ISO-8859-2","UTF-8"));
I know probably the char does not exist in UTF-8 but I tried.
Also tried to setup the correct charset in the header:
header('Content-Type: text/html; charset=iso-8859-2');
echo "T�th";
but its returns : TÄĹźËth insted of Tóth.
Please help me solve this, thanks in advance

I advise against setting the header to charset=iso-8859-2'. It is usual to work with UTF-8. If the data is available with a different encoding, it should be converted to UTF-8 and then processed as CSV. The following example code could be kept as simple as the newline characters in UTF-8 and iso-8859-2 are the same.
$fileName = "yourpath/Iso8859_2.csv";
$fp = fopen($fileName,"r");
while($row = fgets($fp)){
$strUtf8 = mb_convert_encoding($row,'UTF-8','ISO-8859-2');
$arr = str_getcsv($strUtf8);
var_dump($arr);
}
fclose($fp);
The exact encoding of the CSV file must be known. mb_detect_encoding is not suitable for determining the encoding of a file.

Transfer encrypted data in json with utf8 formatted values

I'm trying to stringify output from openssl_public_encrypt and other openssl functions i php, and the output don't seem to be utf8 encoded. Here is a sample code that generate the error that is my problem in a nutshell.
<?php
$pubkey=<<<EOD
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAueWffJhr4j+PZhf4QlFF
1HEmcu9d93YYBIQdZBZLWx4uqxsZ6Q3FaBVMkHh0h+sDHx1je2fQprTEMjWGSIu0
HlXRZqPLkVUCpQg2j1oQk2BbZExS6kyziVa1G9ai094WqMz3MjyimOvJxuCAsb+i
rQ/HaC2+vBAdm8wjLYEkqe/q7Q6Tnf+U6bpPYASXTz0WlLJj/G2LLTpEYzF3IgTB
tRsTI6hwpmpHzpKUucEvliEesEPMAs3xp4AaKBdqKQoGFsiA2p1jxJIRUXC/ur7f
2ZgWI59AtemVd+FRZfUapfe5uDD3M5cJy/6Uh9Yg+7vMzuCzi/yBDDFwyy4hD2RJ
YwIDAQAB
-----END PUBLIC KEY-----
EOD;
$jsontest= new \stdClass();
$data="Testing some text ÆØåæøåéè";
openssl_public_encrypt($data,$encrypted,$pubkey,OPENSSL_PKCS1_OAEP_PADDING);
//Next line outputs encoding UTF8 sometimes but not consequently
echo "\n\ndata1: ".mb_detect_encoding($encrypted)."\n";
$jsontest->data1=$encrypted;
$data="Testing some other text ÆØåæøåéè";
openssl_public_encrypt($data,$encrypted,$pubkey,OPENSSL_PKCS1_OAEP_PADDING);
//Next line outputs encoding UTF8 sometimes but not consequently
echo "\n\ndata1: ".mb_detect_encoding($encrypted)."\n";
$jsontest->data2=$encrypted;
header('Content-Type: application/json; charset=UTF-8');
//print_r($jsontest);
$json=null;
try {
$json = json_encode($jsontest, JSON_THROW_ON_ERROR);
} catch (JsonException $e) {
echo 'Error:'.$e;
}
if($json)echo "JSON output:\n$json";
?>
Expected output would be a stringified json object with utf8 encoded property values. Instead i get this error message:
"Error:JsonException: Malformed UTF-8 characters, possibly incorrectly encoded in 'the php file':24"
When i run the above code snippet, the 'mb_detect_encoding' lines output 'UTF-8' sometimes, but not always.
There seems to be a problem in openssl_public_encrypt, where the output is not conform to utf-8 encoding.
A very strange behavior detected: Probably mb_detect_encoding does not detect correct, because the json_encode function fails every time. and probably openssl_public_encrypt is to blame for this behavior.
Anyways i can't stringify the supposedly UTF-8 encoded output from openssl_public_encrypt. I use base64 encoding of encrypted data for now as a solution, but the data overhead is around the double of original data.
I use openssl in php to encrypt/decrypt with rsa, ecdh and aes, in conjuction with javascript webCrypto.
Can anybody help me solve this problem, as i am probably not the only one who has this problem.
Edit:
Got it wrong! The function json_encode in php is the showstopper! It doesn't accept UTF-8 encoded json strings although json is specified for UTF-8 to my knowledge. It certainly is accepted by and retrieved ok in file_get_contents("php://input"). Is there any reason for that?

"Malformed UTF-8 characters" means the input data contains invalid characters.
If the data is hard coded, save your file with UTF-8 (no BOM) encode.
If not, use iconv to convert or check the input data.
Encrypted data is in binary format, you may need do base64 encode before run json encode
$jsontest->data1 = base64_encode($encrypted);

Send UTF-16 encoded data with PHP curl

I'm building a php client to a web service that requires posted data to be encoded as UTF-16. How do i configure curl to encode my data in UTF-16 and also to decode the answer in UTF-16?
Some sample code:
$s = curl_init($url);
curl_setopt($s,CURLOPT_POST,1);
curl_setopt($s,CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($s,CURLOPT_POSTFIELDS,$data);
curl_setopt($s,CURLOPT_TIMEOUT, 5);
curl_setopt($s,CURLOPT_HTTPHEADER,array('Content-Type: text/plain'));
$result = curl_exec($s);
curl_close($s);
Adding an Accept-Encoding header does not seem to do the trick. Is it possible to encode my $data string in UTF-16 first and then pass a byte array to curl instead of a string?
Thank you for your answers!

First, you need to find out what's your data encoding. Then, it's your choice. Both iconv() and mb_convert_encoding() work pretty well.
Additionaly, you should inform about the encoding in the HTTP header:
curl_setopt($s,CURLOPT_HTTPHEADER,array('Content-Type: text/plain; charset=UTF-16'));

fwrite() and UTF8

I am creating a file using php fwrite() and I know all my data is in UTF8 ( I have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8.), but I am being told the file I am outputting contains non utf8 data :( Is there a command in bash (CentOS) to check the format of a file?
When using vim it shows the content as:
Donâ~#~Yt do anything .... Itâ~#~Ys a
great site with
everything....Weâ~#~Yve only just
launched/
Any help would be appreciated: Either confirming the file is UTF8 or how to write utf8 content to a file.
UPDATE
To clarify how I know I have data in UTF8 i have done the following:
DB is set to utf8 When saving data
to database I run this first:
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "UTF-8", $enc);
Just before I run fwrite i have checked the data with Note each piece of data returns 'IS utf-8'
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'NOT UTF-8';
else print 'IS utf-8';
Thanks!

If you know the data is in UTF8 than you want to set up the header.
I wrote a solution answering to another tread.
The solution is the following: As the UTF-8 byte-order mark is \xef\xbb\xbf we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$file; // this is what makes the magic
fputs($f, $string);
fclose($f);
}
?>
You can adapt it to your code, basically you just want to make sure that you write a UTF8 file (as you said you know your content is UTF8 encoded).

fwrite() is not binary safe. That means, that your data - be it correctly encoded or not - might get mangled by this command or it's underlying routines.
To be on the safe side, you should use fopen() with the binary mode flag. that's b. Afterwards, fwrite() will safe your string data "as-is", and that is in PHP until now binary data, because strings in PHP are binary strings.
Background: Some systems differ between text and binary data. The binary flag will explicitly command PHP on such systems to use the binary output. When you deal with UTF-8 you should take care that the data does not get's mangeled. That's prevented by handling the string data as binary data.
However: If it's not like you told in your question that the UTF-8 encoding of the data is preserved, than your encoding got broken and even binary safe handling will keep the broken status. However, with the binary flag you still ensure that this is not the fwrite() part of your application that is breaking things.
It has been rightfully written in another answer here, that you do not know the encoding if you have data only. However, you can validate data if it validates UTF-8 encoding or not, so giving you at least some chance to check the encoding. A function in PHP which does this I've posted in a UTF-8 releated question so it might be of use for you if you need to debug things: Answer to: SimpleXML and Chinese look for can_be_valid_utf8_statemachine, that's the name of the function.

//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));
I find this piece works for me :)

The problem is your data is double encoded. I assume your original text is something like:
Don’t do anything
with ’, i.e., not the straight apostrophe, but the right single quotation mark.
If you write a PHP script with this content and encoded in UTF-8:
<?php
//File in UTF-8
echo utf8_encode("Don’t"); //this will double encode
You will get something similar to your output.

$handle = fopen($file,"w");
fwrite($handle, pack("CCC",0xef,0xbb,0xbf));
fwrite($handle,$file);
fclose($handle);

I know all my data is in UTF8 - wrong.
Encoding it's not the format of a file. So, check charset in headers of the page, where you taking data from:
header("Content-type: text/html; charset=utf-8;");
And check if data really in multi-byte encoding:
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'not UTF-8';
else print 'utf-8';

There is some reason:
first you get information from database it is not utf-8.
if you sure that was true use this ,I always use this and it work :
$file= fopen('../logs/logs.txt','a');
fwrite($file,PHP_EOL."_____________________output_____________________".PHP_EOL);
fwrite($file,print_r($value,true));

The only thing I had to do is add a UTF8 BOM to the CSV, the data was correct but the file reader (external application) couldn't read the file properly without the BOM

Try this simple method that is more useful and add to the top of the page before tag <body> :
<head>
<meta charset="utf-8">
</head>

How to format incoming email text for HTML display

I've set up a script that processes incoming emails and creates blog entries on Blogger. I'm using PEAR's Mail_Mime libs (for now) to read the incoming message. The messages often have characters in them that cannot be read by browsers--this happens most often when people use Outlook or cut/paste from MS Word.
So the output at the other end is something like this:
Here is a test post with “quotes” and ‘apostrophes�for what it�s worth, it also has dashes�and other strange formatting cut and paste from MS Word.
You can also see the output in the wild.
It's not hard to fix any specific instance, but each client (hotmail, gmail, outlook, etc) seems to handle things just a bit differently. Mail_Mime only seems to munge the output and, if I turn off Mail_Mime's parsing and try to translate the encoded characters myself using mb_convert_encoding or some manual simulation of this, it's even worse.
Please not that this is not going to be solved by selecting the right encoding type and using decode/encode/convert functions. The incoming formats vary from Windows-1252 to UTF8 to just about anything else mail clients can think of.
Has anyone scripted this before that could save me some time by offering up a sample or advice on the best approach? I've tried all the simple answers and done plenty of experimenting, so please don't bother responding unless you've dealt with a similar issue successfully or have a deep understanding of encoding issues.

The only way to do this is to do it by the spec's which is I'm afraid to pull in the 'Content-Type' mime header, pick up the charset (it'll look like Content-Type: text/plain; charset="us-ascii") then convert to UTF-8, and of course ensure your output on the web is sent as UTF-8 with the right headers.

To solve this problem, and get my message into valid UTF-8 that is readable from a browser, I found this PHP lib, ConvertCharset by Mikolaj Jedrzejak, which worked on almost everything. It still had issues with a specific symbol (=A0) when converting from Windows-1252 or iso-8859-1. So I converted this character manually before setting the code loose.
Here's what it looks like overall:
// decode using Mail_Mime
require 'Mail.php';
require 'Mail/mime.php';
require 'Mail/mimeDecode.php';
$params['include_bodies'] = true;
$params['decode_bodies'] = true; // this decodes it!
$params['decode_headers'] = true;
$decoder = new Mail_mimeDecode($input);
$mime = $decoder->decode($params);
// too much work to put in this example
$charset = ...; //do some magic with $mime->parts to get the character set
$text = ...; //do some magic with $mime->parts to get the text
// fix the =A0 control character; it's already been decoded
// by Mail_Mime, so we need the actual byte code now
// this has to be done before trying to convert to UTF-8
$char = chr(hexdec(substr('A0',1)));
$text = str_replace($char, '', $text);
// convert to UTF-8 using ConvertCharset
require 'ConvertCharset.class.php';
if( strtolower($charset) != 'utf-8' ) {
$converter = new ConvertCharset($charset, 'utf-8', false);
}
$text = $converter->Convert($text);
Then everything is spiffy. It even does the infamous Iñtërnâtiônàlizætiøn conversion, as well as accepting french, spanish, and pastes directly from MS Word :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP imap: how to decode and convert Windows-1252 charset emails? - php

Related

PHP read a line from a csv file return wrong in charset

Transfer encrypted data in json with utf8 formatted values

Send UTF-16 encoded data with PHP curl

fwrite() and UTF8

How to format incoming email text for HTML display

Categories

Resources