Decoding Windows-1252 characters in imap subject line to UTF-8

Decoding Windows-1252 characters in imap subject line to UTF-8 - php

I have a website that will allow people to post things to it using the subject line of an email in Outlook. Using PHP and imap, I get the subject line of the text and store it in a mysql db. But every once in a while, someone will copy text from a website into the subject line of that email and I will get garbled text. Similar to this:
=?Windows-1252?Q?_Every_day_in_our_offices_we_recycle_cardboard,aluminum?=
=?Windows-1252?Q?=96_won=92t_you_join_us=3F?=
What I've done is try to decode this text so it will appear normal on the page using the following code:
$subject = strip_tags($mailHeader->subject);
$header = imap_mime_header_decode($subject);
$subject = "";
for($i=0;$i<count($header);$i++)
{
$subject .= $header[$i]->text;
}
When finished I get rid of most of the garbled text, but am left behind with replacement characters for an em dash and a curly quote that was in the original subject line text. See the result below:
Every day in our offices we recycle cardboard, aluminum, � won�t you join us?
The charset for the website is set to UTF-8. When I set the website charset to ISO-8859-1, the replacement characters are replaced with the curly quote and em dash, which is great but I want to leave the website's charset at UTF-8.
Any help on how to get rid of the replacement characters without changing the charset to ISO-8859-1 would be great. Thanks.

Code above works except for one small change to the very end:
$subject .= mb_convert_encoding($header[$i]->text, "UTF-8", $header[$i]->charset);

Each of the objects returned by imap_mime_header_decode includes a charset property, which you are ignoring. You would need to convert each one to UTF-8 in your loop, using something like:
$subject .= mb_convert_encoding($header[$i]->text, "UTF-8", $header[$i]->charset);
As an alternative, consider using the mb_decode_mimeheader or iconv_mime_decode_headers functions. Both of these functions do the entire job of decoding a MIME header for you, returning a string in PHP's internal encoding (which is usually UTF-8).

Related

â€™ in PHP is converting to ’ when using mb_convert_encoding in Outlook subject

I have a mail() function set up in PHP, when emailing to my email to test I noticed the subject was converting my ' into â€™.
$subject="Please provide an updated copy of your company's certification";
result: Please provide an updated copy of your companyâ€™s certification.
I followed Getting â€™ instead of an apostrophe(') in PHP adding mb_convert_encoding but now I am getting &rsquo instead of '.
$subjectBad="Please provide an updated copy of your company's certification";
$subject= mb_convert_encoding($subjectBad, "HTML-ENTITIES", 'UTF-8');
result: Please provide an updated copy of your company&rsquo ;s certification.
It comes through fine to my personal email, so is there a way to properly display a ' in Outlooks subject or am I at the whim of whatever their system settings are?

Whatever you used to type the subject did not use a simple apostrophe ' which has a common representation across virtually all single-byte encodings and UTF8, instead it used a "fancy" right single quote ’, which is represented differently between single-byte encodings and UTF-8.
mb_convert_encoding() is converting to an HTML entity because you are literally telling it to, and email headers are not HTML so it's going to display as the literal string ’. The only character set other than UTF-8 that has "smart quotes" is Microsoft's cp1252, and that is still the wrong answer for email headers.
The simplest answer is: Don't do that. Use a normal apostrophe. Everyone hates dealing with "smart" quotes.
The more complex answer is that email headers MUST be 7-bit safe "ASCII" text, and anything else requires additional handwaving. Ideally you should be using a proper email library that handles this, and the dozens of other annoyances that will malform your emails and impact deliverability.
If you're dead-set on eroding your sanity and using mail() directly, then you're going to want to properly encode your subject line and use an explicitly-defined character set, which you should be doing anyways. Eg:
$subject = 'Please provide an updated copy of your company’s certification';
var_dump(
sprintf('=?UTF-8?Q?%s?=', quoted_printable_encode($subject))
);
Output:
string(82) "=?UTF-8?Q?Please provide an updated copy of your company=E2=80=99s certification?="

Email Broken Character in Japan [duplicate]

Aparently, encoding japanese emails is somewhat challenging, which I am slowly discovering myself. In case there are any experts (even those with limited experience will do), can I please have some guidelines as to how to do it, how to test it and how to verify it?
Bear in mind that I've never set foot anywhere near Japan, it is simply that the product I'm developing is used there, among other places.
What (I think) I know so far is following:
- Japanese emails should be encoded in ISO-2022-JP, Japanese JIS codepage 50220 or possibly SHIFT_JIS codepage 932
- Email transfer encoding should be set to Base64 for plain text and 7Bit for Html
- Email subject should be encoded separately to start with "=?ISO-2022-JP?B?" (don't know what this is supposed to mean). I've tried encoding the subject with
"=?ISO-2022-JP?B?" + Convert.ToBase64String(Encoding.Unicode.GetBytes(subject))
which basically gives the encoded string as expected but it doesn't get presented as any japanese text in an email program
- I've tested in Outlook 2003, Outlook Express and GMail
Any help would be greatly appreciated
Ok, so to post a short update, thanks to the two helpful answers, I've managed to get the right format and encoding. Now, Outlook gives something that resembles the correct subject:
=?iso-2022-jp?B?6 Japanese test に各々の視点で語ってもらった。 6相当の防水?=
However, the exact same email in Outlook Express gives subject like this:
=?iso-2022-jp?B?6 Japanese test 縺ｫ蜷・・・隕也せ縺ｧ隱槭▲縺ｦ繧ゅｉ縺｣縺溘・ 6逶ｸ蠖薙・髦ｲ豌ｴ?=
Furthermore, when viewed in the Inbox view in Outlook Express, the email subject is even more weird, like this:
=?iso-2022-jp?B?6 Japanese test ??????????????? 6???????=
Gmail seems to be working in the similar fashion to Outlook, which looks correct.
I just can't get my head around this one.

I've been dealing with Japanese encodings for almost 20 years and so I can sympathize with your difficulties. Websites that I've worked on send hundreds of emails daily to Japanese customers so I can share with you what's worked for us.
First of all, do not use Shift-JIS. I personally receive tons of Japanese emails and almost never are they encoded using Shift-JIS. I think an old (circa Win 98?) version of Outlook Express encoded outgoing mail using Shift-JIS, but nowadays you just don't see it.
As you've figured out, you need to use ISO-2022-JP as your encoding for at least anything that goes in the mail header. This includes the Subject, To line, and CC line. UTF-8 will also work in most cases, but it will not work on Yahoo Japan mail, and as you can imagine, many Japanese users use Yahoo Japan mail.
You can use UTF-8 in the body of the email, but it is recommended that you base64 encode the UTF-8 encoded Japanese text and put that in the body instead of raw UTF-8 text. However, in practice, I believe that raw UTF-8 text will work fine these days, for the body of the email.
As I alluded to above, you need to at least test on Outlook (Exchange), Outlook Express (IMAP/POP3), and Yahoo Japan web mail. Yahoo Japan is the trickiest because I believe they use EUC for the encoding of their web pages, and so you need to follow the correct standards for your emails or they won't work (ISO-2022-JP is the standard for sending Japanese emails).
Also, your subject line should not exceed 75 characters per line. That is, 75 characters after you've encoded in ISO-2022-JP and base64, not 75 characters before conversion. If you exceed 75 characters, you need to break your encoded subject into multiple lines, starting with "=?iso-2022-jp?B?" and ending with "?=" on each line. If you don't do this, your subject might get truncated (depending on the email reader, and also the content of your subject text). According to RFC 2047:
"An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."
Here's some sample PHP code to encode the subject:
// Convert Japanese subject to ISO-2022-JP (JIS is essentially ISO-2022-JP)
$subject = mb_convert_encoding ($subject, "JIS", "SJIS");
// Now, base64 encode the subject
$subject = base64_encode ($subject);
// Add the encoding markers to the subject
$subject = "=?iso-2022-jp?B?" . $subject . "?=";
// Now, $subject can be placed as-is into the raw mail header.
See RFC 2047 for a complete description of how to encode your email header.

Check http://en.wikipedia.org/wiki/MIME#Encoded-Word for a description on how to encode header fields in MIME-compliant messages. You seem to be missing a “?=” at the end of your subject.

=?ISO-2022-JP?B?TEXTTEXT...
ISO_2022-JP means that string is encoded in ISO-2022-JP codepage (eg. not Unicode)
B means that string is bese64 encoded
In your example, you should just supply your string in ISO-2022-JP instead of Unicode.

I have some experience composing and sending email in japanese...Normally you have to beware what encoding used for operating system and how you store your japanese strings!
My Mail objects are normally encoded as follows:
string s = "V‚µ‚¢ŠwK–#‚Ì‚²’ñˆÄ"; // Our japanese are shift-jis encoded, so it appears like garbled
MailMessage message = new MailMessage();
message.BodyEncoding = Encoding.GetEncoding("iso-2022-jp");
message.SubjectEncoding = Encoding.GetEncoding("iso-2022-jp");
message.Subject = s.ToEncoding(Encoding.GetEncoding("Shift-Jis")); // Change the encoding to whatever your source is
message.Body = s.ToEncoding(Encoding.GetEncoding("Shift-Jis")); // Change the encoding to whatever your source is
Then i have an extension method to which does the conversion for me:
public static string ToEncoding(this string s, Encoding targetEncoding)
{
return s == null ? null : targetEncoding.GetString(Encoding.GetEncoding(1252).GetBytes(s)); //1252 is the windows OS codepage
}

something like this should get the job done in python:
#!/usr/bin/python
# -*- mode: python; coding: utf-8 -*-
import smtplib
from email.MIMEText import MIMEText
from email.Header import Header
from email.Utils import formatdate
def send_from_gmail( from_addr, to_addr, subject, body, password, encoding="iso-2022-jp" ):
msg = MIMEText(body.encode(encoding), 'plain', encoding)
msg['Subject'] = Header(subject.encode(encoding), encoding)
msg['From'] = from_addr
msg['To'] = to_addr
msg['Date'] = formatdate()
s = smtplib.SMTP('smtp.gmail.com', 587)
s.ehlo(); s.starttls(); s.ehlo()
s.login(from_addr, password)
s.sendmail(from_addr, to_addr, msg.as_string())
s.close()
return "Sent mail to: %s" % to_addr
if __name__ == "__main__":
import sys
for n,item in enumerate(sys.argv):
sys.argv[n] = sys.argv[n].decode("utf8")
if len(sys.argv)==6:
print send_from_gmail( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5] )
elif len(sys.argv)==7:
print send_from_gmail( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], encoding=sys.argv[6] )
else:
raise "SYNTAX: %s <from_addr> <to_addr> <subject> <body> <password> [encoding]"
**blatantly stolen/adapted from:
http://mtokyo.blog9.fc2.com/blog-entry-127.html

First of all you should be using:
Encoding.GetEncoding("ISO-2022-JP")
to convert your subject line into bytes that will be processed by Convert.ToBase64String().
=?ISO-2022-JP?B?TEXTTEXT...?= tells the receiving mail client which encoding was used on the sender's side to convert japanese "letters" into a byte stream.
Currently you're using UTF-16 to encode, but specifying ISO-2022-JP to decode. These are obviously two different encodings, I guess, just like ISO-8859-1 is different from Unicode (most extended western-europe chars are represented by one byte in ISO-XXX, but two bytes in Unicode).
I'm not sure what you mean about UTF-8 being second-class citizen. As long as the receiving mail client understands UTF-8 and is able to convert it to the current japanese locale, everything is fine.

<?php
function sendMail($to, $subject, $body, $from_email,$from_name)
{
$headers = "MIME-Version: 1.0 \n" ;
$headers .= "From: " .
"".mb_encode_mimeheader (mb_convert_encoding($from_name,"ISO-2022-JP","AUTO")) ."" .
"<".$from_email."> \n";
$headers .= "Reply-To: " .
"".mb_encode_mimeheader (mb_convert_encoding($from_name,"ISO-2022-JP","AUTO")) ."" .
"<".$from_email."> \n";
$headers .= "Content-Type: text/plain;charset=ISO-2022-JP \n";
/* Convert body to same encoding as stated
in Content-Type header above */
$body = mb_convert_encoding($body, "ISO-2022-JP","AUTO");
/* Mail, optional parameters. */
$sendmail_params = "-f$from_email";
mb_language("ja");
$subject = mb_convert_encoding($subject, "ISO-2022-JP","AUTO");
$subject = mb_encode_mimeheader($subject);
$result = mail($to, $subject, $body, $headers, $sendmail_params);
return $result;
}

Introduction of Japanese encoding to e-mail happened at JUNET(UUCP based nation-wide network) in early 90's.
At that time, RFC1468 was defined.
If you follow RFC1468 in plain text mail, there would be no problem.
If you want to handle html mail, RFC1468 is useless except for header parts.

Here's what I use to send Japanese emails. Subject line looks fine in Outlook 2010, gmail and on iPhone.
Encoding encoding = Encoding.GetEncoding("iso-2022-jp");
byte[] bytes = encoding.GetBytes(subject);
string uuEncoded = Convert.ToBase64String(bytes);
subject = "=?iso-2022-jp?B?" + uuEncoded + "?=";
// not sure this is actually necessary...
mailMessage.SubjectEncoding = Encoding.GetEncoding("iso-2022-jp");

Russian Language encoded when using imap_fetch from gmail

Im reading a log file pasted into the body of an email, some are in various different languages and all language characters seem to display correctly except for Russian.
Here is an example of what the Russian says in the log file:
Ссылка на объект не указывает на экземпляр объекта.
в
From what I have read I need to specify decoding or encoding something on the lines of mb_encoding (UTF-8) but I am a bit lost on how to actual structure it without affecting code that isnt russian. But when echoed out it gets converted to this:
Ð¡ÑÑ‹Ð»ÐºÐ° Ð½Ð° Ð¾Ð±ÑŠÐµÐºÑ‚ Ð½Ðµ ÑƒÐºÐ°Ð·Ñ‹Ð²Ð°ÐµÑ‚ Ð½Ð° ÑÐºÐ·ÐµÐ¼Ð¿Ð»ÑÑ€ Ð¾Ð±ÑŠÐµÐºÑ‚Ð°.
Ð²
Here is the code im using already, I am a php beginner and some of this isnt my code, I have edited to suit but not 100% what everything is doing:
$mailbox = "xxx#gmail.com";
$mailboxPassword = "xxx";
$mailbox = imap_open("{imap.gmail.com:993/imap/ssl}INBOX",
$mailbox, $mailboxPassword);
mb_internal_encoding("UTF-8");
$subject = mb_decode_mimeheader(str_replace('_', ' ', $subject));
$body = imap_fetchbody($mailbox, $val, 1);
$body = base64_decode($body);
echo $body;
Once I echo out body it converts from Russian into that encoding, any pointers on similar code I can dissect to learn how to fix this?
Please bear in mind there is numerous languages been read from the email, for the most part its just a few snippets and the rest is basic logging but what I am worried about is if I set a new decode that it will mess up other language characters

Despite its large adoption, email is still tricky to work with. If your IMAP client has a limited set of requirements, your job will be easy. Otherwise, for truly a general-purpose GMail client, there's no silver bullet and you have to un understand how email wokrs: SMTP, MIME and finally IMAP.
Basic MIME knowledge is absolutely needed, and I won't paste the whole wikipedia article, but you should really read it and understand how it works. IMAP is somewhat easier to understand.
Usually, email messages contains either a single text/plain body, or a multipart/alternative body with both a text/plain and a text/html part. But, you know, there are attachments, so you can also likely find a multipart/mixed and it can really contain anything, and if it's binary content you should treat it differently than text. There are two headers (which you can find in the global message or in part inside a multipart envelope) somewhat involved in charset issues: Content-Type and Content-Transfer-Encoding.
From your code, we must assume that you are only interested in textual parts base64-encoded. Once you have decoded them, they are a sequence of byte representing text in the charset specified by the sender in the Content-Type header, which is non-ASCII here and thus looks like this:
Content-Type: text/plain; charset=ISO-8859-1
Note that charset may be utf8 or really any other you can think of, you have to check this in your program. You job is transcoding this piece of input in the output charset of your HTML page. If your page does not use a Unicode encoding (like UTF-8), chances are that you can't even be able to show the message correctly, and '?' will be printed instead of missing characters. Since you require your application to be used worldwide (not just in Russia), and since it's anyway good practice, you should use UTF-8 in your HTML responses, and thus when you want to echo the message body:
echo mb_convert_encoding(imap_base64($body), "UTF-8", $input_charset);
where $input_charset is the one found in the Content-Type header for the processed part. For the subject line, you should use imap_mime_header_decode(), which returns an array of tuples (binary string, charset) which you have to output in the same manner as above.
TL;DR
The bytes in the UTF-8 encoded input text map quite nicely to the output if we assume it's CP-1252 encoded (maybe you didn't copy some non printable ones). This means that the input is UTF-8, but the browser thinks the page is Windows-1252. Likely this is the default browser behavior for your locale, and you can easily correct it by sending the appropriate header before any other input:
header("Content-Type: text/html; charset=utf-8");
This should be enough to solve this issue, but will also likely cause problem with non-ASCII characters in string literals and the database (if any). If you want a multilingual application, Unicode is the way, but you have to transcode your database and your PHP files from CP-1252 to UTF-8.

Encoding issues with PHP and apostrophes when pasted from MS Word

I have a form that emails to my email address. Everything works fine, except when someone pastes something from MS Word into the form. All the text comes through, but the encoding on the apostrophes and double quotes are all messed up. They come through as strange characters.
Is there anyway to easily fix this issue?

For me this solution works fine:
Convert windows converted string to utf-8.
$str = iconv("cp1252","UTF-8", $str);
cp866 MS DOS Cyrillic
cp1251 Windows Cyrillic
cp1253 Windows west european language
Futher information about iconv()

MS Word uses apostrophes and quotes that are not valid under UTF8. Here's an article on SO about it:
PHP - Getting rid of curly apostrophes

Have you tried using strip_tags() ?

You need to have same charset on both the form html page, and the page which generates the email content. For example, set utf-8 on the html page which displays the form. Also, when creating the mail message on submit, set the charset in header to be utf-8. That works fine.
If you are using phpmailer for email, then you can set the charset through the phpmailer class object like $mail->Charset = 'utf-8'
This works well when you are storing and retrieving from database. The trick is to keep the encoding scheme same all through.

Weird utf8 conversion problem in php

So I'm working on a project that is taking data from a file, in the file some lines require utf8 symbols but are encoded oddly, they are \xC6 for example rather than being \Æ
If I do as follows:
$name = "\xC6ther";
$name = preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
echo utf8_encode($name);
It works fine. I get this:
Æther
But if I pull the same data from MySQL, and do as follows:
$name = $row['OracleName'];
$name = preg_replace('/x([a-fA-F0-9]{2})/', '\&#$1;', $name);
$name = utf8_encode($name);
Then I receive this as output:
\&#C6;ther
Anyone know why this is?
As requested, vardump of $row['OracleName'];
string(15) "xC6ther Barrier"

on your second preg_replace why there is a \
preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
ok I think there is some confusion here. you regular expression is matching something like x66 and would replace that by '&#66', which seems to be some html entities encoding to me but you are using utf8_encode which do that (from manual):
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
so the things would never get converted ... (or to be more precise the '&#66' would remains '&#66' since they are all same characters in ISO-8859-1 and UTF-8)
also to be noted on your first snippet you use \xC6 but this would never get caught by the preg_replace since it's already encoded character. The \x means the next hex number (0x00 ~ 0xFF) would be drop in the string as is. it won't make a string xC6
So I am kind of confused of what you really wanna do. what the preg_replace is all about?
if you want to convert HTML entities to UTF-8 look into mb_convert_encoding (manual), if you want to do the reverse, code in HTML entities from some UTF-8 look into htmlentities (manual)
and if it has nothing to do with all of that and you want to simply change encoding mb_convert_encoding is still there.

Figured out the problem, on the SQL pull I missed an 'x' in the preg_replace
preg_replace('/x([a-fA-F0-9]{2})/', '&#x$1;', $name);
Once I added in the x, it worked like a charm.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.