Email Broken Character in Japan [duplicate] - php

Aparently, encoding japanese emails is somewhat challenging, which I am slowly discovering myself. In case there are any experts (even those with limited experience will do), can I please have some guidelines as to how to do it, how to test it and how to verify it?
Bear in mind that I've never set foot anywhere near Japan, it is simply that the product I'm developing is used there, among other places.
What (I think) I know so far is following:
- Japanese emails should be encoded in ISO-2022-JP, Japanese JIS codepage 50220 or possibly SHIFT_JIS codepage 932
- Email transfer encoding should be set to Base64 for plain text and 7Bit for Html
- Email subject should be encoded separately to start with "=?ISO-2022-JP?B?" (don't know what this is supposed to mean). I've tried encoding the subject with
"=?ISO-2022-JP?B?" + Convert.ToBase64String(Encoding.Unicode.GetBytes(subject))
which basically gives the encoded string as expected but it doesn't get presented as any japanese text in an email program
- I've tested in Outlook 2003, Outlook Express and GMail
Any help would be greatly appreciated
Ok, so to post a short update, thanks to the two helpful answers, I've managed to get the right format and encoding. Now, Outlook gives something that resembles the correct subject:
=?iso-2022-jp?B?6 Japanese test に各々の視点で語ってもらった。 6相当の防水?=
However, the exact same email in Outlook Express gives subject like this:
=?iso-2022-jp?B?6 Japanese test 縺ォ蜷・・・隕也せ縺ァ隱槭▲縺ヲ繧ゅi縺」縺溘・ 6逶ク蠖薙・髦イ豌エ?=
Furthermore, when viewed in the Inbox view in Outlook Express, the email subject is even more weird, like this:
=?iso-2022-jp?B?6 Japanese test ??????????????? 6???????=
Gmail seems to be working in the similar fashion to Outlook, which looks correct.
I just can't get my head around this one.

I've been dealing with Japanese encodings for almost 20 years and so I can sympathize with your difficulties. Websites that I've worked on send hundreds of emails daily to Japanese customers so I can share with you what's worked for us.
First of all, do not use Shift-JIS. I personally receive tons of Japanese emails and almost never are they encoded using Shift-JIS. I think an old (circa Win 98?) version of Outlook Express encoded outgoing mail using Shift-JIS, but nowadays you just don't see it.
As you've figured out, you need to use ISO-2022-JP as your encoding for at least anything that goes in the mail header. This includes the Subject, To line, and CC line. UTF-8 will also work in most cases, but it will not work on Yahoo Japan mail, and as you can imagine, many Japanese users use Yahoo Japan mail.
You can use UTF-8 in the body of the email, but it is recommended that you base64 encode the UTF-8 encoded Japanese text and put that in the body instead of raw UTF-8 text. However, in practice, I believe that raw UTF-8 text will work fine these days, for the body of the email.
As I alluded to above, you need to at least test on Outlook (Exchange), Outlook Express (IMAP/POP3), and Yahoo Japan web mail. Yahoo Japan is the trickiest because I believe they use EUC for the encoding of their web pages, and so you need to follow the correct standards for your emails or they won't work (ISO-2022-JP is the standard for sending Japanese emails).
Also, your subject line should not exceed 75 characters per line. That is, 75 characters after you've encoded in ISO-2022-JP and base64, not 75 characters before conversion. If you exceed 75 characters, you need to break your encoded subject into multiple lines, starting with "=?iso-2022-jp?B?" and ending with "?=" on each line. If you don't do this, your subject might get truncated (depending on the email reader, and also the content of your subject text). According to RFC 2047:
"An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."
Here's some sample PHP code to encode the subject:
// Convert Japanese subject to ISO-2022-JP (JIS is essentially ISO-2022-JP)
$subject = mb_convert_encoding ($subject, "JIS", "SJIS");
// Now, base64 encode the subject
$subject = base64_encode ($subject);
// Add the encoding markers to the subject
$subject = "=?iso-2022-jp?B?" . $subject . "?=";
// Now, $subject can be placed as-is into the raw mail header.
See RFC 2047 for a complete description of how to encode your email header.

Check http://en.wikipedia.org/wiki/MIME#Encoded-Word for a description on how to encode header fields in MIME-compliant messages. You seem to be missing a “?=” at the end of your subject.

=?ISO-2022-JP?B?TEXTTEXT...
ISO_2022-JP means that string is encoded in ISO-2022-JP codepage (eg. not Unicode)
B means that string is bese64 encoded
In your example, you should just supply your string in ISO-2022-JP instead of Unicode.

I have some experience composing and sending email in japanese...Normally you have to beware what encoding used for operating system and how you store your japanese strings!
My Mail objects are normally encoded as follows:
string s = "V‚µ‚¢ŠwK–#‚Ì‚²’ñˆÄ"; // Our japanese are shift-jis encoded, so it appears like garbled
MailMessage message = new MailMessage();
message.BodyEncoding = Encoding.GetEncoding("iso-2022-jp");
message.SubjectEncoding = Encoding.GetEncoding("iso-2022-jp");
message.Subject = s.ToEncoding(Encoding.GetEncoding("Shift-Jis")); // Change the encoding to whatever your source is
message.Body = s.ToEncoding(Encoding.GetEncoding("Shift-Jis")); // Change the encoding to whatever your source is
Then i have an extension method to which does the conversion for me:
public static string ToEncoding(this string s, Encoding targetEncoding)
{
return s == null ? null : targetEncoding.GetString(Encoding.GetEncoding(1252).GetBytes(s)); //1252 is the windows OS codepage
}

something like this should get the job done in python:
#!/usr/bin/python
# -*- mode: python; coding: utf-8 -*-
import smtplib
from email.MIMEText import MIMEText
from email.Header import Header
from email.Utils import formatdate
def send_from_gmail( from_addr, to_addr, subject, body, password, encoding="iso-2022-jp" ):
msg = MIMEText(body.encode(encoding), 'plain', encoding)
msg['Subject'] = Header(subject.encode(encoding), encoding)
msg['From'] = from_addr
msg['To'] = to_addr
msg['Date'] = formatdate()
s = smtplib.SMTP('smtp.gmail.com', 587)
s.ehlo(); s.starttls(); s.ehlo()
s.login(from_addr, password)
s.sendmail(from_addr, to_addr, msg.as_string())
s.close()
return "Sent mail to: %s" % to_addr
if __name__ == "__main__":
import sys
for n,item in enumerate(sys.argv):
sys.argv[n] = sys.argv[n].decode("utf8")
if len(sys.argv)==6:
print send_from_gmail( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5] )
elif len(sys.argv)==7:
print send_from_gmail( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], encoding=sys.argv[6] )
else:
raise "SYNTAX: %s <from_addr> <to_addr> <subject> <body> <password> [encoding]"
**blatantly stolen/adapted from:
http://mtokyo.blog9.fc2.com/blog-entry-127.html

First of all you should be using:
Encoding.GetEncoding("ISO-2022-JP")
to convert your subject line into bytes that will be processed by Convert.ToBase64String().
=?ISO-2022-JP?B?TEXTTEXT...?= tells the receiving mail client which encoding was used on the sender's side to convert japanese "letters" into a byte stream.
Currently you're using UTF-16 to encode, but specifying ISO-2022-JP to decode. These are obviously two different encodings, I guess, just like ISO-8859-1 is different from Unicode (most extended western-europe chars are represented by one byte in ISO-XXX, but two bytes in Unicode).
I'm not sure what you mean about UTF-8 being second-class citizen. As long as the receiving mail client understands UTF-8 and is able to convert it to the current japanese locale, everything is fine.

<?php
function sendMail($to, $subject, $body, $from_email,$from_name)
{
$headers = "MIME-Version: 1.0 \n" ;
$headers .= "From: " .
"".mb_encode_mimeheader (mb_convert_encoding($from_name,"ISO-2022-JP","AUTO")) ."" .
"<".$from_email."> \n";
$headers .= "Reply-To: " .
"".mb_encode_mimeheader (mb_convert_encoding($from_name,"ISO-2022-JP","AUTO")) ."" .
"<".$from_email."> \n";
$headers .= "Content-Type: text/plain;charset=ISO-2022-JP \n";
/* Convert body to same encoding as stated
in Content-Type header above */
$body = mb_convert_encoding($body, "ISO-2022-JP","AUTO");
/* Mail, optional parameters. */
$sendmail_params = "-f$from_email";
mb_language("ja");
$subject = mb_convert_encoding($subject, "ISO-2022-JP","AUTO");
$subject = mb_encode_mimeheader($subject);
$result = mail($to, $subject, $body, $headers, $sendmail_params);
return $result;
}

Introduction of Japanese encoding to e-mail happened at JUNET(UUCP based nation-wide network) in early 90's.
At that time, RFC1468 was defined.
If you follow RFC1468 in plain text mail, there would be no problem.
If you want to handle html mail, RFC1468 is useless except for header parts.

Here's what I use to send Japanese emails. Subject line looks fine in Outlook 2010, gmail and on iPhone.
Encoding encoding = Encoding.GetEncoding("iso-2022-jp");
byte[] bytes = encoding.GetBytes(subject);
string uuEncoded = Convert.ToBase64String(bytes);
subject = "=?iso-2022-jp?B?" + uuEncoded + "?=";
// not sure this is actually necessary...
mailMessage.SubjectEncoding = Encoding.GetEncoding("iso-2022-jp");

Related

’ in PHP is converting to ’ when using mb_convert_encoding in Outlook subject

I have a mail() function set up in PHP, when emailing to my email to test I noticed the subject was converting my ' into ’.
$subject="Please provide an updated copy of your company's certification";
result: Please provide an updated copy of your company’s certification.
I followed Getting ’ instead of an apostrophe(') in PHP adding mb_convert_encoding but now I am getting &rsquo instead of '.
$subjectBad="Please provide an updated copy of your company's certification";
$subject= mb_convert_encoding($subjectBad, "HTML-ENTITIES", 'UTF-8');
result: Please provide an updated copy of your company&rsquo ;s certification.
It comes through fine to my personal email, so is there a way to properly display a ' in Outlooks subject or am I at the whim of whatever their system settings are?
Whatever you used to type the subject did not use a simple apostrophe ' which has a common representation across virtually all single-byte encodings and UTF8, instead it used a "fancy" right single quote ’, which is represented differently between single-byte encodings and UTF-8.
mb_convert_encoding() is converting to an HTML entity because you are literally telling it to, and email headers are not HTML so it's going to display as the literal string ’. The only character set other than UTF-8 that has "smart quotes" is Microsoft's cp1252, and that is still the wrong answer for email headers.
The simplest answer is: Don't do that. Use a normal apostrophe. Everyone hates dealing with "smart" quotes.
The more complex answer is that email headers MUST be 7-bit safe "ASCII" text, and anything else requires additional handwaving. Ideally you should be using a proper email library that handles this, and the dozens of other annoyances that will malform your emails and impact deliverability.
If you're dead-set on eroding your sanity and using mail() directly, then you're going to want to properly encode your subject line and use an explicitly-defined character set, which you should be doing anyways. Eg:
$subject = 'Please provide an updated copy of your company’s certification';
var_dump(
sprintf('=?UTF-8?Q?%s?=', quoted_printable_encode($subject))
);
Output:
string(82) "=?UTF-8?Q?Please provide an updated copy of your company=E2=80=99s certification?="

Decoding Windows-1252 characters in imap subject line to UTF-8

I have a website that will allow people to post things to it using the subject line of an email in Outlook. Using PHP and imap, I get the subject line of the text and store it in a mysql db. But every once in a while, someone will copy text from a website into the subject line of that email and I will get garbled text. Similar to this:
=?Windows-1252?Q?_Every_day_in_our_offices_we_recycle_cardboard,aluminum?=
=?Windows-1252?Q?=96_won=92t_you_join_us=3F?=
What I've done is try to decode this text so it will appear normal on the page using the following code:
$subject = strip_tags($mailHeader->subject);
$header = imap_mime_header_decode($subject);
$subject = "";
for($i=0;$i<count($header);$i++)
{
$subject .= $header[$i]->text;
}
When finished I get rid of most of the garbled text, but am left behind with replacement characters for an em dash and a curly quote that was in the original subject line text. See the result below:
Every day in our offices we recycle cardboard, aluminum, � won�t you join us?
The charset for the website is set to UTF-8. When I set the website charset to ISO-8859-1, the replacement characters are replaced with the curly quote and em dash, which is great but I want to leave the website's charset at UTF-8.
Any help on how to get rid of the replacement characters without changing the charset to ISO-8859-1 would be great. Thanks.
Code above works except for one small change to the very end:
$subject .= mb_convert_encoding($header[$i]->text, "UTF-8", $header[$i]->charset);
Each of the objects returned by imap_mime_header_decode includes a charset property, which you are ignoring. You would need to convert each one to UTF-8 in your loop, using something like:
$subject .= mb_convert_encoding($header[$i]->text, "UTF-8", $header[$i]->charset);
As an alternative, consider using the mb_decode_mimeheader or iconv_mime_decode_headers functions. Both of these functions do the entire job of decoding a MIME header for you, returning a string in PHP's internal encoding (which is usually UTF-8).

Russian Language encoded when using imap_fetch from gmail

Im reading a log file pasted into the body of an email, some are in various different languages and all language characters seem to display correctly except for Russian.
Here is an example of what the Russian says in the log file:
Ссылка на объект не указывает на экземпляр объекта.
в
From what I have read I need to specify decoding or encoding something on the lines of mb_encoding (UTF-8) but I am a bit lost on how to actual structure it without affecting code that isnt russian. But when echoed out it gets converted to this:
СÑылка на объект не указывает на ÑкземплÑÑ€ объекта.
в
Here is the code im using already, I am a php beginner and some of this isnt my code, I have edited to suit but not 100% what everything is doing:
$mailbox = "xxx#gmail.com";
$mailboxPassword = "xxx";
$mailbox = imap_open("{imap.gmail.com:993/imap/ssl}INBOX",
$mailbox, $mailboxPassword);
mb_internal_encoding("UTF-8");
$subject = mb_decode_mimeheader(str_replace('_', ' ', $subject));
$body = imap_fetchbody($mailbox, $val, 1);
$body = base64_decode($body);
echo $body;
Once I echo out body it converts from Russian into that encoding, any pointers on similar code I can dissect to learn how to fix this?
Please bear in mind there is numerous languages been read from the email, for the most part its just a few snippets and the rest is basic logging but what I am worried about is if I set a new decode that it will mess up other language characters
Despite its large adoption, email is still tricky to work with. If your IMAP client has a limited set of requirements, your job will be easy. Otherwise, for truly a general-purpose GMail client, there's no silver bullet and you have to un understand how email wokrs: SMTP, MIME and finally IMAP.
Basic MIME knowledge is absolutely needed, and I won't paste the whole wikipedia article, but you should really read it and understand how it works. IMAP is somewhat easier to understand.
Usually, email messages contains either a single text/plain body, or a multipart/alternative body with both a text/plain and a text/html part. But, you know, there are attachments, so you can also likely find a multipart/mixed and it can really contain anything, and if it's binary content you should treat it differently than text. There are two headers (which you can find in the global message or in part inside a multipart envelope) somewhat involved in charset issues: Content-Type and Content-Transfer-Encoding.
From your code, we must assume that you are only interested in textual parts base64-encoded. Once you have decoded them, they are a sequence of byte representing text in the charset specified by the sender in the Content-Type header, which is non-ASCII here and thus looks like this:
Content-Type: text/plain; charset=ISO-8859-1
Note that charset may be utf8 or really any other you can think of, you have to check this in your program. You job is transcoding this piece of input in the output charset of your HTML page. If your page does not use a Unicode encoding (like UTF-8), chances are that you can't even be able to show the message correctly, and '?' will be printed instead of missing characters. Since you require your application to be used worldwide (not just in Russia), and since it's anyway good practice, you should use UTF-8 in your HTML responses, and thus when you want to echo the message body:
echo mb_convert_encoding(imap_base64($body), "UTF-8", $input_charset);
where $input_charset is the one found in the Content-Type header for the processed part. For the subject line, you should use imap_mime_header_decode(), which returns an array of tuples (binary string, charset) which you have to output in the same manner as above.
TL;DR
The bytes in the UTF-8 encoded input text map quite nicely to the output if we assume it's CP-1252 encoded (maybe you didn't copy some non printable ones). This means that the input is UTF-8, but the browser thinks the page is Windows-1252. Likely this is the default browser behavior for your locale, and you can easily correct it by sending the appropriate header before any other input:
header("Content-Type: text/html; charset=utf-8");
This should be enough to solve this issue, but will also likely cause problem with non-ASCII characters in string literals and the database (if any). If you want a multilingual application, Unicode is the way, but you have to transcode your database and your PHP files from CP-1252 to UTF-8.

PHP sent email subject - Hotmail/Outlook show £ as £

When I send an email containing £ using PHP mail it appears in outlook/hotmail as £. In Gmail/thunderbird it's fine.
Any idea how I can fix this?
The problem is, the client doesn't know what encoding is used to encode the subject. Whatever your application sets in Content-Type header only applies to the body of the email, not the headers.
Usually this affects the following headers:
Subject
From
To
In order to use different encodings your internationalized header lines should be MIME-encoded (as of RFC 2047), using one of the two methods: base64 (B) or modified quoted-printable (Q). The encoded subject usually looks like this:
Subject: =?ISO-8859-1?Q?Pr=FCfung_f=FCr?= Entwerfen von einer MIME kopfzeile
This may look difficult, but there is one very handy helper function in PHP which does all the magic:
iconv_mime_encode() - Composes a MIME header field
Alternatively you may look into discussion under:
quoted_printable_encode() - Convert a 8 bit string to a quoted-printable string
Before using quoted_printable_encode() directly you neet to take into account that long lines need to be split at certain length and spaces need to be replaced with underscore "_".
Just today I fixed a similar subject encoding issue by using phpmailer instead of php's builtin mail:
$mail = new PHPMailer(true);
$mail->IsSMTP();
$mail->CharSet = "utf8";
$mail->Subject = $mail->EncodeHeader("You won £10000000!");
....
$retval = $mail->Send();
Usually I use the mb_convert_encoding() function
mb_convert_encoding($string, "UTF-8"); //AUTO DETECT AND CONVERT
mb_convert_encoding($string, "UTF-8", "latin1"); //MANUAL SET - CHANGE latin1 TO CURRENT ENCODING
Try to use UTF-8 encoding in your email.
this had results for my
<?php $subject = "=?UTF-8?B?" . base64_encode($subject) . "?="; ?>

PHP sent emails have =0A=0A instead of new lines

For some time now I've had the problem of some of my users getting =0A=0A instead of new lines in emails I send to them via PHP. Correspondence via email client works well, but PHP generated emails always look like this with some users (a minority). Googling revealed no decent results, all search results seem to be connected with outlook somehow - and it is unacceptable to think that all outlook users would suffer from this problem. Does anyone know a correct way of handling this and avoiding these new line encoding issues?
Edit: FYI I'm using Zend's Mailer class.
Thanks
Edit 2:
Changing the encoding type did not work. I encoded the headers to base64, and the body to 64, got garbled stuff. Then I tried with base64 headers, and did base64_decode(base64_decode($body)) on the body, and that was fine on the user's "CNR Server but not in the inbox" whatever that means. When I tried mb_convert_encoding to base64, I got the encoded string instead of the body again, so no use.
What else can I try? Zend Mailer only supports Quoted Printable and Base64 header encoding. Not sure what to do to the body for it to match the quoted printable encoding...
The email body has been encoded using quoted-printable - but the mime type declared in the email is text/html (or text/plain or undefined).
How you make the encoding of the body of the email match the mime header is up to you.

Categories