extract body from raw email with regex

extract body from raw email with regex - php

--047d7b33d6decd251504bfe78895
Content-Type: multipart/alternative; boundary=047d7b33d6decd250d04bfe78893
--047d7b33d6decd250d04bfe78893
Content-Type: text/plain; charset=UTF-8
twest
ini sebuah proiduct abru
awdawdawdawdwa
aw
awdawdaw
--047d7b33d6decd250d04bfe78893
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div class=3D"gmail_quote">twest=C2=A0<div><br></div><div>ini sebuah proidu=
ct abru</div><div><br></div><div>awdawdawdawdwa</div><div><br></div><div>aw=
</div><div>awdawdaw</div>
</div><br>
--047d7b33d6decd250d04bfe78893--
how can i get the mail text/plain and the text/html content with regex?
does an email only have 1 content body? consisting a text/html and a text/plain
*heres a snippet what im currently doing it wrong.
$parts = explode('--', $this->rawemail);
$this->headers = imap_rfc822_parse_headers($this->rawemail);
# var_dump($parts);
# Process the parts
foreach ($parts as $part)
{
# Get Content text/plain
if (preg_match('/Content-Type: text\/plain;/', $part))
{
$body_parts = preg_split('/\n\n/', $part);
# If Above the newline (Headers)
if ($body_parts[0])
{
# var_dump($body_parts[0]);
}
# If Below the newline (Data)
if ($body_parts[1])
{
var_dump($body_parts[1]);
}
}
# Get Content text/html
if (preg_match('/Content-Type: text\/html;/', $part))
{
$body_parts = preg_split('/\n\n/', $part);
# If Above the newline (Headers)
if ($body_parts[0])
{
# var_dump($body_parts[0]);
}
# If Below the newline (Data)
if ($body_parts[1])
{
var_dump($body_parts[1]);
}
}

I think you'd be better going down the email line at a time as it's the line breaks that are more critical in e-mail formation.
Your rules would be:
If you get a double line break, then the body is starting - plain text type (as there are no headers to indicate which).
Otherwise, carry on until you get the "boundary=" bit, and then you record the boundary and hop into a "looking for boundary" mode.
Then, when you find a boundary, hop into "Looking for content-type or double new-line" mode, and look for Content-Type (and note content-Type) or double new-line (header has finished, body coming next until the next boundary)
While reading the body of the message, you're back in "looking for boundary" mode to repeat teh process.
Something I remember from a long time ago - so the following may not be 100% accurate, but I'll mention just in case. Be careful with files with attachemnts as you can get two "boundary" markers. But one boundary is withing another boundary, so if you follow the rules above (i.e. grab the first boundary and stick with it) then you should be fine. But test your script with some attachemnts :)
Edit: additional info as asked in the question. An e-mail can have as many "bodies" as the user wishes to encode. You can have a plain, and HTML, a UTF encoded version, and RTF version or even a Morse Code version (if the client knew how to handle "Content-Type Morse/Code"!). Sometimes you don't get plain text, but only HTML versions (naughty users). Sometimes the HTML actually comes without the content type declaration (which may or may not get displayed as HTML, depending on the client). The boundary also splits off the attachments. Rich test is a gotcha from Outlook (although, to be fair, it usually IS converted to HTML). So no, there's somewhere between 0 and X bodies.

Related

Reliably Clean Email Message Body Encoding

I am writing a small piece of software in php which connects to a IMAP email box and stores the messages contained therein in a MySQL DB for later processing and other goodness.
I have noticed that during testing I get some strange characters appearing in the message body when I attempt to save the message body raw. I am using imap_fetchbody() to extract the message body.
I noticed that when I use quoted_printable_decode() to clean up the message body this helps! However in doing lots of research I have also learned that this will not always help and that other methods such as utf8_encode() and base64_decode() should be used instead to clean up the message body.
So, my question is: what is the best method for reliably cleaning an email message body with php to cover all encoding scenarios?

An "email body" is nowadays actually a tree of individual MIME parts. Sometimes there's just one of them, e.g. a text/plain mail. Sometimes there's a multipart/alternative which wraps inside it two "equivalent" copies of the message, one as text/plain and other as text/html. Sometimes the structure is much more complicated, with many levels of nesting. It is quite common that some of these parts are actually binary content, like images, attached ZIP files and what not.
Each of these individual MIME parts can be encoded for transport; these are specified in the Content-Transfer-Encoding header of the corresponding MIME part. The two encoding schemes which you absolutely must support to interoperate are quoted-printable and base64. An important observation is that this encoding happens separately for each part, i.e. it's perfectly legal to have a multipart/alternative with a text/plain encoded with quoted-printable and another part, text/html encoded in base64.
When you have decoded this transfer encoding, you still have to decode the text from its character encoding to Unicode, i.e. to turn the stream of bytes into Unicode text. You need to consult the encoding parameter of the Content-Type MIME header (again, the part header, not the whole-message header, unless the message itself has only one part).
All details you need to know are in RFC 2045, RFC 2046, RFC 2047 and RFC 2048 (and their corresponding updates).
FInally, there's also the interesting question on what the "main part" of an e-mail is. Suppose you have something like this:
1 multipart/mixed
+ 1.1 text/plain: "Hi, I'm forwarding Jeff's message..."
+ 1.2 message/rfc822
+ 1.2.1 multipart/alternative
+ 1.2.1.1 text/plain "Hi coleagues, I'm sending the meeting notes from..."
+ 1.2.1.2 text/html "<p>Hi colleagues,..."
i.e. this happens when Fred forwards Jeff's message to you. What is the "main part" here?

Russian Language encoded when using imap_fetch from gmail

Im reading a log file pasted into the body of an email, some are in various different languages and all language characters seem to display correctly except for Russian.
Here is an example of what the Russian says in the log file:
Ссылка на объект не указывает на экземпляр объекта.
в
From what I have read I need to specify decoding or encoding something on the lines of mb_encoding (UTF-8) but I am a bit lost on how to actual structure it without affecting code that isnt russian. But when echoed out it gets converted to this:
Ð¡ÑÑ‹Ð»ÐºÐ° Ð½Ð° Ð¾Ð±ÑŠÐµÐºÑ‚ Ð½Ðµ ÑƒÐºÐ°Ð·Ñ‹Ð²Ð°ÐµÑ‚ Ð½Ð° ÑÐºÐ·ÐµÐ¼Ð¿Ð»ÑÑ€ Ð¾Ð±ÑŠÐµÐºÑ‚Ð°.
Ð²
Here is the code im using already, I am a php beginner and some of this isnt my code, I have edited to suit but not 100% what everything is doing:
$mailbox = "xxx#gmail.com";
$mailboxPassword = "xxx";
$mailbox = imap_open("{imap.gmail.com:993/imap/ssl}INBOX",
$mailbox, $mailboxPassword);
mb_internal_encoding("UTF-8");
$subject = mb_decode_mimeheader(str_replace('_', ' ', $subject));
$body = imap_fetchbody($mailbox, $val, 1);
$body = base64_decode($body);
echo $body;
Once I echo out body it converts from Russian into that encoding, any pointers on similar code I can dissect to learn how to fix this?
Please bear in mind there is numerous languages been read from the email, for the most part its just a few snippets and the rest is basic logging but what I am worried about is if I set a new decode that it will mess up other language characters

Despite its large adoption, email is still tricky to work with. If your IMAP client has a limited set of requirements, your job will be easy. Otherwise, for truly a general-purpose GMail client, there's no silver bullet and you have to un understand how email wokrs: SMTP, MIME and finally IMAP.
Basic MIME knowledge is absolutely needed, and I won't paste the whole wikipedia article, but you should really read it and understand how it works. IMAP is somewhat easier to understand.
Usually, email messages contains either a single text/plain body, or a multipart/alternative body with both a text/plain and a text/html part. But, you know, there are attachments, so you can also likely find a multipart/mixed and it can really contain anything, and if it's binary content you should treat it differently than text. There are two headers (which you can find in the global message or in part inside a multipart envelope) somewhat involved in charset issues: Content-Type and Content-Transfer-Encoding.
From your code, we must assume that you are only interested in textual parts base64-encoded. Once you have decoded them, they are a sequence of byte representing text in the charset specified by the sender in the Content-Type header, which is non-ASCII here and thus looks like this:
Content-Type: text/plain; charset=ISO-8859-1
Note that charset may be utf8 or really any other you can think of, you have to check this in your program. You job is transcoding this piece of input in the output charset of your HTML page. If your page does not use a Unicode encoding (like UTF-8), chances are that you can't even be able to show the message correctly, and '?' will be printed instead of missing characters. Since you require your application to be used worldwide (not just in Russia), and since it's anyway good practice, you should use UTF-8 in your HTML responses, and thus when you want to echo the message body:
echo mb_convert_encoding(imap_base64($body), "UTF-8", $input_charset);
where $input_charset is the one found in the Content-Type header for the processed part. For the subject line, you should use imap_mime_header_decode(), which returns an array of tuples (binary string, charset) which you have to output in the same manner as above.
TL;DR
The bytes in the UTF-8 encoded input text map quite nicely to the output if we assume it's CP-1252 encoded (maybe you didn't copy some non printable ones). This means that the input is UTF-8, but the browser thinks the page is Windows-1252. Likely this is the default browser behavior for your locale, and you can easily correct it by sending the appropriate header before any other input:
header("Content-Type: text/html; charset=utf-8");
This should be enough to solve this issue, but will also likely cause problem with non-ASCII characters in string literals and the database (if any). If you want a multilingual application, Unicode is the way, but you have to transcode your database and your PHP files from CP-1252 to UTF-8.

PHP sent emails have =0A=0A instead of new lines

For some time now I've had the problem of some of my users getting =0A=0A instead of new lines in emails I send to them via PHP. Correspondence via email client works well, but PHP generated emails always look like this with some users (a minority). Googling revealed no decent results, all search results seem to be connected with outlook somehow - and it is unacceptable to think that all outlook users would suffer from this problem. Does anyone know a correct way of handling this and avoiding these new line encoding issues?
Edit: FYI I'm using Zend's Mailer class.
Thanks
Edit 2:
Changing the encoding type did not work. I encoded the headers to base64, and the body to 64, got garbled stuff. Then I tried with base64 headers, and did base64_decode(base64_decode($body)) on the body, and that was fine on the user's "CNR Server but not in the inbox" whatever that means. When I tried mb_convert_encoding to base64, I got the encoded string instead of the body again, so no use.
What else can I try? Zend Mailer only supports Quoted Printable and Base64 header encoding. Not sure what to do to the body for it to match the quoted printable encoding...

The email body has been encoded using quoted-printable - but the mime type declared in the email is text/html (or text/plain or undefined).
How you make the encoding of the body of the email match the mime header is up to you.

Content-Transfer-Encoding in file uploading request

I'm trying to upload file, using XMLHTTPRequest, and sending this headers:
Content-Type:multipart/form-data, boundary=xxxxxxxxx
--xxxxxxxxx
Content-Disposition: form-data; name='uploadfile'; filename='123_logo.jpg'
Content-Transfer-Encoding: base64
Content-Type: image/jpeg
/*base64data*/
But on server side PHP ignore header "Content-Transfer-Encoding: base64"
and write base64 undecoded data directly into the file!
Is there any way to fix it?
p.s. it is very important to send data using base64

Xavier's answer doesn't sound right. RFC2616 also has this to say (section 3.7):
In general, HTTP treats a multipart
message-body no differently than
any other media type: strictly as
payload. The one exception is the
"multipart/byteranges"
It seems to me that section 19.4 of RFC2616 is talking about HTTP as a whole, in the sense that it uses a syntax similar to MIME (like headers format), but is not MIME-compliant.
Also, there is RFC2388. In section 3, last paragraph, it says:
Each part may be encoded and the
"content-transfer-encoding" header
supplied if the value of that part
does not conform to the default
encoding.
Section 4.3 elaborates on this:
4.3 Encoding
While the HTTP protocol can transport arbitrary binary data, the
default for mail transport is the 7BIT encoding. The value supplied
for a part may need to be encoded and the "content-transfer-encoding"
header supplied if the value does not conform to the default
encoding. [See section 5 of RFC 2046 for more details.]

My previous answer was wrong
Content-Transfer-Encoding may appear in the a composite body
http://www.ietf.org/rfc/rfc2616.txt
There are several consequences of this. The entity-body for composite
types MAY contain many body-parts, each with its own MIME and HTTP
headers (including Content-MD5, Content-Transfer-Encoding, and
Content-Encoding headers).

Proper PHP way to parse email attachments from EML format

I have a file containing an email in "plain text MIME message format". I am not sure if this is the EML format. The email contains an attachment and I want to extract the attachment and create those files again. This is how the attachment part looks like -
...
...
Receive, deliver details
...
...
From: sac ascsac <sacsac#sacascsac.ascsac>
Date: Thu, 20 Jan 2011 18:05:16 +0530
Message-ID: <AANLkTimmSL0iGW4rA3tvSJ9M3eT5yZLTGsqvCvf2fFC3#mail.gmail.com>
Subject: Test attachments
To: ascsacsa#ascsac.com
Content-Type: multipart/mixed; boundary=20cf3054ac85d97721049a465e12
--20cf3054ac85d97721049a465e12
Content-Type: multipart/alternative; boundary=20cf3054ac85d97717049a465e10
--20cf3054ac85d97717049a465e10
Content-Type: text/plain; charset=ISO-8859-1
hello this is a test mail. It contains two attachments
--20cf3054ac85d97717049a465e10
Content-Type: text/html; charset=ISO-8859-1
hello this is a test mail. It contains two attachments<br>
--20cf3054ac85d97717049a465e10--
--20cf3054ac85d97721049a465e12
Content-Type: text/plain; charset=US-ASCII; name="simple_test.txt"
Content-Disposition: attachment; filename="simple_test.txt"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gj5n2yx60
aGVsbG8gd29ybGQKYWMgYXNj
...
encoded things here
...
ZyBmZyAKCjIKNDIzCnQ2Mwo=
--20cf3054ac85d97721049a465e12
Content-Type: application/x-httpd-php; name="oscomm_backup_code.php"
Content-Disposition: attachment; filename="oscomm_backup_code.php"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gj5n5gxn1
PD9waHAKCg ...
...
encoded things here
...
X2xpbmsoRklMRU5BTUVfQkFDS1VQKSk7Cgo/Pgo=
--20cf3054ac85d97721049a465e12--
I can see that the part between X-Attachment-Id: f_gj5n2yx60 and ZyBmZyAKCjIKNDIzCnQ2Mwo=, both including
is the content of the first attachment. I want to parse those attachments (file names and contents and create those files).
I got this file after parsing a dbx format file using a DBX Parser class available in PHP classes.
I searched in many places and did not find much discussion regarding this here in SO other than Script to parse emails for attachments. May be I missed some terms while searching. In that answer it is mentioned -
you can use the boundries to extract
the base64 encoded information
But I am not sure which are the boundaries and how exactly to use the boundaries? There already must be some libraries or some well defined method of doing this. I guess I will commit many mistakes if I try reinventing the wheel here.

There's an PHP Mailparse extension, have you tried it?
The manual way would be, process the mail line by line. When you hit your first Content-Type header (this one in your example):
Content-Type: multipart/mixed; boundary=20cf3054ac85d97721049a465e12
You have the boundary. This string is used as the boundary between your multiple parts (that's why they call it multipart).
Everytime a line starts with the dashes and this string, a new part begin. In your example:
--20cf3054ac85d97721049a465e12
Every part will start with headers, a blank line, and content. By looking at the content-type of the headers you can determine which are attachments, what their type is and their filename.
Read the whole content, strip the spaces, base64_decode it, and you've got the binary contents of the file. Does this help?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.