Proper PHP way to parse email attachments from EML format

Proper PHP way to parse email attachments from EML format - php

I have a file containing an email in "plain text MIME message format". I am not sure if this is the EML format. The email contains an attachment and I want to extract the attachment and create those files again. This is how the attachment part looks like -
...
...
Receive, deliver details
...
...
From: sac ascsac <sacsac#sacascsac.ascsac>
Date: Thu, 20 Jan 2011 18:05:16 +0530
Message-ID: <AANLkTimmSL0iGW4rA3tvSJ9M3eT5yZLTGsqvCvf2fFC3#mail.gmail.com>
Subject: Test attachments
To: ascsacsa#ascsac.com
Content-Type: multipart/mixed; boundary=20cf3054ac85d97721049a465e12
--20cf3054ac85d97721049a465e12
Content-Type: multipart/alternative; boundary=20cf3054ac85d97717049a465e10
--20cf3054ac85d97717049a465e10
Content-Type: text/plain; charset=ISO-8859-1
hello this is a test mail. It contains two attachments
--20cf3054ac85d97717049a465e10
Content-Type: text/html; charset=ISO-8859-1
hello this is a test mail. It contains two attachments<br>
--20cf3054ac85d97717049a465e10--
--20cf3054ac85d97721049a465e12
Content-Type: text/plain; charset=US-ASCII; name="simple_test.txt"
Content-Disposition: attachment; filename="simple_test.txt"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gj5n2yx60
aGVsbG8gd29ybGQKYWMgYXNj
...
encoded things here
...
ZyBmZyAKCjIKNDIzCnQ2Mwo=
--20cf3054ac85d97721049a465e12
Content-Type: application/x-httpd-php; name="oscomm_backup_code.php"
Content-Disposition: attachment; filename="oscomm_backup_code.php"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gj5n5gxn1
PD9waHAKCg ...
...
encoded things here
...
X2xpbmsoRklMRU5BTUVfQkFDS1VQKSk7Cgo/Pgo=
--20cf3054ac85d97721049a465e12--
I can see that the part between X-Attachment-Id: f_gj5n2yx60 and ZyBmZyAKCjIKNDIzCnQ2Mwo=, both including
is the content of the first attachment. I want to parse those attachments (file names and contents and create those files).
I got this file after parsing a dbx format file using a DBX Parser class available in PHP classes.
I searched in many places and did not find much discussion regarding this here in SO other than Script to parse emails for attachments. May be I missed some terms while searching. In that answer it is mentioned -
you can use the boundries to extract
the base64 encoded information
But I am not sure which are the boundaries and how exactly to use the boundaries? There already must be some libraries or some well defined method of doing this. I guess I will commit many mistakes if I try reinventing the wheel here.

There's an PHP Mailparse extension, have you tried it?
The manual way would be, process the mail line by line. When you hit your first Content-Type header (this one in your example):
Content-Type: multipart/mixed; boundary=20cf3054ac85d97721049a465e12
You have the boundary. This string is used as the boundary between your multiple parts (that's why they call it multipart).
Everytime a line starts with the dashes and this string, a new part begin. In your example:
--20cf3054ac85d97721049a465e12
Every part will start with headers, a blank line, and content. By looking at the content-type of the headers you can determine which are attachments, what their type is and their filename.
Read the whole content, strip the spaces, base64_decode it, and you've got the binary contents of the file. Does this help?

Related

extract body from raw email with regex

--047d7b33d6decd251504bfe78895
Content-Type: multipart/alternative; boundary=047d7b33d6decd250d04bfe78893
--047d7b33d6decd250d04bfe78893
Content-Type: text/plain; charset=UTF-8
twest
ini sebuah proiduct abru
awdawdawdawdwa
aw
awdawdaw
--047d7b33d6decd250d04bfe78893
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div class=3D"gmail_quote">twest=C2=A0<div><br></div><div>ini sebuah proidu=
ct abru</div><div><br></div><div>awdawdawdawdwa</div><div><br></div><div>aw=
</div><div>awdawdaw</div>
</div><br>
--047d7b33d6decd250d04bfe78893--
how can i get the mail text/plain and the text/html content with regex?
does an email only have 1 content body? consisting a text/html and a text/plain
*heres a snippet what im currently doing it wrong.
$parts = explode('--', $this->rawemail);
$this->headers = imap_rfc822_parse_headers($this->rawemail);
# var_dump($parts);
# Process the parts
foreach ($parts as $part)
{
# Get Content text/plain
if (preg_match('/Content-Type: text\/plain;/', $part))
{
$body_parts = preg_split('/\n\n/', $part);
# If Above the newline (Headers)
if ($body_parts[0])
{
# var_dump($body_parts[0]);
}
# If Below the newline (Data)
if ($body_parts[1])
{
var_dump($body_parts[1]);
}
}
# Get Content text/html
if (preg_match('/Content-Type: text\/html;/', $part))
{
$body_parts = preg_split('/\n\n/', $part);
# If Above the newline (Headers)
if ($body_parts[0])
{
# var_dump($body_parts[0]);
}
# If Below the newline (Data)
if ($body_parts[1])
{
var_dump($body_parts[1]);
}
}

I think you'd be better going down the email line at a time as it's the line breaks that are more critical in e-mail formation.
Your rules would be:
If you get a double line break, then the body is starting - plain text type (as there are no headers to indicate which).
Otherwise, carry on until you get the "boundary=" bit, and then you record the boundary and hop into a "looking for boundary" mode.
Then, when you find a boundary, hop into "Looking for content-type or double new-line" mode, and look for Content-Type (and note content-Type) or double new-line (header has finished, body coming next until the next boundary)
While reading the body of the message, you're back in "looking for boundary" mode to repeat teh process.
Something I remember from a long time ago - so the following may not be 100% accurate, but I'll mention just in case. Be careful with files with attachemnts as you can get two "boundary" markers. But one boundary is withing another boundary, so if you follow the rules above (i.e. grab the first boundary and stick with it) then you should be fine. But test your script with some attachemnts :)
Edit: additional info as asked in the question. An e-mail can have as many "bodies" as the user wishes to encode. You can have a plain, and HTML, a UTF encoded version, and RTF version or even a Morse Code version (if the client knew how to handle "Content-Type Morse/Code"!). Sometimes you don't get plain text, but only HTML versions (naughty users). Sometimes the HTML actually comes without the content type declaration (which may or may not get displayed as HTML, depending on the client). The boundary also splits off the attachments. Rich test is a gotcha from Outlook (although, to be fair, it usually IS converted to HTML). So no, there's somewhere between 0 and X bodies.

How can I send email with UTF-8 encoding?

I have inherited a script that sends some content out in three languages (all on same content - repeated) however when recieved the content characters are broken for what i assume is a UTF-8 issue.
Am i right all i need to do is change the charset part to utf-8, or does anything else need to change like the 7bit part ?
you can see where I inserted one UTF-8 reference (not tested yet)
there was something here http://bitprison.net/php_mail_utf-8_subject_and_message which seems to reference base encoding, but I'm not sure if I need that here ?
// Contruct message body.
$body = "";
// Add message for non-mime clients.
$body .= "This is a multi-part message in MIME format.\n";
// Add text body.
$body .= "\n--$boundary\nContent-Type: text/plain; charset=UTF-8; format=flowed\nContent-Transfer-Encoding: 7bit\n\n" . $textContent;
// Add HTML body.
$body .= "\n--$boundary\nContent-Type: text/html; charset=ISO-8859-1; format=flowed\nContent-Transfer-Encoding: 7bit\n\n" . $htmlContent;
mail( $row["email"], "Update Your ArtsDB Listing", $body, $headers );
I looked on another post on here for an a example.
$body .= "\n--$boundary\nContent-Type: text/plain; charset=UTF-8; format=flowed\nContent-Transfer-Encoding: 8bit\n\n" . $textContent;
// Add HTML body.
$body .= "\n--$boundary\nContent-Type: text/html; charset=UTF-8; format=flowed\nContent-Transfer-Encoding: 8bit\n\n" . $htmlContent;

You are using Content-Type: text/plain; charset=UTF-8 to tell the mail reader that such message part uses UTF-8, which is fine, but... What does the $textContent variable contain? That's the important bit. According to Content-Transfer-Encoding: 7bit, it's a 7 bit encoding so it can't be raw UTF-8. However, you are not using any of the usual 7-bit encodings used for e-mail. Otherwise, there would be a (e.g.) Content-Transfer-Encoding: quoted-printable header.
To sum up, you need to:
Have a source string that contains valid UTF-8.
Pick a encoding for the transfer, such as quoted_printable_encode().
Add a header to tell which transfer encoding you chose.
You could also send the raw UTF-8 as-is and set Content-Transfer-Encoding: 8bit but I would not recommend it. You risk breaking the SMTP standard just by sending very long lines. Also, you have no idea of what kind of legacy programs this will go through.
E-mail is harder than it seems, that's why sooner or later you end up using a third-party library: PHP Mailer, Swift Mailer, PEAR Mail...

Content-Transfer-Encoding: 7bit
This makes no sense - there is no direct mapping between 7bit data and an 8+bit representation. You need to change the mime headers to state what encoding you are using.
For SMTP the transfer encoding should be a 7 bit ascii charset. To change your utf8 data you need to encode this - common encodings are base64 and quoted printable (PHP provides encode and decode fns for both).
Why not just use a good lib like phpmailer or swiftmailer

how to get email body text while getting duplicate entry?

After mime parsing I am getting email body with duplicate entry(plain n html) and wondering how I can get the true message body. I am using php/mysql. Is there anything in php string or mysql to solve this?
email message body Sample:
testing body from hotmail. testing word can be repeated.
testing body from hotmail. testing word can be repeated.

Ok, so as I said you receive the email in double because you receive it in plain/text and text/html format.
The best way to read email from pop3 as I found until now is Manuel Lemos POP3 Access
the email formats ussualy are received in parts, for each type or image
plain/text:
------=_Part_38964_33016848.1312149074828
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
email content in text format
text/html:
------=_Part_38964_33016848.1312149074828
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
email content in html format
you will find in headers the name of the part, that unique identifier
Content-Type: multipart/alternative;
boundary="----=_Part_38964_33016848.1312149074828"
There isn't a simple way to get the real plain/text and text/html, they are most likely to be togheter if sent from a public email service. If you send email from your scripts, I don't think you'll bother to send that email in double format.

Content-Transfer-Encoding in file uploading request

I'm trying to upload file, using XMLHTTPRequest, and sending this headers:
Content-Type:multipart/form-data, boundary=xxxxxxxxx
--xxxxxxxxx
Content-Disposition: form-data; name='uploadfile'; filename='123_logo.jpg'
Content-Transfer-Encoding: base64
Content-Type: image/jpeg
/*base64data*/
But on server side PHP ignore header "Content-Transfer-Encoding: base64"
and write base64 undecoded data directly into the file!
Is there any way to fix it?
p.s. it is very important to send data using base64

Xavier's answer doesn't sound right. RFC2616 also has this to say (section 3.7):
In general, HTTP treats a multipart
message-body no differently than
any other media type: strictly as
payload. The one exception is the
"multipart/byteranges"
It seems to me that section 19.4 of RFC2616 is talking about HTTP as a whole, in the sense that it uses a syntax similar to MIME (like headers format), but is not MIME-compliant.
Also, there is RFC2388. In section 3, last paragraph, it says:
Each part may be encoded and the
"content-transfer-encoding" header
supplied if the value of that part
does not conform to the default
encoding.
Section 4.3 elaborates on this:
4.3 Encoding
While the HTTP protocol can transport arbitrary binary data, the
default for mail transport is the 7BIT encoding. The value supplied
for a part may need to be encoded and the "content-transfer-encoding"
header supplied if the value does not conform to the default
encoding. [See section 5 of RFC 2046 for more details.]

My previous answer was wrong
Content-Transfer-Encoding may appear in the a composite body
http://www.ietf.org/rfc/rfc2616.txt
There are several consequences of this. The entity-body for composite
types MAY contain many body-parts, each with its own MIME and HTTP
headers (including Content-MD5, Content-Transfer-Encoding, and
Content-Encoding headers).

How to pull html encoding from email data using PHP

I'm working with emails and want to display the html in the browser, I'm not sure how to deal with the encoding. I want to extract the html to display it in the html browser. The way I plan on doing this is using an html parser on the entire email parsing the data inbetween the tags in the html section. Is there an easier/more efficient way to do this?
Here's text encoding
------=_Part_29856965_540743623.1285814590176
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Here's the html encoding
------=_Part_29856965_540743623.1285814590176
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

You can have a look at the ezComponents - Mail component. It has a lot of operations for building and using a MIME
http://ezcomponents.org/docs/tutorials/Mail

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.