parsing email attachments in php

parsing email attachments in php - php

I have a .forward.postix that is piping incoming emails to a shell account, and consequently to a PHP script that parses the emails - awesome.
As of now, and based on everything I've seen online, I'm using the explode('From: ', $email); approach to split everything down into the vars I need for my database.
Enter attachments! Using the same approach, I'm doing this to pull out a jpg attachment - but it appears not every email client formats the raw source the same way and it's causing some headaches!
$contentType1 = explode('Content-type: image/jpg;', $email); //start ContentType
$contentType2 = explode("--Boundary_", $contentType1[1]); //end ContentType
$jpg1 = explode("\n\n", $contentType2[0]); //double return starts base64 blob
$jpg2 = explode("\n\n", $jpg1[1]); //double return marks end of base64 blob
$image = base64_decode($jpg2[0]); //decode base64 blob into jpg raw
Here's my issue:
Not all mail clients - like GMail for example - list an attachment as 'Content-type: image/jpg'. GMail puts in 'Content-Type: IMAGE/JPEG' and sometimes sends it as 'Content-Type: image/jpeg'! Which don't match my explode...
My Question: Does someone know a better way of finding the bounds of a base64 blob? or maybe a case-insensitive way of exploding the 'Content-Type:' so I can then try against image/jpg or image-jpeg to see what matches?

You can try to split your string using preg_split(). Maybe something like this
$pattern = '#content-type: image/jpe?g#i';
$split_array = preg_split($pattern, $email);

Related

Remove 'E' output headers from base64 string in TCPDF

I'm using TCPDF using
$base64String = $pdf->Output('file.pdf', 'E');
So I can send the data via AJAX
The only problem is that it comes with header information in addition to the Base64 string
Content-Type: application/pdf;
name="FILE-31154d59f28c63efae86e4f3d6a00e13.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="FILE-31154d59f28c63efae86e4f3d6a00e13.pdf"
So if I take the string that is created to base64_decode() or use with phpMailer in my case it errors. Is it possible to remove the headers so I only have the base64 string?
(The error is that the pdf can't be read by any PDF reader when opened)
I thought I'd be able to find something that solves this but I haven't found anything!!
UPDATE
This is what I've put in place to solve the issue
$base64String = preg_replace('/Content-[\s\S]+?;/', '', $base64String);
$base64String = preg_replace('/name=[\s\S]+?pdf"/', '', $base64String);
$base64String = preg_replace('/filename=[\s\S]+?"/', '', $base64String);
However it's not very elegant! So if anyone has a better solution please post it below :)

TCPDF docs are huge but unusable – it's easier to read the source code directly. It has those extra headers because you're asking for them by using the E output mode, which is intended for generating email messages.
For sending the PDF data as a PHPMailer attachment, you want the straight binary PDF data as a string, as provided by the S output mode, which you can pass straight into addStringAttachment(), and PHPMailer will handle all the encoding for you. All you have to do is this:
$mail->addStringAttachment($pdf->Output('file.pdf', 'S'), 'file.pdf');
To convert the PDF binary into base64, for example to us it in a JSON string, simply pass it through base64_encode:
$base64String = base64_encode($pdf->Output('file.pdf', 'S'));

how to base64 encode a doc file

I am working on a php project that I need to send base64 encoded doc document to the server using CURL. Here is the code:
$c = file_get_contents("test.doc");
$encoded = base64_encode($c);
I found the resulting $encoded is invalid.
In this website: https://www.base64encode.org/
I uploaded the file and base64 encoded online, the resultant encoded string is correct.
I then tried to cut and paste the text from the doc file and encoded them online at the above website, the resultant string is different than the valid string I got. Therefore, I guess I could not just extract the text from the doc file and base64 encode them.

try this-
$c = file_get_contents("test.doc");
file_put_contents('temp.txt',$c);
$cc = file_get_contents('temp.txt');
if(strlen($cc)=='0'){
$cc = fopen("temp.txt","r");}
$encoded = base64_encode($cc);
unlink('temp.txt');

Decode email body in php gmail api?

I'm trying to read the email body usign Gmail API
I was using IMAP but for performance reasons(reading emails takes so much time in IMAP) I have to move to Gmail API which is considerably faster.
The issue is where I'm trying to decode the Body, with IMAP is simple, just read the transfer encodings from imap_fetchstructure return and use the appropriate function to decode. (imap_qprint,imap_7bit, etc)
And for Gmail
$message = $service->users_messages->get($user, $msg->id, ["format"=>"full"]);
$payload = $message->getPayload();
$mime = $payload->getMimeType();
$body = $payload->getBody();
$headers = $payload->getHeaders();
$content = $body->getData();
$decoded = base64_decode($content);
The variable $contents is the body in base64 but If I decode includes strange characters like ��ѽ��� that didn't happened with IMAP.
The content is plain text UTF-8 no additional parts or attachment, just plain text. And it happens with HTML as well.
These are the relevants headers
[{"name":"MIME-Version","value":"1.0"},
{"name":"Content-Type","value":"text\/plain; charset=utf-8"},
{"name":"Content-Transfer-Encoding","value":"quoted-printable"}]
I think the issue is that the body is quoted-printable but even if I use imap_qprint or quoted_printable_decode over the decoded base64 these strange characters continue.

I was having the same problem...Felipe Morales got at the solution...but to be a bit more explicit:
Take the base64-url-encoded string from the API response, and run it through this function:
function gmailBodyDecode($data) {
$data = base64_decode(str_replace(array('-', '_'), array('+', '/'), $data));
//from php.net/manual/es/function.base64-decode.php#118244
$data = imap_qprint($data);
return($data);
}
Works like a champ...

imap_open function in PHP sometimes see blank message body

I am using imap_open function in PHP to download emails and insert them into a mysql database
Here is my code to get the headers and body message etc:
$emails = imap_search($inbox,'ALL');
//if emails are returned, cycle through each...
if($emails)
{
//begin output var
$output = '';
//put the newest emails on top
rsort($emails);
//for every email...
foreach($emails as $email_number)
{
//get information specific to this email
$header=imap_headerinfo($inbox,$email_number);
$structure = imap_fetchstructure($inbox,$email_number);
$from = $header->from[0]->mailbox . "#" . $header->from[0]->host;
$toaddress=$header->to[0]->mailbox."#".$header->to[0]->host;
$replyto=$header->reply_to[0]->mailbox."#".$header->reply_to[0]->host;
$datetime=date("Y-m-d H:i:s",$header->udate);
$subject=$header->subject;
$message = quoted_printable_decode(imap_fetchbody($inbox,$email_number,1.1));
if($message == '')
{
$message = quoted_printable_decode(imap_fetchbody($inbox,$email_number,1));
}
}
}
but it doesnt seem to get the body of all emails. For example, when it receives Read Receipts the body is just blank and the same with some other emails people send.
sometimes, the email body looks like:
PGh0bWw+DQo8aGVhZD4NCjxtZXRhIGh0dHAtZXF1aXY9IkNvbnRlbnQtVHlwZSIgY29udGVudD0i dGV4dC9odG1sOyBjaGFyc2V0PXV0Zi04Ij4NCjwvaGVhZD4NCjxib2R5IHN0eWxlPSJ3b3JkLXdy YXA6IGJyZWFrLXdvcmQ7IC13ZWJraXQtbmJzcC1tb2RlOiBzcGFjZTsgLXdlYmtpdC1saW5lLWJy ZWFrOiBhZnRlci13aGl0ZS1zcGFjZTsgY29sb3I6IHJnYigwLCAwLCAwKTsgZm9udC1zaXplOiAx NHB4OyBmb250LWZhbWlseTogQ2FsaWJyaSwgc2Fucy1zZXJpZjsiPg0KPGRpdj4NCjxkaXY+DQo8 ZGl2PnJlcGx5PC9kaXY+DQo8ZGl2Pg0KPHAgc3R5bGU9ImZvbnQtZmFtaWx5OiBDYWxpYnJpOyBt YXJnaW46IDBweCAwcHggMTJweDsiPjxiPktpbmQgUmVnYXJkcyw8YnI+DQo8YnI+DQpDaGFybGll IEZvcmQgfCZuYnNwOzwvYj48c3BhbiBzdHlsZT0iY29sb3I6IHJnYigyNTIsIDc5LCA4KTsiPjxi PlRlY2huaWNhbCBNYW5hZ2VyJm5ic3A7PC9iPjwvc3Bhbj48Yj58Jm5ic3A7SW50ZWdyYSBEaWdp dGFsPC9iPjxmb250IGNvbG9yPSIjNTk1OTU ... continued
How can i convert the whole message body to be plain text

Here's what I use, in general. $email refers to one of the objects from the return of eg imap_fetch_overview:
$structure = imap_fetchstructure($email->msgno);
$body = imap_fetchbody($email->msgno, '1');
if (3 === $structure->encoding) {
$body = imap_base64($body);
} else if (4 === $structure->encoding) {
$body = imap_qprint($body);
}
Note there are 6 possible encodings (ranging from 0 to 5), and I'm only handling 2 of them (3 and 4) -- you might want to handle all of them.
Also note I'm also getting only the 1st part (in imap_fetchbody) -- you might want to loop over the pieces to get them as needed.
Update
One other thing I noticed about your code. You're doing imap_fetchbody($inbox,$email_number,1.1). That third argument should be a string, not a number. Do this instead:
imap_fetchbody($inbox, $email_number, '1.1')

The code given handles only simple text messages having at most one sub-part and no encoding. This is basically the simplest kind of email there is. The world used to be that simple, sadly no more!
To handle more email, your code must be expanded to handle:
Multi-parts
Encodings
Multi-part is the concept that a single email message (a bunch of data) can be divided into multiple, logically-separate pieces. In the simplest case, there is only one part: the text of the message. In the next simplest case, there is message text with a single attachment. The next simplest case is message text plus multiple attachments. Then it starts to get hard, when the text of the message refers inline or embeds the attachments (think of an HTML message with an image -- that image could be an attachment that's linked with "local" CSS or embedded as eg base64 data url).
Encoding is the idea that email needs to accommodate the lowest common denominator of SMTP servers on the Internet. From 1971 to the early 1990s, most email messages were plain text using 7-bit US ASCII character set -- and SMTP mailers in the middle relied on this 7-bit framework. As the need for character sets became more apparent, simultaneously with the need to send binary data (eg images), 8-bit SMTP mailers cropped up as did various methods to shoe-horn 8-bit clean data into 7-bits. These include quoted-printable and base64. While 7-bit is virtually dead, we still have all the hoops of this history to jump through.
Rather than re-invent the wheel, there is a good piece of code on PHP.net that handles multi-part encoded messages. See the comment by david at hundsness dot com. You would use that code like this:
$mailbox = imap_open($service, $username, $password) or die('Cannot open mailbox');
// for all messages
$emails = imap_fetch_overview($mailbox, '1:1'/* . imap_check($mbox)->Nmsgs*/);
foreach ($emails as $email) {
// get the info
getmsg($mailbox, $email->msgno);
// now you have info from this message in these global vars:
// $charset,$htmlmsg,$plainmsg,$attachments
echo $plainmsg; // for example
}
imap_close($mailbox);
(Side note: his code has three parse errors, where he does ". =" to mean ".=". Fix those and you're good to go.)
Also, if you're looking for a good blog on doing this "from the ground up", check out this: http://www.electrictoolbox.com/php-imap-message-parts/

How to format incoming email text for HTML display

I've set up a script that processes incoming emails and creates blog entries on Blogger. I'm using PEAR's Mail_Mime libs (for now) to read the incoming message. The messages often have characters in them that cannot be read by browsers--this happens most often when people use Outlook or cut/paste from MS Word.
So the output at the other end is something like this:
Here is a test post with “quotes” and ‘apostrophes�for what it�s worth, it also has dashes�and other strange formatting cut and paste from MS Word.
You can also see the output in the wild.
It's not hard to fix any specific instance, but each client (hotmail, gmail, outlook, etc) seems to handle things just a bit differently. Mail_Mime only seems to munge the output and, if I turn off Mail_Mime's parsing and try to translate the encoded characters myself using mb_convert_encoding or some manual simulation of this, it's even worse.
Please not that this is not going to be solved by selecting the right encoding type and using decode/encode/convert functions. The incoming formats vary from Windows-1252 to UTF8 to just about anything else mail clients can think of.
Has anyone scripted this before that could save me some time by offering up a sample or advice on the best approach? I've tried all the simple answers and done plenty of experimenting, so please don't bother responding unless you've dealt with a similar issue successfully or have a deep understanding of encoding issues.

The only way to do this is to do it by the spec's which is I'm afraid to pull in the 'Content-Type' mime header, pick up the charset (it'll look like Content-Type: text/plain; charset="us-ascii") then convert to UTF-8, and of course ensure your output on the web is sent as UTF-8 with the right headers.

To solve this problem, and get my message into valid UTF-8 that is readable from a browser, I found this PHP lib, ConvertCharset by Mikolaj Jedrzejak, which worked on almost everything. It still had issues with a specific symbol (=A0) when converting from Windows-1252 or iso-8859-1. So I converted this character manually before setting the code loose.
Here's what it looks like overall:
// decode using Mail_Mime
require 'Mail.php';
require 'Mail/mime.php';
require 'Mail/mimeDecode.php';
$params['include_bodies'] = true;
$params['decode_bodies'] = true; // this decodes it!
$params['decode_headers'] = true;
$decoder = new Mail_mimeDecode($input);
$mime = $decoder->decode($params);
// too much work to put in this example
$charset = ...; //do some magic with $mime->parts to get the character set
$text = ...; //do some magic with $mime->parts to get the text
// fix the =A0 control character; it's already been decoded
// by Mail_Mime, so we need the actual byte code now
// this has to be done before trying to convert to UTF-8
$char = chr(hexdec(substr('A0',1)));
$text = str_replace($char, '', $text);
// convert to UTF-8 using ConvertCharset
require 'ConvertCharset.class.php';
if( strtolower($charset) != 'utf-8' ) {
$converter = new ConvertCharset($charset, 'utf-8', false);
}
$text = $converter->Convert($text);
Then everything is spiffy. It even does the infamous Iñtërnâtiônàlizætiøn conversion, as well as accepting french, spanish, and pastes directly from MS Word :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.