Threading emails by subject

Threading emails by subject - php

We're parsing an email inbox signed up to a mailing list (Mailman) that does nothing except sit there and capture emails from other users on the mailing list. This is going to be PHP connecting to an email box, grabbing new emails and putting them into a MySQL database for use as a web archive that's searchable.
I noticed that many of the subjects have RE: FW: FWD in front of them (obviously), but wondered if I didn't need to manually strip these out to get a grouping by subject when outputting database results to the web page.
Maybe there's a PHP/Mail or PEAR class that will automatically handle message grouping/threading that I'm not aware of. Thanks for your help!

The proper way to thread them is not by subject, but rather by the Message-ID and References headers. The References header will contain a comma-delimited string of all the previously related Messgage-ID headers. By using these, the actual content of the subject line becomes less relevant since it can get modified and mangled. In other cases, you might get many separate threads with subjects like "Need help please" that should not be threaded together.

You probably want to look into the References and In-Reply-To email headers. These give you information about which email the current email is in reply to.
There's a good algorithm for threading email based on this information here: http://www.jwz.org/doc/threading.html

Related

Threading without In-Reply-To: Message-ID?

So our system is sending us email notifications regarding specific incidents. The problem is, things get really spammy and disorganized, and we really want to have our email client (Gmail) thread notifications regarding the same incident.
So for instance:
Email 1 is sent about incident 1.
Email 2 is sent about incident 1.
Gmail displays them in their own threads while we want them to appear as "replies" in the same thread.
I thought all I needed was to append "Re: Subject Line" but that doesn't work. Then I read about the References and In-Reply-To headers, but these require storing message IDs and we'd want to avoid the overhead if we can.
Is there any workaround to have clients thread emails received about the same topic without having to store message IDs and add them to In-Reply-To headers?
Thank you so much!
Note: Using PHPMailer

You really can't get around this. You must have something that uniquely identifies a message, and as far as email messages go, that something is the message ID, and that's also exactly the thing referenced by References and In-Reply-To headers, so you really can't get away from that. It has to be implemented using message headers because that's universally supported by all email clients.
The definitive threading algorithm, is, like email itself, now very old, but you can find it here, and it's still as valid as it ever was. I used that approach to implement the email threading used in the chamsocial.com social network.
Some email clients try to do threading without using message IDs, and it's almost always a failure - I sometimes see Apple Mail doing this when IDs are missing, pulling completely unrelated messages into a thread just because they happen to have the same subject.
If you wanted to do message threading without message IDs, you'd need to think up something that acts in exactly the same way - which is pointless, so you're best off biting the bullet and implementing message ID support properly.

What is the normal way to create a unique ID for POP3 emails?

IMAP messages have a UID for which we all rejoice. However, I'm trying to figure out how to generate a unique ID for a POP3 message and having trouble (old systems like hotmail.com only allow POP3).
Available messages to the client are fixed when a POP session opens
the maildrop, and are identified by message-number local to that
session or, optionally, by a unique identifier assigned to the message
by the POP server. This unique identifier is permanent and unique to
the maildrop and allows a client to access the same message in
different POP sessions. Mail is retrieved and marked for deletion by
message-number. When the client exits the session, the mail marked for
deletion is removed from the maildrop. - wikipedia
It seems however, that the basic LIST command simply returns an array of temp numbers to allow you to fetch the email. Those numbers are in no way unique though so another extension called UIDL seems to have been added: CAPA (POP3 Extension Mechanism).
POP3 states that a UIDL is unique as long as the message exists.
The unique-id of a message is an arbitrary server-determined string,
consisting of one to 70 characters in the range 0x21 to 0x7E, which
uniquely identifies a message within a maildrop and which persists
across sessions. This persistence is required even if a session ends
without entering the UPDATE state. The server should never reuse an
unique-id in a given maildrop, for as long as the entity using the
unique-id exists.
Note that messages marked as deleted are not listed.
While it is generally preferable for server implementations to store
arbitrarily assigned unique-ids in the maildrop, this specification is
intended to permit unique-ids to be calculated as a hash of the
message. Clients should be able to handle a situation where two
identical copies of a message in a maildrop have the same unique-id.
Which makes me think that it's possible that I might download another message a year later (after the first one was deleted) which has the same UIDL and might clash in my system.
Should I just hash the whole message body and use that as an ID?
Rather than fetching the whole email to hash it, perhaps I should just use TOP [id] 1 to hash the headers (and first line) which shouldn't ever match an existing email since the receiving server will always add some type of information correct? So an attacker could never cause a collision since the received or something should have been modified right?
The MDaemon program seems to tackle the issue with partial header hashing:
MDaemon constructs the UIDL results using the message name, date stamp, size, and a few other details about the messages. As a result, if a message is modified on the server, it will appear as “new” to mail clients even if you don’t rename it.
What is the proper way to make an ID for a POP3 email?
Note: Emails often contain a Message-ID header - but I can't rely on that because it could be used as an attack vector to confuse my system. It also is left-out by some email clients.

Personally, I would just hash a small subset of the email headers: something like Date, From, Subject, and Message-ID if available.
I often subscribe to mailing lists where you tend receive multiple copies of the same message when someone is replying to you - one that comes directly from them, and another via the mail server. Under those circumstances, many of the headers are different, but I'd really rather not receive two copies of the message.
And the chance of me receiving two different emails, from the same person at the same time, with the same subject and the same message-id seems extremely unlikely.
Of course, it's not impossible. They might not generate message-ids, they might have a blank subject line, they might have a broken clock, and they might have all of those things at the same time. But then again, the router through which their email is passing might be wiped out by a giant meteor from space.
Frankly, the most likely scenario is the email will end up being detected by spam and I'll never see it anyway. Email just isn't that reliable a form of communication. You need something that works reasonably well, but if it doesn't handle that 1 in a million edge case, you'll probably still be ok.

Excuse me for questioning your question, but – the real question is: why do you care? It seems to me you are trying real hard to come up with a natural primary key for emails. You shouldn’t need to – and there isn’t really one, anyway. What’s the real problem you are trying to solve?
Your understanding of UIDLs is correct. A message must keep the same UIDL while it is in a particular mailbox, identical messages can have identical UIDLs (but don’t need to), and UIDLs should not repeat within the context of a mailbox, but are not strictly required to. The last requirement in particular highlights the scope and purpose of UIDLs. Once the client has deleted a message from a mailbox, it must (and can) forget about its UIDL, because that value, should it appear again, will henceforth never convey any relationship to the former message.

I would hash the UIDL you mention, together with the current timestamp that you should ensure the uniqueness of the number. If the UIDL is unique as long as the message leaves, using the timestamp would ensure that the scenario you refer to (another message with the same UIDL) would not occur!

Forwarding IMAP Messages with Perl

I need to handle some mail. I already have a script built that can parse through a mailbox and perform several actions like save attachments, move email to a folders and other administrative tasks. A few of the emails are identified as rogue during this process and need to be forwarded. The messages may or may not have one or more attachments and are dumped into their own folder labeled fwd.
I can create and send new email messages but am having trouble finding information on forwarding or replying to existing email. One solution would be to save the parts (body, subject, attachments) to a database and construct a new message with MIME::Lite but this seems inefficient at best.
I am handling the email with Net::IMAP::Simple::SSL and MIME::Parser.
Since the email is dumped into a temporary folder for holding I am not totally against using a PHP script to handle the messages, but prefer something in line with my current Perl handler to execute the task.
Looking for some helpful info to help complete this task.

You might want to look into CPAN at Mail::Box, a rich (and a bit complex) module handling mail messages, including primitives such as message->copy and message->reply.
For documentation and examples, author's website is at http://perl.overmeer.net/mailbox/

Are custom mail headers preserved after reply?

I'm currently trying to design a PHP webapp that allows users to send emails to other users. The recipient can then reply to the email and the message will be updated in the webapp.
Now to keep track of each individual user message, I would like to add a custom header (ie. conversation_id) in the email. When the recipient replies to the email in their email client, will the custom mail header (ie. conversation_id) be preserved?
There will be cron job that executes every minute that opens a POP3 stream to the web server to retrieve new emails (replies that the user may have sent with their mail client) to update my DB.
I'm not sure if this is a good way for designing such an app. Any suggestions?
EDIT: Also, I'm sure wondering how I can strip out the quoted messages in the reply?

You can't rely on mail headers being preserved - it is pretty much up to the individual mail client to decide what to include.
I would generally put the conversation ID within [] brackets in the subject which makes it really easy to parse out with a regular expression.

Each message already contains the Message-ID field which is used by the mail clients to create the content of the In-Reply-To field.
Wouldn't the usual way after the standards be to rely on the user's mail client setting the in-reply-to field correctly? As far as I know, all email client use this correctly. (even though according to this thread Outlook may have an occasional bug?)
So I think, emails already feature this and you don't have to worry about creating a custom mail header entry and unpredictably behaving mail clients.
EDIT: I rembember a friend telling me his frustration from work about how many people remove or even edit these Tags in [ ] brackets from the suject field. Also, it seems to be a very dirty work-around and all of your software would need to handle it without opposing it to the users ability to change it => practically impossible.
EDIT: I think it will be hard to reliably strip out the quoted message in the reply, because each mail client handles it differently.

Handling Incoming Mail to Multiple Recipients in PHP

Alright, this may take a moment or two to explain:
I'm working on creating an Email<>SMS Bridge (like Teleflip). I have a few set parameters to work in:
Dreamhost Webhosting
PHP 5 (without PEAR)
Postfix
MySQL (If Needed)
What I have right now, is a catch-all email address that forwards the email sent to a shell account. The shell account in turn forwards it to my PHP script.
The PHP script reads it, strips a few Email Headers in an effort to make sure it sends properly, then forwards it to the number specified as the recipient. 5551234567#sms.bridge.gvoms.com of course sends an SMS to +1 (555) 123-4567.
This works really well, as I am parsing the To field and grabbing just the email address it is sending to. However, what I realized that I did not account for is multiple recipients. For example, an email sent to both 5551234567 and 1235554567 (using the To line, the CC line, or any combination of those).
The way email works of course, is I get two emails received, end up parsing each of them separately, and 5551234567 ends up getting the same message twice.
What is the best way to handle this situation, so that each number specified in TO and CC can get one copy of the message.
In addition, though I doubt its possible: Is there a way to handle BCC the same way?

If you check the headers of the mail, you should find a Message-ID field (according to RFC2822 - section 3.6.4). So you could test if you have already sent an SMS for a mail with the same Message-ID & phone number to prevent sending the same message to the same number twice.

Why not use something like imap to check the catch-all mailbox, loop through the messages and then delete them once finished? That way you don't need to forward them to a seperate account.

Stupid dirty solution: parse all recipients from the mail, then send them SMS, then put em all into temporary table with md5 of message text. And check all incoming mails against this table.

Although wimvds had the best answer here, I found out elsewhere that Dreamhost includes a "X-DH-Original-To" header in the way I'm running it through the system. Using this, I'm able to send to each number individually upon receipt of the email without checking it against a database. This should also work with Blind Carbon Copy (I don't know the specifics of how email works enough to tell you how that works).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.