I am trying to parse GMail emails, but have one problem: how do I know which message a reply corresponds to?
I tried sorting email by subject. For example, if a message has the subject "hi Jack", then all messages with subject "Re: hi Jack" are a reply to this mail.
But what do I do if I have many emails with the same subject? How do I know which email they are replies to?
Do emails perhaps have a unique code for what the reply goes to? Maybe there is an ID or something like that to know what the children of a message are(?).
Threading by subject is not a good idea because there may be as you noticed several different threads based on identical subjects.
You need to examine 3 headers in the message to make threading (or other kind of grouping) possible:
Message-ID: contains unique message identifier (what you call "unique code") in a string surrounded by < and > characters e.g. <123456#User1PC> Most MUAs will create identifiers in above form or something similar to that. This header should be generated when a new message is sent.
In-Reply-To: contains a message this particular reply is related to e.g. <789abcd#User2PC>. This header should be copied from Message-ID it replies to.
References: contains list of recent references to messages in this "thread". The format is similar to above except they are separated e.g. <123456#User1PC> <789abcd#User2PC> It is there so that you can use it to locate message in the thread.
If one message has been replied or posted a few days later it might be hard to locate it without list of references. Usually list of references is trimmed by mail clients to a reasonable size. By reasonable, I mean, trimming it enough to be able to locate message in a thread but keep the header under reasonable size (not having too many references). For example it may contain 5-10 references which is more than enough usually to connect it to other messages. References: are also useful in case if original message (first one) has been deleted so even without it, you can still utilize References: list to build a threaded (grouped) messages.
So, in order to thread messages, you would need to read all of them, and then sort threads based on the information you can extract from above headers.
If references or message ids are not in form you can recognize (e.g. <example#something> you can bail out by not threading these messages and displaying them as unthreaded. So generic algorithm for threading/locating might look something like this:
Take first message ID
Examine nearby (by date) messages to see if one of them contains message ID in its references list or in-reply-to - if there are none - you can't group it so keep it as standalone message.
Group messages somehow, perhaps based on Date:, or Received: header
Place this message into "Done" list so you don't need to examine it further (or related references)
Continue until you can't find any more references and then move to next message which is not already in "Done" list and repeat steps until you process entire message list.
It will probably take you a while to get this done properly but now at least you have a starting point to look into.
Related
This is my first post on the forums, i've been lurking forever now. About time to say hello! I did use the search, but either nobody else has this specific issue, or they don't utilize the comment section the way we do.
We like to send updates to customers through the order comment section of the order page. The email that is sent does not hold any of the line breaks that were used in the original comment. If you have 5 separate sentences, the email shows one big paragraph.
This is really annoying, because our message becomes a big mess. We have to give the customer a series of information about their order, and instructions on how to process with the issue on hand.
Here's two images of what i'm talking about
I have images but I can't post them because I need 10 rep. hmm..
As you can see, this is just an example and not a long comment, But our regular emails can have a lot of information, and maybe some up-selling.
I haven't tested it myself, but David Manners' answer on this question on the Magento Stack Exchange sounds promising.
Basically he suggests running the comments through nl2br to convert the newlines to tags that would render in HTML email, like this:
{{var data.comment:escape|nl2br}}
I'm a bit confused here and due to the keywords involved Google has not been much help. I want to send text messages from my website to users with cellphones via SMS. Sounds simple enough using things like 1234567890#carrier.com with php mail scripts right? Well one thing I've encountered before is that sometimes it sends an MMS(multimedia message) instead of a regular text message.
What I am trying to find out(using keywords in Google, send text message phone number php) is how to get a phone number for your website? I've seen sites that actually have a phone number linked to the site that you can receive messages from.
Example Facebook uses 32665 to send and receive messages on.
How do you accomplish this?
Is it any way possible using the 1234567890#carrier.com with php mail to strictly only send a text message and not multimedia?
Example php mail script using a phone number:
<?PHP
mail("5558349324#vtext.com", "", "Test SMS", "From: RRPowered <test#rrpowered.com>rn");
?>
Also thanks to any help ahead of time, because I am really really confused at this point.
UPDATE: I just tried using the regular PHP mail functions to a phone number and it appears to cut the messages off at around 140 characters(I believe that is standard message size).
UPDATE 2: Before anyone mentions API like Text Magic, I am aware of those things, BUT I am wanting to build my own code instead of integrating someone elses API(if possible).
I'm the webmaster and lead developer for a set of websites providing assessment services. I have no background in server administration and am ashamed to say that being a sysadmin is pretty intimidating to me; I'm comfortable with basic command line, but I'm frequently overwhelmed by all of the myriad configurations needed to get a standard LAMP stack working with several subdomains and an email server running.
I've finally reached something that thoroughly stumps me. Let me first list our basic server configuration:
VPS running Ubuntu 12.04
Standard Apache, PHP, Mysql, BIND services
1 main domain + 4 subdomain websites configured in the above services, two of which make extensive use of mysql databases having 300-400 tables each
Citadel email server (a choice I'm increasingly come to see as a mistake)
Swiftmailer (for sending automated emails from PHP)
Some of our pages (relating to account and assessment management) send alert emails to the test taker, coach, and/or us, when a specific action is taken. The text for these emails is fetched from a "boilerplate" table in the mysql db. Then in PHP, a couple "placeholder" tags are swapped out for test taker name, date, etc. The PHP code then uses Swift Mailer to send the prepared text as an email, with our local Citadel email server, to a given set of (usually external) addresses. We are CC'd on most of these emails, to provide us with a second record of all of the automated messages we've sent.
(Feel free to skip over this) Here's the gist of my code that fetches and processes the text, and sends it as an email:
$subject = $bp['email_report_title'];
$subject = str_replace("[NAME]", $name, $subject);
$subject = str_replace("[INSTRUMENT_NAME]", $instrument_name, $subject);
$body = $bp['email_report'];
$body = str_replace("[NAME]", $name, $body);
$body = str_replace("[INSTRUMENT_NAME]", $instrument_name, $body);
$body = str_replace("[INSTRUMENT_BASE]", $instrument_base, $body);
sendEmail($email_to, $email_from, $email_cc, $subject, $body);
In a nutshell, here's my problem: the emails we see (as a result of the above events) sometimes have typos appear in them -- inconsistently. Here are a couple images as example, showing a set of auto emails generated & sent on the same day using the same PHP code drawing from the same (unaltered) string in our database. These are screenshots from our internal (CC'd) account, viewed in Mac Mail; but based on test taker complaints, all email recipients appear to be seeing the same text. In the first image, note the differences in spacing in the titles.
In the second & third images, note the incorrect link (two periods at the end). I have checked the PHP code and the mysql source text, and can find absolutely no reason that the login link would be changed to have two periods in that spot. EDIT: Based on commenter request, I've copied the full-length email text below the images.
[name removed]'s LSUA report has been completed. To view the report, click on the following link:
https://www.devtestservice.org/security/logIn..php
Once you have logged in, click the LSUA icon, and follow the link that says "View your report".
The commenting on our assessments is done by highly trained human beings who occasionally make mistakes. If you notice any inconsistencies or errors in your report, please let us know and we will make prompt repairs. If you have any problems logging in or viewing your report, please contact us using the following link:
https://www.devtestservice.org/contactus.php
Lectica, Inc.
Northampton, MA 01060
Our mission is to develop standardized, formative, and diagnostic developmental assessments of the knowledge and skills required to meet the challenges of the 21st century. Our aims are threefold: (1) to build engaging, educative, and feedback-rich developmental assessments and learning resources for K-18 students and their teachers, (2) to create equally rich assessments that diagnose learning needs and support the development of adults (in the workplace and beyond), and (3) to build (and share) knowledge about learning and its role in the future of society.
This message contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s) named above. Any disclosure, distribution, copying or use of the information by others is strictly prohibited. If you have received this message in error, please advise the sender by immediate reply and delete the original message and any and all attachments. Thank you.
Sorry for the long background explanation. But my question is simple: How could these typos happen? At what point in the workflow could a (verified consistent) mysql data cell, fetched and processed by (straightforward) PHP code, plopped into a standard PHP -> SMTP mailer plugin, lead to the creation of an email that appears to have random typos?
I'd love any ideas you have, and I'm glad to provide more detailed code if needed. Thanks in advance.
EDIT: Thanks to Eggyal for pointing out that only the longer subject lines have the spaces removed. This pattern holds true across other older emails too. Also, I've provided the full text of one email alert above; note the '..php' at the end of the link. All emails have the same signature line, and all alert emails of the same type have the same length body (plus or minus 15 characters for different instrument names and test taker names).
I have experienced very similar behavior in emails when using plain php's mail() function and postfix smtp server. It took me weeks to figure it out why HTML in those emails is randomly corrupted: it was long lines in the email source, exactly as #eggyal writes in the comments.
RFC2822 states that lines should be no more than 78 chars. To comply, SMTP server inserts newlines at selected places, however this can easily corrupt space-sensitive content. Check the Content-Transfer-Encoding header in your emails, I assume it will be either 7bit, 8bit or quoted printable.
Try changing it to base64 - that's the easy way. I'm not familiar with Swift Mailer but I guess this could work:
$transferEncoding = $message->getHeaders()->get('Content-Transfer-Encoding');
$transferEncoding->setValue('base64');
Alternatively, you can try to wrap lines manually before handing over to SMTP server.
I need create a system using IMAP that can read the replies inside the inbox, I don't need a piece of code, but an explaination of how a reply mail is structured.
At this moment I'm sending multiparts emails and my intencion is to put the 2 information that I need inside the boundary (a 87 chars identifier plus the user mail) and add string like:
=======================================================
Reply above the previous line
But my doubts are:
What if for some insanes reasons the user uses the same identifier
inside the reply, how can I identify in a safe way where the reply starts?
has it got sense to save the info inside the boundary or once that the user reply those information will be lost from the header?
How can I identify if the reply mail is plain text or HTML or both?
If some one has got some suggestion even about where to put those information I will be glad
There's no standard for the structure of a reply email. It's not usually done using multipart email, it just uses human-readable text, often with > prefixes to denote quoted text. This allows replies to be interspersed inline with the quoted material.
The only standard features of replies are a couple of headers:
In-Reply-To: <ID>
and
References: <ID1>, <ID2>, <ID3>, ...
In-Reply-To contains the message ID of the message that was replied to. References is a growing list of message IDs -- when you reply, you take the original message's reference list and append the ID of the message being replied to at the end.
See RFC 5322 for more details about these headers.
I'm using Sendgrid and their Parse API to send/receive email. The Parse API allows one's web app to receive email as a $_POST but the problem is that in the $_POST I want to be able to extract the message itself from its prior messages and meta data that get chained along.
To show you what I mean in the following picture, i'd just like to capture the text, "trying sending from 12373 to 12373 from GMAIL" and not all the junk below it. If that is not possible, does anyone have any suggestions on how to parse the email body ($_POST['text']) such that I can separate out the message itself?
The problem is see is that depending on the email client (gmail, outlook, etc.), It's not clear to me that the date information, in this case: "On Wed, Jan 23, 2013...", will allows follow the message itself. If all email client's put the date beneath the message, then it would seem I could design a fancy regex to look for a line break followed by a date or something. Thoughts?
You have a couple of options:
1) Insert a token that splits the emails
You could do something like --- reply above this line --- and then cut out everything below that token.
2) Use an email reply parsing library
There is a really good one done by github, but it's in ruby. There's a php port though that might be good for what you need:
Fully working code:
<?php
require_once 'application/third_party/EmailReplyParser-master/src/autoload.php';
$email = new \EmailReplyParser\Email();
$reply = $email->read($_POST['text']);
$message=$reply[0]->getContent();
$message=preg_replace('~On(.*?)wrote:(.*?)$~si', '', $message);
//Last line is needed for some email clients, e.g., some university e-mails: foo#bar.edu but not Gmail or Hotmail, to get rid of "On Jan 23...wrote:"
//This failure to remove "On Jan 23...wrote:" is a known issue and is documented in their README
?>
There's simply no guaranteed way to parse quoted message threads from an email message, so you won't find a regex or any other code that will work in all cases. There's no standard to define formatting of replies, and as you've already observed different mail clients use different conventions. Many, in fact, will allow the user to edit the quoted text. Also, users can paste in unrelated messages, with or without headers, resulting in a mix-and-match of formats.
If you can record and keep the history of all messages as they are sent and received, then you can (usually, but not always) use the In-Reply-To header (see RFC-5322) to locate the previous message by matching it's Message-ID header, and do a diff on the body and remove duplicate text runs. It's apparent that some email systems do this to improve their presentations, but I'm not aware of any available open source code.
// cut quoted text, https://regex101.com/r/xO8nI1/5
$message = preg_replace('/(On\s.*<\n){0,1}(.*\n(\n){0,1}((^>+\s?.*$)+\n?)+)/mi', '', $message);
How about replies in languages other than English? We came out with solution to add marker, but instead of translating it for every email (based on user's language) we put some invisible characters into it (zero width space U+200B , to be precise). Basing on "On..." regexp it's error prone, it can easily cut some email content.