Parsing incoming email content from a range of set templates - php

Working on a project that requires incoming email to be parsed, and certain information be extracted and stored in the database. We're using postmarkapp to extract the body content of the email so we only have the text only guts of it, but I'm currently a bit stuck on how to parse the email the most efficient way.
Over time we'll be adding more 'accepted' formats of incoming mail, but to start off with we'll have probably 4 common emails coming in, that is, they'll follow the same format and the information that we want to extract (contact details, id's, links, bio) will be in the same place, (per supported format).
I'm thinking that we'll have an interface that will handle the common tasks, and each supported format will implement that, however just how to get that information is where I'm stuck.
Open to any thoughts and ideas on different methods / technologies to do this, ideally PHP, but if we need to use something else, that's fine.

There is a similar feature on a site that I developed. Our users get emails from their suppliers with pricing. They copy and paste the body of the email into a textarea on our site and click a button. Then we parse the text to find products and prices and stick the info into a database.
To do the parsing, we first have to determine the supplier, like you'll need to do to determine which template was used. We look for certain strings in the text - the supplier's name usually, or a line that's unique to their emails. We do that in a method called something like getParserForText(). That method returns a Parser object which implements a simple interface with a parseText() method.
There's a Parser implementation class for each format. The parseText() method in each class is responsible for getting the data out of the text. We looked for ways of making these elegant and generic and have simply not found a really good way to do that. We're using a combination of regular expressions, splitting the string into smaller sections, and walking through the string.
Pseudocode:
$text = $_POST['emailBody'];
$parser = getParserForText($text);
$result = $parser->parseText($text);
if(count($result["errors"]) > 0)
{
// handle errors
}
else
{
saveToDatabase($result["prices"]);
}
We have no control over the formats the suppliers use, so we have to resort to things like:
split the text into an array of strings around each line with a date (prey_split())
for each element in that array, the first line contains the date, the next three to six lines contain products and prices
pull the date out and then split the string on new lines
for each line, use a regex to find the price ($000.0000) and pull it out
trim the rest of the line to use as the product name
We use a lot of prey_split(), preg_match_all() and explode(). While it doesn't seem to me to be particularly elegant or generic, the system has been very robust. By leaving a little wiggle room in the regular expressions, we've made it through a number of small format changes without needing to change the code. By "wiggle room" I mean things like: Don't search for a space, search for any whitespace. Don't search for a dollar sign and two numbers, search for a dollar sign and any number of numbers. Little things like that.
EDIT:
Here's a question I asked about it a few years ago:
Algorithms or Patterns for reading text

Since it's generated email, It most likely comes in an easily parsable format, such as one line per instruction; key=value. You can then split the lines on the first =-sign and use the key-value pairs that this gives you.
Regular expressions are great for when you don't have control over the incoming data format, but when you do, it's easier to make sure it is parsable without a regexp.
If the format is too complex for such simple parsing, please give an example of a file using the format, so I can make the answer more specific. Same thing if this isn't an answer to what you mean to ask: please give an example of the sort of answer you want.

Related

Choosing effective function names

I'm looking for advice on writing a good function name as part of a web page I'm developing. It's coded in PHP and the function basically reassembles array data holding customer attendance information to a music venue (example time, date, entrance, etc) . The function takes in array data and returns the information formatted as a string that includes HTML structuring.
For instance:
//function formats array
...
$returnStr = "<span class='bold'>Entrance</span>customerData['entrance']";
The reason I ask is that any function name I come up with seems either too verbose or isn't completely clear about what it means. I have to maintain a lot of code, so I'm trying to choose effective names such that when I revisit code, I can quickly grasp what is going on.
Any online resources or personal insight would be appreciated.
There's no black or white in this case. But I believe that the best practice should be:
Logical - Describe what the function does
Comfortable - short and to the point
So you won't have to think on "Wait, what was the name of the function that does X and Y?" and you won't have to write too much code, like: printMusicVenueFromArray.
Both the "logical" and "comfortable" aspects are subjective and might be different from one person to another, so as long it's only you that work on that project - do what feels right.
When you have a team of developers working on a single project, consider drawing some guidelines before.
Start by describing the function in words, what's the input, what's the output.
Consider other functions that already exists in your code with similar names (you don't want to get confused).
According to your description:
function basically reassembles array data holding customer attendance
information to a music venue (example time, date, entrance, etc) . The
function takes in array data and returns the information formatted as
a string that includes HTML structuring.
Input
array data holding customer attendance information to a music venue
Output
(return) the information formatted as a string that includes HTML structuring.
Usually when I write functions that return something I'll start with get, but your function returns HTML string so it's more a view function so you can ignore it.
Now you should think on what describes the returned string the most, in my opinion, something like "MusicProfile" or "MusicDetails".
BTW, your quotes conflict.

regular expression that will extract sentences from text file

I need a regular expression that will extract sentences from text file.
example text :
Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). information by mr. Kahana.
here's my code :
$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
but the last sentence still splitted information by mr. and Kahana.
how to solve it ? thank you :)
You Can't Do this with Regular Expressions
English as a language does not fit into well-placed formatting rules. As such, regular expressions are not fit to fulfill the purpose you are seeking out. What you are really looking for is something like a Natural Language Processor.
Unless this is critical to your program, I suggest you instead determine the following things:
What is an acceptable level of error? Nothing you do will be perfect. But if it works 80% is that okay? 90%? 99%? How critical is this to you/your client?
Where is the text coming from? For example, a textbook will most likely be written differently than people's twitter feeds. You can do research and make exceptions based on what you see in the actual text you are using.
What am I doing with the text? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. It's all about tuning the program to get the appropriate output for this specific purpose.
My recommendation is to use trial and error to get your error rate down as much as possible. Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.
In short, PHP and Regular Expressions aren't meant for this because English is funky. So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.

Extract URL containing /find/ from numerous URL's?

I'm really a major novice at RegEx and could do with some help.
I have a long string containing lots of URL's and other text, and one of the URL's contains has /find/ in it. ie:
1. http://www.example.com/not/index.html
2. http://www.example.com/sat/index.html
3. http://www.example.com/find/index.html
4. http://www.example.com/rat/mine.html
5. http://www.example.com/mat/find.html
What sort of RegEx would I use to return the URL that is number 3 in that list but not return me number 5 as well? I suppose basically what I'm looking for is a way of returning a whole word that contains a specific set of letters and / in order.
TIA
I would assume you want preg_match("%/find/%",$input); or similar.
EDIT: To get the full line, use:
preg_match("%^.*?/find/.*$%m",$input);
I can suggest you to use RegExr to generate regular expressions.
You can type in a sample list (like the one above) and use a palette to create a RegExp and test it in realtime. The program is available both online and as downloadable Adobe AIR package.
Unfortunately I cannot access their site now, so I'm attaching the AIR package of the downloadable version.
I really recommend you this, since it helped a RegExp newbie like me to design even the most complex patterns.
However, for your question, I think that just
\/find\/
goes well if you want to obtain a yes/no result (i.e. if it contains or not /find/), otherwise to obtain the full line use
.*\/find\/.*
In addition to Kolink's answer, in case you wanted to regex match the whole URI:
This is by no means an exhaustive regex for URIs, but this is a good starting point. I threw in a few options at key points, like .com, .net, and .org. In reality you'll have a fairly hard time matching URIs with regular expressions due to the lack of conformity, but you can come very close
The regex from the above link:
/(https?:\/\/)?(www\.)?([a-zA-Z0-9-_]+)\.(com|org|net)\/(find)\/([a-zA-Z0-9-_]+)\.(html|php|aspx)?/is

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?
If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

Improve a regex statement in order to be as efficient as it can be

I have a PHP program that, at some point, needs to analyze a big amount of HTML+javascript text to parse info.
All I want to parse needs to be in two parts.
Seperate all "HTML goups" to parse
Parse each HTML group to get the needed information.
In the 1st parse it needs to find:
<div id="myHome"
And start capturing after that tag. Then stop capturing before
<span id="nReaders"
And capture the number that comes after this tag and stop.
In the 2nd parse use the capture nÂș 1 (0 has the whole thing and 2 has the number) from the parse made before and then find
.
I already have code to do that and it works. Is there a way to improve this, make it easier for the machine to parse?
preg_match_all('%<div id="myHome"[^>]>(.*?)<span id="nReaders[^>]>([0-9]+)<"%msi', $data, $results, PREG_SET_ORDER);
foreach($results AS $result){
preg_match_all('%<div class="myplacement".*?[.]php[?]((?:next|before))=([0-9]+).*?<tbody.*?<td[^>]>.*?[0-9]+"%msi', $result[1], $mydata, PREG_SET_ORDER);
//takes care of the data and finish the program
Note: I need this for a freeware program so it must be as general as possible and, if possible, not use php extensions
ADD:
I ommitted some parts here because I didn't expect for answers like those.
There is also a need to parse text inside one of the tags that is in the document. It may be the 6th 7th or 8th tag but I know it is after a certain tag. The parser I've checked (thx profitphp) does work to find the script tag. What now?
There are more than 1 tag with the same class. I want them all. But I want only with also one of a list of classes.....
Where can I find instructions and demos and limitations of DOM parsers (like the one in http://simplehtmldom.sourceforge.net/)? I need something that will work on, at least, a big amount of free servers.
Another thing. How do I parse this part:
"php?=([0-9]+)"
with those HTML parsers?
If you're concerned about efficiency (and indeed accuracy), don't attempt to parse HTML using regex.
You should use a parser, such as PHP's DOM
As noted above, regex is not a good fit for this. You'll be better of using somethign like this:
Robust and Mature HTML Parser for PHP
Efficiency doesn't matter if your results are incorrect. Parsing HTML with regexes will lead to incorrect results down the road. Use a parser.
I found a way to create efficient searches.
If you want to search for "A huge string in a whole text" you can do it this way:
(?:(?:[^A]*A)+? huge string in a whole text)
It always works. Only creates a backtrace every 'A' character and not for every single character. Because of that it is not only memory efficient but processing power efficient too. If there are two options, it's also works without a problem:
(?:(?:[^AB]*AB)+?(?: huge string in a whole text|e the huge string in a whole text))
Up until now it has never failed.

Categories