Email string parsing using regex - php

I am trying to do a complicated (to me) regex on a multi-line snip from an e-mail. I have tried hard, with no luck. I am trying to get rid of anything from "On " through " wrote:"
Would be nice if you can also check to see if it contains the word "AcmeCompany", so it doesn't check for everything "On " "wrote:"
So far, I have this: /On(.*)AcmeCompany(.*)/im but it does not work...
say hello, world!
On Tue, Jun 7, 2011 at 6:18 AM, AcmeCompany <
24a95f49f7ce573fds2d+c#AcmeCompany.com> wrote:
Thank you for the responses, but it seems like there's another problem.
EDIT: I found out that this works: /On[\s\S]+?AcmeCompany[\s\S]+?wrote:/m, but it seems to fail when the e-mail contents have word "On".
say hello, world!
On a plane!
On Tue, Jun 7, 2011 at 6:18 AM, AcmeCompany <
24a95f49f7ce573fds2d+c#AcmeCompany.com> wrote:
EDIT2: Every mail client is different... gmail tends to do it in 2 lines, mail app from iphone do it in 1 line, so it doens't always follow the strict format.
1 thing for sure: beginning always uses "On " and ends with " wrote:". It also contains a hash and AcmeCompany, which I can also use to verify.

For the new requirement I am adding another reply. Hope you won't mind.
Can you try something like this?
/On\s(Mon|Tue|Wed|Thu|Fri|Sat)[\s\S]+?AcmeCompany[\s\S]+?wrote:/
I am trying again..how about using ?
/On.+?AcmeCompany[\s\S]+?wrote:/

Hope this helps:
/On[\s\S]+?AcmeCompany[\s\S]+?wrote:/
The regular expression above first matches On and then either of all spaces and non-spaces (together swallowing all characters and newlines) with a lazy repetition mode till it finds AcmeCompany. Again it matches all spaces and non-spaces (together swallowing all characters and newlines) with a lazy repetition till it finds wrote:

This will work:
On.*AcmeCompany.*
Maybe offtopic but...
If you want to learn regex you should try Expresso
Example of Expresso at work:

To get the string before On Tue,Jun...:
$str = explode ('On', $yourstring);
$oldstr = array_pop($str); //Remove the last value of the $str array
echo trim( implode('On',$str) ); //Trim the string to remove any unnecessary line breaks
To find if the hidden message contains AcmeCompany:
if( strstr ( $oldstr , 'AcmeCompany' ) ) {
echo "I found AcmeCompany!";
} else {
echo "I didn't find AcmeCompany!";
}
Hope my answer is useful, even though I didn't use regex.

Try this: /On.*AcmeCompany <$[^:]+:/im, the m is important as it lets the $ match line breaks.

Related

PHP preg_replace cuts of my $subject string

I was working on this project of mine when I encountered the following problem. I have a link which goes to:
file.php?page=1&color=all&pos=all&nat=all&mine=all&tree=all
Now, I wanted to change the color to 'gold' so I looked around on Google and found this php function called preg_replace(). So I implemented it in my code like this:
$pre='?page=1&color=all&pos=all&nat=all&mine=all&tree=all';
preg_replace('/color=(.*)&/', 'color=gold&', $pre);
For some reason my output is ?page=1&color=gold&tree=all so it seems that it cut of the middle of the code somehow.
This is the link I expect as my output: ?page=1&color=gold&pos=all&nat=all&mine=all&tree=all
Can anybody tell me what it is I'm doing wrong? Thanks!
Regular Expressions (regex) are greedy. You said "find color=" and then "get as much as you can until you see a &". What you want is "get as much as you can as long as it is not a &". That would be:
preg_replace('/color=[^&]*/','color=gold',$pre);
The [^&] means "anything except &". Also - you aren't using the match, so you don't need the parenthesis.

Regex PHP code to scrape street address that has a line break

Been searching for two days now with Google, and a lot on SOF here, but I can't solve this regex preg_match problem. I want to simple scrape a street address, and normally I can do this easily, but because some street addresses have line breaks in the middle of them with around 25 characters of whitespace, my code displays an empty array or just NULL.
Below I have included the source code to show an example of what I'm trying to scrape, and also the failed code I have so far. Any help from someone with more experience than I, would be greatly appreciated this Sunday morning.
Sample of source code here;
<span style="font-size:14px;">736
E 17th St</span><br />
My attempt so far;
$new_data = file_get_contents('someURLaddress');
$street_address_regex = '~14px\;\"\>(.*?)\<\/span\>\<br\s\/\>\s~s';
preg_match($street_address_regex,$new_data,$extracted_street_address);
var_dump ($extracted_street_address);
I'm only doing this because it is horrible practice to use a dot. The giveaway that you're doing something wrong in Regular Expressions is when you use the Single-Line option. That's a huge waste of resources and bound to break at some point.
This is 99.9% positively what you need to use:
$street_address_regex = '~14px;">([^<]*)~i';
Or, if you are (for some reason) expecting a < as a legitimate character, either meaning Less-than or formatting tags like bold or italics, then you can do this:
$street_address_regex = '~14px;">([^<]*<)*?\/span~i';
And if it bothers you enough that you don't want to have to format out the last < character you'll get in your string, you can do this:
$street_address_regex = '~14px;">((?:[^<]*(?(?!<\/span)<))*)~i';
.
Test it With This Tester
.
But honestly, you shouldn't even be using Regex. Find the stripos of <span style="font-size:14px;"> and add its length (to get the Address Starting Point)... Then find the stripos of </span> and input the offset point of the previously found Index (to get the Address Ending Point). Subtract them to get the length. Then pull the substr using the OriginalString, StartIndex, And Length.
Sounds like a lot, but make that a small function that you use instead of Regex, and just input the OriginalString, StartString, and EndString... then return the contents between StartString and EndString using the method I just said. Make the function re-usable.
With that function, that portion of your code will literally run 10 times faster, at least. Regex is powerful as hell for patterns, but you don't have a pattern, you have two static strings from which you want the contents between them. Regex is slow as hell for static string manipulation... Especially using the Dot with Single-Line ~Shiver~
$Input = '<span style="font-size:14px;">736 E 17th St</span><br />';
echo GetBetween($Input, '14px;">', '</span');
function GetBetween($OrigStr, $StartStr, $EndStr) {
$StartPos = stripos($OrigStr, $StartStr) + strlen($StartStr);
$EndPos = stripos($OrigStr, $EndStr, $StartPos);
return substr($OrigStr, $StartPos, $EndPos - $StartPos);
}

preg_match and remove multiple characters in string?

Hi I'm using php to program my site and I've been reading loads about preg_match and trying lots of examples. What I'm trying to do is something like below...
if(preg_match('These characters'), $myVariable, matches)){
Find and remove found characters from $myVariable;
}
I'm pretty sure this is obvious to php experts but it's had me stuck after hours of trying and reading.
Thanks in advance
You don't need to check for a match before doing a replace. It's like if you were to do:
str_replace("A","B","ZZZZZZZ");
It just won't replace anything. Same goes for preg_replace: If there is no match, it just does nothing.
It sounds like you should be using preg_replace. If you wanted to remove all y's and o's for example you would do this:
$string = 'hey you guys!';
$ans = preg_replace('/[yo]/','',$string);
print_r($ans); //outputs 'he u gus!'
Whatever characters you want to remove, just put them between the brackets [...]

Regex to conditionally replace Twitter hashtags with hyperlinks

I'm writing a small PHP script to grab the latest half dozen Twitter status updates from a user feed and format them for display on a webpage. As part of this I need a regex replace to rewrite hashtags as hyperlinks to search.twitter.com. Initially I tried to use:
<?php
$strTweet = preg_replace('/(^|\s)#(\w+)/', '\1#\2', $strTweet);
?>
(taken from https://gist.github.com/445729)
In the course of testing I discovered that #test is converted into a link on the Twitter website, however #123 is not. After a bit of checking on the internet and playing around with various tags I came to the conclusion that a hashtag must contain alphabetic characters or an underscore in it somewhere to constitute a link; tags with only numeric characters are ignored (presumably to stop things like "Good presentation Bob, slide #3 was my favourite!" from being linked). This makes the above code incorrect, as it will happily convert #123 into a link.
I've not done much regex in a while, so in my rustyness I came up with the following PHP solution:
<?php
$test = 'This is a test tweet to see if #123 and #4 are not encoded but #test, #l33t and #8oo8s are.';
// Get all hashtags out into an array
if (preg_match_all('/(^|\s)(#\w+)/', $test, $arrHashtags) > 0) {
foreach ($arrHashtags[2] as $strHashtag) {
// Check each tag to see if there are letters or an underscore in there somewhere
if (preg_match('/#\d*[a-z_]+/i', $strHashtag)) {
$test = str_replace($strHashtag, ''.$strHashtag.'', $test);
}
}
}
echo $test;
?>
It works; but it seems fairly long-winded for what it does. My question is, is there a single preg_replace similar to the one I got from gist.github that will conditionally rewrite hashtags into hyperlinks ONLY if they DO NOT contain just numbers?
(^|\s)#(\w*[a-zA-Z_]+\w*)
PHP
$strTweet = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '\1#\2', $strTweet);
This regular expression says a # followed by 0 or more characters [a-zA-Z0-9_], followed by an alphabetic character or an underscore (1 or more), followed by 0 or more word characters.
http://rubular.com/r/opNX6qC4sG <- test it here.
It's actually better to search for characters that aren't allowed in a hashtag otherwise tags like "#Trentemøller" wont work.
The following works well for me...
preg_match('/([ ,.]+)/', $string, $matches);
I have devised this: /(^|\s)#([[:alnum:]])+/gi
I found Gazlers answer to work, although the regex added a blank space at the beginning of the hashtag, so I removed the first part:
(^|\s)
This works perfectly for me now:
#(\w*[a-zA-Z_0-9]+\w*)
Example here: http://rubular.com/r/dS2QYZP45n

regular expression and forward slash

i'm searching for keywords in a string via a regular expression. It works fine for all keywords, exept one which contains a forward slash in it: "time/emit" .
Even using preg_quote($find,'/'), which escapes it, i still get the message:
Unknown modifier 't' in /frontend.functions.php on line 71
If i print the find pattern, it shows /time\\/emit/ . Without preg_quote, it shows /time/emit/ and both return the same error message.
Any bit of knowledge would be useful.
Try to begin and end your regular expression with different sign than /
I personally use `
I've seen people using #
I think most chars are good. You can read more about it here: http://pl.php.net/manual/en/regexp.reference.delimiters.php
Like this:
preg_match('#time/emit#', $subject); // instead of /time/emit/
To put it another way: Your $find variable should contain rather #time/emit# than /time/emit/
looks like you have something already escaping it..
preg_quote('time/emit') // returns time\/emit
preg_quote('time\/emit') // returns time\\/emit
as a hack you could simply do:
preg_quote(stripslashes($find)) // will return time\/emit
bit of code?
the the 'regex' for that particular term should look something like '/time/emit/'. With a set of keywords there may be a more efficient method so seeing what you are doing would be good.
this should work:
$a="Hello////////";
$b=str_replace($a,"//","/");
echo $b;

Categories