I'm having a little trouble figuring out the pattern to identify the beginning of inline replies/forwards in an email body, there are some easier ones that simply begin with something like "Begin forwarded message" but the replies are a little more complicated:
On 12-06-13 10:56 AM, "John Doe" <john.doe#some.tld> wrote:
Obviously the constants will be "On" and "wrote:". I'd like to be able to find only the first match and then either wrap everything after it in a div with display:none applied or even just eliminate it using substr($body,0, POSITION_OF_MATCH).
One of the issues I'm having is that it's not catching the FIRST occurrence, and second is that I can't get the greediness to work properly.
My progress (having fallen back to at least a partially working version) so far is:
preg_match("/On [^>]* wrote:/i",$content,$matches,PREG_OFFSET_CAPTURE);
Any help would be greatly appreciated!
You can probably break this down by elements; so you basically have:
On DATE, "NAME" <EMAIL> wrote:
You can then characterize DATE, NAME, and EMAIL.
DATE is composed of numbers, dashes, spaces, colons, and letters. However, it ends with a comma, so you can use that instead.
NAME is composed of letters and spaces, though it is delimited by quotes, and you can probably handle that.
EMAIL is a bit more complicated, but emails cannot contain the character >, so you should be able to capture everything but that.
So you basically get:
On [anything but comma], "[anything but "]" <[anything but >]> wrote:
Which, in regex, is something like:
/^On ([^,]+), \"([^\"]+)\" <([^>]+)> wrote:$/
Then, when using preg_match, you can get your matches from some $matches array, indices 1 through 3.
I wonder how your current version works at all, because you cannot possibly match the closing >. But you could do something like this:
$content = preg_replace('/(On [^>]*> wrote:).*$/s', '$1', $content);
Which will match the first On ... wrote: and everything after that up until the end of the string. And replace it by just the On ... wrote:.
I suggest
$email = preg_match('/^On [^"]*"[^"]*" <([^>]*)> wrote:$/', $str, $re) ? $re[1] : '';
See this demo.
I appreciate the other answers, but none of them really took into account the many possible variations in the reply strings I was dealing with, that might have been my fault for not explaining properly or providing more options. I've +1'd everyone for their efforts though.
The final solution which seems to be working best after a day of fiddling with it on and off is this:
/On (Mon|Tue|Wed|Thu|Fri|Sat|Sun|[[:digit:]]{1,2})(.*?) wrote:/i
The option list that it begins with covers a range of different reply types that start with "On Tue..." or "On 23..." or "On 1...", etc. ensuring that the greediness I was complaining about wasn't taking in too much from random "on" strings elsewhere, the (.*?) takes care of the rest of the name/email portion, finally following up with "wrote:" to finish it off.
Related
I am using a regex pattern to find instances of [code][/code] BB tags. (This is in PHP with pearl-type regex using preg_match / preg_relace / etc)
'~\[code\](.*?)\[\/code\]~is'
Well, my question is how can I make it so somebody could type something like:
[code][code]code here[/code][/code]
The purpose of typing this would be to demonstrate to a newbie how to place their code into [code][/code] tags.
Currently if I type that, the regex will stop at the 1st instance of "[/code]" and not keep looking ahead to see the 2nd instance of "[code]"
I can't post images since I'm a new user, but here is a screenshot of the output:
http://i.imgur.com/t8zNh.png
I know there is a term in regex called "positive look ahead" and "negative look ahead", but I'm not quite sure what they mean, or if they are relevant to my situation. Could someone please give me a hand? Thank you.
EDIT: I'm sorry but I don't seem to have enough rep to +1 anything. I really appreciate your help, and it was so fast.
you can try for this pattern
'~\[code\](.*?)(\[\/code\])+~is'
If you want to cover also the case that there can be more code between the two closing tags you can try this one
\[code\].*?\[\/code\](?:(?:(?!\[code\]).)*\[\/code\])?
See it here on Regexr
The first part \[code\].*?\[\/code\] is matching from the first opening tag to the first closing tag.
Then comes the tricky part that is now optional. (?:(?:(?!\[code\]).)*\[\/code\])? is matching characters, if the there is not an opening tag following, till the last closing tag found.
All you are looking at is removing the ? in the (.*?)
With the ?, it is a lazy match and will match few characters as possible and that is why it stops when it sees the first [/code]
Without the ?, it is a greedy match and will match as many characters as possible and will match till the outer end tag.
Here's what I have so far:
/(^|\s)(http:\/\/(\S+)(?!(.png|.gif|.jpg)($|\.\s|\.$|\s)))($|\.\s|\.$|\s)/i
And I'm replacing it like so:
'$1$2$6'
Sometimes, my users type something like this: http://google.com. <- How do I avoid including that final period without parsing out other periods that are in URLs?
Also, in case you're wondering what the .gif .png etc is for, I'm parsing out images to automatically create elements.
Edit:
This is for PHP.
This is for a forum where users post lots of things including links. It successfully handles every situation except for punctuation after the URL.
Edit 2:
Parse out might be the wrong word. I'm not trying to remove the punctuation, just separate it from the URL so I can display a working link to my users.
Edit 3:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I haven't testing fully yet, but it seems to work. I'll make it a solution after I've tested. Or if someone else wants points, feel free to test and I'll vote for your solution.
So updated solution:
/\b(http:\/\/(\S+(?<!\.)(?=(?:$|\s|\.(?:$|\s)))))(?<!(?:\.(?:png|gif|jpg)))/i
See it here online on Regexr
I replaced your (^|\s) by \b thats a word boundary that is exactly what you want here.
To your (\S+) I changed to (\S+(?<!\.)(?=(?:$|\s|\.(?:$|\s)))). Basically I match every non whitespace till there is $|\s|\.(?:$|\s) ahead and no dot on the left (the (?<!\.) part).
The following look around needs to be a look behind.
Then I cleaned your brackets and alternations a bit up and used some non capturing groups (the groups that start with (?:)
So for your test string users type something like this: http://google.com. <- How do I avoid it will match http://google.com with this in the first group and google.com in the second group.
PHP solution:
$line = 'http://www.google.com.';
echo preg_replace(
"/(\s*)((http:\/\/)?(\S+?(.png|.gif|.jpg)?))(\W*)$/i",
'$1$2$6',
$line), "\n";
I'm building this regex with a positive look ahead in it. Basically it must select all text in the line up to last period that precedes a ":" and add a "|" to the end to delimit it. Some sample text below. I am testing this in gskinner and editpadpro which has full grep regex support apparently so if I could get the answers in that for I'd appreciate it.
The regex below works to a degree but I am unsure if it is correct. Also it falls down if the text contains brackets.
Finally I would like to add another ignore rule like the one that ignores but includes "Co." in the selection. This second ignore rule would ignore but include periods that have a single Capital letter before them. Sample text below too. Thanks for all the help.
^(?:[^|]+\|){3}(.*?)[^(?:Co)]\.(?=[^:]*?\:)
121| Ryan, T.N. |2001. |I like regex. But does it like me (2) 2: 615-631.
122| O' Toole, H.Y. |2004. |(Note on the regex). Pages 90-91 In: Ryan, A. & Toole, B.L. (Editors) Guide to the regex functionality in php. Timmy, Tommy& Stewie, Quohog. * Produced for Family Guy in Quohog.
I don't think I understand what you want to do. But this part [^(?:Co)] is definitely not correct.
With the square brackets you are creating a character class, because of the ^ it is a negated class. That means at this place you don't want to match one of those characters (?:Co), in other words it will match any other character than "?)(:Co".
Update:
I don't think its possible. How should I distinguish between L. Co. or something similar and the end of the sentence?
But I found another error in your regex. The last part (?=[^:]*?\:) should be (?=[^.]*?\:) if you want to match the last dot before the : with your expression it will match on the first dot.
See it here on Regexr
This seems to do what you want.
(.*\.)(?=[^:]*?:)
It quite simply matches all text up to the last full stop that occurs before the colon.
I'm looking for a simple replacement of [[wiki:Title]] into Title.
So far, I have:
$text = preg_replace("/\[\[wiki:(\w+)\]\]/","\\1", $text);
The above works for single words, but I'm trying to include spaces and on occasion special characters.
I get the \w+, but \w\s+ and/or \.+ aren't doing anything.
Could someone improve my understanding of basic regex? And I don't mean for anyone to simply point me to a webpage.
\w\s+ means "a word-character, followed by 1 or more spaces". You probably meant (\w|\s)+ ("1 or more of a word character or a space character").
\.+ means "one or more dots". You probably meant .+ (1 or more of any character - except newlines, unless in single-line mode).
The more robust way is to use
\[wiki:(.+?)\]
This means "1 or more of any character, but stop at first position where the rest matches", i.e. stop at first right bracket in this case. Without ? it would look for the longest available match - i.e. past the first bracket.
You need to use \[\[wiki:([\w\s]+)\]\]. Notice square brackets around \w\s.
If you are learning regular expressions, you will find this site useful for testing: http://rexv.org/
You're definitely getting there, but you've got a couple syntax errors.
When you're using multiple character classes like \w and \s, in order to match within that group, you have to put them in [square brackets] like so... ([\w\s]+) this basically means one or more of words or white space.
Putting a backslash in front of the period escapes it, meaning the regex is searching for a period.
As for matching special characters, that's more of a pain. I tried to come up with something quickly, but hopefully someone else can help you with that.
(Great cheat sheet here, I keep a copy on my desk at all times: http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/ )
I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.
I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?
Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get
(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...
This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:
(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
With the following input:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
This is the output of a preg_replace:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
Just wanted to contribute back to save somebody some time.
I made a slight modification to the Regex contained in the original answer:
(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:
$convertedText = preg_replace( '#(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]#i', '\0', $originalText );
Note, I removed # from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that # would be used in a URL anyway.
Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.
Hope that helps.
To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:
/(?<!href=")http://\S*/
Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.
if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[A-Z0-9+&##\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
# Successful match
} else {
# Match attempt failed
}
Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.
The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.
All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).
Beware, long regex ahead. Apply case-insensitively.
(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,<>"'\s\r\n\t]+(?:\.(?![.<>"'\s\r\n])[^.,!<>"'\s\r\n\t]+)+)
Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.
The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.
Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:
Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
Scan for incorrectly nested <a> tags, removing the innermost one