PHP Regex to get multiline value out of Email Header

PHP Regex to get multiline value out of Email Header - php

I have a quick question. How do I get a value from a email header that is on multiple lines?
Here is an example subject value in the email header:
Subject: =?UTF-8?B?RGVhbHMgZm9yIHRoZSBEYXkgfCBQbHVzLCBzYXZlIDI1JSBvbiA=?=
=?UTF-8?B?bmVhcmx5IEVWRVJZVEhJTkch?=
MIME-Version: 1.0
I am using the following regex but it only returns a single line:
'/Subject: (.*)/i'
Now I tried using the following and returns both lines, however when the subject is only one line it returns other header information that is not wanted (MIME-Version...).
'/Subject: (.*)(\n\s*(.*))/i'
How can I modify the regex to only pull the second line if it starts with spaces (\s*) and can span multiple lines, i.e. if the "Subject" is varied in length.
Thanks for your help!
UPDATE SOLUTION
Thanks to #G-Nugget below is a regex that will do what I want and group the result:
/Subject: ((.*)(\n\s+(.*))*)/i

Your second regex is close. This modified version should do the trick:
/Subject: (.*)(\n\s+(.*))*/i
By switch the * in the middle to a +, there must be a space at the start of the line to grab it. The * at the end allows the regex to match any number of lines as long as all but the first start with a space.

I strongly recommend to use regexp with "m" modifier and "^" to search only at the begining of the line:
/^Subject: (.*)(\n\s+(.*))*/im
to avoid matching completely different header than expected - for example:
"X-Subject" instead of "Subject"
"X-Google-DKIM-Signature" instead of "DKIM-Signature"

Related

How do I extract one group from a URL using regex for use in a redirect?

I've read the Best RegEx Trick Ever and tried to wrap my head around the other answers here on Stack Exchange and just can't seem to get it right. Take these three strings:
http://www.test.com/newyork/class-schedule
http://www.test.com/location/newyork/class-schedule
http://www.test.com/location/newyork/training
I need a regex that will extract the newyork from the first string and save it for a replace later, but will NOT match any part of the other strings. Also, for obscure reasons, I can not include http://www.test.com as a condition for matching (so I can't use anything before the slash that precedes newyork). Note that in this scenario, newyork could easily be chicago, atlanta, or any other city name with no spaces or punctuation.
The only thing I've been able to figure out that isolates only newyork in the first string is the following:
/.*\.com\/(.[^\/]*)\/class-schedule/g
However, this relies on using the URL first which I can't use.
Any ideas on how to achieve this WITHOUT using the URL?
[EDIT]
To clarify what I'm looking for, I'm trying to take the results from the first string and add "location" to it, still using regex. So:
http://www.test.com/newyork/class-schedule
would become
http://www.test.com/location/newyork/class-schedule
using something like
http://www.test.com/location/$1/class-schedule

Try this: ~/(\w+)/[-a-z]+?/?(?:\?.*?)*(:?\s|$)~gm
See it working here: https://regex101.com/r/4VMazZ/3.
So it will use the end of URL instead of the beginning and match only the word between slash 2 and 3 from the end. There can be a query string it will still work.
[EDIT 1]
I exchanged 2 chars doing typo in the end so it was capturing one extra group: /(\w+)/[-a-z]+?/?(?:\?.*?)*(?:\s|$). here: https://regex101.com/r/4VMazZ/4
If you use preg_match($pattern, $string, $matches); the result you want (newyork) will be in $matches[1];, $matches[0] contains everything.
You can see the captures in 'MATCH INFORMATION' panel on regex101 in my example!
[EDIT 2] after your comment.
If you want to replace the whole url you have to match the whole URL, something like this: .*?/(\w+)/[-a-z]+?/?(?:\?.*?)*(?:\s|$) will do in this example. See it working here: https://regex101.com/r/4VMazZ/5
[EDIT 3] Add capturing of last part for replacement.
So as you want to reuse last part you need to add capturing parenthesis: .*?/(\w+)/([-a-z]+?)/?(?:\?.*?)*(?:\s|$).
See it working here: https://regex101.com/r/4VMazZ/6

Could this work? See it here.
(?<=location\/|\.\w{3}\/|\.\w{2}\/)(?!location).*?(?=\/|$)
It matches everything following .xxx/ or .xx/ or location/. I don't know if one letter domain exist, in this case, you can add |\.\w\/ to the lookahead at the start of the regex.
(?<=location\/|\.\w{3}\/|\.\w{2}\/) is a lookahead, so it matches the following pattern only if preceded by location/ or .xxx or .xx
.*? matches every character (lazy)
(?=\/|$) end match if next character is / or on line end
Note: If location is counted as part of the url, I don't think what you are asking is possible in regex, as the city name could be anywhere in string. If so, then you could have a list of cities and check what part of the url matches one of them.
EDIT: You need the multiline m flag so $ also matches end of line

Regular expression to replace broken email links

Problem: authors have added email addresses wrongly in a CMS - missing out the 'mailto:' text.
I need a regular expression, if possible, to do a search and replace on the stored MySQL content table.
Cases I need to cope with are:
No 'mailto:'
'mailto:' is already included (correct)
web address not email - no replace
multiple mailto: required (more than one in string)
Sample string would be: (line breaks added for readability)
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
Required output would be:
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
What I tried (in PHP) and issues:
pattern: /href="(.+?)(#)(.+?)(<\/a> )/iU
replacement: href="mailto:$1$2$3$4
This is adding mailto: to the correctly formatted mailto: and acting greedily over the last two links.
Thanks for any help. I have looked about, but am running out of time on this as it was an unexpected content issue.
If you are able to save me time and give the SQL expression, that would be even better.

Try replace
/href="(?!(mailto:|http:\/\/|www\.))/iU
with
href="mailto:
?! loosely means "the next characters aren't these".
Alternative:
Replace
/(href=")(?!mailto:)([^"]+#)/iU
with
$1mailto:$2
[^"]+ means 1 or more characters that aren't ".
You'd probably need a more complex matching pattern for guaranteed correctness.
MySQL REGEX matching:
See this or this.

You need to apply a proper mail pattern first (e.g: Using a regular expression to validate an email address), second search for mailto:before mail or nothing (e.g: (mailto:|)), and last preg_replace_callback suits for this.
This looks like working as you wish (searching only email addresses in double quotes);
$s = 'add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com';
echo preg_replace_callback(
'~"(mailto:|)([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))"~i',
function($m) {
// print_r($m); #debug
return '"mailto:'. $m[2] .'"';
},
$s
);
Output as you desired;
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com

Use the following as pattern:
/(href=")(?!mailto:)(.+?#.+?")/iU
and replace it with
$1mailto:$2
(?!mailto:) is a negative lookahead checking whether a mailto: follows. If there is no such one, remaining part is checked for matching. (.+?#.+?") matches one or more characters followed by a # followed by one or more characters followed by a ". Both + are non-greedy.
The matched pattern is replaced with first capture group (href=") followed by mailto: followed by second capture group (upto closing ").

Where did I go wrong in my regex lookaround?

I'm trying to pull the first paragraph out of Markdown formatted documents:
This is the first paragraph.
This is the second paragraph.
The answer here gives me a solution that matches the first string ending in a double line break.
Perfect, except some of the texts begin with Markdown-style headers:
### This is an h3 header.
This is the first paragraph.
So I need to:
Skip any line that begins with one or more # symbols.
Match the first string ending in a double line break.
In other words, return 'This is the first paragraph' in both of the examples above.
So far, I've tried many variations on:
"/(?s)(?:(?!\#))((?!(\r?\n){2}).)*+/
But I can't get it to return the proper match.
Where did I go wrong in my lookaround?
I'm doing this in PHP (preg_match()), if that makes a difference.
Thanks!

You could try
"/(?sm)^[^#](?:(?!(?:\r\n|\r|\n){2}).)*/"
I enable the multiline option by using (?sm) instead of (?s) and start each check at a new line, which may not be starting with a #. And I used \r\n|\r|\n instead of \r?\n because my testing environment had funny line breaks =)

PHP preg_replace non-greedy trouble

I've been using the following site to test a PHP regex so I don't have to constantly upload:
http://www.spaweditor.com/scripts/regex/index.php
I'm using the following regex:
/(.*?)\.{3}/
on the following string (replacing with nothing):
Non-important data...important data...more important data
and preg_replace is returning:
more important data
yet I expect it to return:
important data...more important data
I thought the ? is the non-greedy modifier. What's going on here?

Your non-greedy modifier is working as expected. But preg_match replaces all occurences of the the (non-greedy) match with the replacement text ("" in your case). If you want only the first one replaced, you could pass 1 as the optional 4th argument (limit) to preg_replace function (PHP docs for preg_replace). On the website you linked, this can be accomplished by typing 1 into the text input between the word "Flags" and the word "limit".

just an actual example of #Asaph solution. In this example ou don't need non-greediness because you can specify a count.
replace just the first occurrence of # in a line with a marker
$line=preg_replace('/#/','zzzzxxxzzz',$line,1);

Remove all characters starting from last occurrence of specific sequence of characters

I am parsing out some emails. Mobile Mail, iPhone and I assume iPod touch append a signature as a separate boundary, making it simple to remove. Not all mail clients do, and just use '--' as a signature delimiter.
I need to chop off the '--' from a string, but only the last occurrence of it.
Sample copy
hello, this is some email copy-- check this out
--
Tom Foolery
I thought about splitting on '--', removing the last part, and I would have it, but explode() and split() neither seem to return great values for letting me know if it did anything, in the event there is not a match.
I can not get preg_replace() to go across more than one line. I have standardized all line endings to \n.
What is the best suggestion to end up with hello, this is some email copy-- check this out, taking not, there will be cases where there is no signature, and there are of course going to be cases where I can not cover all the cases.

Actually correct signature delimiter is "-- \n" (note the space before newline), thus the delimiter regexp should be '^-- $'. Although you might consider using '^--\s*$', so it'll work with OE, which gets it wrong.

Try this:
preg_replace('/--[\r\n]+.*/s', '', $body)
This will remove everything after the first occurence of -- followed by one or more line break characters. If you just want to remove the last occurence, use /.*--[\r\n]+.*/s instead.

Instead of just chopping of everything after -- could you not cache the last few emails sent by that user or service and compare. The bit at the bottom that looks like the others can be safely removed leaving the proper message intact.

I think in the interest of being more bulletproof, I will take the non regex route
echo substr($body, 0, strrpos($body, "\n--"));

This seems to give me the best result:
$body = preg_replace('/\s*(.+)\s*[\r\n]--\s+.*/s', '$1', $body);
It will match and trim the last "(newline)--(optional whitespace/newlines)(signature)"
Trim all remaining newlines before the signature
Trim beginning/ending whitespace from the body (remaining newlines before the signature, whitespace at the start of the body, etc)
Will only work if there's some text (non-whitespace) before the signature (otherwise it won't strip the signature and return it intact)

To cleanly remove all of the signature and its leading newline characters, perform greedy matching upto the the last occurring --. Before matching the last -- followed by zero or more spaces then a system-agnostic newline character, restart the fullstring match using \K, then match all of the remaining string to be replaced.
Code: (Demo)
$string = <<<BODY
hello, this is some email copy-- check this out
--
Tom Foolery
BODY;
var_export(preg_replace('~.*\K\R-- *\R.*~s', '', $string));
Output:
'hello, this is some email copy-- check this out'

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regex to get multiline value out of Email Header - php

I strongly recommend to use regexp with "m" modifier and "^" to search only at the begining of the line: /^Subject: (.)(\n\s+(.))*/im to avoid matching completely different header than expected - for example: "X-Subject" instead of "Subject" "X-Google-DKIM-Signature" instead of "DKIM-Signature"

Related

How do I extract one group from a URL using regex for use in a redirect?

Regular expression to replace broken email links

Where did I go wrong in my regex lookaround?

PHP preg_replace non-greedy trouble

Remove all characters starting from last occurrence of specific sequence of characters

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regex to get multiline value out of Email Header - php

I strongly recommend to use regexp with "m" modifier and "^" to search only at the begining of the line: /^Subject: (.*)(\n\s+(.*))*/im to avoid matching completely different header than expected - for example: "X-Subject" instead of "Subject" "X-Google-DKIM-Signature" instead of "DKIM-Signature"

Related

How do I extract one group from a URL using regex for use in a redirect?

Regular expression to replace broken email links

Where did I go wrong in my regex lookaround?

PHP preg_replace non-greedy trouble

Remove all characters starting from last occurrence of specific sequence of characters

Categories

Resources

I strongly recommend to use regexp with "m" modifier and "^" to search only at the begining of the line: /^Subject: (.)(\n\s+(.))*/im to avoid matching completely different header than expected - for example: "X-Subject" instead of "Subject" "X-Google-DKIM-Signature" instead of "DKIM-Signature"