Remove all characters starting from last occurrence of specific sequence of characters - php

I am parsing out some emails. Mobile Mail, iPhone and I assume iPod touch append a signature as a separate boundary, making it simple to remove. Not all mail clients do, and just use '--' as a signature delimiter.
I need to chop off the '--' from a string, but only the last occurrence of it.
Sample copy
hello, this is some email copy-- check this out
--
Tom Foolery
I thought about splitting on '--', removing the last part, and I would have it, but explode() and split() neither seem to return great values for letting me know if it did anything, in the event there is not a match.
I can not get preg_replace() to go across more than one line. I have standardized all line endings to \n.
What is the best suggestion to end up with hello, this is some email copy-- check this out, taking not, there will be cases where there is no signature, and there are of course going to be cases where I can not cover all the cases.

Actually correct signature delimiter is "-- \n" (note the space before newline), thus the delimiter regexp should be '^-- $'. Although you might consider using '^--\s*$', so it'll work with OE, which gets it wrong.

Try this:
preg_replace('/--[\r\n]+.*/s', '', $body)
This will remove everything after the first occurence of -- followed by one or more line break characters. If you just want to remove the last occurence, use /.*--[\r\n]+.*/s instead.

Instead of just chopping of everything after -- could you not cache the last few emails sent by that user or service and compare. The bit at the bottom that looks like the others can be safely removed leaving the proper message intact.

I think in the interest of being more bulletproof, I will take the non regex route
echo substr($body, 0, strrpos($body, "\n--"));

This seems to give me the best result:
$body = preg_replace('/\s*(.+)\s*[\r\n]--\s+.*/s', '$1', $body);
It will match and trim the last "(newline)--(optional whitespace/newlines)(signature)"
Trim all remaining newlines before the signature
Trim beginning/ending whitespace from the body (remaining newlines before the signature, whitespace at the start of the body, etc)
Will only work if there's some text (non-whitespace) before the signature (otherwise it won't strip the signature and return it intact)

To cleanly remove all of the signature and its leading newline characters, perform greedy matching upto the the last occurring --. Before matching the last -- followed by zero or more spaces then a system-agnostic newline character, restart the fullstring match using \K, then match all of the remaining string to be replaced.
Code: (Demo)
$string = <<<BODY
hello, this is some email copy-- check this out
--
Tom Foolery
BODY;
var_export(preg_replace('~.*\K\R-- *\R.*~s', '', $string));
Output:
'hello, this is some email copy-- check this out'

Related

Regular expression to replace broken email links

Problem: authors have added email addresses wrongly in a CMS - missing out the 'mailto:' text.
I need a regular expression, if possible, to do a search and replace on the stored MySQL content table.
Cases I need to cope with are:
No 'mailto:'
'mailto:' is already included (correct)
web address not email - no replace
multiple mailto: required (more than one in string)
Sample string would be: (line breaks added for readability)
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
Required output would be:
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
What I tried (in PHP) and issues:
pattern: /href="(.+?)(#)(.+?)(<\/a> )/iU
replacement: href="mailto:$1$2$3$4
This is adding mailto: to the correctly formatted mailto: and acting greedily over the last two links.
Thanks for any help. I have looked about, but am running out of time on this as it was an unexpected content issue.
If you are able to save me time and give the SQL expression, that would be even better.
Try replace
/href="(?!(mailto:|http:\/\/|www\.))/iU
with
href="mailto:
?! loosely means "the next characters aren't these".
Alternative:
Replace
/(href=")(?!mailto:)([^"]+#)/iU
with
$1mailto:$2
[^"]+ means 1 or more characters that aren't ".
You'd probably need a more complex matching pattern for guaranteed correctness.
MySQL REGEX matching:
See this or this.
You need to apply a proper mail pattern first (e.g: Using a regular expression to validate an email address), second search for mailto:before mail or nothing (e.g: (mailto:|)), and last preg_replace_callback suits for this.
This looks like working as you wish (searching only email addresses in double quotes);
$s = 'add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com';
echo preg_replace_callback(
'~"(mailto:|)([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))"~i',
function($m) {
// print_r($m); #debug
return '"mailto:'. $m[2] .'"';
},
$s
);
Output as you desired;
add1#test.com and
add2#test.com and
real web link
second one to replace add3#test.com
Use the following as pattern:
/(href=")(?!mailto:)(.+?#.+?")/iU
and replace it with
$1mailto:$2
(?!mailto:) is a negative lookahead checking whether a mailto: follows. If there is no such one, remaining part is checked for matching. (.+?#.+?") matches one or more characters followed by a # followed by one or more characters followed by a ". Both + are non-greedy.
The matched pattern is replaced with first capture group (href=") followed by mailto: followed by second capture group (upto closing ").

Regex For PHP Code?

I have the following code
<?
php drupal_set_message("Your registration submission has been received.");
drupal_goto("/events-initiatives/events-listing");
?>
And I want to remove everything but the Your registration submission has been received. and this message will change, so I need it to be a wildcard. So it would also make say
<?php
drupal_set_message("Testing!!!");
drupal_goto("/events-initiatives/events-listing");
?>
But I can't figure out how to do the PHP code, my current one is
preg_replace('#(<?php drupal_set_message(").*?("); drupal_goto("/guidelines-resources/professionals/lending-library"); ?>)#', '$1$2', $string);
but that isn't working, it seems to have problems with the ( in it.
Any idea how I could do this?
From looking at your original post, (before your regex was changed into a PHP snippet) I'd suggest you are looking for a regex along these lines:
#<\?php\s+drupal_set_message\(".*?"\);\s+drupal_goto\("/guidelines-resources/professionals/lending-library"\);\s+\?>#
Note that this regex:
escapes all special characters (e.g., ?, ( and )) with preceding slashes
replaces a single space with \s+ which matches one or more consecutive whitespace characters
EDIT
After rereading your question, if the only thing you want left is the text that is passed as an argument to drupal_set_message, then try this:
$pattern = '#\bdrupal_set_message\("(.*?)"\)#';
$found = preg_match($pattern, $subject, $matches);
// if found, $matches[1] will contain the argument to drupal_set_message
You can escape the special characters (though really, just the open and close parentheses) with backslashes. On a side note, if you have a decent IDE then it should have sophisticated regex-capable search-and-replace; use it (although if you do, you'll probably need to also escape the forward slashes, as those are the most likely delimiters that your IDE would use).

till the end of the string - Regex

I'm parsing an external feed which contains location and date inside post title which I want to get rid of, so:
This happened on Date in Location
I need to find on (space on space) and remove everything till the end of the line, same for in(space in space).
I googled a bit, but regex is really unfathomable for me so I'd appreciate any help.
Thanks!
Well, a literal "on" does match exactly. Then tell the regex engine to match everything after: ".*". (Note, that the . doesn't match newlines, so it works as needed.)
In the case of "in" you need an alternative, which is marked by parentheses () and the vertical bar |: "(on|in)". You could also make that a bit tighter with character classes []: "[oi]n".
With that we arrive at this regex:
/ [oi]n .*/
To the end of the line? Then I suppose:
preg_replace("/(?:on|in).*?(\n|$)/", "", 'This happened on Date in Location');
Would do it.
Use a negative lookbehind if you want to remove everything after the on and in but not the on and in themselves.
(?<=\son\s).*
and
(?<=\sin\s).*
http://regexr.com?30ops

Preg_match when string is sometimes a single word?

I'm trying to pull a word out of an email subject line to use as a category for attached email. Preg_match works great as long as it's not just a single word (which is what I'd like to do anyway). If there is only one word in the subject line, I just get an empty array. I've tried to treat $matches as just a variable in that case, but that doesn't work either. Can anyone tell me if preg_match will work on a single word, or what the better way to do this would be?
Thanks very much
Assuming \b(?:word1|word2|word3)\b
The reason it wont match "word1" is because you included a word separator, the \b.
What you can do is just simply always inject the word separator:
preg_match("\b(?:word1|word2|word3)\b", "." . $subject . ".", $matches);
Crude but effective.
preg_match will work on a string one character long. I think that the issue here is probably your regex. My guess is that you're testing for whitespace and because it isn't finding any it says that there is no match. Try appending '^([^\s]*)$|' to your regex and I wager it will start picking up those one word values. ([^\s] means give me anything which has no spaces in it, | means 'or'. By adding it to the front of your regex, it will include things without whitespace or whatever you already had)

Regex to strip some lines out of a text file

I need to try and strip out lines in a text file that match a pattern something like this:
anything SEARCHTEXT;anything;anything
where SEARCHTEXT will always be a static value and each line ends with a line break. Any chance someone could help with the regext for this please? Or give me some ideas on where to start (been to many years since I looked at regex).
I am planning on using PHP's preg_replace() for this.
Thanks.
This solution removes all lines in $text which contain the sub-string SEARCHTEXT:
$text = preg_replace('/^.*?SEARCHTEXT.*\n?/m', '', $text);
My benchmark tests indicate that this solution is more than 10 times faster than '/\n?.*SEARCHTEXT.*$/m' (and this one correctly handles the case where the first line matches and the second one doesn't).
Use a regex to match the whole line like so:
^.*SEARCHTEXT.*$
preg_replace would be a good option for this.
$str = preg_replace('/\n?.*SEARCHTEXT.*$/m', '', $str);
The \n escape matches the line break for the matched line. This way matched lines are removed and the replace method does not just leave empty lines in the string.
The /m flag makes the caret (^) match the start of each line instead of the start of the string.

Categories