I am trying to clean up user submitted comments in PHP using regex but have become rather stuck and confused!
Is it possible using regex to:
Remove punctuation repeated more than twice so that:
OMG it was AWESOME!!!! becomes OMG it was AWESOME!!
!!!!!!!!!!.........------ becomes !!..--
!?!?!? becomes !?
Remove duplicate words of phrases (for example a user has copied and pasted a message) so:
spamspamspamspam becomes spam
I love copy and paste. I love copy and paste. I love copy and paste. becomes I love copy and paste.
Remove collections of letters and spaces longer than say 10 letters in caps:
I LOVE CAPITALS THEY ARE SO AWESOME becomes I love capitals they are so awesome
GOOD that sounds stays the same
Any suggestions you have?
This is for a student system (hence the urge to at least try and tidy up what they post), although I do not wish to go as far as filtering it or blocking their messages, just "correct" it with some regex.
Thanks for your time,
Edit:
If it isn't possible using regex (or regex mised with other PHP) how would you do it?
1:
// same punctuation repeated more than 2 times
preg_replace('#([?!.-])\1{2,}#', '$1$1', $string);
// sequence of different punctuations repeated more than one time
preg_replace('#([?!.-][?!.-]+?)\1+#', '$1', $string);
2:
// any sequence of characters repeated more than one time
preg_replace('#(.{2,}?)\1+#', '$1', $string);
3:
// sequence of uppercase letters and spaces
function tolower_cb($match) {
return strtolower($match[0]);
}
preg_replace_callback('#([A-Z ]{10,})#', 'tolower_cb', $string);
Try it here: http://codepad.org/iQsZ2vJ0
A good rule of thumb is to never, ever try and "fix" user input. If a user wants to type 4 exclamation points after a sentence then allow it. There is no reason not too.
You should be more concerned with injection attacks then things like this.
Related
I'm having a little trouble figuring out the pattern to identify the beginning of inline replies/forwards in an email body, there are some easier ones that simply begin with something like "Begin forwarded message" but the replies are a little more complicated:
On 12-06-13 10:56 AM, "John Doe" <john.doe#some.tld> wrote:
Obviously the constants will be "On" and "wrote:". I'd like to be able to find only the first match and then either wrap everything after it in a div with display:none applied or even just eliminate it using substr($body,0, POSITION_OF_MATCH).
One of the issues I'm having is that it's not catching the FIRST occurrence, and second is that I can't get the greediness to work properly.
My progress (having fallen back to at least a partially working version) so far is:
preg_match("/On [^>]* wrote:/i",$content,$matches,PREG_OFFSET_CAPTURE);
Any help would be greatly appreciated!
You can probably break this down by elements; so you basically have:
On DATE, "NAME" <EMAIL> wrote:
You can then characterize DATE, NAME, and EMAIL.
DATE is composed of numbers, dashes, spaces, colons, and letters. However, it ends with a comma, so you can use that instead.
NAME is composed of letters and spaces, though it is delimited by quotes, and you can probably handle that.
EMAIL is a bit more complicated, but emails cannot contain the character >, so you should be able to capture everything but that.
So you basically get:
On [anything but comma], "[anything but "]" <[anything but >]> wrote:
Which, in regex, is something like:
/^On ([^,]+), \"([^\"]+)\" <([^>]+)> wrote:$/
Then, when using preg_match, you can get your matches from some $matches array, indices 1 through 3.
I wonder how your current version works at all, because you cannot possibly match the closing >. But you could do something like this:
$content = preg_replace('/(On [^>]*> wrote:).*$/s', '$1', $content);
Which will match the first On ... wrote: and everything after that up until the end of the string. And replace it by just the On ... wrote:.
I suggest
$email = preg_match('/^On [^"]*"[^"]*" <([^>]*)> wrote:$/', $str, $re) ? $re[1] : '';
See this demo.
I appreciate the other answers, but none of them really took into account the many possible variations in the reply strings I was dealing with, that might have been my fault for not explaining properly or providing more options. I've +1'd everyone for their efforts though.
The final solution which seems to be working best after a day of fiddling with it on and off is this:
/On (Mon|Tue|Wed|Thu|Fri|Sat|Sun|[[:digit:]]{1,2})(.*?) wrote:/i
The option list that it begins with covers a range of different reply types that start with "On Tue..." or "On 23..." or "On 1...", etc. ensuring that the greediness I was complaining about wasn't taking in too much from random "on" strings elsewhere, the (.*?) takes care of the rest of the name/email portion, finally following up with "wrote:" to finish it off.
Here's what I have so far:
/(^|\s)(http:\/\/(\S+)(?!(.png|.gif|.jpg)($|\.\s|\.$|\s)))($|\.\s|\.$|\s)/i
And I'm replacing it like so:
'$1$2$6'
Sometimes, my users type something like this: http://google.com. <- How do I avoid including that final period without parsing out other periods that are in URLs?
Also, in case you're wondering what the .gif .png etc is for, I'm parsing out images to automatically create elements.
Edit:
This is for PHP.
This is for a forum where users post lots of things including links. It successfully handles every situation except for punctuation after the URL.
Edit 2:
Parse out might be the wrong word. I'm not trying to remove the punctuation, just separate it from the URL so I can display a working link to my users.
Edit 3:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I haven't testing fully yet, but it seems to work. I'll make it a solution after I've tested. Or if someone else wants points, feel free to test and I'll vote for your solution.
So updated solution:
/\b(http:\/\/(\S+(?<!\.)(?=(?:$|\s|\.(?:$|\s)))))(?<!(?:\.(?:png|gif|jpg)))/i
See it here online on Regexr
I replaced your (^|\s) by \b thats a word boundary that is exactly what you want here.
To your (\S+) I changed to (\S+(?<!\.)(?=(?:$|\s|\.(?:$|\s)))). Basically I match every non whitespace till there is $|\s|\.(?:$|\s) ahead and no dot on the left (the (?<!\.) part).
The following look around needs to be a look behind.
Then I cleaned your brackets and alternations a bit up and used some non capturing groups (the groups that start with (?:)
So for your test string users type something like this: http://google.com. <- How do I avoid it will match http://google.com with this in the first group and google.com in the second group.
PHP solution:
$line = 'http://www.google.com.';
echo preg_replace(
"/(\s*)((http:\/\/)?(\S+?(.png|.gif|.jpg)?))(\W*)$/i",
'$1$2$6',
$line), "\n";
I'm looking for a simple replacement of [[wiki:Title]] into Title.
So far, I have:
$text = preg_replace("/\[\[wiki:(\w+)\]\]/","\\1", $text);
The above works for single words, but I'm trying to include spaces and on occasion special characters.
I get the \w+, but \w\s+ and/or \.+ aren't doing anything.
Could someone improve my understanding of basic regex? And I don't mean for anyone to simply point me to a webpage.
\w\s+ means "a word-character, followed by 1 or more spaces". You probably meant (\w|\s)+ ("1 or more of a word character or a space character").
\.+ means "one or more dots". You probably meant .+ (1 or more of any character - except newlines, unless in single-line mode).
The more robust way is to use
\[wiki:(.+?)\]
This means "1 or more of any character, but stop at first position where the rest matches", i.e. stop at first right bracket in this case. Without ? it would look for the longest available match - i.e. past the first bracket.
You need to use \[\[wiki:([\w\s]+)\]\]. Notice square brackets around \w\s.
If you are learning regular expressions, you will find this site useful for testing: http://rexv.org/
You're definitely getting there, but you've got a couple syntax errors.
When you're using multiple character classes like \w and \s, in order to match within that group, you have to put them in [square brackets] like so... ([\w\s]+) this basically means one or more of words or white space.
Putting a backslash in front of the period escapes it, meaning the regex is searching for a period.
As for matching special characters, that's more of a pain. I tried to come up with something quickly, but hopefully someone else can help you with that.
(Great cheat sheet here, I keep a copy on my desk at all times: http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/ )
IMHO,
I heard about this few times lately.
On some web portals I saw when whitespace in the beginning of the keywords, returns empty search result, without whitespaces it's working.
Are there some cases when this can be harmful?
Can somebody give an arguments for this kind of practice?
In almost all cases it's beneficial to clean the input because you can't trust what you're going to get. But note that you don't want to always blindly do it. There are circumstances where you might actually want a leading or trailing space to be there. (E.g., in a password.)
It's generally a good idea to clean up user-entered text, which usually includes removing extraneous whitespace, problematic punctuation characters, and so forth. This can also include replacing multiple adjacent spaces with a single space.
It goes without saying that you should protect yourself from SQL- and HTML-injection attacks, too, by scrubbing (preprocessing) user-supplied text appropriately. The easiest way is to ignore punctuation; another approach is to convert punctuation into harmless escape sequences.
No, there isn't anything wrong with it, if whitespaces are necessary for the user's input, don't trim away, but if they aren't I would suggest you to trim whitespaces.
For example, suppose someone enters a multi word string that you want to split apart.
Normally, you would break strings apart by splitting them using whitespaces as a delimiter, but if whitespaces aren't trimmed, you may or may not get an empty variable at the beginning. This will almost always have you guessing whether or not to use the first element of the split string. It really makes it a lot easier if you just trim whitespaces. Otherwise, you'll have a large block of code to figure out whether the first element of the split string is a valid entry or not.
" This string" would be split into an array that looks like this.
$string[0] = ''
$string[1] = 'This'
$string[2] = 'string'
but "This string" is simply
$string[0] = 'This'
$string[1] = 'string'
If you are doing string operations, you may want to find out how many words are in a string, the first case (above) would show you 3, while the latter would show you 2. There's just too many things to look for unless, the beginning or trailing whitespaces are really necessary.
I am looking for something like trim() but for within the bounds of a string. Users sometimes put 2, 3, 4, or more line returns after they type, I need to sanitize this input.
Sample input
i like cats
my cat is happy
i love my cat
hope you have a nice day
Desired output
i like cats
my cat is happy
i love my cat
hope you have a nice day
I am not seeing anything built in, and a string replace would take many iterations of it to do the work. Before I whip up a small recursive string replace, I wanted to see what other suggestions you all had.
I have an odd feeling there is a regex for this one as well.
function str_squeeze($body) {
return preg_replace("/\n\n+/", "\n\n", $body);
}
How much text do you need to do this on? If it is less than about 100k then you could probably just use a simple search and replace regex (searching something like /\n+/ and replace with \n)
On the other hand if you need to go through megabytes of data, then you could parse the text character by character, copying the input to the output, except when mulitple newlines are encountered, in which case you would just copy one newline and ignore the rest.
I would not recommend a recursive string replace though, sounds like that would be very very slow.
Finally managed to get it, needs preg so you are using the PCRE version in php, and also needs a \n\n replacement string, in order to not wipe all line endings but one:
$body = preg_replace("/\n\n+/", "\n\n", $body);
Thanks for getting me on the right track.
To consider all three line break sequences:
preg_replace('/(?:\r\n|[\r\n]){2,}/', "\n\n", $str)
The following regular expression should remove multiple linebreaks while ignoring single line breaks, which are okay by your definition:
ereg_replace("\n\n+", "\n\n", $string);
You can test it with this PHP Regular Expression test tool, which is very handy (but as it seems not in perfect parity with PHP).
[EDIT] Fixed the ' to ", as they didn't seem to work. Have to admit I just tested the regex in the web tool. ;)