Regex to replace punctuation - php

I've been trying for a few hours to get this to work to the effect I need but nothing works quite like it should. I'm building a discussion board type thing and have made a way to tag other users by putting #username in the post text.
Currently I have this code to strip anything that wouldn't be part of the username once the tags have already been pulled out of the entire text:
$name= preg_replace("/[^A-Za-z0-9_]/",'',$name);
This works well because it correct captures names that are for example (#username), #username:, #username, some text etc. (so to remove the ,, :, and )).
HOWEVER, this does not work when the user has non-ascii characters in their username. For example if it's #üsername, the result of that line above gives sername which is not useful.
IS there a way using preg_replace to still strip these additional punctuation, but retain any non-ascii letters?
Any help is much appreciated :)

You enter the area of Unicode Regexps.
$name= preg_replace('/[^\p{Letter}\p{Number}_]/u', '', $name);
or the other way round. The link I provided contains more examples.

To detect punctuation characters, you can use unicode property \p{P} instead:
$name = preg_replace('/[\p{P} ]+/', '', $name);
RegEx Demo

Related

using regex for filtering some words in persian in php

I'm working on a script that is going to identify offensive words from text messages. The problem is that sometimes users make some changes in words and make them unidentifiable. my code has to be able to identify those too as far as possible.
First of all I replace all non-alnum chars to spaces.
And then:
I've written two regex patterns.
One to remove repeating characters from string.
for Example: the user has written: seeeeex, it replaces it with sex:
preg_replace('/(.)\1+/', '$1', $text)
this regex works fine for English words but not in Farsi words which is my case.
for example if you write:
امیییییییییین
it does nothing with it.
I also tried
mb_ereg_replace
But it didn't work either.
My other regex is to remove spaces around all one-letter words.
for example: I want it to convert S E X to sex:
preg_replace('/( [a-zA-Zآ-ی] )\1+/', trim('$1'), $text);
This regex doesn't work at all and needs to be corrected.
Thank you for your help
Working with multi-byte characters, you should enable Unicode Aware modifier to change behavior of tokens in order to match right thing. In your first case it should be:
/(.)\1+/u
In your second regex, however, I see both syntax and semantic errors which you would change it to:
/\b(\pL)\s+/u
PHP:
preg_replace('/\b(\pL)\s+/u', '$1', $text);
Putting all together:
$text = 'سسس ککک سسس';
echo preg_replace(['/(.)\1+/u', '/\b(\pL)\s+/u'], '$1', $text); // خروجی میدهد: سکس
Live demo

Building a regex expression for PHP

I am stuck trying to create a regex that will allow for letters, numbers, and the following chars: _ - ! ? . ,
Here is what I have so far:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //not escaping the ?
and this version too:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //attempting to escape the ?
Neither of these seem to be able to match the following:
"Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?"
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
Thanks!
UPDATE:
Thanks to Lix's advice, this is what I have so far:
/^[-\'a-zA-Z0-9_!\?,\.\s]+$/
However, it's still not working??
UPDATE2
Ok, based on input this is what's happening.
User inputs string, then I run the string through following functions:
$comment = preg_replace('/\s+/', '',
htmlspecialchars(strip_tags(trim($user_comment_orig))));
So in the end, user input is just a long string of chars without any spaces. Then that string of chars is run using:
preg_match("#^[-_!?.,a-zA-Z0-9]+$#",$comment)
What could possibly be causing trouble here?
FINAL UPDATE:
Ended up using this regex:
"#[-'A-Z0-9_?!,.]+#i"
Thanks all! lol, ya'll are going to kill me once you find out where my mistake was!
Ok, so I had this piece of code:
if(!preg_match($pattern,$comment) || strlen($comment) < 2 || strlen($comment) > 60){
GEEZ!!! I never bothered to look at the strlen part of the code. Of course it was going to fail every time...I only allowed 60 chars!!!!
When in doubt, it's always safe to escape non alphanumeric characters in a class for matching, so the following is fine:
/^[\-\'a-zA-Z0-9\_\!\?\,\.\s]+$/
When run through a regular expression tester, this finds a match with your target just fine, so I would suggest you may have a problem elsewhere if that doesn't take care of everything.
I assume you're not including the quotes you used around the target when actually trying for a match? Since you didn't build double quote matching in...
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
in which case you don't need the \s if it's working correctly.
I got the following code to work as expected to (running php5):
<?php
$pattern = "#[-'A-Z0-9_?!,.\s]+#i";
$string = "Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?";
$results = array();
preg_match($pattern, $string, $results);
echo '<pre>';
print_r($results);
echo '</pre>';
?>
The output from print_r($results) was as following:
Array
(
[0] => Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?
)
Tested on http://writecodeonline.com/php/.
It's not necessary to escape most characters inside []. However, \s will not do what you want inside the expression. You have two options: either manually expand (/^[-\'a-zA-Z0-9_!?,. \t\n\r]+$/) or use alternation (/^(?:[-\'a-zA-Z0-9_!?,.]|\s)+$/).
Note that I left the \ before the ' because I'm assuming you're putting this in a PHP string and I wouldn't want to suggest a syntax error.
The only characters with a special meaning within a character class are:
the dash (since it can be used as a delimiter for ranges), except if it is used at the beginning (since in this case it is no part of any range),
the closing bracket,
the backslash.
In "pure regex parlance", your character class can be written as:
[-_!?.,a-zA-Z0-9\s]
Now, you need to escape whatever needs to be escaped according to your language and how strings are written. Given that this is PHP, you can take the above sample as is. Note that \s is interpreted in character classes as well, so this will match anything which is matched by \s outside of a character class.
While some manuals recommend using escapes for safety, knowing the general regex rules for character classes and applying them leads to shorter and easier to read results ;)

Regex For PHP Code?

I have the following code
<?
php drupal_set_message("Your registration submission has been received.");
drupal_goto("/events-initiatives/events-listing");
?>
And I want to remove everything but the Your registration submission has been received. and this message will change, so I need it to be a wildcard. So it would also make say
<?php
drupal_set_message("Testing!!!");
drupal_goto("/events-initiatives/events-listing");
?>
But I can't figure out how to do the PHP code, my current one is
preg_replace('#(<?php drupal_set_message(").*?("); drupal_goto("/guidelines-resources/professionals/lending-library"); ?>)#', '$1$2', $string);
but that isn't working, it seems to have problems with the ( in it.
Any idea how I could do this?
From looking at your original post, (before your regex was changed into a PHP snippet) I'd suggest you are looking for a regex along these lines:
#<\?php\s+drupal_set_message\(".*?"\);\s+drupal_goto\("/guidelines-resources/professionals/lending-library"\);\s+\?>#
Note that this regex:
escapes all special characters (e.g., ?, ( and )) with preceding slashes
replaces a single space with \s+ which matches one or more consecutive whitespace characters
EDIT
After rereading your question, if the only thing you want left is the text that is passed as an argument to drupal_set_message, then try this:
$pattern = '#\bdrupal_set_message\("(.*?)"\)#';
$found = preg_match($pattern, $subject, $matches);
// if found, $matches[1] will contain the argument to drupal_set_message
You can escape the special characters (though really, just the open and close parentheses) with backslashes. On a side note, if you have a decent IDE then it should have sophisticated regex-capable search-and-replace; use it (although if you do, you'll probably need to also escape the forward slashes, as those are the most likely delimiters that your IDE would use).

Preg_match when string is sometimes a single word?

I'm trying to pull a word out of an email subject line to use as a category for attached email. Preg_match works great as long as it's not just a single word (which is what I'd like to do anyway). If there is only one word in the subject line, I just get an empty array. I've tried to treat $matches as just a variable in that case, but that doesn't work either. Can anyone tell me if preg_match will work on a single word, or what the better way to do this would be?
Thanks very much
Assuming \b(?:word1|word2|word3)\b
The reason it wont match "word1" is because you included a word separator, the \b.
What you can do is just simply always inject the word separator:
preg_match("\b(?:word1|word2|word3)\b", "." . $subject . ".", $matches);
Crude but effective.
preg_match will work on a string one character long. I think that the issue here is probably your regex. My guess is that you're testing for whitespace and because it isn't finding any it says that there is no match. Try appending '^([^\s]*)$|' to your regex and I wager it will start picking up those one word values. ([^\s] means give me anything which has no spaces in it, | means 'or'. By adding it to the front of your regex, it will include things without whitespace or whatever you already had)

filter non-alphanumeric "repeating" characters

What's the best way to filter non-alphanumeric "repeating" characters
I would rather no build a list of characters to check for. Is there good regex for this I can use in PHP.
Examples:
...........
*****************
!!!!!!!!
###########
------------------
~~~~~~~~~~~~~
Special case patterns:
=*=*=*=*=*=
->->->->
Based on #sln answer:
$str = preg_replace('~([^0-9a-zA-Z])\1+|(?:=[*])+|(?:->)+~', '', $str);
The pattern could be something like this : s/([\W_]|=\*|->)\1+//g
or, if you want to replace by just a single instance: s/([\W_]|=\*|->)\1+/$1/g
edit ... probably any special sequence should be first in the alternation, incase you need to make something like == special, it won't be grabbed by [\W_].
So something like s/(==>|=\*|->|[\W_])\1+/$1/g where special cases are first.
preg_replace('~\W+~', '', $str);
sin's solution is pretty good but the use of \W "non-word" class includes whitespace. I don't think you wan't to be removing sequences of tabs or spaces! Using a negative class (something like: '[^A-Za-z0-9\s]') would work better.
This will filter out all symbols
[code]
$q = ereg_replace("[^A-Za-z0-9 ]", "", $q);
[/code]
replace(/([^A-Za-z0-9\s]+)\1+/, "")
will remove repeated patterns of non-alphanumeric non-whitespace strings.
However, this is a bad practice because you'll also be removing all non-ASCII European and other international language characters in the Unicode base.
The only place where you really won't ever care about internationalization is in processing source code, but then you are not handling text quoted in strings and you may also accidentally de-comment a block.
You may want to be more restrictive in what you try to remove by giving a list of characters to replace instead of the catch-all.
Edit: I have done similar things before when trying to process early-version ShoutCAST radio names. At that time, stations tried to call attention to themselves by having obnoxious names like: <<!!!!--- GREAT MUSIC STATION ---!!!!>>. I used used similar coding to get rid of repeated symbols, but then learnt (the hard way) to be careful in what I eventually remove.
This works for me:
preg_replace('/(.)\1{3,}/i', '', $sourceStr);
It removes all the symbols that repats 3+ times in row.

Categories