I'm attempting to create a bad word filter in PHP that will analyze the word and match against an array of known bad words, but keep the first letter of the word and replace the rest with asterisks. Example:
fook would become f***
shoot would become s**
The only part I don't know is how to keep the first letter in the string, and how to replace the remaining letters with something else while keeping the same string length.
$string = preg_replace("/\b(". $word .")\b/i", "***", $string);
Thanks!
$string = 'fook would become';
$word = 'fook';
$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);
var_dump($string);
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
This can be done in many ways, with very weird auto-generated regexps...
But I believe using preg_replace_callback() would end up being more robust
<?php
# as already pointed out, your words *may* need sanitization
foreach($words as $k=>$v)
$words[$k]=preg_quote($v,'/');
# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);
# after that, a single preg_replace_callback() would do
$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);
function my_beloved_callback($m)
{
$len=strlen($m[1])-1;
return $m[1][0].str_repeat('*',$len);
}
Here is unicode-friendly regular expression for PHP:
function lowercase_except_first_letter($s) {
// the following line SKIP the first word and pass it to callback func...
// \W it allows to keep the first letter even in words in quotes and brackets
return preg_replace_callback('/(?<!^|\s|\W)(\w)/u', function($m) {
return mb_strtolower($m[1]);
}, $s);
}
Related
An example:
THIS IS A Sentence that should be TAKEN Care of
The output should be:
This is a Sentence that should be taken Care of
Rules
Convert UPPERCASE words to lowercase
Keep the lowercase words with an uppercase first character intact
Set the first character in the sentence to uppercase.
Code
$string = ucfirst(strtolower($string));
Fails
It fails because the ucfirst words are not being kept.
This is a sentence that should be taken care of
You can test each word for those rules:
$str = 'THIS IS A Sentence that should be TAKEN Care of';
$words = explode(' ', $str);
foreach($words as $k => $word){
if(strtoupper($word) === $word || // first rule
ucfirst($word) !== $word){ // second rule
$words[$k] = strtolower($word);
}
}
$sentence = ucfirst(implode(' ', $words)); // third rule
Output:
This is a Sentence that should be taken Care of
A little bit of explanation:
Since you have overlapping rules, you need to individually compare them, so...
Break down the sentence into separate words and check each of them based on the rules;
If the word is UPPERCASE, turn it into lowercase; (THIS, IS, A, TAKEN)
If the word is ucfirst, leave it alone; (Sentence, Care)
If the word is NOT ucfirst, turn it into lowercase, (that, should, be, of)
You can break the sentence down into individual words, then apply a formatting function to each of them:
$sentence = 'THIS IS A Sentence that should be TAKEN Care of';
$words = array_map(function ($word) {
// If the word only has its first letter capitalised, leave it alone
if ($word === ucfirst(strtolower($word)) && $word != strtoupper($word)) {
return $word;
}
// Otherwise set to all lower case
return strtolower($word);
}, explode(' ', $sentence));
// Re-combine the sentence, and capitalise the first character
echo ucfirst(implode(' ', $words));
See https://eval.in/936462
$str = "THIS IS A Sentence that should be TAKEN Care of";
$str_array = explode(" ", $str);
foreach ($str_array as $testcase =>$str1) {
//Check the first word
if ($testcase ==0 && ctype_upper($str1)) {
echo ucfirst(strtolower($str1))." ";
}
//Convert every other upercase to lowercase
elseif( ctype_upper($str1)) {
echo strtolower($str1)." ";
}
//Do nothing with lowercase
else {
echo $str1." ";
}
}
Output:
This is a Sentence that should be taken Care of
I find preg_replace_callback() to be a direct tool for this task. Create a pattern that will capture the two required strings:
The leading word
Any non-leading, ALL-CAPS word
Code: (Demo)
echo preg_replace_callback(
'~(^\pL+\b)|(\b\p{Lu}+\b)~u',
function($m) {
return $m[1]
? mb_convert_case($m[1], MB_CASE_TITLE, 'UTF-8')
: mb_strtolower($m[2], 'UTF-8');
},
'THIS IS A Sentence that should be TAKEN Care of'
);
// This is a Sentence that should be taken Care of
I did not test this with multibyte input strings, but I have tried to build it with multibyte characters in mind.
The custom function works like this:
There will always be either two or three elements in $m. If the first capture group matches the first word of the string, then there will be no $m[2]. When a non-first word is matched, then $m[2] will be populated and $m[1] will be an empty string. There is a modern flag that can be used to force that empty string to be null, but it is not advantageous in this case.
\pL+ means one or more of any letter (single or multi-byte)
\p{Lu}+ means one or more uppercase letters
\b is a word boundary. It is a zero-width character -- it doesn't match a character, it checks that the two consecutive characters change from a word to a non-word or vice versa.
My answer makes just 3 matches/replacement on the sample input string.
$string='THIS IS A Sentence that should be TAKEN Care of';
$arr=explode(" ", $string);
foreach($arr as $v)
{
$v = ucfirst(strtolower($v));
$stry = $stry . ' ' . $v;
}
echo $stry;
I'm trying to remove all words of less than 3 characters from a string, specifically with RegEx.
The following doesn't work because it is looking for double spaces. I suppose I could convert all spaces to double spaces beforehand and then convert them back after, but that doesn't seem very efficient. Any ideas?
$text='an of and then some an ee halved or or whenever';
$text=preg_replace('# [a-z]{1,2} #',' ',' '.$text.' ');
echo trim($text);
Removing the Short Words
You can use this:
$replaced = preg_replace('~\b[a-z]{1,2}\b\~', '', $yourstring);
In the demo, see the substitutions at the bottom.
Explanation
\b is a word boundary that matches a position where one side is a letter, and the other side is not a letter (for instance a space character, or the beginning of the string)
[a-z]{1,2} matches one or two letters
\b another word boundary
Replace with the empty string.
Option 2: Also Remove Trailing Spaces
If you also want to remove the spaces after the words, we can add \s* at the end of the regex:
$replaced = preg_replace('~\b[a-z]{1,2}\b\s*~', '', $yourstring);
Reference
Word Boundaries
You can use the word boundary tag: \b:
Replace: \b[a-z]{1,2}\b with ''
Use this
preg_replace('/(\b.{1,2}\s)/','',$your_string);
As some solutions worked here, they had a problem with my language's "multichar characters", such as "ch". A simple explode and implode worked for me.
$maxWordLength = 3;
$string = "my super string";
$exploded = explode(" ", $string);
foreach($exploded as $key => $word) {
if(mb_strlen($word) < $maxWordLength) unset($exploded[$key]);
}
$string = implode(" ", $exploded);
echo $string;
// outputs "super string"
To me, it seems that this hack works fine with most PHP versions:
$string2 = preg_replace("/~\b[a-zA-Z0-9]{1,2}\b\~/i", "", trim($string1));
Where [a-zA-Z0-9] are the accepted Char/Number range.
while attempting a question in SO,i tried to write the regular expression which matches three characters that should be in the string.
i am following the answer Regular Expressions: Is there an AND operator?
<?php
$words = "systematic,gear,synthesis,mysterious";
$words=explode(",",$words);
$your_array = preg_grep("/^(^s|^m|^e)/", $words);
print_r($your_array);
?>
the output should be systematic and mysterious.but i am getting synthesis also.
Why is it so?what i am doing wrong?
** i dont want a new solution :)
SEE HERE
You can do this:
$wordlist = 'systematic,gear,synthesis,mysterious';
$words = explode(',', $wordlist);
foreach($words as $word) {
if (preg_match('~(?=[^s]*s)(?=[^m]*m)(?=[^e]*e)~', $word))
echo '<br/>' . $word;
}
//or
$res = preg_grep('~(?=[^s]*s)(?=[^m]*m)(?=[^e]*e)~', $words);
print_r($res);
To test the presence of a character in the string, I use (?=[^s]*s).
[^s]*s means all that is not a "s" zero or more times, and a "s".
(?=..) is a lookahead assertion and means "followed by". It is only a check, a lookahead give no characters in a match result, but the main interest with this feature is that you can check the same substring several times.
What is wrong with your pattern?
/^(^s|^m|^e)/ will give you only words that begins with "s" or "m" or "e" because ^ is an anchor and means : "start of the string". In other words, your pattern is the same as /^([sme])/.
I have a string that contains many underscores followed by words ex: "Field_4_txtbox" I need to find the last underscore in the string and remove everything following it(including the "_"), so it would return to me "Field_4" but I need this to work for different length ending strings. So I can't just trim a fixed length.
I know I can do an If statement that checks for certain endings like
if(strstr($key,'chkbox')) {
$string= rtrim($key, '_chkbox');
}
but I would like to do this in one go with a regex pattern, how can I accomplish this?
The matching regex would be:
/_[^_]*$/
Just replace that with '':
preg_replace( '/_[^_]*$/', '', your_string );
There is no need to use an extremly costly regex, a simple strrpos() would do the job:
$string=substr($key,0,strrpos($key,"_"));
strrpos — Find the position of the last occurrence of a substring in a string
You can also just use explode():
$string = 'Field_4_txtbox';
$temp = explode('_', strrev($string), 2);
$string = strrev($temp[1]);
echo $string;
As of PHP 5.4+
$string = 'Field_4_txtbox';
$string = strrev(explode('_', strrev($string), 2)[1]);
echo $string;
For example, if my sentence is $sent = 'how are you'; and if I search for $key = 'ho' using strstr($sent, $key) it will return true because my sentence has ho in it.
What I'm looking for is a way to return true if I only search for how, are or you. How can I do this?
You can use the function preg-match that uses a regex with word boundaries:
if(preg_match('/\byou\b/', $input)) {
echo $input.' has the word you';
}
If you want to check for multiple words in the same string, and you're dealing with large strings, then this is faster:
$text = explode(' ',$text);
$text = array_flip($text);
Then you can check for words with:
if (isset($text[$word])) doSomething();
This method is lightning fast.
But for checking for a couple of words in short strings then use preg_match.
UPDATE:
If you're actually going to use this I suggest you implement it like this to avoid problems:
$text = preg_replace('/[^a-z\s]/', '', strtolower($text));
$text = preg_split('/\s+/', $text, NULL, PREG_SPLIT_NO_EMPTY);
$text = array_flip($text);
$word = strtolower($word);
if (isset($text[$word])) doSomething();
Then double spaces, linebreaks, punctuation and capitals won't produce false negatives.
This method is much faster in checking for multiple words in large strings (i.e. entire documents of text), but it is more efficient to use preg_match if all you want to do is find if a single word exists in a normal size string.
One thing you can do is breaking up your sentence by spaces into an array.
Firstly, you would need to remove any unwanted punctuation marks.
The following code removes anything that isn't a letter, number, or space:
$sent = preg_replace("/[^a-zA-Z 0-9]+/", " ", $sent);
Now, all you have are the words, separated by spaces. To create an array that splits by space...
$sent_split = explode(" ", $sent);
Finally, you can do your check. Here are all the steps combined.
// The information you give
$sent = 'how are you';
$key = 'ho';
// Isolate only words and spaces
$sent = preg_replace("/[^a-zA-Z 0-9]+/", " ", $sent);
$sent_split = explode(" ", $sent);
// Do the check
if (in_array($key, $sent))
{
echo "Word found";
}
else
{
echo "Word not found";
}
// Outputs: Word not found
// because 'ho' isn't a word in 'how are you'
#codaddict's answer is technically correct but if the word you are searching for is provided by the user, you need to escape any characters with special regular expression meaning in the search word. For example:
$searchWord = $_GET['search'];
$searchWord = preg_quote($searchWord);
if (preg_match("/\b$searchWord\b", $input) {
echo "$input has the word $searchWord";
}
With recognition to Abhi's answer, a couple of suggestions:
I added /i to the regex since sentence-words are probably treated case-insensitively
I added explicit === 1 to the comparison based on the documented preg_match return values
$needle = preg_quote($needle);
return preg_match("/\b$needle\b/i", $haystack) === 1;