Check a string for bad words? [duplicate]

Check a string for bad words? [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Efficient way to test string for certain words
I want to check if a string contains any of these words: ban, bad, user, pass, stack, name, html.
If it contains any of the words I need to echo the number of bad words
str = 'Hello my name is user';

I think something like this would work:
$badWords = array("ban","bad","user","pass","stack","name","html");
$string = "Hello my name is user.";
$matches = array();
$matchFound = preg_match_all(
"/\b(" . implode($badWords,"|") . ")\b/i",
$string,
$matches
);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
echo "<li>" . $word . "</li>";
}
echo "</ul>";
}
This creates an array of banned words, and uses a regular expression to find instances of these words:
\b in the Regex indicates a word boundary (i.e. the beginning or end of a word, determined by either the beginning/end of the string or a non-word character). This is done to prevent "clbuttic" mistakes - i.e. you don't want to ban the word "banner" when you only want to match the word "ban".
The implode function creates a single string containing all your banned words, separated by a pipe character, which is the or operator in the Regex.
The implode portion of the Regex is surrounded with parentheses so that preg_match_all will capture the banned word as the match.
The i modifier at the end of the Regex indicates that the match should be case-sensitive - i.e. it will match each word regardless of capitalization - "Ban, "ban", and "BAN" will all match against the word "ban" in the $badWords array.
Next, the code checks if any matches were found. If there are, it uses array_unique to ensure only one instance of each word is reported, and then it outputs the list of matches in an unordered list.
Is this what you're looking for?

This is what you want.
function teststringforbadwords($string,$banned_words) {
foreach($banned_words as $banned_word) {
if(stristr($string,$banned_word)){
return false;
}
}
return true;
}
$string = "test string";
$banned_words = array('ban','bad','user','pass','stack','name','html');
if (!teststringforbadwords($string,$banned_words)) {
echo 'string is clean';
}else{
echo 'string contains banned words';
}

The \b in the pattern indicates a word boundary, so only the distinct
word "web" is matched, and not a word partial like "webbing" or "cobweb"
if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
if (preg_match("/\bweb\b/i", "PHP is the website scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
This is your best bet. As stated at the beginning you can control your regex.
This is directly from php.net

function check_words($text) {
$text=$text;
$bad_words = file('bad_words.txt');
$bad = explode(" | ",$bad_words[0]);
$b = '/\W' . implode('\W|\W', $bad) . '\W/i';
if(preg_match($b, $text)){
echo $text ." - Contain Bad words!"; other function here
} else {
echo $text ." - Not containing bad words :D";
// other function here
}
}
Example: check_words('He is good');
This works well although anything after the final / does not seem to get checked, e.g. http://www.mysite.com/thisbit, thisbit seems not to get checked for bad words.
It does work again how ever if it is typed like this: http://www.mysite.com/thisbit/, with the trailing /.
Not sure if this can be fixed or not.

function check_words($text) {
$text=$text;
$bad_words = file('bad_words.txt');
$bad = explode(" | ",$bad_words[0]);
$b = '/\W' . implode('\W|\W', $bad) . '\W/i';
if(preg_match($b, $text)){
echo $text ." - Contain Bad words!";
# - other function here
}
else{
echo $text ." - Not containing bad words :D";
# - other function here
}
}
# - Example
check_words('He is good');
Hope this can help.. you can put all the bad words in bad_words.txt file.
Arrange the bad words in txt as:
bad_words1 | bad_words2 | bad_words3 | bad_words4 ...
Note: you can also put something like:
bad words 1 | bad words 2 | bad words 3
as long as it is in the "|" format.

Related

PHP convert uppercase words to lowercase, but keep ucfirst on lowercase words

An example:
THIS IS A Sentence that should be TAKEN Care of
The output should be:
This is a Sentence that should be taken Care of
Rules
Convert UPPERCASE words to lowercase
Keep the lowercase words with an uppercase first character intact
Set the first character in the sentence to uppercase.
Code
$string = ucfirst(strtolower($string));
Fails
It fails because the ucfirst words are not being kept.
This is a sentence that should be taken care of

You can test each word for those rules:
$str = 'THIS IS A Sentence that should be TAKEN Care of';
$words = explode(' ', $str);
foreach($words as $k => $word){
if(strtoupper($word) === $word || // first rule
ucfirst($word) !== $word){ // second rule
$words[$k] = strtolower($word);
}
}
$sentence = ucfirst(implode(' ', $words)); // third rule
Output:
This is a Sentence that should be taken Care of
A little bit of explanation:
Since you have overlapping rules, you need to individually compare them, so...
Break down the sentence into separate words and check each of them based on the rules;
If the word is UPPERCASE, turn it into lowercase; (THIS, IS, A, TAKEN)
If the word is ucfirst, leave it alone; (Sentence, Care)
If the word is NOT ucfirst, turn it into lowercase, (that, should, be, of)

You can break the sentence down into individual words, then apply a formatting function to each of them:
$sentence = 'THIS IS A Sentence that should be TAKEN Care of';
$words = array_map(function ($word) {
// If the word only has its first letter capitalised, leave it alone
if ($word === ucfirst(strtolower($word)) && $word != strtoupper($word)) {
return $word;
}
// Otherwise set to all lower case
return strtolower($word);
}, explode(' ', $sentence));
// Re-combine the sentence, and capitalise the first character
echo ucfirst(implode(' ', $words));
See https://eval.in/936462

$str = "THIS IS A Sentence that should be TAKEN Care of";
$str_array = explode(" ", $str);
foreach ($str_array as $testcase =>$str1) {
//Check the first word
if ($testcase ==0 && ctype_upper($str1)) {
echo ucfirst(strtolower($str1))." ";
}
//Convert every other upercase to lowercase
elseif( ctype_upper($str1)) {
echo strtolower($str1)." ";
}
//Do nothing with lowercase
else {
echo $str1." ";
}
}
Output:
This is a Sentence that should be taken Care of

I find preg_replace_callback() to be a direct tool for this task. Create a pattern that will capture the two required strings:
The leading word
Any non-leading, ALL-CAPS word
Code: (Demo)
echo preg_replace_callback(
'~(^\pL+\b)|(\b\p{Lu}+\b)~u',
function($m) {
return $m[1]
? mb_convert_case($m[1], MB_CASE_TITLE, 'UTF-8')
: mb_strtolower($m[2], 'UTF-8');
},
'THIS IS A Sentence that should be TAKEN Care of'
);
// This is a Sentence that should be taken Care of
I did not test this with multibyte input strings, but I have tried to build it with multibyte characters in mind.
The custom function works like this:
There will always be either two or three elements in $m. If the first capture group matches the first word of the string, then there will be no $m[2]. When a non-first word is matched, then $m[2] will be populated and $m[1] will be an empty string. There is a modern flag that can be used to force that empty string to be null, but it is not advantageous in this case.
\pL+ means one or more of any letter (single or multi-byte)
\p{Lu}+ means one or more uppercase letters
\b is a word boundary. It is a zero-width character -- it doesn't match a character, it checks that the two consecutive characters change from a word to a non-word or vice versa.
My answer makes just 3 matches/replacement on the sample input string.

$string='THIS IS A Sentence that should be TAKEN Care of';
$arr=explode(" ", $string);
foreach($arr as $v)
{
$v = ucfirst(strtolower($v));
$stry = $stry . ' ' . $v;
}
echo $stry;

How to get substrings on both sides of hyphen and trailing substring?

I am currently working on a web app which is using a specific string to call a function. Here is a sample string:
$string = "translate from-to word for translate"
First I need to validate the string, and it should be like the above $string. How should I validate the string?
Then I need to extract 3 substrings from $string.
The word that precedes the hyphen. (To be named: $target)
The word that follows the hyphen. (To be named: $source)
The text (not including the first space) that follows $source to the end of the string. (To be named: $translate)
This is my coding attempt to get the from and to:
$found = false;
$source ="";
$target = "";
$next = 3;
$prev = 1;
for($i=0;$i<strlen($string);$i++){
if($found== false){
if($string[$i] == "-"){
$found = true;
while($string[$i+$prev] != " "){
$target .= $string[$i+$prev];
$prev +=1;
}
/*$next -=1;
while($string[$i-$next] != " " && $next > 0){
$source .= $string[$i-$next];
$next -=1;
}*/
}
}
}
From that code, I only can return the $target which contains to after -.I don't know how to get $source.
Please show me the fastest way to get the from as $source and to as $target.
Then I need to get word for translate (all of the string after from-to).
So the result should be
$target = "to";
$source = "from";
$translate = "word for translate";
Finally, if the $string has two hyphens, like translate from-to from-to test-test word for translate, it should be return false;
note to and from are random strings.

Consider the following possible input strings:
translate from-to word for translate (1 hyphen, no accents or non-English characters)
translate dari-ke dari-ke word for translate (2 hyphens)
translate clé-solution word for translate (1 hyphen, accented character used)
translate goodbye-さようなら word for translate (1 hyphen , Japanese characters used)
A case-insensitive pattern like: /^[a-z]+? ([a-z]+)-([a-z]+?) ([a-z ]+)$/i will perform as requested on the first two sample strings with high efficiency, but not the last two.
Using the "word character" (\w) to match the substrings (instead of case-insensitive [a-z]) will perform as intended with the first two samples with, but also allows 0-9 and _ as valid characters. This means a slight drop in pattern accuracy (this may be of no noticeable consequence to your project).
If you are translating strings that may go beyond English characters, it can be simpler / more forgiving to use a "negated character class" for matching. If you want to allow letters beyond a-z, like accented and other multibyte characters, then [^-] will offer a broad allowance of characters (at the expense of allowing many unwanted letters too). Here is a demo of this kind of pattern.
It is important to only write "capture groups" for substrings that you want to subsequently use. For this reason, I do not capture the leading substring translate.
list() is a handy "language construct" to assign variable names to array values. Notice that the first element (the fullstring match) is not assigned to a variable. This is why list()'s parameters starts with ,. If you don't wish to leverage the convenience of list(), then you can manually assign the three variable names over three lines like this:
$source=$out[1];
$target=$out[2];
$translate=$out[3];
Code: (Demo)
$strings=[
"translate from-to word for translate",
"translate dari-ke dari-ke word for translate",
"translate clé-solution word for translate",
"translate goodbye-さようなら word for translate"
];
foreach($strings as $string){
if(preg_match('/^[a-z]+? ([^-]+)-([^-]+?) ([a-z ]+)$/i',$string,$out)){
list(,$source,$target,$translate)=$out;
echo "source=$source; target=$target; translate=$translate";
}else{
var_export(false); // $found=false;
}
echo "<br>";
}
Output:
source=from; target=to; translate=word for translate
false
source=clé; target=solution; translate=word for translate
source=goodbye; target=さようなら; translate=word for translate
While regex provides a much more concise method with fewer function calls, this is a non-regex method:
if(substr_count($string,'-')!=1){
var_export(false); // $found=false;
}else{
$trimmed=ltrim($string,'translate ');
$array=explode(' ',$trimmed,2);
list($source,$target)=explode('-',$array[0]);
$translate=$array[1];
echo "source=$source; target=$target; translate=$translate";
}

If I understand your question correctly, this can be done with a regular expression:
<?php
$string = "translate from-to word for translate";
$result = preg_match("/^([\w ]+?) (\w+)-(\w+) ([\w ]+)$/", $string, $matches);
if ($result) {
print_r($matches);
$source = $matches[2];
$target = $matches[3];
$translate = $matches[4];
} else {
echo "No match";
}
Output:
Array
(
[0] => translate from-to word for translate
[1] => translate
[2] => from
[3] => to
[4] => word for translate
)
Here is an explanation of the regular expression.

preg_match: can't find substring which has trailing special characters

I have a function which uses preg_match to check for if a substring is in another string.
Today I realize that if substring has trailing special characters like special regular expression characters (. \ + * ? [ ^ ] $ ( ) { } = ! < > | : -) or #, my preg_match can't find the substring even though it is there.
This works, returns "A match was found."
$find = "website scripting";
$string = "PHP is the website scripting language of choice.";
if (preg_match("/\b" . $find . "\b/i", $string)) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
But this doesn't, returns "A match was not found."
$find = "website scripting #";
$string = "PHP is the website scripting # language of choice.";
if (preg_match("/\b" . $find . "\b/i", $string)) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
I have tried preg_quote, but it doesn't help.
Thank you for any suggestions!
Edit: Word boundary is required, that's why I use \b. I don't want to find "phone" in "smartphone".

You can just check if the characters around the search word are not word characters with look-arounds:
$find = "website scripting #";
$string = "PHP is the website scripting # language of choice.";
if (preg_match("/(?<!\\w)" . preg_quote($find, '/') . "(?!\\w)/i", $string)) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
See IDEONE demo
Result: A match was found.
Note the double slash used with \w in (?<!\\w) and (?!\\w), as you have to escape regex special characters in interpolated strings.
The preg_quote function is necessary as the search word - from what I see - can have special characters, and some of them must be escaped if intended to be matched as literal characters.
UPDATE
There is a way to build a regex with smartly placed word boundaries around the keyword, but the performance will be worse compared with the approach above. Here is sample code:
$string = "PHP is the website scripting # language of choice.";
$find = "website scripting #";
$find = preg_quote($find);
if (preg_match('/\w$/u', $find)) { // Setting trailing word boundary
$find .= '\\b';
}
if (preg_match('/^\w/u', $find)) { // Setting leading word boundary
$find = '\\b' . $find;
}
if (preg_match("/" . $find . "/ui", $string)) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
See another IDEONE demo

If you try to find a string from another string, you can strpos().
Ex.
<?php
$find = "website scripting";
$string = "PHP is the website scripting language of choice.";
if (strpos($string,$find) !== false) {
echo 'true';
} else {
echo 'false';
}

How to check if a word exists in a sentence

For example, if my sentence is $sent = 'how are you'; and if I search for $key = 'ho' using strstr($sent, $key) it will return true because my sentence has ho in it.
What I'm looking for is a way to return true if I only search for how, are or you. How can I do this?

You can use the function preg-match that uses a regex with word boundaries:
if(preg_match('/\byou\b/', $input)) {
echo $input.' has the word you';
}

If you want to check for multiple words in the same string, and you're dealing with large strings, then this is faster:
$text = explode(' ',$text);
$text = array_flip($text);
Then you can check for words with:
if (isset($text[$word])) doSomething();
This method is lightning fast.
But for checking for a couple of words in short strings then use preg_match.
UPDATE:
If you're actually going to use this I suggest you implement it like this to avoid problems:
$text = preg_replace('/[^a-z\s]/', '', strtolower($text));
$text = preg_split('/\s+/', $text, NULL, PREG_SPLIT_NO_EMPTY);
$text = array_flip($text);
$word = strtolower($word);
if (isset($text[$word])) doSomething();
Then double spaces, linebreaks, punctuation and capitals won't produce false negatives.
This method is much faster in checking for multiple words in large strings (i.e. entire documents of text), but it is more efficient to use preg_match if all you want to do is find if a single word exists in a normal size string.

One thing you can do is breaking up your sentence by spaces into an array.
Firstly, you would need to remove any unwanted punctuation marks.
The following code removes anything that isn't a letter, number, or space:
$sent = preg_replace("/[^a-zA-Z 0-9]+/", " ", $sent);
Now, all you have are the words, separated by spaces. To create an array that splits by space...
$sent_split = explode(" ", $sent);
Finally, you can do your check. Here are all the steps combined.
// The information you give
$sent = 'how are you';
$key = 'ho';
// Isolate only words and spaces
$sent = preg_replace("/[^a-zA-Z 0-9]+/", " ", $sent);
$sent_split = explode(" ", $sent);
// Do the check
if (in_array($key, $sent))
{
echo "Word found";
}
else
{
echo "Word not found";
}
// Outputs: Word not found
// because 'ho' isn't a word in 'how are you'

#codaddict's answer is technically correct but if the word you are searching for is provided by the user, you need to escape any characters with special regular expression meaning in the search word. For example:
$searchWord = $_GET['search'];
$searchWord = preg_quote($searchWord);
if (preg_match("/\b$searchWord\b", $input) {
echo "$input has the word $searchWord";
}

With recognition to Abhi's answer, a couple of suggestions:
I added /i to the regex since sentence-words are probably treated case-insensitively
I added explicit === 1 to the comparison based on the documented preg_match return values
$needle = preg_quote($needle);
return preg_match("/\b$needle\b/i", $haystack) === 1;

Identifying a random repeating pattern in a structured text string

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?

Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.

Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].

If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Check a string for bad words? [duplicate] - php

Related

PHP convert uppercase words to lowercase, but keep ucfirst on lowercase words

How to get substrings on both sides of hyphen and trailing substring?

preg_match: can't find substring which has trailing special characters

How to check if a word exists in a sentence

Identifying a random repeating pattern in a structured text string

Categories

Resources