Blocking Cuss/Vulgar/Obscenity Terms in PHP

Blocking Cuss/Vulgar/Obscenity Terms in PHP - php

I know you might laugh, but actually this is a common need in most apps. Many apps that take in customer/visitor input may need to filter cuss words or vulgar terms.
Sometimes PHP changes and new stuff gets added in. For instance, just the other day I learned about MultiCurl API in PHP5. So, anyway, is there a new native function in PHP that lets me filter most common English-based cuss words in a string, as well as flip a boolean to say, "string had English-based cuss words in it"? It doesn't need to be perfect, obviously, but cut out a good bit of garbage and let me replace it with ### for instance.
If that's not part of PHP yet, then does anyone have a function that I can use which cloaks the cuss word list? For instance, I want it such that I can drop the class in a project and not have to worry about another programmer getting offended. In other words, a decently encoded cuss word list -- not one actually spelled out.
Now, obviously it needs to be flexible and let words like "rebuttal" get through.
tl;dr: Does PHP5 now have a native function that can filter obscene words? And if not, does anyone have a class that encodes a cuss word list so that it doesn't offend other programmers?

I doubt this is something that would be a high priority for the core PHP team since that treads dangerously close to censorship. Censorship in that they would have a 'master' list of 'inappropriate' language which should be filtered.
You can do this fairly simply. Make up an array of all the words you want filtered out and when a page is displayed that contains user input run a preg_filter() on the words.
$bad_words = array('bleeping', 'blooping');
$submitted_text = 'bleh blah....';
echo preg_filter($bad_words, $replace, $submitted_text);
Note: you will have to deal with the edge cases where a bad word might be inside of a good word (i.e.- 'shitzu[sic] dog')
EDIT
For the bad-words-inside-good-words issue, you can add to the regular expression to require space at the beginning and end of the bad word. If you have lots of submissions though, it's going to be a constant battle to keep up with the trolls.

<?php
$badwords = "fuc";
$replacebad = "****";
$string = $_POST['something'];
$filtered = str_ireplace($badwords, $replacebad, "$string");
echo $filtered;
?>
something like this ?
Edit:
sorry I didn't noticed the php5 part ..

Related

Recursive Regex in PHP with variable names

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.
For example:
Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers
These are the easy ones and i achieved making it work.
Now the problem is, what happens, when two of those codes are behind each other:
I [bold]really[/bold] like [bold]cheeseburgers[/bold]
Or inside each other
I [bold]really like [italic]cheeseburgers[/italic][/bold]
These codes can also have attributes
I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]
The following one worked quite well, but lacks in the recursive part (?R)
(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)
I just dont know where to put the (?R) recursive tag.
Also the system has to know that in this string here
I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]
are 2 "code-objects":
1. [bold]really like [italic]cheeseburgers[/italic][/bold]
and
2. [bold]football[/bold]
... and the content of the first one is
really like [italic]cheeseburgers[/italic]
which again has a code in it
[italic]cheeseburgers[/italic]
which content is
cheeseburgers
I searched the web for two days now and i cant figure it out.
I thought of something like this:
Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.
I hope there are some regex specialist which are willing to help me. :(
Thank you!
EDIT
As this might be difficult to understand, here is an input and an expected output:
Input:
[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]
I'd like to have an array like
array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>

Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.
How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.
This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.
It might be more work (or less, depending on your skill with regex), but it's worth it.

How to remove offensive words from post by php?

Assume "xyza" is a bad word. I'm using following method to replace offensive words-
$text = str_replace("x***","(Offensive words detected & removed!)",$text);
This code will replace xyza into "(Offensive words detected & removed!)".
But problem is "Case" if someone type XYZA my code can't detect it. How to solve it?

No matter what you do, users will find ways to get around your filters. They will use unicode characters (аss, for example, uses a Cyrillic а and will not get captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, whatever you haven't managed to catch yet.
If family-friendliness is essential to your application, have a person review the content before it goes live. Otherwise, add a flag feature so other people can flag offensive content. Better yet, use some sort of machine learning or Bayesian filter to automatically flag potentially offensive posts and have humans check them out manually. People read human languages better than computers.

The problem with whitelists/blacklists is—as other users have pointed out—your users will make it their priority to find ways around your filter for satisfaction rather than using your website for what it was intended for, whatever that may be.
One approach would be to use Google’s undocumented profanity API it created for its “What Do You Love?” website. If you get a response of true then just give the user a message saying their post couldn’t be submitted due to detected profanity.
You could approach this as follows:
<?php
if (isset($_POST['submit'])) {
$result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments']))));
if ($result->response == true) {
// profanity detected
}
else {
// save comments to database as normal
}
}

Other answers and comments say that programming is not the best solution to this problem. I agree with them. Those answers should be moved to Moderators - Stack Exchange or Webmasters - Stack Exchange.
Since this is stackoverflow, my answer is going to be based on computer programming.
If you want to use str_replace, do something like this.
For the sake of this post, since some people are offended by actual cusswords, let's pretend that these are bad words:
'fug', 'schnitt', 'dam'.
$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);
Notice, it's str_ireplace not str_replace. The i is for "case insensitive".
But that will erroneously match "fuggedaboudit," for example.
If you want to do a more reliable job, you need to use regex.
$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn";
$hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words...
array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think.
$value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge"
});
/*print_r($bad_words);*/
$good_words = array("fudge","shoot","dang");
$good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once
echo '<br />' . $good_text . '<br />';
That will do all your search/replacements at once. The two arrays should contain the same number of elements, matching up searches and replace terms. It will not match parts of words, only whole words. And of course, determined cussers will find ways of getting their swearing onto your website. But it will stop lazy cussers.
I've decided to add some links to sites that obviously use programming to do a first run through removing profanity. I'll add more as I come across them. Other than yahoo:
1.) Dell.com - replace matching words with <profanity deleted>.
http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx
2.) Watson, the supercomputer, apparently developed a cursing problem. How do you tell the difference between cursing and slang? Apparently, it's so hard that the researchers just decided to purge it all. But they could have just used a list of curse words ( exact matching is a subset of regex, I would say) and forbidden their use. That's kind of how it works in real life, anyway.
Watson develops a profanity problem
3.) Content Compliance section of Gmail custom settings in Apps for Business:
Add expressions that describe the content you want to search for in each message
The "Expresssions" used can be of several types, including "Advanced content match", which, among other things, allows you to choose "Match type" options very similar to what you'd have in an excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which presumably use Regex. But wait, there's more: Matches regex, Not matches regex, Matches any word, Matches all words. So, the mighty Google implements regex filtering options for its business users. Why would it do that, when regex is supposedly so ineffective? Because it actually is effective enough. It is a simple, fast, programming solution that will only fail when people are hell-bent on circumventing it.
Besides that list, I wonder if anyone else has noticed the similarity between weeding out profanity and filtering out spam. Clearly, regex has uses in both arenas but nitpickers who learned by rote that "all regex is bad" will always downvote any answer to any question if regex is even mentioned.
Try googling "how spam filters work". You'll get results like this one that covers spam assassin:
http://www.seas.upenn.edu/cets/answers/spamblock-filter.html
Another example where I'm sure regex is used is when communicating via Amazon.com's Amazon Marketplace. You receive emails at your usual email address. So, naturally, when responding to a seller, your email program will include all kinds of sender information, like your email address, cc email addresses, and any you enter into the body. But Amazon.com strips these out "for your protection." Can I find a way around this regex? Probably, but it would take more trouble than it's worth and is therefore effective to a degree. They also keep the emails for 2 years, presumably so that a human can go over them in case of any fraud claims.
SpamAssassin also looks at the subject and body of the message for the same sort of things that a person notices when a message "looks like spam". It searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. It also looks for flashy HTML such as large fonts, blinking text, bright colors, etc.
Regex is not mentioned, but I'm sure it's in use.

Use str_ireplace function that Case-insensitive version of str_replace()
$text = str_ireplace("flip","(Offensive words detected & removed!)", $text);

Use 'str_ireplace' to replace any case sensitive strings
Probable, this will help you
$text = 'contains offensive_word .... so on';
$array = array(
'offensive_word' => '****',
'offensive_word2' => '****',
'offensive_word3' => '****',
//.....
);
$text = str_ireplace(array_keys($array),array_values($array), $text);
echo $text;

You should use regex replacement and need to add the i flag to the end of your regex so it searches your text regardless of case. so..
$text = preg_replace("/xyza/i","(Offensive words detected & removed!)", $text);
str_ireplace can also be used if you don't need complex regex rules.
$text = str_ireplace("xyza","(Offensive words detected & removed!)", $text);
In fact, the latter is the preferred way as it's faster than regex manipulation. From PHP docs:
If you don't need fancy replacing rules, you should generally use this function instead of preg_replace() with the i modifier.
BUT, as the commenter pointed out, simple string/regex replacements can break your strings if the substring you're replacing appears as part of another non-offensive word. For this, you could either use word boundaries in your regexes or replace only those words that can't be part of other strings (e.g. the word xyza).

What is the best way to censor inappropriate words that might contain markup within them?

I run a large website that contains millions of user generated posts that contain HTML. Some of these posts contain sensitive words my advertisers don't want to advertise next to. Instead of deleting these posts, I'd rather censor out the "bad" words. I also need to preserve the markup because letting the users mark up their posts is a major feature of the site.
I am currently using a search and replace with str_ireplace(), but our authors have become clever and are doing things (below) that slip through my primitive filter. I can strip the tags and detect the inappropriate words, but am looking for a way of replacing the words while leaving the markup untouched.
Examples:
Successfully censored:
input: "<p>Mary is a bitch.</p>"
output: "<p>Mary is a *****.</p>"
Unsuccessfully censored:
input: "<p>Mary is a <strong>b</strong>itch.</p>"
failed output: "<p>Mary is a <strong>b</strong>itch.</p>"
desired output: "<p>Mary is a <strong>*</strong>****.</p>"

My advice would be to use other methods to stop this, as it is extremely hard.
from this amusing piece by Jeff Atwood about what 'clbuttic' problems arise from trying to do so:
Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.

Just for fun here is a quick and dirty way:
$badWords = array('bitch', 'jerk');
$input = '<p>Mary is a <strong>b</strong>itch. </p>';
$arr = explode(' ', $input);
foreach($arr as $key => $word)
{
$word = str_replace('.', '', strip_tags($word));
if(in_array($word, $badWords))
{
$arr[$key] = '*****';
}
}
$output = implode(' ', $arr);
echo $output;
Output
<p>Mary is a ***** </p>
The above splits the text into words, and applies strip_tags() on each of the words, so that it doesn't affect the entire content.
There are still many ways around it though, as the comments point out. You'll never get a perfect solution that can handle everything they throw at it - you would need to create something close to artificial intelligence. I think the best real solution would be to strip_tags() on the whole post and search for the bad words, then if any found, flag the post for moderator attention. Or just simply have a report post system with active moderators.

You're going to have an extremely tough time accomplishing this in your way, but my recommendation would be to not change the words out with asterisks, but rather just reject the posting and let the user know why. Here's why:
Simplify your searching. If your algorithm only has to check if some form of a bad word exists in the text, then you can strip_tags the text and search for your words. If you were to try to replace this out with asterisks, you can't strip_tags since you need to leave the originating text in it's prior condition.
It's what people expect. What people don't expect is for their text to be modified with no notification to them. You'd likely be better sending people back with a message that says "this post contains inappropriate words/text"
If you are insistent that you replace with asterisks instead of sending the user back, you'll need to write a basic character-by-character parser that ignores HTML tags and constructs words out of it.

You could start from a "bad words" list and check the tag-clean string (that is, filtered via strip_tags() against the "bad words".
Then you could iterate each bad word through a series of possible single-letter alterations, eg S=>5, 1=>L, 0=>O etc.

What regex pattern do I need for this?

I need a regex (to work in PHP) to replace American English words in HTML with British English words. So color would be replaced by colour, meters by metres and so on [I know that meters is also a British English word, but for the copy we'll be using it will always be referring to units of distance rather than measuring devices]. The pattern would need to work accurately in the following (slightly contrived) examples (although as I have no control over the actual input these could exist):
<span style="color:red">This is the color red</span>
[should not replace color in the HTML tag but should replace it in the sentence]
<p>Color: red</p>
[should replace word]
<p>Tony Brammeter lives 2000 meters from his sister</p>
[should replace meters for the word but not in the name]
I know there are edge cases where replacement wouldn't be useful (if his name was Tony Meter for example), but these are rare enough that we can deal with them when they come up.

Html/xml should not be processed with regular expressions, it is really hard to generate one that will match anything. But you can use the builtin dom extension and process your string recursively:
# Warning: untested code!
function process($node, $replaceRules) {
foreach ($node->children as $childNode) {
if ($childNode instanceof DOMTextNode) {
$text = pre_replace(
array_keys(replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild($childNode, new DOMTextNode($text));
} else {
process($childNode, $replaceRules);
}
}
}
$replaceRules = array(
'/\bcolor\b/i' => 'colour',
'/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();

I think you'd rather need a dictionary and maybe even some grammatical analysis in order to get this working correctly, since you don't have control over the input. A pure regex solution is not really going to be able to process this kind of data correctly.
So I'd suggest to first come up with a list of words that need to be replaced, those are not only "color" and "meter". Wikipedia has some information on the topic.

You do not want a regular expression for this. Regular expressions are by their very nature stateless, and you need some measure of state to be able to tell the difference between 'in a html tag' and 'in data'.
You want to be using a HTML parser in combination with something like a str_replace, or even better, use a proper grammer dictionary and stuff as Lucero suggests.

The second problem is easier - you want to replace when there are word boundaries around the word: http://www.regular-expressions.info/wordboundaries.html -- this will make sure you don't replace the meter in Brammeter.
The first problem is much harder. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match must make sure that you last saw > or nothing, but never just <. This is either hard, and requires some combination of lookahead/lookbehind assertions, or just plain impossible with regular expressions.
a script implementing a state machine would work much better here.

You don't need to use a regex explicitly. You can try the str_replace function, or if you need it to be case insensitive use the str_ireplace function.
Example:
$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);
You can pass an array with all the words that you want to search for, instead of the string.

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.

This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.

Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.

Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.

This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.