Assume "xyza" is a bad word. I'm using following method to replace offensive words-
$text = str_replace("x***","(Offensive words detected & removed!)",$text);
This code will replace xyza into "(Offensive words detected & removed!)".
But problem is "Case" if someone type XYZA my code can't detect it. How to solve it?
No matter what you do, users will find ways to get around your filters. They will use unicode characters (аss, for example, uses a Cyrillic а and will not get captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, whatever you haven't managed to catch yet.
If family-friendliness is essential to your application, have a person review the content before it goes live. Otherwise, add a flag feature so other people can flag offensive content. Better yet, use some sort of machine learning or Bayesian filter to automatically flag potentially offensive posts and have humans check them out manually. People read human languages better than computers.
The problem with whitelists/blacklists is—as other users have pointed out—your users will make it their priority to find ways around your filter for satisfaction rather than using your website for what it was intended for, whatever that may be.
One approach would be to use Google’s undocumented profanity API it created for its “What Do You Love?” website. If you get a response of true then just give the user a message saying their post couldn’t be submitted due to detected profanity.
You could approach this as follows:
<?php
if (isset($_POST['submit'])) {
$result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments']))));
if ($result->response == true) {
// profanity detected
}
else {
// save comments to database as normal
}
}
Other answers and comments say that programming is not the best solution to this problem. I agree with them. Those answers should be moved to Moderators - Stack Exchange or Webmasters - Stack Exchange.
Since this is stackoverflow, my answer is going to be based on computer programming.
If you want to use str_replace, do something like this.
For the sake of this post, since some people are offended by actual cusswords, let's pretend that these are bad words:
'fug', 'schnitt', 'dam'.
$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);
Notice, it's str_ireplace not str_replace. The i is for "case insensitive".
But that will erroneously match "fuggedaboudit," for example.
If you want to do a more reliable job, you need to use regex.
$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn";
$hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words...
array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think.
$value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge"
});
/*print_r($bad_words);*/
$good_words = array("fudge","shoot","dang");
$good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once
echo '<br />' . $good_text . '<br />';
That will do all your search/replacements at once. The two arrays should contain the same number of elements, matching up searches and replace terms. It will not match parts of words, only whole words. And of course, determined cussers will find ways of getting their swearing onto your website. But it will stop lazy cussers.
I've decided to add some links to sites that obviously use programming to do a first run through removing profanity. I'll add more as I come across them. Other than yahoo:
1.) Dell.com - replace matching words with <profanity deleted>.
http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx
2.) Watson, the supercomputer, apparently developed a cursing problem. How do you tell the difference between cursing and slang? Apparently, it's so hard that the researchers just decided to purge it all. But they could have just used a list of curse words ( exact matching is a subset of regex, I would say) and forbidden their use. That's kind of how it works in real life, anyway.
Watson develops a profanity problem
3.) Content Compliance section of Gmail custom settings in Apps for Business:
Add expressions that describe the content you want to search for in each message
The "Expresssions" used can be of several types, including "Advanced content match", which, among other things, allows you to choose "Match type" options very similar to what you'd have in an excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which presumably use Regex. But wait, there's more: Matches regex, Not matches regex, Matches any word, Matches all words. So, the mighty Google implements regex filtering options for its business users. Why would it do that, when regex is supposedly so ineffective? Because it actually is effective enough. It is a simple, fast, programming solution that will only fail when people are hell-bent on circumventing it.
Besides that list, I wonder if anyone else has noticed the similarity between weeding out profanity and filtering out spam. Clearly, regex has uses in both arenas but nitpickers who learned by rote that "all regex is bad" will always downvote any answer to any question if regex is even mentioned.
Try googling "how spam filters work". You'll get results like this one that covers spam assassin:
http://www.seas.upenn.edu/cets/answers/spamblock-filter.html
Another example where I'm sure regex is used is when communicating via Amazon.com's Amazon Marketplace. You receive emails at your usual email address. So, naturally, when responding to a seller, your email program will include all kinds of sender information, like your email address, cc email addresses, and any you enter into the body. But Amazon.com strips these out "for your protection." Can I find a way around this regex? Probably, but it would take more trouble than it's worth and is therefore effective to a degree. They also keep the emails for 2 years, presumably so that a human can go over them in case of any fraud claims.
SpamAssassin also looks at the subject and body of the message for the same sort of things that a person notices when a message "looks like spam". It searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. It also looks for flashy HTML such as large fonts, blinking text, bright colors, etc.
Regex is not mentioned, but I'm sure it's in use.
Use str_ireplace function that Case-insensitive version of str_replace()
$text = str_ireplace("flip","(Offensive words detected & removed!)", $text);
Use 'str_ireplace' to replace any case sensitive strings
Probable, this will help you
$text = 'contains offensive_word .... so on';
$array = array(
'offensive_word' => '****',
'offensive_word2' => '****',
'offensive_word3' => '****',
//.....
);
$text = str_ireplace(array_keys($array),array_values($array), $text);
echo $text;
You should use regex replacement and need to add the i flag to the end of your regex so it searches your text regardless of case. so..
$text = preg_replace("/xyza/i","(Offensive words detected & removed!)", $text);
str_ireplace can also be used if you don't need complex regex rules.
$text = str_ireplace("xyza","(Offensive words detected & removed!)", $text);
In fact, the latter is the preferred way as it's faster than regex manipulation. From PHP docs:
If you don't need fancy replacing rules, you should generally use this function instead of preg_replace() with the i modifier.
BUT, as the commenter pointed out, simple string/regex replacements can break your strings if the substring you're replacing appears as part of another non-offensive word. For this, you could either use word boundaries in your regexes or replace only those words that can't be part of other strings (e.g. the word xyza).
Related
I am trying to prevent certain kinds of posts on my site, which are mostly meant to make it look like they contain some content but are just spam. Specifically, the posts are a few random words, some newline characters, and a random character.
So, I know some legit users might have use for using two newline chars (to create a blank line between paragraphs), but I figure 3+ can be marked as spam.
I tested this regex on regex101 and it works fine, but is never triggered when I test on my site, any ideas as to why? When I uncomment the echo line, it will show me the number 4 for my test data, so I know it sees the newlines.. is my regex formed incorrectly?!
Test data:
This is a potential
spam post
Code:
//echo substr_count($lowercaseBody, "\n");
if (preg_match('/\n{3,}./', $lowercaseBody)){
error("Stop Spamming my chan you .");
}
The data likely contains CRLF's, not just LF's.
The substr_count test does not care about the interleaving CR's, but your regex patterns does.
Use (\r?\n) instead of the \n to allow both CRLF's and LF's (different browsers/OS's, may use different new-lines):
if (preg_match('/(\r?\n){3,}./', $lowercaseBody)){
error("Stop Spamming my chan you .");
}
My website form is getting hammered with spam. I have noticed in the "Phone" field the spam bots always insert text rather that a number so I would like to add an if statement to the php mailer blocking the email if the phone field doesn't contain any of the following:
1) I want users to be able to leave the field blank, so empty field must be accepted.
2) Must contain "numbers" or "plus sign" or "spaces"
How would I write this in PHP?
Any help is appreciated
EDIT: Just though lol it would be much easier to just check if the field contains alphabetical characters. How would I do this?
EDIT2: Sorted. I used "if (ctype_alpha ($phone) !== false)"
Regular expressions are probably the best way, although not necessarily the easiest to understand at first. But regular expressions are definitely a good thing to learn if you are not familiar with them. My favorite introduction is this site: http://www.zytrax.com/tech/web/regex.htm And this is a good site for interactively building a regex and seeing how it works in realtime: http://www.regexr.com/ I'm sure there are plenty of other similar sites but those are the two I always go back to.
If you search around for a regular expression solution you will find countless possibilities and variations. My personal advice is to keep it simple. I would start with considering how you store the phone number data. I usually just keep the numbers, so I would simplify it by first removing those "allowed" characters and then checking if what's left over is just numbers.
$phone = str_replace(Array('+', ' ', '(', ')'), '', $phone);
That will replace all pluses, spaces, and parentheses with an empty string (i.e. remove them). Then you can check if the string is numeric, and if it is store it, otherwise print/return an error.
if (!is_numeric($phone))
// stop processing and output an error
First of all You must use some spamblock for example: token, honey pot, captcha etc.
In my country mobile or local phone number contains only 9digits without country code which is +XX. So i create INT(10) field in db. After submit form remove everything without digits.
For example:
$phoneNumber = (int) substr( preg_replace( '#[^\d]+#', '', $_POST['phone_numer'] ), 0, 9 );
In many project allways works.
I run a large website that contains millions of user generated posts that contain HTML. Some of these posts contain sensitive words my advertisers don't want to advertise next to. Instead of deleting these posts, I'd rather censor out the "bad" words. I also need to preserve the markup because letting the users mark up their posts is a major feature of the site.
I am currently using a search and replace with str_ireplace(), but our authors have become clever and are doing things (below) that slip through my primitive filter. I can strip the tags and detect the inappropriate words, but am looking for a way of replacing the words while leaving the markup untouched.
Examples:
Successfully censored:
input: "<p>Mary is a bitch.</p>"
output: "<p>Mary is a *****.</p>"
Unsuccessfully censored:
input: "<p>Mary is a <strong>b</strong>itch.</p>"
failed output: "<p>Mary is a <strong>b</strong>itch.</p>"
desired output: "<p>Mary is a <strong>*</strong>****.</p>"
My advice would be to use other methods to stop this, as it is extremely hard.
from this amusing piece by Jeff Atwood about what 'clbuttic' problems arise from trying to do so:
Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.
Just for fun here is a quick and dirty way:
$badWords = array('bitch', 'jerk');
$input = '<p>Mary is a <strong>b</strong>itch. </p>';
$arr = explode(' ', $input);
foreach($arr as $key => $word)
{
$word = str_replace('.', '', strip_tags($word));
if(in_array($word, $badWords))
{
$arr[$key] = '*****';
}
}
$output = implode(' ', $arr);
echo $output;
Output
<p>Mary is a ***** </p>
The above splits the text into words, and applies strip_tags() on each of the words, so that it doesn't affect the entire content.
There are still many ways around it though, as the comments point out. You'll never get a perfect solution that can handle everything they throw at it - you would need to create something close to artificial intelligence. I think the best real solution would be to strip_tags() on the whole post and search for the bad words, then if any found, flag the post for moderator attention. Or just simply have a report post system with active moderators.
You're going to have an extremely tough time accomplishing this in your way, but my recommendation would be to not change the words out with asterisks, but rather just reject the posting and let the user know why. Here's why:
Simplify your searching. If your algorithm only has to check if some form of a bad word exists in the text, then you can strip_tags the text and search for your words. If you were to try to replace this out with asterisks, you can't strip_tags since you need to leave the originating text in it's prior condition.
It's what people expect. What people don't expect is for their text to be modified with no notification to them. You'd likely be better sending people back with a message that says "this post contains inappropriate words/text"
If you are insistent that you replace with asterisks instead of sending the user back, you'll need to write a basic character-by-character parser that ignores HTML tags and constructs words out of it.
You could start from a "bad words" list and check the tag-clean string (that is, filtered via strip_tags() against the "bad words".
Then you could iterate each bad word through a series of possible single-letter alterations, eg S=>5, 1=>L, 0=>O etc.
I know you might laugh, but actually this is a common need in most apps. Many apps that take in customer/visitor input may need to filter cuss words or vulgar terms.
Sometimes PHP changes and new stuff gets added in. For instance, just the other day I learned about MultiCurl API in PHP5. So, anyway, is there a new native function in PHP that lets me filter most common English-based cuss words in a string, as well as flip a boolean to say, "string had English-based cuss words in it"? It doesn't need to be perfect, obviously, but cut out a good bit of garbage and let me replace it with ### for instance.
If that's not part of PHP yet, then does anyone have a function that I can use which cloaks the cuss word list? For instance, I want it such that I can drop the class in a project and not have to worry about another programmer getting offended. In other words, a decently encoded cuss word list -- not one actually spelled out.
Now, obviously it needs to be flexible and let words like "rebuttal" get through.
tl;dr: Does PHP5 now have a native function that can filter obscene words? And if not, does anyone have a class that encodes a cuss word list so that it doesn't offend other programmers?
I doubt this is something that would be a high priority for the core PHP team since that treads dangerously close to censorship. Censorship in that they would have a 'master' list of 'inappropriate' language which should be filtered.
You can do this fairly simply. Make up an array of all the words you want filtered out and when a page is displayed that contains user input run a preg_filter() on the words.
$bad_words = array('bleeping', 'blooping');
$submitted_text = 'bleh blah....';
echo preg_filter($bad_words, $replace, $submitted_text);
Note: you will have to deal with the edge cases where a bad word might be inside of a good word (i.e.- 'shitzu[sic] dog')
EDIT
For the bad-words-inside-good-words issue, you can add to the regular expression to require space at the beginning and end of the bad word. If you have lots of submissions though, it's going to be a constant battle to keep up with the trolls.
<?php
$badwords = "fuc";
$replacebad = "****";
$string = $_POST['something'];
$filtered = str_ireplace($badwords, $replacebad, "$string");
echo $filtered;
?>
something like this ?
Edit:
sorry I didn't noticed the php5 part ..
I have the following part of a validation script:
$invalidEmailError .= "<br/>» You did not enter a valid E-mail address";
$match = "/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/";
That's the expression, here is the validation:
if ( !(preg_match($match,$email)) ) {
$errors .= $invalidEmailError; // checks validity of email
}
I think that's enough info, let me know if more is needed.
Basically, what happens is the message "You did not enter a valid E-mail address" gets echoed no matter what. Whether a correct email address or an incorrect email address is entered.
Does anyone have any idea or a clue as to why?
EDIT: I'm running this on localhost (using Apache), could that be the reason as to why the preg_match ain't working?
Thanks!
Amit
Your regex only includes [A-Z], not [a-z]. Try
$match = "/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i";
to make the regex case-insensitive.
You can test this live on http://regexpal.com.
However, I'd advise you to try one of the expressions on the page mentioned by strager: http://fightingforalostcause.net/misc/2006/compare-email-regex.php. They have been perfected over time and will probably behave better. But Gmail users will be satisfied with yours, since they'll be able to use plus aliases which are rejected incorrectly by many validators.
You likely got the regular expression you're using from regular-expressions.info. On that page, the author states (emphasis added):
If you want to use the regular expression above, there's two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn't include a-z in any of the three character classes. This regex is intended to be used with your regex engine's "case insensitive" option turned on. (You'd be surprised how many "bug" reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}$.
To solve this problem, add the i PCRE flag after your regular expression.
You can always try debugging your regex using a simpler tool (I'm quite fond of using Notepad++ for this purpose) and performing iterative tests - ie. making the expression more/less complicated and seeing if that fixes/breaks things.