Related
Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs to be filtered out.
Where can one find a good list of swear words in various languages and dialects?
Are there APIs available to sources that contain good lists? Or maybe an API that simply says "yes this is clean" or "no this is dirty" with some parameters?
What are some good methods for catching folks trying to trick the system, like a$$, azz, or a55?
Bonus points if you offer solutions for PHP. :)
Edit: Response to answers that say simply avoid the programmatic issue:
I think there is a place for this kind of filter when, for instance, a user can use public image search to find pictures that get added to a sensitive community pool. If they can search for "penis", then they will likely get many pictures of, yep. If we don't want pictures of that, then preventing the word as a search term is a good gatekeeper, though admittedly not a foolproof method. Getting the list of words in the first place is the real question.
So I'm really referring to a way to figure out of a single token is dirty or not and then simply disallow it. I'd not bother preventing a sentiment like the totally hilarious "long necked giraffe" reference. Nothing you can do there. :)
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Also, one can't forget The Untold History of Toontown's SpeedChat, where even using a "safe-word whitelist" resulted in a 14-year-old quickly circumventing it with:
"I want to stick my long-necked Giraffe up your fluffy white bunny."
Bottom line: Ultimately, for any system that you implement, there is absolutely no substitute for human review (whether peer or otherwise). Feel free to implement a rudimentary tool to get rid of the drive-by's, but for the determined troll, you absolutely must have a non-algorithm-based approach.
A system that removes anonymity and introduces accountability (something that Stack Overflow does well) is helpful also, particularly in order to help combat John Gabriel's G.I.F.T.
You also asked where you can get profanity lists to get you started -- one open-source project to check out is Dansguardian -- check out the source code for their default profanity lists. There is also an additional third party Phrase List that you can download for the proxy that may be a helpful gleaning point for you.
Edit in response to the question edit: Thanks for the clarification on what you're trying to do. In that case, if you're just trying to do a simple word filter, there are two ways you can do it. One is to create a single long regexp with all of the banned phrases that you want to censor, and merely do a regex find/replace with it. A regex like:
$filterRegex = "(boogers|snot|poop|shucks|argh)"
and run it on your input string using preg_match() to wholesale test for a hit,
or preg_replace() to blank them out.
You can also load those functions up with arrays rather than a single long regex, and for long word lists, it may be more manageable. See the preg_replace() for some good examples as to how arrays can be used flexibly.
For additional PHP programming examples, see this page for a somewhat advanced generic class for word filtering that *'s out the center letters from censored words, and this previous Stack Overflow question that also has a PHP example (the main valuable part in there is the SQL-based filtered word approach -- the leet-speak compensator can be dispensed with if you find it unnecessary).
You also added: "Getting the list of words in the first place is the real question." -- in addition to some of the previous Dansgaurdian links, you may find this handy .zip of 458 words to be helpful.
Whilst I know that this question is fairly old, but it's a commonly occurring question...
There is both a reason and a distinct need for profanity filters (see Wikipedia entry here), but they often fall short of being 100% accurate for very distinct reasons; Context and accuracy.
It depends (wholly) on what you're trying to achieve - at it's most basic, you're probably trying to cover the "seven dirty words" and then some... Some businesses need to filter the most basic of profanity: basic swear words, URLs or even personal information and so on, but others need to prevent illicit account naming (Xbox live is an example) or far more...
User generated content doesn't just contain potential swear words, it can also contain offensive references to:
Sexual acts
Sexual orientation
Religion
Ethnicity
Etc...
And potentially, in multiple languages. Shutterstock has developed basic dirty-words lists in 10 languages to date, but it's still basic and very much oriented towards their 'tagging' needs. There are a number of other lists available on the web.
I agree with the accepted answer that it's not a defined science and as language is a continually evolving challenge but one where a 90% catch rate is better than 0%. It depends purely on your goals - what you're trying to achieve, the level of support you have and how important it is to remove profanities of different types.
In building a filter, you need to consider the following elements and how they relate to your project:
Words/phrases
Acronyms (FOAD/LMFAO etc)
False positives (words, places and names like 'mishit', 'scunthorpe' and 'titsworth')
URLs (porn sites are an obvious target)
Personal information (email, address, phone etc - if applicable)
Language choice (usually English by default)
Moderation (how, if at all, you can interact with user generated content and what you can do with it)
You can easily build a profanity filter that captures 90%+ of profanities, but you'll never hit 100%. It's just not possible. The closer you want to get to 100%, the harder it becomes... Having built a complex profanity engine in the past that dealt with more than 500K realtime messages per day, I'd offer the following advice:
A basic filter would involve:
Building a list of applicable profanities
Developing a method of dealing with derivations of profanities
A moderately complex filer would involve, (In addition to a basic filter):
Using complex pattern matching to deal with extended derivations (using advanced regex)
Dealing with Leetspeak (l33t)
Dealing with false positives
A complex filter would involve a number of the following (In addition to a moderate filter):
Whitelists and blacklists
Naive bayesian inference filtering of phrases/terms
Soundex functions (where a word sounds like another)
Levenshtein distance
Stemming
Human moderators to help guide a filtering engine to learn by example or where matches aren't accurate enough without guidance (a self/continually-improving system)
Perhaps some form of AI engine
I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!
During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?
Of course, the most foul word in the English language.
Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).
For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.
a profanity filtering system will never be perfect, even if the programmer is cocksure and keeps abreast of all nude developments
that said, any list of 'naughty words' is likely to perform as well as any other list, since the underlying problem is language understanding which is pretty much intractable with current technology
so, the only practical solution is twofold:
be prepared to update your dictionary frequently
hire a human editor to correct false positives (e.g. "clbuttic" instead of "classic") and false negatives (oops! missed one!)
The only way to prevent offensive user input is to prevent all user input.
If you insist on allowing user input and need moderation, then incorporate human moderators.
Have a look at CDYNE's Profanity Filter Web Service
Testing URL
Beware of localization issues: what is a swearword in one language might be a perfectly normal word in another.
One current example of this: ebay uses a dictionary approach to filter "bad words" from feedback. If you try to enter the german translation of "this was a perfect transaction" ("das war eine perfekte Transaktion"), ebay will reject the feedback due to bad words.
Why? Because the german word for "was" is "war", and "war" is in ebay dictionary of "bad words".
So beware of localisation issues.
Regarding your "trick the system" subquestion, you can handle that by normalizing both the "bad word" list and the user-entered text before doing your search. e.g., Use a series of regexes (or tr if PHP has it) to convert [z$5] to "s", [4#] to "a", etc., then compare the normalized "bad word" list against the normalized text. Note that the normalization could potentially lead to additional false positives, although I can't think of any actual cases at the moment.
The larger challenge is to come up with something that will let people quote "The pen is mightier than the sword" while blocking "p e n i s".
I collected 2200 bad words in 12 languages: en, ar, cs, da, de, eo, es, fa, fi, fr, hi, hu, it, ja, ko, nl, no, pl, pt, ru, sv, th, tlh, tr, zh.
MySQL dump, JSON, XML or CSV options are available.
https://github.com/turalus/openDB
I'd suggest you to execute this SQL into your DB and check everytime when user inputs something.
If you can do something like Digg/Stackoverflow where the users can downvote/mark obscene content... do so.
Then all you need to do is review the "naughty" users, and block them if they break the rules.
I'm a little late to the party, but I have a solution that might work for some who read this. It's in javascript instead of php, but there's a valid reason for it.
Full disclosure, I wrote this plugin...
Anyways.
The approach I've gone with is to allow a user to "Opt-In" to their profanity filtering. Basically profanity will be allowed by default, but if my users don't want to read it, they don't have to. This also helps with the "l33t sp3#k" issue.
The concept is a simple jquery plugin that gets injected by the server if the client's account is enabling profanity filtering. From there, it's just a couple simple lines that blot out the swears.
Here's the demo page
https://chaseflorell.github.io/jQuery.ProfanityFilter/demo/
<div id="foo">
ass will fail but password will not
</div>
<script>
// code:
$('#foo').profanityFilter({
customSwears: ['ass']
});
</script>
result
*** will fail but password will not
Also late in the game, but doing some researches and stumbled across here. As others have mentioned, it's just almost close to impossible if it was automated, but if your design/requirement can involve in some cases (but not all the time) human interactions to review whether it is profane or not, you may consider ML. https://learn.microsoft.com/en-us/azure/cognitive-services/content-moderator/text-moderation-api#profanity is my current choice right now for multiple reasons:
Supports many localization
They keep updating the database, so I don't have to keep up with latest slangs or languages (maintenance issue)
When there is a high probability (I.e. 90% or more) you can just deny it pragmatically
You can observe for category which causes a flag that may or may not be profanity, and can have somebody review it to teach that it is or isn't profane.
For my need, it was/is based on public-friendly commercial service (OK, videogames) which other users may/will see the username, but the design requires that it has to go through profanity filter to reject offensive username. The sad part about this is the classic "clbuttic" issue will most likely occur since usernames are usually single word (up to N characters) of sometimes multiple words concatenated... Again, Microsoft's cognitive service will not flag "Assist" as Text.HasProfanity=true but may flag one of the categories probability to be high.
As the OP inquires, what about "a$$", here's a result when I passed it through the filter:, as you can see, it has determined it's not profane, but it has high probability that it is, so flags as recommendations of reviewing (human interactions).
When probability is high, I can either return back "I'm sorry, that name is already taken" (even if it isn't) so that it is less offensive to anti-censorship persons or something, if we don't want to integrate human review, or return "Your username have been notified to the live operation department, you may wait for your username to be reviewed and approved or chose another username". Or whatever...
By the way, the cost/price for this service is quite low for my purpose (how often does the username gets changed?), but again, for OP maybe the design demands more intensive queries and may not be ideal to pay/subscribe for ML-services, or cannot have human-review/interactions. It all depends on the design... But if design does fit the bill, perhaps this can be OP's solution.
If interested, I can list the cons in the comment in the future.
Once you have a good MYSQL table of some bad words you want to filter (I started with one of the links in this thread), you can do something like this:
$errors = array(); //Initialize error array (I use this with all my PHP form validations)
$SCREENNAME = mysql_real_escape_string($_POST['SCREENNAME']); //Escape the input data to prevent SQL injection when you query the profanity table.
$ProfanityCheckString = strtoupper($SCREENNAME); //Make the input string uppercase (so that 'BaDwOrD' is the same as 'BADWORD'). All your values in the profanity table will need to be UPPERCASE for this to work.
$ProfanityCheckString = preg_replace('/[_-]/','',$ProfanityCheckString); //I allow alphanumeric, underscores, and dashes...nothing else (I control this with PHP form validation). Pull out non-alphanumeric characters so 'B-A-D-W-O-R-D' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/1/','I',$ProfanityCheckString); //Replace common numeric representations of letters so '84DW0RD' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/3/','E',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/4/','A',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/5/','S',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/6/','G',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/7/','T',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/8/','B',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/0/','O',$ProfanityCheckString); //Replace ZERO's with O's (Capital letter o's).
$ProfanityCheckString = preg_replace('/Z/','S',$ProfanityCheckString); //Replace Z's with S's, another common substitution. Make sure you replace Z's with S's in your profanity database for this to work properly. Same with all the numbers too--having S3X7 in your database won't work, since this code would render that string as 'SEXY'. The profanity table should have the "rendered" version of the bad words.
$CheckProfanity = mysql_query("SELECT * FROM DATABASE.TABLE p WHERE p.WORD = '".$ProfanityCheckString."'");
if(mysql_num_rows($CheckProfanity) > 0) {$errors[] = 'Please select another Screen Name.';} //Check your profanity table for the scrubbed input. You could get real crazy using LIKE and wildcards, but I only want a simple profanity filter.
if (count($errors) > 0) {foreach($errors as $error) {$errorString .= "<span class='PHPError'>$error</span><br /><br />";} echo $errorString;} //Echo any PHP errors that come out of the validation, including any profanity flagging.
//You can also use these lines to troubleshoot.
//echo $ProfanityCheckString;
//echo "<br />";
//echo mysql_error();
//echo "<br />";
I'm sure there is a more efficient way to do all those replacements, but I'm not smart enough to figure it out (and this seems to work okay, albeit inefficiently).
I believe that you should err on the side of allowing users to register, and use humans to filter and add to your profanity table as required. Though it all depends on the cost of a false positive (okay word flagged as bad) versus a false negative (bad word gets through). That should ultimately govern how aggressive or conservative you are in your filtering strategy.
I would also be very careful if you want to use wildcards, since they can sometimes behave more onerously than you intend.
I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.
On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.
Thank you for the ideas.
HanClinto rules!
Frankly, I'd let them get the "trick the system" words out and ban them instead, which is just me. But it also makes the programming simpler.
What I'd do is implement a regex filter like so: /[\s]dooby (doo?)[\s]/i or it the word is prefixed on others, /[\s]doob(er|ed|est)[\s]/. These would prevent filtering words like assuaged, which is perfectly valid, but would also require knowledge of the other variants and updating the actual filter if you learn a new one. Obviously these are all examples, but you'd have to decide how to do it yourself.
I'm not about to type out all the words I know, not when I don't actually want to know them.
Don't. It just leads to problems. One clbuttic personal experience I have with profanity filters is the time where I was kick/banned from an IRC channel for mentioning that I was "heading over the bridge to Hancock for a couple hours" or something to that effect.
I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:
Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.
Also see this blog post for more details:
Fast Multiple String Replacement in PHP
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
I concluded, in order to create a good profanity filter we need 3 main components, or at least it is what I am going to do. These they are:
The filter: a background service that verify against a blacklist, dictionary or something like that.
Not allow anonymous account
Report abuse
A bonus, it will be to reward somehow those who contribute with accurate abuse reporters and punish the offender, e.g. suspend their accounts.
Don't.
Because:
Clbuttic
Profanity is not OMG EVIL
Profanity cannot be effectively defined
Most people quite probably don't appreciate being "protected" from profanity
Edit: While I agree with the commenter who said "censorship is wrong", that is not the nature of this answer.
Let's say I am building a simple dictionary where users type a word and see a definition.
In an oversimplification, are there any problems with setting up my dictionary as a MySQL table, and each user request for a word will call a PHP script to find the word, and display its definition?
What's the optimal way to build this to minimize user lag time/not overheat the server? How does dictionary.com do it? My resources are limited, so I can't afford a dedicated server
As this question is tagged as architecture, so trying to provide a basic architecture overview in this case.
Problem statement consists of following points.
Online application - So single service/application will provide services to multiple users.
Text search - Most of the time queries are not complete word which could be find in database.
Frequent database queries - As the number of user grows this might become problem.
So, you might think of following solutions.
Google the text searching tools/library. You will find lots of them. To have some relevant search results. Or you can use how wordweb does.
To avoid frequent database queries you can cached last 1000 results or some configurable number of results in some file such as Lucene Search does.
DISCLAIMER
Above architecture will hold good if there are simultaneously multiple users. Or if this is even needed. Otherwise this might be more than effort required.
Best way to develop an architecture is to make system adaptable to change. So start with basic work and keep adapting to changes.
I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
The indexer would generate the following results/index:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.
(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )
Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
Then simply do a preg_replace on the string you are indexing.
What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.
If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.
This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
field search operator:
#vendor intel
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:
"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).
"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.
"total_found":
Total amount of matching documents in index (that were found and procesed on server).
"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").
"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.
"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.
filtering out common words (as you
perhaps noticed, "the" "is" "of" and
"intel's" are missing from list)
Find (or create) a list of common words and filter user input.
With regards to "cpus" (plurals vs
singulars), would it be best to use a
particular type (singular or plural),
both or exact (ie, "cpus" is different
"cpu")?
Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.
Continuing with previous item, how can
I determine a plural (different
flavors: test=>tests fish=>fish and
leaf=>leaves)
Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.
I'm currently using MySql and I'm very
concerned with performance issues; we
have 500+ categories and we didn't
even launch the site
Having good schema and code design helps, but I can't really give you much advice on that one.
Let's say I wanted to use the search
term "vendor:intel", where vendor
specifies the field name (field_name),
do you think there would be a huge
impact on the sql server?
That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.
Search throttling; I don't like this
at all, but it's a possibility, and if
you know of any workarounds, make
yourself heard!
Not many options here. To help here and in performance, you should consider having some sort of caching.
I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.
Search is tough to implement. Would recommend using a package if you're new to it.
Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?
Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?
From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.
There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.
I'd like to find a way to take a piece of user supplied text and determine what addresses on the map are mentioned within the text. I'd be happy to use a free web service if it exists or use a script which will not consume too many resources.
One way I can imagine doing this is taking a gigantic database of addressing and searching for each of them individually in the text, but this does not seem efficient. Is there a better algorithm or technique one can suggest?
My basic idea is to take the location information and turn it into markers on a Google Map. If it is too difficult or CPU intensive to determine the locations automatically, I could require users to add information in a location field if necessary but I would prefer not to do this as some of the users are going to be quite young students.
This needs to be done in PHP as that is the scripting language available on my school hosted server.
Note this whole set-up will happen within the context of a Drupal node, and I plan on using a filter to collect the necessary location information from the individual node, so this parsing would only happen once (when the new text enters the database).
You could get something like opencalais to tag your text. One of the catigories which it returns is "city" you coud then use another third party module to show the location of the city.
If you did have a gigantic list of locations in a relational database, and you're only concerned about 500 to 1000 words, then you could definitely just pass the SQL command to find matches for the 500-1000 words and it would be quite efficient.
But even if you did have to call a slow API, you could feasibly request for 500 words one by one. If you kept a cache of the matches, then the cache would probably quickly fill up with all the stop words (you know, like "the", "if", "and") and then using the cache, it'd be likely that you would be searching much less than 500 words each time.
I think you might be surprised at how fast the brute force approach would work.
For future reference I would just like to mention the Yahoo API called Placemaker and the service GeoMaker that is built on top of it.
Those tools can be used to parse out locations from a text as requested here. Unfortunately no Drupal module seems to exists right now- but a custom solution seems easy to code.
Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs to be filtered out.
Where can one find a good list of swear words in various languages and dialects?
Are there APIs available to sources that contain good lists? Or maybe an API that simply says "yes this is clean" or "no this is dirty" with some parameters?
What are some good methods for catching folks trying to trick the system, like a$$, azz, or a55?
Bonus points if you offer solutions for PHP. :)
Edit: Response to answers that say simply avoid the programmatic issue:
I think there is a place for this kind of filter when, for instance, a user can use public image search to find pictures that get added to a sensitive community pool. If they can search for "penis", then they will likely get many pictures of, yep. If we don't want pictures of that, then preventing the word as a search term is a good gatekeeper, though admittedly not a foolproof method. Getting the list of words in the first place is the real question.
So I'm really referring to a way to figure out of a single token is dirty or not and then simply disallow it. I'd not bother preventing a sentiment like the totally hilarious "long necked giraffe" reference. Nothing you can do there. :)
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Also, one can't forget The Untold History of Toontown's SpeedChat, where even using a "safe-word whitelist" resulted in a 14-year-old quickly circumventing it with:
"I want to stick my long-necked Giraffe up your fluffy white bunny."
Bottom line: Ultimately, for any system that you implement, there is absolutely no substitute for human review (whether peer or otherwise). Feel free to implement a rudimentary tool to get rid of the drive-by's, but for the determined troll, you absolutely must have a non-algorithm-based approach.
A system that removes anonymity and introduces accountability (something that Stack Overflow does well) is helpful also, particularly in order to help combat John Gabriel's G.I.F.T.
You also asked where you can get profanity lists to get you started -- one open-source project to check out is Dansguardian -- check out the source code for their default profanity lists. There is also an additional third party Phrase List that you can download for the proxy that may be a helpful gleaning point for you.
Edit in response to the question edit: Thanks for the clarification on what you're trying to do. In that case, if you're just trying to do a simple word filter, there are two ways you can do it. One is to create a single long regexp with all of the banned phrases that you want to censor, and merely do a regex find/replace with it. A regex like:
$filterRegex = "(boogers|snot|poop|shucks|argh)"
and run it on your input string using preg_match() to wholesale test for a hit,
or preg_replace() to blank them out.
You can also load those functions up with arrays rather than a single long regex, and for long word lists, it may be more manageable. See the preg_replace() for some good examples as to how arrays can be used flexibly.
For additional PHP programming examples, see this page for a somewhat advanced generic class for word filtering that *'s out the center letters from censored words, and this previous Stack Overflow question that also has a PHP example (the main valuable part in there is the SQL-based filtered word approach -- the leet-speak compensator can be dispensed with if you find it unnecessary).
You also added: "Getting the list of words in the first place is the real question." -- in addition to some of the previous Dansgaurdian links, you may find this handy .zip of 458 words to be helpful.
Whilst I know that this question is fairly old, but it's a commonly occurring question...
There is both a reason and a distinct need for profanity filters (see Wikipedia entry here), but they often fall short of being 100% accurate for very distinct reasons; Context and accuracy.
It depends (wholly) on what you're trying to achieve - at it's most basic, you're probably trying to cover the "seven dirty words" and then some... Some businesses need to filter the most basic of profanity: basic swear words, URLs or even personal information and so on, but others need to prevent illicit account naming (Xbox live is an example) or far more...
User generated content doesn't just contain potential swear words, it can also contain offensive references to:
Sexual acts
Sexual orientation
Religion
Ethnicity
Etc...
And potentially, in multiple languages. Shutterstock has developed basic dirty-words lists in 10 languages to date, but it's still basic and very much oriented towards their 'tagging' needs. There are a number of other lists available on the web.
I agree with the accepted answer that it's not a defined science and as language is a continually evolving challenge but one where a 90% catch rate is better than 0%. It depends purely on your goals - what you're trying to achieve, the level of support you have and how important it is to remove profanities of different types.
In building a filter, you need to consider the following elements and how they relate to your project:
Words/phrases
Acronyms (FOAD/LMFAO etc)
False positives (words, places and names like 'mishit', 'scunthorpe' and 'titsworth')
URLs (porn sites are an obvious target)
Personal information (email, address, phone etc - if applicable)
Language choice (usually English by default)
Moderation (how, if at all, you can interact with user generated content and what you can do with it)
You can easily build a profanity filter that captures 90%+ of profanities, but you'll never hit 100%. It's just not possible. The closer you want to get to 100%, the harder it becomes... Having built a complex profanity engine in the past that dealt with more than 500K realtime messages per day, I'd offer the following advice:
A basic filter would involve:
Building a list of applicable profanities
Developing a method of dealing with derivations of profanities
A moderately complex filer would involve, (In addition to a basic filter):
Using complex pattern matching to deal with extended derivations (using advanced regex)
Dealing with Leetspeak (l33t)
Dealing with false positives
A complex filter would involve a number of the following (In addition to a moderate filter):
Whitelists and blacklists
Naive bayesian inference filtering of phrases/terms
Soundex functions (where a word sounds like another)
Levenshtein distance
Stemming
Human moderators to help guide a filtering engine to learn by example or where matches aren't accurate enough without guidance (a self/continually-improving system)
Perhaps some form of AI engine
I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!
During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?
Of course, the most foul word in the English language.
Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).
For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.
a profanity filtering system will never be perfect, even if the programmer is cocksure and keeps abreast of all nude developments
that said, any list of 'naughty words' is likely to perform as well as any other list, since the underlying problem is language understanding which is pretty much intractable with current technology
so, the only practical solution is twofold:
be prepared to update your dictionary frequently
hire a human editor to correct false positives (e.g. "clbuttic" instead of "classic") and false negatives (oops! missed one!)
The only way to prevent offensive user input is to prevent all user input.
If you insist on allowing user input and need moderation, then incorporate human moderators.
Have a look at CDYNE's Profanity Filter Web Service
Testing URL
Beware of localization issues: what is a swearword in one language might be a perfectly normal word in another.
One current example of this: ebay uses a dictionary approach to filter "bad words" from feedback. If you try to enter the german translation of "this was a perfect transaction" ("das war eine perfekte Transaktion"), ebay will reject the feedback due to bad words.
Why? Because the german word for "was" is "war", and "war" is in ebay dictionary of "bad words".
So beware of localisation issues.
Regarding your "trick the system" subquestion, you can handle that by normalizing both the "bad word" list and the user-entered text before doing your search. e.g., Use a series of regexes (or tr if PHP has it) to convert [z$5] to "s", [4#] to "a", etc., then compare the normalized "bad word" list against the normalized text. Note that the normalization could potentially lead to additional false positives, although I can't think of any actual cases at the moment.
The larger challenge is to come up with something that will let people quote "The pen is mightier than the sword" while blocking "p e n i s".
I collected 2200 bad words in 12 languages: en, ar, cs, da, de, eo, es, fa, fi, fr, hi, hu, it, ja, ko, nl, no, pl, pt, ru, sv, th, tlh, tr, zh.
MySQL dump, JSON, XML or CSV options are available.
https://github.com/turalus/openDB
I'd suggest you to execute this SQL into your DB and check everytime when user inputs something.
If you can do something like Digg/Stackoverflow where the users can downvote/mark obscene content... do so.
Then all you need to do is review the "naughty" users, and block them if they break the rules.
I'm a little late to the party, but I have a solution that might work for some who read this. It's in javascript instead of php, but there's a valid reason for it.
Full disclosure, I wrote this plugin...
Anyways.
The approach I've gone with is to allow a user to "Opt-In" to their profanity filtering. Basically profanity will be allowed by default, but if my users don't want to read it, they don't have to. This also helps with the "l33t sp3#k" issue.
The concept is a simple jquery plugin that gets injected by the server if the client's account is enabling profanity filtering. From there, it's just a couple simple lines that blot out the swears.
Here's the demo page
https://chaseflorell.github.io/jQuery.ProfanityFilter/demo/
<div id="foo">
ass will fail but password will not
</div>
<script>
// code:
$('#foo').profanityFilter({
customSwears: ['ass']
});
</script>
result
*** will fail but password will not
Also late in the game, but doing some researches and stumbled across here. As others have mentioned, it's just almost close to impossible if it was automated, but if your design/requirement can involve in some cases (but not all the time) human interactions to review whether it is profane or not, you may consider ML. https://learn.microsoft.com/en-us/azure/cognitive-services/content-moderator/text-moderation-api#profanity is my current choice right now for multiple reasons:
Supports many localization
They keep updating the database, so I don't have to keep up with latest slangs or languages (maintenance issue)
When there is a high probability (I.e. 90% or more) you can just deny it pragmatically
You can observe for category which causes a flag that may or may not be profanity, and can have somebody review it to teach that it is or isn't profane.
For my need, it was/is based on public-friendly commercial service (OK, videogames) which other users may/will see the username, but the design requires that it has to go through profanity filter to reject offensive username. The sad part about this is the classic "clbuttic" issue will most likely occur since usernames are usually single word (up to N characters) of sometimes multiple words concatenated... Again, Microsoft's cognitive service will not flag "Assist" as Text.HasProfanity=true but may flag one of the categories probability to be high.
As the OP inquires, what about "a$$", here's a result when I passed it through the filter:, as you can see, it has determined it's not profane, but it has high probability that it is, so flags as recommendations of reviewing (human interactions).
When probability is high, I can either return back "I'm sorry, that name is already taken" (even if it isn't) so that it is less offensive to anti-censorship persons or something, if we don't want to integrate human review, or return "Your username have been notified to the live operation department, you may wait for your username to be reviewed and approved or chose another username". Or whatever...
By the way, the cost/price for this service is quite low for my purpose (how often does the username gets changed?), but again, for OP maybe the design demands more intensive queries and may not be ideal to pay/subscribe for ML-services, or cannot have human-review/interactions. It all depends on the design... But if design does fit the bill, perhaps this can be OP's solution.
If interested, I can list the cons in the comment in the future.
Once you have a good MYSQL table of some bad words you want to filter (I started with one of the links in this thread), you can do something like this:
$errors = array(); //Initialize error array (I use this with all my PHP form validations)
$SCREENNAME = mysql_real_escape_string($_POST['SCREENNAME']); //Escape the input data to prevent SQL injection when you query the profanity table.
$ProfanityCheckString = strtoupper($SCREENNAME); //Make the input string uppercase (so that 'BaDwOrD' is the same as 'BADWORD'). All your values in the profanity table will need to be UPPERCASE for this to work.
$ProfanityCheckString = preg_replace('/[_-]/','',$ProfanityCheckString); //I allow alphanumeric, underscores, and dashes...nothing else (I control this with PHP form validation). Pull out non-alphanumeric characters so 'B-A-D-W-O-R-D' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/1/','I',$ProfanityCheckString); //Replace common numeric representations of letters so '84DW0RD' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/3/','E',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/4/','A',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/5/','S',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/6/','G',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/7/','T',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/8/','B',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/0/','O',$ProfanityCheckString); //Replace ZERO's with O's (Capital letter o's).
$ProfanityCheckString = preg_replace('/Z/','S',$ProfanityCheckString); //Replace Z's with S's, another common substitution. Make sure you replace Z's with S's in your profanity database for this to work properly. Same with all the numbers too--having S3X7 in your database won't work, since this code would render that string as 'SEXY'. The profanity table should have the "rendered" version of the bad words.
$CheckProfanity = mysql_query("SELECT * FROM DATABASE.TABLE p WHERE p.WORD = '".$ProfanityCheckString."'");
if(mysql_num_rows($CheckProfanity) > 0) {$errors[] = 'Please select another Screen Name.';} //Check your profanity table for the scrubbed input. You could get real crazy using LIKE and wildcards, but I only want a simple profanity filter.
if (count($errors) > 0) {foreach($errors as $error) {$errorString .= "<span class='PHPError'>$error</span><br /><br />";} echo $errorString;} //Echo any PHP errors that come out of the validation, including any profanity flagging.
//You can also use these lines to troubleshoot.
//echo $ProfanityCheckString;
//echo "<br />";
//echo mysql_error();
//echo "<br />";
I'm sure there is a more efficient way to do all those replacements, but I'm not smart enough to figure it out (and this seems to work okay, albeit inefficiently).
I believe that you should err on the side of allowing users to register, and use humans to filter and add to your profanity table as required. Though it all depends on the cost of a false positive (okay word flagged as bad) versus a false negative (bad word gets through). That should ultimately govern how aggressive or conservative you are in your filtering strategy.
I would also be very careful if you want to use wildcards, since they can sometimes behave more onerously than you intend.
I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.
On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.
Thank you for the ideas.
HanClinto rules!
Frankly, I'd let them get the "trick the system" words out and ban them instead, which is just me. But it also makes the programming simpler.
What I'd do is implement a regex filter like so: /[\s]dooby (doo?)[\s]/i or it the word is prefixed on others, /[\s]doob(er|ed|est)[\s]/. These would prevent filtering words like assuaged, which is perfectly valid, but would also require knowledge of the other variants and updating the actual filter if you learn a new one. Obviously these are all examples, but you'd have to decide how to do it yourself.
I'm not about to type out all the words I know, not when I don't actually want to know them.
Don't. It just leads to problems. One clbuttic personal experience I have with profanity filters is the time where I was kick/banned from an IRC channel for mentioning that I was "heading over the bridge to Hancock for a couple hours" or something to that effect.
I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:
Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.
Also see this blog post for more details:
Fast Multiple String Replacement in PHP
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
I concluded, in order to create a good profanity filter we need 3 main components, or at least it is what I am going to do. These they are:
The filter: a background service that verify against a blacklist, dictionary or something like that.
Not allow anonymous account
Report abuse
A bonus, it will be to reward somehow those who contribute with accurate abuse reporters and punish the offender, e.g. suspend their accounts.
Don't.
Because:
Clbuttic
Profanity is not OMG EVIL
Profanity cannot be effectively defined
Most people quite probably don't appreciate being "protected" from profanity
Edit: While I agree with the commenter who said "censorship is wrong", that is not the nature of this answer.