Related
I have been having an issue with building a search function for a while now that I'm building for a cooking blog.
In Dutch (similar to German), one can add as many compound words together to create a new word. This has been giving me a headache when wanting to include search results that include a relevant singular word inside compound words. It's kind of like a reverse Scunthorpe problem, I actually want to include certain words inside other words, but only sometimes.
For example, the word rice in Dutch is rijst. Brown rice is zilvervliesrijst and pandan rice is pandanrijst. If I want these two to pop up in search results, I have to search whether words exist inside a word, rather than whether they are the word.
However, this immediately causes issues for smaller words that can exist inside other words accidentally. For example, the word for egg is ei, while leek is prei. Onion is ui, while Brussel sprouts are spruitjes. You can see that accepting subsections of strings being matching the search strings could cause major problems.
I initially tried to grade what percentage of a word contains the search string, but this also causes issues as prei is 50% ei, while zilvervliesrijst is only about 25% rijst. This also makes using a levenshtein distance to solve this very impractical.
My current solution is as follows: I have an SQL table list of ingredients that are being used to automatically calculate the price and calorie total for each recipe based on the ingredient list, and I have used this to add all relevant synonyms to the name column. Basically, zilvervliesrijst is listed as zilvervliesrijst|rijst. I also use this to add both the plural and singular version of a term such that I will not have to test those.
However, this excludes any compound words in any place other than the ingredient list. Things such as title, cuisine, cooking equipment, dietary preferences and so on are still having this problem.
My question is this, is there a non-library-esque method that addresses this within the field of computer science? Or will I be doomed to include every single possible searchable compound word and its singular components, every time I want to add in a new recipe? I just hope that's not the case, as that will massively increase the processing time required for each additional library entry.
I think it will be hard to do this well without using a library, and probably also a dictionary (which may be bundled as part of the library).
There are really two somewhat orthogonal problems:
Splitting compound words into their constituent parts.
Identifying the stem of a simple (non-compound) word. (For example, removing plural markers and inflections.) This is often called "stemming" but that's not really the best strategy; you'll also find the rather awkward term "lemmatization".
Both of these tasks are plagued with ambiguities in all the languages I know about. (A German example, taken from an Arxiv paper describing the German-language morphological analyser DEMorphy, is "Rohrohrzucker", which means "raw cane sugar" -- Roh Rohr Zucker -- but could equally be split into Rohr Ohr Zucker, pipe-ear sugar, if there were such a thing.)
The basic outline of how these tasks can be done in reasonable time (with lots of CPU power) is:
Using ngram analysis to figure out plausible word division points.
Lemmatize each candidate component word to get plausible POS (part-of-speech) markers.
Use a trained machine-learning model (or something of that form) to reject non-sensical (or at least highly improbable) divisions.
At each step, check possible corner cases in a dictionary (of corner cases).
That's just a rough outline, of course.
I was able to find, without too much trouble, a couple of fairly recent discussions of how to do this with Dutch words. I'm not even vaguely competent to discuss the validity of these papers, so I'll leave you to do the search yourself. (I used the search query "split compound words in Dutch".) But I can tell you two things:
The problem is being worked on, but not necessarily to produce freely-available products.
If you choose to tackle it yourself, you'll end up devoting quite a lot of time to the project, although you might find it interesting. If you do succeed, you'll end up with a useful product and the beginning of a thesis (perhaps useful if you have academic ambitions).
However you choose to do it, you're best off only doing it once for each new recipe. Analyse the contents of each recipe as it is entered, to build a list of search terms which you can store in your database along with the recipe. You will probably also want to split and lemmatize search queries, but those are generally short enough that the CPU time is reasonable. Even so, consider caching the analyses in order to save time on common queries.
Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs to be filtered out.
Where can one find a good list of swear words in various languages and dialects?
Are there APIs available to sources that contain good lists? Or maybe an API that simply says "yes this is clean" or "no this is dirty" with some parameters?
What are some good methods for catching folks trying to trick the system, like a$$, azz, or a55?
Bonus points if you offer solutions for PHP. :)
Edit: Response to answers that say simply avoid the programmatic issue:
I think there is a place for this kind of filter when, for instance, a user can use public image search to find pictures that get added to a sensitive community pool. If they can search for "penis", then they will likely get many pictures of, yep. If we don't want pictures of that, then preventing the word as a search term is a good gatekeeper, though admittedly not a foolproof method. Getting the list of words in the first place is the real question.
So I'm really referring to a way to figure out of a single token is dirty or not and then simply disallow it. I'd not bother preventing a sentiment like the totally hilarious "long necked giraffe" reference. Nothing you can do there. :)
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Also, one can't forget The Untold History of Toontown's SpeedChat, where even using a "safe-word whitelist" resulted in a 14-year-old quickly circumventing it with:
"I want to stick my long-necked Giraffe up your fluffy white bunny."
Bottom line: Ultimately, for any system that you implement, there is absolutely no substitute for human review (whether peer or otherwise). Feel free to implement a rudimentary tool to get rid of the drive-by's, but for the determined troll, you absolutely must have a non-algorithm-based approach.
A system that removes anonymity and introduces accountability (something that Stack Overflow does well) is helpful also, particularly in order to help combat John Gabriel's G.I.F.T.
You also asked where you can get profanity lists to get you started -- one open-source project to check out is Dansguardian -- check out the source code for their default profanity lists. There is also an additional third party Phrase List that you can download for the proxy that may be a helpful gleaning point for you.
Edit in response to the question edit: Thanks for the clarification on what you're trying to do. In that case, if you're just trying to do a simple word filter, there are two ways you can do it. One is to create a single long regexp with all of the banned phrases that you want to censor, and merely do a regex find/replace with it. A regex like:
$filterRegex = "(boogers|snot|poop|shucks|argh)"
and run it on your input string using preg_match() to wholesale test for a hit,
or preg_replace() to blank them out.
You can also load those functions up with arrays rather than a single long regex, and for long word lists, it may be more manageable. See the preg_replace() for some good examples as to how arrays can be used flexibly.
For additional PHP programming examples, see this page for a somewhat advanced generic class for word filtering that *'s out the center letters from censored words, and this previous Stack Overflow question that also has a PHP example (the main valuable part in there is the SQL-based filtered word approach -- the leet-speak compensator can be dispensed with if you find it unnecessary).
You also added: "Getting the list of words in the first place is the real question." -- in addition to some of the previous Dansgaurdian links, you may find this handy .zip of 458 words to be helpful.
Whilst I know that this question is fairly old, but it's a commonly occurring question...
There is both a reason and a distinct need for profanity filters (see Wikipedia entry here), but they often fall short of being 100% accurate for very distinct reasons; Context and accuracy.
It depends (wholly) on what you're trying to achieve - at it's most basic, you're probably trying to cover the "seven dirty words" and then some... Some businesses need to filter the most basic of profanity: basic swear words, URLs or even personal information and so on, but others need to prevent illicit account naming (Xbox live is an example) or far more...
User generated content doesn't just contain potential swear words, it can also contain offensive references to:
Sexual acts
Sexual orientation
Religion
Ethnicity
Etc...
And potentially, in multiple languages. Shutterstock has developed basic dirty-words lists in 10 languages to date, but it's still basic and very much oriented towards their 'tagging' needs. There are a number of other lists available on the web.
I agree with the accepted answer that it's not a defined science and as language is a continually evolving challenge but one where a 90% catch rate is better than 0%. It depends purely on your goals - what you're trying to achieve, the level of support you have and how important it is to remove profanities of different types.
In building a filter, you need to consider the following elements and how they relate to your project:
Words/phrases
Acronyms (FOAD/LMFAO etc)
False positives (words, places and names like 'mishit', 'scunthorpe' and 'titsworth')
URLs (porn sites are an obvious target)
Personal information (email, address, phone etc - if applicable)
Language choice (usually English by default)
Moderation (how, if at all, you can interact with user generated content and what you can do with it)
You can easily build a profanity filter that captures 90%+ of profanities, but you'll never hit 100%. It's just not possible. The closer you want to get to 100%, the harder it becomes... Having built a complex profanity engine in the past that dealt with more than 500K realtime messages per day, I'd offer the following advice:
A basic filter would involve:
Building a list of applicable profanities
Developing a method of dealing with derivations of profanities
A moderately complex filer would involve, (In addition to a basic filter):
Using complex pattern matching to deal with extended derivations (using advanced regex)
Dealing with Leetspeak (l33t)
Dealing with false positives
A complex filter would involve a number of the following (In addition to a moderate filter):
Whitelists and blacklists
Naive bayesian inference filtering of phrases/terms
Soundex functions (where a word sounds like another)
Levenshtein distance
Stemming
Human moderators to help guide a filtering engine to learn by example or where matches aren't accurate enough without guidance (a self/continually-improving system)
Perhaps some form of AI engine
I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!
During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?
Of course, the most foul word in the English language.
Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).
For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.
a profanity filtering system will never be perfect, even if the programmer is cocksure and keeps abreast of all nude developments
that said, any list of 'naughty words' is likely to perform as well as any other list, since the underlying problem is language understanding which is pretty much intractable with current technology
so, the only practical solution is twofold:
be prepared to update your dictionary frequently
hire a human editor to correct false positives (e.g. "clbuttic" instead of "classic") and false negatives (oops! missed one!)
The only way to prevent offensive user input is to prevent all user input.
If you insist on allowing user input and need moderation, then incorporate human moderators.
Have a look at CDYNE's Profanity Filter Web Service
Testing URL
Beware of localization issues: what is a swearword in one language might be a perfectly normal word in another.
One current example of this: ebay uses a dictionary approach to filter "bad words" from feedback. If you try to enter the german translation of "this was a perfect transaction" ("das war eine perfekte Transaktion"), ebay will reject the feedback due to bad words.
Why? Because the german word for "was" is "war", and "war" is in ebay dictionary of "bad words".
So beware of localisation issues.
Regarding your "trick the system" subquestion, you can handle that by normalizing both the "bad word" list and the user-entered text before doing your search. e.g., Use a series of regexes (or tr if PHP has it) to convert [z$5] to "s", [4#] to "a", etc., then compare the normalized "bad word" list against the normalized text. Note that the normalization could potentially lead to additional false positives, although I can't think of any actual cases at the moment.
The larger challenge is to come up with something that will let people quote "The pen is mightier than the sword" while blocking "p e n i s".
I collected 2200 bad words in 12 languages: en, ar, cs, da, de, eo, es, fa, fi, fr, hi, hu, it, ja, ko, nl, no, pl, pt, ru, sv, th, tlh, tr, zh.
MySQL dump, JSON, XML or CSV options are available.
https://github.com/turalus/openDB
I'd suggest you to execute this SQL into your DB and check everytime when user inputs something.
If you can do something like Digg/Stackoverflow where the users can downvote/mark obscene content... do so.
Then all you need to do is review the "naughty" users, and block them if they break the rules.
I'm a little late to the party, but I have a solution that might work for some who read this. It's in javascript instead of php, but there's a valid reason for it.
Full disclosure, I wrote this plugin...
Anyways.
The approach I've gone with is to allow a user to "Opt-In" to their profanity filtering. Basically profanity will be allowed by default, but if my users don't want to read it, they don't have to. This also helps with the "l33t sp3#k" issue.
The concept is a simple jquery plugin that gets injected by the server if the client's account is enabling profanity filtering. From there, it's just a couple simple lines that blot out the swears.
Here's the demo page
https://chaseflorell.github.io/jQuery.ProfanityFilter/demo/
<div id="foo">
ass will fail but password will not
</div>
<script>
// code:
$('#foo').profanityFilter({
customSwears: ['ass']
});
</script>
result
*** will fail but password will not
Also late in the game, but doing some researches and stumbled across here. As others have mentioned, it's just almost close to impossible if it was automated, but if your design/requirement can involve in some cases (but not all the time) human interactions to review whether it is profane or not, you may consider ML. https://learn.microsoft.com/en-us/azure/cognitive-services/content-moderator/text-moderation-api#profanity is my current choice right now for multiple reasons:
Supports many localization
They keep updating the database, so I don't have to keep up with latest slangs or languages (maintenance issue)
When there is a high probability (I.e. 90% or more) you can just deny it pragmatically
You can observe for category which causes a flag that may or may not be profanity, and can have somebody review it to teach that it is or isn't profane.
For my need, it was/is based on public-friendly commercial service (OK, videogames) which other users may/will see the username, but the design requires that it has to go through profanity filter to reject offensive username. The sad part about this is the classic "clbuttic" issue will most likely occur since usernames are usually single word (up to N characters) of sometimes multiple words concatenated... Again, Microsoft's cognitive service will not flag "Assist" as Text.HasProfanity=true but may flag one of the categories probability to be high.
As the OP inquires, what about "a$$", here's a result when I passed it through the filter:, as you can see, it has determined it's not profane, but it has high probability that it is, so flags as recommendations of reviewing (human interactions).
When probability is high, I can either return back "I'm sorry, that name is already taken" (even if it isn't) so that it is less offensive to anti-censorship persons or something, if we don't want to integrate human review, or return "Your username have been notified to the live operation department, you may wait for your username to be reviewed and approved or chose another username". Or whatever...
By the way, the cost/price for this service is quite low for my purpose (how often does the username gets changed?), but again, for OP maybe the design demands more intensive queries and may not be ideal to pay/subscribe for ML-services, or cannot have human-review/interactions. It all depends on the design... But if design does fit the bill, perhaps this can be OP's solution.
If interested, I can list the cons in the comment in the future.
Once you have a good MYSQL table of some bad words you want to filter (I started with one of the links in this thread), you can do something like this:
$errors = array(); //Initialize error array (I use this with all my PHP form validations)
$SCREENNAME = mysql_real_escape_string($_POST['SCREENNAME']); //Escape the input data to prevent SQL injection when you query the profanity table.
$ProfanityCheckString = strtoupper($SCREENNAME); //Make the input string uppercase (so that 'BaDwOrD' is the same as 'BADWORD'). All your values in the profanity table will need to be UPPERCASE for this to work.
$ProfanityCheckString = preg_replace('/[_-]/','',$ProfanityCheckString); //I allow alphanumeric, underscores, and dashes...nothing else (I control this with PHP form validation). Pull out non-alphanumeric characters so 'B-A-D-W-O-R-D' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/1/','I',$ProfanityCheckString); //Replace common numeric representations of letters so '84DW0RD' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/3/','E',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/4/','A',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/5/','S',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/6/','G',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/7/','T',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/8/','B',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/0/','O',$ProfanityCheckString); //Replace ZERO's with O's (Capital letter o's).
$ProfanityCheckString = preg_replace('/Z/','S',$ProfanityCheckString); //Replace Z's with S's, another common substitution. Make sure you replace Z's with S's in your profanity database for this to work properly. Same with all the numbers too--having S3X7 in your database won't work, since this code would render that string as 'SEXY'. The profanity table should have the "rendered" version of the bad words.
$CheckProfanity = mysql_query("SELECT * FROM DATABASE.TABLE p WHERE p.WORD = '".$ProfanityCheckString."'");
if(mysql_num_rows($CheckProfanity) > 0) {$errors[] = 'Please select another Screen Name.';} //Check your profanity table for the scrubbed input. You could get real crazy using LIKE and wildcards, but I only want a simple profanity filter.
if (count($errors) > 0) {foreach($errors as $error) {$errorString .= "<span class='PHPError'>$error</span><br /><br />";} echo $errorString;} //Echo any PHP errors that come out of the validation, including any profanity flagging.
//You can also use these lines to troubleshoot.
//echo $ProfanityCheckString;
//echo "<br />";
//echo mysql_error();
//echo "<br />";
I'm sure there is a more efficient way to do all those replacements, but I'm not smart enough to figure it out (and this seems to work okay, albeit inefficiently).
I believe that you should err on the side of allowing users to register, and use humans to filter and add to your profanity table as required. Though it all depends on the cost of a false positive (okay word flagged as bad) versus a false negative (bad word gets through). That should ultimately govern how aggressive or conservative you are in your filtering strategy.
I would also be very careful if you want to use wildcards, since they can sometimes behave more onerously than you intend.
I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.
On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.
Thank you for the ideas.
HanClinto rules!
Frankly, I'd let them get the "trick the system" words out and ban them instead, which is just me. But it also makes the programming simpler.
What I'd do is implement a regex filter like so: /[\s]dooby (doo?)[\s]/i or it the word is prefixed on others, /[\s]doob(er|ed|est)[\s]/. These would prevent filtering words like assuaged, which is perfectly valid, but would also require knowledge of the other variants and updating the actual filter if you learn a new one. Obviously these are all examples, but you'd have to decide how to do it yourself.
I'm not about to type out all the words I know, not when I don't actually want to know them.
Don't. It just leads to problems. One clbuttic personal experience I have with profanity filters is the time where I was kick/banned from an IRC channel for mentioning that I was "heading over the bridge to Hancock for a couple hours" or something to that effect.
I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:
Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.
Also see this blog post for more details:
Fast Multiple String Replacement in PHP
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
I concluded, in order to create a good profanity filter we need 3 main components, or at least it is what I am going to do. These they are:
The filter: a background service that verify against a blacklist, dictionary or something like that.
Not allow anonymous account
Report abuse
A bonus, it will be to reward somehow those who contribute with accurate abuse reporters and punish the offender, e.g. suspend their accounts.
Don't.
Because:
Clbuttic
Profanity is not OMG EVIL
Profanity cannot be effectively defined
Most people quite probably don't appreciate being "protected" from profanity
Edit: While I agree with the commenter who said "censorship is wrong", that is not the nature of this answer.
When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google).
So for example if someone looks for "jquyer"
", it would output "did you mean jquery?"
Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).
Do you know a library that can do this? I've googled this but haven't found any great results.
Or perhaps you have an idea how to construct this on my own?
A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.
In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.
The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.
By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.
A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).
How about the levenshtein function, or similar_text function?
Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.
You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.
http://www.phpclasses.org/browse/package/4859.html
Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:
Domain specificity helps avoid misleading probabilities from overtaking your implementation
Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
Proper nouns that are available within your search index are now accounted for.
Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.
When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.
If someone searched for jQuery, I would build a select-statement that went
SELECT Word, 1 AS Relevance
FROM keywords
WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION
SELECT Word, 2 AS Relevance
FROM keywords
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc
ORDER BY Relevance, Word
The resulting words were my suggestions and it worked really well.
You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.
Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.
You can simply use an Api like this one https://www.mashape.com/marrouchi/did-you-mean
I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. The database of keywords is much larger than the text.
Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).
The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.
SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');
Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.
Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.
The classic way of searching a text stream for multiple keywords is the Aho-Corasick finite automaton, which uses time linear in the text to be searched. You'll want minor adaptations to recognize strings only on word boundaries, or perhaps it would be simpler just to check the keywords found and make sure they are not embedded in larger words.
You can find an implementation in fgrep. Even better, Preston Briggs wrote a pretty nice implementation in C that does exactly the kind of keyword search you are talking about. (It searches programs for occurrences of 'interesting' identifiers'.) Preston's implementation is distributed as part of the Noweb literate-programming tool. You could find a way to call this code from PHP or you could rewrite it in PHP---the recognize itself is about 220 lines of C, and the main program is another 135 lines.
All the proposed solutions, including Aho-Corasick, have these properties in common:
A preprocessing step that takes time and space proportional to the number of keywords in the database.
A search step that takes time and space proportional to the length of the text plus the number of keywords found.
Aho-Corasick offers considerably better constants of proportionality on the search step, but if your texts are small, this won't matter. In fact, if your texts are small and your database is large, you probably want to minimize the amount of memory used in the preprocessing step. Andrew Appel's DAWG data structure from the world's fastest scrabble program will probably do the trick.
In general,
break the text into words
b. convert words back to canonical root form
c. drop common conjunction words
d. strip duplicates
insert the words into a temporary table then do an inner join against the keywords table,
or (as you suggested) build the keywords into a complex query criteria
It may be worthwhile to cache a 3- or 4-letter hash array with which to pre-filter potential keywords; you will have to experiment to find the best tradeoff between memory size and effectiveness.
I'm not 100% clear on what you're asking, but maybe what you're looking for is an inverted index?
Update:
You can use an inverted index to match multiple keywords at once.
Split up the new document into tokens, and insert the tokens paired with an identifier for the document into the inverted index table. A (rather denormalized) inverted index table:
inverted_index
-----
document_id keyword
If you're searching for 3 keywords manually:
select document_id, count(*) from inverted_index
where keyword in (keyword1, keyword2, keyword3)
group by document_id
having count(*) = 3
If you have a table of the keywords you care about, just use an inner join rather than an in() operation:
keyword_table
----
keyword othercols
select keyword_table.keyword, keyword_table.othercols from inverted_index
inner join keyword_table on keyword_table.keyword=inverted_index.keyword
where inverted_index.document_id=id_of_some_new_document
is any of this closer to what you want?
Have you considered graduating to a fulltext solution such as Sphinx?
I'm talking out of my hat here, because I haven't used it myself. But it's getting a lot of attention as a high-speed fulltext search solution. It will probably scale better than any relational solution you use.
Here's a blog about using Sphinx as a fulltext search solution in MySQL.
I would do 2 things here.
First (and this isn't directly related to the question) I'd break up and partition user keywords by users. Having more tables with fewer data, ideally on different servers for distributed lookups where slices or ranges of users exist on different slices. Aka, all of usera's data exists on slice one, userb on slice two, etc.
Second, I'd have some sort of in-memory hash table to determine existence of keywords. This would likely be federated as well to distribute the lookups. For n keyword-existence servers, hash the keyword and mod it by n then distribute ranges of those keys across all of the memcached servers. This quick way lets you say is keyword x being watched, hash it and determine what server it would live on. Then make the lookup and collect/aggregate keywords being tracked.
At that point you'll at least know which keywords are being tracked and you can take your user slices and perform subsequent lookups to determine which users are tracking which keywords.
In short: SQL is not an ideal solution here.
I hacked up some code for scanning for multiple keywords using a dawg (as suggested above referencing the Scrabble paper) although I wrote it from first principles and I don't know whether it is anything like the AHO algorithm or not.
http://www.gtoal.com/wordgames/spell/multiscan.c.html
A friend made some hacks to my code after I first posted it on the wordgame programmers mailing list, and his version is probably more efficient:
http://www.gtoal.com/wordgames/spell/multidawg.c.html
Scales fairly well...
G
Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs to be filtered out.
Where can one find a good list of swear words in various languages and dialects?
Are there APIs available to sources that contain good lists? Or maybe an API that simply says "yes this is clean" or "no this is dirty" with some parameters?
What are some good methods for catching folks trying to trick the system, like a$$, azz, or a55?
Bonus points if you offer solutions for PHP. :)
Edit: Response to answers that say simply avoid the programmatic issue:
I think there is a place for this kind of filter when, for instance, a user can use public image search to find pictures that get added to a sensitive community pool. If they can search for "penis", then they will likely get many pictures of, yep. If we don't want pictures of that, then preventing the word as a search term is a good gatekeeper, though admittedly not a foolproof method. Getting the list of words in the first place is the real question.
So I'm really referring to a way to figure out of a single token is dirty or not and then simply disallow it. I'd not bother preventing a sentiment like the totally hilarious "long necked giraffe" reference. Nothing you can do there. :)
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Also, one can't forget The Untold History of Toontown's SpeedChat, where even using a "safe-word whitelist" resulted in a 14-year-old quickly circumventing it with:
"I want to stick my long-necked Giraffe up your fluffy white bunny."
Bottom line: Ultimately, for any system that you implement, there is absolutely no substitute for human review (whether peer or otherwise). Feel free to implement a rudimentary tool to get rid of the drive-by's, but for the determined troll, you absolutely must have a non-algorithm-based approach.
A system that removes anonymity and introduces accountability (something that Stack Overflow does well) is helpful also, particularly in order to help combat John Gabriel's G.I.F.T.
You also asked where you can get profanity lists to get you started -- one open-source project to check out is Dansguardian -- check out the source code for their default profanity lists. There is also an additional third party Phrase List that you can download for the proxy that may be a helpful gleaning point for you.
Edit in response to the question edit: Thanks for the clarification on what you're trying to do. In that case, if you're just trying to do a simple word filter, there are two ways you can do it. One is to create a single long regexp with all of the banned phrases that you want to censor, and merely do a regex find/replace with it. A regex like:
$filterRegex = "(boogers|snot|poop|shucks|argh)"
and run it on your input string using preg_match() to wholesale test for a hit,
or preg_replace() to blank them out.
You can also load those functions up with arrays rather than a single long regex, and for long word lists, it may be more manageable. See the preg_replace() for some good examples as to how arrays can be used flexibly.
For additional PHP programming examples, see this page for a somewhat advanced generic class for word filtering that *'s out the center letters from censored words, and this previous Stack Overflow question that also has a PHP example (the main valuable part in there is the SQL-based filtered word approach -- the leet-speak compensator can be dispensed with if you find it unnecessary).
You also added: "Getting the list of words in the first place is the real question." -- in addition to some of the previous Dansgaurdian links, you may find this handy .zip of 458 words to be helpful.
Whilst I know that this question is fairly old, but it's a commonly occurring question...
There is both a reason and a distinct need for profanity filters (see Wikipedia entry here), but they often fall short of being 100% accurate for very distinct reasons; Context and accuracy.
It depends (wholly) on what you're trying to achieve - at it's most basic, you're probably trying to cover the "seven dirty words" and then some... Some businesses need to filter the most basic of profanity: basic swear words, URLs or even personal information and so on, but others need to prevent illicit account naming (Xbox live is an example) or far more...
User generated content doesn't just contain potential swear words, it can also contain offensive references to:
Sexual acts
Sexual orientation
Religion
Ethnicity
Etc...
And potentially, in multiple languages. Shutterstock has developed basic dirty-words lists in 10 languages to date, but it's still basic and very much oriented towards their 'tagging' needs. There are a number of other lists available on the web.
I agree with the accepted answer that it's not a defined science and as language is a continually evolving challenge but one where a 90% catch rate is better than 0%. It depends purely on your goals - what you're trying to achieve, the level of support you have and how important it is to remove profanities of different types.
In building a filter, you need to consider the following elements and how they relate to your project:
Words/phrases
Acronyms (FOAD/LMFAO etc)
False positives (words, places and names like 'mishit', 'scunthorpe' and 'titsworth')
URLs (porn sites are an obvious target)
Personal information (email, address, phone etc - if applicable)
Language choice (usually English by default)
Moderation (how, if at all, you can interact with user generated content and what you can do with it)
You can easily build a profanity filter that captures 90%+ of profanities, but you'll never hit 100%. It's just not possible. The closer you want to get to 100%, the harder it becomes... Having built a complex profanity engine in the past that dealt with more than 500K realtime messages per day, I'd offer the following advice:
A basic filter would involve:
Building a list of applicable profanities
Developing a method of dealing with derivations of profanities
A moderately complex filer would involve, (In addition to a basic filter):
Using complex pattern matching to deal with extended derivations (using advanced regex)
Dealing with Leetspeak (l33t)
Dealing with false positives
A complex filter would involve a number of the following (In addition to a moderate filter):
Whitelists and blacklists
Naive bayesian inference filtering of phrases/terms
Soundex functions (where a word sounds like another)
Levenshtein distance
Stemming
Human moderators to help guide a filtering engine to learn by example or where matches aren't accurate enough without guidance (a self/continually-improving system)
Perhaps some form of AI engine
I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!
During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?
Of course, the most foul word in the English language.
Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).
For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.
a profanity filtering system will never be perfect, even if the programmer is cocksure and keeps abreast of all nude developments
that said, any list of 'naughty words' is likely to perform as well as any other list, since the underlying problem is language understanding which is pretty much intractable with current technology
so, the only practical solution is twofold:
be prepared to update your dictionary frequently
hire a human editor to correct false positives (e.g. "clbuttic" instead of "classic") and false negatives (oops! missed one!)
The only way to prevent offensive user input is to prevent all user input.
If you insist on allowing user input and need moderation, then incorporate human moderators.
Have a look at CDYNE's Profanity Filter Web Service
Testing URL
Beware of localization issues: what is a swearword in one language might be a perfectly normal word in another.
One current example of this: ebay uses a dictionary approach to filter "bad words" from feedback. If you try to enter the german translation of "this was a perfect transaction" ("das war eine perfekte Transaktion"), ebay will reject the feedback due to bad words.
Why? Because the german word for "was" is "war", and "war" is in ebay dictionary of "bad words".
So beware of localisation issues.
Regarding your "trick the system" subquestion, you can handle that by normalizing both the "bad word" list and the user-entered text before doing your search. e.g., Use a series of regexes (or tr if PHP has it) to convert [z$5] to "s", [4#] to "a", etc., then compare the normalized "bad word" list against the normalized text. Note that the normalization could potentially lead to additional false positives, although I can't think of any actual cases at the moment.
The larger challenge is to come up with something that will let people quote "The pen is mightier than the sword" while blocking "p e n i s".
I collected 2200 bad words in 12 languages: en, ar, cs, da, de, eo, es, fa, fi, fr, hi, hu, it, ja, ko, nl, no, pl, pt, ru, sv, th, tlh, tr, zh.
MySQL dump, JSON, XML or CSV options are available.
https://github.com/turalus/openDB
I'd suggest you to execute this SQL into your DB and check everytime when user inputs something.
If you can do something like Digg/Stackoverflow where the users can downvote/mark obscene content... do so.
Then all you need to do is review the "naughty" users, and block them if they break the rules.
I'm a little late to the party, but I have a solution that might work for some who read this. It's in javascript instead of php, but there's a valid reason for it.
Full disclosure, I wrote this plugin...
Anyways.
The approach I've gone with is to allow a user to "Opt-In" to their profanity filtering. Basically profanity will be allowed by default, but if my users don't want to read it, they don't have to. This also helps with the "l33t sp3#k" issue.
The concept is a simple jquery plugin that gets injected by the server if the client's account is enabling profanity filtering. From there, it's just a couple simple lines that blot out the swears.
Here's the demo page
https://chaseflorell.github.io/jQuery.ProfanityFilter/demo/
<div id="foo">
ass will fail but password will not
</div>
<script>
// code:
$('#foo').profanityFilter({
customSwears: ['ass']
});
</script>
result
*** will fail but password will not
Also late in the game, but doing some researches and stumbled across here. As others have mentioned, it's just almost close to impossible if it was automated, but if your design/requirement can involve in some cases (but not all the time) human interactions to review whether it is profane or not, you may consider ML. https://learn.microsoft.com/en-us/azure/cognitive-services/content-moderator/text-moderation-api#profanity is my current choice right now for multiple reasons:
Supports many localization
They keep updating the database, so I don't have to keep up with latest slangs or languages (maintenance issue)
When there is a high probability (I.e. 90% or more) you can just deny it pragmatically
You can observe for category which causes a flag that may or may not be profanity, and can have somebody review it to teach that it is or isn't profane.
For my need, it was/is based on public-friendly commercial service (OK, videogames) which other users may/will see the username, but the design requires that it has to go through profanity filter to reject offensive username. The sad part about this is the classic "clbuttic" issue will most likely occur since usernames are usually single word (up to N characters) of sometimes multiple words concatenated... Again, Microsoft's cognitive service will not flag "Assist" as Text.HasProfanity=true but may flag one of the categories probability to be high.
As the OP inquires, what about "a$$", here's a result when I passed it through the filter:, as you can see, it has determined it's not profane, but it has high probability that it is, so flags as recommendations of reviewing (human interactions).
When probability is high, I can either return back "I'm sorry, that name is already taken" (even if it isn't) so that it is less offensive to anti-censorship persons or something, if we don't want to integrate human review, or return "Your username have been notified to the live operation department, you may wait for your username to be reviewed and approved or chose another username". Or whatever...
By the way, the cost/price for this service is quite low for my purpose (how often does the username gets changed?), but again, for OP maybe the design demands more intensive queries and may not be ideal to pay/subscribe for ML-services, or cannot have human-review/interactions. It all depends on the design... But if design does fit the bill, perhaps this can be OP's solution.
If interested, I can list the cons in the comment in the future.
Once you have a good MYSQL table of some bad words you want to filter (I started with one of the links in this thread), you can do something like this:
$errors = array(); //Initialize error array (I use this with all my PHP form validations)
$SCREENNAME = mysql_real_escape_string($_POST['SCREENNAME']); //Escape the input data to prevent SQL injection when you query the profanity table.
$ProfanityCheckString = strtoupper($SCREENNAME); //Make the input string uppercase (so that 'BaDwOrD' is the same as 'BADWORD'). All your values in the profanity table will need to be UPPERCASE for this to work.
$ProfanityCheckString = preg_replace('/[_-]/','',$ProfanityCheckString); //I allow alphanumeric, underscores, and dashes...nothing else (I control this with PHP form validation). Pull out non-alphanumeric characters so 'B-A-D-W-O-R-D' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/1/','I',$ProfanityCheckString); //Replace common numeric representations of letters so '84DW0RD' shows up as 'BADWORD'.
$ProfanityCheckString = preg_replace('/3/','E',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/4/','A',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/5/','S',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/6/','G',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/7/','T',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/8/','B',$ProfanityCheckString);
$ProfanityCheckString = preg_replace('/0/','O',$ProfanityCheckString); //Replace ZERO's with O's (Capital letter o's).
$ProfanityCheckString = preg_replace('/Z/','S',$ProfanityCheckString); //Replace Z's with S's, another common substitution. Make sure you replace Z's with S's in your profanity database for this to work properly. Same with all the numbers too--having S3X7 in your database won't work, since this code would render that string as 'SEXY'. The profanity table should have the "rendered" version of the bad words.
$CheckProfanity = mysql_query("SELECT * FROM DATABASE.TABLE p WHERE p.WORD = '".$ProfanityCheckString."'");
if(mysql_num_rows($CheckProfanity) > 0) {$errors[] = 'Please select another Screen Name.';} //Check your profanity table for the scrubbed input. You could get real crazy using LIKE and wildcards, but I only want a simple profanity filter.
if (count($errors) > 0) {foreach($errors as $error) {$errorString .= "<span class='PHPError'>$error</span><br /><br />";} echo $errorString;} //Echo any PHP errors that come out of the validation, including any profanity flagging.
//You can also use these lines to troubleshoot.
//echo $ProfanityCheckString;
//echo "<br />";
//echo mysql_error();
//echo "<br />";
I'm sure there is a more efficient way to do all those replacements, but I'm not smart enough to figure it out (and this seems to work okay, albeit inefficiently).
I believe that you should err on the side of allowing users to register, and use humans to filter and add to your profanity table as required. Though it all depends on the cost of a false positive (okay word flagged as bad) versus a false negative (bad word gets through). That should ultimately govern how aggressive or conservative you are in your filtering strategy.
I would also be very careful if you want to use wildcards, since they can sometimes behave more onerously than you intend.
I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.
On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.
Thank you for the ideas.
HanClinto rules!
Frankly, I'd let them get the "trick the system" words out and ban them instead, which is just me. But it also makes the programming simpler.
What I'd do is implement a regex filter like so: /[\s]dooby (doo?)[\s]/i or it the word is prefixed on others, /[\s]doob(er|ed|est)[\s]/. These would prevent filtering words like assuaged, which is perfectly valid, but would also require knowledge of the other variants and updating the actual filter if you learn a new one. Obviously these are all examples, but you'd have to decide how to do it yourself.
I'm not about to type out all the words I know, not when I don't actually want to know them.
Don't. It just leads to problems. One clbuttic personal experience I have with profanity filters is the time where I was kick/banned from an IRC channel for mentioning that I was "heading over the bridge to Hancock for a couple hours" or something to that effect.
I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:
Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.
Also see this blog post for more details:
Fast Multiple String Replacement in PHP
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
I concluded, in order to create a good profanity filter we need 3 main components, or at least it is what I am going to do. These they are:
The filter: a background service that verify against a blacklist, dictionary or something like that.
Not allow anonymous account
Report abuse
A bonus, it will be to reward somehow those who contribute with accurate abuse reporters and punish the offender, e.g. suspend their accounts.
Don't.
Because:
Clbuttic
Profanity is not OMG EVIL
Profanity cannot be effectively defined
Most people quite probably don't appreciate being "protected" from profanity
Edit: While I agree with the commenter who said "censorship is wrong", that is not the nature of this answer.