How to detect nonsensical text in PHP? - php

I have comments enabled on my site and I require users to enter at least 30 characters to publish their comments (Just to get some value because they usualy just submitted "I like it")
But some users now use simple technique to overcome this and enter e.g.:
"I like it. asdsdf dfdsfsdf tt erretrt re"
As you can see the rest of the text is nonsense. Is there a way (algorithm) how to filter these comments out in PHP ?

Get a dictionary of English words from the net. Check the post has a certain % (maybe 50%? maybe 70%?) of words that are in the dictionary. You can't look for 100%, or names and technical jargon will not be found.
users will get around this by entering.
I like it ....................................................
So then add logic to parse out punctuation.
Then users will get around it with
I like it. the the the the the the the the
Then you will need to parse it for proper English grammer
Then no one will be able to post on your site becuase it has too many rules.
Better suggestion: Add comment moderation. Dumb posts get downvoted and go away. Good posts stay.

Check out the Akismet PHP5 class.
$WordPressAPIKey = 'KEYHERE';
$MyBlogURL = 'http://www.example.com/blog/';
$akismet = new Akismet($MyBlogURL ,$WordPressAPIKey);
$akismet->setCommentAuthor($name);
$akismet->setCommentAuthorEmail($email);
$akismet->setCommentAuthorURL($url);
$akismet->setCommentContent($comment);
$akismet->setPermalink('http://www.example.com/blog/alex/someurl/');
if($akismet->isCommentSpam()) {}

You can use a naive bayesian filter for this. http://www.paulgraham.com/better.html
There are probably existing libraries for this kind of thing. Check out spam assassin.

I'd do a simple check on consecutive consonants or vowels. If there are more than four of any in a row, than there is a high probability of nonsense. Furthermore, check for more than two repetitions of the same character. When looking at some nonsense text, I'm sure you'll find some pragmatic reciepes ;-)

Personally, I would say there's not much you can do about it. Even if you had a dictionary and parser, what if I were to leave a comment: "I like it. As do I like your car." Depending on what they're leaving a comment for, that could be complete nonsense. Best I can say is have an edit available for each comment so that you or a mod or whomever can edit it. Sorry that this isn't of any help.
I had this same issue when trying to create password restrictions. Words couldn't be used, so we needed to use a dictionary, but there is never a comprehensive dictionary. And the biggest thing was eliminating l33t speak. :)

Unfortunately not, your best bet is to modify something like this: Get Spelling Corrections From Google. When messages are close to the 80 character limit, you could look up each word individually and if it doesn't have a direct hit, boot out the input.

Related

Best way to save info in hash

I have a webpage that the user inputs data into a textarea and then process and display it with some javascript. For example if the user types:
_Hello_ *World* it would do something like:
<underline>Hello</underline> <b>World</b>
Or something like that, the details aren't important. Now the user can "save" the page to make it something like site.com/page#_Hello_%20*World* and share that link with others.
My question is: Is this the best way to do this? Is there a limit on a url that I should be worried about? Should I do something like what jsfiddle does?
I would prefer not to as the site would work offline if the full text would be in the hash, and as the nature of the site is to be used offline, the user would have to first cache the jsfiddle-like hash before they could use it.
What's the best way to do this?
EDIT: Ok the example I gave is nothing similar to what I'm actually doing. I'm not cloning markdown or using underline or b tags, just wanted to illustrate what I wanted
Instead of trying to save stuff in the URL, you should use the same approach that is common in pastebins: you store the data , can provide use with url, containing an unique string to identify stored document. Something like http://foo.bar/g4jg64
From URL you get state or identifiers, not the data.
URLs are typically limited to 2KB total, but there is no officially designated limit. It is browser-dependent.
Other than that, make sure you properly URL encode what you're putting up there, and you're fine... although I certainly would not want to deal with obnoxiously long URLs. I might suggest you also avoid tags such as <underline> and <b>, as they have been deprecated for a very, very long time.
Use javascript function:
encodeURIComponent('_Hello_ *World*');

Need an algorithm to find near-duplicate text values

I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.
There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.
Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/
Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;
If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.

Is there any way to detect strings like putjbtghguhjjjanika?

People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.
Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.
I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.
I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.
As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.
Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!
If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.

Check for misspelt words using PHP

I am using this code to create an instant search for my site...
http://woorkup.com/2010/09/13/how-to-create-your-own-instant-search/
Some of the phrases in our database our very complex and can be easily misspelt so on top of this I wanted to use spelling suggestions.
Does anyone know of any ways to offer correct spellings based on a string provided?
Any help would be greatly appreciated.
Yes there is a jQuery plugin called After the Deadline.
If someone searches for a phrase, doesn't click any of the results, and then researches with a new similar phrase (check out levenshtein()) and does click a result, write to your database the original phrase and the new phrase.
Record each time this happens. If the phrase is already matched, increment a counter for that phrase.
Then, if someone searches for a phrase that matches one of your possibly incorrect phrases (perhaps have a threshold using your counter), you can display a Did you mean to search for...? as well as the results (if any) for the incorrect phrase.
This isn't a spell check per se, but I think it would be useful to pick up on common mistakes. Unfortunately though, you probably don't have as many people to help you build an index like Google's Did you mean?
Peter Norvig has written (and explained) a fairly basic spelling corrector; which makes for a very interesting read. It's in Python, but his explanations are invaluable (He does work for Google and this is a very bare bones representation of the Google did you mean algorithm).

Fast way to match an array of words with a block of text?

The subject is probably not as clear as it could be, but I was struggling to think of a better way to easily describe it.
I am implementing a badword filter on some articles that we pick up from an XML feed. At the moment I have the badwords in an array and simply check the text like so;
str_replace($badwords, '', $text, $count);
if ($count > 0) // We have bad words...
But this is SLOW! So slow! And when I am trying to process 30,000+ articles at a time, I start wondering if there is a better way to achieve this. If only strpos supported arrays! Even then I dont think it'd be faster...
I'd love any suggestions. Thanks in advance!
EDIT:
I have now tested a few methods between calls to microtime() to time them.
str_replace() = 990 seconds
preg_match() = 1029 seconds (Remember I only need to identify them, not replace them)
no bad word filtering = 1057 seconds (presumably because it has another thousand or so bad-worded articles to process.
Thanks for all the answers, I will just still with str_replace. :)
How about combining all the words in a regex to replace everything in one go? I'm not sure how it will go for performance but it might be faster.
E.g.
preg_replace('/(' . implode('|', $badwords) . ')/i', '', $text);
i used to work at my local newspaper office. instead of modifying the text to delete badwords from the original files, what i did was just run a filter when a user requested to view the article. this way you preserve the original text should you ever need it, but also dish out a clean version for your viewers. there should be no need to process 30,000 articles at once unless i am misunderstanding something.
Define "slow"? Anything that's going to be processing 30,000 articles is probably going to take a bit of time to complete.
That said, one option (which I have not benchmarked, just tossing it out there for consideration) would be to combine the words into a regex and run that through preg_replace (just using the | operator to put them together).
In case these previous questions are useful:
How do you implement a good
profanity filter?
How do I replace bad words with
php?
Blacklist of words on content to
filter message.
Trouble with simple PHP profanity
filter

Categories