How can I use PHP to find base info about text quality? - php

I have a PHP/MySQL driven site, which I have not maintained for the past 6 months. It is a site where users come and submit their articles. I have 50.000 articles and by some 'ad hoc' tests I should say that about 50-60% is spam and copy pasted text from other sites.
I am looking to write a PHP script that will take some base parameters to mark/remove spam text(not copy/pasted, for this step only pure spam) so my idea is to make a script which takes every unit, counts characters, words, different words and phrases usage and word density and depending on those factors remove as pure spam (with much repeated phrases, etc.). So for this I will lose a whole day and my question is:
Is there some solution already developed in PHP?
If I need to code it myself, what parameters on determining spam should I use?

Here's a PHP class that I've used in the past - Basic Spam Class
I am not the author, so I don't take any responsibility for potential damage done by the code. I've used it for checking short texts though - user comments on a site, so I'm not sure about the performance on 50k of long articles, maybe you will need to do some enhancements on it. But at least you have something to start from.

Maybe you could take a look at Akismet and Bad Behaviour. The first one to analyze the articles you already have (as well as future ones) and Bad Behaviour to combat spam before it ever gets into your database.
They may not be ideal, but they could help you on your way.

I've observed that a lot of spam posts on sites like that have a lack of articles. They contain just a bunch of keywords and links. You could add a parameter for minimum number of articles. If less than 1% of the post is articles you could reject it as spam.
For example, if you count the number of thes ans as and somes in the above paragraph you get 3 as and 1 the (4 articles total out of 43 words is 9.3%)

Related

proposed nlp algorithm for text tagging

I was looking for opensource tool which can help to identify the tags for any user post on social media and identifying topic/off-topic or spam comment on that post. Even after looking for entire day, I could not find any suitable tool/library.
Here I have proposed my own algorithm for tagging user post belonging to 7 categories (jobs, discussion, events, articles, services, buy/sell, talents).
Initially when user makes post, he tags his post. Tags can be like marketing, suggestion, entrepreneurship, MNC etc. So consider for some posts I have tags and to which category they belongs.
Steps:
Perform POS (part of speech) tagging on user post.
Here two things can be done.
considering only nouns. Nouns may represent the tag for post more
intuitively I guess
Considering Nouns and adjectives both. Here we can collect large
numbers of nouns and adjectives. Frequency of such words can be used
to identify tag for that post.
For each user defined tag, we will collect POS for that post belonging to particular tag. Example. Consider user assigned tag marketing and post for this tag contains POS words SEO and adwords. Suppose 10 post of marketing tag contains SEO and adwords 5 and 7 times respectively. So next time when user post comes which does not have any tag but contains POS word SEO. SEO is occurring maximum times 7 in marketing tag, So we will predict marketing tag for this post
NExt steps is for identify spam or off-topic comment for POST.
Consider one user post for Job category. This post contains tag marketing. Now I will check in database for TOP most frequent 10-15 Part of speech tags(i.e. nouns and adjective) for marketing.
Parallel I have POS tag for that comment. I will check whether POS(noun & adj) of this post contains top most frequent tags(we can consider 15-20 such POS tags) belonging to marketing.
If POS in comments does not match with any of the most frequent, top POS for marketing then that comment can be said off-topic/span
DO YOU HAVE ANY SUGGESTION TO MAKE THIS ALGO MORE INTUITIVE??
I guess SVM can help for classification, any suggestion for this?
Apart from this WhIch machine learning technique can help here to learn system to predict tag and spam(off topic) comments
The main problem as I see it is with your feature modeling. While picking out only nouns would help reduce the feature space, it is an extra step with a potentially significant error rate. And do you really care whether you are looking at market/N and not market/V?
Most mainline text classification implementations using naive bayesian classifiers just ignore the POS, and simply count each distinct word form as an independent feature. (You could also do brute-force stemming to reduce market, markets, and marketing to a single stem form and thus a single feature. This tends to work in English, but might not be very adequate if you are actually working in a different language.)
A compromise could be to do POS filtering when you train your classifier. Then word forms which do not have a noun reading end up with a zero score in the classifier, so you don't have to do anything to filter them out when you use the resulting classifier.
Empirically, SVM tends to achieve a high accuracy, but it comes at the cost of complexity, both in implementation and behavior. A naive bayesian classifier has the distinct advantage that you can understand precisely how it arrived at a particular conclusion. (Well, most of us mortals cannot claim to have the same grasp of the mathematics behind SVM.) Perhaps a good way to proceed would be to prototype with Bayes, and iron out any kinks while learning how the system as a whole behaves, then maybe later consider switching to SVM once the other parts are stable?
The "spam" category is going to be harder than any well-defined content category. It would be tempting to suggest that anything which doesn't fit any of your content categories is off-topic, but if you are going to use the verdict for automatic spam filtering, this is likely to cause some false positives at least in the early stages. A possible alternative could be to train classifiers for particular spam categories -- one for medications, another for running shoes, etc.
Any linear classifier is good for text classification. In my experience, Logistic Regression or SVM is good for text classification.
You may also try Naive Bayes Multinomial classifier. This was used with spam classification on several email spam classifiers.
Have a look at this for POS tagging.
http://nlp.stanford.edu/software/tagger.shtml
Rely more on data collection ( labelled data) and then use it to build a classifier, linear, bayesian, SVM anything which matches your requirements ( performs best )
Also see if you could make a multi class prediction ( i.e. create new class which is a combination of two or more classes), or try finding out the probability of a string sequence of being in every class.
Hope this helps

Prevent search abuse

I am unable to google something useful on this subject, so I'd appreciate either links to articles that deal in this subject, or direct answers here, either is fine.
I am implementing a search system in PHP/MySQL on a site that has quite a lot of visitors, so I am going to implement some restrictions to the length of the characters a visitor is allowed to enter in the search field and the minimum time required between two searches. Since I'm kind of new to these problems and I don't really know the "real reasons" why this is usually done, it's only my assumptions that the character minimum length is implemented to minimize the number of results the database will return, and the time between searches is implemented to prevent robots from spamming the search system and slowing down the site. Is that right?
And finally, the question of how to implement the minimum time between two searches. The solution i came up with, in pseudo-code, is this
Set a test cookie at the URL where the search form is submitted to
Redirect user to the URL where the search results should be output
Check if the test cookie exists
If not, output a warning that he isn't allowed to use the search system (is probably a robot)
Check if a cookie exists that tells the time of the last search
If this was less that 5 seconds ago, output a warning that he should wait before searching again
Search
Set a cookie with the time of last search to current time
Output search results
Is this the best way to do it?
I understand this means visitors that have cookies disabled will not be able to use the search system, but is that really a problem these days? I couldn't find the statistics for 2012, but I managed to find data saying 3.7% of people had disabled cookies in 2009. That doesn't seem like a lot and I suppose should probably be even less these days.
"only my assumptions that the character minimum length is implemented to minimize the number of results the database will return". Your assumption is absolutely correct. It reduces the number of potential results, by forcing the user to think about, what it is they wish to search.
As far as bots spamming your search, you could implement a captcha, the most frequently used is recaptcha. If you don't want to show a captcha right away, you can track (via session) the number of times the user submitted search, and if X amount of searches occur within a certain time frame, then render the captcha.
I've seen sites like SO and thechive.com implement this type of strategy, where captcha isn't rendered right away, but will be rendered if a threshold is encountered.
This way you're preventing Search Engine from indexing your search results. A cleaner way of doing this would be:
Get IP where search originated
Store that IP in a cache system such as memcached and the time that query was made
If another query is sent from same IP and less then x second passed simply reject it or make the user wait
Another thing you can do to increase performance is to take a look at analytics and see which queries are made most often and cache those so when a request comes in you serve the cached version and not make a full db query, parsing, etc...
Another naive option would be to have a script run 1-2 times a day running all common queries and create static HTML files that users hit when making particular search queries instead of hitting the db.

PHP auto answering script

I was thinking about an idea of auto generated answers, well the answer would actually be a url instead of an actual answer, but that's not the point.
The idea is this:
On our app we've got a reporting module which basically show's page views, clicks, conversions, details about visitors like where they're from, - pretty much a similar thing to Google Analytics, but way more simplified.
And now I was thinking instead of making users select stuff like countries, traffic sources and etc from dropdown menu's (these features would be available as well) it would be pretty cool to allow them to type in questions which would result in a link to their expected part of the report. An example:
How many conversions I had from Japan on variant (one page can have many variants) 3.
would result in:
/campaign/report/filter/campaign/(current campaign id they're on)/country/Japan/variant/3/
It doesn't seem too hard to do it myself, but it's just that it would take quite a while to make it accurate enough.
I've tried google'ing but had no luck to find an existing script, so maybe you guys know anything alike to my idea that's open source and well reliable/flexible enough to suit my needs.
Thanks!
You are talking about natural language processing - an artificial intelligence topic. This can never be perfect, and eventually boils down to the system only responding to a finite number of permutations of one question.
That said, if that is fine with you - then you simply need to identify "tokens". For example,
how many - evaluate to count
conversations - evaluate to all "conversations"
from - apply a filter...
japan - ...using japan
etc.

Integrating a 1-10 voting system effectively without common pitfalls

I'm planning on integrating a reasonable ranking/voting system into an existing application.
I'm familiar with how the traditional 5 star rating systems work and know the common pitfalls/problems associated with them therefore was wondering if there is other ways (I've heard of Wilsons, Bayesian etc. but not really sure on how to implement this with the below structure):
I'm planning on allowing users to vote on content between 1 to 10 via the contents page.
The score and total votes for that content will be displayed on the contents page.
I will also be displaying/listing the Top 10 Content so I'd need the method to be fair/realistic and not make a vote of 10 with total votes of 1 to go straight to number 1.
I'm using PHP and MySQL, I have a table for the content (which has a content_id which I guess I can JOIN on).
I'm wondering if you can suggest a way/method which achieves the above, I'd appreciate if you can attach some example PHP code and example MySQL schema so I can better understand it, as I've google'd and may have found potential solutions such as Wilsons and Bayesian...yet they provide a lengthy article with confusing mathematical equations - and mention no way which achieves the above (ie. the score....and implenting the method in PHP/MySQL) or atleast due to there not being any example PHP/MySQL code me misunderstanding this.
Perhaps this is easier then I think - I don't know as I've never had the need to implement this sort of "more complex" ranking/voting functionality before - so I'd appreciate your responses.
You should start by watching this video on youtube : Building Web Reputation Systems.
To emphasize the point, let me direct you to XKCD.
As for DB structure, you need following parts:
list of items ( with total_votes column )
list of user, which have voted
intersection table for the items-users ( with rating column, if you go with 5star thing )

check if a name seems "human"?

I have an online RPG game which I'm taking seriously. Lately I've been having problem with users making bogus characters with bogus names, just a bunch of different letters. Like Ghytjrhfsdjfnsdms, Yiiiedawdmnwe, Hhhhhhhhhhejejekk. I force them to change names but it's becoming too much.
What can I do about this?
Could I somehow check so at least you can't use more than 2 of the same letter beside each other?? And also maybe if it contains vowels
I would recommend concentrating your energy on building a user interface that makes it brain-dead easy to list all new names to an administrator, and a big fat "force to rename" mechanism that minimizes the admin's workload, rather than trying to define the incredibly complex and varied rules that make a name (and program a regular expression to match them!).
Update - one thing comes to mind, though: Second Life used to allow you to freely specify a first name (maybe they check against a database of first names, I don't know) and then gives you a selection of a few hundred pre-defined last names to choose from. For an online RPG, that may already be enough.
You could use a metaphone implementation and then look for "unnatural" patterns:
http://www.php.net/manual/en/function.metaphone.php
This is the PHP function for metaphone string generation. You pass in a string and it returns the phonetic representation of the text. You could, in theory, pass a large number of "human" names and then store a database of valid combinations of phonemes. To test a questionable name, just see if the combinations of phonemes are in the database.
Hope this helps!
Would limiting the amount of consonants or vowels in a row, and preventing repeating help?
As a regex:
if(preg_match('/[bcdfghjklmnpqrtsvwxyz]{4}|[aeiou]{4}|([a-z])\1{2}/i',$name)){
//reject
}
Possibly use iconv with ASCII//TRANSLIT if you allow accentuated characters.
What if you would use the Google Search API to see if the name returns any results?
I say take #Unicron's approach, of easy admin rejection, but on each rejection, add the name to a database of banned names. You might be able to use this data to detect specific attacks generation large numbers of users based on patterns. Will of course be very difficult to detect one-offs.
I had this issue as well. An easy way to solve it is to force user names to validate against a database of world-wide names. Essentially you have a database on the backend with a few hundred thousand first and last names for both genders, and make their name match.
With a little bit of searching on google, you can find many name databases.
Could I somehow check so at least you cant use more than 2 of the same letter beside each other?? and also maybe if it contains vowels
If you just want this, you can do:
preg_match('/(.)\\1\\1/i', $name);
This will return 1 if anything appears three times in a row or more.
This link might help. You might also be able to plug it through a (possibly modified) speech synthesiser engine and analyse how much trouble it's having generating the speech, without actually generating it.
You should try implementing a modified version of a Naive Bayes spam filter. For example, in normal spam detection you calculate the probability of a word being spam and use individual word probabilities to determine if the whole message is spam.
Similarly, you could download a word list, and compute the probability that a pair of letters belongs to a real word.
E.g., create a 26x26 table say, T. Let the 5th row represent the letter e and let entry T(5,1) be the number of times ea appeared in your word list. Once you're done counting, divide each element in each row with the sum of the row so that T(5,1) is now the percentage of times ea appears in your word list in a pair of letter starting with e.
Now, you can use the individual pair probability (e.g. in Jimy that would be {Ji,im,iy} to check whether Jimy is an acceptable name or not. You'll probably have to determine the right probability to threshold at, but try it out --- it's not that hard to implement.
What do you think about delegating the responsibility of creating users to a third party source (like Facebook, Twitter, OpenId...)?
Doing that will not solve your problem, but it will be more work for a user to create additional accounts - which (assuming that the users are lazy, since most are) should discourage the creation of additional "dummy" users.
It seems as though you are going to need a fairly complex preg function. I don't want to take the time to write one for you, as you will learn more writing it yourself, but I will help along the way if you post some attempts.
http://php.net/manual/en/function.preg-match.php

Categories