Prevent search abuse

Prevent search abuse - php

I am unable to google something useful on this subject, so I'd appreciate either links to articles that deal in this subject, or direct answers here, either is fine.
I am implementing a search system in PHP/MySQL on a site that has quite a lot of visitors, so I am going to implement some restrictions to the length of the characters a visitor is allowed to enter in the search field and the minimum time required between two searches. Since I'm kind of new to these problems and I don't really know the "real reasons" why this is usually done, it's only my assumptions that the character minimum length is implemented to minimize the number of results the database will return, and the time between searches is implemented to prevent robots from spamming the search system and slowing down the site. Is that right?
And finally, the question of how to implement the minimum time between two searches. The solution i came up with, in pseudo-code, is this
Set a test cookie at the URL where the search form is submitted to
Redirect user to the URL where the search results should be output
Check if the test cookie exists
If not, output a warning that he isn't allowed to use the search system (is probably a robot)
Check if a cookie exists that tells the time of the last search
If this was less that 5 seconds ago, output a warning that he should wait before searching again
Search
Set a cookie with the time of last search to current time
Output search results
Is this the best way to do it?
I understand this means visitors that have cookies disabled will not be able to use the search system, but is that really a problem these days? I couldn't find the statistics for 2012, but I managed to find data saying 3.7% of people had disabled cookies in 2009. That doesn't seem like a lot and I suppose should probably be even less these days.

"only my assumptions that the character minimum length is implemented to minimize the number of results the database will return". Your assumption is absolutely correct. It reduces the number of potential results, by forcing the user to think about, what it is they wish to search.
As far as bots spamming your search, you could implement a captcha, the most frequently used is recaptcha. If you don't want to show a captcha right away, you can track (via session) the number of times the user submitted search, and if X amount of searches occur within a certain time frame, then render the captcha.
I've seen sites like SO and thechive.com implement this type of strategy, where captcha isn't rendered right away, but will be rendered if a threshold is encountered.

This way you're preventing Search Engine from indexing your search results. A cleaner way of doing this would be:
Get IP where search originated
Store that IP in a cache system such as memcached and the time that query was made
If another query is sent from same IP and less then x second passed simply reject it or make the user wait
Another thing you can do to increase performance is to take a look at analytics and see which queries are made most often and cache those so when a request comes in you serve the cached version and not make a full db query, parsing, etc...
Another naive option would be to have a script run 1-2 times a day running all common queries and create static HTML files that users hit when making particular search queries instead of hitting the db.

Related

Page hit count or Google Analytics?

I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
Create a Google Analytics account and use its info.
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
UPDATE:
Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
Log every page visit to a simple text log file
Send a cookie to the user to separate unique users
Every 8 minutes, load the log file, collect the info and update the MySQL tables.
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.

Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.

Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
You could try the API from Clicky. Bear in mind that:
Free accounts are limited to the last 30 days of history, and 100 results per request.
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.

I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.

API autocomplete call (performance, etc...)

I'm just starting to plan out a web app which allows users to save information about movies. This relies on the TMDb API. Now, i'd like to include an autocomplete feature for when they are searching for a movie. Do you think it's wise to make an API call onKeyUp? Or wait for a certain amount of time after a keyUp? Overall, is this the best way to carry this out?
I will be using PHP, jQuery and saving/retrieving user data with MySQL

Delay after key up unless you (or the server you are hitting) has the speed to be able to handle it. You'll have to account for race conditions anyway, but having that many calls isn't going to be very helpful. Your speeds to query the API are going to be slower than most user's typing speed, which means you'll be making unnecessary calls to your api, using both yours and your users' bandwidth.
Also, I would set a minimum number of character entered before you query (probably ~3 is good). This will also help lower number of queries, and you won't be running a query for 'a' or even 'ap' which could both be a lot of things. Once you get to 3 ('app') you can get a much smaller list of results, which is more helpful for a user.

You could use the TypeWatch jQuery plugin or something similar to only call the API after the user has stopped typing for a certain amount of time. Stack Overflow's Tags and Users pages, for example, use TypeWatch to only process the input after the user has stopped typing for 500ms.

Big sites showing less data

I look after a large site and have been studying other similar sites. In particular, I have had a look at flickr and deviantart. I have noticed that although they say they have a whole lot of data, they only display up to so much of it.
I persume this is because of performance reasons, but anyone have an idea as to how they decide what to show and what not to show. Classic example, go to flickr, search a tag. Note the number of results stated just under the page links. Now calculate which page that would be, go to that page. You will find there is no data on that page. In fact, in my test, flickr said there were 5,500,000 results, but only displayed 4,000. What is this all about?
Do larger sites get so big that they have to start brining old data offline? Deviantart has a wayback function, but not quite sure what that does.
Any input would be great!

This is type of perfomance optimisation. You don't need to scan full table if you already get 4000 results. User will not go to page 3897. When flickr runs search query it finds first 4000 results and then stops and don't spend CPU time and IO time for finding useless additional results.

I guess in a way it makes sense. Upon search if the user does not click on any link till page 400 (assuming each page has 10 results) then either the user is a moron or a crawler is involved in some way.
Seriously speaking if no favorable result is yielded till page 40, the concerned company might need to fire all their search team & adopt Lucene or Sphinx :)
What I mean is they will be better off trying to improve their search accuracy than battling infrastructure problems trying to show more than 4000 search results.

Complex search workaround

Before anything, this is not necessarily a question.. But I really want to know your opinion about the performance and possible problems of this "mode" of search.
I need to create a really complex search on multiple tables with lots of filters, ranges and rules... And I realize that I can create something like this:
Submit the search form
Internally I run every filter and logic step-by-step (this may take some seconds)
After I find all the matching records (the result that I want) I create a record on my searches table generating a token of this search (based on the search params) like 86f7e437faa5 and save all the matching records IDs
Redirect the visitor to a page like mysite.com/search?token=86f7e437faa5
And, on the results page I only need to discover what search i'm talking about and page the results IDs (retrieved from the searches table).
This will make the refresh & pagination much faster since I don't need to run all the search logic on every pageview. And if the user change a filter or search criteria, I go back to step 2 and generate a new search token.
I never saw a tutorial or something about this, but I think that's wat some forums like BBForum or Invision do with search, right? After the search i'm redirect to sometihng like search.php?id=1231 (I don't see the search params on the URL or inside the POST args).
This "token" will no last longer than 30min~1h.. So the "static search" is just for performance reasons.
What do you think about this? It'll work? Any consideration? :)

Your system may have special token like 86f7e437faa5 and cache search requests. It's a very useful mechanism for system efficiency and scalability.
But user must see all parameters in accordance with usability principles.
So generating hash of parameters on the fly on server-side will be a good solution. System checks existanse of genereted hash in the searches table and returns result if found.
If no hash, system makes query from base tables and save new result into searches table.

Seems logical enough to me.
Having said that, given the description of you application, have you considered using Sphinx. Regardless of the number of tables and/or filters and/or rules, all that time consuming work is in the indexing, and is done beforehand/behind the scene. The filtering/rules/fields/tables is all done quickly and on the fly after the fact.
So, similar to your situation, Sphinx could give you your set of ID's very quickly, since all the hard work was pre-done.

TiuTalk,
Are you considering keeping searches saved on your "searches" table? If so, remember that your param-based generated token will remain the same for a given set of parameters, lasting in time. If your search base is frequently altered, you can't rely on saved searches, as it may return outdated results. Otherwise, it seems a good solution at all.
I'd rather base the token on the user session. What do you think?
#g0nc1n

Sphinx seems to be a nice solution if you have control of your server (in a VPS for example).
If you don't and a simple Full Text Search isn't enough for you, I guess this is a nice solution. But it seems not so different to me than a paginated search with caching. It seems better than a paginated search with simple url refered caching. But you still have the problem of the searches remaining static. I recommend you flush the saved searches from time to time.

Twitter competition ~ saving tweets (PHP & MySQL)

I am creating an application to help our team manage a twitter competition. So far I've managed to interact with the API fine, and return a set of tweets that I need.
I'm struggling to decide on the best way to handle the storage of the tweets in the database, how often to check for them and how to ensure there are no overlaps or gaps.
You can get a maximum number of 100 tweets per page. At the moment, my current idea is to run a cron script say, once every 5 minutes or so and grab a full 100 tweets at a time, and loop through them looking in the db to see if I can find them, before adding them.
This has the obvious drawback of running 100 queries against the db every 5 minutes, and however many INSERT there are also. Which I really don't like. Plus I would much rather have something a little more real time. As twitter is a live service, it stands to reason that we should update our list of entrants as soon as they enter.
This again throws up a drawback of having to repeatedly poll Twitter, which, although might be necessary, I'm not sure I want to hammer their API like that.
Does anyone have any ideas on an elegant solution? I need to ensure that I capture all the tweets, and not leave anyone out, and keeping the db user unique. Although I have considered just adding everything and then grouping the resultant table by username, but it's not tidy.
I'm happy to deal with the display side of things separately as that's just a pull from mysql and display. But the backend design is giving me a headache as I can't see an efficient way to keep it ticking over without hammering either the api or the db.

100 queries in 5 minutes is nothing. Especially since a tweet has essentially only 3 pieces of data associated with it: user ID, timestamp, tweet, tweet ID - say, about 170 characters worth of data per tweet. Unless you're running your database on a 4.77MHz 8088, your database won't even blink at that kind of "load"

The Twitter API offers a streaming API that is probably what you want to do to ensure you capture everything:
http://dev.twitter.com/pages/streaming_api_methods
If I understand what you're looking for, you'll probably want a statuses/filter, using the track parameter with whatever distinguishing characteristics (hashtags, words, phrases, locations, users) you're looking for.
Many Twitter API libraries have this built in, but basically you keep an HTTP connection open and Twitter continuously sends you tweets as they happen. See the streaming API overview for details on this. If your library doesn't do it for you, you'll have to check for dropped connections and reconnect, check the error codes, etc - it's all in the overview. But adding them as they come in will allow you to completely eliminate duplicates in the first place (unless you only allow one entry per user - but that's client-side restrictions you'll deal with later).
As far as not hammering your DB, once you have Twitter just sending you stuff, you're in control on your end - you could easily have your client cache up the tweets as they come in, and then write them to the db at given time or count intervals - write whatever it has gathered every 5 minutes, or write once it has 100 tweets, or both (obviously these numbers are just placeholders). This is when you could check for existing usernames if you need to - writing a cached-up list would allow you the best chance to make things efficient however you want to.
Update:
My solution above is probably the best way to do it if you want to get live results (which it seems like you do). But as is mentioned in another answer, it may well be possible to just use the Search API to gather entries after the contest is over, and not worry about storing them at all - you can specify pages when you ask for results (as outlined in the Search API link), but there are limits as to how many results you can fetch overall, which may cause you to miss some entries. What solution works best for your application is up to you.

I read over your question and it seems to me that you want to duplicate data already stored by Twitter. Without more specifics on the competition your running, how users enter for example, estimated amount of entries; its impossible to know whether or not storing this information locally on a database is the best way to approach this problem.
Might a better solution to be, skip storing duplicate data locally and drag the entrants directly from twitter, i.e. when your attempting to find a winner.
You could eliminate duplicate entries on-the-fly then whilst the code is running. You would just need to call "the next page" once its finished processing the 100 entries its already fetched. Although, i'm not sure if this is possible directly through the Twitter API.

I think running a cron every X minutes and basing it off of the tweets creation date may work. You can query your database to find the last date/time of the last recorded tweet, then only run selects if there are matching times to prevent duplicates. Then, when you do your inserts into the database, use one or two insert statements containing all the entries you want to record to keep performance up.
INSERT INTO `tweets` (id, date, ...) VALUES (..., ..., ...), (..., ..., ...), ...;
This doesn't seem too intensive...also depends on the number of tweets you expect to record though. Also make sure to index the table properly.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.