Using Data Scraping Scripts The Right Way

Using Data Scraping Scripts The Right Way - php

I am playing around with data scraping scripts. For now I am starting with PHP/cURL. Reason one I am interested in learning this is to learn how these are designed to help protect my own websites against those sneaky malicious ones. The second reason is to design these in a way that acts like a human for the purpose of avoiding undue burden on a website owners server.
If I use this in real life, it would be simply to use it to automate what I currently already do manually but I don't want to abuse this process however, I am a bit lazy so rather not do it manually.
To perform like a human:
1) Send header that look like a browser.
2) Send a referrer that represents the source of the link (sequence of pages).
3) Create delays randomized similar to how a human would search per page fetch.
4) Clear cookies when done. (Have to learn more about this, not sure how cookies function in a web scraper environment)
If the tools used above are done correctly, is IP Proxy switching necessary? Are there any other considerations I should be aware? Still learning about this so just curious at this point.

Related

How to prevent someone from scraping my website data?

I am using PHP to write my server side code for my website. What is the best way to prevent someone from scraping my data?
Like in PHP if someone uses file_get_contents() or someone fetches my login form in an iframe element or the data entered in the login form -
how can I prevent such things?
I am using PHP 5.47, MySQL, HTML and CSS.

I think that being a web-developer these days is terrifying and that maybe there is a temptation to go into "overkill" when it comes to web security. As the other answers have mentioned, it is impossible to stop automated scraping and it shouldn't worry you if you follow these guidelines:
It is great that you are considering website security. Never change.
Never send anything from the server you don't want the user to see. If the user is not authorised to see it, don't send it. Don't "hide" important bits and pieces in jQuery.data() or data-attributes. Don't squirrel things away in obfuscated JavaScript. Don't use techniques to hide data on the page until the user logs in, etc, etc.
Everything - everything - is visible if it leaves the server.
If you have content you want to protect from "content farm" scraping use email verified user registration (including some form of GOOD reCaptcha to confound - most of - the bots).
Protect your server!!! As best you can, make sure you don't leave any common exploits. Read this -> http://owasp.org/index.php/Category:How_To <- Yes. All of it ;)
Prevent direct access to your files. The more traditional approach is defined('_SOMECONSTANT') or die('No peeking, hacker!'); at the top of your PHP document. If the file is not accessed through the proper channels, nothing important will be sent from the server.
You can also meddle with your .htaccess or go large and in charge.
Are you perhaps worried about cross site scripting (XSS)?
If you are worried about data being intercepted when the user enters login information, you can implement double verification (like Facebook) or use SSL
It really all boils down to what your site will do. If it is a run of the mill site, cover the basics in the bullet points and hope for the best ;) If it is something sensitive like a banking site... well... don't do a banking site just yet :P
Just as an aside: I never touch credit card numbers and such. Any website I develop will politely API on to a company with insurance and fleets of staff dedicated to security (not just little old me and my shattered nerves).

No there is no way to make this sure. You can implement some Javascript functions which try to prevent this, but if the client just deactivate JS (or a server just ignores it), you can't prevent this.

It is really hard to prevent this. I have found a similar discussion here. This will answer most of your queries but if you want even more perfect protection then sophisticated programs and services like Scrapesentry and Distil would be needed.

Using JavaScript or php, you just decrease the data scraping, but you can't stop the data scraping.
Browser can read the html data so user can view your page source and get that. You can disable key events but can't stop the scraping.

How do I protect my AJAX services?

Right now I'm working on a service that handles reviews/recommendations of local restaurants overlayed on Google Maps. Basically Yelp, but restricted to a certain niche. Anyhow, since I don't want to have to load every location and review at once, I'm finally getting into using jQuery and AJAX calls.
The question I have is: How do I prevent other people from 'scraping' data from my ajax scripts on the server?
The main map/location info functionality needs to be public, in that users should not have to log in to use the application, so it may simply boil down to making it difficult to scrape. I'm hoping that one of you AJAX veteran out there can point me in the direction of a better idea, or some 'best practices' docs that I haven't been able to find yet.
So far all I've been able to come up with is:
The user-facing scripts open a short-lived session on the server and the AJAX calls will not run without an active session.
Send some sort of access key along with the application code and require that in all of the AJAX calls. But not sure how to best implement this in a way that's not trivially easy to get around.

You can't completely protect your AJAX web services. Even if you mangle your data and obfuscate your source code, it is trivial to just fire up a packet sniffer or debugging proxy, figure it out, and scrape from it.
What I would do is exactly what you propose... only users with an active session on the site can make calls. Then from there, throttle requests.
Even a busy normal user won't make more than a handful of requests per minute. You can analyze your logs to figure out what a good number would be. Even if you limited your service to 20 calls per minute, that kind of limitation makes it fairly useless for folks that want to duplicate all of your content.
Don't limit just on session data either... keep an eye on IP addresses. It's entirely possible to fire off a request and get a new session at any time. Periodically check your logs to see if anything is getting through, and adjust your strategy accordingly.
Finally, regularly search for your content. Google is a great tool for finding copyright infringers. If you use specific data, such as GPS coordinates, you can actually watermark the coordinates with a specific value in the noise area of the coordinate.

From what I hear, you want to protect the JavaScript side of the service. This is not possible as JavaScript is essentially fully open source (albeit not public domain)
Google offers a tool called Google Closure which can compact the script by removing white space and tabs. It can also obfuscate a document for you by replacing variable names and function names with random characters. It is customizable so you can tell it what you want. From what I can tell, Google uses it for their own website (evident by viewing the source of their pages)

Security of Flex for payment website

So, it's been about 3 years since I wrote and went live with my company's main internet facing website. Originally written in php, I've since just been making minor changes here and there to progress the site as we've needed to.
I've wanted to rewrite it from the ground up in the last year or so and now, we want to add some major features so this is a perfect time.
The website in question is as close to a banking website as you'd get (without being a bank; sorry for the obscurity, but the less info I can give out, the better).
For the rewrite, I want to separate the presentation layer from the processing layer as much as I can. I want the end user to be stuck in a box and not be able to get out so to speak
(this is all because of PCI complacency, being PEN tested every 3 months, etc...)
So, being probed every 3 months has increasingly made me nervous. We haven't failed yet and there hasen't been a breach yet, but I want to make sure I continue to pass (as much as I can anyways)
So, I'm considering rewriting the presentation layer in Adobe Flex and do all the processing in PHP (effectively IMO, separating presentation from processing) - I would do all my normal form validation in flex (as opposed to javascript or php) and do my reads and writes to the db via php.
My questions are:
I know Flash has something like 99% market penetration - do people find this to be true? Has anyone seen on their own sites being in flash that someone couldn't access it?
Flash in general has come under alot of attacks about security and the like - i know this. I would use a swf encryptor - disable debugging (which i got snagged on once on a different application), continue to use https and any other means i can think of.
At the end of the day, everyone knows if someone wants in to the data bad enough, their going to find a ways in; i just wanna make it as difficult for them as i can.
Any thoughts are appreciated.
-Mario

There are always people who, for one reason or another, don't install the Flash plugin. Bear in mind that these are distinctly in the minority. Realize also that some people still refuse to enable Javascript. The question you have to ask yourself is whether this small group is enough to get you to move off of some newer technologies.
If the answer to that is yes, you will have to resort to vanilla HTML form processing, sending everything to the server for validation, etc.
If the answer is no, don't be afraid to use Flex. It works fine with https protocol, and is as secure as you want. That said, I wouldn't use it for username/password validation on the client; that information should always be encrypted and sent to a secure server. But validation of other types of field (phone number, etc.) shouldn't be a problem.

There are definitely people who don't have Flash installed and yes, there are people who have JavaScript disabled. But no matter whether you develop for the common denominator which is plain HTML forms or if you go high end, e.g. Flex or AJAX, never ever rely on the client to validate the inputs. It's a good first step, but everything that comes from the client, be it Flash or Ajax or Silverlight or whatever, could be forged.

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

force a reCAPTCHA every 10 page loads for each unique IP

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?

Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

Load testing the UI

I have been working on a site that makes some pretty big use of AJAX and dynamic JavaScript on the front end and it's time to start stress testing. But how do you properly stress test something that requires clicking several links on the front-end? One way I was able to easily hit every page of the site quickly and repeatedly was to point a Google Mini at it. But that's not going to click links and then navigate Modal windows and things like that.
Edit - I should point out that the site is done in PHP5 and the JavaScript library used is jQuery. Not sure if this would make any difference but felt it might be useful to know.

JMeter is great at this. You may record your sessions and tweak them to your liking.
So-called 'ajax load testing' is a recurring subject on this site, and is often confused. So let's get it straight: There is really no difference between load testing a normal web page and load testing with ajax. It all boils down to discrete requests; they just happen to not be full page refreshes.
One thing to keep in mind is there is a distinct difference between load testing the server processing the requests (a load test) and the performance on screen of the UI components being updated (how well your javascript performs.)
Simple load test example:
initial page load
login
navigate?
5-10 'ajax' requests (or whatever may fit your application usage pattern)
logout

There are load testing tools that can support AJAX. For example, WebLoad
http://www.radview.com/solutions/ajax-load-testing.aspx

What you really want is to stress test is the server's ability to handle the ajax requests. Use a load tool that looks at the requests while "recording" the test, and then tune as appropriate. I have only used the vs test edition one, so I can't point you to a low cost one.

I disagree with Nathan and Freddy to some degree. They are correct that "AJAX testing" is really no different in that HTTP requests are made. But it's not that simple. See my article on Ajaxian.com on Why Load Testing Ajax is Hard.
JMeter, Pylot, and The Grinder are all great tools for generating HTTP requests (I personally recommend Pylot). But at their core, they don't act as a browser and process JavaScript, meaning all they do is replay the traffic they saw at record time. If those AJAX requests were unique to that session, they may not be suitable/correct to replay in large volumes.
The fact is that as more logic is pushed down in to the browser, it becomes much more difficult (if not impossible) to properly simulate the traffic using traditional load testing tools.
In my article, I give a simple example of how difficult it becomes to test something like Google's home page when you want to query 1000's of different search terms (an important goal during load testing). To do it with JMeter/Pylot/Grinder you effectively end up re-writing parts of the AJAX code (in your case w/ jQuery) over again in the native language of the tool.
It gets even more complex if your goal is to measure the response time as perceived by the user (which is arguably the most important thing at the end of the day). For really complex applications that use Comet/"Reverse Ajax" (a technique that keeps open sockets for long periods of time), traditional load tools don't work at all.
My company, BrowserMob, provides a load testing service that uses Firefox browsers, powered by Selenium, to drive hundreds or thousands of real browsers, allowing you to measure and time the performance of visual elements as seen in the browser. We also support traditional virtual users (blind HTTP traffic) and a simulated browser (via HtmlUnit).
All that said, usually a mix of a service like BrowserMob plus traditional load testing is the right approach. That is, real browsers are great for a full-fidelity load test, but they will never be as economical as "virtual users", since they require 10-100X more RAM and CPU. See my recent blog post on whether to simulate or not to simulate virtual users.
Hope that helps!

You could use something like openSTA.
This allows a session with a web site to be recorded and then played back via a relatively simple script language.
You can also easily test web services and write your own scripts.
It allows you to put scripts together in a test in any way you want and configure the number of iterations, the number of users in each iteration, the ramp up time to introduce each new user and the delay between each iteration. Tests can also be scheduled in the future.
It's open source and free.
It produces a number of reports which can be saved to a spreadsheet. We then use a pivot table to easily analyse and graph the results.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.