Legal script that scrapes and indexes? - php

I want to create a website that scrapes certain websites (specified by me) to collect data and pricing and then offer that data as search results on my own site. So basically like a search engine, but for specific sites, indexed in a specific way. I can write this myself, but would like to know:
Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
What if I make money off this?
Are there any popular PHP scripts that already do this?
The legal aspect has been covered. I found a way around this (well, I got permission from the persons creating the content)... so the only real question is: what can I use to crawl the content, especially keeping in mind, each site will have diffrent rules that I will have to set up? It must also be clever enough to not spider the same content twice?

Is it legal?
Yes. And no. Probably.
There isn't one set of laws covering the entire planet, and SO isn't really for legal advice, you need to find a lawyer in your jurisdiction.
My own thoughts are that you would probably be okay in most jurisdictions as long as you use only the information. So, no eBay logos, no representations that you may be associated with them and so on.
But I am not a lawyer (though I deal a lot with the US sub-species as part of my work), certainly not your lawyer, and this advice (which isn't legal advice) is worth every cent you paid for it, which is ZERO!
What if I make money of this?
Good for you :-) Make mega-bucks. But see above point.
Are there any popular PHP scripts that already do this?
That's the bit I can't answer. My experience with PHP ranges somewhere between zero and nothing.

The legality is a bit shady in this area. You should look for the presence of a robots.txt ( http://www.robotstxt.org/robotstxt.html ) file to first determine if the website welcomes web spiders.
Also, there is a very good PHP search script called sphider ( http://www.sphider.eu/ ), you should have a look at.
EDIT:
I can't see many websites having an issue with you taking snippets of their website and then linking users onto the webpage which the content came from. However, if you plan on just taking all their content and displaying it on your own website in order to make profit, I can only assume many web sites would have an issue as they are the ones who should be profiting off the content.

1) Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
This is technically feasible. You can build a PHP script that does this quite easily. I would say that it is borderline illegal however, because by scraping content from somebody elses site you will be using their intellectual property, their data without permission.
2) What if I make money off this?
Then the original owners of the data are very likely to come after you, issue a cease and desist notice then sue you. An organization as large as ebay could do this without blinking.
3) Are there any popular PHP scripts that already do this?
Because of the questionable legal nature of your question, I highly doubt there are any scripts that already do this.
The correct technique of getting data from ebay and other large data providers is by using APIs, or application programming interfaces. These are special protocols, languages, designed for programs to communicate with each other. This has the benifit of being significantly more efficient than page-scraping, while also being a known legal way to get data from a provider.
More information about the ebay specific API can be found here; http://developer.ebay.com/common/api/

Related

How to extract/fetch and show data ( from other ecommerce site ) to my website using PHP

I have seen many of the eCommerce portals which are showing the list of products from another bigger eCommerce websites from across the world.
The fetching is not a big problem i think, by using file_get_contents or CURL in php, But the question is,
Do they provide some api to allow others to fetch their data/product info?
Do we need to get their permissions to fetch data from their sites.
Are there some elegant and specific method/way to fetch data to show on our site (instead of CURL & file_get_contents)?
Some websites provide their API to access data. Some cost money, Some may be free. In any case , yes, you need permission.
But you can always scrape their sites without permission.
Here's some general guidelines on the subject.
You should check to see if they have a robot.txt file denying permission to spider some areas of the site.
Although there are copyright issues with reproducing content, search engines publish excerpts of site content all the time. Therefore to some extent, reproducing content is legally permissible.
APIs are sometimes available, but search engines scrape sites all the time without any sort of permission (except for perhaps the robot.text files).
Respect the site owner's wishes concerning their bandwidth. Poorly written robot code can wastefully tie up server resources.
If you can get permission, all the better.
I use cURL and the DomDocument class. I don't know what else you would want in terms of elegance.
Write a crawler to get all the data you want from those websites.
Use the APIs if provided.But usually it costs much.
Create you own APIs using third-party software.

PHP - CMS Recommendation For Licensing Type Script

So I'm trying to make my own mini CMS, and just for my knowledge once I get it good enough, and I know enough, I'd like to sell it. Now for licensing, I know there's tons of licensing scripts you can pay for, but would the following be advisable?
I'd like to plant a script hidden in my CMS where instead of checking for some sort of key, it checks if your domain is allowed to run the CMS by running it past the main CMS database. Now I have two questions.
1.) Could I encrypt the code, so if I wanted it to redirect to a page where it just says "CMS Deactivated" For example, so that people don't go through the code just ctrl-f searching for the key text?
2.) I was going to reach the domain name by doing the following, $_SERVER['SERVER_NAME']. Is that going to be a reliable way of checking the domain? IE. Will IIS pick up on it?
I'm not trying to completely extinguish cracking of the CMS, I know that is impossible.
Maybe you should consider housing the whole thing on your own servers and making the content accessible via a REST API. You can certainly restrict and control that way.
Providing a CMS with source code to any client opens you to evaluation and cleansing. Not saying there's no way, but I am saying it may be easier for you to provide the content via REST than to write perfect security. Especially if you're asking this question.
As I said in my comment, I think worring about money is irrelevant for now, but here's some information for you to learn from.
1.) I haven't found an encryption solution that works. Any will require you to install additional PHP components (and no one wants to deal with that when there are plenty of free CMS's out there). There is code obfuscation, but that's iffy at best.
2.) According to this page, that should work on IIS!

To have multiple sub-domains or multiple separate domains?

My client has a host of Facebook pages that have become very successful. In order to move away from big brother Facebook my client wishes to create a large dynamic site that incorporates the more successful parts of the Facebook empire.
One of my client's spin off sites has been created and is getting a lot of traffic. I'm not sure exactly how much but it hit 90 Gigs in a month as the space allocated need to be increased.
In any case my client has dreamed up a massive website with its own community looking to put the community under the one banner. However I am concerned that it will get thrashed, bottlenecks, long load time, etc.
My questions:
Will a managed dedicated server be able to handle a potentially large amount of traffic?
Is it going to be better to create various parts of the empire in their own separate hosting and domain (normal hosting or VPS), or is it better to have them all under the one hood (i.e. using sub-domains).
If they were all together would it be better for SEO and easier to manage? Or if they are separate, they may be quicker but would it need some sort of Passport user system so people can log into any of the website with the same user details?
Whats the best way to implement a Passport style user system? Do you remotely connect to databases? Or run a regular a Cron job that updates each individual user details on each domain? Maybe run CURL request to the other site given then any new data?
Any other Pros/Cons to keeping all the section together or separating them?
Large site like Facebook manages to have everything under the one root. Then sites like eBay have separate domain names but you can use the same user login across all of them.
I'm not sure what the best option is and would appreciate any guidance.
It is a very general question but to give some hints:
Measure, measure and measure again. Know what kind of parts are used heavily and which are not.
Fix things and go back to 1.
Really: Without knowing what takes lots of time, what is used most heavily etc. you cannot say anything usefull.
VPS or dedicated servers are not the right question. You start with: What do I have to do for the users. Then: How am I going to do that? (for example: in database, in scripts, in message queue) and then finally you see how much hardware you need.
One or multiple domains doesn't really matter. Though one exception: For static content it might be interesting if you have lots of it to use a CDN like Amazon. Read for example: http://highscalability.com/blog/2011/12/27/plentyoffish-update-6-billion-pageviews-and-32-billion-image.html where you can read some things about the possibilities with a CDN.
In general serving static content from a static domain is useful many other things don't really need that. So there you could just consider all in one domain.

PHP - detecting changes in external database-driven site

For a homework project, I'm creating a PHP driven website which main function is aggregating news about various university courses.
The main problem is this: (almost) each course has it's own website. These are usually just plain HTML or built using some simple free CMS system.
As a student, participating in 6-7 courses, almost every day you go through 6-7 websites checking if there are any news. The idea behind the project is that you don't have to do that, instead, you just check the aggregation site.
My idea is the following: each time a student logs in, go through his course list. For every course, get it's website (recursively, like with wget), and create a hash value of it. If the hash is different then one stored in database, we know that site has changed, and we notify the student.
So, what do you think, is this reasonable way to achieve the functionality?
And if yes, what is (technically) the best way to go about this? I was checking php_curl, put I don't know if it can get a website recursively.
Furthermore, there's a slight problem I have somewhat limited resources, only a few MB of quota on public (university) server. However, if that's a big problem, I could use a seperate hosting solution.
Thanks :)
Just use file_get_contents, or cURL if you absolutely have to (in case you need COOKIES).
You can use your hashing trick to check for modifications but it's not very elegant. What you want to know is when was it last changed. I doubt this information is on the website, but maybe they offer an RSS feed or some webservice or API you can use for this purpose.
Don't worry about doing recursive requests. Just make a new request each time.
"When all else fails, build a scraper"

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

Categories