PHP prevent image scrape

PHP prevent image scrape - php

I have a project that serves many images. That project also have an API that serves not only but the image links.
I would like to have a way to successfuly avoid the scraping of my images. I don't mind users could download each image individually but would not like that someone could scrape all images at the same time to avoid high bandwith usage.
I though using htaccess to deny direct access to image folders.
Also, thought to use in PHP (in website) to use a dynamic link to show the image (for example loadimage.php?id=XXXXX) so my users doesn't know the full image link.
How could I do it in API (and even in website) to prevent scraping? I though something like a token and each request will generate a new "image id", but or I'm missing something or can't figure it out how to make it work.
I know it will be impossible to have a 100% valid method to do it, but any suggestions in how to difficult it would be appreciated.
Thanks.

You're looking for a rate limit policy. It involves tracking how many times the images are being requested (or the number of bytes being exchanged), and issuing a (typically) 429 Too Many Requests response when a threshold is exceeded.
Nginx has some pretty good built-in tools for rate limiting. You mention .htaccess which implies Apache, for which there is also a rate limiting module.
You could do this with or without PHP. You could identify a URL pattern that you want rate limited, and apply the rate limit policy to that URL pattern (could be a PHP script or just a directory somewhere).
For Apache:
<Location ".../path/to/script.php">
SetOutputFilter RATE_LIMIT
SetEnv rate-limit 400
SetEnv rate-initial-burst 512
</Location>
Or, you could write code in your PHP that writes accesses to a database, and enforces a limit based on how many accesses in a given window period.
I would not generally recommend writing your own when there are such good tools available supported in the web server itself. One exception would be if you use several web servers in a cluster, which cannot easily synchronize rate limiting thresholds and counts across the server.

Related

Stop multiple web requests (web security)

Since few hours, I'm having multiple requests coming from various IP's to our website occuring every second(maybe 4 or 5 requests per second).
The website's usual traffic is about 3 to 5 requests per minute.
The requests are very random, for example:
/gtalczp/197zbcylgxpoaj-26228e-dtmlnaibx/
/109/jxwhezsivr/10445_xwvpfdyzhea.cgi
/nouyaku.html
/index.php/43e3133-pmuwbfgoedakvxs/
/keyword_list/s_index=L
The site's indexing in google is now all in japanese characters and messed up.
I have tried blocking IP's(via .htaccess) that make all these random requests, but every time a new IP is making a new request. How can I stop all of these requests? Can I use an .htaccess rule that allows only the links that are available in the site?
EDIT: Our site is running Wordpress latest version, with custom built features. If this was some kind of hack, how can I find the infected files/database tables?
EDIT 2: these look like legit google bots, but why are they trying to access these random links which don't exist...

This traffic is coming from automated security scanners. They scan blocks of IP ranges used by AWS, Digital Ocean etc looking for known security bugs on the web server.
Can you stop it? Sort of.
One quick way to catch the low hanging fruit is to put a /password.txt on the root of the webserver. Every scanner on this planet will scan for that. Block any IP that accesses it. You can use Fail2Ban for this.
You can also rate limit access to your webserver. If a client is scanning pages very quickly it's likely a scanner and in which case ban the IP. But could also be a search engine spider etc. In which case this will likely hurt your SEO.

With request for slugs containing Japanese keywords like nouyaku and Google indexing of pages in Japanese might well indicate the Japanese Keyword Hack. This Google article provides an explanation and some generic general fixes and preventitive measures: https://developers.google.com/web/fundamentals/security/hacked/fixing_the_japanese_keyword_hack
Fixing Wordpress hacks already covered elsewhere: you will find numerous questions and answers about this on Stackoverflow or via Google.
.htaccess Google's article advises replacing your htaccess. A useful start would be adding and tweaking either Geof Starr's 6G "Firewall" or 7G(beta) code.
The rate of requests is DDOS like; so it makes sense to cater for this at same time (e.g. mod_evasive, Fail2ban, and modsecurity) google protecting Apache from DDOS attacks.
DDOS, brute force and Wordpress - stopping dodgy requests before PHP/Wordpress code/SQL is run will massively reduce server load. If there is no need for the Public to log in to Wordpress, then use htaccess to password protect wp-login and also maybe wp-admin folder (may cause problems on some sites).

Only allow real visitors, and block custom "parsing" bots through PHP?

I have a very large database of all items in a massive online game on my website, and so do my competitors. I however, am the only site to have pictures of all these items. All these pictures are on my server, eg. from 1.png to 99999.png (all in the same directory).
It's very easy for my competitors to create a simple file_get_contents/file_put_contents script to just parse all of this images to their own server and redistribute them their own way. Is there anything I can do about this?
Is there a way to limit (for example) everyone to only see/load 100 images per minute (I'm sure those scripts would rapidly parse all of the images)? Or even better, only allow real users to visit the URL's? I'm sure those scripts wont listen to a robots.txt file, so what would be a better solution? Does anybody have an idea?

Place a watermark in your images that states that the images are copyrighted by you or your company. Your competitors would have to remove the watermark and make the image look like there never was one, so that would definitely be a good measure to take.
If you're using Apache Web Server, create an image folder and upload an htaccess file that tells the server that only you and the server are allowed to see the files. This will help hide the images from the parsing bots, as Apache would see that they are not authorized to see what's in the folder. You'd need to have PHP load the images (not just pass img tags on) so that as far as the permissions system is concerned, the server is accessing the raw files.
On your PHP page itself, use a CAPTCHA device or some other robot detection method.

Different between get image from .jpg and .php?type=3&item_id=013c23

I wonder the pro / cons for below method to get a image:
http://image.anobii.com/anobi/image_book.php?type=3&item_id=013c23a6dd4c6115e4&time=1282904048
http://static.anobii.com/anobii/static/image/welcome/icon_welcome.png
One use php to get image and the second one just enter the url.
e.g. which one is fast?

They potentially serve very different purposes. If you are able to link directly to the .png resource, it is likely (but not guaranteed to be) a real file which is world accessible on the web. When using a PHP script to serve the image content, a lot of different things may be happening behind the scenes.
For example, PHP is able to check the user's session or authentication credentials to provide authorization for the image. The image binary data could be stored in a database instead of in the filesystem, or if in the filesystem, the image file could be stored outside the web server's document root, preventing direct access to it. One common usage is to to deny access to the file when a user is not authorized, and instead serve other image data in its place, like an "access denied" default image.
Another potential use of the PHP script could be per-session hit counting on the resource, or rate limiting clients from hitting a resource too many times.
When serving a static file, authorization, logging, etc. are limited to the capabilities of the web server as it is configured.
The question to ask isn't really which is faster, but which suits the application's business need.

Input on decision: file hosting with amazon s3 or similar and php

I appreciate your comments to help me decide on the following.
My requirements:
I have a site hosted on a shared server and I'm going to provide content to my users. About 60 GB of content (about 2000 files 30mb each. Users will have access to only 20 files at a time), I calculate about 100 GB monthly bandwidth usage.
Once a user registers for the content, links will be accessible for the user to download. But I want the links to expire in 7 days, with the posibility to increase the expiration time.
I think that the disk space and bandwidth calls for a service like Amazon S3 or Rackspace Cloud files (or is there an alternative? )
To manage the expiration I plan to somehow obtain links that expire (I think S3 has that feature, not Rackspace) OR control the expiration date on my database and have a batch process that will rename on a daily basis all 200 files on the cloud and on my database (in case a user copied the direct link, it won't work the next day, only my webpage will have the updated links). PHP is used for programming.
So what do you think? Cloud file hosting is the way to go? Which one? Does managing the links makes sense that way or it is too difficult to do that through programming (send commands to the cloud server...)
EDIT:
Some host companies have Unlimited space and Bandwidth on their shared plans.. I asked their support staff and they said that they really honor the "unlimited" deal. So 100 GB of transfer a month is ok, the only thing to look out is CPU usage. So going shared hosting is one more alternative to choose from..
FOLLOWUP:
So digging more into this I found that the TOS of the Unlimited plans say that it is not permitted to use the space primarily to host multimedia files. So I decided to go with Amazon s3 and the solution provided by Tom Andersen.
Thanks for the input.

I personally don't think you necessarily need to go to a cloud based solution for this. It may be a little costly. You could simply get a dedicated server instead. One provider that comes to mind gives 3,000 GB/month of bandwidth on some of their lowest level plans. That is on a 10Mbit uplink; you can upgrade to 100Mbps for $10/mo of 1Gbit for $20/mo. I won't mention any names, but you can search for dedicated servers and possibly find one to your liking.
As for expiring the files, just implement that in PHP backed by a database. You won't have to move files around, store all the files in a directory not accessible from the web, and use a PHP script to determine if the link is valid, and if so read the contents of the file and pass them through to the browser. If the link is invalid, you can show an error message instead. It's a pretty simple concept and I think there are a lot of pre-written scripts that do that available, but depending on your needs, it isn't too difficult to do it yourself.
Cloud hosting has advantages, but right now I think its costly and if you aren't trying to spread the load geographically or plan on supporting thousands of simultaneous users and need the elasticity of the cloud, you could possibly use a dedicated server instead.
Hope that helps.

I can't speak for S3 but I use Rackspace Cloud files and servers.
It's good in that you don't pay for incoming bandwidth, so uploads are super cheap.
I would do it like this:
Upload all the files you need to a 'private' container
Create a public container with CDN enabled
That'll give you a special url like http://c3214146.r65.ce3.rackcdn.com
Make your own CNAME DNS record for your domain point to that, like: http://cdn.yourdomain.com
When a user requests a file, use the COPY api operation with a long random filename to do a server side copy from the private container to the public container.
Store the filename in a mysql DB for your app
Once the file expires, use the DELETE api operation, then the PURGE api operation to get it out of the CDN .. finally delete the record from the mysql table.
With the PURGE command .. I heard it doesn't work 100% of the time and it may leave the file around for an extra day .. also in the docs it says to reserve it's use for only emergency things.
Edit: I just heard, there's a 25 purge per day limit.
However personally I've just used delete on objects and found that took it out the CDN straight away. In summary, the worst case would be that the file would still be accessible on some CDN nodes for 24 hours after deletion.
Edit: You can change the TTL (caching time) on the CDN nodes .. default is 72 hours so might pay to set it to something lower .. but not so low that you loose the advantage of CDN.
The advantages I find with the CDN are:
It pushes content right out to end users far away from the USA servers and gives super fast download times for them
If you have a super popular file .. it won't take out your site when 1000 people start trying to download it .. as they'd all get copies pushed out the whatever CDN node they were closest to.

You don't have to rename the files on S3 every day. Just make them private (which is default), and hand out time limited urls for day or a week to anyone who is authorized.
I would consider making the links only good for 20 mins, so that a user has to re-login in order to re-download the files. Then they can't even share the links they get from you.

How to use proxies to download files

I'm downloading a website with heaps of GPL-licensed free content, however my computer exceeds the daily download limit of 20 files (out of some 10,000!)
Is there a proxy service I can use (via PHP) to continue accessing such content?

Yes, this is possible. See PHP's cURL:
- http://php.net/manual/en/book.curl.php
Specifically: http://www.php.net/manual/en/function.curl-setopt.php : CURLOPT_HTTPPROXYTUNNEL
You may want to check that what you're doing is legal. Also, I'd imagine that you'll run into the same download limit, with a proxy.

You shouldn't be trying to bypass that limit. It's there to stop people like you from overloading their connection by trying to download everything.
If you really need to download the entire site, and the content is really free, maybe it's mirrored on another site where you can get at it more easily.
Edit: Or you could email the site administrator and ask nicely. Maybe he can give it to you in a convenient format or disable the limit for you.

Technically, shouldn't a proxy only get you an extra 20 files per day? I hope you have a lot of proxies lined up.
Another option would be to use Tor, which could potentially spread your requests amongst hundreds of end points.
Personally, I'd approach the site owner first. If the files truly are GPL and the host is following the spririt of GPL and not just trying to maximize advertising revenue, they shouldn't have too much of an issue giving you the lot.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.