How to use proxies to download files - php

I'm downloading a website with heaps of GPL-licensed free content, however my computer exceeds the daily download limit of 20 files (out of some 10,000!)
Is there a proxy service I can use (via PHP) to continue accessing such content?

Yes, this is possible. See PHP's cURL:
- http://php.net/manual/en/book.curl.php
Specifically: http://www.php.net/manual/en/function.curl-setopt.php : CURLOPT_HTTPPROXYTUNNEL
You may want to check that what you're doing is legal. Also, I'd imagine that you'll run into the same download limit, with a proxy.

You shouldn't be trying to bypass that limit. It's there to stop people like you from overloading their connection by trying to download everything.
If you really need to download the entire site, and the content is really free, maybe it's mirrored on another site where you can get at it more easily.
Edit: Or you could email the site administrator and ask nicely. Maybe he can give it to you in a convenient format or disable the limit for you.

Technically, shouldn't a proxy only get you an extra 20 files per day? I hope you have a lot of proxies lined up.
Another option would be to use Tor, which could potentially spread your requests amongst hundreds of end points.
Personally, I'd approach the site owner first. If the files truly are GPL and the host is following the spririt of GPL and not just trying to maximize advertising revenue, they shouldn't have too much of an issue giving you the lot.

Related

Check the cause of slow loading time among different server

A e-shop has developed using perstashop and put to the three server:
the first 2 is amazon, should be same setting
Server 1:
http://be-pure.com/en/women/3-slim-y-tank.html
Server 2:
http://52.77.216.83/en/women/3-slim-y-tank.html
the last one is just local hosting
Server 3:
http://internal001.zizsoft.com/be_pure/en/women/3-slim-y-tank.html
The problem is server 1 loading very slow compare to the other two server, but the performance should be the best among 3.
It looks as if server 1 hasn't cache the files
but in fact, all of them has
turn on smarty cache, using file system , with recomplie when modify
and
turn on the file system cache
Given that the code and server setting are the same, both 2 amazon server is same setting, and localhost one is other server, however it should be slower than server 1
1) How to debug/ check whether the file is using cache already?
(the cache file locate in cache/smarty and cache/cachefs in server)
2) And what takes the long load time for server 1? Just consider it as an PHP site, any ways to check why it is slow?
Thanks a lot for helping
Refer to the comments - I misinterpreted the data I was looking at earlier. It appears the server can only handle maybe 5-10 requests at a time so things get blocked until the other things are done loading. You likely just need to update your web server's configuration to handle more requests.
There is also a lot of JS data in the file. It is 318KB just to load the page and it has to do many requests to get JS/CSS files before it even gets to any of the HTML. So it is 318KB + all the external JS/CSS it needs to fetch (wow!). That's like 4MB of stuff just to load a page.
Check the modify timestamp on the files generated by your caching system to verify that caching is working.
Edit:
Since there is now a bounty out - please review the comment discussion we had. There is an issue where a traceroute doesn't make it to the server destination and I suspect that is related to the slowness but that type of network issue is over my head.
I am giving answer of your 2nd question.
I dont't know the exact problem of slow loading. but We had faced same issue in one of our projects in last month. Server was amazon.
One of our Instance was very slow. We had tried many solutions but none of them were working. Then We have found a solution which looks very unfair but it Worked for us.
We had Just restarted the slow Instance and We got success.
I hope this solution will work for u also.
All the Best :)
Answer to question 1:
You can use chromes developer tool, F12, and then network tab. It will show you all the files being downloaded, in the size column you can find if it's being loaded from cache or not.
Answer to question 2:
You can use chrome's YSlow plugin. it will help you a lot, but is obvious about your site you have too many files; css, js and lots of images; try to merge your css and js files and use image map for your images.
Hope you can solve your problem
No body can clear answer what is problem with servers. But you can find it with profiling. If you have budget i highly recommend to you buy a profiling tool "Tideways.io", "Blackfire" or "New Relic". I used New Relic, and it really help full to find bottlenecks. If you don't have budget to get a profiling tool you can use php profiling extension Xdebug. It is help full too, but reading profiling output of xdebug can be a bit difficult. But it's setup is really easy, and you can do partial profiling ( you can profile url which you want) with "xdebug profile trigger" instead profiling all requests.
Use this google solid tool to get insight about your page performance.
https://developers.google.com/speed/pagespeed/insights/?url=http%3A%2F%2Fbe-pure.com%2Fen%2Fwomen%2F3-slim-y-tank.html.
Also check if server1 is matching server2 configuration.
This tools shows me lots of suggestions for improvements for your website.

Input on decision: file hosting with amazon s3 or similar and php

I appreciate your comments to help me decide on the following.
My requirements:
I have a site hosted on a shared server and I'm going to provide content to my users. About 60 GB of content (about 2000 files 30mb each. Users will have access to only 20 files at a time), I calculate about 100 GB monthly bandwidth usage.
Once a user registers for the content, links will be accessible for the user to download. But I want the links to expire in 7 days, with the posibility to increase the expiration time.
I think that the disk space and bandwidth calls for a service like Amazon S3 or Rackspace Cloud files (or is there an alternative? )
To manage the expiration I plan to somehow obtain links that expire (I think S3 has that feature, not Rackspace) OR control the expiration date on my database and have a batch process that will rename on a daily basis all 200 files on the cloud and on my database (in case a user copied the direct link, it won't work the next day, only my webpage will have the updated links). PHP is used for programming.
So what do you think? Cloud file hosting is the way to go? Which one? Does managing the links makes sense that way or it is too difficult to do that through programming (send commands to the cloud server...)
EDIT:
Some host companies have Unlimited space and Bandwidth on their shared plans.. I asked their support staff and they said that they really honor the "unlimited" deal. So 100 GB of transfer a month is ok, the only thing to look out is CPU usage. So going shared hosting is one more alternative to choose from..
FOLLOWUP:
So digging more into this I found that the TOS of the Unlimited plans say that it is not permitted to use the space primarily to host multimedia files. So I decided to go with Amazon s3 and the solution provided by Tom Andersen.
Thanks for the input.
I personally don't think you necessarily need to go to a cloud based solution for this. It may be a little costly. You could simply get a dedicated server instead. One provider that comes to mind gives 3,000 GB/month of bandwidth on some of their lowest level plans. That is on a 10Mbit uplink; you can upgrade to 100Mbps for $10/mo of 1Gbit for $20/mo. I won't mention any names, but you can search for dedicated servers and possibly find one to your liking.
As for expiring the files, just implement that in PHP backed by a database. You won't have to move files around, store all the files in a directory not accessible from the web, and use a PHP script to determine if the link is valid, and if so read the contents of the file and pass them through to the browser. If the link is invalid, you can show an error message instead. It's a pretty simple concept and I think there are a lot of pre-written scripts that do that available, but depending on your needs, it isn't too difficult to do it yourself.
Cloud hosting has advantages, but right now I think its costly and if you aren't trying to spread the load geographically or plan on supporting thousands of simultaneous users and need the elasticity of the cloud, you could possibly use a dedicated server instead.
Hope that helps.
I can't speak for S3 but I use Rackspace Cloud files and servers.
It's good in that you don't pay for incoming bandwidth, so uploads are super cheap.
I would do it like this:
Upload all the files you need to a 'private' container
Create a public container with CDN enabled
That'll give you a special url like http://c3214146.r65.ce3.rackcdn.com
Make your own CNAME DNS record for your domain point to that, like: http://cdn.yourdomain.com
When a user requests a file, use the COPY api operation with a long random filename to do a server side copy from the private container to the public container.
Store the filename in a mysql DB for your app
Once the file expires, use the DELETE api operation, then the PURGE api operation to get it out of the CDN .. finally delete the record from the mysql table.
With the PURGE command .. I heard it doesn't work 100% of the time and it may leave the file around for an extra day .. also in the docs it says to reserve it's use for only emergency things.
Edit: I just heard, there's a 25 purge per day limit.
However personally I've just used delete on objects and found that took it out the CDN straight away. In summary, the worst case would be that the file would still be accessible on some CDN nodes for 24 hours after deletion.
Edit: You can change the TTL (caching time) on the CDN nodes .. default is 72 hours so might pay to set it to something lower .. but not so low that you loose the advantage of CDN.
The advantages I find with the CDN are:
It pushes content right out to end users far away from the USA servers and gives super fast download times for them
If you have a super popular file .. it won't take out your site when 1000 people start trying to download it .. as they'd all get copies pushed out the whatever CDN node they were closest to.
You don't have to rename the files on S3 every day. Just make them private (which is default), and hand out time limited urls for day or a week to anyone who is authorized.
I would consider making the links only good for 20 mins, so that a user has to re-login in order to re-download the files. Then they can't even share the links they get from you.

Large file uploads from web pages

I code primarily in PHP and Perl. I have a client who is insisting on seeking video submissions (any encoding) from the public via one of their pages rather than letting YouTube do its job.
Server in question is a virtual machine and I can adjust ini settings for max post, max upload size etc as needed.
My initial thought is to use a Flash based uploader with PHP on the back end but I wondered if someone might have useful advice and experience on the subject?
Doing large file transfers of HTTP is not usually fun -- but sometimes it's necessary.
For large files, you'll definitely want to provide some kind of progress gauge for end-users.
There are flash-based tools that do this (swfUpload comes to mind).
If you want to avoid flash and do it with pretty html/javascript/css, you can leverage PHP's APC extension, which for some reason provides support for getting upload status from the server, as explained here
You can adjust the post size and use a normal html form. The big problem is not Apache, its http. If anything goes wrong in the transmission you will have no way to detect the error. Further more there is no way to resume the transfer. This is exactly why BitTorrent is so popular.
I don't know how against youtube your client is, but you can use their api to do the uploads from a page on your site.
http://code.google.com/apis/youtube/2.0/developers_guide_protocol.html#Uploading_Videos
See: browser based uploading.
For web-based uploads, there's not many options. Regardless of web platform, web server, etc. you're still transferring over HTTP. The transfer is all or nothing.
Your best option might be to find a Flash, Java, or other client side option that can chunk files and upload them piecemeal, then do a checksum to verify. That will allow for resuming uploads. Unfortunately, I don't know of any such open source component that does this.
Try to convince your client to change point of view.
Using http (and the browser, hell, the browser!) for this kind of issue is rarely a good deal; Will his users wait 40 minutes with the computer and the browser running until the upload is complete?
I dont think so.
Maybe, you could set up a public ftp account, where users can upload but not download and see the others user's files.. then, who want to use FTP software can, who like to do it via browser can too.
The big problem dealing using a browser is that, if something go wrong, you cant resume but have to restart from zero again.
the past year i had the same issue, i gave a look to ZUpload
, but i didnt use it so i can suggest (we wrote a small python script that we send to our customer; the python script create a torrent of the folder our costumer need to send to us, and we download it via utorrent ;)
p.s: again, sorry for my bad english ;)
I used jupload. Yes it looks horrible, but it just works.
With that said, it's still a better idea to convince the client that doing so is stupid.
I would agree with others stating that using HTML is a poor option. I believe there is a size limitation using Flash as well. I know of a script that uses a JavaScript Applet to perform an actual FTP transfer. It is called Simple2FTP and can be found at http://www.simple2ftp.com
Not sure but perhaps worth a try?

Quantify streamed video

I'm developing a PHP application which will charge users for the videos they watch. The business model is "everyone pays for how much she watches". For this purpose, I need to;
Implement secure video (FLV) access. (Authorized sessions will gain access)
Calculate how much video (FLV) data is sent from the server.
A trivial solution for this is to read FLV with PHP ("fread") and send it to client chunk by chunk (just "echo"). However I have real performance concerns about this method, because the application server has 1.7GB Rams and just a single core.
In short run we're expecting to get large number of impressions, however we would like to upgrade hardware as late as possible. That's why, I want to implement the requirement with the minimum overhead, in the most effective way.
I'm not tied to a webserver. I prefer Apache 2.2, however lighttpd can also be deployed if it offers a feature for the implementation.
Any idea is appreciated.
Thanks!
The PHP fread solution looks like the way to go, but with the server restriction, I think you will need to tweak the flash player. The flash player could send the server messages based on how much of the video has been played. This might be something to think about. Take a look at the JW FLV Media player, the customisation and Javascript integration will allow you to send xmlhttprequests to the server.
Why not using some videostreaming servers like Red5, I'm sure they have triggers that could perform writing some statistics to a db or something similar.
Another advantage would be that user could skip forward in the video.
So to sum up and for future reference I decided to go with the php fread method, since no satisfactory alternative is suggested.
Thanks to all contributers.

What's the best way to manage multiple media servers, and file allocations between them?

I have a file host website thats burning through 2gbit of bandwidth, so I need to start adding secondary media servers to store the files. What would be the best way to manage a multiple server setup, with a large amount of files? Preferably through php only.
Currently, I only have around 100Gb of files... so I could get a 2nd server, mirror all content between them, and then round robin the traffic 50/50, 33/33/33, etc. But once the total amount of files grows beyond the capacity of a single server, this wont work.
The idea that I had was to have a list of media servers stored in the DB with the amounts of free space left on each server. Once a file is uploaded, php will choose to which server the file is actually uploaded to, and spread out all the files evenly among the servers.
Was hoping to get some more input/inspiration.
Cant use any 3rd party services like Amazon. The files range from several bytes to a gigabyte.
Thanks
You could try MogileFS. It is a distributed file system. Has a good API for PHP. You can create categories and upload a file to that category. For each category you can define on how many servers it should be distributed. You can use the API to get a URL to that file on a random node.
If you are doing as much data transfer as you say, it would seem whatever it is you are doing is growing quite rapidly.
It might be worth your while to contact your hosting provider and see if they offer any sort of shared storage solutions via iscsi, nas, or other means. Ideally the storage would not only start out large enough to store everything you have on it, but it would also be able to dynamically grow beyond your needs. I know my hosting provider offers a solution like this.
If they do not, you might consider colocating your servers somewhere that either does offer a service like that, or would allow you install your own storage server (which could be built cheaply from off the shelf components and software like Freenas or Openfiler).
Once you have a centralized storage platform, you could then add web-servers to your hearts content and load balance them based on load, all while accessing the same central storage repository.
Not only is this the correct way to do it, it would offer you much more redundancy and expandability in the future if you endeavor continues to grow at the pace it is currently growing.
The other solutions offered using a database repository of what is stored where, would work, but it not only adds an extra layer of complexity into the fold, but an extra layer of processing between your visitors and the data they wish to access.
What if you lost a hard disk, do you lose 1/3 or 1/2 of all your data?
Should the heavy IO's of static content be on the same spindles as the rest of your operating system and application data?
Your best bet is really to get your files into some sort of storage that scales. Storing files locally should only be done with good reason (they are sensitive, private, etc.)
Your best bet is to move your content into the cloud. Mosso's CloudFiles or Amazon's S3 will both allow you to store an almost infinite amount of files. All your content is then accessible through an API. If you want, you can then use MySQL to track meta-data for easy searching, and let the service handle the actual storage of the files.
i think your own idea is not the worst one. get a bunch of servers, and for every file store which server(s) it's on. if new files are uploaded, use most-free-space first*. every server handles it's own delivery (instead of piping through the main server).
pros:
use multiple servers for a single file. e.g. for cutekitten.jpg: filepath="server1\cutekitten.jpg;server2\cutekitten.jpg", and then choose the server depending on the server load (or randomly, or alternating, ...)
if you're careful you may be able to move around files automatically depending on the current load. so if your cute-kitten image gets reddited/slashdotted hard, move it to the server with the lowest load and update the entry.
you could do this with a cron-job. just log the downloads for the last xx minutes. try some formular like (downloads-per-minutefilesize(product of serverloads)) for weighting. pick tresholds for increasing/decreasing the number of servers those files are distributed to.
if you add a new server, it's relativley painless (just add the address to the server pool)
cons:
homebrew solutions are always risky
your load distribution algorithm must be well tested, otherwise bad things could happen (everything mirrored everywhere)
constantly moving files around for balancing adds additional server load
* or use a mixed weighting algorithm: free-space, server-load, file-popularity
disclaimer: never been in the situation myself, just guessing.
Consider HDFS, which is part of Apache's Hadoop. This will integrate with PHP, but you'll be setting up a second application. This will also solve all your points of balancing among servers and handling things when your file space usage exceeds one server's ability. It's not purely in PHP, though, but I don't think that's what you meant when you said "pure" anyway.
See http://hadoop.apache.org/core/docs/current/hdfs_design.html for the idea of it. They cover the whole idea of how it handles large files, many files, replication, etc.

Categories