Use PHP Curl to get around hotlink protection - php

I'm working on an android application that interacts with a forum I visit. The staff of the forum allows this app, but won't give an API to work with.
In order to get the information I need I use an intermediate PHP script that scrapes the forum with CURL. Everything works just great, exept for one small detail.
To view topics I scrape all the data I need such as poster name, date and post content. But since the images that are stored on their server are hotlink protected, I am unable to see them. The funny thing is that viewing individual images is no problem, but whenever they are placed in a context, they are replaced by the sites copyright image.
I have the feeling that the website changes the HTTP referer that I send (which is empty), and hence respond with a copyright image (hotlink protection).
Can someone give me some tips how to solve this problem?
The code I use:
$url = 'someurliwanttoscrape';
$cookie_string = 'somecookies';
$useragent = 'someuseragent';
$timeout = 60;
$rawhtml = curl_init();
curl_setopt ($rawhtml, CURLOPT_URL,$url);
curl_setopt ($rawhtml, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($rawhtml, CURLOPT_REFERER, '');
curl_setopt ($rawhtml, CURLOPT_COOKIE, $cookie_string);
curl_setopt ($rawhtml, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($rawhtml, CURLOPT_USERAGENT, $userAgent);
$output = curl_exec($rawhtml);
curl_close($rawhtml);
This works whenever I put the url of the image in there. No problem, I can see the image, no hotlink protection. But as soon as I put the URL where the image is embedded in the text, the hotlink protection kicks in.

You can use curl_setopt to tell cURL what referrer to send:
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
See the documentation for more details, but that's pretty much all there is to it

Related

Setting up proxy for an axis ip camera in PHP

I have been searching for a way to proxy a mjpeg stream from the AXIS M1114 Network Camera.
using the following url setup
http://host:port/axis-cgi/mjpg/video.cgi?resolution=320x240&camera=1
i try to capture the output and make them available to users with a php script running an apache server on ubuntu.
having browsed the web looking for an answer to no avail i come to you.
my ultimate goal is to have users able to link to the proxy like this:
<img src='proxy.php'>
and have the details of all the things in proxy.php.
I have tried using the way of cURL (advised in similar thread here) but i can't get it to work, probably due to lack of knowledge on the inner workings.
currently my very simple proxy.php looks like this
<?php
$camurl = "http://ip:port";
$campath = "axis-cgi/mjpg/video.cgi";
$userpass = "user:pw";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $camurl + $campath);
curl_setopt($ch, CURLOPT_POSTFIELDS, 'resolution=320x240&camera=1');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, $userpass);
$result = curl_exec($ch);
header('Content-type: image/jpeg');
echo $result;
curl_close($ch);
?>
My understanding is that this would produce an acceptable output for my plan. But alas.
My question would be if there is a blatant error i do not see. Any simpler option/way of getting the result i aim for is welcome too.
Please point me in the right direction. I happily provide any relevant information i might have missed to provide. Thank you in advance.
solved edit:
After commenting out:
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
changing
curl_setopt($ch, CURLOPT_URL, $camurl + $campath);
to
curl_setopt($ch, CURLOPT_URL, $camurl . $campath); (mixing up some languages)
and most importantly removing a space in the .php file so that the header is actually the header it sort of does what i wanted.
Adding a
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
seems to be needed to get the image displayed as image and not as raw data.

PHP cURL redirects to localhost

I'm trying to login to an external webpage using a php script with cURL. I'm new to cURL, so I feel like I'm missing a lot of pieces. I found a few examples and modified them to allow access to https pages. Ultimately, my goal is to be able to login to the page and download a .csv by following a specified link once logged in. So far, what I have is a script that tests logging in to the page; the script is shown below:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.websiteurl.com/login');
curl_setopt($ch, CURLOPT_POSTFIELDS,'Email='.urlencode($login_email).'&Password='.urlencode($login_pass).'&submit=1');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_setopt($ch, CURLOPT_REFERER, "https://www.websiteurl.com/login");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$output = curl_exec($ch);
I have a few questions. First, is there a reason this does not redirect on its own? The only way for me to view the contents of the page is to
echo $output
even though CURLOPT_RETURNTRANSFER and CURLOPT_FOLLOWLOCATION are both set to True.
Second, the URL for the page stays at "localhost/folderName/test.php" instead of directing to the actual website. Can anyone explain why this happens? Because the script doesn't actually redirect to a logged in webpage, I can't seem to do anything that I need to do.
Does my issue have to do with cookies? My cookies.txt file is in the same folder that my .php script is. (I'm using wampServer btw). Should it be located elsewhere?
Once I'm able to fix these two issues, it seems that all I need to be able to do is to redirect to the link that start the download process for the .csv file.
Thanks for any help, much appreciated!
Answering part of your question:
From http://php.net/manual/en/function.curl-setopt.php :
CURLOPT_RETURNTRANSFER TRUE to return the transfer as a string of the
return value of curl_exec() instead of outputting it out directly.
In other words - doing exactly what you described. It's returning the response to a string and you echo it to see it. As requested...
----- EDIT-----
As for the second part of your question - when I change the last three lines of the script to
$output = curl_exec($ch);
header('Location:'.$website);
echo $output;
The address of the page as displayed changes to $website - which in my case is the variable I use to store my equivalent of your 'https://www.websiteurl.com/login'
I am not sure that is what you wanted to do - because I'm not sure I understand what your next steps are. If you were getting redirected by the login site, wouldn't the new address be part of the header that is returned? And wouldn't you need to extract that address in order to perform the next request (wget or whatever) in order to download the file you wanted to get?
To do so, you need to set CURLOPT_HEADER to TRUE,
You can get the URL where you ended up from
$last_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
(see cURL , get redirect url to a variable ).
The same link also has a useful script for completely parsing the header information (returned when CURLOPT_HEADER==true. It's in the answer by nico limpica.
Bottom line: CURL gets the information that your browser would have received if you had pointed it to a particular site; that doesn't mean your browser behaves as though you pointed it to that site...

Set up a php proxy to access censored websites and bypass firewall

I'm currently using this plugin http://wordpress.org/extend/plugins/repress/ which basically makes my website a proxy so that users can access censored websites like this
www.mywebsite.com/proxy/www.cnn.com
The plugin works well enough but when it comes to absolute links the plugin doesn't parse it properly and the link is still blocked. That plugin development has stopped. So I need to write my own script. I've been searching everywhere and reading up on the tutorials I can find but none specifically helps me in regards to this.
I know how to use php curl to fetch a website and echo it on a blank page. What I don't know is how to set a proxy script to work like the above example where users can type
www.mywebsite.com
followed by
/proxy.php
then their target website
/www.cnn.com
Currently I have this set up:
<?php
$url = 'http://www.cnn.com';
$proxy_port = 80;
$proxy = '92.105.140.115';
$timeout = 0;
$referer = 'http://www.mydomain.com'
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_PROXYPORT, $proxy_port);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);*/
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
?>
This pulls the home page but no css or images are retrieved. Likewise all relative links are broken. I have no idea how to apply the proxy_port and proxy variables. I tried
92.105.140.115:80/www.cnn.com
but that doesn't work. I don't quite fully understand this code either since I found it on an example site.
Any answer or links for tutorials is greatly welcome.
Thank You!
To have a completely functioning proxy isn't that simple. There are many such projects already available. Give any a shot:
http://www.surrogafier.info/
https://github.com/Alexxz/Simple-php-proxy-script
http://www.glype.com/
Have fun!
you can not just echo in a page the result of a curl's fetch website because the browsers will interpret the Uris bad, you need that when the user click on a link he goes to your proxy site not to the original site, so you can't just print with echo, you need to edit manual every link in that fetched page before print it to the users, i have a full functional proxy made by php en p.listascuba.com you cant try it.
contact me for more info

Copying an image with PHP

With permission from the other site, I am supposed to periodically download an image hosted on another site and include it in a collection of webcam photos on our site with an external link to the contributing site.
This hasn't been any issue for any of the other sites, but with this particular one, I can't open the image to resize it and save it to our server. It's not a hotlinking issue because I can create a plain ole' <img src="http://THEIRIMAGE /> on a page on our site and it works fine.
I've tried using $img = new Imagick($sourceFilePath) directly as with all the others, as well as trying to use PHP's copy and also trying to copy the image using cURL but when doing so, the page just times out with no results at all.
Here's the image in question: http://island-alpaca.selfip.com:10202/SnapshotJPEG?Resolution=640x480&Quality=Standard
Like I've said, I'm able to do this sort of thing with several other webcams, but it isn't working with this one, and I am stuck as to why it isn't. Any help would be greatly appreciated.
Thank you.
In order to reduce bandwidth and server load some sites block certain bots from accessing their content. Your cURL request needs to more closely mimic an actual browser, which would include a referrer (usually), user agent, etc. It could also be that there is a redirect and you haven't told cURL to follow redirects.
Try setting more opts like this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'PHP');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
If that doesn't work, get the user agent from an actual browser and put it in there and see if that makes a difference.
Try using file_get_contents().
for example:
$url = 'http://island-alpaca.selfip.com:10202/SnapshotJPEG?Resolution=640x480&Quality=Standard';
$outputfile = "tmp/image" . date("Y-m-d_H.i.s");
$cmd = "wget -q \"$url\" -O $outputfile";
exec($cmd);
$temp_img = file_get_contents($outputfile);
$img = new Imagick($temp_img);
Can you try this and get back to me?

PHP / Curl: HEAD Request takes a long time on some sites

I have simple code that does a head request for a URL and then prints the response headers. I've noticed that on some sites, this can take a long time to complete.
For example, requesting http://www.arstechnica.com takes about two minutes. I've tried the same request using another web site that does the same basic task, and it comes back immediately. So there must be something I have set incorrectly that's causing this delay.
Here's the code I have:
$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
// Only calling the head
curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
$content = curl_exec ($ch);
curl_close ($ch);
Here's a link to the web site that does the same function: http://www.seoconsultants.com/tools/headers.asp
The code above, at least on my server, takes two minutes to retrieve www.arstechnica.com, but the service at the link above returns it right away.
What am I missing?
Try simplifying it a little bit:
print htmlentities(file_get_contents("http://www.arstechnica.com"));
The above outputs instantly on my webserver. If it doesn't on yours, there's a good chance your web host has some kind of setting in place to throttle these kind of requests.
EDIT:
Since the above happens instantly for you, try setting this curl setting on your original code:
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
Using the tool you posted, I noticed that http://www.arstechnica.com has a 301 header sent for any request sent to it. It is possible that cURL is getting this and not following the new Location specified to it, thus causing your script to hang.
SECOND EDIT:
Curiously enough, trying the same code you have above was making my webserver hang too. I replaced this code:
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
With this:
curl_setopt($ch, CURLOPT_NOBODY, true);
Which is the way the manual recommends you do a HEAD request. It made it work instantly.
You have to remember that HEAD is only a suggestion to the web server. For HEAD to do the right thing it often takes some explicit effort on the part of the admins. If you HEAD a static file Apache (or whatever your webserver is) will often step in an do the right thing. If you HEAD a dynamic page, the default for most setups is to execute the GET path, collect all the results, and just send back the headers without the content. If that application is in a 3 (or more) tier setup, that call could potentially be very expensive and needless for a HEAD context. For instance, on a Java servlet, by default doHead() just calls doGet(). To do something a little smarter for the application the developer would have to explicitly implement doHead() (and more often than not, they will not).
I encountered an app from a fortune 100 company that is used for downloading several hundred megabytes of pricing information. We'd check for updates to that data by executing HEAD requests fairly regularly until the modified date changed. It turns out that this request would actually make back end calls to generate this list every time we made the request which involved gigabytes of data on their back end and xfer it between several internal servers. They weren't terribly happy with us but once we explained the use case they quickly came up with an alternate solution. If they had implemented HEAD, rather than relying on their web server to fake it, it would not have been an issue.
If my memory doesn't fails me doing a HEAD request in CURL changes the HTTP protocol version to 1.0 (which is slow and probably the guilty part here) try changing that to:
$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
// Only calling the head
curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1); // ADD THIS
$content = curl_exec ($ch);
curl_close ($ch);
I used the below function to find out the redirected URL.
$head = get_headers($url, 1);
The second argument makes it return an array with keys. For e.g. the below will give the Location value.
$head["Location"]
http://php.net/manual/en/function.get-headers.php
This:
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
I wasn't trying to get headers.
I was just trying to make the page load of some data not take 2 minutes similar to described above.
That magical little options has dropped it down to 2 seconds.

Categories