I am hesitated to ask this question because it looks weird.
But anyway.
Just in case someone had encountered the same problem already...
filesystem functions (fopem, file, file_get_contents) behave very strange for http:// wrapper
it seemingly works. no errors raised. fopen() returns resource.
it returns no data for all certainly working urls (e.g. http://google.com/).
file returns empty array, file_get_contents() returns empty string, fread returns false
for all intentionally wrong urls (e.g. http://goog973jd23le.com/) it behaves exactly the same, save for little [supposedly domain lookup] timeout, after which I get no error (while should!) but empty string.
url_fopen_wrapper is turned on
curl (both command line and php versions) works fine, all other utilities and applications works fine, local files opened fine
This error seems inapplicable because in my case it doesn't work for every url or host.
php-fpm 5.2.11
Linux version 2.6.35.6-48.fc14.i686 (mockbuild#x86-18.phx2.fedoraproject.org)
I fixed this issue on my server (running PHP 5.3.3 on Fedora 14) by removing the --with-curlwrapper from the PHP configuration and rebuilding it.
Sounds like a bug. But just for posterity, here are a few things you might want to debug.
allow_url_fopen: already tested
PHP under Apache might behave differently than PHP-CLI, and would hint at chroot/selinux/fastcgi/etc. security restrictions
local firewall: unlikely since curl works
user-agent blocking: this is quite common actually, websites block crawlers and unknown clients
transparent proxy from your ISP, which either mangles or blocks (PHP user-agent or non-user-agent could be interpreted as malware)
PHP stream wrapper problems
Anyway, first let's proof that PHPs stream handlers are functional:
<?php
if (!file_get_contents("data:,ok")) {
die("Houston, we have a stream wrapper problem.");
}
Then try to see if PHP makes real HTTP requests at all. First open netcat on the console:
nc -l 80000
And debug with just:
<?php
print file_get_contents("http://localhost:8000/hello");
And from here you can try to communicate with PHP, see if anything returns if you variate the response. Enter an invalid response first into netcat. If there's no error thrown, your PHP package is borked.
(You might also try communicating over a "tcp://.." handle then.)
Next up is experimenting with http stream wrapper parameters. Use http://example.com/ literally, which is known to work and never block user-agents.
$context = stream_context_create(array("http"=>array(
"method" => "GET",
"header" => "Accept: xml/*, text/*, */*\r\n",
"ignore_errors" => false,
"timeout" => 50,
));
print file_get_contents("http://www.example.com/", false, $context, 0, 1000);
I think ignore_errors is very relevant here. But check out http://www.php.net/manual/en/context.http.php and specifically try to set protocol_version to 1.1 (will get chunked and misinterpreted response, but at least we'll see if anything returns).
If even this remains unsuccessful, then try to hack the http wrapper.
<?php
ini_set("user_agent" , "Mozilla/3.0\r\nAccept: */*\r\nX-Padding: Foo");
This will not only set the User-Agent, but inject extra headers. If there is a processing issue with construction the request within the http stream wrapper, then this could very eventually catch it.
Otherwise try to disable any Zend extensions, Suhosin, PHP xdebug, APC and other core modules. There could be interferences. Else this is potentiallyan issue specific to the Fedora package. Try a new version, see if it persists on your system.
When you use the http stream wrapper PHP creates an array for you called $http_response_header after file_get_contents() (or any of the other f family of functions) is called. This contains useful info on the state of the response. Could you do a var_dump() of this array and see if it gives you any more info on the response?
It's a really weird error that you're getting. The only thing I can think of is that something else on the server is blocking the http requests from PHP, but then I can't see why cURL would still be ok...
Is http stream registered in your PHP installation? Look for "Registered PHP Streams" in your phpinfo() output. Mine says "https, ftps, compress.zlib, compress.bzip2, php, file, glob, data, http, ftp, phar, zip".
If there is no http, set allow_url_fopen to on in your php.ini.
My problem was solved dealing with the SSL:
$arrContextOptions = array(
"ssl" => array(
"verify_peer" => false,
"verify_peer_name" => false,
),
);
$context = stream_context_create($arrContextOptions);
$jsonContent = file_get_contents("https://www.yoursite.com", false, $context);
What does a test with fsockopen tell you?
Is the test isolated from other code?
I had the same issue in Windows after installing XAMPP 1.7.7. Eventually I managed to solve it by adding the following line to php.ini (while having allow_url_fopen = On):
extension=php_openssl.dll
Use http://pear.php.net/reference/PHP_Compat-latest/__filesource/fsource_PHP_Compat__PHP_Compat-1.6.0a2CompatFunctionfile_get_contents.php.html and rename it and test if the error occurs with this rewritten function.
Related
we have a php web app using Guzzle 5 to download Wordpress RSS feeds.
It's working fine except for this feed https://www.socialquant.net/blog/feed/
The owner of this site does want us to pull the feed, and is not knowingly attempting to block access.
I can successfully download the file from my local machine and from the production web server (where we initially noticed the problem) using wget or curl with no special options.
This happened once before and that time we believed the issue to be caused by mod_security on Apache and it was solved by adding an arbitrary User-Agent header. But that time I was able to reproduce the issue consistently on the command line, this time it's only failing through Guzzle/PHP
I've copied the response headers from a browser request to the problem feed, and another feed that is working. I crossed off those that were the same and was left with the below
Server:Apache/2.2.22
Vary:User-Agent
X-Powered-By:PHP/5.3.29
Content-Encoding:gzip
Server:Apache
Vary:Accept-Encoding
X-Powered-By:PHP/5.5.30
That's not offering much insight. The gzip content encoding jumps out, I'm trying to find another working feed using gzip to verify this but it shouldn't matter as Guzzle's default mode is to automatically handle encoding. And we're using the same settings to download images from CDNs which are using gzip.
Does anyone have any ideas please? Thanks :)
EDIT
Using Guzzle 5.3.0
Code:
$client = new \GuzzleHttp\Client();
try {
$res = $client->get( $feed, [
'headers' => ['User-Agent' => 'Mozilla/4.0']
] );
} catch (\Exception $e) {
}
I'm afraid I don't have a proper solution to your problem, but I have it working again.
tl;dr version
It's the User-Agent header, changing it to pretty much anything else works.
This wget call fails:
wget -d --header="User-Agent: Mozilla/4.0" https://www.socialquant.net/blog/feed/
but this works
wget -d --header="User-Agent: SomeRandomText" https://www.socialquant.net/blog/feed/
And with that, the PHP below now also works:
require 'vendor/autoload.php';
$client = new \GuzzleHttp\Client();
$feed = 'https://www.socialquant.net/blog/feed/';
try {
$res = $client->get(
$feed,
[
'headers' => [
'User-Agent' => 'SomeRandomText',
]
]
);
echo $res->getBody();
} catch (\Exception $e) {
echo 'Exception: ' . $e->getMessage();
}
My thoughts
I started with wget and curl as you pointed out, which works when no special headers or options are set. Opening it in my browser also worked. I also tried using Guzzle without the User-Agent set and that also works.
Once I set the User-Agent to Mozilla/4.0 or even Mozilla/5.0 it started failing with 406 Not Acceptable
According to the HTTP Status Code definitions, a 406 means
The resource identified by the request is only capable of generating response entities which have content characteristics not acceptable according to the accept headers sent in the request.
In theory, adding Accept and Accept-Encoding headers should resolve the issue, but it didn't. Not via Guzzle or wget.
I then found the Mozilla Developer Network definition which states:
This response is sent when the web server, after performing server-driven content negotiation, doesn't find any content following the criteria given by the user agent.
This kinda points at the User-Agent again. This led me to believe that you are indeed correct that mod_security is doing something odd. I am convinced that an update to mod_security or Apache on the client's servers added a rule to parse the Mozilla/* user agents in a specific way since sending the User-Agent: Mozilla/4.0 () also works.
That's why I'm saying I don't have a proper solution for you. Even though the client wants you to pull the feed, they (or their hosting) is still in control of the rules.
Note: I noticed my IP getting blacklisted after a number of failed 406 attempts, after which I had to wait an hour before I could access the site again. Most likely a mod_security rule. mod_security might even be picking up on the automated requests with your user agent and start blocking it or rejecting it with the 406.
I don't have a solution for you either, as I'm also experiencing this same issue (except I get error 503 and it fails 60% of the time). Let me know if you have found a solution.
However, I would like to share with you what I have found through my recent research. I found that certain User-Agents work better than others for me. This makes me believe that it's not what Donovan states to be the case (at least for me).
When I set User-Agent to null, it works 100% of the time. However, I haven't made any large requests yet, as I'm afraid of getting IP banned, as I know I would with a large request.
When I do a var_dump of the request itself, I see a lot of arrays which include Guzzle markers. I'm thinking, maybe Amazons detection services can tell that I'm spoofing the headers? I don't know.
Hope you figured it out.
This is driving me crazy. I can download a file from internet through browser and its working fine. However if I write a script that does this for me using PHP, the file ends up being completely different (as per online diff tools).
$link = "https://torcache.net/torrent/9AE1726935FF9C08DF422CCE3C4445FC9484478B.torrent?title=[kat.cr]the.big.bang.theory.s08e24.720p.hdtv.x264.dimension.rartv";
file_put_contents($file_name, fopen( $link, 'r'));
First I tried to play around with encoding, but that should not matter as the file is binary, right? Also tried file_get_contents instead of fopen first, the same problem.
My PHP app is running on UTF-8 w/o BOM files. Can someone help? What am I doing wrong?
I think I got what you mean. The responses from torcache.net are all compressed with gzip. If you download the file from your browser, the file is automatically decoded, but if you do the same with php you get the same file but still encoded. You can use gzdecode to decodes it.
file_put_contents($file_name, gzdecode(file_get_contents($link)));
You have some errors in your code.
First you try to add a string to a variable without parentheses.
$link = "https://torcache.net/torrent/9AE1726935FF9C08DF422CCE3C4445FC9484478B.torrent?title=[kat.cr]the.big.bang.theory.s08e24.720p.hdtv.x264.dimension.rartv"
The next one you try to download that file over https. Check if your openssl is enabled and if allow_url_fopen is set correctly in your php.ini
https://secure.php.net/manual/en/features.remote-files.php
There are 3 ways to solve your problem:
enable openssl for https calls
Use curl and set the ssl verifier to 0
Replace the https with http to make a normal call.
Option 3 is the easiest one.
The URL in question : http://www.roblox.com/asset/?id=149996624
When accessed in a browser, it will correctly download a file (which is an XML document). I wanted to get the file in php, and simply display its contents on a page.
$contents = file_get_contents("http://www.roblox.com/asset/?id=149996624");
The above is what I've tried using (as far as I know, the page does not expect any headers). I get a 500 HTTP error. However, in Python, the following code works and I receive the file.
r = requests.get("http://www.roblox.com/asset/?id=147781188")
I'm confused as to what the distinction is between how these two requests are sent. I am almost 100% it is not a header problem. I've also tried the cURL library in PHP to no avail. Nothing I've tried in PHP seems to succeed with the URL (with any valid id parameter); but Python is able to bring success nonchalantly.
Any insight as to why this issue may be happening would be great.
EDIT : I have already tried copying Python's headers into my PHP request.
EDIT2 : It also appears that there are two requests happening upon navigating to the link.
Is this on a linux/mac host by chance? If so you could use ngrep to see the differences on the request themselves on the wire. Something like the following should work
ngrep -t '^(GET) ' 'src host 127.0.0.1 and tcp and dst port 80'
EDIT - The problem is that your server is responding with a 302 and the PHP library is not following it automatically. Cheers!
$output = file_get_contents("http://www.canadapost.ca/cpc2/addrm/hh/current/indexa/caONu-e.asp");
var_dump($output);
HTTP 505 Status means the webserver does not support the HTTP version used by the client (in this case, your PHP program).
What version of PHP are you running, and what HTTP/Web package(s) are you using in your PHP program?
[edit...]
Some servers deliberately block some browsers -- your code may "look like" a browser that the server is configured to ignore. I would particularly check the user agent string that your code is passing along to the server.
Check in your PHP installation (php.ini file) if the allow_url_fopen is enabled.
If not, any calls to file_get_contents will fail.
It works fine for me.
That site could be blocking the server that you're using to access it.
When you run the URL from your browser, your own ISP is used to get the information and display in your browser. But when you run from PHP, the ISP of your web host is used to get the information, then it passes it back to you.
Maybe you can do this to check and see what kind of headers its returning for you?
$headers=get_headers("http://www.canadapost.ca/cpc2/addrm/hh/current/indexa/caONu-e.asp");
print_r($headers);
I'm using the file_get_contents function to control a client,
(e.g. http://ip:port/?light=on)
When using the corresponding command in the browser it works, when i use the same url in combination with file_get_contents function it doesnt work.
When i wireshark the requests i notice that the browser is using http/1.1 and file_get_contents is using http/1.0.
I believe that the version of http is the problem why my code is not working,
How can i change this version of http in de file_get_contents function? or work around it?
You can set the HTTP version using a context:
$context = stream_context_create(array('http'=>array('protocol_version'=>'1.1')));
file_get_contents('http://ip:port/?light=on', false, $context);
See also the full list of context options http://www.php.net/manual/en/context.http.php
Note that if the server uses chunked encoding, you must use PHP 5.3 or superior.