Neither cURL nor file_get_contents can load certain webpages

Neither cURL nor file_get_contents can load certain webpages - php

I am running an IIS 8 / PHP web server and am attempting to write a so-called 'proxy script' as a means of fetching HTTP content and loading it onto an HTTPS page.
Although the script does run successfully (outputting whatever the HTTP page sends) in some cases - for example, Google.com, Amazon.com, etc. - it does not work in fetching my own website and a few others.
Here is the code of proxy.php:
<?php
$url = $_GET['url'];
echo "FETCHING URL<br/>"; // displays this no matter what URL I enter
$ctx_array = array('http' =>
array(
'method' => 'GET',
'timeout' => 10,
)
);
$ctx = stream_context_create($ctx_array);
$output = file_get_contents($url, false, $output); // times out for certain requests
echo $output;
When I set $_GET['url'] to http://www.ucomc.net, the script fails. With most other URLs, it works fine.
I have checked other answers on here and other places but none of them describe my issue, nor do the solutions offered solve it.
I've seen some suggestions to similar problems that involve changing the user agent, but when I do this it not only does not solve the existing problem but prevents other sites from loading as well. I do not want to rely on third-party proxies (don't trust the free ones/want to deal with their query limit and don't want to pay for the expensive ones)

Turns out that it was just a problem with the firewall. Testing it on a PHP sandbox worked fine, so I just had to modify the outgoing connections settings in the server firewall to allow the request through.

Related

PHP HTTP client not able to resolve web page

I am trying to use a Http client to store the HTML from a web page. The following code snippet shows how I have configured the Http client, it uses php-http/guzzle6-adapter.
I know from my tests that the client works properly when pointed at other webpages.
<?php
require_once(__DIR__.'/vendor/autoload.php');
use Http\Adapter\Guzzle6\Client as GuzzleAdapter;
use GuzzleHttp\Psr7\Request;
$config = [
'verify' => false,
'timeout' => 2
];
$adapter = GuzzleAdapter::createWithConfig($config);
$request = new Request('GET', 'https://workingwithchildren.wa.gov.au/');
// Returns a Psr\Http\Message\ResponseInterface
$response = $adapter->sendRequest($request);
echo $response->getBody();
?>
However page I am trying to resolve https://workingwithchildren.wa.gov.au/ returns the following error, no matter what I do.
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 9283834035315018727
I pointed my browser at the website and used Chrome Developer Tools to examine the Request/Response data being exchanged. The screenshot below shows the Response data my browser received.
I noticed that the site is setting cookies that seem to relate to security and CPFS and I would imagine these cookies are what are stopping my client from resolving the web page successfully. But I don't know how to fix this. I'd imagine this is a problem others have faced before. Any help would be much appreciated.

For anyone experiencing a similar problem the solution I found was to, as the commenter Scuzzy suggested, add User-Agent data to my guzzle config.

Why does ini_get('user_agent') doesn't work with file_get_contents() in php?

I have a situation like this.
A crawler script fetches the content of the URL using
file_get_contents().
It sets the user agent as "CrawlerBot" just
above the line where file_get_contents() is called using
ini_set('user_agent').
My concern is when I write ini_get('user_agent') in the code of URL, it gets a blank value. However when I use $_SERVER['HTTP_USER_AGENT'] it detects the correct user agent. Both the files are hosted on same server.
Anybody aware why does it happen?

That's not what ini_get() does. It's for retrieving server configuration values (the configuration of your server), not request-specific values like the user agent sent by a requesting browser/script/whatever.
So, you can use ini_get() to find out what user agent value, if any, is set for requests made by your server, like the one you are actually making. You cannot use it to find out the user agent of a request made to your server.

Here is an example of code to set user agent and retrive a ressource with file_get_contents.
//Set uri
$uri = 'http://example.com';
//Init context
$ctx = stream_context_create(
array(
'http' => array(
'user_agent' => 'MySuperAgent/3.0'
)
)
);
//Try to retrieve content
if (($data = file_get_contents($uri, false, $ctx)) === false) {
die('file_get_contents error');
}
ps : Note that the context array should be under http key even for https ressources.
ps2 : I strongly suggests that you set the timeout and maximum acceptable redirections on the context to avoid slowdown in your application.

Slow HTTP POST request in php

I'm trying to POSTing some data (a JSON string) from a php script to a java server (all written by myself) and getting the response back.
I tried the following code:
$url="http://localhost:8000/hashmap";
$opts = array('http' => array('method' => 'POST', 'content' => $JSONDATA,'header'=>"Content-Type: application/x-www-form-urlencoded"));
$st = stream_context_create($opts);
echo file_get_contents($url, false,$st);
Now, this code actually works (I get back as result the right answer), but file_get_contents hangs everytime 20 seconds while being executed (I printed the time before and after the instruction). The operations performed by the server are executed in a small amount of time, and I'm sure it's not normal to wait all this time to get the response.
Am I missing something?

Badly mis-configured server maybe that doesn't send the right content-size and using HTTP/1.1.
Either fix the server or request the data as HTTP/1.0

Try adding Connection: close and Content-Length: strlen($JSONDATA) headers to the $opts.
Also, if you want to avoid using extensions, have a look at this class I wrote some time ago to perform HTTP requests using PHP core only. It works on PHP4 (which is why I wrote it) and PHP5, and the only extension it ever requires is OpenSSL, and you only need that if you want to do an HTTPS request. Documented(ish) in comments at the top.
Supports all sorts of stuff - GET, POST, PUT and more, including file uploads, cookies, automatic redirect handling. I have used it quite a lot on a platform I work with regularly that is stuck with PHP/4.3.10 and it works beautifully... Even if I do say so myself...

How to send cookies with file_get_contents

I'm trying to get the contents from another file with file_get_contents (don't ask why).
I have two files: test1.php and test2.php. test1.php returns a string, bases on the user that is logged in.
test2.php tries to get the contents of test1.php and is being executed by the browser, thus getting the cookies.
To send the cookies with file_get_contents, I create a streaming context:
$opts = array('http' => array('header'=> 'Cookie: ' . $_SERVER['HTTP_COOKIE']."\r\n"))`;
I'm retrieving the contents with:
$contents = file_get_contents("http://www.example.com/test1.php", false, $opts);
But now I get the error:
Warning: file_get_contents(http://www.example.com/test1.php) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
Does somebody knows what I'm doing wrong here?
edit:
forgot to mention: Without the streaming_context, the page just loads. But without the cookies I don't get the info I need.

First, this is probably just a typo in your question, but the third arguments to file_get_contents() needs to be your streaming context, NOT the array of options. I ran a quick test with something like this, and everything worked as expected
$opts = array('http' => array('header'=> 'Cookie: ' . $_SERVER['HTTP_COOKIE']."\r\n"));
$context = stream_context_create($opts);
$contents = file_get_contents('http://example.com/test1.txt', false, $context);
echo $contents;
The error indicates the server is returning a 404. Try fetching the URL from the machine PHP is running on and not from your workstation/desktop/laptop. It may be that your web server is having trouble reaching the site, your local machine has a cached copy, or some other network screwiness.
Be sure you repeat your exact request when running this test, including the cookie you're sending (command line curl is good for this). It's entirely possible that the page in question may load fine in a browser without the cookie, but when you send the cookie the site actually is returning a 404.
Make sure that $_SERVER['HTTP_COOKIE'] has the raw cookie you think it does.
If you're screen scraping, download Firefox and a copy of the LiveHTTPHeaders extension. Perform all the necessary steps to reach whatever page it is you want in Firefox. Then, using the output from LiveHTTPHeaders, recreate the exact same request requence. Include every header, not just the cookies.
Finally, PHP Curl exists for a reason. If at all possible, (I'm not asking!) use it instead. :)

Just to share this information.
When using session_start(), the session file is lock by PHP. Thus the actual script is the only script that can access the session file. If you try to access it via fsockopen() or file_get_contents() you can wait a long time since you try to open a file that has been locked.
One way to solve this problem is to use the session_write_close() to unlock the file and relock it after with session_start().
Example:
<?php
$opts = array('http' => array('header'=> 'Cookie: ' . $_SERVER['HTTP_COOKIE']."\r\n"));
$context = stream_context_create($opts);
session_write_close(); // unlock the file
$contents = file_get_contents('http://120.0.0.1/controler.php?c=test_session', false, $context);
session_start(); // Lock the file
echo $contents;
?>
Since file_get_contents() is a blocking function, both script won't be in concurrency while trying to modify the session file.
But i'm sure this is not the best manner to manipulate session with an extend connection.
Btw: it's faster than cURL and fsockopen()
Let me know if you find something better.

Just out of curiosity, are you attempting file_get_contents on a page that has a space in it? I remember trying to use fgc on a URL that had a space in the name and while my web browser parsed it just fine, fgc didn't. I ended up having to use a str_replace to replace ' ' with '%20'.
I would think that this should have been relatively easy to spot that though as it would report only half of the filename. Also, I noticed in one of these posts, someone used \r\n while defining the headers. Keep in mind that PHP doesn't like these to be in single quotes, but they work fine in double.

Make sure that file1.php exists on the server. Try opening it in your own browser to make sure!

How do I check for valid (not dead) links programmatically using PHP?

Given a list of urls, I would like to check that each url:
Returns a 200 OK status code
Returns a response within X amount of time
The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.
The script will be written in PHP and will most likely run on a daily basis via cron.
The script will be processing approximately 1000 urls at a go.
Question has two parts:
Are there any bigtime gotchas with an operation like this, what issues have you run into?
What is the best method for checking the status of a url in PHP considering both accuracy and performance?

Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.
As a starting point you could use some function like this:
function is_available($url, $timeout = 30) {
$ch = curl_init(); // get cURL handle
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
CURLOPT_URL => $url, // set URL
CURLOPT_NOBODY => true, // do a HEAD request only
CURLOPT_TIMEOUT => $timeout); // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK
curl_close($ch); // close handle
return $retval;
}
However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.
Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.

Look into cURL. There's a library for PHP.
There's also an executable version of cURL so you could even write the script in bash.

I actually wrote something in PHP that does this over a database of 5k+ URLs. I used the PEAR class HTTP_Request, which has a method called getResponseCode(). I just iterate over the URLs, passing them to getResponseCode and evaluate the response.
However, it doesn't work for FTP addresses, URLs that don't begin with http or https (unconfirmed, but I believe it's the case), and sites with invalid security certificates (a 0 is not found). Also, a 0 is returned for server-not-found (there's no status code for that).
And it's probably easier than cURL as you include a few files and use a single function to get an integer code back.

fopen() supports http URI.
If you need more flexibility (such as timeout), look into the cURL extension.

Seems like it might be a job for curl.
If you're not stuck on PHP Perl's LWP might be an answer too.

You should also be aware of URLs returning 301 or 302 HTTP responses which redirect to another page. Generally this doesn't mean the link is invalid. For example, http://amazon.com returns 301 and redirects to http://www.amazon.com/.

Just returning a 200 response is not enough; many valid links will continue to return "200" after they change into porn / gambling portals when the former owner fails to renew.
Domain squatters typically ensure that every URL in their domains returns 200.

One potential problem you will undoubtably run into is when the box this script is running on looses access to the Internet... you'll get 1000 false positives.
It would probably be better for your script to keep some type of history and only report a failure after 5 days of failure.
Also, the script should be self-checking in some way (like checking a known good web site [google?]) before continuing with the standard checks.

You only need a bash script to do this. Please check my answer on a similar post here. It is a one-liner that reuses HTTP connections to dramatically improve speed, retries n times for temporary errors and follows redirects.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Neither cURL nor file_get_contents can load certain webpages - php

Turns out that it was just a problem with the firewall. Testing it on a PHP sandbox worked fine, so I just had to modify the outgoing connections settings in the server firewall to allow the request through.

Related

PHP HTTP client not able to resolve web page

Why does ini_get('user_agent') doesn't work with file_get_contents() in php?

Slow HTTP POST request in php

How to send cookies with file_get_contents

How do I check for valid (not dead) links programmatically using PHP?

Categories

Resources