I use the PHP function file_get_contents as a proxy to fetch websites on two different web hosts.
It works for all websites except Wikipedia.
It gives me this output every time:
WIKIMEDIA FOUNDATION
Error
Our servers are currently experiencing a technical problem. This is probably temporary and
should be fixed soon. Please try again in a few minutes.
Anyone know what the problem is?
You're probably not passing the correct User-Agent. See here.
You should pass a context to file_get_contents:
PHP: file_get_contents - Manual
PHP: stream_context_create - Manual
Wikimedia Foundation policy is to block requests with non-descriptive or missing User-Agent headers because these tend to originate from misbehaving scripts. "PHP" is one of the blacklisted values for this header.
You should change the default User-Agent header to one that identifies your script and how the system administrators can contact you if necessary:
ini_set('user_agent', 'MyCoolTool/1.1 (http://example.com/MyCoolTool/; MyCoolTool#example.com)');
Of course, be sure to change the name, URL, and e-mail address rather than copying the code verbatim.
Wikipedia requires a User-Agent HTTP header be sent with the request. By default, file_get_contents does not send this.
You should use fsockopen, fputs, feof and fgets to send a full HTTP request, or you may be able to do it with cURL. My personal experience is with the f* functions, so here's an example:
$attempts = 0;
do {
$fp = #fsockopen("en.wikipedia.org",80,$errno,$errstr,5);
$attempts++;
} while(!$fp && $attempts < 5);
if( !$fp) die("Failed to connect");
fputs($fp,"GET /wiki/Page_name_here HTTP/1.0\r\n"
."Host: en.wikipedia.org\r\n"
."User-Agent: PHP-scraper (your-email#yourwebsite.com)\r\n\r\n");
$out = "";
while(!feof($fp)) {
$out .= fgets($fp);
}
fclose($fp);
list($head,$body) = explode("\r\n\r\n",$out);
$head = explode("\r\n",$head);
list($http,$status,$statustext) = explode(" ",array_shift($head),3);
if( $status != 200) die("HTTP status ".$status." ".$statustext);
echo $body;
Use cURL for this:
$ch = curl_init('http://wikipedia.org');
curl_setopt_array($ch, array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 5.1; rv:18.0) Gecko/20100101 Firefox/18.0',
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_RETURNTRANSFER => true
);
$data = curl_exec($ch);
echo $data;
I assume you have already "tried again in a few minutes".
Next thing you could try is using cURL instead of file_get_contents, and setting the user-agent one of a common browser.
If it still doesn't work, it should at least give you some more info.
Related
I'm trying to read a small part of a webpage. At first I thought the problem was urlencode so added that but the problem still arises.
The script reads fine from simple links like google.com however it isn't working with the link I want.
<?php
$link = "http://www.adidas.co.uk/nmd_r1-bape-camouflage-shoes/BA7326.html";
$newlink = urlencode($link);
$linkcontents = file_get_contents($newlink);
$needle = "Sold out";
if(strpos($linkcontents, $needle) == true){
echo "String found";
} else{
echo "String not found";
} ?>
Im changing my answer because i did this below code:
$link = "http://www.adidas.co.uk/nmd_r1-bape-camouflage-shoes/BA7326.html";
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, $link);
// $output contains the output string
$output = curl_exec($ch);
die(var_dump($output));
and it gave me this response
Sorry, you have been blocked
You are unable to access this website
Why have I been blocked?
This website is using a security service to protect itself from online
attacks. The action you just performed triggered the security
solution. There are several actions that could trigger this block
including submitting a certain word or phrase, a SQL command or
malformated data.
What can I do to resolve this?
If you are on a personal connection, like at home, you can run an
anti-virus scan on your device to make sure it is not infected with
malware.
If you are at an office or shared network, you can ask the network
administrator to run a scan across the network looking for
misconfigured or infected devices.
HTTP 403 - Forbidden "
It seems you are unable to do any webscraping on the Addidas website.
You don't need urlencode.
The site you are trying to access responds with 403 Forbidden.
file_get_contents(http://www.adidas.co.uk/nmd_r1-bape-camouflage-shoes/BA7326.html): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden
This is because file_get_contents doesn't send a properly formed request from the site's perspective.
You should use curl or another request tool such as Guzzle, and send a properly formed request that is understand by the site you are trying to scrape (Eg: Guzzle will send some headers by default so it should work).
It's likely the remote host is filtering out connections without valid header information (e.g. like user-agent).
You can spoof it (usually better to use cURL for these things) by creating a stream context:
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n" .
"User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2228.0 Safari/537.36\r\n"
)
);
$ctx = stream_context_create($opts);
$content = file_get_contents($url,false,$ctx);
Disclaimer: While this may work in returning the HTML, the fact that the remote host put these checks in place may indicate that doing this would go against their terms of use. Don't blame me if your IP gets blacklisted.
I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20
I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20
I noticed there was a question somewhat similar to mine, only with c#:link text.
Let me explain: I'm very new to the whole web-services implementation and so I'm experiencing some difficulty understanding (especially due to the vague MediaWiki API manual).
I want to retrieve the entire page as a string in PHP (XML file) and then process it in PHP (I'm pretty sure there are other more sophisticated ways to parse XML files but whatever):
Main Page wikipedia.
I tried doing $fp = fopen($url,'r');. It outputs: HTTP request failed! HTTP/1.0 400 Bad Request. The API does not require a key to connect to it.
Can you describe in detail how to connect to the API and get the page as a string?
EDIT:
The URL is $url='http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main Page';. I simply want to read the entire content of the file into a string to use it.
Connecting to that API is as simple as retrieving the file,
fopen
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$fp = fopen($url, 'r');
while (!feof($fp)) {
$c .= fread($fp, 8192);
}
echo $c;
file_get_contents
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$c = file_get_contents($url);
echo $c;
The above two can only be used if your server has the fopen wrappers enabled.
Otherwise if your server has cURL installed you can use that,
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
echo $c;
You probably need to urlencode the parameters that you are passing in the query string ; here, at least the "Main Page" requires encoding -- without this encoding, I get a 400 error too.
If you try this, it should work better (note the space is replaced by %20) :
$url='http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$str = file_get_contents($url);
var_dump($str);
With this, I'm getting the content of the page.
A solution is to use urlencode, so you don't have to encode yourself :
$url='http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=' . urlencode('Main Page');
$str = file_get_contents($url);
var_dump($str);
According to the MediaWiki API docs, if you don't specify a User-Agent in your PHP request, WikiMedia will refuse the connection with a 4xx HTTP response code:
https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
You might try updating your code to add that request header, or change the default setting in php.ini if you have edit access to that.
Are there any alternatives to using curl on hosts that have curl disabled?
To fetch content via HTTP, first, you can try with file_get_contents ; your host might not have disabled the http:// stream :
$str = file_get_contents('http://www.google.fr');
Bit this might be disabled (see allow_url_fopen) ; and sometimes is...
If it's disabled, you can try using fsockopen ; the example given in the manual says this (quoting) :
$fp = fsockopen("www.example.com", 80, $errno, $errstr, 30);
if (!$fp) {
echo "$errstr ($errno)<br />\n";
} else {
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.example.com\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
while (!feof($fp)) {
echo fgets($fp, 128);
}
fclose($fp);
}
Considering it's quite low-level, though (you are working diretly with a socket, and HTTP Protocol is not that simple), using a library that uses it will make life easier for you.
For instance, you can take a look at snoopy ; here is an example.
http://www.phpclasses.org is full of these "alternatives", you can try this one: http://www.phpclasses.org/browse/package/3588.html
You can write a plain curl server script with PHP and place it on curl-enabled hosting, and when you need curl - you make client calls to it when needed from curl-less machine, and it will return data you need. Could be a strange solution, but was helpful once.
All the answers in this thread present valid workarounds, but there is one thing you should keep in mind. You host has, for whatever reason, deemed that making HTTP requests from your web server via PHP code is a "bad thing", and have therefore disabled (or not enabled) the curl extension. There's a really good chance if you find a workaround and they notice it that they'll block your request some other way. Unless there's political reasons forcing you into using this particular host, seriously consider moving your app/page elsewhere if it need to make http requests.