I'm trying to scrape data from some websites. For several sites it all seems to go fine, but for one website it doesn't seem to be able to get any HTML. This is my code:
<?php include_once('simple_html_dom.php');
$html = file_get_html('https://www.magiccardmarket.eu/?mainPage=showSearchResult&searchFor=' . $_POST['data']);
echo $html; ?>
I'm using ajax to fetch the data. When I log the returned value in my js it's completely empty.
Could it be due to the fact that this website is running on https? And if so, is there any way to work around it? (I've tried changed the url to http, but I get the same result)
Update:
If I var_dump the $html variable, I get bool(false).
My PHP error log says this:
[27-Feb-2014 22:20:50 Europe/Amsterdam] PHP Warning: file_get_contents(http://www.magiccardmarket.eu/?mainPage=showSearchResult&searchFor=tarmogoyf): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden
in /Users/leondewit/PhpstormProjects/Magic/stores/simple_html_dom.php on line 75
It's your user agent, file_get_contents doesn't send one by default, so:
$url = 'http://www.magiccardmarket.eu/?mainPage=showSearchResult&searchFor=tarmogoyf';
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents($url, false, $context);
$html = str_get_html($response);
echo $html;
Related
I call myself an experienced PHP developer, but this is one drives me crazy. I'm trying to get release informations of a repository for displaying update-warnings, but I keep returning 403 errors. For simplifying it I used the most simple usage of GitHubs API: GET https://api.github.com/zen. It is kind of a hello world.
This works
directly in the browser
with a plain curl https://api.github.com/zen in a terminal
with a PHP-Github-API-Class like php-github-api
This works not
with a simple file_get_contents()from a PHP-Skript
This is my whole simplified code:
<?php
$content = file_get_contents("https://api.github.com/zen");
var_dump($content);
?>
The browser shows Warning: file_get_contents(https://api.github.com/zen): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden, the variable $content is a boolean and false.
I guess I'm missing some sort of http-header-fields, but neither can I find those informations in the API-Docs, nor uses my terminal curl-call any special header files and works.
This happens because GitHub requires you to send UserAgent header. It doesn't need to be anything specific. This will do:
$opts = [
'http' => [
'method' => 'GET',
'header' => [
'User-Agent: PHP'
]
]
];
$context = stream_context_create($opts);
$content = file_get_contents("https://api.github.com/zen", false, $context);
var_dump($content);
The output is:
string(35) "Approachable is better than simple."
Im trying to scrape information from the site http://steamstat.us - The thing i want to get is the status and such from the site.
Im currently only using this code:
<?php
$homepage = file_get_contents('http://www.steamstat.us/');
echo $homepage;
?>
The problem I have here is that "Normal (16h)" and the rest just returns 3 dots.
Cant figure what the problem should be.
Anyone have any clue?
EDIT
This is now fixed.
I solved the problem as followed:
<?php
$opts = array('http' => array('header' => "User-Agent:MyAgent/1.0\r\n"));
$context = stream_context_create($opts);
$json_url = file_get_contents('https://crowbar.steamdb.info/Barney', FALSE, $context);
$data = json_decode($json_url);
?>
Its a https protocol which is not easy to scrap. Thought the website allows it as the headers sent for "access-control-allow-origin" are marked as * which means the content can be requested by any other site.
You are not receiving the content becasue Normal (16h) is not yet populated on page load. Its coming from ajax.
The HTML source says <span class="status" id="repo">…</span>. You are receiving these three dots inside span tag in file_get_contents.
The only way to do it is to look for the ajax call in the network log and then use file_get_contents in that URL called by ajax.
I'm trying to scrape some product details from a website using the following code:
$list_url = "http://www.topshop.com/en/tsuk/category/sale-offers-436/sale-799";
$html = file_get_contents($list_url);
echo $html;
However, I'm getting this error:
Warning:
file_get_contents(http://www.topshop.com/en/tsuk/category/sale-offers-436/sale-799)
[function.file-get-contents]: failed to open stream: HTTP request
failed! HTTP/1.0 403 Forbidden in
/homepages/19/d361310357/htdocs/shopaholic/rss/topshop_f_uk.php on
line 123
I gather that this is some sort of block by the website to prevent scraping. Is there a way around this - perhaps using cURL and setting a user agent?
If not, is there another way of getting basic product data like item name and price?
EDIT
The context of my code is that I'd eventually still want to be able to achieve the following:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
I've managed to fix it by adding the following code...
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)');
...as per this answer.
You should use cURL , not the simple way with file_get_contents().
Use cURL and set up the proper http headers to mimic a proper http request (a real request).
P.S. : set up cURL to follow redirects . Here is the link to cURL
I am having a problem with PHP's file_get_contents command.
$url = "http://api.rememberthemilk.com/services/rest/".$format.$auth_token.$filter."&api_sig=".$md5.$apikey.$method;
$content = file_get_contents($url);
$array = json_decode($content, true);
$taskname = $array['rsp']['tasks']['list']['taskseries']['name'];
$duedate = $array['rsp']['tasks']['list']['taskseries']['task']['due'];
($format, $auth_token, $filter, $md5, $apikey, $method are already defined in the script)
When I try to run this code this error is returned:
[function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad request for line 101
line 101 = $content, = file_get_contents($url);
How to fix this? Thanks!!!
This url does not look great.
http://api.rememberthemilk.com/services/rest/?format=json&auth_token=AUTH_TOKEN&filter=dueWithin:"3 days of today"&api_sig=API_SIG&api_key=API_KEY&method=rtm.tasks.getList
Encode the tokens as follows:
$filter = 'filter='.urlencode( 'dueWithin:"3 days of today"' );
Use urlencode().
Try printing the URL after concatenating the variables. Then paste the URL into the address bar of your browser and see what comes back. Because it's a web service call, your browser might not know what to do with the response. In that case you might get additional information using the command-line user agent "curl", e.g.
curl -v 'http://some-url'
curl is built in to Macs and other *nix machines and is also available for Windows.
I want to read and parse the xml data from an url.My url is:"http://xml.gamebookers.com/sports/bandy.xml".I can access xml data from browser.However,when i attempt to read it by using php it doesnt work.It errors like this:
Warning: file_get_contents(http://xml.gamebookers.com/sports/bandy.xml): failed to open stream: Connection timed out in
How can i fix this?Any comments on this?
Thanks in advance..
Please see here for an answer:
This error is most likely connected to
too many (HTTP) redirects on the way
from your script to the file you want
to open. The default redirection
limit should be 20. As 20
redirects are quite alot there could
be some error in the filename itself
(causing e.g. the webserver on the
other end to do some
spellcheck-redirects) or the other
server is misconfigured or there are
some security measures in place or...
If you feel the need to extend the 20
redirects you could use a stream
context.
$context = array(
'http'=>array('max_redirects' => 99)
);
$context = stream_context_create($context);
// hand over the context to fopen()
$data = file_get_contents('http://xml.gamebookers.com/sports/bandy.xml', false, $context);
// ...
Please see:
Streams
stream_context_create()
HTTP context options
Try the snippet:
$request_url = 'http://xml.gamebookers.com/sports/bandy.xml';
$xml = simplexml_load_file($request_url) or die("feed not loading");
/*
Then just parsing out child node of your xml.
for example
*/
foreach($xml->children() as $child)
{
echo $child->getName().": ".$child."";
}
Hope this help
PS: Open your PHP.INI and look for
allow_url_fopen = On // make sure it is ON