I'm trying to scrape some product details from a website using the following code:
$list_url = "http://www.topshop.com/en/tsuk/category/sale-offers-436/sale-799";
$html = file_get_contents($list_url);
echo $html;
However, I'm getting this error:
Warning:
file_get_contents(http://www.topshop.com/en/tsuk/category/sale-offers-436/sale-799)
[function.file-get-contents]: failed to open stream: HTTP request
failed! HTTP/1.0 403 Forbidden in
/homepages/19/d361310357/htdocs/shopaholic/rss/topshop_f_uk.php on
line 123
I gather that this is some sort of block by the website to prevent scraping. Is there a way around this - perhaps using cURL and setting a user agent?
If not, is there another way of getting basic product data like item name and price?
EDIT
The context of my code is that I'd eventually still want to be able to achieve the following:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
I've managed to fix it by adding the following code...
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)');
...as per this answer.
You should use cURL , not the simple way with file_get_contents().
Use cURL and set up the proper http headers to mimic a proper http request (a real request).
P.S. : set up cURL to follow redirects . Here is the link to cURL
Related
I need to add an svg file to a website and apply a class to this svg. This is frustrating, I've tried different solutions posted on here and none of them have worked for me. This worked on a different server, but after being moved to a new server it no longer works. Here is how I am calling it in the php:
<?php $svg = file_get_contents("http://www.folklorecoffee.com/wp-content/uploads/2018/04/folkloretextwhite-1.svg");
$dom = new DOMDocument();
$dom->loadHTML($svg);
foreach($dom->getElementsByTagName('svg') as $element) {
$element->setAttribute('class','logo-light');
}
$dom->saveHTML();
$svg = $dom->saveHTML();
echo $svg;?>
I'm getting these warnings:
WARNING: file_get_contents(http://www.folklorecoffee.com/wp-content/uploads/2018/04/folkloretextwhite-1.svg): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
in /home/folklorecoffee/public_html/wp-content/themes/lily/header.php on line 34
WARNING: DOMDocument::loadHTML(): Empty string supplied as input in /home/folklorecoffee/public_html/wp-content/themes/lily/header.php on line 36
But when I test the url in my browser, it comes up fine. Not sure why I'm getting a 404 error. What am I doing wrong? Thank you in advance!
The request is working for me:
echo $svg = file_get_contents("http://www.folklorecoffee.com/wp-content/uploads/2018/04/folkloretextwhite-1.svg"); // prints the SVG data as expected
You could try the same request via a different mechanism, such as cURL:
$url = "http://www.folklorecoffee.com/wp-
content/uploads/2018/04/folkloretextwhite-1.svg";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data; // prints the SVG data as expected
But if the URL is the same, I expect you will get the same result.
Given the above, the best answer is for you to do troubleshooting on your own about what is happening, since it is very specific to your setup.
You may gain some insight by looking at the Apache logs for the request. Does it show up? Does the URL it displays match what you expect?
Can you make other requests with file_get_contents to your domain that work? To the uploads folder? To other domains?
Check with your hosting provider to see if they can explain. There may be a configuration item that is interfering in some way.
Finally, you may want try investigating more carefully why the logo cannot be loaded from the file system. Is it a permissions issue? Can you load any file from the permissions directory?
I tried everyone's suggestions and I thank you all for taking your time to try to offer help, I was never able to get it to work and continued to get the same error.
Interestingly any file on my server that I tried to call with file_get_contents would get the 404 error. It didn't matter if I did a local path or the exact url, the paths were correct. Here's what it's interesting: it would work if I pointed to a file that was NOT on my server. So I'm thinking there must be a configuration setting somewhere that I would need to change. I do not know what this setting would be and if someone out there is reading this and experiencing the same issues I hope this helps point you in the right direction.
I was in a time crunch and decided instead to simply write the svg file in with an object tag and that did the trick for now. I'd prefer using a different method, and I'll revisit this again in the future and update this post if I figure out what would fix it. Thank you again for all of your help, I hope this helps someone else.
I'm trying to develop and Instagram application but I'm struggling to get a json inserting the access token via variable.
That's my code:
$personal = json_decode(file_get_contents('https://api.instagram.com/v1/users/self/?access_token={$accesstoken}'));
And the error that I receive is this:
PHP Warning: file_get_contents(https://api.instagram.com/v1/users/self/?access_token={$accesstoken}): failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request
I already tried to see if I get the variable $accesstoken from the previous form and it's all right, because if I echo-it on the page it pops up.
I tried the function preg_replace to understand if the problem were the white spaces, but nothing.
I don't want to use cURL if it is not mandatory.
What's wrong with the code (and me)?
Thanks in advance!
Edit:
As I answered to #FirstOne: yes, I already tried to put the variable (access token) manually and in this way it works.
I'm trying to scrape data from some websites. For several sites it all seems to go fine, but for one website it doesn't seem to be able to get any HTML. This is my code:
<?php include_once('simple_html_dom.php');
$html = file_get_html('https://www.magiccardmarket.eu/?mainPage=showSearchResult&searchFor=' . $_POST['data']);
echo $html; ?>
I'm using ajax to fetch the data. When I log the returned value in my js it's completely empty.
Could it be due to the fact that this website is running on https? And if so, is there any way to work around it? (I've tried changed the url to http, but I get the same result)
Update:
If I var_dump the $html variable, I get bool(false).
My PHP error log says this:
[27-Feb-2014 22:20:50 Europe/Amsterdam] PHP Warning: file_get_contents(http://www.magiccardmarket.eu/?mainPage=showSearchResult&searchFor=tarmogoyf): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden
in /Users/leondewit/PhpstormProjects/Magic/stores/simple_html_dom.php on line 75
It's your user agent, file_get_contents doesn't send one by default, so:
$url = 'http://www.magiccardmarket.eu/?mainPage=showSearchResult&searchFor=tarmogoyf';
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents($url, false, $context);
$html = str_get_html($response);
echo $html;
I've been using the following code to scrape keywords from Google:
$data=file_get_contents('http://clients1.google.com/complete/search?hl=en&gl=us&q='.$keyword);
However, my script has suggest started showing these errors:
Warning: file_get_contents(http://clients1.google.com/complete/search?hl=en&gl=us&q=money) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request in /home/username/public_html/keywords.php on line 10
I'm guessing this is being caused by Google changing the link? Does anyone know what the new link would be or what I need to change in my code?
Try this:
$url = 'http://suggestqueries.google.com/complete/search?output=firefox&client=firefox&hl=en-US&q=';
$data=file_get_contents($url . urlencode( $keyword ) );
Hope it helps.
I am using the following PHP:
$xml = simplexml_load_file($request_url) or die("url not loading");
I use:
$status = $xml->Response->Status->code;
To check the status of the response. 200 bening everything is ok, carry on.
However if I get a 403 access denied error, how do I catch this in PHP so I can return a user friendly warning?
To retrieve the HTTP response code from a call to simplexml_load_file(), the only way I know is to use PHP's little known $http_response_header. This variable is automagically created as an array containing each response header separately, everytime you make a HTTP request through the HTTP wrapper. In other words, everytime you use simplexml_load_file() or file_get_contents() with a URL that starts with "http://"
You can inspect its content with a print_r() such as
$xml = #simplexml_load_file($request_url);
print_r($http_response_header);
In your case, though, you might want to retrieve the XML separately with file_get_contents() then, test whether you got a 4xx response, then if not, pass the body to simplexml_load_string(). For instance:
$response = #file_get_contents($request_url);
if (preg_match('#^HTTP/... 4..#', $http_response_header[0]))
{
// received a 4xx response
}
$xml = simplexml_load_string($response);
You'll have to use something like the cURL module or the HTTP module to fetch the file, then use the functionality provided by them to detect an HTTP error, then pass the string from them into simplexml_load_string.