PHP 500 internal server error file_get_contents - php

Using PHP I'm trying to crawl a website page and then grab an image automatically.
I've tried the following:
<?php
$url = "http://www.domain.co.uk/news/local-news";
$str = file_get_contents($url);
?>
and
<?php
$opts = array('http'=>array('header' => "User-Agent:Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.75 Safari/537.1\r\n"));
$context = stream_context_create($opts);
$header = file_get_contents('http://www.domain.co.uk/news/local-news',false,$context);
?>
and also
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.domain.co.uk/news/local-news');
$result = $html->find('section article img', 0)->outertext;
?>
but these all return with Internal Server Error. I can view the site perfectly in the browser but when I try to grab the page in PHP it fails.
Is there anything I can try?

Try below code: It will save content in local file.
<?php
$ch = curl_init("http://www.domain.co.uk/news/local-news");
$fp = fopen("localfile.html", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
Now you can ready localfile.html.

Sometimes you might get an error opening an http URL with file_get_contents.
even though you have set allow_url_fopen = On in php.ini
For me the the solution was to also set "user_agent" to something.

Related

simple_html_dom: 403 Access denied

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

php file_get_contents() get stuck in loading an image

As mentioned above, the php file_get_contents() function or even the fopen()/fread() combination stucks and times out when trying to read this simple image url:
http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png
but the same image is easily loaded by browsers, whats the catch?
EDITED:
as requested in comments, I am showing the function I used to get the data:
function customRead($url)
{
$contents = '';
$handle = fopen($url, "rb");
$dex = 0;
while ( !feof($handle) )
{
if ( $dex++ > 100 )
break;
$contents .= fread($handle, 2048);
}
fclose($handle);
echo "\nbreaking due to too many calls...\n";
return $contents;
}
I also tried simply this:
echo file_get_contents('http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png');
Both give the same issue
EDITED:
As suggested in comment I used curl:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.1 Safari/537.11');
$res = curl_exec($ch);
$rescode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch) ;
echo "\n\n\n[DATA:";
echo $res;
echo "]\n\n\n[CODE:";
print_r($rescode);
echo "]\n\n\n[ERROR:";
echo curl_error($ch);
echo "]\n\n\n";
this is the result:
[DATA:]
[CODE:0]
[ERROR:]
If you don't get the remote data with file_get_contents, you can try it with cURL as it can provide error messages on curl_error. If you get nothing, even no error, then something on your server blocks outgoing connections. Maybe you even want to try curl over SSH. I'm not sure if that makes any difference but it's worth the try. If you don't get anything you may want to consider contacting the server admin (if you're not that) or the provider.

RSS-Feed returns an empty string

I have a news portal that displays RSS Feeds Items. Approximately 50 sources are read and it works very well.
Only with a source I always get an empty string. The RSS Validator of W3C can read the RSS feed. Even my program Vienna receives data.
What can I do?
Here is my simple code:
$link = 'http://blog.bosch-si.com/feed/';
$response = file_get_contents($link);
if($response !== false) {
var_dump($response);
} else {
echo 'Error ';
}
The server serving that feed expects a User Agent to be set. You apparently don't have a User Agent set in your php.ini, nor do you set it in the call to file_get_contents.
You can either set the User Agent for this particular request through a stream context:
echo file_get_contents(
'http://blog.bosch-si.com/feed/',
FALSE,
stream_context_create(
array(
'http' => array(
'user_agent' => 'php'
)
)
)
);
Or globally for any http calls:
ini_set('user_agent', 'php');
echo file_get_contents($link);
Both will give you the desired result.
blog http://blog.bosch-si.com/feed/ required some header to fetch content from the website, better use curl for the same.
See below solution:
<?php
$link = 'http://blog.bosch-si.com/feed/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $link);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: blog.bosch-si.com', 'User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'));
$result = curl_exec($ch);
if( ! $result)
{
echo curl_error($ch);
}
curl_close($ch);
echo $result;

"Checking browser before accessing..." error when using Curl

I am trying to use curl to get the contents off a website. The error that I am getting is.
"Checking your browser before accessing roosterteeth.com"
I tried changing different attributes in curl but still no luck. I have tried using PHP Simple HTML Dom Parser but once again no luck.
below is my current code.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
foreach($content->find("div.streamIndividual") as $div) {
$divContents[] = $div->outertext; }
file_put_contents("cache.htm", implode(PHP_EOL, $divContents));
$hash = file_get_contents("pg_1_hash.htm");
$cache = file_get_contents("cache.htm");
if ($hash == ($pageHash = md5($test))) {
} else {
$fpa = fopen("pg_1.htm", "w");
fwrite($fpa, $cache);
fclose($fpa);
$fpb = fopen("pg_1_hash.htm", "w");
fwrite($fpb, $pageHash);
fclose($fpb);
}
?>
As it stands the code above shows a different error due to the find command not being able to get any content. The code below shows the error I get from the site.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
echo $content;
?>
My hunch about the error is that the server thinks that I am a bot (which I don't blame it to believe that). I used curl to see if i can pretend to be a client and bypass the checker but was unsuccessful. I hope someone can shed some light onto this.
For a visual error click this link.
Thank you for your time :)
If the site you're trying to access uses wordpress, it's definetly has security issues. It' a known malicious modification for WP and redirects users to some different sites. So in this case the problem is not in your code.

Can't load the XML file?

http://westwood-backup.com/podcast?categoryID2=403
This is the XML file that i want to load and echo via PHP. I tried file_get_contents and load. Both of are return empty string. If i change the URL as another XML file, functions works great. What can be special about the URL?
<?php
$content = file_get_contents("http://westwood-backup.com/podcast?categoryID2=403");
echo $content;
?>
Another try with load, same empty result.
<?php
$feed = new DOMDocument();
if (#$feed->load("http://westwood-backup.com/podcast?categoryID2=403")) {
$xpath = new DOMXpath($feed);
$linkPath = $xpath->query("/rss/channel/link");
echo $linkPath
}
?>
Use CURL and you can do it like this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://westwood-backup.com/podcast?categoryID2=403');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, ' Mozilla/1.22 (compatible; MSIE 2.0d; Windows NT)');
$xml = curl_exec($ch);
curl_close($ch);
$xml = new SimpleXMLElement($xml);
echo "<pre>";
print_r($xml);
echo "</pre>";
Outputs:
I think the server implements a "User-Agent" check to make sure the XML data is only loaded within a browser (not via bots/file_get_contents etc...)
so, by using CURL and setting a dummy user-agent, you can get around the check and load the data.
You need to set a useragent header that the server is happy with. No need for cUrl if you dont want to use it, you can use stream_context_create with file_get_contents:
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n" // i.e. An iPad
)
);
$context = stream_context_create($options);
$content = file_get_contents("http://westwood-backup.com/podcast?categoryID2=403", false, $context);
echo $content;

Categories