I have a news portal that displays RSS Feeds Items. Approximately 50 sources are read and it works very well.
Only with a source I always get an empty string. The RSS Validator of W3C can read the RSS feed. Even my program Vienna receives data.
What can I do?
Here is my simple code:
$link = 'http://blog.bosch-si.com/feed/';
$response = file_get_contents($link);
if($response !== false) {
var_dump($response);
} else {
echo 'Error ';
}
The server serving that feed expects a User Agent to be set. You apparently don't have a User Agent set in your php.ini, nor do you set it in the call to file_get_contents.
You can either set the User Agent for this particular request through a stream context:
echo file_get_contents(
'http://blog.bosch-si.com/feed/',
FALSE,
stream_context_create(
array(
'http' => array(
'user_agent' => 'php'
)
)
)
);
Or globally for any http calls:
ini_set('user_agent', 'php');
echo file_get_contents($link);
Both will give you the desired result.
blog http://blog.bosch-si.com/feed/ required some header to fetch content from the website, better use curl for the same.
See below solution:
<?php
$link = 'http://blog.bosch-si.com/feed/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $link);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: blog.bosch-si.com', 'User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'));
$result = curl_exec($ch);
if( ! $result)
{
echo curl_error($ch);
}
curl_close($ch);
echo $result;
Related
Something strange is going on, and I would like to know why.
On this url: http://api.promasters.net.br/cotacao/v1/valores?moedas=USD&alt=json, which works well in the browser, but when I tried to retrieve the content with php:
echo file_get_contents('http://api.promasters.net.br/cotacao/v1/valores?moedas=USD&alt=json');
printed nothing, with var_dump(...) = string(0) "", so i went a little further and used:
function get_page($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, True);
curl_setopt($curl, CURLOPT_URL, $url);
$return = curl_exec($curl);
curl_close($curl);
return $return;
}
echo get_page('http://api.promasters.net.br/cotacao/v1/valores?moedas=USD&alt=json');
Also printed nothing, so i tried python (3.X):
import requests
print(requests.get('http://api.promasters.net.br/cotacao/v1/valores?moedas=USD&alt=json').text)
And WORKED. Why is this happening? What's going on?
It looks like they're blocking the user agent, or lack thereof, considering that php curl and file_get_contents doesn't seem to set the value in the request header.
You can fake this by setting it to something like Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
<?php
function get_page($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, True);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1');
$return = curl_exec($curl);
curl_close($curl);
return $return;
}
echo get_page('http://api.promasters.net.br/cotacao/v1/valores?moedas=USD&alt=json');
I experienced the same behaviour.
Fetching the URL using the CLI Curl worked for me.
I then wrote a script with a file_get_contents call to another script that dumped all request headers to a file using getallheaders:
<?php
file_put_contents('/tmp/request_headers.txt', var_export(getallheaders(),true));
Output of file:
array (
'Host' => 'localhost',
)
I then inspected the curl request headers,
$ curl -v URL
And tried adding one at a time to the file_get_contents request. It turned out a User agent header was needed.
<?php
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>
"User-Agent: examplebot\r\n"
)
);
$context = stream_context_create($opts);
$response = file_get_contents($url, false , $context);
This gave me a useful response.
I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>
I am stuck in a problem I have a url which has a geo location restriction like it can only be viewed from europe or USA. My location is Asia. I want to extract all href's from the url.
However I am using curl but the problem is that it send server ip address and I want the request to be made with user ip address inorder to track a user which links he has visited. If you can guide me how to send request with user ip address and without using curl I'll be grateful.
Following is the source code. The url which I am accesing is:
http://partnerads.ysm.yahoo.com/ypa/?ct=2&c=000000809&u=http%3A%2F%2Ftrouve.autocult.fr%2F_test.php%3Fq%3Dtarif%2520skoda%2520superb%2520combi&r=&w=1&tv=&tt=&lo=&ty=&ts=1458721731523&ao=&h=1&CoNo=3292b85181511c0a&dT=1&er=0&si=p-Autocult_FRA_SERP_2%3A600x796
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('iframe.html');
// find iframe from within html doc
foreach($html->find('iframe') as $iframe)
{
$src = $iframe->getAttribute('src'); // src extracted
$ch = curl_init(); // Initialise a cURL handle
// Set any other cURL options that are required
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_URL,$src);
$results = curl_exec($ch); // Execute a cURL request
//echo curl_error($ch);
curl_close($ch); // Closing the curl
$bool = TRUE; $match = array(); $int = 0;
while(preg_match('/<a[^>]+href=([\'"])(.+?)\1[^>]*>/i', $results, $matches))
{
if($bool)
{
// print captured group that's actually the url your searching for
echo $matches[2].'<br>'.'<br>'.'<br>'.'<br>';
$bool = false;
}
}
}
You can use proxy.
$ip = '100.100.100.100:234' //example $ip
curl_setopt($ch, CURLOPT_PROXY,$ip);
without curl:
$aContext = array(
'http' => array(
'proxy' => 'tcp://'.$ip,
'request_fulluri' => true,
),
);
$cxContext = stream_context_create($aContext);
$sFile = file_get_contents("http://www.google.com", False, $cxContext);
If you lookin' for proxies, there's some adresses easy to scrape:
'http://proxylist.hidemyass.com/',
'http://ipaddress.com/proxy-list/',
'http://nntime.com/proxy-ip-'.$i.'.htm',
'http://www.proxylisty.com/ip-proxylist-'.$i
over 2000 ips
As mentioned above, the php file_get_contents() function or even the fopen()/fread() combination stucks and times out when trying to read this simple image url:
http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png
but the same image is easily loaded by browsers, whats the catch?
EDITED:
as requested in comments, I am showing the function I used to get the data:
function customRead($url)
{
$contents = '';
$handle = fopen($url, "rb");
$dex = 0;
while ( !feof($handle) )
{
if ( $dex++ > 100 )
break;
$contents .= fread($handle, 2048);
}
fclose($handle);
echo "\nbreaking due to too many calls...\n";
return $contents;
}
I also tried simply this:
echo file_get_contents('http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png');
Both give the same issue
EDITED:
As suggested in comment I used curl:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.1 Safari/537.11');
$res = curl_exec($ch);
$rescode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch) ;
echo "\n\n\n[DATA:";
echo $res;
echo "]\n\n\n[CODE:";
print_r($rescode);
echo "]\n\n\n[ERROR:";
echo curl_error($ch);
echo "]\n\n\n";
this is the result:
[DATA:]
[CODE:0]
[ERROR:]
If you don't get the remote data with file_get_contents, you can try it with cURL as it can provide error messages on curl_error. If you get nothing, even no error, then something on your server blocks outgoing connections. Maybe you even want to try curl over SSH. I'm not sure if that makes any difference but it's worth the try. If you don't get anything you may want to consider contacting the server admin (if you're not that) or the provider.
http://westwood-backup.com/podcast?categoryID2=403
This is the XML file that i want to load and echo via PHP. I tried file_get_contents and load. Both of are return empty string. If i change the URL as another XML file, functions works great. What can be special about the URL?
<?php
$content = file_get_contents("http://westwood-backup.com/podcast?categoryID2=403");
echo $content;
?>
Another try with load, same empty result.
<?php
$feed = new DOMDocument();
if (#$feed->load("http://westwood-backup.com/podcast?categoryID2=403")) {
$xpath = new DOMXpath($feed);
$linkPath = $xpath->query("/rss/channel/link");
echo $linkPath
}
?>
Use CURL and you can do it like this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,'http://westwood-backup.com/podcast?categoryID2=403');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, ' Mozilla/1.22 (compatible; MSIE 2.0d; Windows NT)');
$xml = curl_exec($ch);
curl_close($ch);
$xml = new SimpleXMLElement($xml);
echo "<pre>";
print_r($xml);
echo "</pre>";
Outputs:
I think the server implements a "User-Agent" check to make sure the XML data is only loaded within a browser (not via bots/file_get_contents etc...)
so, by using CURL and setting a dummy user-agent, you can get around the check and load the data.
You need to set a useragent header that the server is happy with. No need for cUrl if you dont want to use it, you can use stream_context_create with file_get_contents:
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n" // i.e. An iPad
)
);
$context = stream_context_create($options);
$content = file_get_contents("http://westwood-backup.com/podcast?categoryID2=403", false, $context);
echo $content;