fetching a page from facebook with curl/file_get_contents - php

I try to get content from my facebook page like so:
echo file_get_contents("http://www.facebook.com/dma.y");
The problem is that it doesnt give me the page but redirects me to another page that says that I need to upgrade my browswer. Then I thought to use curl and fetch it by sending a request with some headers.
echo get_follow_url('http://www.facebook.com/dma.y');
function get_follow_url($url){
// must set $url first. Duh...
$http = curl_init($url);
curl_setopt($http, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($http, CURLOPT_HTTPHEADER, get_headers('http://google.com'));
// do your curl thing here
$result = curl_exec($http);
if(curl_errno($http)){
echo "<br/>An error has been thrown!<br/>";
exit();
}
$http_status = curl_getinfo($http, CURLINFO_HTTP_CODE);
curl_close($http);
return $http_status;
}
Still there is no luck. I should have a status code response returned which is either 404 or 200.. depending if I am logged into facebook. But it returns 301, cause it identifies my request as not being a regular browser request. so what am I missing in the curl option settings?
UPDATE
What I am actually trying to do is to replicate this functionality:
The script will trigger the function onload or onerror, depending on the status code returned..
That code will retrieve the page. However, that javascript method is clumsy, and breaks in some browsers like firefox..cause it isnt a javascript file.

What you might want to try is to set the user_agent with CURL.
$url = 'https://www.facebook.com/cocacola';
$http = curl_init($url);
$fake_user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3';
curl_setopt($http, CURLOPT_USERAGENT, $fake_user_agent);
$result = curl_exec($http);
This is the parameter that servers look at to see what browser you are using. I'm not 100% sure if this will bypass Facebook's checks and give you ALL the information on the page, but it's definitely worth a try! :)

Related

cURL issue with Google consent redirect

I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).

Webpage detecting / displaying different content for curl request - Why?

I need to retrieve and parse the text of public domain books, such as those found on gutenberg.org, with PHP.
To retrieve the content of most webpages I am able to use CURL requests to retrieve the HTML exactly as I would find had I navigated to the URL in a browser.
Unfortunately on some pages, most importantly gutenberg.org pages, the websites display different content or send a redirect header.
For example, when attempting to load this target, gutenberg.org, page a curl request gets redirected to this different but logically related, gutenberg.org, page. I am successfully able to visit the target page with both cookies and javascript turned off on my browser.
Why is the curl request being redirected while a regular browser request to the same site is not?
Here is the code I use to retrieve the webpage:
$urlToScan = "http://www.gutenberg.org/cache/epub/34175/pg34175.txt";
if(!isset($userAgent)){
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36";
}
$ch = curl_init();
$timeout = 15;
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
#curl_setopt($ch, CURLOPT_HEADER, 1); // return HTTP headers with response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_URL, $urlToScan);
$html = curl_exec($ch);
curl_close($ch);
if($html == null){
return false;
}
print $html;
The hint is probably in the url: it says "welcome stranger". They are redirecting every "first" time visitor to this page. Once you have visited the page, they will not redirect you anymore.
THey don't seem to be saving a lot of stuff in your browser, but they do set a cookie with a session id. This is the most logical thing really: check if there is a session.
What you need to do is connect with curl AND a cookie. You can use your browsers cookie for this, but in case it expires, you'd be better of doing
request the page.
if the page is redirected, safe the cookie (you now have a session)
request the page again with that cookie.
If all goes well, the second request will not redirect. Until the cookie / session expires, and then you start again. see the manual to see how to work with cookies/cookie-jars
The reason that one could navigate to the target page in a browser without cookies or javascript, yet not by curl, was due to the website tracking the referrer in the header. The page can be loaded without cookies by setting the appropriate referrer header:
curl_setopt($ch, CURLOPT_REFERER, "http://www.gutenberg.org/ebooks/34175?msg=welcome_stranger");
As pointed out by #madshvero, the page also be, surprisingly, loaded by simply excluding the user agent.

curl returns 404 on valid page

I've got a PHP function that checks a URL to make sure that (a.) there's some kind of server response, and (b.) it's not a 404.
It works just fine on every domain/URL I've tested, with the exception of bostonglobe.com, where it's returning a 404 for valid URLs. I'm guessing it has something to do with their paywall, but my function works fine on nytimes.com and other newspaper sites.
Here's an example URL that returns a 404:
https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html
What am I doing wrong?
function check_url($url){
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
$result = curl_exec($curl);
if ($result == false) {
//There was no response
$message = "No information found for that URL";
} else {
//What was the response?
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404) {
$message = "No information found for that URL";
} else{
$message = "Good";
}
}
return $message;
}
The problem seems to come from you CURLOPT_NOBODY option.
I've tested your code both with and without this line and the http code returns 404 when CURLOPT_NOBODY is present, and 200 when it's not.
The PHP manual informs us that setting the CURLOPT_NOBODY option will transform your request method to HEAD, my guess is that the server on which bostonglobe.com is hosted doesn't support that method.
I checked this URL with curl command.
curl --head https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html
It returned an error .(HTTP/1.1 404 Not Found)
I also used another command use wget. The result was same.
wget –server-response --spider https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html
I also checked this case with web service ( HTTP request generator: http://web-sniffer.net/ ).
The result was same.
Other URL cases in https://www.bostonglobe.com/ work for HEAD request only.
but i think post page (extension .html) is not support head request.
server administrator or programmer shutdown head request?
for php,
if($_SERVER["REQUEST_METHOD"] == "HEAD"){
// response 404 or using header method to redirect
exit;
}
or server soft(Apache and more) limit the HTTP request.
for example, this purpose is to reduce server load.

SMS url not working with file operation

My question is, i'm using - $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles="$mobiil_no"&sms="$msg"&senderid="$sender_id"; $ret = file($url);- url to send sms to users from user panel and i'm using FILE operation to execute this url as mentioned above.
After executing this when i'm trying to print $ret, its giving me status true and generating message id and sending id.
But its not getting delivered to user....??
When same url i'm executing in browser as $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles=98xxxxxx02&sms=Hi..&senderid="$sender_id"
its getting delivered immediately..??
can anyone help me out..?? Thanks in advance..
It is possible that this SMS service needs to think a browser and not a bot is executing the request, or there is some "protection" we don't know about. Is there any documentation regarding this particular service ? Is it intended to be used like you're trying to do?
You can try with CURL and see if the behaviour is still the same:
<?php
// create curl resource
$ch = curl_init();
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
// Fake real browser
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$ret = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
Does it help?

file_get_contents won't return the source code

When I execute this code:
var_dump(file_get_contents('http://www.zahnarzt-gisler.ch'));
I get this error:
Warning: file_get_contents(http://www.zahnarzt-gisler.ch): failed to
open stream: HTTP request failed! HTTP/1.1 403 Forbidden in
/home/httpd/vhosts/your-click.ch/httpdocs/wp-content/themes/your-click/ajax-request.php
on line 146 bool(false)
I don't know why it is returning false, since when I change the url, e.g. http://www.google.com or any other url, it will work and returns the source code of the page.
I guess it must be something wrong with the url but it just seems weird to me, because it url is online and available.
Site owners can disallow you scraping their data without asking.
You can just scrape the page, but you have to set a user-agent. Curl is the way to go.
file_get_contents() is a simple screwdriver. Great for simple GET requests where the header, HTTP request method, timeout, cookiejar, redirects, and other important things do not matter.
<?php
$config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';
$ch = curl_init();
// Set the url, number of GET vars, GET data
curl_setopt($ch, CURLOPT_URL, 'http://www.zahnarzt-gisler.ch');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt($ch, CURLOPT_USERAGENT, $config['useragent']);
// Execute request
$result = curl_exec($ch);
curl_close($ch);
echo $result;
?>

Categories