I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).
Related
I had a simple parser for an external site that's required to confirm that the link user submitted leads to an account this user owns (by parsing a link to their profile from linked page). And it worked for a good long while with just this wordpress function:
function fetch_body_url($fetch_link){
$response = wp_remote_get($fetch_link, array('timeout' => 120));
return wp_remote_retrieve_body($response);
}
But then the website changed something in their cloudflare defense, and now this results in "Please wait..." page of cloudflare with no option to pass it.
Thing is, I don't even need it done automatically - if there was a captcha, the user could've complete it. But it won't show anything other than endlessly spinning "checking your browser".
Googled a bunch of curl examples, and best I could get so far is this:
<?php
$url='https://ficbook.net/authors/1000'; //random profile from requrested website
$agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, 'https://facebook.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
$response = curl_exec($ch);
curl_close($ch);
echo '<textarea>'.$response.'</textarea>';
?>
Yet it still returns the browser check screen. Adding random free proxy to it doesn't seem to work either, or maybe I wasn't lucky finding a working one (or couldn't figure out how to insert it correctly in this case). Is there any way around it? Or perhaps there is some other way to see if there is a specific keyword/link on the page?
Ok, I've spent most of the day on this problem, and seems like I got it more or less sorted. Not exactly the way I expected, but hey, it works... sort of.
Instead of solving this on the server side, I ended up looking for solution to parse it on my own PC (it has better uptime than my hosting's server anyway). Turns out, there are plenty of ready-to-use open source scrapers, including those that know how to bypass cloudflare being extra defensive for no good reason.
Solution for python dummies like myself:
Install Anaconda if you don't have python installed yet.
In cmd type pip install cloudscraper
Open Spyder (it comes along with Anaconda) and paste this:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://your-parse-target/").text)
Save it anywhere and poke at run button to test. If it works, you got your data in the console window of same app.
Replace print with whatever you're gonna do with that data.
For my specific case it also required to install mysql-connector-python and to enable remote access for mysql database (and my hosting had it available for free all this time, huh?). So instead of directly verifying that user is the owner of the profile they input, there's now a queue - which isn't perfect, but oh well, they'll have to wait.
First, user request is saved to mysql. My local python script will check that table every now and then to see if anything's in line to be verified. It'll get the page's content and save it back to mysql. Then the old php parser will do its job like before, but from mysql fetch instead of actual website.
Perhaps there are better solutions that don't require resorting to measures like creating a separate local parser, but maybe this will help to someone running into similar issue.
I need to retrieve and parse the text of public domain books, such as those found on gutenberg.org, with PHP.
To retrieve the content of most webpages I am able to use CURL requests to retrieve the HTML exactly as I would find had I navigated to the URL in a browser.
Unfortunately on some pages, most importantly gutenberg.org pages, the websites display different content or send a redirect header.
For example, when attempting to load this target, gutenberg.org, page a curl request gets redirected to this different but logically related, gutenberg.org, page. I am successfully able to visit the target page with both cookies and javascript turned off on my browser.
Why is the curl request being redirected while a regular browser request to the same site is not?
Here is the code I use to retrieve the webpage:
$urlToScan = "http://www.gutenberg.org/cache/epub/34175/pg34175.txt";
if(!isset($userAgent)){
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36";
}
$ch = curl_init();
$timeout = 15;
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
#curl_setopt($ch, CURLOPT_HEADER, 1); // return HTTP headers with response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_URL, $urlToScan);
$html = curl_exec($ch);
curl_close($ch);
if($html == null){
return false;
}
print $html;
The hint is probably in the url: it says "welcome stranger". They are redirecting every "first" time visitor to this page. Once you have visited the page, they will not redirect you anymore.
THey don't seem to be saving a lot of stuff in your browser, but they do set a cookie with a session id. This is the most logical thing really: check if there is a session.
What you need to do is connect with curl AND a cookie. You can use your browsers cookie for this, but in case it expires, you'd be better of doing
request the page.
if the page is redirected, safe the cookie (you now have a session)
request the page again with that cookie.
If all goes well, the second request will not redirect. Until the cookie / session expires, and then you start again. see the manual to see how to work with cookies/cookie-jars
The reason that one could navigate to the target page in a browser without cookies or javascript, yet not by curl, was due to the website tracking the referrer in the header. The page can be loaded without cookies by setting the appropriate referrer header:
curl_setopt($ch, CURLOPT_REFERER, "http://www.gutenberg.org/ebooks/34175?msg=welcome_stranger");
As pointed out by #madshvero, the page also be, surprisingly, loaded by simply excluding the user agent.
I have a peculiar issue.
I have a script which fetches a JSON. It works perfectly fine in the browser (gives the correct json). For eg. accessing the URL
http://example.com/json_feed.php?sid=21662567
in browser gives me following JSON (snippet shown):
{"id":"21662567","title":"Camp and Kayak in Amchi Mumbai. for 1 Booking...
As can be seen, the sid (of URL) and id of JSON match and is the correct json.
But the same URL when accessed via file_get_contents, gives me wrong result. The code is rather trivial and hence, I am completely stumped as to why this will happen.
$json = file_get_contents("http://example.com/json_feed.php?sid=21662567");
echo "<pre>";
var_dump($json);
echo "</pre>";
The JSON response of above code is:
string(573) "{"id":"23160210","title":"Learn about Commodity Markets (Gold\/Silver) for...
As can be seen, sid and id don't match now and the JSON fetched is incorrect.
I tried using curl also, thinking that it could be some format issue, but to no avail. curl also fetches the same incorrect JSON.
At the same time, accessing the original URL in browser will fetch the correct JSON.
Any ideas on what's happening here?
EDIT by Talvinder (14 April, 2014 at 0913 IST)
ISSUE SPOTTED: the script json_feed.php is session dependent and file_get_contents doesn't pass session values. I am not sure how to build the HTTP_REQUEST in cURL. Can someone help me with this? My current cURL code is:
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 OPR/20.0.1387.91');
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
Where $url is the url given at the beginning of the question.
EDIT by TALVINDER(14 april, 1805 IST)
Killed the links shared earlier as they are dead now.
EDIT by TALVINDER (14 april, 0810 IST):
JSON can be seen here: JSON GENERATOR
file_get_content results can be seen here: file_get_contents script
Without any links that we can investigate, I guess there's some sort of user-agent magic going on.
Try spoofing it with cURL.
Someting like this:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 OPR/20.0.1387.91');
You can use your own user agent or find something else here.
Not 100% this is the issue, but considering the data provided by you, this is the only solution I can think of. I am sure you've double checked that the urls etc are correct in the scripts etc.
Figured out the issue and sharing it here for others to take note of:
Lesson 1
file_get_contents doesn't pass session or cookies to the URL, so don't use it to fetch data from URLs which are session or cookie dependents
cURL is your friend. You will have to build full HTTP request to pass proper session variables
Lesson 2
Session dependent scripts will behave properly when accessed via browser but not when accessed via file_get_contents or partially formed cURL
List item
Lesson 3
When the code is too trivial and yet buggy, Devil is in details - question every little function (apologies for philosophical connotation here :) )
SOLUTION
The json_feed.php I created is session dependent. So it was being naughty when accessed via file_get_contents. With cURL too it wasn't behaving properly.
I changed the cURL to include suggestions given here: Maintaining PHP session while accessing URL via cURL
My final cURL code (which worked is below):
$strCookie = 'PHPSESSID=' . $_COOKIE['PHPSESSID'] . '; path=/';
session_write_close();
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt( $ch, CURLOPT_COOKIE, $strCookie );
curl_setopt($ch, CURLOPT_URL,$url);
$result = curl_exec($ch);
curl_close($ch);
I hope it saves some time for someone.
Thanks for all the replies.
I am just playing around trying to learn php and decided to write a php page that could pull info from the leagueoflegends boards. Problem I am having is the site needs me to login first. Ive tried just
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://forums.euw.leagueoflegends.com/board');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_REFERER, "http://leagueoflegends.com");
$html = curl_exec($ch);
curl_close($ch);
echo $html;
and I have tried
file_get_contents('http://forums.euw.leagueoflegends.com/board/')
but every time I get nowhere. I was hoping that being logged in on another tab would allow me to get the source of pages on the forums, but that doesn't seem to be the case. I honestly don't even know where to go from here or what I should be searching for to give me a clue. Normally I like to post a little more info, but like I said I am trying to learn PHP; i've seem to learn best by just jumping in.
First, good luck on your path of learning PHP! Curl is mighty powerful, but lately I've been using Guzzle instead (guzzlephp.org) for it's ease of use.
Most sites that have login mechanisms do in fact use sessions or cookies to map users so you are on the right path. What you have above will simply retrieve the main board page. From here, you'll submit a second curl request to login. The login page there is:
https://account.leagueoflegends.com/login
That actually pops up a modal window though and uses a captcha. You'll submit the following form fields:
username
password
recaptcha_response_field
to: https://account.leagueoflegends.com/auth
Since this has a captcha, your best bet may be to login as yourself and export your cookie data for this domain and see if you can reuse it in your script. It'll expire at some point so this won't be fully automated.
I have a small web page that, every day, displays a one word answer - either Yes or No - depending on some other factor that changes daily.
Underneath this, I have a Facebook like button. I want this button to post, in the title/description, either "Yes" or "No", depending on the verdict that day.
I have set up the OG metadata dynamically using php to echo the correct string into the og:title etc. But Facebook caches the value, so someone sharing my page on Tuesday can easily end up posting the wrong content to Facebook.
I have confirmed this is the issue by using the Facebook object debugger. As soon as I force a refresh, all is well. I attempted to automate this using curl, but this doesn't seem to work.
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, "http://developers.facebook.com/tools/lint/?url={http://ispizzahalfprice.com}");
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
Am I missing some easy fix here? Or do I need to re-evaluate my website structure to acheive what I am looking for (e.g. use two separate pages)?
Here's the page in case it's useful: http://ispizzahalfprice.com
Using two separate URL's would be the safe bet. As you have observed, Facebook does quite heavy caching on URL scrapes. You've also seen that you, as the admin of the App, can flush and refresh Facebook's cache by pulling the page through the debugger again.
Using two URL's would solve this issue because Facebook could cache the results all they want! There will still be a separate URL for "yes" and one for "no".