I have a peculiar issue.
I have a script which fetches a JSON. It works perfectly fine in the browser (gives the correct json). For eg. accessing the URL
http://example.com/json_feed.php?sid=21662567
in browser gives me following JSON (snippet shown):
{"id":"21662567","title":"Camp and Kayak in Amchi Mumbai. for 1 Booking...
As can be seen, the sid (of URL) and id of JSON match and is the correct json.
But the same URL when accessed via file_get_contents, gives me wrong result. The code is rather trivial and hence, I am completely stumped as to why this will happen.
$json = file_get_contents("http://example.com/json_feed.php?sid=21662567");
echo "<pre>";
var_dump($json);
echo "</pre>";
The JSON response of above code is:
string(573) "{"id":"23160210","title":"Learn about Commodity Markets (Gold\/Silver) for...
As can be seen, sid and id don't match now and the JSON fetched is incorrect.
I tried using curl also, thinking that it could be some format issue, but to no avail. curl also fetches the same incorrect JSON.
At the same time, accessing the original URL in browser will fetch the correct JSON.
Any ideas on what's happening here?
EDIT by Talvinder (14 April, 2014 at 0913 IST)
ISSUE SPOTTED: the script json_feed.php is session dependent and file_get_contents doesn't pass session values. I am not sure how to build the HTTP_REQUEST in cURL. Can someone help me with this? My current cURL code is:
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 OPR/20.0.1387.91');
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
Where $url is the url given at the beginning of the question.
EDIT by TALVINDER(14 april, 1805 IST)
Killed the links shared earlier as they are dead now.
EDIT by TALVINDER (14 april, 0810 IST):
JSON can be seen here: JSON GENERATOR
file_get_content results can be seen here: file_get_contents script
Without any links that we can investigate, I guess there's some sort of user-agent magic going on.
Try spoofing it with cURL.
Someting like this:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 OPR/20.0.1387.91');
You can use your own user agent or find something else here.
Not 100% this is the issue, but considering the data provided by you, this is the only solution I can think of. I am sure you've double checked that the urls etc are correct in the scripts etc.
Figured out the issue and sharing it here for others to take note of:
Lesson 1
file_get_contents doesn't pass session or cookies to the URL, so don't use it to fetch data from URLs which are session or cookie dependents
cURL is your friend. You will have to build full HTTP request to pass proper session variables
Lesson 2
Session dependent scripts will behave properly when accessed via browser but not when accessed via file_get_contents or partially formed cURL
List item
Lesson 3
When the code is too trivial and yet buggy, Devil is in details - question every little function (apologies for philosophical connotation here :) )
SOLUTION
The json_feed.php I created is session dependent. So it was being naughty when accessed via file_get_contents. With cURL too it wasn't behaving properly.
I changed the cURL to include suggestions given here: Maintaining PHP session while accessing URL via cURL
My final cURL code (which worked is below):
$strCookie = 'PHPSESSID=' . $_COOKIE['PHPSESSID'] . '; path=/';
session_write_close();
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt( $ch, CURLOPT_COOKIE, $strCookie );
curl_setopt($ch, CURLOPT_URL,$url);
$result = curl_exec($ch);
curl_close($ch);
I hope it saves some time for someone.
Thanks for all the replies.
Related
I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).
I had a simple parser for an external site that's required to confirm that the link user submitted leads to an account this user owns (by parsing a link to their profile from linked page). And it worked for a good long while with just this wordpress function:
function fetch_body_url($fetch_link){
$response = wp_remote_get($fetch_link, array('timeout' => 120));
return wp_remote_retrieve_body($response);
}
But then the website changed something in their cloudflare defense, and now this results in "Please wait..." page of cloudflare with no option to pass it.
Thing is, I don't even need it done automatically - if there was a captcha, the user could've complete it. But it won't show anything other than endlessly spinning "checking your browser".
Googled a bunch of curl examples, and best I could get so far is this:
<?php
$url='https://ficbook.net/authors/1000'; //random profile from requrested website
$agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, 'https://facebook.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
$response = curl_exec($ch);
curl_close($ch);
echo '<textarea>'.$response.'</textarea>';
?>
Yet it still returns the browser check screen. Adding random free proxy to it doesn't seem to work either, or maybe I wasn't lucky finding a working one (or couldn't figure out how to insert it correctly in this case). Is there any way around it? Or perhaps there is some other way to see if there is a specific keyword/link on the page?
Ok, I've spent most of the day on this problem, and seems like I got it more or less sorted. Not exactly the way I expected, but hey, it works... sort of.
Instead of solving this on the server side, I ended up looking for solution to parse it on my own PC (it has better uptime than my hosting's server anyway). Turns out, there are plenty of ready-to-use open source scrapers, including those that know how to bypass cloudflare being extra defensive for no good reason.
Solution for python dummies like myself:
Install Anaconda if you don't have python installed yet.
In cmd type pip install cloudscraper
Open Spyder (it comes along with Anaconda) and paste this:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://your-parse-target/").text)
Save it anywhere and poke at run button to test. If it works, you got your data in the console window of same app.
Replace print with whatever you're gonna do with that data.
For my specific case it also required to install mysql-connector-python and to enable remote access for mysql database (and my hosting had it available for free all this time, huh?). So instead of directly verifying that user is the owner of the profile they input, there's now a queue - which isn't perfect, but oh well, they'll have to wait.
First, user request is saved to mysql. My local python script will check that table every now and then to see if anything's in line to be verified. It'll get the page's content and save it back to mysql. Then the old php parser will do its job like before, but from mysql fetch instead of actual website.
Perhaps there are better solutions that don't require resorting to measures like creating a separate local parser, but maybe this will help to someone running into similar issue.
I have a website in my local network. It hidden behind a login. I want my PHP code to get into this website and copy content of it. The content isn't posted right away, it is loaded only after 1-3 seconds.
I already figured out how to log in and copy website via cURL. But it shows only what was posted right away, the content that I'm aiming for is added after this 1-3 seconds.
<?php
$url = "http://#192.168.1.101/cgi-bin/minerStatus.cgi";
$username = 'User';
$password = 'Password';
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_HTTPHEADER,array('User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0'));
curl_setopt($ch, CURLOPT_USERPWD, $username . ":" . $password);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if(curl_errno($ch)){
//If an error occured, throw an Exception.
throw new Exception(curl_error($ch));
}
echo $response;
?>
The output are empty tables. And I'm expecting them to be filled with data that shows up a bit later on this website.
The problem is that curl simply makes an HTTP-request and returns the response body to you. The table on the target page is probably populated asynchronously using JavaScript. You have two options here:
Find out what resources are requested and use curl to get them directly. For this open the page in your browser and check the developer tools for outgoing AJAX requests. Once you figured out what file is actually loaded there simply request that instead of your $url.
Use an emulated / headless browser to execute JavaScript. If for any reason the first option does not work for you, you could use a headless browser to simulate a real user navigating the site. This allows for full JavaScript capabilities. For PHP there is the great Symfony/Panther library that uses facebooks webdriver under the hood and works really well. It will be more work than the first solution so try that first.
When using cURL to send data via POST, if that data string is URL encoded or if parts of it are URL encoded, cURL automatically decodes the data when sending it.
This happens when using cURL in PHP or directly in the command line.
I've tested with 2 different version of cURL: 7.19 and 7.49. Both exhibit the same behavior
I've sent the cURL request from two different servers thinking that the the way the servers was configured somehow influenced this, but the result was the same.
Here is a simple PHP cURL request that I've used for my test:
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$data = "https%3A%2F%2Fexample.com%3A8081%2Ftemoignez%3FQid%3D%26"
$ch = curl_init( "https://example.com/test/webhook.php" );
curl_setopt($ch, CURLOPT_USERAGENT, $ua);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "payload=".$data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec( $ch );
The data will be sent decoded even though the initial string is URL encoded.
I'm retrieving the data by dumping the POST data into a file on disk using PHP.
Is this normal? Any idea what may cause this?
You have two different assertions here:
cURL automatically decodes the data when sending it.
...
I've simply dumped the POST data into a file after retrieving it.
It is PHP that automatically DECODES the data when receiving it. It is NOT getting decoded upon sending it!
This integrates with the behaviour of other values, like cookie data, post and get variables, header information like referrer, ... everything get's decoded automatically when it is received, because it is expected to be sent encoded.
When you want to see the exact data that is getting send over the wires, use a tool like ngrep on port 80 to sniff the TCP HTTP traffic.
I have a small web page that, every day, displays a one word answer - either Yes or No - depending on some other factor that changes daily.
Underneath this, I have a Facebook like button. I want this button to post, in the title/description, either "Yes" or "No", depending on the verdict that day.
I have set up the OG metadata dynamically using php to echo the correct string into the og:title etc. But Facebook caches the value, so someone sharing my page on Tuesday can easily end up posting the wrong content to Facebook.
I have confirmed this is the issue by using the Facebook object debugger. As soon as I force a refresh, all is well. I attempted to automate this using curl, but this doesn't seem to work.
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, "http://developers.facebook.com/tools/lint/?url={http://ispizzahalfprice.com}");
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
Am I missing some easy fix here? Or do I need to re-evaluate my website structure to acheive what I am looking for (e.g. use two separate pages)?
Here's the page in case it's useful: http://ispizzahalfprice.com
Using two separate URL's would be the safe bet. As you have observed, Facebook does quite heavy caching on URL scrapes. You've also seen that you, as the admin of the App, can flush and refresh Facebook's cache by pulling the page through the debugger again.
Using two URL's would solve this issue because Facebook could cache the results all they want! There will still be a separate URL for "yes" and one for "no".