Webpage detecting / displaying different content for curl request - Why? - php

I need to retrieve and parse the text of public domain books, such as those found on gutenberg.org, with PHP.
To retrieve the content of most webpages I am able to use CURL requests to retrieve the HTML exactly as I would find had I navigated to the URL in a browser.
Unfortunately on some pages, most importantly gutenberg.org pages, the websites display different content or send a redirect header.
For example, when attempting to load this target, gutenberg.org, page a curl request gets redirected to this different but logically related, gutenberg.org, page. I am successfully able to visit the target page with both cookies and javascript turned off on my browser.
Why is the curl request being redirected while a regular browser request to the same site is not?
Here is the code I use to retrieve the webpage:
$urlToScan = "http://www.gutenberg.org/cache/epub/34175/pg34175.txt";
if(!isset($userAgent)){
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36";
}
$ch = curl_init();
$timeout = 15;
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
#curl_setopt($ch, CURLOPT_HEADER, 1); // return HTTP headers with response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_URL, $urlToScan);
$html = curl_exec($ch);
curl_close($ch);
if($html == null){
return false;
}
print $html;

The hint is probably in the url: it says "welcome stranger". They are redirecting every "first" time visitor to this page. Once you have visited the page, they will not redirect you anymore.
THey don't seem to be saving a lot of stuff in your browser, but they do set a cookie with a session id. This is the most logical thing really: check if there is a session.
What you need to do is connect with curl AND a cookie. You can use your browsers cookie for this, but in case it expires, you'd be better of doing
request the page.
if the page is redirected, safe the cookie (you now have a session)
request the page again with that cookie.
If all goes well, the second request will not redirect. Until the cookie / session expires, and then you start again. see the manual to see how to work with cookies/cookie-jars

The reason that one could navigate to the target page in a browser without cookies or javascript, yet not by curl, was due to the website tracking the referrer in the header. The page can be loaded without cookies by setting the appropriate referrer header:
curl_setopt($ch, CURLOPT_REFERER, "http://www.gutenberg.org/ebooks/34175?msg=welcome_stranger");
As pointed out by #madshvero, the page also be, surprisingly, loaded by simply excluding the user agent.

Related

cURL issue with Google consent redirect

I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).

LogOn to remote protected site using PHP and cURL

I am trying to log on to my company's intranet which is protected by an RSA token. I managed to find out all the necessary data for the log on and it works using this code.
<?php
//init curl
$ch = curl_init();
//Set the URL to work with
curl_setopt($ch, CURLOPT_URL, $loginUrl);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5");
// ENABLE HTTP POST
curl_setopt($ch, CURLOPT_POST, 1);
//Set the post parameters
curl_setopt($ch, CURLOPT_POSTFIELDS, $var);
//Handle cookies for the login
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
//Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
//not to print out the results of its query.
//Instead, it will return the results as a string return value
//from curl_exec() instead of the usual true/false.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//execute the request (the login)
$store = curl_exec($ch);
?>
After the log on I will be logged out instantly. There is set a maximum of 2 hours for a session. How ist that normally set? Where will I find this information in the original site code? I guess it will be stored in a cookie? What do I have to do in order to not being logged out right after the logIn?
Best regards,
Michael
I believe its due to unavailability of SESSION and Cookie info. Eventhough you are getting login success, with very next request to code will check for cookie value generated with login and will check against session value.
So if they won't matched then you will be logged out.
I don't think this code will work for your purpose.

How to use token for login without refresh curl

i'm trying to login to a website(remotely) lets say example.com/login and that example.com/login use request token to login so i am getting request token from a url like this below
// code for getting token cookies etc
$url = 'http://example.com/login/';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$doc = curl_exec($ch);
curl_close($ch);
// extract __RequestVerificationToken input field
preg_match('#<input name="__RequestVerificationToken" type="hidden" value="(.*?)"#is', $doc, $match);
$token = $match[1];
// code for redirect to dashboard
$postinfo = "Email=".$username."&Password=".$password."&__RequestVerificationToken=".$token;
// var_dump($token); //debug info
$useragent="Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36";
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
$html = curl_exec($ch);
echo $html;
if (curl_errno($ch)) print curl_error($ch);
curl_close($ch);
So, i m getting the token but when trying to login with next curl request obviously the $token keep changing due to refresh so i want to know how i can login to url example.com/login with the same curl script so $token keep same?!
TIA!
first off, a proper dom parser is much more reliable than a regex to extract the token, so use that.
$token = (new DOMXPath(#DOMDocument::loadHTML($dom)))->query("//input[#name='__RequestVerificationToken']")->item(0)->getAttribute("value");
now, the token DEFINITELY changes for each new cookie session. and POSSIBLY changes for each failed login attempt, and POSSIBLY changes for each still-not-logged-in-page-refresh.
now, when you first get the token, you also get assigned a cookie session id. to "log in with the correct token", you must send that same session cookie id with the login request. the easiest way to do this, is to let curl handle cookies automatically, with CURLOPT_COOKIEFILE (ps, you don't need a dedicated file for the cookies, just set an emptystring and curl will take care of the cookies for you) - with that enabled, curl automatically sends the session cookie with the next login request.
and protip: whenever you're debugging curl code, enable CURLOPT_VERBOSE , it gives lots of useful information (like showing you all the cookies it received)

PHP curl web crawling fail suddenly

I can web crawling a newspaper web site successful before but fail today.
But I can access the web successfully by using firefox. It just happen in curl. That mean it allow my IP to access and it is not banned.
Here is the error shown by the web
Please enable cookies.
Error 1010 Ray ID: 1a17d04d7c4f8888
Access denied
What happened?
The owner of this website (www1.hkej.com) has banned your access based
on your browser's signature (1a17d04d7c4f8888-ua45).
CloudFlare Ray ID: 1a17d04d7c4f8888 • Your IP: 2xx.1x.1xx.2xx •
Performance & security by CloudFlare
Here is my code which work before:
$cookieMain = "cookieHKEJ.txt"; // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$cookieMobile = "cookieMobile.txt"; // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";
// submit a login
function cLogin($url, $post, $agent, $cookiefile, $referer) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 100); // follow the location if the web page refer to the other page automatically
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Get returned value as string (don’t put to screen)
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // Spoof the user-agent to be the browser that the user is on (and accessing the php script)
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); // Use cookie.txt for STORING cookies
curl_setopt($ch, CURLOPT_POST, true); // Tell curl that we are posting data
curl_setopt($ch, CURLOPT_POSTFIELDS, $post); // Post the data in the array above
curl_setopt($ch, CURLOPT_REFERER, $referer);
$output = curl_exec($ch); // execute
curl_close($ch);
return $output;
}
$input = cDisplay("http://www1.hkej.com/dailynews/toc", $agent, $cookieMain);
echo $input;
How can I use curl to pretend the browser successfully? Did I miss some parameters?
As I said in the post, I can use firefox to access the web and my IP is not banned.
At last, I got success after I changed the code from
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";
to
$agent = $_SERVER['HTTP_USER_AGENT'];
Actually, I don't know why it fail when "User-Agent: " exist start from yesterday but it is alright before.
Thanks all anyway.
The users have used Cloudflares security features to prevent you crawling their website, More than likely got shown as a malicious bot. They will have done this based on your user-agent and IP address.
Try changing your IP (if home user, try rebooting your router. sometimes will get a different IP address). Try using a proxy and try sending different headers with Curl.
More importantly they do not want people crawling their site and affecting their traffic etc, You should really ask permission for this.

PHP cURL redirects to localhost

I'm trying to login to an external webpage using a php script with cURL. I'm new to cURL, so I feel like I'm missing a lot of pieces. I found a few examples and modified them to allow access to https pages. Ultimately, my goal is to be able to login to the page and download a .csv by following a specified link once logged in. So far, what I have is a script that tests logging in to the page; the script is shown below:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.websiteurl.com/login');
curl_setopt($ch, CURLOPT_POSTFIELDS,'Email='.urlencode($login_email).'&Password='.urlencode($login_pass).'&submit=1');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_setopt($ch, CURLOPT_REFERER, "https://www.websiteurl.com/login");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$output = curl_exec($ch);
I have a few questions. First, is there a reason this does not redirect on its own? The only way for me to view the contents of the page is to
echo $output
even though CURLOPT_RETURNTRANSFER and CURLOPT_FOLLOWLOCATION are both set to True.
Second, the URL for the page stays at "localhost/folderName/test.php" instead of directing to the actual website. Can anyone explain why this happens? Because the script doesn't actually redirect to a logged in webpage, I can't seem to do anything that I need to do.
Does my issue have to do with cookies? My cookies.txt file is in the same folder that my .php script is. (I'm using wampServer btw). Should it be located elsewhere?
Once I'm able to fix these two issues, it seems that all I need to be able to do is to redirect to the link that start the download process for the .csv file.
Thanks for any help, much appreciated!
Answering part of your question:
From http://php.net/manual/en/function.curl-setopt.php :
CURLOPT_RETURNTRANSFER TRUE to return the transfer as a string of the
return value of curl_exec() instead of outputting it out directly.
In other words - doing exactly what you described. It's returning the response to a string and you echo it to see it. As requested...
----- EDIT-----
As for the second part of your question - when I change the last three lines of the script to
$output = curl_exec($ch);
header('Location:'.$website);
echo $output;
The address of the page as displayed changes to $website - which in my case is the variable I use to store my equivalent of your 'https://www.websiteurl.com/login'
I am not sure that is what you wanted to do - because I'm not sure I understand what your next steps are. If you were getting redirected by the login site, wouldn't the new address be part of the header that is returned? And wouldn't you need to extract that address in order to perform the next request (wget or whatever) in order to download the file you wanted to get?
To do so, you need to set CURLOPT_HEADER to TRUE,
You can get the URL where you ended up from
$last_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
(see cURL , get redirect url to a variable ).
The same link also has a useful script for completely parsing the header information (returned when CURLOPT_HEADER==true. It's in the answer by nico limpica.
Bottom line: CURL gets the information that your browser would have received if you had pointed it to a particular site; that doesn't mean your browser behaves as though you pointed it to that site...

Categories