I am trying to grab the meta data from a news article on the NY Times website, specifically http://www.nytimes.com/2014/06/25/us/politics/thad-cochran-chris-mcdaniel-mississippi-senate-primary.html
Whenever I try however I am getting redirects from the sight because my "browser" does not accept cookies. I have enabled the curl options to save cookies and tried following the accepted answers in a few other StackOverflow questions (here, here, and here) and while the answer worked on those websites it doesn't seem to work on the nytimes site.
My current php curl function looks like this:
function get_extra_meta_tags_curl($url) {
$ckfile = tempnam("/public_html/commentarium/", "cookies.txt");
$ch = curl_init($main_url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($ch);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
}
The problem appears to be that when I request the URL, nytimes.com checks if the browser accepts cookies. I checks a couple of times before redirecting to the login page with a REFUSE_COOKIE_ERROR. Instead of posting the full redirect list here you can see it on my test page here along with the raw html that the final redirect returns and what my current get_extra_meta_tags_curl function is returning under CURL test
Thanks for any help!
You enable cookies auto-handling in wrong manner. CURLOPT_COOKIEJAR only enables cookies saving (storing), but you need also enable cookies loading and passing them with request (by CURLOPT_COOKIEFILE option). Otherwise cookies auto-handling won't work and you will experienced mentioned "Browser does not accept cookies" problem.
So you have to set both CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options to the same value ($ckfile) at each CURL request:
...
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
...
Related
I'm trying to record data from Philippine Stock Exchange website. I have found that they have an endpoint which is http://www.pse.com.ph/stockMarket/companyInfo.html?method=fetchHeaderData&company=29&security=146
I can clearly access it using any browsers except when I go into incognito mode where I'm being shown with a content saying Access Denied and it never stops loading. When I try to access it using PHP I'm quite sure that what is happening is the same as the later.
I'm trying to access it using PHP to no avail, here are the attempts I tried:
file_get_contents
cURL with user agent
cURL with temporary cookies
Tried all in localhost and in live server.
Code:
$c = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.pse.com.ph/stockMarket/companyInfo.html");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_COOKIEJAR, $c);
curl_setopt($ch, CURLOPT_COOKIEFILE, $c);
curl_setopt($ch, CURLOPT_POSTFIELDS, "method=fetchHeaderData&ajax=true&company=29&security=146");
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
var_dump(curl_exec($ch));
curl_close ($ch);
I don't have any clear idea on why and how does this happen. Can someone explain to me why it happens and what are the possible solutions (PHP only if possible)
I have reviewed other developer's approach on this API (They all implemented it using Java) and it is just a simple POST request and it is done. I have not verified though if their code is still working. I can't post links to their repository (limited).
SOLUTIONS:
Problem 1. Can't access API
$posts = array(
"method"=>"fetchHeaderData",
"ajax"=>"true",
"company"=>29,
"security"=>146
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.pse.com.ph/stockMarket/companyInfo.html");
curl_setopt($ch, CURLOPT_POSTFIELDS,$posts);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
var_dump(curl_exec($ch));
curl_close ($ch);
It seems I have two different problems. I can now access and use the API using the code above. No need for other options. Turning the post data into array fixed the problem.
Problem 2. Access Denied
On the problem about the Access Denied, it is cookie related. Answered below by #Wayne.
Unfortunately, I can't accept two answers.
Try this solution. convert your post data in array then pass this array in CURLOPT_POSTFIELDS
$posts = array(
"method"=>"fetchHeaderData",
"ajax"=>"true",
"company"=>29,
"security"=>146
);
$c = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.pse.com.ph/stockMarket/companyInfo.html");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_COOKIEJAR, $c);
curl_setopt($ch, CURLOPT_COOKIEFILE, $c);
curl_setopt($ch, CURLOPT_POSTFIELDS,$posts);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
var_dump(curl_exec($ch));
curl_close ($ch);
It is because they have their server setup to stop you from doing that. They are securing the data with a cookie.
Cookie details
When you visit the site http://www.pse.com.ph/stockMarket/companyInfo.html it gives you a cookie as it knows you are a human visitor.
In your browser tools enter
document.cookie
to see your cookie. It will provide you an individual the data because you have the cookie.
Remove the cookie
document.cookie = "JSESSIONID=; expires=Thu, 01 Jan 1970 00:00:00 UTC; path=/;";
and visit
http://www.pse.com.ph/stockMarket/companyInfo.html?method=fetchHeaderData&company=29&security=146
without going to get a cookie http://www.pse.com.ph/stockMarket/companyInfo.html first you will get the 403 (Forbidden)
Also they do not have jsonp with a callback so an ajax request will violate the cross domain security. Requests for the JSON must be from pages that originate from their domain or an approved domain.
Why would they do that.
Likely their licence to the information does not allow them to give it to other websites, or they need/want to get paid to provide the information to other websites. Or they have terms of use for the information.
Where can you get the data ... data wants to be free
I don't see anyplace on their site http://www.pse.com.ph where they have API information and how to request permission to access it.
Programable web has been the number one source for finding APIs, they have 96 stock APIs listed ... Obviously I can not just copy their data and past it here, but one of these API may work for you?
Im trying to set a cookie through PHP CURL for more than twenty four hour for no avail.
Before i have been setting cookies in my browser by adding them as parameters in a url as shown below
http://localhost/setc.php?userid=123&panelid=1
but now i need to set the cookie when i run a script(setcookie.php)
below is the latest of various types of code that i tried.
setcookie.php
$c = curl_init('http://localhost/atst.php?userid=628929&panelid=1');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_COOKIE, 'userid=123; panelid=1');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);
it still does not create the cookie, can anybody help out
P.S : if you guys too cant figure this out at least give me a hint/guide on how to set a simple cookie without any complications
The cookiejar is only saved when you close the curl handle using curl_close($ch).
From the manual:
CURLOPT_COOKIEFILE The name of the file containing the cookie data. The cookie file can be in Netscape format, or just plain HTTP-style headers dumped into a file. If the name is an empty string, no cookies are loaded, but cookie handling is still enabled.
CURLOPT_COOKIEJAR The name of a file to save all internal cookies to when the handle is closed, e.g. after a call to curl_close.
http://www.php.net/manual/en/function.curl-setopt.php
$ckfile = tempnam ("/tmp/", "CURLCOOKIE");
$BASEURL='http://localhost/openx/www/api/json/index.php/main/authenticate/';
$POSTFIELDS='username='.$username.'&password='.$password.'';
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/cookieFileName");
curl_setopt($ch, CURLOPT_URL,'http://localhost/openx/www/api/json/index.php/main/authenticate/');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $POSTFIELDS);
ob_start(); // prevent any output
$result=curl_exec ($ch); // execute the curl command
ob_end_clean(); // stop preventing output
$result = curl_exec($ch);
curl_close($ch);
I've done scraping for lots of sites but one in particular isn't saving it's cookies to my cookie file. Any ideas?
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_TIMEOUT,8200);
curl_setopt($ch,CURLOPT_TIMEOUT_MS,8200);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,8200);
$cookie_file = "cookies/zapper.txt";
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
if ($fields) {curl_setopt($ch,CURLOPT_POST, count($fields)); }
if ($fields) {curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string); }
This is the first site that I've done that doesn't respond to my cookie saves. All others use the same code and work perfectly. I've even emulated the post of their forms and faked the header in case it was checking [those.
The site I'm trying to mimic an add to cart process for is http://zapper.co.uk/
Read a possible solution directly from the php.net site about curl_setopt. It's a workaround to get Cookie content from the header output. Seems to be a cool alternative.
Also, you can get surprising results modifiying some of your rules at curl_setop. Sometimes we use more options than needed.
I also recommend you to echo the whole $ch content (It will print page like the browser does). Sometimes you get a detailed error not present at headers seeing the live result content.
I've got NTLM (Active Directory) based service, and I need to write a PHP application. Normally, users are logging in to website with Activre Directory credentials, and it's ok.
But what I want to do, is to let them type in their credentials to PHP-written site, which in next step will use cURL to authenticate users to that Active Directory based site where they normally log in.
And this part is hard. I need then to keep session of users that through PHP cURL script authenticated to Active Directory based site in order to use them again later
(CRON querying site to determine that it has changed and automatically do some operations when this happens, which normally user has do manually).
In order to NOT store their credentials to authenticate again when this change happens, I somehow need to store NTLM session in PHP cURL site to every user that authenticated to
that site through this PHP cURL site.
My question is: Is that even possible?
Thanks in advance.
#Willem Mulder
The code you've posted actually does cookie-storing, but that is not my point becouse I've already done that (sorry for not writing it before). What I got so far is:
$cookie_file_path = dirname(__FILE__) . '/cookies.txt';
$ch = curl_init();
//==========================================================================
curl_setopt($ch, CURLOPT_USERPWD, $username. ':' . $password);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_FAILONERROR, 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, 100);
//==========================================================================
$ret = curl_exec($ch);
By using options CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, cURL does the cookie storing in local file "cookies.txt". The problem is, that when I comment CURLOPT_USERPWD option (after authenticating and storing cookie, so theoretically I have session), I cannot authorize to website. Perhaps it reinitializes NTLM Handshake authorisation and is expecting username and password, which I don't want to store.
I want to store session info only, to provide service this session info and omit second authentication, but cURL seems to not take this data from cookie file, and REWRITES it with not relevant data send to me from service as response to NOT AUTHRORISED access request.
Well, yes you could
$ch = curl_init('http://www.google.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Get headers too with this line
curl_setopt($ch, CURLOPT_HEADER, 1);
$result = curl_exec($ch);
// Get cookie
preg_match('/^Set-Cookie:\s*([^;]*)/mi', $result, $m);
var_dump(parse_url($m[1]));
// And then of course store it somewhere :-)
As seen here how to get the cookies from a php curl into a variable
I have been experimenting with curl for accessing the PayPal payment authorisation site using PHP.
e.g.
...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $nvp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$res = curl_exec ($ch);
preg_match_all('/Set-Cookie: .*/', $res, $cookieMatches);
foreach ($cookieMatches[0] as $cookieMatch)
header($cookieMatch);
preg_match('/Location: .*/', $res, $locMatches);
header($locMatches[0]);
header('Vary: Accept-Encoding');
header('Strict-Transport-Security: max-age=500');
header('Transfer-Encoding: chunked');
header('Content-Type: text/html');
The principle being simply to reflect the original redirect (I am sure there is a simpler way to do this). However, the response from PayPal seems to indicate some kind of cookie error.
My hunch is that the cookie has been linked to the originating machine in some way. Can anyone confirm this, or am I just missing something obvious!
The CURL has built-in support for cookies (as you know). But it's been tricky. I haven't managed cookies to work until I declared option
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
Third parameter is a name of the file storing cookies - preferably in temp folder. Maybe you should just try this approach.
With this the redirects work "automatically".
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
//SAVE THE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
USE THE COOKIES
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1');
// Follow Where the location will take you, maybe you catch the issue.
Since it's working on browser it has to work using CURL, unless they are using javascript to set cookies.
even if they are using cookies depending on IP address, try to start the session from beginning using curl so they set your server ip address with generated cookies.