I'm trying to record data from Philippine Stock Exchange website. I have found that they have an endpoint which is http://www.pse.com.ph/stockMarket/companyInfo.html?method=fetchHeaderData&company=29&security=146
I can clearly access it using any browsers except when I go into incognito mode where I'm being shown with a content saying Access Denied and it never stops loading. When I try to access it using PHP I'm quite sure that what is happening is the same as the later.
I'm trying to access it using PHP to no avail, here are the attempts I tried:
file_get_contents
cURL with user agent
cURL with temporary cookies
Tried all in localhost and in live server.
Code:
$c = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.pse.com.ph/stockMarket/companyInfo.html");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_COOKIEJAR, $c);
curl_setopt($ch, CURLOPT_COOKIEFILE, $c);
curl_setopt($ch, CURLOPT_POSTFIELDS, "method=fetchHeaderData&ajax=true&company=29&security=146");
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
var_dump(curl_exec($ch));
curl_close ($ch);
I don't have any clear idea on why and how does this happen. Can someone explain to me why it happens and what are the possible solutions (PHP only if possible)
I have reviewed other developer's approach on this API (They all implemented it using Java) and it is just a simple POST request and it is done. I have not verified though if their code is still working. I can't post links to their repository (limited).
SOLUTIONS:
Problem 1. Can't access API
$posts = array(
"method"=>"fetchHeaderData",
"ajax"=>"true",
"company"=>29,
"security"=>146
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.pse.com.ph/stockMarket/companyInfo.html");
curl_setopt($ch, CURLOPT_POSTFIELDS,$posts);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
var_dump(curl_exec($ch));
curl_close ($ch);
It seems I have two different problems. I can now access and use the API using the code above. No need for other options. Turning the post data into array fixed the problem.
Problem 2. Access Denied
On the problem about the Access Denied, it is cookie related. Answered below by #Wayne.
Unfortunately, I can't accept two answers.
Try this solution. convert your post data in array then pass this array in CURLOPT_POSTFIELDS
$posts = array(
"method"=>"fetchHeaderData",
"ajax"=>"true",
"company"=>29,
"security"=>146
);
$c = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.pse.com.ph/stockMarket/companyInfo.html");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_COOKIEJAR, $c);
curl_setopt($ch, CURLOPT_COOKIEFILE, $c);
curl_setopt($ch, CURLOPT_POSTFIELDS,$posts);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
var_dump(curl_exec($ch));
curl_close ($ch);
It is because they have their server setup to stop you from doing that. They are securing the data with a cookie.
Cookie details
When you visit the site http://www.pse.com.ph/stockMarket/companyInfo.html it gives you a cookie as it knows you are a human visitor.
In your browser tools enter
document.cookie
to see your cookie. It will provide you an individual the data because you have the cookie.
Remove the cookie
document.cookie = "JSESSIONID=; expires=Thu, 01 Jan 1970 00:00:00 UTC; path=/;";
and visit
http://www.pse.com.ph/stockMarket/companyInfo.html?method=fetchHeaderData&company=29&security=146
without going to get a cookie http://www.pse.com.ph/stockMarket/companyInfo.html first you will get the 403 (Forbidden)
Also they do not have jsonp with a callback so an ajax request will violate the cross domain security. Requests for the JSON must be from pages that originate from their domain or an approved domain.
Why would they do that.
Likely their licence to the information does not allow them to give it to other websites, or they need/want to get paid to provide the information to other websites. Or they have terms of use for the information.
Where can you get the data ... data wants to be free
I don't see anyplace on their site http://www.pse.com.ph where they have API information and how to request permission to access it.
Programable web has been the number one source for finding APIs, they have 96 stock APIs listed ... Obviously I can not just copy their data and past it here, but one of these API may work for you?
Related
I am trying to retrieve information from an external website using cURL, but the website returns a blank page.
I took a close looker at the network functionality Chrome has and I think I found the problem, but I have no idea how to fix it. As seen in the image below, the server posts to a specific URL and then redirects to another one showing the final result.
This is the code I have right now:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"https://www.politie.nl/aangifte-of-melding-doen/controleer-handelspartij.html?_hn:type=action&_hn:ref=r199_r1_r1_r1");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,"url=&query=test");
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec ($ch);
curl_close ($ch);
The website is in Dutch, but what I am trying to do is check a certain email, phone number or bank account number to see if they have been involved in any scams, so I would like to have the information that a user gets after submitting the form on the website.
The form is on this website: https://www.politie.nl/aangifte-of-melding-doen/controleer-handelspartij.html
I hope someone can help me and thank you for your time.
As was pointed out in one of the comments to your question, a redirect occurs after the form is submitted. But not only that - information transfer between the form submit request and the request after redirect happens through a session, with session id stored in a cookie, so in order to get the results you have to enable cookies, too.
// follow redirects
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// store and send cookies
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($ch, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($ch, CURLOPT_COOKIEFILE, $tmpfname);
I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)
I am trying to grab the meta data from a news article on the NY Times website, specifically http://www.nytimes.com/2014/06/25/us/politics/thad-cochran-chris-mcdaniel-mississippi-senate-primary.html
Whenever I try however I am getting redirects from the sight because my "browser" does not accept cookies. I have enabled the curl options to save cookies and tried following the accepted answers in a few other StackOverflow questions (here, here, and here) and while the answer worked on those websites it doesn't seem to work on the nytimes site.
My current php curl function looks like this:
function get_extra_meta_tags_curl($url) {
$ckfile = tempnam("/public_html/commentarium/", "cookies.txt");
$ch = curl_init($main_url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($ch);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
}
The problem appears to be that when I request the URL, nytimes.com checks if the browser accepts cookies. I checks a couple of times before redirecting to the login page with a REFUSE_COOKIE_ERROR. Instead of posting the full redirect list here you can see it on my test page here along with the raw html that the final redirect returns and what my current get_extra_meta_tags_curl function is returning under CURL test
Thanks for any help!
You enable cookies auto-handling in wrong manner. CURLOPT_COOKIEJAR only enables cookies saving (storing), but you need also enable cookies loading and passing them with request (by CURLOPT_COOKIEFILE option). Otherwise cookies auto-handling won't work and you will experienced mentioned "Browser does not accept cookies" problem.
So you have to set both CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options to the same value ($ckfile) at each CURL request:
...
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
...
I've got NTLM (Active Directory) based service, and I need to write a PHP application. Normally, users are logging in to website with Activre Directory credentials, and it's ok.
But what I want to do, is to let them type in their credentials to PHP-written site, which in next step will use cURL to authenticate users to that Active Directory based site where they normally log in.
And this part is hard. I need then to keep session of users that through PHP cURL script authenticated to Active Directory based site in order to use them again later
(CRON querying site to determine that it has changed and automatically do some operations when this happens, which normally user has do manually).
In order to NOT store their credentials to authenticate again when this change happens, I somehow need to store NTLM session in PHP cURL site to every user that authenticated to
that site through this PHP cURL site.
My question is: Is that even possible?
Thanks in advance.
#Willem Mulder
The code you've posted actually does cookie-storing, but that is not my point becouse I've already done that (sorry for not writing it before). What I got so far is:
$cookie_file_path = dirname(__FILE__) . '/cookies.txt';
$ch = curl_init();
//==========================================================================
curl_setopt($ch, CURLOPT_USERPWD, $username. ':' . $password);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_FAILONERROR, 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, 100);
//==========================================================================
$ret = curl_exec($ch);
By using options CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, cURL does the cookie storing in local file "cookies.txt". The problem is, that when I comment CURLOPT_USERPWD option (after authenticating and storing cookie, so theoretically I have session), I cannot authorize to website. Perhaps it reinitializes NTLM Handshake authorisation and is expecting username and password, which I don't want to store.
I want to store session info only, to provide service this session info and omit second authentication, but cURL seems to not take this data from cookie file, and REWRITES it with not relevant data send to me from service as response to NOT AUTHRORISED access request.
Well, yes you could
$ch = curl_init('http://www.google.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Get headers too with this line
curl_setopt($ch, CURLOPT_HEADER, 1);
$result = curl_exec($ch);
// Get cookie
preg_match('/^Set-Cookie:\s*([^;]*)/mi', $result, $m);
var_dump(parse_url($m[1]));
// And then of course store it somewhere :-)
As seen here how to get the cookies from a php curl into a variable
So a little trivia first..
There is written in ASP.NET website, which uses NTLM protocol to authenticate users that want to log in. It's perfectly ok when they normally use it, they type in website URL, they provide their credentials, authenticate and maintain session in web browser.
What I want to do, is create PHP website that will act as bot. It is my companys internal website and I am approved to do so. The problem I run into, is managing session. Users will be able to type in their credentials in my PHP website, and my PHP website will authenticate them to target site, using cURL.
The code I got so far is:
$cookie_file_path = dirname(__FILE__) . '/cookies.txt';
$ch = curl_init();
//==============================================================
curl_setopt($ch, CURLOPT_USERPWD, $username. ':' . $password);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_FAILONERROR, 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, 100);
//=============================================================
$ret = curl_exec($ch);
Above code logs in to target website by cURL (which manages NTLM handshake, as it seems), and fetches websites content. It also stores Session ID that is sent back in cookie file.
What I'm trying to do next, is comment the CURLOPT_USERPWD option, in hope that this script will use session ID stored in cookie file to authenticate previously logged in user in second execution of this script. It could get rid of user credentials and do not store it anywhere that way, becouse it is not safe to store it in manually created session, database, or anywhere else.
I need this becouse bot will be using CRON to periodically check if website status has changed and perform some user actions as reaction to this. But to do this, user first must be authenticated, and his username and password must not be stored anywhere, so I have to use session information estabilished when he initially logged in.
CURL seems to NOT DO THIS. When I execute script second time with commented CURLOPT_USERPWD option, it does not use stored cookie to keep beeing authenticated. Instead, it REWRITES cookie file with not relevant data send to me from service as response to NOT AUTHRORISED access request.
My questions are:
Why cURL doesnt use stored session information to keep beeing authenticated?
Is there any way to maintain this session with cURL and NTLM protocol based website?
Thanks in advance.
A few Month ago I had a similar problem then you. I tried to get a connection to a navision soap api. Navision use the ntlm authentication. The problem is that curl doesn't native support ntlm so you have to do it yourself.
A blog post that helped me a lot in this situation was the following:
http://rabaix.net/en/articles/2008/03/13/using-soap-php-with-ntlm-authentication
** Edit
Sorry i misread you question.
You problem is simple.
Just receive the header from a request with this line
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
You can then get from the result of curl_exec function, the Set-Cookie header.
preg_match('/^Set-Cookie:\s*([^;]*)/mi', $ret, $match);
$cookie = parse_url($match[0]);
Now you can store it somewhere, and use it on the 2ten request.
I have the same problem and i solved it using curl_setopt($ch, CURLOPT_COOKIEFILE, ""); line of code. The string should be exactly empty.