Set session to scrape page - php

URL1: https://duapp3.drexel.edu/webtms_du/
URL2: https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX
URL3: https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX
As a personal programming project, I want to scrape my University's course catalog and provide it as a RESTful API.
However, I'm running into the following issue.
The page that I need to scrape is URL3. But URL3 only returns meaningful information after I visit URL2 (it sets the term there Colleges.asp?Term=201125), but URL2 can only be visited after visiting URL1.
I tried monitoring the HTTP data going to and fro using Fiddler and I don't think they are using cookies. Closing the browser instantly resets everything, so I suspect they are using Session.
How can I scrape URL 3? I tried, programatically, visiting URLs 1 and 2 first, and then doing file_get_contents(url3) but that doesn't work (probably because it registers as three different sessions.

A session needs a mechanism to identify you as well. Popular methods include: cookies, session id in URL.
A curl -v on URL 1 reveals a session cookie is indeed being set.
Set-Cookie: ASPSESSIONIDASBRRCCS=LKLLPGGDFBGGNFJBKKHMPCDA; path=/
You need to send this cookie back to the server on any subsequent requests to keep your session alive.
If you want to use file_get_contents, you need to manually create a context for it with stream_context_create for to include cookies with the request.
An alternative (which I would personally prefer) would be to use curl functions conveniently provided by PHP. (It can even take care of the cookie traffic for you!) But that's just my preference.
Edit:
Here's a working example to scrape the path in your question.
$scrape = array(
"https://duapp3.drexel.edu/webtms_du/",
"https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX",
"https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX"
);
$data = '';
$ch = curl_init();
// Set cookie jar to temporary file, because, even if we don't need them,
// it seems curl does not store the cookies anywhere otherwise or include
// them in subsequent requests
curl_setopt($ch, CURLOPT_COOKIEJAR, tempnam(sys_get_temp_dir(), 'curl'));
// We don't want direct output by curl
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Then run along the scrape path
foreach ($scrape as $url) {
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
}
curl_close($ch);
echo $data;

Related

How do use PHP file_get_contents or cURL with a log-in page that I'm already logged into

If I go to a website, example.com/page1.php, it is a log-in page. When I log-in, it takes me to example.com/page2.php. If I close my browser and come back to page1 later, I’m still logged in and it automatically takes me to page2. That means there’s a cookie set and it knows I already logged in.
I want to use file_get_contents to get page2.php. When I try it, I get the contents of the log-in page instead. I assume that’s because file_get_contents doesn’t know a cookie is set and page2 is saying, you shouldn’t be here, you’re not logged in, so it bumps me back to page 1.
I realize I can use cURL to do the log-in, create a cookie and get the contents, like this….
$url = 'https://www.example.com/page1.php'; // the url of the login page
$post_data = "urerid=myusername&password=mypassword "; // The login data to post
$ch = curl_init(); // Create a curl object
curl_setopt($ch, CURLOPT_URL, $url ); // Set the URL
curl_setopt($ch, CURLOPT_POST, 1 ); // This is a POST query
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data); //Set the post data
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Get the content
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // Follow Location redirects
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt'); // Set cookie storing files
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
$output = curl_exec($ch); // Execute the action to login
My problem is , I don’t want to log-in again (for reasons I don’t want to get into). Is there a way to let file_get_contents, cURL, or some other function, know I’m previously logged in and get the contents of page2. Since example.com is setting a cookie, can I access that cookie somehow and use it to avoid logging in again?
Why it wont work:
If the website is creating security cookie against xss , you cant simply take one user cookies and send request from diffrent IP while using them.
Even if the website is not using security hash , you cant access cookies belongs to a diffrent domain due security resones (you dont want that gmail.com could access your microsoft.com cookies)
to cut it short - the only way that could work is by :
Use SSO (with partner to the destination domain).
Use Cross-Domain support (with partner to the destination domain).
Get Access Tokens (like facebook is doing) if supported by the destination domain.
Request your users to login from your domain (by trusting you - which is bad) in order to let your site be able to access the other domain data.

Retrieve / send back HTTP headers with PHP / Curl

I have a HTML/PHP/JS page that I use for an automation process.
On load, it performs a curl request like :
function get_data($url) {
$curl = curl_init();
$timeout = 5;
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($curl);
curl_close($curl);
return $data;
}
$html = get_data($url);
Then it uses DOMDocument to retrieve a specific element on the remote page. My PHP code handles it, makes some operations, then stores it in a variable.
My purpose as you can guess is to simulate a "normal" connexion. To do so, I used the Tamper tool to see what requests are performed, when I was physically interacting with the remote page. HTTP headers are made of UA, cookies (among them, a session cookie), and so on. The only POST variable I have to send back is my PHP variable (you know, the one wich was calculated and stored in a PHP var). I also tested the process with Chrome, which allows me to copy/paste requests as curl.
My question is simple : is there a way to handle HTTP requests / cookies in a simple way ? Or do I have to retrieve them, parse them, store them and send them back "one by one" ?
Indeed, a request and a response are slightly different, but in this case they share many things in common. So I wonder if there is a way to explore the remote page as a browser would do, and interact with it, using for instance an extra PHP library.
Or maybe I'm doing it the wrong way and I should use other languages (PERL...) ?
The code shown above does not handle requests and cookies, I've tried but it was a bit too tricky to handle, hence I ask this question here :) I'm not lazy, but I wonder if there is a more simple way to achieve my goal.
Thanks for your advices, sorry for the english

Need to scrape contents of website that requires an "i agree" cookie to be set

From everything I've read, it seems that this is an impossible. But here is my scenario:
I need to scrape a table's content containing for sale housing information. The page is not password protected or anything, but you first have to click an "I Agree" link on the previous page so that a cookie gets set saying you agree that the content may not be 100% accurate. You are only then shown the data. Is there any way at all to accomplish this using php/jquery/javascript? I know you cannot create an iframe because of the fact that it is cross-domain. I also do not have access to this other website.
Thanks for any answers, as I'm not really expecting anything positive. :) And many thanks if you can tell me how to do this. :D
Use server side script (PHP using cURL) to crawl the website and return the information you need. Make sure you set the appropriate HTTP header with your request that represents the "I agree" cookie.
Sample:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/');
curl_setopt($ch, CURLOPT_COOKIE, 'I_Agree=1');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$responseBody = curl_exec($ch);
curl_close($ch);
// Read the information you need from $responseBody and return it as response body
?>
Now you can access the information from your website by calling your server side script above. For details about how to use cURL take a look at the documentation.
CURL can store or recall cookies from a file depending on the options you set. Here is the "cookiejar" example:
http://curl.haxx.se/libcurl/php/examples/cookiejar.html
Check out the CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options

PHP: use logged-in cURL session on normal web browser

Good evening!
I've an script in PHP which makes a CURL call to a remote host loggin page.
After loggin in and keeping the session via cookiejar opt and cookiefile opt, I use the same CURL connection handler to loggin in on to the immediatly next page wich needs an upload.
When it's done, I got the full session parameters and I can call any page I want from the site, but IN CURL!
The idea, is that this script wich uses CURL, needs to finally be redirected to one of those pages in the remote host using the CURL session, but this is not possible, because from curl you can not show the results as a redirected page.
So I've tried alot of options. None of em works at all.
Schema:
PHP script on a local server.
Call to domain.com/loggin.php (creates curl ch)
Keep curl session on cookie.txt file.
Call to domain.com/loggin_2.php with the same ch (non closed last one).
Full logged in on the remote site.
Back to the PHP script. Need to redirect to domain.com/index.php, wich needs Session variables filled in with the full login process.
What to do then?
1) After having full loggin in, read cookies.txt file to get PHPSESSID.
Then tried to use setcookie(), or via header("Set-cookie: ...") and immediatly after, using header("Location: domain.com/index.php").
Doesn't work.
2) Tried same thing via ajax call and finally document.cookie = ...
Doesn't work.
3) Adding a third cURL call to a file in my remote host wich prints a JSONED $_SESSION.
Getting it on my PHP script, decoding it and loaded on my local session via foreach on any array value (foreach()...$_SESSION[$c] = $v).
Added a session_start() before this foreach. And immediatly after, a header("Loaction: domain.com/index.php").
Doesn't work.
4) Added a session_write_close() before the header("Loaction: domain.com/index.php").
Doesn't work.
So I don't really know how to use the CURL session.
I've tried to manually fix the PHPSESSID via Web Developer Firefox plugin. And I wrote down the curl generated session id. It perfeclty works. So, It should be possible to fix it via scripting on my php script! But I can't!
Give me a hand, please!
Thanks!
I may have gotten lost a bit, but I think I understand.
You can use
CURLOPT_HEADER for some debugging (will contain current redirected page info)
and CURLOPT_FOLLOWLOCATION like so:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://domain.com/login.php');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
I also use
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
to return as a string, which is much more useful for debugging, or parsing.

How I can get data after make a POST to an external HTTPS Web Page?

I need to make a POST in JSON format to an HTTPS web page in a remote server and receive an answer in JSON format.
The data to be send it to the remote server is take it from the URL (bar)<---Done in PHP
My problem is to send this data and receive an answer.
I tried making it in PHP, and HTML using cURL(php) and submit(html).
The results: In PHP I can't send anything.
In HTML I can submit the data, get an answer but I can't catch in my code.
I see the answer using Wireshark, and as I see the POST is make it after a negotiation protocol, and as I said I receive an answer(encoded due to HTTPS, I think).
Now I need receive that answer in my code to generate an URL link so I'm considering to use Java Script.
I never do something similar before.
Any suggestion will be appreciated, thanks.
I'm using the following code with not result but a 20 seconds of delay until a blank page.
<?php
$url = 'https://www.google.com/loc/json';
$body = '{"version":"1.1.0","cell_towers":[{"cell_id":"48","location_area_code":1158,"mobile_country_code":752,"mobile_network_code.":7,"age":0,"signal_strength":-71,"timing_advance":2255}]}';
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_POST, true);
curl_setopt($c, CURLOPT_POSTFIELDS, $body);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($c, CURLOPT_HTTPHEADERS,'Content-Type: application/json');
$page = curl_exec($c);
echo($page);
//print_r($page);
curl_close($c);
?>
New info
I Just get new very important info
"The Gears Terms of Service prohibits direct use of the Google location server (http://www.google.com/loc/json) via HTTP requests. This service may only be accessed through the Geolocation API."
So, I was going trough the wrong way, and from now I will start to learn about Gears in order to apply the Gears API.
Cheers!
There's no real reason PHP couldn't do the PHP for you, if you set things up properly.
For instance, it may require a cookie that it had set on the client browser at some point, which your PHP/curl request doesn't have.
To do proper debugging, use HTTPFox or Firebug in Firefox, which monitor the requests from within the browser itself, and can show the actual data, not the encrypted garbage that wireshark would capture.
Of course, you could use the client browser as a sort of proxy for your server. Browser posts to the HTTPS server, gets a response, then sends that response to your server. But if that data is "important" and shouldn't be exposed, then the client-side solution is a bad one.

Categories