I'm building a web bot to login into some of my accounts on websites but one of the url's are sending cookie from javascript and curl is unable to store them. Any suggestion?
You could parse the javascript file using whatever language your using and look for the document.cookie statement. You could then use this data to set the cookie manually in curl. (CURLOPT_COOKIE)
It wouldn't exactly be the best idea if your hoping for this to work with a number of sites, but since you state that you know the site you'll need to load its a possibility as you'll have an idea of how the Javascript will look.
If you have months of free time on your hands, you could compile webkit and its javascript engine and modify the cookie-setting functionality so that it exports the cookies to stdout (and then grab them with PHP's exec). Good luckkk with that though. Considering you're asking this question with relation to cURL, I don't think this is quite up your alley...
I'd sort of go with Kewley's answer if you're desperate though. You should be able to reverse-engineer the javascript and see the logic behind how the web application sets its cookies. If it authenticates and returns the login-result with XHR, watch what's sent and received by the browser (with Firebug). Add breakpoints on the document.cookie lines and observe what cookies are being set (and what they're being set to). Once you know the precise logic behind authentication, you perform the necessary behind-the-scenes requests necessary to snag a session on the site with cURL.
Curl doesn't parse Javascript, therefore the cookies wont ever be set.
Related
It's rare, but I have to pay MS a compliment: the ASP.NET WebMethod (AJAX) authorization is a dream, regarding my desire for security and laziness.
Encosia's ASP.NET page methods are only as secure as you make them absolutely fits those needs. ASP.NET is actually workable for me now. Free at last! (From the noble but disastrous AJAXControlToolkit).
Anyways, the problem is, that's for work. I'm not buying the MS architecture when LAMP's out there for free. I'm new to AJAX, and I can't seem to find a clear answer on how to authorize AJAX calls to PHP in the same way as Encosia above.
Can anyone suggest the PHP equivalent of what Encosia does in the link above?
Thanks in advance!
More Details
OK, let me be more specific. Encosia's solution above gives 401 denied to anyone not logged in trying to access a webmethod. Neat, clean, easy. Before, I tried to user session data to give access, but it, unknowingly to me, forced synchronous mode. Nono.
I need both, for my site. I need to be able to give 401 denieds on certain pages if a user isn't logged in. I need to be able to allow anyone to call other phps via ajax regardless of login.
Clarity
Bottom line: I don't want anyone accessing certain AJAX PHPs unless if they are logged in. I don't care what the response or any other details as long as its' still AJAX. How to?
Not really clear from the question, but if you want to only allow access to your AJAX server side listening scripts (maybe XML or JSON output) to users that have either authed or are on the related page,then how about adding a session identifier to your JS AJAX requests? In the server side script you can check that identifier against maybe a DB table holding your current sessions.
For extra security, you could check against IP, a cookie etc. These are all values that you can set when the session is started.
The main thing you need to ask yourself is this:
If a user is either logged in or browsing, what kind of access to the database do you really want / need to give? Each application will have its own needs. If you are going to have AJAX listeners on your server, then all that's needed is a quick look at Firebug (example) to see where your scripts are and the format of the requests. This could allow a potential security hole to be found. Make sure all your incoming requests are correctly treated so as to remove the possibility of injection attacks.
For my school, we have to do these "Advisory Lessons" that tell you about College, etc. After completing the lesson, I am wondering if I would be able to replicate the same process using a set of requests from a PHP script with cURL.
I went through the lesson again, this time with Firebug on and an HTTP Analyzer.
Much to my surprise, the only GET requests were sent out during the entire lesson.
In case your curious, here is what the "Lesson" window looks like. It's sort of powerpoint-type thing where you read the slide and then some slides have questions on them. At the end, there is a quiz and if you don't pass it, the lesson doesn't count.
My question is this: If I were to setup a PHP/cURL script that logged into my account, and then made every single one of those requests, would the lesson be counted as complete?
Now obviously it's impossible for you guys to know how their server works and such...
I guess what I am saying is, is there any hidden content or fields that you can pass through a GET request? It just doesn't seem like the lesson window is passing enough info to the server for it to know if the lesson was complete or not.
Thanks so much for any advice and tips on my project!
EDIT: Here is my official test run (please don't do it too many times):
As many of you hinted, it did not work....but I am still not completely sure why.
Like you say, we can't speak to the details of their server, but it is possible to do these kinds of things with GET requests only because servers can use cookies and store state (associated with these cookies) on the server.
This gives the appearance, probably, of passing extra hidden information to the server.
You can research cookies, and even that jsessionid thing that is appearing in their URLs. That BTW tips you off that they are using at least some Java. :)
The lesson application may very well be storing data in a session or some other persistant data store server-side and using a token from your browser (usually a cookie or a GET parameter) to look up that data when needed.
Its a kinda complicated task. With only cURL you can't emulate execution of javascript code, AJAX requests etc
I am not sure what you are trying to do. For one HTTP is stateless protocol meaning the server gets request and gives a response to that particular request (that might be GET, POST or whatever and might have some request parameters). Statefullness in system usage is usually achieved by server creating a session and setting up a cookie on client side to pass session id in later requests. Session id is used to recognize the client and track his session. Everything you send during request is plain text. What response you get most likely will depend on session state and will also be a plain text. There is nothing hidden on a client side about client side. You just don't get to know what information server keeps in session and how requests are processed based on that and information you give during requests.
I've done web scraping before but it was never this complex. I want to grab course information from a school website. However all the course information is displayed in a web scraper's nightmare.
First off, when you click the "Schedule of Classes" url, it directs you through several other pages first (I believe to set cookies and check other crap).
Then it finally loads a page with an iframe that apparently only likes to load when it's loaded from within the institution's webpage (ie arizona.edu).
From there the form submissions have to be made via buttons that don't actually reload the page but merely submit a AJAX query and I think it just manipulates the iframe.
This query is particularly hard for me to replicate. I've been using PHP and curl to simulate a browser visiting the initial page, gather's the proper cookies and such. But I think I have a problem with the headers that my curl function is sending because it never lets me execute any sort of query after the initial "search form" loads.
Any help would be awesome...
http://www.arizona.edu/students/registering-classes -> "Schedule of Classes"
Or just here:
http://schedule.arizona.edu/
If you need to scrape a site with heavy JS / AJAX usage - you need something more powerful than php ;)
First - it must be full browser with capability to execute JS, and second - there must be some api for auto-browsing.
Assuming that you are a kid (who else would need to parse a school) - try Firefox with iMacros. If you are more seasoned veteran - look towards Selenium.
I used to scrap a lot of pages with JS, iframes and all kinds of that stuff. I used PhantomJS as a headless browser, that later I wrapped with PhantomCurl wrapper. The wrapper is a python script that can be run from command line or imported as a module
Are you sure you are allowed to scrape the site?
If yes, then they could just give you a simple REST api?
In rare case when they would allow you to get to the data, but would not provide API, my advice would be to install some software to record your HTTP interaction with web site, maybe wireshark, or some HTTP proxy, but it is important that you get all details of http requests recorded. After you have that, analyze it, and try to replay it up to the latest bit.
Among possible chores, it might be that at some point in time server sends you generated javascript, that needs to be executed by the client browser in order to get to the next step. In this case you would need to figure how to parse received javascript, and figure out how to move next.
An also good idea would be not to fire all your http requests in burst mode, put put some random delays so that it appears to the server more "human" like.
But in the end you need to figure out if all this is worth the trouble? Since almost any road block to scraping can be worked around, but it can get quite involved and time consuming.
I have a script that uses JSONP to make cross domain ajax calls. This works great but my question is, is there a way to prevent other sites from accessing and getting data from these URL's? I basically would like to make a list of sites that are allowed and only return data if they are in the list. I am using PHP and figure I might be able to use "HTTP_REFERER" but have read that some browsers will not send this info.... ??? Any ideas?
Thanks!
There really is no effective solution. If your JSON is accessible through the browser, then it is equally accessible to other sites. To the web server a request originating from a browser or another server are virtually indistinguishable aside from the headers. Like ILMV commented, referrers (and other headers) can be falsified. They are after all, self-reported.
Security is never perfect. A sufficiently determined person can overcome any security measures in place, but the goal of security is to create a high enough deterrent that laypeople and or most people would be dissuaded from putting the time and resources necessary to compromise the security.
With that thought in mind, you can create a barrier of entry high enough that other sites would probably not bother making requests with the barriers of entry put into place. You can generate single use tokens that are required to grab the json data. Once a token is used to grab the json data, the token is then subsequently invalidated. In order to retrieve a token, the web page must be requested with a token embedded within the page in javascript that is then put into the ajax call for the json data. Combine this with time-expiring tokens, and sufficient obfuscation in the javascript and you've created a high enough barrier.
Just remember, this isn't impossible to circumvent. Another website could extract the token out of the javascript, and or intercept the ajax call and hijack the data at multiple points.
Do you have access to the servers/sites that you would like to give access to the JSONP?
What you could do, although not ideal is to add a record to a db of the IP on the page load that is allowed to view the JSONP, then on the jsonp load, check if that record exists. Perhaps have an expiry on the record if appropriate.
e.g.
http://mysite.com/some_page/ - user loads page, add their IP to the database of allowed users
http://anothersite.com/anotherpage - as above, add to database
load JSONP, check the IP exists in the database.
After one hour delete the record from the db, so another page load would be required for example
Although this could quite easily be worked around if the scraper (or other sites) managed to work out what method you are using to allow users to view the JSONP, they'd only have to hit the page first.
How about using a cookie that holds a token used with every jsonp request?
Depending on the setup you can also use a variable if you don't want to use cookies.
Working with importScript form the Web Worker is quite the same as jsonp.
Make a double check like theAlexPoon said. Main-script to web worker, web worker to sever and back with security query. If the web worker answer to the main script without to be asked or with the wrong token, its better to forward your website to the nirvana. If the server is asked with the wrong token don't answer. Cookies will not be send with an importScript request, because document is not available at web worker level. Always send security relevant cookies with a post request.
But there are still a lot of risks. The man in the middle knows how.
I'm certain you can do this with htaccess -
Ensure your headers are sending "HTTP_REFERER" - I don't know any browser that wont send it if you tell it to. (if you're still worried, fall back gracefully)
Then use htaccess to allow/deny access from the right referer.
# deny all except those indicated here
order deny,allow
deny from all
allow from .*domain\.com.*
Looking around for a solution to this, I have found different methods. Some use regex, some use DOM scripting or something.
I want to go to a site, log in, fill out a form and then check if the form sent. The logging in part is the part I can't find anything on.
Anyone know of an easy way to do this?
I'd agree with Les. Curl + Charles (or Fiddler, Firefox's Tamper Data extension, wireshark, etc.) is the way I've always done this. The one trick I've found is that some sites require a three step process:
Hit the login page with a GET request first to get any session ids, cookies, and/or required fields (e.g. .net sites have __VIEWSTATE and __EVENTVALIDATION).
Once you have these values, then you post to the login page
Finally, request whatever resource you're after.
Don't plan on curl's cookie jar and cookie file being much help. You'll probably be best off parsing out the session id and cookies from the headers using a simple regex.
Hope this helps!
You might be better off with some sort of scriptable browser if you need to do a lot of GUI stuff. If you need to use PHP, check out curl: http://us2.php.net/curl
what I usually do is fire up charles go through the login process in a browser and record the raw requests. Copy+paste the requests and throw them through fopen or curl (with some small adjustments according to the responses).
You may want to take a look at Perl's LWP library (I know it isn't PHP, but it's very useful for screen scraping, web unit testing, and such):
Perl LWP at CPAN
LWP::Simple tutorial
I have fair bit of experience in this. I used to use Curl but it is no fun using it. In particular many times sites exchange XSRF tokens, or pass hidden variables, or set all kinds of cookies. Tracking all this with Curl becomes difficult. Atleast for me.
I then explored Selenium and I love it. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.