Automate getting report from webpage - php

I'm a Java developer and I have a question about automating a task I've been given.
I'm having to 3 times daily, login to this website we have at work, select a few form elements and then click on submit to get a report printed out.
I'm wondering how I can write some sort of script that will automate this task? Where should I start? What language should I do it in? I was thinking PHP might be able to do this or even a greasemonkey script possibly?
Thanks a lot.

Check out cURL in PHP. It allows you to do all the normal functions of a web browser with code (other than moving the mouse). And yes, you'll need to do screen scraping.

I think the potential sticking point that hasn't been touched on yet is your phrase "login to this website"... Depending on how you need to log in, you may need to go in through a back door to access the report.
I had problems with this kind of thing in the past when I had to download a report from a third party site. The issue was that I couldn't authenticate to access the report parameters because of the hard-coded and less-than-script-friendly way I was required to log in to the site. However, I presume that your site is internal to your organisation, so it may be possible to bypass/rework the security requirements in order to access the data. If this is the case, then you should be able to use one of the screen scraping methods outlined above.
If not, you may need to incorporate the actual login procedure into your script or application, download and capture any cookies that may be set and incorporate them into your data request.

I don't know what language your form is written in, but what you could do is:
rewrite the form to a script which generates the report when called
use a cron entry to schedule this task to be done daily and mail the output to you
A cron is basically a scheduled task on Unix systems. Windows-based servers can use the Task Scheduler to much the same end.
The above assumes that you have access to the script which generates the report at the moment and can modify it / copy it to a new file which will email the output to you. If not, then you may need to look into screen scraping. As you're a Java developer, you may find this list of Java screen scraping utilities handy to get you started.

It's called "web scraping" or "screen scraping", and there are a lot of libraries out there to do this. I couldn't speak to a Java-specific tool, though: I'm a .Net guy (the .Net way would be System.Net.WebClient or System.Net.HttpWebRequest/System.Net.HttpWebResponse). But I'm sure there's something.
In the meantime, the first step is go to the page where you input the form values and view the source of the page. Look for the specific <form> element you're filling out, and see where it posts to (it's action). Then, find any <input> <select>, <textarea> elements you use, including any hidden inputs for the form, and figure out what values you need to get. That will tell you how to build your request once you find a library that will let you send it.
If you need to login to the site first to get to the page, things can be more complicated. You may need to retrieve and parse a session value or be able to send certain cookies to the server.

Related

How to prevent automated request?

I was wondering how to set up a system in which an authenticated user could send, with a simple graphical interaction (cliccking a button or so) a non-replayable request/message to the server from an application or a web page.
It's crytical there's must not be a way to set up an automated system that replaces user interaction automating the request as this would totally break up my entire project.
Moreover, as this action must be frequently repeated, it should not implement boring stuff like chaptas or so.
A pratical example: let's say the web page, shown after the login, displays a button that sends the server a request. How can I be sure the request was sent because the user actually clicked the button and it wasn't some sort of bot that forged the message?
Is that even possible to check? I'm sure it is and I'm quite sure there's must be some simple implementation I'm missing, and I'm sorry if this is a trivial question.
Also, if the solution is hiding ('cause I already searched a lot!) out there, please point me to it.
Thanks for your attention.
You could use a non-graphical captcha like a simple question.
Generate a simple addition of two random integers between 0 and 10.
Add a text field to ask for the result.
The result is very easy to find (for a human being), and very quick to type.
Example:
What is the result of 7+5? Write your result here: [_]
It should only block robots and very young or very stupid people.

How to scrape website content (*COMPLEX* iframe, javascript submission)

I've done web scraping before but it was never this complex. I want to grab course information from a school website. However all the course information is displayed in a web scraper's nightmare.
First off, when you click the "Schedule of Classes" url, it directs you through several other pages first (I believe to set cookies and check other crap).
Then it finally loads a page with an iframe that apparently only likes to load when it's loaded from within the institution's webpage (ie arizona.edu).
From there the form submissions have to be made via buttons that don't actually reload the page but merely submit a AJAX query and I think it just manipulates the iframe.
This query is particularly hard for me to replicate. I've been using PHP and curl to simulate a browser visiting the initial page, gather's the proper cookies and such. But I think I have a problem with the headers that my curl function is sending because it never lets me execute any sort of query after the initial "search form" loads.
Any help would be awesome...
http://www.arizona.edu/students/registering-classes -> "Schedule of Classes"
Or just here:
http://schedule.arizona.edu/
If you need to scrape a site with heavy JS / AJAX usage - you need something more powerful than php ;)
First - it must be full browser with capability to execute JS, and second - there must be some api for auto-browsing.
Assuming that you are a kid (who else would need to parse a school) - try Firefox with iMacros. If you are more seasoned veteran - look towards Selenium.
I used to scrap a lot of pages with JS, iframes and all kinds of that stuff. I used PhantomJS as a headless browser, that later I wrapped with PhantomCurl wrapper. The wrapper is a python script that can be run from command line or imported as a module
Are you sure you are allowed to scrape the site?
If yes, then they could just give you a simple REST api?
In rare case when they would allow you to get to the data, but would not provide API, my advice would be to install some software to record your HTTP interaction with web site, maybe wireshark, or some HTTP proxy, but it is important that you get all details of http requests recorded. After you have that, analyze it, and try to replay it up to the latest bit.
Among possible chores, it might be that at some point in time server sends you generated javascript, that needs to be executed by the client browser in order to get to the next step. In this case you would need to figure how to parse received javascript, and figure out how to move next.
An also good idea would be not to fire all your http requests in burst mode, put put some random delays so that it appears to the server more "human" like.
But in the end you need to figure out if all this is worth the trouble? Since almost any road block to scraping can be worked around, but it can get quite involved and time consuming.

Do I have to display specific error message for each error in PHP script(these error messages are included in Javascript code)

I am developing a website using PHP. I use Javascript/jquery to control/regulate user input. Of course I need to validate user input against these rules in PHP script. However, do I have to display specific error messages once an error is detected in PHP script? I mean if the Javascript runs properly, the errors of input won't come out in PHP script. Only when a user disables Javascript or bypasses Javascript somehow can the errors of input come out in PHP script. How to handle this problem?
I know I need to validate user input in PHP script, the problem is that whether I need to display specific error message to a user once an error is detected in PHP script?
Well, the problem is that whether Javascript is active or not changes from user to user, while PHP is used across all users, as it is ran before the page even leaves the server.
Ideally, You should always design a site as if you don't have Javascript, and then layered Javascript on as if it was an extra layer of usability. You never know what the user's browser is running, and it is best to design for the worst.
Javascript tends to work best, especially with jQuery, when you simply create a valid website that has the basic features you want, and then you bring in Javascript to make it run without refreshing. You hook Javascript into the page by making submit buttons not refresh, but simply tell Javascript to submit the form, or you have Javascript do error producing instead of the PHP page.
The benefit of designing this way is that because you have made a fully PHP based functioning website, the people that don't have Javascript can still use your website, and the people who do have Javascript get a nicer, cleaner, more usable website.
In some instances, if your audience is very known, like a company application, you can forget about not having Javascript. But with a public website, it is best to design for the worst.
So,yes. I would have PHP display errors just as well as Javascript did.
I would say you are the only one who can choose whether or not your website should display specific error messages ;-)
Considering Javascript can be disabled, you should probabbly display specific (useful) error messages -- but, considering it's not often disabled... Well, maybe you have some stuff to do that has a higher priority ?
Which, of course, doesn't prevent you from implementing those specific error messages a bit later, when there's nothing with a higher priority left to do ;-)
In the end, the question you are asking shows you started developping the JS part of your application, and, only in the end, thought about the "what if JS is disabled" part.
In theory, things should go the other way arround :
First, develop the website without depending on Javascript
And when it's working OK, start enhancing it with Javascript
Which means :
Users who have JS enabled will have a nice website
And users without Javascript will also have a fully-functionnal website ; even if it's a bit less user-friendly.
Still, what really matters is that you're doing verifications on the server side -- and you are.
Whether to implement a complete fall-back for non-javascript clients/users or not is an ongoing debate for years and years.
I guess it depends on the clientele you expect for your website and how much afford you want to put in for the (alleged) minority. And there are many shades of not implementing the whole user pampering experience for non-javascript clients. E.g. your server could sent maybe two or three different general error messages while at input time the javascript shows detailed messages.

Offloading script function to post-response: methods and best-practices?

First,
the set up:
I have a script that executes several tasks after a user hits the "upload" button that sends the script the data it need. Now, this part is currently mandatory, we don't have the option at this point to cut out the upload and draw from a live source.
This section intentionally long-winded to make a point. Skip ahead if you hate that
Right now the data is parsed from a really funky source using regex, then broken down into an array. It then checks the DB for any data already in the uploaded data's date range. If the data date ranges don't already exist in the DB, it inserts the data and outputs success to the user (there is also some security checks, data source validation, and basic upload validation)... If the data does exist, the script then gets the data already in the DB, finds the differences between the two sets, deletes the old data that doesn't match, adds the new data, and then sends an email to each person affected by these changes (one email per person with all relevant changes in said email, which is a whole other step). The email addresses are pulled by means of an LDAP search as our DB has their work email but the LDAP has their personal email which ensures they get the email before they come in the next day and get caught unaware. Finally, the data-uploader is told "Changes have been made, emails have been sent." which is really all they care about.
Now I may be adding a Google Calendar API that posts the data (when it's scheduling data) to the user's Google Calendar. I would do it via their work calendar, but I thought I'd get my toes wet with Google's API before dealing with setting up a WebDav system for Exchange.
</backstory>
Now!
The practical question
At this point, pre-Google integration, the script takes at most a second and a half to run. It's pretty impressive, at least I think so (the server, not my coding). But the Google bit, in tests, is SLOOOOW. We can probably fix that, but it raises the bigger question...
What is the best way to off-load some of the work after the user has gotten confirmation that the DB has been updated? This is the part he's most concerned with and the part most critical. Email notifications and Google Calendar updates are only there for the benefit of those affected by the upload, and if there is a problem with these notifications, he'll hear about it (and then I'll hear about it) regardless of the script telling him first.
So is there a way, for example, to run a cronjob that's triggered by a script's last execution? Can PHP create cronjobs with exec() ability? Is there some normalized way of handling post-execution work that needs getting done?
Any advice on this is really appreciated. I feel like the scripts bloated-ness reflects my stage of development and the need for me to finally know how to do division-of-labor in web apps.
But I also get worried that this is not done, as user's need to know when all tasks are completed, etc. So this brings up:
The best-practices/more-subjective question
Basically, is there an idea that progress bars, real-time offloading, and other ways of keeping the user tethered to the script are --when combined with optimization of the code, of course-- the better, more-preferred method then simply saying "We're done with your part, if you need us, we'll be notifying users" etc etc.
Are there any BIG things to avoid (other than obviously not giving the user any feedback at all)?
Thanks for reading. The coding part is crucial, so don't feel obliged to cover the second part or forget to cover the coding part!
A cron job is good for this. If all you want to do when a user uploads data is say "Hey user, thanks for the data!" then this will be fine.
If you prefer a more immediate approach, then you can use exec() to start a background process. In a Linux environment it would look something like this:
exec("php /path/to/your/worker/script.php >/dev/null &");
The & part says "run me in the backgound." The >/dev/null part redirects output to a black hole. As far as handling all errors and notifying appropriate parties--this is all down to the design of your worker script.
For a more flexible cross-platform approach, check out this PHP Manual post
There are a number of ways to go about this. You could exec(), like the above says, but you could potentially run into a DoS situation if there are too many submit clicks. the pcntl extension is arguably better at managing processes like this. Check out this post to see a discussion (there are 3 parts).
You could use Javascript to send a second, ajax post that runs the appropriate worker script afterwards. By using ignore_user_abort() and sending a Content-Length, the browser can disconnect early, but your apache process will continue to run and process your data. Upside is no forkbomb potential, Downside is it will open more apache processes.
Yet another option is to use a cron in the background that looks at a process-queue table for things to do 'later' - you stick items into this table on the front end, remove them on the backend while processing (see Zend_Queue).
Yet another is to use a more distributed job framework like gearmand - which can process items on other machines.
It all depends on your overall capabilities and requirements.

Sending data from one page to another server. Language agnostic

I'll try to keep this short and simple.
I haven't begun writing the code for this project yet, but I'm trying to work out the pre-coding logistics as of right now.
What I am looking to do, is create a method of sending data from one/any site, to another remote server, which would generate a response for the user requesting the data be sent.
Specifically, a web designer should be able to take a snippet of code I would have available on my website, implement it into their own page(s), and when a user browses their page(s), my service would be available.
A specific example, a web designer with a website (helloworld.com) could implement my code into their page (helloworld.com/index.html). When a user is viewing the page (/index.html), the user hovers the mouse over a text word (lemonade) for a couple seconds, a small dialog box pops up beside it, providing the user with some options specific to the text (example of an option would be "Define 'lemonade' at Dictionary.com") that if selected, would be processed at, and a response would be returned from my remote server (myremoteserver.com)
Obviously, I would want the code that designers would be using to be as lightweight and standalone as possible. Any ideas as to how I could execute this? Resources I should study, and methods I could implement?
Please do not create another one of those services that annoyingly double-underlines words in web site content and then pops up a ugly, slow-to-load ad over the content if I accidentally mouse over the word. Because that sounds like what you're doing.
If you're going to do it anyway, then what the "remote server" will be will probably actually be a bit of client-side JavaScript, in which case JSON is probably your best bet. XML could also work, but even when JavaScript isn't on the other side, I rather like JSON as a serialization technique due to its compactness and readability.
I think you are talking about a hyperlink.
The part that has you confused is the level of interactivity you want on the client site. Whatever sort of neat UI interface you want to wrap around the link will probably be done in javascript and need to be supplied to that site. The core of what you're asking for
text ... that if selected, would be processed at, and a response would be returned from my remote server (myremoteserver.com)
is just a hyperlink.
There's probably more to it than that though. Explain and we'll try to help.
I'll elaborate and furthermore explain that I am not making one of those 'annoying' webservices that turns resourceful information into a clunky billboard. I intend on making a low graphic (16x16 icons at most per menu option) resource linking tool that can be used to connect resources on the local server to other resources, whether local or remote.
This data would be accessed by sending a request to my webserver and returning a response in a popup box (this response would be based on the query of course) The response would be displayed in a brief menu of options, for example Wikipedia entries, links to torrent searches on popular engines, etc.
It won't be inhibitive of selecting, scrolling, clicking predefined hyperlinks, or anything, as you would need to hover the text for a few seconds.
I'm just looking for resources that would be helpful in designing such a service.

Categories