I've done web scraping before but it was never this complex. I want to grab course information from a school website. However all the course information is displayed in a web scraper's nightmare.
First off, when you click the "Schedule of Classes" url, it directs you through several other pages first (I believe to set cookies and check other crap).
Then it finally loads a page with an iframe that apparently only likes to load when it's loaded from within the institution's webpage (ie arizona.edu).
From there the form submissions have to be made via buttons that don't actually reload the page but merely submit a AJAX query and I think it just manipulates the iframe.
This query is particularly hard for me to replicate. I've been using PHP and curl to simulate a browser visiting the initial page, gather's the proper cookies and such. But I think I have a problem with the headers that my curl function is sending because it never lets me execute any sort of query after the initial "search form" loads.
Any help would be awesome...
http://www.arizona.edu/students/registering-classes -> "Schedule of Classes"
Or just here:
http://schedule.arizona.edu/
If you need to scrape a site with heavy JS / AJAX usage - you need something more powerful than php ;)
First - it must be full browser with capability to execute JS, and second - there must be some api for auto-browsing.
Assuming that you are a kid (who else would need to parse a school) - try Firefox with iMacros. If you are more seasoned veteran - look towards Selenium.
I used to scrap a lot of pages with JS, iframes and all kinds of that stuff. I used PhantomJS as a headless browser, that later I wrapped with PhantomCurl wrapper. The wrapper is a python script that can be run from command line or imported as a module
Are you sure you are allowed to scrape the site?
If yes, then they could just give you a simple REST api?
In rare case when they would allow you to get to the data, but would not provide API, my advice would be to install some software to record your HTTP interaction with web site, maybe wireshark, or some HTTP proxy, but it is important that you get all details of http requests recorded. After you have that, analyze it, and try to replay it up to the latest bit.
Among possible chores, it might be that at some point in time server sends you generated javascript, that needs to be executed by the client browser in order to get to the next step. In this case you would need to figure how to parse received javascript, and figure out how to move next.
An also good idea would be not to fire all your http requests in burst mode, put put some random delays so that it appears to the server more "human" like.
But in the end you need to figure out if all this is worth the trouble? Since almost any road block to scraping can be worked around, but it can get quite involved and time consuming.
Related
I am creating an application that will use Web Sockets for a notification system. Is it better to have the application in an iframe with the Web Sockets in the parent so there isn't a new connection every time a page is loaded? Or maybe it should re-connect?
What are your thoughts?
If anyone has any other way in PHP to get push like notifications without sending a AJAX request every 10 seconds then let me know.
Thanks.
This is one of the options you have described.
The problem with that option is that there wont be direct control over the content of that inner iframe, and you will need to implement push message window communication between parent of iframe, in order to be able to change iframe src attribute, in case someone will refresh parent page, and iframe should refresh to actual state, not initial page.
Second problem, there will be no SEO at all. So your page wont be crawlerable by search engine robots. If SEO is important for your application - then this is not an option.
In WebSockets, if you work with sessions, it is important to make session available for normal PHP script and WebSockets logic, in order to keep consistent access to data it self. PHP will make it not an easy task at all.
You might consider Long Pull technique as well, as it allows to open one AJAX request and then get responses back, and this request can last some time but will eventually close and have to be reopened on client-side.
Another option is to review actual application architecture, and think of single-page application. It have as well cons and pros.
Good thing about it, is UX will be much higher. Response times as well as you will load less content and data.
Pros are that it requires lots of development on front-end side in javascript. As well there is two major routes you can do single-page applications. Consistent and inconsistent. In first case you need to make sure that your back-end will server static html on refresh or just navigating to specific link, the same way as your single-page application would generate using java-script. Then it solves issues with SEO. While inconsistent approach, will just be purely on javascript (front-end), and will have issues with SEO.
WebSockets usually used with single-page applications, for example Facebook Chat is great example of such. Or Google Talk while you are in Gmail account.
They are not meant to be used with often refreshed pages, as Handshake process is a bit more heavy than normal HTTP request.
A couple of years ago, before I knew about Stack Overflow, I was working in an office with a lot of competition between the programmers. There, I had to code a web page in PHP with Drupal, that needed to get data from another site by RSS. What happened was that there was no way to get the data beforehand: the data depended on the content of the page which itself was dynamic, so the page stopped loading for a couple of seconds while PHP went to get the RSS data. That was bad. The page depended on a couple of parameters out of a huge list. So fetching all possible combinations in davance was out of the questions. It was some sort of search page, that included the results of a sister site, I think.
The first thing I did to improve that was to set up a caching system. When the page was loaded, it launched a Javascript method that saved the RSS data back into the database for this specific page, using AJAX. That meant that if the same page was requested again, the old data would be sent immediately. and the AJAX script would get the cache updated with the new data, if needed. The Javascript pretty much opened a hidden page on the site with a GET instruction that matched the current page's parameters. It's only a couple of days later that I realised that I could have cached the data without the AJAx. (Trust me, it's easier to spot in hindsight.) But that's not the issue I'm asking about.
But I was told not to do any caching at all. I was told that my AJAX page "exposed the API". That a malicious user could hit the hidden page again and again to do a Denial of Service attack. I thought my AJAX was a temporary solution anyway, but that caching was needed. But mostly: wasn't the DoS argument true of ANY page on the site? Did the fact that my hidden page did not appear in the menus and returned no content make it worse?
As I said, there was a lot of competition between programmers, so the people around me, who were unanimous, might have been right, or they may have tried to stop me from doing something that was bad because they were not the ones doing it. (It happened a lot.) But I'm still curious. I was fully aware that my AJAX thing was a hack. I wanted to change that system as soon as I found something better, but I thought that no caching at all was even worse. Which was true? Doesn't, by that logic, ALL AJAX expose the API? If we look past the fact that my AJAX was an ugly hack, was it really that dangerous?
I'll admit again and again that it was an ugly, temporary fix, but my question is about having a "hidden" page that returns no content that makes the server do something. How horrible is that?
both sides are right. Yes, it does "expose" the api, but ajax requests can only access publicly accessible documents/scripts in the first place, so yes, all ajax requests "expose" their target script in the same way. DoS attacks are not script specific, they are server specific, so one can perform a DoS using anything pointing to the server, not just this script your ajax calls. I would tell your buddies their argument is weak and grasping at straws, and don't be jealous :P
If I read your post correctly, it seems as if the AJAX requested version of the page would know to invalidate the cache each time?
If that's the case, then I suppose your co-worker might have been saying that the hidden page would be susceptible to a DDOS attack in a way that the full pageload wasn't. I.E. The full pageload would get a cached version on each pageload after the first, where as the AJAX version would get fresh content each time. If that's the case, then s/he's right.
By "expose the API", your co-worker was saying that you were exposing the URL of a page that was doing work that should be done in the background. The outside world should not know about a URL whose sole purpose is to do some heavy lifting task. As you even said, you found a backend solution that didn't require the user's browser knowing about your worker process at all.
Yes, having no cache at all when the page relies on heavy content is worse than having an ajax version of the page do the caching, but I think the warning from your coworker was that no page, EVEN if it's AJAX, should have the power to break the cache in a way you didn't expect or intend.
The only way this would be a problem is said "hidden page that returns no content that makes the server do something" had different authentication scheme or permissioning from the rest of the pages, or if what it made the back-end do would be inordinately heavy compared to any other page on the site that posted something.
i try to make a "status monitor" for our small network. After the page was load i make a ping for every IP which i addedd. Its, ok. But i would like to do this ping in every X minute, without reload my hole page.
I can make it if i reload the page with header refresh, but i would like to do this witout reload.
I think i have to do this with AJAX?, But i dont know how..
Thank you
I would strongly suggest you have a look at Nagios or something similar:
1) you don't need to have a web page constantly open to detect problems
2) it can automatically verify and escalate issues
3) there are lots of probes available out of the box which can be used to measure all sorts of things - not just ping times
4) responding to a ping is not the same thing as working
5) it automatically collates stats to identify patterns of issues
6) it also provides SLA type reporting
7) Nagios is simple enough that even I can understand it
8) its what I chose after a lot of work researching a replacement for a system similar to you are suggesting.
HTH
C.
If it is entire code of page i suggest setting up a cron job
and if you want to use ajax ( ie jquery ajax there is a plugin called jquery timer) use it send a ajax request to the page with code you want to run.
http://plugins.jquery.com/project/timers
check this out
I suggest you take a look at some of the "other-way-around" approaches, such as COMET, here is an interesting article covering basic usage with PHP.
This would put the implementation of "ping" in your server instead of the client.
You could for instance instead of setting a fixed interval push out updates at will. Meaning you would get almost realtime status notifications instead of the fixed interval updates.
In web development, Comet is a
neologism to describe a web
application model in which a long-held
HTTP request allows a web server to
push data to a browser, without the
browser explicitly requesting it.
Comet is an umbrella term for multiple
techniques for achieving this
interaction. All these methods rely on
features included by default in
browsers, such as JavaScript, rather
than on non-default plugins.
COMET (Wikipedia)
Why don't you try a cron?
I'm not sure exactly what you want to do here, but this quick tutorial shows you how to call a php file every second and update a dib block with the results. It is quick and simple using jquery.
I was looking at Twitter Streaming API for getting a real-time feed.
But I dont want it stored on my server.
I just want it pulled from the server and the browser page will retrieve the data from my server's twitter pull URL.
But I want to avoid polling my server every few mill-seconds.
Is there a way for my server script to keep pushing to my browser page ?
Just how live do you want it? There are ways to set up sockets, but they can be fairly complex, and still consume their fair share of bandwidth.
Is it acceptable to poll every 5, 10 seconds or so? Every few milliseconds would give you pretty darn "live" results, but I would not be upset as a user if it took a matter of seconds for something to appear on your website. That would be satisfactorily "live" for me.
Check out COMET.
In web development, Comet is a neologism to describe a web application model in which a long-held HTTP request allows a web server to push data to a browser, without the browser explicitly requesting it.
I've always wanted to try this method but I haven't gotten around to it:
Hidden IFrame
A basic technique for dynamic web application is to use a hidden IFrame HTML element (an inline frame, which allows a website to embed one HTML document inside another). This invisible IFrame is sent as a chunked block, which implicitly declares it as infinitely long (sometimes called “forever frame”). As events occur, the iframe is gradually filled with script tags, containing JavaScript to be executed in the browser. Because browsers render HTML pages incrementally, each script tag is executed as it is received.[8]
One benefit of the IFrame method is that it works in every common browser. Two downsides of this technique are the lack of a reliable error handling method, and the impossibility of tracking the state of the request calling process.[8]
When is it appropriate to use AJAX?
what are the pros and cons of using AJAX?
In response to my last question: some people seemed very adamant that I should only use AJAX if the situation was appropriate:
Should I add AJAX logic to my PHP classes/scripts?
In response to Chad Birch's answer:
Yes, I'm referring to when developing a "standard" site that would employ AJAX for its benefits, and wouldn't be crippled by its application. Using AJAX in a way that would kill search rankings would not be acceptable. So if "keeping the site intact" requires more work, than that would be a "con".
It's a pretty large subject, but you should be using AJAX to enhance the user experience, without making the site totally dependent on it. Remember that search engines and some other visitors won't be able to execute the AJAX, so if you rely on it to load your content, that will not work in your favor.
For example, you might think that it would be nice to have users visit your blog, and then have the page dynamically load the newest article(s) with AJAX once they're already there. However, when Google tries to index your blog, it's just going to get the blank site.
A good search term to find resources related to this subject is "progressive enhancement". There's plenty of good stuff out there, spend some time following the links around. Here's one to start you off:
http://www.alistapart.com/articles/progressiveenhancementwithjavascript/
When you are only updating part of a page or perhaps performing an action that doesn't update the page at all AJAX can be a very good tool. It's much more lightweight than an entire page refresh for something like this. Conversely, if your entire page reloads or you change to a different view, you really should just link (or post) to the new page rather than download it via AJAX and replace the entire contents.
One downside to using AJAX is that it requires javascript to be working OR you to construct your view in such a way that the UI still works without it. This is more complicated than doing it just via normal links/posts.
AJAX is usually used to perform an HTTP request while the page is already loaded (without loading another page).
The most common use is to update part of the view. Note that this does not include refreshing the whole view since you could just navigate to a new page.
Another common use is to submit forms. In all cases, but especially for forms, it is important to have good ways of handling browsers that do not have javascript or where it is disabled.
I think the advantage of using ajax technologies isn't only for creating better user-experiences, the ability to make server calls for only specific data is a huge performance benefit.
Imagine having a huge bandwidth eater site (like stackoverflow), most of the navigation done by users is done through page reloads, and data that is continuously sent over HTTP.
Of course caching and other techniques help this bandwidth over-head problem, but personally I think that sending huge chunks of HTML everytime is really a waste.
Cons are SEO (which doesn't work with highly based ajax sites) and people that have JavaScript disabled.
When your application (or your users) demand a richer user experience than a traditional webpage is able to provide.
Ajax gives you two big things:
Responsiveness - you can update only parts of a web page at a time if need be (saving the time to re-load a page). It also makes it easier to page data that is presented in a table for instance.
User Experience - This goes along with responsiveness. With AJAX you can add animations, cooler popups and special effects to give your web pages a newer, cleaner and cooler look and feel. If no one thinks this is important then look to the iPhone. User Experience draws people into an application and make them want to use it, one of the key steps in ensuring an application's success.
For a good case study, look at this site. AJAX effects like animating your new Answer when posted, popups to tell you you can't do certain things and hints that new answers have been posted since you started your own answer are all part of drawing people into this site and making it successful.
Javascript should always just be an addition to the functionality of your website. You should be able to use and navigate the site without any Javascript involved. You can use Javascript as an addition to existing functionality, for example to avoid full-page reloads. This is an important factor for accessibility. Javascript should never be used as the only possibility to reach or complete a request on your site.
As AJAX makes use of Javascript, the same applies here.
Ajax is primarily used when you want to reload part of a page without reposting all the information to the server.
Cons:
More complicated than doing a normal post (working with different browsers, writing server side code to hadle partial postbacks)
Introduces potential security vulnerabilities (
You are introducing additional code that interacts with the server. This can be a problem on both the client and server.
On the client, you need ways of sending and receiving responses. It's another way of interacting with the browser which means there is another point of entry that has to be guarded. Executing arbritary code, posting data to a non-intended source etc. There are several exploits for Ajax apps that have been plugged over time, but there will always be more.
)
Pros:
It looks flashier to end users
Allows a lot of information to be displayed on the page without having to load all at the same time
Page is more interactive.