How to grab dynamic content on website and save it?

How to grab dynamic content on website and save it? - php

For example I need to grab from http://gmail.com/ the number of free storage:
Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage.
And then store those numbers in a MySql database.
The number, as you can see, is dynamically changing.
Is there a way i can setup a server side script that will be grabbing that number, every time it changes, and saving it to database?
Thanks.

Since Gmail doesn't provide any API to get this information, it sounds like you want to do some web scraping.
Web scraping (also called Web
harvesting or Web data extraction) is
a computer software technique of
extracting information from websites
There are numerous ways of doing this, as mentioned in the wikipedia article linked before:
Human copy-and-paste: Sometimes even
the best Web-scraping technology can
not replace human’s manual examination
and copy-and-paste, and sometimes this
may be the only workable solution when
the websites for scraping explicitly
setup barriers to prevent machine
automation.
Text grepping and regular expression
matching: A simple yet powerful
approach to extract information from
Web pages can be based on the UNIX
grep command or regular expression
matching facilities of programming
languages (for instance Perl or
Python).
HTTP programming: Static and dynamic
Web pages can be retrieved by posting
HTTP requests to the remote Web server
using socket programming.
DOM parsing: By embedding a
full-fledged Web browser, such as the
Internet Explorer or the Mozilla Web
browser control, programs can retrieve
the dynamic contents generated by
client side scripts. These Web browser
controls also parse Web pages into a
DOM tree, based on which programs can
retrieve parts of the Web pages.
HTML parsers: Some semi-structured
data query languages, such as the XML
query language (XQL) and the
hyper-text query language (HTQL), can
be used to parse HTML pages and to
retrieve and transform Web content.
Web-scraping software: There are many
Web-scraping software available that
can be used to customize Web-scraping
solutions. These software may provide
a Web recording interface that removes
the necessity to manually write
Web-scraping codes, or some scripting
functions that can be used to extract
and transform Web content, and
database interfaces that can store the
scraped data in local databases.
Semantic annotation recognizing: The
Web pages may embrace metadata or
semantic markups/annotations which can
be made use of to locate specific data
snippets. If the annotations are
embedded in the pages, as Microformat
does, this technique can be viewed as
a special case of DOM parsing. In
another case, the annotations,
organized into a semantic layer2,
are stored and managed separated to
the Web pages, so the Web scrapers can
retrieve data schema and instructions
from this layer before scraping the
pages.
And before I continue, please keep in mind the legal implications of all this. I don't know if it's compliant with gmail's terms and I would recommend checking them before moving forward. You might also end up being blacklisted or encounter other issues like this.
All that being said, I'd say that in your case you need some kind of spider and DOM parser to log into gmail and find the data you want. The choice of this tool will depend on your technology stack.
As a ruby dev, I like using Mechanize and nokogiri. Using PHP you could take a look at solutions like Sphider.

Initially I thought it was not possible thinking that the number was initialized by javascript.
But if you switch off javascript the number is there in the span tag and probably a javascript function increases it at a regular interval.
So, you can use curl, fopen, etc. to read the contents from the url and then you can parse the contents looking for this value to store it on the datanase. And set this up a cron job to do it on a regular basis.
There are many references on how to do this. Including SO. If you get stuck then just open another question.
Warning: Google have ways of finding out if their apps are being scraped and they will block your IP for a certain period of time. Read the google small print. It's happened to me.

One way I can see you doing this (which may not be the most efficient way) is to use PHP and YQL (From Yahoo!). With YQL, you can specify the webpage (www.gmail.com) and the XPATH to get you the value inside the span tag. It's essentially web-scraping but YQL provides you with a nice way to do it using maybe 4-5 lines of code.
You can wrap this whole thing inside a function that gets called every x seconds, or whatever time period you are looking for.

Leaving aside the legality issues in this particular case, I would suggest the following:
Trying to attack something impossible, stop and think where the impossibility comes from, and whether you chose the correct way.
Do you really think that someone in his mind would issue a new http connection or even worse hold an open comet connection to look if the common storage has grown? For an anonimous user? Just look and find a function that computes a value based on some init value and the current time.

Related

elm and database transactions

I have a static website where all the content is rendered by elm.
Right now all the data is hard-coded into the elm source code. In the future I would like to add a small amount of database interaction to the project.
The web server I use support MySQL databases and PHP.
I was thinking it would be nice to be able to use the get function in the elm Http package to point to a php script on the server, which would query the database, and return json data that my elm program could interpret and render.
I would like to know if:
This approach is possible
There is a better (more convenient or correct) way to do this

What you describe is a good way to do it. See this chapter in elm-tutorial that covers this http://www.elm-tutorial.org/080_fetching_resources/cover.html
As an alternative you could seed the data in the html and pass it via ports.

That approach is very much possible (I do the same to access TCP connections on my server by using a GET request to a CGI module on the same server as the web page).
This is, as far as I know, the best way to do this for all client side pages. I work for a company and we use PHP, Node and MySQL, with about half of the scripts in Node and the other half in PHP, all of them just talk between the front end and the database.

Using Data Scraping Scripts The Right Way

I am playing around with data scraping scripts. For now I am starting with PHP/cURL. Reason one I am interested in learning this is to learn how these are designed to help protect my own websites against those sneaky malicious ones. The second reason is to design these in a way that acts like a human for the purpose of avoiding undue burden on a website owners server.
If I use this in real life, it would be simply to use it to automate what I currently already do manually but I don't want to abuse this process however, I am a bit lazy so rather not do it manually.
To perform like a human:
1) Send header that look like a browser.
2) Send a referrer that represents the source of the link (sequence of pages).
3) Create delays randomized similar to how a human would search per page fetch.
4) Clear cookies when done. (Have to learn more about this, not sure how cookies function in a web scraper environment)
If the tools used above are done correctly, is IP Proxy switching necessary? Are there any other considerations I should be aware? Still learning about this so just curious at this point.

Many HTTP requests (API) Vs Everything in single php

I need some advice on website design.
Lets take example of twitter for my question. Lets say I am making twitter. Now on the home_page.php ,I need both, Data about tweets (Tweet id , who tweeted , tweet time etc. etc) and Data about the user( userId , username , user profile pic).
Now to display all this, I have two option in mind..
1) Making separate php files like tweets.php and userDetails.php. By using AJAX queries, I can get the data on the home_page.php.
2) Adding all the php code (connecting to db, fetching data ) in the home_page.php itself.
In option one, I need to make many HTTP requests, which (i think) will be load to the network. So it might slow down the website.
But option two, I will have a defined REST API. Which will be good of adding more features in the future.
Please give me some advice on picking the best. Also I am still a learner, so if there are more options of implementing this, please share.

In number 1 you're reliant on java-script which doesn't follow progressive enhancement or graceful degradation; if a user doesn't have JS they will see zero content which is obviously bad.
Split your code into manageable php files to make it easier to read and require them all in one main php file; this wont take any extra http requests because all the includes are done server side and 1 page is sent back.
You can add additional javascript to grab more "tweets" like twitter does, but dont make the main functionality rely on javascript.

Don't think of PHP applications as a collection of PHP files that map to different URLs. A single PHP file should handle all your requests and include functionality as needed.
In network programming, it's usually good to minimize the number of network requests, because each request introduces an overhead beyond the time it takes for the raw data to be transmitted (due to protocol-specific information being transmitted and the time it takes to establish a connection for example).
Don't rely on JavaScript. JavaScript can be used for usability enhancements, but must not be used to provide essential functionality of your application.

Adding to Kiee's answer:
It can also depend on the size of your content. If your tweets and user info is very large, the response the single PHP file will take considerable time to prepare and deliver. Then you should go for a "minimal viable response" (i.e. last 10 tweets + 10 most popular users, or similar).
But what you definitely will have to do: create an API to bring your page to life. No matter which approach you will use...

Changing web content based on browser type

I'm writing a web application and I'd like to work out what type of browser/OS the request is coming from, and customise the returned content accordingly. So if someone visits the site from an iPhone/Android, they get a more streamlined experience, or if it's a desktop, they get the full version. I will pretty much take a completely different path, rather than try to mix the content together.
What is the recommended approach for this in ASP.NET/IIS and PHP? Is there a single place I can catch incoming HTTP requests, make a decision, then redirect? Or is this usually done on a page by page case? Any gotchas I should look out for?
Edit: A good point was made to make sure there is a link to the full version on the reduced version. That's a good point, but raises the problem that once the user make this choice, all future redirections now have to point to the full version. I'd really rather be doing all of this in one place.
Cheers,
Shane

ASP.NET has a built-in browser detection mechanism. It's driven by a fully extensible collection of XML files (*.browser) that contain regular expressions for matching the incoming User-Agent string and associated properties for the matched agents.
You can access the properties from the Request.Browser object; you can also tag control properties based on browser specifics.
There's a bunch of info on the Web about this -- I also cover it in detail in my book: Ultra-Fast ASP.NET.

Not a direct answer but it's worth checking out CSS media types. You can specify the handheld type to streamline the page for phones and other small screened devices.
http://www.w3.org/TR/CSS21/media.html

You could take a look at the UserAgent header in the HTTP request and redirect accordingly.
In PHP that would be $_SERVER['HTTP_USER_AGENT'].
You should however watch out that you don't write a lot of duplicate code when doing this.

For ASP.NET applications you can check out the Global.asax file and Session_BeginRequest event.

You should probably look at Conditional Comments:
http://msdn.microsoft.com/en-us/library/ms537512%28VS.85%29.aspx

Light Blogging system sans database

This is a general programming question.
What is the best way to make a light blogging system that can handle images, bbcode-ish styling and text without a database back end? Light means not more than 50 to 100 posts in extreme cases.
What language(s) should be used? Is there any preferred data format for the information? How does security play out?
EDIT: Client has no database, is on a shared server. Can't change that. Therefore, no DB.
EDIT2:
Someone mentioned SQL Compact - does that require anything more than copying files to the server? The key here is again that things shouldn't require any more permissions than FTP Acess.

If you're looking to do it yourself; store each post as a file in a directory. Then to sort and limit the posts you rely partially on the file names to order and limit them, and potentially (in the case of a search) on reading every last file. Don't go letting users make 10,000 posts though. But yeah, the above is considered a flat file data format. You can get fancy by using a standard format like JSON, Yaml, or XML within each post file, and even fancier by requesting these with Ajax calls in mostly client side code.
Now if the reason you want to work with flat files is that you just don't want to install a database server, there's nothing stopping you from reading a local (to the server) file as a berkley DB, a Lucene Index, or an SQLite DB from within your webapp using the appropriate client library. You'll find any of these approaches a little more sane (a bit faster, a bit more readable in code) than the afore-mentioned with all the same requirements for installing on the server (read-write file permissions). Many web frameworks or languages (like php) come with the option of an API to these client libraries; SQLite, and Lucy (C Lucene) particularly.
If you're just looking for examples of it being done, I first (I think 1999 or 2000) came across blosxom which is a perl script that either runs as a cgi script per request or as a cron job. It builds a dated index of "posts" based on whatever you throw into the directory it's meant to scan. It also builds an RSS feed.

Jekyll or Blogofile are my favorite kind of solution for that, "compiling pages before upload".

I'm going to go out on a limb here and say that it's not always the destination, but the Journey.
If you're going to set out to do this, I recommend using a language you are comfortable. Personally, this would be C#/.net for me, but from your tagging, I'll assume PHP would be the Serverside scripting language you would choose.
I would layout how I wanted my application to behave. If there is going to be a lot of data, you should consider (as dlamblin mentioned) an DB of some sort for lookup and retrieval. (Light Blog, not so much data... 1000 users can edit? maybe you should consider a DB.) Once you've decided how to store the data, decide how to present it.
Write some proof of concept code for each of the features you want to implement (blog templating, bbcode, user authentication, text searching...) and start to work them all together.

search for flat-file cms-es on google, for example:
http://www.flatcms.org/
this has been already done, so there is no need to create such CMS again. there are plenty of them.

I concur with dusoft that this has already been done.
DotNetBlogEngine.net is an ASP.NET (C#) based blogging system that has a nice XML back-end as an option.

Doesn't answer your question directly but check Unify.

If you do not want to write a new one or want to get some inspiration:
Flatpress
Simple PHP Blog
Ninja Designs are working on a db-free wordpress clone

You could either use XML, or use SQL compact (which allows for handling things just like SQL Server, but instead of a database you utilize flat files).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.