I've bumped into a problem while working at a project. I want to "crawl" certain websites of interest and save them as "full web page" including styles and images in order to build a mirror for them. It happened to me several times to bookmark a website in order to read it later and after few days the website was down because it got hacked and the owner didn't have a backup of the database.
Of course, I can read the files with php very easily with fopen("http://website.com", "r") or fsockopen() but the main target is to save the full web pages so in case it goes down, it can still be available to others like a "programming time machine" :)
Is there a way to do this without read and save each and every link on the page?
Objective-C solutions are also welcome since I'm trying to figure out more of it also.
Thanks!
You actually need to parse the html and all css files that are referenced, which is NOT easy. However a fast way to do it is to use an external tool like wget. After installing wget you could run from the command line
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html
This will download the mypage.html and all linked css files, images and those images linked inside css.
After installing wget on your system you could use php's system() function to control programmatically wget.
NOTE: You need at least wget 1.12 to properly save images that are references through css files.
Is there a way to do this without read and save each and every link on the page?
Short answer: No.
Longer answer: if you want to save every page in a website, you're going to have to read every page in a website with something on some level.
It's probably worth looking into the Linux app wget, which may do something like what you want.
One word of warning - sites often have links out to other sites, which have links to other sites and so on. Make sure you put some kind of stop if different domain condition in your spider!
If you prefer an Objective-C solution, you could use the WebArchive class from Webkit.
It provides a public API that allows you to store whole web pages as .webarchive file. (Like Safari does when you save a webpage).
Some nice features of the webarchive format:
completely self-contained (incl. css,
scripts, images)
QuickLook support
Easy to decompose
Whatever app is going to do the work (your code, or code that you find) is going to have to do exactly that: download a page, parse it for references to external resources and links to other pages, and then download all of that stuff. That's how the web works.
But rather than doing the heavy lifting yourself, why not check out curl and wget? They're standard on most Unix-like OSes, and do pretty much exactly what you want. For that matter, your browser probably does, too, at least on a single page basis (though it'd also be harder to schedule that).
I'm not sure if you need a programming solution to 'crawl websites' or personally need to save websites for offline viewing, but if its the latter, there's a great app for Windows — Teleport Pro and SiteCrawler for Mac.
You can use IDM (internet downloader management) for downloading full webpages, there's also HTTrack.
Related
I read a lot of topics about scripts that compute html and output pdf; I tried lots of them, and I am always disapointed in the results. Lots of them don't consider the external CSS, lots of them can't be executed from shared hosting (need to be installed in some unaccessible places, like DOMPDF), etc. Also, lots of the threads on the question are pretty old (most of them've been asked in 2010).
Question: Is there a simple way to cURL (from a php script) a remote web page and simply save a pdf of the "print" (like in css media print) version of the page, or even a jpeg, or a docx, or anything that "contains" the images and the styling for offline viewing? And more important, can it be free/open source?
All the web browsers do that with no effort. Once on the page, only press ctrl-p and there it goes (almost). Why is it so trivial to find a good script that can do this? Is there a way to emulate a browser, or what...?
Isn't it possible to cURL and force css media print, then take a snapshot of this?
The difficulty to find this seems very strange to me... I feel like it's a quite simple task.
Try to call wkhtmltox from PHP.
wkhtmltox/bin/wkhtmltopdf www.stackoverflow.com stackoverflow.pdf
This PHP library seems to work with wkhtmltox:
http://thejoyofcoding.org/php-wkhtmltox/
This might help:
http://davidbomba.com/php-wkhtmltox/
I am developing an application in the Kohana PHP framework that assesses performance. The end result of the process is a webpage listing the overall scoring and a color coded list of divs and results.
The original idea was to have the option to save this as a non-editable PDF file and email the user. After further research I have found this to be non as straight forward as I hoped.
The best solution seemed to be installing the unix application wkhtmltopdf but as the destination is shared hosting I am unable to install this on the server.
My question is, what's the best option to save a non editable review of the assessment to the user?
Thank you for help with this.
I guess the only way to generate a snapshot, or review how you call it, is by storing it on the server side and only grant access via a read only protocol. So basically by offering it as a 'web page'.
Still everyone can save and modify the markup. But that is the case for every file you generate, regardless of the type of file. Ok, maybe except DRM infected files. But you don't want to do that, trust me.
Oh, and you could also print the files. Printouts are pretty hard to be edited. Though even that is not impossible...
I found a PHP version that is pre-built as a Kohana Module - github.com/ryross/pdfview
I have a PHP/mySQL site that is no longer going to get any new content added. But I'd like to keep what I do have as an archive and keep it online. Ideally I'd like to convert it to a static site so that it no longer requires a database.
If anyone else has gone through this process, are there any tools, scripts, or methodologies that can automate this or at least make this easier? I'd want to be able to do things like make sure that all the links still work (so they'd have to somehow be converted to correctly point to the new static versions), things like that.
I have ssh access to the server in question. I'm relatively comfortable with both PHP and Python so tools using those languages would be ideal.
Note: there are two basic reasons I'm doing this:
cost, as it's much cheaper to host just a collection of static files than a dynamic website (I'm using NearlyFreeSpeech and with the bandwidth I'm using I estimate my costs would go down to well under $1/month).
spammers have somehow found my site and keep signing up for accounts (at which point, they're blocked from making comments anyway, but it's still annoying).
If you have shell access to any linux machine (perhaps even your own webserver would suffice), I'd recommend that you just spider and download a mirror of your own site using wget. Wget is a utility which is designed to mirror sites as flat files, and it has been in use for quite some time. I believe it should serve you well:
http://www.gnu.org/software/wget/manual/wget.html
I hope that's helpful.
Chris
I have recently used the following to good effect:
wget --mirror -w 2 -p --html-extension --convert-links -P folder_to_save_to http://mysite.com
You might need to use the full path to your wget script. This will change all the links so that your site is fully static and self contained.
Using PHP you could write a simple script that would do this:
Save current page.
Follow links from that page and saving those pages (and for each page repeat from 1).
Replace URLs on current page with those leading to saved pages.
Don't think that I'm mad, I understand how php works!
That being said. I develop personal website and I usually take advantage of php to avoid repetion during the development phase nothing truly dynamic, only includes for the menus, a couple of foreach and the likes.
When the development phase ends I need to give the website in html files to the client. Is there a tool (crawler?) that can do this for me instead of visiting each page and saving the interpreted html?
You can use wget to download recursively all the pages linked.
You can read more about this here: http://en.wikipedia.org/wiki/Wget#Recursive_download
If you need something more powerful that recursive wget, httrack works pretty well. http://www.httrack.com/
Pavuk offers much finer control than wget. And will rewrite the URLs in the grabbed pages if required.
If you want to use a crawler, I would go for the mighty wget.
Otherwise you could also use some build tool like make.
You need to create a file nameed Makefile in the same folder of your php files.
It should contain this:
all: 1st_page.html 2nd_page.html 3rd_page.html
1st_page.html: 1st_page.php
php command
2nd_page.html: 2nd_page.php
php command
3rd_page.html: 3rd_page.php
php command
Note that the php command is not preceded by spaces but by a tabulation.
(See this page for the php command line syntax.)
After that, whenever you want to update your html files just type
make
in your terminal to automatically generate them.
It could seem a lot of work for just a simple job, but make is a very handy tool that you will find useful to automate other tasks as well.
Maybe, command line will help?
If you're on windows, you can use Free Download Manager to crawl a web-site.
I need to make snapshots of web pages programmatically using PHP and get them into a HTML E-Mail.
I tried wget --page-requisites. It downloads everything all right, but it doesn't change the HTML page's source code to point to the downloaded files rather than the on-line originals. Also, that HTML is of course a long way from being displayed properly in a HTML E-Mail.
I am interested to know whether there are ready-made solutions for this. I would already be happy with a solution that takes a HTML snapshot and changes the HTML accordingly. Being able to E-Mail it would be the icing on the cake.
I control the web pages being snapshot, so I have the possibility to adjust the content to optimize the results.
My server-side platform is PHP but with very liberal settings, I can execute things like wget and Perl scripts from within PHP. I do however not have root access and can not install additional packages or programs.
The task is to make a snapshot of a product page each time somebody places an order, so there is documentation about what the page looked like at the time.
wget has a -k (--convert-links) option, which will convert both links and references to embedded content (like images). See e.g. wget advanced use (also here).
For the email-part of your question - I'm sure you can use one of the existing libraries. For example, PHP has some PEAR package (do no remember the exact name) to handle HTML emails; I'm pretty sure both Perl and Python have something similar.
In this case, you try to do a website mirroring using wget. The simple solution is to use httrack which is a simple command-line tool. It's very powerful and configurable, try it!
The httrack website presents a GUI, but you don't need it, all is possible from the command-line (or from PHP).