I have a PHP/mySQL site that is no longer going to get any new content added. But I'd like to keep what I do have as an archive and keep it online. Ideally I'd like to convert it to a static site so that it no longer requires a database.
If anyone else has gone through this process, are there any tools, scripts, or methodologies that can automate this or at least make this easier? I'd want to be able to do things like make sure that all the links still work (so they'd have to somehow be converted to correctly point to the new static versions), things like that.
I have ssh access to the server in question. I'm relatively comfortable with both PHP and Python so tools using those languages would be ideal.
Note: there are two basic reasons I'm doing this:
cost, as it's much cheaper to host just a collection of static files than a dynamic website (I'm using NearlyFreeSpeech and with the bandwidth I'm using I estimate my costs would go down to well under $1/month).
spammers have somehow found my site and keep signing up for accounts (at which point, they're blocked from making comments anyway, but it's still annoying).
If you have shell access to any linux machine (perhaps even your own webserver would suffice), I'd recommend that you just spider and download a mirror of your own site using wget. Wget is a utility which is designed to mirror sites as flat files, and it has been in use for quite some time. I believe it should serve you well:
http://www.gnu.org/software/wget/manual/wget.html
I hope that's helpful.
Chris
I have recently used the following to good effect:
wget --mirror -w 2 -p --html-extension --convert-links -P folder_to_save_to http://mysite.com
You might need to use the full path to your wget script. This will change all the links so that your site is fully static and self contained.
Using PHP you could write a simple script that would do this:
Save current page.
Follow links from that page and saving those pages (and for each page repeat from 1).
Replace URLs on current page with those leading to saved pages.
Related
I am developing an application in the Kohana PHP framework that assesses performance. The end result of the process is a webpage listing the overall scoring and a color coded list of divs and results.
The original idea was to have the option to save this as a non-editable PDF file and email the user. After further research I have found this to be non as straight forward as I hoped.
The best solution seemed to be installing the unix application wkhtmltopdf but as the destination is shared hosting I am unable to install this on the server.
My question is, what's the best option to save a non editable review of the assessment to the user?
Thank you for help with this.
I guess the only way to generate a snapshot, or review how you call it, is by storing it on the server side and only grant access via a read only protocol. So basically by offering it as a 'web page'.
Still everyone can save and modify the markup. But that is the case for every file you generate, regardless of the type of file. Ok, maybe except DRM infected files. But you don't want to do that, trust me.
Oh, and you could also print the files. Printouts are pretty hard to be edited. Though even that is not impossible...
I found a PHP version that is pre-built as a Kohana Module - github.com/ryross/pdfview
I've built a CMS (using the Codeigniter PHP framework) that we use for all our clients. I'm constantly tweaking it, and it gets hard to keep track of which clients have which version. We really want everyone to always have the latest version.
I've written it in a way so that updates and upgrades generally only involve uploading the new version via FTP, and deleting the old one - I just don't touch the /uploads or /themes directories (everything specific to the site is either there or in the database). Everything is a module, and each module has it's own version number (as well as the core CMS), as well as an install and uninstall script for each version, but I have to manually FTP the files first, then run the module's install script from the control panel. I wrote and will continue to write everything personally, so I have complete control over the code.
What I'd like is to be able to upgrade the core CMS and individual modules from the control panel of the CMS itself. This is a "CMS for Dummies", so asking people to FTP or do anything remotely technical is out of the question. I'm envisioning something like a message popping up on login, or in the list of installed modules, like "New version available".
I'm confident that I can sort out most of the technical details once I get this going, but I'm not sure which direction to take. I can think of ways to attempt this with cURL (to authenticate and pull source files from somewhere on our server) and PHP's native filesystem functions like unlink(), file_put_contents(), etc. to preform the actual updates to files or stuff the "old" CMS in a backup directory and set up the new one, but even as I'm writing this post - it sounds like a recipe for disaster.
I don't use git/github or anything, but I have the feeling something like that could help? How should (or shouldn't) I approach this?
Theres a bunch of ways to do this but the least complicated is just to have Git installedo n your client servers and set up a cron job that runs a git pull origin master every now and then. If your application uses Migrations it should be easy as hell to do.
You can do this as it sounds like you are in full control of your clients. For something like PyroCMS or PancakeApp that doesn't work because anyone can have it on any server and we have to be a little smarter. We just download a ZIP which contains all changed files and a list of deleted files, which means the file system is updated nicely.
We have a list of installations which we can ping with a HTTP request so the system knows to run the download, or the click can hit "Upgrade" when they log in.
You can use Git from your CMS: Glip. The cron would be a url on your own system, without installing Git.
#Obsidian Wouldn't a DNS poisoning attack also compromise most methods being mentioned in this thread?
Additionally SSH could be compromised by a man in the middle attack as well.
While total paranoia is a good thing when dealing with security, Wordpress being a GPL codebase would make it easy to detect an unauthorized code change in your code if such an attack did occur, so resolution would be easy.
SSH and Git does sound like a good solution, but what is the intended audience?
Have you taken a look at how WordPress does it?
That would seem to do what you want.
Check this page for a description of how it works.
http://tech.ipstenu.org/2011/how-the-wordpress-upgrade-works/
There exists numerous solutions on generating a thumbnail or an image preview of a webpage. Some of these solutions are webs-based like websnapshots, windows libraries such as PHP's imagegrabscreen (only works on windows), and KDE's wkhtml. Many more do exist.
However, I'm looking for a GUI-less solution. Something I can create an API around and link it to php or python.
I'm comfortable with python, php, C, and shell. This is a personal project, so I'm not interested in commercial applications as I'm aware of their existence.
Any ideas?
You can run a web browser or web control within Xvfb, and use something like import to capture it.
I'll never get back the time I wasted on wkhtml and Xvfb, along with the joy of embedding a monolithic binary from google onto my system. You can save yourself a lot of time and headache by abandoning wkhtml2whatever completely and installing phantom.js. Once I did that, I had five lines of shell code and beautiful images in no time.
I had a single problem - using ww instead of www in a url caused the process to fail without meaningful error messages. Eventually I saw the dns lookup problem, and my faith was restored.
But seriously, every other avenue of thumbnailing seemed to be out of date and/or buggy.
phantom.js = it changed my life.
I need to make snapshots of web pages programmatically using PHP and get them into a HTML E-Mail.
I tried wget --page-requisites. It downloads everything all right, but it doesn't change the HTML page's source code to point to the downloaded files rather than the on-line originals. Also, that HTML is of course a long way from being displayed properly in a HTML E-Mail.
I am interested to know whether there are ready-made solutions for this. I would already be happy with a solution that takes a HTML snapshot and changes the HTML accordingly. Being able to E-Mail it would be the icing on the cake.
I control the web pages being snapshot, so I have the possibility to adjust the content to optimize the results.
My server-side platform is PHP but with very liberal settings, I can execute things like wget and Perl scripts from within PHP. I do however not have root access and can not install additional packages or programs.
The task is to make a snapshot of a product page each time somebody places an order, so there is documentation about what the page looked like at the time.
wget has a -k (--convert-links) option, which will convert both links and references to embedded content (like images). See e.g. wget advanced use (also here).
For the email-part of your question - I'm sure you can use one of the existing libraries. For example, PHP has some PEAR package (do no remember the exact name) to handle HTML emails; I'm pretty sure both Perl and Python have something similar.
In this case, you try to do a website mirroring using wget. The simple solution is to use httrack which is a simple command-line tool. It's very powerful and configurable, try it!
The httrack website presents a GUI, but you don't need it, all is possible from the command-line (or from PHP).
I've bumped into a problem while working at a project. I want to "crawl" certain websites of interest and save them as "full web page" including styles and images in order to build a mirror for them. It happened to me several times to bookmark a website in order to read it later and after few days the website was down because it got hacked and the owner didn't have a backup of the database.
Of course, I can read the files with php very easily with fopen("http://website.com", "r") or fsockopen() but the main target is to save the full web pages so in case it goes down, it can still be available to others like a "programming time machine" :)
Is there a way to do this without read and save each and every link on the page?
Objective-C solutions are also welcome since I'm trying to figure out more of it also.
Thanks!
You actually need to parse the html and all css files that are referenced, which is NOT easy. However a fast way to do it is to use an external tool like wget. After installing wget you could run from the command line
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html
This will download the mypage.html and all linked css files, images and those images linked inside css.
After installing wget on your system you could use php's system() function to control programmatically wget.
NOTE: You need at least wget 1.12 to properly save images that are references through css files.
Is there a way to do this without read and save each and every link on the page?
Short answer: No.
Longer answer: if you want to save every page in a website, you're going to have to read every page in a website with something on some level.
It's probably worth looking into the Linux app wget, which may do something like what you want.
One word of warning - sites often have links out to other sites, which have links to other sites and so on. Make sure you put some kind of stop if different domain condition in your spider!
If you prefer an Objective-C solution, you could use the WebArchive class from Webkit.
It provides a public API that allows you to store whole web pages as .webarchive file. (Like Safari does when you save a webpage).
Some nice features of the webarchive format:
completely self-contained (incl. css,
scripts, images)
QuickLook support
Easy to decompose
Whatever app is going to do the work (your code, or code that you find) is going to have to do exactly that: download a page, parse it for references to external resources and links to other pages, and then download all of that stuff. That's how the web works.
But rather than doing the heavy lifting yourself, why not check out curl and wget? They're standard on most Unix-like OSes, and do pretty much exactly what you want. For that matter, your browser probably does, too, at least on a single page basis (though it'd also be harder to schedule that).
I'm not sure if you need a programming solution to 'crawl websites' or personally need to save websites for offline viewing, but if its the latter, there's a great app for Windows — Teleport Pro and SiteCrawler for Mac.
You can use IDM (internet downloader management) for downloading full webpages, there's also HTTrack.