There exists numerous solutions on generating a thumbnail or an image preview of a webpage. Some of these solutions are webs-based like websnapshots, windows libraries such as PHP's imagegrabscreen (only works on windows), and KDE's wkhtml. Many more do exist.
However, I'm looking for a GUI-less solution. Something I can create an API around and link it to php or python.
I'm comfortable with python, php, C, and shell. This is a personal project, so I'm not interested in commercial applications as I'm aware of their existence.
Any ideas?
You can run a web browser or web control within Xvfb, and use something like import to capture it.
I'll never get back the time I wasted on wkhtml and Xvfb, along with the joy of embedding a monolithic binary from google onto my system. You can save yourself a lot of time and headache by abandoning wkhtml2whatever completely and installing phantom.js. Once I did that, I had five lines of shell code and beautiful images in no time.
I had a single problem - using ww instead of www in a url caused the process to fail without meaningful error messages. Eventually I saw the dns lookup problem, and my faith was restored.
But seriously, every other avenue of thumbnailing seemed to be out of date and/or buggy.
phantom.js = it changed my life.
Related
I have a PHP/mySQL site that is no longer going to get any new content added. But I'd like to keep what I do have as an archive and keep it online. Ideally I'd like to convert it to a static site so that it no longer requires a database.
If anyone else has gone through this process, are there any tools, scripts, or methodologies that can automate this or at least make this easier? I'd want to be able to do things like make sure that all the links still work (so they'd have to somehow be converted to correctly point to the new static versions), things like that.
I have ssh access to the server in question. I'm relatively comfortable with both PHP and Python so tools using those languages would be ideal.
Note: there are two basic reasons I'm doing this:
cost, as it's much cheaper to host just a collection of static files than a dynamic website (I'm using NearlyFreeSpeech and with the bandwidth I'm using I estimate my costs would go down to well under $1/month).
spammers have somehow found my site and keep signing up for accounts (at which point, they're blocked from making comments anyway, but it's still annoying).
If you have shell access to any linux machine (perhaps even your own webserver would suffice), I'd recommend that you just spider and download a mirror of your own site using wget. Wget is a utility which is designed to mirror sites as flat files, and it has been in use for quite some time. I believe it should serve you well:
http://www.gnu.org/software/wget/manual/wget.html
I hope that's helpful.
Chris
I have recently used the following to good effect:
wget --mirror -w 2 -p --html-extension --convert-links -P folder_to_save_to http://mysite.com
You might need to use the full path to your wget script. This will change all the links so that your site is fully static and self contained.
Using PHP you could write a simple script that would do this:
Save current page.
Follow links from that page and saving those pages (and for each page repeat from 1).
Replace URLs on current page with those leading to saved pages.
Hello I am wondering what processes involves converting files online...What programming languages are required?Basically I am wondering how are the files on scribd,issu,slideshare converted...
Thank you..!
There is a media processing library called FFMpeg. It will read, write and convert pretty much any common media format. It can resize, crop, scale, resample, etc all the files it can load, meaning videos can be shrunk in size, or whatever you might want to do!
The great thing is, if you have PHP and FFMpeg installed on a server, you can use PHP's exec() command to convert/modify/save videos.
A little note: beware of using exec() with any commands that are influence by what is sent from the server, like frame sizes, etc - hackers can latch onto them and mess your server up!
To get round the problem of pethetic people (script kiddies, etc), try ffmpeg-php. I don't know how popular this is on web hosts, but I haven't seen it many places in the wild.
James
EDIT
Unfortunately not. FFMpeg is primarily for video/audio conversion. However, there is a program called pdf2swf that should do the trick. The man page for it is here.
The only concern about pdf2swf is whether your web host actually has it installed on their server. If you have a VPN or a dedicated server, that's no problem, but if you're on shared hosting, and have no access to the root filesystem, this is where issues arise - you can't install pdf2swf if you don't have it.
I've been taken onboard to work on a PHP-based web application. One part of the application generates thumbnail images for MS Office documents on demand, and it uses MS Office + the VeryPDF docprint utility to do this. Because of this one requirement, the system is running on Windows Server 2003 + IIS.
I would prefer to have the system running on a Linux server, rather than MS, as I have far more experience in administering Linux systems than Windows and we have no other in-house technical staff.
Does anyone know a way to handle the document conversion using native Linux software? I would love something PHP native, but am willing to look outside that if necessary.
I have never done anything like this, so I'm just throwing an idea off the top of my head.
Have you thought about utilizing Open Office's capabilities to create thumbnail images? I know OO saves thumbnail images within a created document, so all you need to do is extract the image to display it. (This is demonstrated on the Ubuntu forums.) You could always do something sort of "hackish" where you use run a file through OpenOffice and extract the image to display a small thumbnail.
Again, I have no idea how well this will work, but it may be worth a shot.
To anyone else who comes across this, I have ended up going with the newer version of jodconverter. The sample code includes a basic web page that can be POSTed to using something like Pear's HTTP_Request2. A sample class (by yours truly) which uses this is mentioned in the comments in jodconverter's group on google code.
I've bumped into a problem while working at a project. I want to "crawl" certain websites of interest and save them as "full web page" including styles and images in order to build a mirror for them. It happened to me several times to bookmark a website in order to read it later and after few days the website was down because it got hacked and the owner didn't have a backup of the database.
Of course, I can read the files with php very easily with fopen("http://website.com", "r") or fsockopen() but the main target is to save the full web pages so in case it goes down, it can still be available to others like a "programming time machine" :)
Is there a way to do this without read and save each and every link on the page?
Objective-C solutions are also welcome since I'm trying to figure out more of it also.
Thanks!
You actually need to parse the html and all css files that are referenced, which is NOT easy. However a fast way to do it is to use an external tool like wget. After installing wget you could run from the command line
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html
This will download the mypage.html and all linked css files, images and those images linked inside css.
After installing wget on your system you could use php's system() function to control programmatically wget.
NOTE: You need at least wget 1.12 to properly save images that are references through css files.
Is there a way to do this without read and save each and every link on the page?
Short answer: No.
Longer answer: if you want to save every page in a website, you're going to have to read every page in a website with something on some level.
It's probably worth looking into the Linux app wget, which may do something like what you want.
One word of warning - sites often have links out to other sites, which have links to other sites and so on. Make sure you put some kind of stop if different domain condition in your spider!
If you prefer an Objective-C solution, you could use the WebArchive class from Webkit.
It provides a public API that allows you to store whole web pages as .webarchive file. (Like Safari does when you save a webpage).
Some nice features of the webarchive format:
completely self-contained (incl. css,
scripts, images)
QuickLook support
Easy to decompose
Whatever app is going to do the work (your code, or code that you find) is going to have to do exactly that: download a page, parse it for references to external resources and links to other pages, and then download all of that stuff. That's how the web works.
But rather than doing the heavy lifting yourself, why not check out curl and wget? They're standard on most Unix-like OSes, and do pretty much exactly what you want. For that matter, your browser probably does, too, at least on a single page basis (though it'd also be harder to schedule that).
I'm not sure if you need a programming solution to 'crawl websites' or personally need to save websites for offline viewing, but if its the latter, there's a great app for Windows — Teleport Pro and SiteCrawler for Mac.
You can use IDM (internet downloader management) for downloading full webpages, there's also HTTrack.
I know there is not a direct way to take a screen shot of a web page with PHP. What would be the most straightforward way to accomplish this? Are there any command line tools that could do this that I might be able to execute from a PHP script (I'm thinking something that would run in a 'NIX OS (OS X and/or Linux in particular)?
Edit: Or maybe some sort of web service I could access via SOAP or REST or ...
Edit #2: I found a related question discussing the CLI option, but I'd still be open to other methods if anyone knows of anything.
See webkit2png for an OSX commandline program that does this.
The page also mentions Linux alternatives.
[edit]: wkhtml2image is the newest kid in town, and it works better then anything else i've ever used.
[edit2]: As of 2014, PhantomJS seems to be the way to go, as it has the newest webkit version of the alternatives I know about.
[edit3]: In 2019, Puppeteer is the way to go. Official headless chrome, always up to date.
You can use the GD functions imagegrabscreen() or imagegrabwindow() to take a screenshot, but they're only available on Windows at the moment.
http://www.thumbshots.org/
html2ps does a decent job for relatively simple pages, and it requires very little in terms of external binaries, meaning it's very easy to install/use. If you control the pages you'll be capturing, then you can ensure that they'll render appropriately in html2ps. If you're hoping to capture arbitrary URLs, however, I'm not sure that the PHP port of HTML2PS is up to the task. It's also not the fastest thing in the world (expect render times in the seconds for complex pages), but that doesn't really matter for some applications.
Not sure if this would be enough for you, because it has some added stuff there, but would be worth giving it a try: http://www.snap.com
It's possible to get a base64 encoded image of a site by using the Google pagespeed api.
You can specify desktop or mobile views, but you are limited to an image of a certain size.