Best way to serve third party html on your site?

Best way to serve third party html on your site? - php

I'm building a web app where users can build custom web pages that pull content from other web pages. I know of a few options for doing this, and I'm not sure which is best, and if there are better solutions out there. Right now, I could:
Use iframes, which will (sort of) accomplish what I want, but will force the client to download and render all the web content, which seems slow. I've heard a lot of people say iframes are passe and should not be used, etc.
Use a library like wkhtmltopdf, which will render the html on the server side and generate a pdf image of it. This would work nicely, but the result is just an image, so text won't be selectable, links won't be clickable, etc. Also, I've heard that you can get in legal trouble for hosting other people's web content on your site without permission.
Use something like phpquery to literally scrape content off of other sites. This option could have the same legal issues as the above option.
Has anyone done anything like this, or does anyone have any thoughts?

The cleanest solution would be send off a http request server side, then render the html into your page as you require, this will also require changing all the urls of content and links to be absolute
eg:
<img src="\images\banner.png">
will work on the remote server, but once inside your page, the image will not exist. The most workable solution would be limit the functionality to images and links, then do a find / replace with regex to match relative urls and add the source address to it.
You will however run into legal issues if you are resending other peoples content from your server, even just html.
Using an iframe would be the quick dirty solution and probably have the least legal ramifications, as the browser sends a normal request to the site for the content.

I'd recommend DocRaptor for generating PDF files from HTML. It works in a similar fashion as wkhtmltopdf, but produces fully functional PDF files.
Here's a link to its homepage:
http://docraptor.com/
And a link to its API documentation:
http://docraptor.com/documentation

Related

How do sites like Bing Search, Imgur, and Reddit generate a thumbnail of the website from a URL?

In Imgur, you can input an image URL and a few seconds later, there's a thumbnail of the image. Or in Bing Search, you can (or used to) be able to view a thumbnail of the website in the search results before visiting it.
I would love to implement something similar for my website, but I can't wrap my head around on how it is done. Moreover, are there not security concerns? I'd imagine the servers have to at least download the website, render it and take a screenshot. What if it's a malicious website, and you download something malicious on your server?

A headless Web browser engine like PhantomJS can be used for this. See example on their wiki. Yes, it would be prudent to run this in some sort of a sandbox, feeding a queue of URLs into it, then taking the generated thumbnails from the file system.

While I don't know the internal workings of any of the aforementioned services, I'd guess that they download/create a local copy of the images and generate a thumbnail from that.
Imgur, as an image hosting service, definitely needs a copy of the image prior to being able to generate thumbnails or anything else from it. The image may be stored locally or just in memory, but either way, it must be downloaded.
The search engines displaying screenshots of the sites likely have services that periodically take a screenshot of the viewable area when the content is getting indexed, and then serve those screenshots (or derivatives) along with the search results. Taking a screenshot really isn't dangerous, so there's nothing to worry about there, and whatever tools are used to load/parse/index the websites will obviously be written with security considerations in mind.
Of course, there are security concerns about the data you're downloading, too; the images can easily contain executable code (such as PHP) in their EXIF data, so you need to be careful about what you do with the images and how.

Convert HTML to Image in PHP without shell

I want the option of converting HTML to image and showing the result to the user. I would be creating an $html variable with PHP, and instead of displaying using echo $html, I want to display it as an image so the user can save the file if they needed to.
I was hoping there would something as simple as $image = convertHTML2Image($html); :p if that exists?!
Thanks!!

As #Pekka says, the job of turning HTML code into an image is the job of a full-blown web browser.
If you want to do this sort of thing, you therefore need to have a script that does the following:
Opens the page in a browser.
Captures the rendered page from the browser as a graphic.
Outputs that graphic to your user.
Traditionally, this would have been a tough task, because web browsers are typically driven by the user and not easy to automate in this way.
Fortunately, there is now a solution, in the form of PhantomJS.
PhantomJS is a headless browser, designed for exactly this kind of thing -- automated tasks that require a full-blown rendering engine.
It's basically a full browser, but without the user interface. It renders the page content exactly as another browser would (it's based on Webkit, so results are similar to Chrome), and it can be controlled by a script.
As it says on the PhantomJS homepage, one of its target use-cases is for taking screenshots or thumbnail images of websites.
(another good use for it is automated testing of your site, where it is also a great tool)
Hope that helps.

This is not possible in pure PHP.
What you call "converting" is in fact a huge, non-trivial task: the HTML page has to be rendered. To do this in PHP, you'd have to rewrite an entire web browser.
You'll either have to use an external tool (which usually taps into a browser's rendering engine) or a web service (which does the same).

It is possible to convert html to image. However, first you must convert to PDF. see link

You may have a look at dompdf which is a php framework to convert a html file to a pdf.

use WKHTMLTOPDF. works like a charm. it converts to any page to PDF ..
a jpeg can be obtained by performing later operation.
http://code.google.com/p/wkhtmltopdf/

Embedding a PDF into a website without a SRC attribute

Currently working on an offshoot of the idea more adequately addressed here.
Creating a Secure File Hosting Server for PDFs
I'm developing a secure PDF hosting website where certain users can download certain PDF's that I have stored outside of the webroot to prevent people from accessing documents they shouldn't access.
I've got the download working using the first solution, but I want to implement a 'view/preview' feature too. I still don't get content headers as well as I should but I believe what is causing the bulk of my issues is I can't put a 'src' attribute on the embed/object/iframe/whatever. And that's kind of the point of the system.
My question is, is there any way to feed a file (as opposed to a url) to an embed/object? I would like to keep my current system and I'm going for simplicity at the moment so the easier the better.
I saw Recommended way to embed PDF in HTML? and will probably check out pdf.js if I'm trying something that isn't doable.

I have not yet had the chance to play with pdf.js, but it either that or a flash player of some sort.
Or you rely on the browser to display it has a webpage and you can iframe it, but that's so lame... it would work only for a fraction of you users.

PDF2SWF - convert PDF to SWF ( 1 page = 1 SWF).
Use other SWF (reader) to load SWF pages via XML or something else.
Use $_SESSION to store ID of PDF document which should be served through e.g. /preview (same link for previewing all documents)
Don't serve original PDF, put a watermark, or make them low-res.
Otherwise, your PDF will never be "secure".
http://www.swftools.org/

Curl preg_match

We are downloading images to our computers when we open new webpages. For example: If a webpage has an image(image.jpg), our computer downloads it while we are surfing that page.
Some webpages are using ajax methods. For example: You don't see an image on the page's source codes, however your computer downloads an image. Because, if you click a link on that page, ajax will be showing that image...
Let me show an example:
<div id="ajax_will_load_image_here"></div>
Okay, how can php curl see (or download) that image? Curl can't see that image when I try to use preg_match function. Actually there is an image. I want to download that image by using php curl. Any advice?

If i understand the question correctly there is no convinient way of doing that.
Your crawler/spider would have to parse the website and evaluate javascript.
There are libraries for that but support is very limited.
There are however methods where an actual browser is used to evaulate the page (without displaying it but setting proper environment variables like resolution etc).
Then the generated source including javascript dom modifications is available.
This is for example how the google search previews are generated.
But if you require user interaction it gets pretty specific and complicated.
I am sorry to dissapoint you, but using curl and preg metch the old school way we used to when javascript was not yet so common wont work.
However for most legit use cases this is more than sufficient and websites are today more and more designed to be non-javascript compliant. Especially the content for crawling purposes. It is a must in search engine optimization, and which website doesnt want that?

Restricting access to images on a website

I'm putting together a portfolio website which includes a number of images, some of which I don't want to be viewable by the general public. I imagine that I'll email someone a user name and password, with which they can "log-in" to view my work.
I've seen various solutions to the "hide-an-image" problem on line including the following, which uses php's readfile. I've also seen another that uses .htaccess.
Use php's readfile() or redirect to display a image file?
I'm not crazy about the readfile solution, as it seems slow to load the images, and I'd like to be able to use Cabel Sasser's FancyZoom, which needs unfettered access to the image, (his library wants a link to the full sized image), so that rules out .htaccess.
To recap what I'm trying to do:
1) Provide a site where I give users the ability to authenticate themselves as someone I'd like looking at my images.
2) Restrict random web users from being able see those images.
3) Use FancyZoom to blow up thumbnails.
I don't care what technology this ends up using -- Javascript, PHP, etc. -- whatever's cleanest and easiest.
By the way, I'm a Java Developer, not a web developer, so I'm probably not thinking about the problem correctly.

Instead of providing a link to an image. Provide a link to a cgi script which will automatically provide the proper header and content of the image.
For example:
image.php?sample.jpg
You can then make sure they are already authenticated (e.g. pass a session id) as part of the link.
This would be part of the header, and then your image data can follow.
header('Content-Type: image/jpeg');
Edit: If it has to be fast, you can write this in C/C++ instead of php.

Using .htaccess should be the safest/simplest method, as it's built in functionality of the webserver itself.

I do not know if it fits your needs, but I solved a similar poblem(giving pictures to a restricted group of people) by using TinyWebGallery, which is a small gallery application without database.
You can allow access to different directories via password and you can upload pictures directly into the filesystem, as TinyWebGallery will check for new dirs/pics on the fly. It will generate thumbnails and gives users possibility to rate / comment pictures (You can disable this).
This is not the smallest tool, however I thik it is far easier to setup than using apache directives and it looks better as naked images.

If you're using Nginx, you could use the Secure Link module.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Best way to serve third party html on your site? - php

I'd recommend DocRaptor for generating PDF files from HTML. It works in a similar fashion as wkhtmltopdf, but produces fully functional PDF files. Here's a link to its homepage: http://docraptor.com/ And a link to its API documentation: http://docraptor.com/documentation

Related

How do sites like Bing Search, Imgur, and Reddit generate a thumbnail of the website from a URL?

Convert HTML to Image in PHP without shell

Embedding a PDF into a website without a SRC attribute

Curl preg_match

Restricting access to images on a website

Categories

Resources