Retrieve images from a link

Retrieve images from a link - php

Is there a script or service or snippet or method or anything that can get thumbnail from a url, by thumbnail i dont mean snapshot of the site, but an image that can automatically be fetched and used as post thumbnail, much like the one used in facebook. The image should be fetched thus img src="xxxxxxx?url=google.com" . this would fetch the google logo

Maybe there are existing solutions for this, but it's not really hard to implement:
you need to fetch the remote site, for e.g. with file_get_contents
optionally use Tidy to clean up the source HTML
parse the output with an XML parser if you used Tidy to clean the fetched data, or an HTML parser
fetch the first n images from the site (n should be a relatively small number)
store this fetched image set in a cache because this fetching, parsing thing could take time
Comments:
you may fetch the robots.txt from the site to check whether it's allowed to use/index the content
set a timeout for this remote website fetching, because if the website is down or slow it would timeout on your site as well
limit the concurrent fetching to a site and globally to protect against DoS-ing
you could use an HTTP client and limit the fetched HTML data size, or use HEAD HTTP method to fetch the Content-Length before downloading the actual content if it's allowed

Related

When loading multiple images from network requests, should I return the image data or a link to the image?

I have an iOS app that lists local places in a table view. Each cell has a picture, text, and subtext.
Each cell's detail view also has multiple pictures of the relevant location, as well as a decent amount of text. JSON is the interchange format.
Currently I am sending bit blobs and constructing it into a jpeg once loaded to the device but I am worried this is intensive on both the device and the server. So I was considering sending a link to the picture and asynchronously downloading each picture, but I am unaware of what repercussions this would have. Especially considering that I am currently using a cheap PHP/MySQL shared hosting plan for the backend.
I am looking for a list of pros and cons for sending the raw image data through JSON vs a link to that image. Any other options for quickly and efficiently populating a view with multiple network images is welcome.

I think the difference is as following:
1- the user will download : (link+image > image) more data stream.
2- if the image is on another server -> might be slower than your server or faster -> affect the image loading speed provided for the user and minimize transmitted data size between your server and the client.
3- if the image is on another server -> do you guarantee that it will be there when your website is up ?
4- loading data using ajax is already an asynchronous method and you don't have to worry about another server if you use it. well, unless your server is as slow as hell then you should consider using another server for the big images as the synchronization won't be your major concern as it is the load you are applying to your server.
if other points come to my mind, I'll post them here ..

I've done a little bit of research into this and asked a few colleagues, so I'll share what tidbits I've come up with.
At some point, the raw image data is going to have to be sent- that is unavoidable.
But I can benefit from lazily loading the image data- so that if my user only looks through 14 tableview cells, I only spend time loading 14 images instead of however many total results are returned from the server (And even less if I implement proper caching).
My solution so far is to return 30 (the number of tableview cells I load at one time) JSON objects, each having an "Image_URL":"..." field and putting those into a dictionary. Then, in cellForRowAtIndexPath:, I check to see if the image for that cell is already cached and if not, I make a request for that picture and update the cell.image in the network callback.
This is pretty simple to do on your own, but SDWebImage seems like a pretty good library for handling corner cases, caching, and other things that aren't covered in a basic implementation. I should note that AFNetworking also includes functionality for asynchronous image downloading.

a way to implement facebook's functionality of link sharing

I am looking for a way to create functionality, similar to when you post a link to the existed web-site in facebook. If this statement is rather ambiguous, I will try to elaborate.
When you paste your link and submit your post, facebook together with you link gives a small preview of the page, you are posting (text and may be a small image)
What are the ways to achieve this?
I read the similar post, but the thing is that I do not need an image so much, text will be sufficient.
Working in PHP, but language is not important, because I am looking for a high level idea.
Previously I was thinking about parsing content of the link with cURL but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Is there other ways?

From what I can tell, Facebook pulls from the meta name="description" tag's content attribute on the linked page.
If no meta description tag is available, it seems to pull from the beginning of the first paragraph <p> tag it can find on the page.
Images are pulled from available <img> tags on the page, with a carousel selection available to pick from when posting.
Finally, the link subtext is also user-editable (start a status update, include a link, and then click in the link subtext area that appears).
Personally I would go with such a route: cURL the page, parse it for a meta tag description and if not grab some likely data using a basic algorithm or just the first paragraph tag, and then allow user editing of whatever was presented (it's friendlier to the user and also solves issues with different returns on user-agent). Do the user facing control as ajax so that you don't have issues with however long it takes your site to access the link you want to preview.
I'd recommend using a DOM library (you could even use DOMDocument if you're comfortable with it and know how to handle possibly malformed html pages) instead of regex to parse the page for the <meta>, <p>, and potentially also <img> tags. Building a regex which will properly handle all of the myriad potential different cases you will encounter "in the wild" versus from a known set of sites can get very rough. QueryPath usually comes recommended, and there are stackoverflow threads covering many of the available options.
Most modern sites, especially larger ones, are good about populating the meta description tag, especially for dynamically generated pages.
You can scrape the page for <img> tags as well, but you'll want to then host the images locally: You can either host all of the images, and then delete all except the one chosen, or you can host thumbnails (assuming you have an image processing library installed and turned on). Which you choose depends on whether bandwidth and storage are more important, or the one-time processing of running an imagecopyresampled, imagecopyresized, Gmagick::thumbnailimage, etc, etc. (pick whatever you have at hand/your favorite). You don't want to hot link to the images on the page due to both the morality of it in terms of bandwidth and especially the likelihood of ending up with broken images when linking any site with hotlink prevention (referrer/etc methods), or from expiration/etc. Personally I would probably go for storing thumbnails.
You can wrap the entire link entity up as an object for handling expiration/etc if you want to eventually delete the image/thumbnail files on your own server. I'll leave particular implementation up to you since you asked for a high level idea.
but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Have you looked at the page's meta tags? I've tested with a few pages so far and this is generally where content not otherwise visible on the rendered linked pages is coming from, and seems to be the first choice for Facebook's algorithm.

Full disclosure upfront, I'm a developer at ThumbnailApp.com.
It's an JSON API service with an optional Javascript SDK which I think does exactly what you're after: It will parse a string to detect any urls and return the title, description and thumbnail of the asset. If the page has OpenGraph tags, it will use those for the image thumbnail. It's currently in private beta but we're adding more accounts each week.
If you feel that you really need a do-it-yourself solution:
Checkout the python based Webkit2Png and the headless browser PhantomJs. They can render webpages to an image (default size is 800x600), then you'll have to write some code to resize and crop the image like taswyn mentioned. Ideally you would then upload the resized image to Amazon S3 and then get it hosted on a CDN such as CloudFront.
To get the title and description, first get the URL content (cURL or whatever) and you will need to check the content-type header to make sure it's a webpage. If it is, you can then use a HTML parser such as the SimpleHTMLDOM PHP library to grab the title and description meta data. If you want it exactly like Facebook you will also need to check for any OpenGraph tags specifically the og:image tag.
Also don't forget about caching. The first render and description parsing can take a long time. Even if your site is fast, the webpage you're rendering could be slow and the best approach is to render / parse it once, then just save and return the resized image and meta data for subsequent requests. Depending on what your requirements are you may need to refresh the cached data every hour or you could get away with refreshing it once a day.
To do it yourself takes quite a bit of work and lots of server configuration. I feel using a 3rd party service is a better way to go, but obviously I have a biased opinion :)

Crawling custom URL's for images and resizing them via AJAX

I'm building a web app where users can store links, with 200x200 pictures associated. By default, I'd like to crawl the link for images and then return thumbnails of the biggest ones (from which the user can select the "official" thumbnail). I want this all to happen via AJAX. My question is: what is the best way to do this?
Currently, I'm using the PHP Simple HTTP Parser to scan a URL. I then find the src attribute of all the <img> tags, use getimagesize to store the image size located at that URL, sort the array from biggest to smallest and return the top 5 biggest image URL's via AJAX to the client. Then the client sends a different AJAX request for each one which makes a server-side ImageMagick script download and cut the image to a thumbnail, save it in a temporary folder and then return the URL of this thumbnail, which the client finally loads on his browser.
Needless to say, this is a little complicated and probably really inefficient. Running this process on http://en.wikipedia.org takes about 10-15 seconds from start to finish. I'm not certain there are any more efficient ways, however.

I'd do it in one AJAX request, with the script automatically resizing the biggest 5 images on the first pass, saving them, and returning a JSON array with the resized image URLs for the client.

You should probably use PHP's DOMDocument class to grab/parse the html page.
getimagesize() means you have to download each image and process them. Perhaps you should consider simply showing the user ALL images, by simply placing img tags that link back to the original HTML page. You could size these however you like using the tags. This way you do not have to download/process a single image until the user has actually selected one for the thumbnail.

Interested if / how you solved this?
In the end I looped through the images doing getimagesize() until both height and width were over a certain size, and then broke the loop.
This way its slightly more efficient as it only downloads as many images as it needs

reduce http requests by saving images in database?

Would it make sense to improve pageload speed by serving smaller images from the database rather than make multiple HTTP requests given that the website is PHP driven?
I'm thinking of smaller page design elements, buttons, thumbnails for galleries etc.

No. Since:
A browser only communicates with the server via HTTP so you would have to pull them from a database, put them in HTTP, then return them to the browser
It is more expensive to pull large chunks of binary data from a database then it is to pull them from the filesystem.
If you want to make fewer HTTP requests, you can sprite the images, but don't do that with content images (which should have proper <img> elements with alt text).

also you can serve the images from multiple subdomains, so you can have more concurrent HTTP requests which could help speeding up.

No.
The user isn't directly connected to the database and you can't (well you can but it's so ugly I'm ignoring it) output the image data inside the HTML. They have to be loaded on separate requests.
If you store them in a database, you need something to access the database and then stream it out. It's actually seriously worse than just letting your httpd serve it. If a server hosts it, only the core server and the filesystem get touched. If it's in a database it's the core server, the connector to the language (eg mod_php), the language (eg php), the database connection and the filesystem (which the database is written on).
Keep it simple. Keep it as a file.
If you're drowning in requests:
If you're on Apache consider using a server like lighttpd or nginx. Massively more efficient on static/dynamic mixed environments. You can still keep apache or you can dump it altogether.
Shift your images off to a CDN like S3, Akami, etc. There are plenty of providers and it usually only works out a little bit more expensive than hosting (this is assuming you've got quite a lot of traffic).

It is possible, you can embed image in HTML using Data URI Scheme. But I doubt it will redeem, you will decrease number of HTTP requests, but images can be cached on client, so therefore you will greatly increase length of each response.
But, it will be faster to load those files directly from disc, not from DB.

The number of HTTP requests remains the same whether the browser loads images from a script that loads image data from a database or regular files. In fact, loading image data from a database rather than static files would probably introduce additional overhead.
If you're looking to reduce the amount of HTTP requests a browser has to make to load your documents, you should look into CSS Sprites.

You would save the overhead from the HTTP, but how would you insert the images in the html? Otherwise you have to still make an HTTP request to get the image.
If you serve the images as byteStream from the DB, you don't let browsers to cache the content. And if you use HTTP requests per image, you let them cache the content, but paying the price to do more requests. You have also to consider the time fetching the images from the DB and the time processing them!.
I think that your best option in this case, is put all the small images in just one file (sprite), and then use CSS to display them. That's what high load sites do. This way you just do one request and get all the images, the browser will cache the file and it will improve your perfomance. The price you pay is that you need to write more CSS but that's just plain text and the same number of files. It's a win-win situation :)

There are various ways to improve image performance in a website
Use an alternate domain just for static content. This has two benefits - cookies from your main domain are not sent with each request, and a separate domain gets it's own allocation of connections
Combine images into sprites
Configure caching correctly. Set far future expiry headers. Set the expiry header so that the image is not downloaded between visits to the site. When an image is requested, the ETAG can also be checked and if they match, then a 302 response is returned and the content is not downloaded again.
I don't see why streaming images from a database is going to better than from the file system. Your performance numbers are subjective I suspect because of caching.

PHP: I want to create a page that extracts images from a forum thread, doable? codeigniter?

You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?

assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner

It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.

When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.

Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.