What is the fastest way to get images from external webpage? - php

I need a way to get the biggest 5 images from a generic external webpage.
I know that I can't do this with only ajax ( maybe I am wrong ) due cross-site security.
So I must use php+javascript.
I have just written this PHP code to get all images from external url:
$html = file_get_contents($link);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
So now what is the fastest way to get only the biggest 5 images of that page ?
With biggest I mean images with highest resolutions.

If you mean "biggest" as in in largest file size, then I think you are somehwat on the right track already. You would just need to find all the images in the source document, then likely make a HEAD request to the server where the image lies to (hopefully) get the file size information from the headers without downloading the file.
If "fastest" really is your concern, you could use cURL which has "multi" support for making parallel requests. Once you get the header information from the requests, you can determine the 5 biggest files and display the URL to them.
If the URL you are calling doesn't change much, you could probably cache the results locally to prevent the need to parse through the page and/or make HEAD requests on the images.
If "biggest" as in largest image size, then you are likely going to need to inspect the images on your server using an image library.

What is the fastest way to get images from external webpage?
With any method that you use, the network connection is by far your limiting factor. It makes no sense to optimize.
I need a way to get the biggest 5 images from a generic external webpage.
A HTTP HEAD request should give you information about how many bytes need to be transfered to download the image. The response to a HEAD request should be the HTTP header, that would have been sent if it where a GET request. Especially the HTTP body (which contains the actual image data) is omitted. Notice the word should instead of the (IMHO more preferable) word must.
Furthermore, the number of bytes is not an adequate measure for the number of pixels in the image. You might employ some heuristics based on the contant type (PNG has a different size than GIF has a different size than JPEG for the same number of pixels). I don't know if this is accurate enough for you. For example JPEG images can vary widely due to different compression levels.

Related

Get compressed and uncompressed page size with PHP

http://www.gidnetwork.com/tools/gzip-test.php is a great tool, but I need to test a site behind some walls and want to find similar results: compression type, markup size, compressed page size, and compression ratio.
How can I get both compressed and uncompressed page sizes? What is a better method for fetching this data (file_get_contents() or curl)?
The response headers contain everything else needed - I'm just not sure about the sizes.
You have to use the CURL as there are two reasons to do so:
As you need the Headers of the request and you can have through CURL easily
file_get_contents() generally used internally to include things and not for external source to fetch data. If you used it for external purposes then there are some issue like memory limit, size etc. have to be taken great care of.
$headerArray = get_headers($pointer);
$fileSize = $headerArray['Content-Length'];

Retrieve images from a link

Is there a script or service or snippet or method or anything that can get thumbnail from a url, by thumbnail i dont mean snapshot of the site, but an image that can automatically be fetched and used as post thumbnail, much like the one used in facebook. The image should be fetched thus img src="xxxxxxx?url=google.com" . this would fetch the google logo
Maybe there are existing solutions for this, but it's not really hard to implement:
you need to fetch the remote site, for e.g. with file_get_contents
optionally use Tidy to clean up the source HTML
parse the output with an XML parser if you used Tidy to clean the fetched data, or an HTML parser
fetch the first n images from the site (n should be a relatively small number)
store this fetched image set in a cache because this fetching, parsing thing could take time
Comments:
you may fetch the robots.txt from the site to check whether it's allowed to use/index the content
set a timeout for this remote website fetching, because if the website is down or slow it would timeout on your site as well
limit the concurrent fetching to a site and globally to protect against DoS-ing
you could use an HTTP client and limit the fetched HTML data size, or use HEAD HTTP method to fetch the Content-Length before downloading the actual content if it's allowed

Crawling custom URL's for images and resizing them via AJAX

I'm building a web app where users can store links, with 200x200 pictures associated. By default, I'd like to crawl the link for images and then return thumbnails of the biggest ones (from which the user can select the "official" thumbnail). I want this all to happen via AJAX. My question is: what is the best way to do this?
Currently, I'm using the PHP Simple HTTP Parser to scan a URL. I then find the src attribute of all the <img> tags, use getimagesize to store the image size located at that URL, sort the array from biggest to smallest and return the top 5 biggest image URL's via AJAX to the client. Then the client sends a different AJAX request for each one which makes a server-side ImageMagick script download and cut the image to a thumbnail, save it in a temporary folder and then return the URL of this thumbnail, which the client finally loads on his browser.
Needless to say, this is a little complicated and probably really inefficient. Running this process on http://en.wikipedia.org takes about 10-15 seconds from start to finish. I'm not certain there are any more efficient ways, however.
I'd do it in one AJAX request, with the script automatically resizing the biggest 5 images on the first pass, saving them, and returning a JSON array with the resized image URLs for the client.
You should probably use PHP's DOMDocument class to grab/parse the html page.
getimagesize() means you have to download each image and process them. Perhaps you should consider simply showing the user ALL images, by simply placing img tags that link back to the original HTML page. You could size these however you like using the tags. This way you do not have to download/process a single image until the user has actually selected one for the thumbnail.
Interested if / how you solved this?
In the end I looped through the images doing getimagesize() until both height and width were over a certain size, and then broke the loop.
This way its slightly more efficient as it only downloads as many images as it needs

Getting an image with PHP

Is it bad practise to retrieve images this way? I have a page to call this script about 100 times (there are 100 images). Can i cause server overload or too many http requests or something? I have problems with the server and i dont know if this is causing it :(
// SET THE CONTENT TYPE HEADER
header('Content-type: image/jpeg');
// GET THE IMAGE TO DISPLAY
$image = imagecreatefromjpeg( '../path/to/image/' . $_SESSION[ID] . '/thumbnail/' . $_GET[image]);
// OUTPUT IMAGE AND FREE MEMORY
imagejpeg($image);
imagedestroy($image);
I call the script from regular tags. The reason I call them through PHP is that the images are private to the user.
All help greatly appreciated!!
With this, you are :
Reading the content of a file
Evaluating that content to an in-memory image
Re-rendering that image
If you just want to send an image (that you have on disk) to your users, why not just use readfile(), like this :
header('Content-type: image/jpeg');
readfile('../path/to/image/' . $_SESSION[ID] . '/thumbnail/' . $_GET[image]);
With that, you'll just :
Read the file
and send its content
Without evaluating it to an image -- eliminating some useless computations in the process.
As a sidenote : you should not use $_GET[image] like that in your path : you must make sure no malicious data is injected via that parameter !
Else, anyone will potentially be able to access any possible file on your server... they just have to specify some relative path in the image parameter...
Yes, it's very bad. You're decoding a .jpg into a memory-based bitmap (which is "huge" compared to the original binary .jpg. You then recompress the bitmap into a jpeg.
So you're wasting a ton of
a) memory
b) CPU time
c) losing even more image quality, because jpg is a lossy format.
why not just do:
<?php
header('Content-type: text/jpeg');
readfile('/path/to/your/image.jpg');
instead?
To answer two particular questions from your question
Can i cause server overload or too many http requests or something?
yes, of course.
by both numerous HTTP requests and image processing.
You have to reduce number of images and implement some pagination to show images in smaller packs.
You may also implement some Conditional GET functionality to reduce bandwidth and load.
If things continue getting bad, and you have some resources to dispose, consider to install some content distribution proxy. nginx with X-Accel-Redirect header is a common example
I have problems with the server and i dont know if this is causing it :(
You shouldn't shoot in the dark then. Profile your site first.

Base 64 encode vs loading an image file

So I am working on something in php where I have to get my images from a sql database where they will be encoded in base64. The speed of displaying these images is critical so I am trying to figure out if it would be faster turn the database data into an image file and then load it in the browser, or just echo the raw base64 data and use:
<img src="..." />
Which is supported in FireFox and other Gecko browsers.
So to recap, would it be faster to transfer an actual image file or the base64 code. Would it require less http request when using ajax to load the images?
The images would be no more than 100 pixels total.
Base64 encoding makes the file bigger and therefore slower to transfer.
By including the image in the page, it has to be downloaded every time. External images are normally only downloaded once and then cached by the browser.
It isn't compatible with all browsers
Well I don't agree with anyone of you. There are cases when you've to load more and more images. Not all the pages contain 3 images at all. Actually I'm working on a site where you've to load more than 200 images. What happens when 100000 users request that 200 images on a very loaded site. The disks of the server, returning the images should collapse. Even worse you've to make so much request to the server instead of one with base64. For so much thumbnails I'd prefer the base64 representation, pre-saved in the database. I found the solution and a strong argumentation at http://www.stoimen.com/2009/04/23/when-you-should-use-base64-for-images/. The guy is really in that case and made some tests. I was impressed and make my tests as well. The reality is like it says. For so much images loaded in one page the one response from the server is really helpful.
Why regenerate the image again and again if it will not be modified. Hypothetically, even if there are a 1000 different possible images to be shown based on 1000 different conditions, I still think that 1000 images on the disks are better. Remember, disk based images can be cached by the browser and save bandwidth etc etc.
It's a very fast and easy solution. Although the image size will increase about 33% in size, using base64 will reduce significantly the number of http requests.
Google images and Yahoo images are using base64 and serving images inline. Check source code and you'll see it.
Of course there are drawbacks on this approach, but I believe the benefits outweighs the costs.
A cons I have found is in slow devices. For example, In iPhone 3GS the images served by google images are very slow to render, since the images come gziped from the server and must be uncompressed in the browser. So, if the customer has a slow device, he will suffer a little when rendering the images.
To answer the initial question, I ran a test measuring a jpeg image 400x300 px in 96 ppi:
base64ImageData.Length
177732
bitmap.Length
129882
I have used base64 images once or twice for icons (10x10 pixels or so).
Base64 images pros:
compact - you have single file. also if file is compressed, base64 image is compressed almost to the size of normal image.
page is retrieved in single request.
Base64 images cons:
to be realistic, you probably need to use scripting engine (such PHP) on all pages that contains the image.
if image is changed, all cached pages must be re-downloaded.
because image is inline, you can not use CDN or static content web server.
Normal images pros:
if you are use SPDY protocol, at least theoretical, page + images + CSS will load with single request too.
you can set expiration on the image, so content will be cached from the browsers.
Don't think data:// works in IE7 or below.
When an image is requested you could save it to the filesystem then serve that from then on. If the image data in the database changes then just delete the file. Serve it from another domain too like img.domain.com. You can get all the benefits of last-modified, or e-tags for free from your webserver without having to start up PHP unless you need too.
If you're using apache:
# If the file doesn't exist:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^/(image123).jpg$ makeimage.php?image=$1
Generally, using base64 encoding is going to increase the byte size by about 1/3. Because of that, you are going to have to move 1/3 bytes from the database into the server, and then move those extra same 1/3 bytes over the wire to the browser.
Of course, as the size of the image grows, the overhead mentioned will increase proportionately.
That being said, I think it is a good idea to change the files into their byte representations in the db, and transmit those.
To answer the OP Question.
As static files, directly via disk thru web server.
at only 100px they are ideally suited to in memory caching by the Web server.
There is a plethora of info ,caching strategies, configs, how-to's for just about every web server out there.
Infact - The best option in terms of user experience (the image speed you refer to) is to use a CDN capable object store. period.
The "DB" as static storage choice is simply expensive - in terms of all the overhead processing, the burden on the DB, as well as financially, and in terms of technical debt.
A few things, from several answers
Google images and Yahoo images are using base64 and serving images
inline. Check source code and you'll see it.
No. They absolutely do NOT. Images are mostly served from a static file "web server" Specfically gstatic.com:
e.g. https://ssl.gstatic.com/gb/images/p1_2446527d.png
compact - you have single file. also if file is compressed, base64
image is compressed almost to the size of normal image.
So actually, No advantage at all, plus the processing needed to compress?
page is retrieved in single request.
Again, multiple parallel requests as opposed to a single larger load.
What happens when 100000 users request that 200 images on a very
loaded site. The disks of the server, returning the images should
collapse.
You will still be sending The same amount of data, but having a Longer connection time, as well as stressing your database. Secondly the odds of a run of the mill site having 100000 concurrent connections... and even if so, if you are running this all of a single server you are a foolish admin.
By storing the images - binary blobs or base64 in the DB, all you are doing it adding huge overhead to the DB. Either, you have masses and masses of RAM, or your query via the DB will come off the disk anyway.
And, if you DID have such unlimited RAM, then serving the bin images off a Ramdisk - ideally via an alternative dedicated, lightweight webserver static file & caching optimised, configured on a subdomain, would be the fastest, lightest load possible!
Forward planning? You can only scale up so far, and scaling a DB is expensive (relatively speaking). Again the disks you say will "sp
In such a case, where you are serving 100's of images to 100000 concurrent users, the serving of you images should be the domain of CDN Object store.
If you want the fastest speed, then you should write them to disk when they are uploaded/modified and let the webserver serve static files. Rojoca's suggestions are good, too, since they minimize the invocation of php. An additional benefit of serving from another domain is (most) browsers will issue the requests in parallel.
Barring all that, when you query for the data, check if it was last modified, then write it to disk and serve from there. You'll want to make sure you respect the If-Modified-Since header so you don't transfer data needlessly.
If you can't write to disk, or some other cache, then it would be fastest to store it as binary data in the database and stream it out. Adjusting buffer sizes will help at that point.

Categories