I'm trying to build mechanism which will scan a website at a given URL and get all images. Currently I'm using simple_html_dom which is slow.
Scanning a website from localhost is taking me about 30s - 1 min.
What I need to do is:
load a URL.
scan for images ( if its posible with specific size x > width )
print them.
I'm looking for fastest way.
There is no fastest way.
You cannot reduce network latency.
You cannot avoid getting image to detect its size.
The rest of operations already being a negligible part of process.
The other answer is oversimplified because you can reduce the overall network throughput by sending HEAD requests to the server to get the image size before downloading it -- immediately saving you almost all of the bandwidth for images with size < x.
Depending on the size of the pages involved, the choice of string operations used to extract the image URLs could be important as well. PHP's perfectly adequate for the needs it caters for but it's still a moderately slow, interpreted language at the end of the day and I find calling routines which involve moving large substrings around appreciably laggy sometimes. In this case parsing it fully, even using a simple library, is overkill.
The reason I would go to extreme lengths to download only the bare minimum of images is that some PHP methods for doing so are very slow. If I use copy() to download a file and then do the same thing using raw sockets or cURL, copy() sometimes takes at least twice as long.
So choice of transfer method and choice of parsing method both have a noticeable effect.
Related
Theorizing here on how to get lightning fast media + prevent hotlinking and the <img src="data:image-kj134332k4" /> is coming to mind and more. Scrapers dont need our src and real clients need instant load (esp cell net). Considering the recent google https-everywhere move, this would drastically decrease handshakes as well.
What disadvantages are there to crafting lists such as ecom
categories/widgets/slideshows using data:image?
Is there any implications to extra KB of actual source code over serving vastly larger total page size?
Do ya'll prefer any PHP data:image gen script over another for parsing images as data as data at certain controller levels (leaving standard src images in other areas)?
Are there caching/CDN concerns? Would the parse wonk cache somehow? Seems not but im not cache expert.
Any guidance or case thoughts are much appreciated. Thank you!
Generally, the idea is worth considering, but in most cases the problems outweight the benefits.
It is true that these images won't be cached on the client side anymore. Especially Expires-based caching saves you tons of bandwidth.
As a rule of thumb I'd say: If these are small images that change frequently, embedding is a good idea. If images are larger and clients load the same image more than once in subsequent request, do by all means deliver images separately and put some effort into caching.
As for the other points:
Most browsers support this; however, some old IEs don't … so think of a fallback solution or be ready to get bug reports (may be neglible, depending on your user base.)
The number of SSL handshakes is neglible, if you're using HTTP keep-alive, which is standard. Follow-up requests do indeed require a new handshake, but if you cache properly (see next point) and maybe put static files on a CDN, this is no problem.
Read about caching, especially the Expires/Cache-Control headers and their friends.
If you decide to embed, you don't really need a generator script, embedded images are base64 coded image files; this shouldn't take more than 3 lines of code.
However, if you process/convert your images in PHP, there's even another disadvantage: Instead of statically serving them (maybe even from a different machine or CDN), images have to be on the same machine and go through the PHP engine, thus increasing the used memory of each process that serves a page with these images.
I've been on a project for the past few days and hit a problem displaying large quantities of images (+20gb total ~1-2gb/directory)in a gallery on one area of the site. The site is built on the bootstrap framework. I've been trying to make massive carousels that ultimately do not function fluidly due to combined /images size. Question A: In this situation do I need i/o from a database and store images there-- is this faster than in /images folder on front end?
And b) in my php script i need to -set directories to variables/ iterate through and display images into < li >, but how do I go about putting controls on the memory usage so as to not overload browser? Any additions, suggestions, or alternatives would be greatly appreciated. Im looking for most direct means to end here.
Though the question is a little generic, here are some thoughts in regards to your two questions:
A) No, performance pulling images from a database would most likely be worse than pulling straight from the file system. In general, it is not a good idea to store images or other binary data in databases unless you absolutely have to, because databases can't do much with this information and you are just adding an extra layer on top of the file system that doesn't need to be there. You would, however, want to store paths to images in your database, potentially along with other characteristics such as image dimensions, thumbnail paths, keywords, etc. Then your application would read the entries for the images to return the correct paths to the images.
B) You will almost certainly want to implement some sort of paging if you are displaying many hundreds or thousands of photos. If the final display must be a carousel, you will want to investigate the Javascript that drives it to determine how you could hook in a function that retrieves more results from your PHP application via an AJAX call when it reaches the end or near end of the current listing of images. If you are having problems with the browser crashing due to too many images, you will also want to remove images from the first part of the list of <li>s when you load new ones so that it keeps the DOM under control.
A) It's a bad idea to store that much binary data into a database, even if the DB allows it, you shouldn't use it, it'll also give you much more memory consumption, all your data will be stored in the database's memory space, then copied into PHP's memory space for you to handle, which eats up twice the memory, plus the overhead of running a database server, and querying, etc.. so no, it's slower to use a database, accessing the filesystem directly is faster, if you also use varnish or other front-end caching system, you'll even be able to serve content much faster too.
What I would do is store files on the filesystem, and the best server to handle static serving like that is either G-WAN or NGINX Source, but do your read up and decide for yourself what suits you best. point is, stay away from apache, and probably host all those static files onto a separate server running a lightweight http server
ProTip: Save multiple copies of the same image with scaled down sizes for example 50% and another version with 25% of the original image size, this way you'll be able to send the thumbnails first for quick browsing, then when a user decides to view an image you serve up the 50% or 100% size, depending on their screen size, this way you save yourself bandwidth and memory. you also save a big 3G bill for mobile users.
B) This is where it makes some sense to use a database, you can index all the directories into a database, and use that to store the location of the image in the FS, and perhaps some tags, and maybe even number of views, etc...
and in the forntend you'll implement a scipt that'll fetch for example 50 thumbnails per page then the user can scroll around using some fancy JQuery, and when you need to fetch more, simply get a new result set with 50 more thumbs, etc..
this way you'll save yourself memory, bandwidth and even the users will thank you for such a lightweight browsing experience !
Another tip:
If you want to be able to handle bigger traffic, you might want to consider using a CDN, there are many CDN services that aren't as expensive as Amazon S3, a simple search will give you tons of resources !
Happy hacking !
I've been searching about this for a while but I didn't find what I wanted, so here is my problem:
using PHP,
I want to create a very big image file,lets say 20000 gigapixels, then I want to add a small image to specific location on this big image. My computer doesn't have enough RAM to load up the entire image and manipulate pixels that way, so I think I need to access the image data on hard disk and manipulate them in some way, so anyone knows how to do this?
thanks for helping me out :)
ImageMagick supports operations on very large files. I don't see support in the PHP/ImageMagick API but you could call out (exec) to the command line program and use one of it's disk caching or streaming options.
There is some documentation for dealing with large files here: www.imagemagick.org.
What would you do with an image that size? You couldn't serve it a browser, and even if you did manage to load it into the server, it would take up all the server resources, so you wouldn't be using the server for anything else in the meanwhile.
The short answer is that handling an image of that kind of scale as a single file in RAM is out of the question unless you've got an extremely powerful machine dedicated to it, and nothing else. At 20k x 20k pixels, even a simple monchrome image is going to take 400mb. Scale that up to any useful colour depth, and you're talking about gigabytes of RAM just to hold the graphic, and that's before we even start thinking about actually doing stuff with it.
I guess the solution is to look at what other people do, given the same problem.
Real applications that use images of that scale (eg mapping apps or panorama photos like this one) store their image as a series of much smaller blocks. Each block is a smaller image in its own right. They'd also usually have separate sets of blocks for each zoom level too. Handling a single massive image file is implausible for any realistic server environment, but smaller chunks make it easy to handle for both browser and server. The server just sends the blocks to the user that are in the current view; when the user scrolls or zooms, they get sent more blocks.
Your question mentions adding a smaller image to a specific location on the big one. Again, looking at how others do this, google maps and others handle this kind of thing using a layering system. The layers are built up and sent to the browser separately.
I know that doesn't directly answer the question, but I hope it gives you some options to think about.
Just keep a simple file, not image, and store pixel data in it in any custom format. PHP has a fseek function, which allows you to jump to any location in the file, so you can calculate needed location & perform read/write on it. If you have image with size W x H, and if each pixel takes 3 bytes, then the address of pixel (X, Y) in the file will be (W * Y + X) * 3.
I want to create multiple thumbnails using GD library in php, and I already have a script to do this, the question is what is better for me .. is it better to create thumbnail on the fly? or create a physical file on my server each time I want a thumb?? and Why?
Please, consider time consuming and storage capacity and other disadvantages for both
When you create the thumbnail depends on a couple of factors (that I'll get into) but you should never discard the output of something like this (unless you'll never use it again) as it's a really expensive operation.
Anyway your two main choices for "when to generate the thumbnail" are:
When it's first requested. This is common and it means that you don't generate thumbnails that are never used but it does mean if you have a page full of first-time-thumbnails that the server might become overwhelmed with PHP processes generating the thumbnails.
I had a similar issue with Sorl+Django where I was generating 100+ thumbnails per request for the first few requests after uploading and it basically made the entire server hang for 20 minutes. Not good.
Generate all required thumbnails when you upload. Because it takes a long time to upload, you break down the processing quite a lot. You can also pull it out-of-process (ie use another script to process uploads - perhaps not even in PHP).
The obvious downside is you're using up disk space that you otherwise might not need to use up... But unless you're talking about hundreds of thousands of thumbnails, a small percentage of unused ones probably won't break the bank.
Of course, if disk space is an issue, there might be an argument for pushing the thumbnail up to a CDN at the same time as you process it.
One note when you save the thumbnails, it's fairly common that you'll want to resize the thumbnails at some point down the line or perhaps want two small variants. I find it really useful to make the filenames very specific so if the original image is image.jpg, the 200x200 version is image-200x200.jpg.
Neither/both - don't generate the thumbnails till you need them - but keep the files you generate.
That way you'll minimise the amount of work needed and have a self-repairing system
C.
GD is really resource heavy, so you should look at if you can use ImageMagick instead (which also has a clearer syntax).
You definitely will be better off caching the created thumbnail after the first run (regardless of if you run GD or ImageMagick) and serve them from the cache. If you are worried about storage, clear out old files from the cache now and then.
Always cache (= write out to disk) the results of GD operations. They are too expensive both regarding processor time and memory to be done on the fly every time. This becomes increasingly true the more visitors/hits you have.
So I am working on something in php where I have to get my images from a sql database where they will be encoded in base64. The speed of displaying these images is critical so I am trying to figure out if it would be faster turn the database data into an image file and then load it in the browser, or just echo the raw base64 data and use:
<img src="..." />
Which is supported in FireFox and other Gecko browsers.
So to recap, would it be faster to transfer an actual image file or the base64 code. Would it require less http request when using ajax to load the images?
The images would be no more than 100 pixels total.
Base64 encoding makes the file bigger and therefore slower to transfer.
By including the image in the page, it has to be downloaded every time. External images are normally only downloaded once and then cached by the browser.
It isn't compatible with all browsers
Well I don't agree with anyone of you. There are cases when you've to load more and more images. Not all the pages contain 3 images at all. Actually I'm working on a site where you've to load more than 200 images. What happens when 100000 users request that 200 images on a very loaded site. The disks of the server, returning the images should collapse. Even worse you've to make so much request to the server instead of one with base64. For so much thumbnails I'd prefer the base64 representation, pre-saved in the database. I found the solution and a strong argumentation at http://www.stoimen.com/2009/04/23/when-you-should-use-base64-for-images/. The guy is really in that case and made some tests. I was impressed and make my tests as well. The reality is like it says. For so much images loaded in one page the one response from the server is really helpful.
Why regenerate the image again and again if it will not be modified. Hypothetically, even if there are a 1000 different possible images to be shown based on 1000 different conditions, I still think that 1000 images on the disks are better. Remember, disk based images can be cached by the browser and save bandwidth etc etc.
It's a very fast and easy solution. Although the image size will increase about 33% in size, using base64 will reduce significantly the number of http requests.
Google images and Yahoo images are using base64 and serving images inline. Check source code and you'll see it.
Of course there are drawbacks on this approach, but I believe the benefits outweighs the costs.
A cons I have found is in slow devices. For example, In iPhone 3GS the images served by google images are very slow to render, since the images come gziped from the server and must be uncompressed in the browser. So, if the customer has a slow device, he will suffer a little when rendering the images.
To answer the initial question, I ran a test measuring a jpeg image 400x300 px in 96 ppi:
base64ImageData.Length
177732
bitmap.Length
129882
I have used base64 images once or twice for icons (10x10 pixels or so).
Base64 images pros:
compact - you have single file. also if file is compressed, base64 image is compressed almost to the size of normal image.
page is retrieved in single request.
Base64 images cons:
to be realistic, you probably need to use scripting engine (such PHP) on all pages that contains the image.
if image is changed, all cached pages must be re-downloaded.
because image is inline, you can not use CDN or static content web server.
Normal images pros:
if you are use SPDY protocol, at least theoretical, page + images + CSS will load with single request too.
you can set expiration on the image, so content will be cached from the browsers.
Don't think data:// works in IE7 or below.
When an image is requested you could save it to the filesystem then serve that from then on. If the image data in the database changes then just delete the file. Serve it from another domain too like img.domain.com. You can get all the benefits of last-modified, or e-tags for free from your webserver without having to start up PHP unless you need too.
If you're using apache:
# If the file doesn't exist:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^/(image123).jpg$ makeimage.php?image=$1
Generally, using base64 encoding is going to increase the byte size by about 1/3. Because of that, you are going to have to move 1/3 bytes from the database into the server, and then move those extra same 1/3 bytes over the wire to the browser.
Of course, as the size of the image grows, the overhead mentioned will increase proportionately.
That being said, I think it is a good idea to change the files into their byte representations in the db, and transmit those.
To answer the OP Question.
As static files, directly via disk thru web server.
at only 100px they are ideally suited to in memory caching by the Web server.
There is a plethora of info ,caching strategies, configs, how-to's for just about every web server out there.
Infact - The best option in terms of user experience (the image speed you refer to) is to use a CDN capable object store. period.
The "DB" as static storage choice is simply expensive - in terms of all the overhead processing, the burden on the DB, as well as financially, and in terms of technical debt.
A few things, from several answers
Google images and Yahoo images are using base64 and serving images
inline. Check source code and you'll see it.
No. They absolutely do NOT. Images are mostly served from a static file "web server" Specfically gstatic.com:
e.g. https://ssl.gstatic.com/gb/images/p1_2446527d.png
compact - you have single file. also if file is compressed, base64
image is compressed almost to the size of normal image.
So actually, No advantage at all, plus the processing needed to compress?
page is retrieved in single request.
Again, multiple parallel requests as opposed to a single larger load.
What happens when 100000 users request that 200 images on a very
loaded site. The disks of the server, returning the images should
collapse.
You will still be sending The same amount of data, but having a Longer connection time, as well as stressing your database. Secondly the odds of a run of the mill site having 100000 concurrent connections... and even if so, if you are running this all of a single server you are a foolish admin.
By storing the images - binary blobs or base64 in the DB, all you are doing it adding huge overhead to the DB. Either, you have masses and masses of RAM, or your query via the DB will come off the disk anyway.
And, if you DID have such unlimited RAM, then serving the bin images off a Ramdisk - ideally via an alternative dedicated, lightweight webserver static file & caching optimised, configured on a subdomain, would be the fastest, lightest load possible!
Forward planning? You can only scale up so far, and scaling a DB is expensive (relatively speaking). Again the disks you say will "sp
In such a case, where you are serving 100's of images to 100000 concurrent users, the serving of you images should be the domain of CDN Object store.
If you want the fastest speed, then you should write them to disk when they are uploaded/modified and let the webserver serve static files. Rojoca's suggestions are good, too, since they minimize the invocation of php. An additional benefit of serving from another domain is (most) browsers will issue the requests in parallel.
Barring all that, when you query for the data, check if it was last modified, then write it to disk and serve from there. You'll want to make sure you respect the If-Modified-Since header so you don't transfer data needlessly.
If you can't write to disk, or some other cache, then it would be fastest to store it as binary data in the database and stream it out. Adjusting buffer sizes will help at that point.