I have seen on various sites a querystring followed by a numbers for images and css files. When I look at the source code (via Chrome Developer), the cached css files and images do not have the number in the query string in their names. I have also seen on sites where the number changes in the querystrings when I refresh the page.
As example:
myimage.jpg?num=12345
myStyles.css?num=82943
After refresh:
myimage.jpg?num=67948
myStyles.css?num=62972
Can anyone explain to me what could possibly be the purpose of these querystrings short of tracking?
Often times developers use those query strings with random numbers (or version numbers) to force the browser to request a fresh copy and avoid caching of those files since the request is different each time.
So if you have a file /image.png but it is a generated image, like perhaps a captcha or something, you could follow it with a random number querystring /image.png?399532 which the browser would then not pull image.png from its cache, but instead will download a fresh copy from the server.
Prevent caching (the query string can provide a unique URL each time the file is updated causing the browser to download a new copy and not load a stale one from its cache)
Versioning (similar to #1 but with a more specific purpose)
The query string is for version controling it force to the navigator to reload the css and the image instead of use the cache
Related
Is it possible to cache all page, but do not cache a part of it in the browser?
For example, i have a page with date. Only date is changing daily, the rest of page never changes. How i shall cache such a page in the browser?
Could on the browser cached page contain dynamical content?
Actually, i am new to caching, i do not understand how it works with dynamical content and browser caching. Is this right, that from the moment some dynamic page is cached, it is served always as it was after during the caching, and new dynamic content is not displayed?
I do not ask about the server side caching, only about the browser side caching.
There is no specific tool for ignoring caching a part of the page. but you can do some tricks like:
You can cache whole page and change the part you want by iframe!
You can cache whole page and change the part you want by ajax!
You can cache whole page and change the part you want by a javascript
file!
i have not checked iframe solution and not sure if it works.
If the js files has cache, you can add a version in their file name like scripts.v.2.3.js and load them by version name.
You can't really dynamically cache "part" of a file, you can however cache separate assets the more you split your page into separate assets the more you can cache each one of them separately.
Your index.html could have a cache setting of dont-cache (using the Cache-Control header)
Your logo.png could have a long cache set of 10 days
Now if you want to have certain elements changing but the core to stay the same then I believe this would be a better job for JavaScript. What you could do is write a Javascript function to display the date, then you can fully cache the HTML page and the Javascript page and since the raw content never changes (only the manipulation of the DOM does you have very little client->server requests.
I'm currently rewriting a website that need a lot of different sizes for each images. In the past I was doing it by creating the thumbnails images for all sizes on the upload. But now I have a doubt about is performance. This is because now I have to change my design and half of my images are not of the right size. So I think of 2 solutions :
Keep doing this and add a button on the backend to re-generate all the images. The problem is that I always need to know every sizes needed by every part of the site.
Only upload the real size image, and when displaying it, put in the SRC tag something like sr="thumbs.php?img=my-image-path/image.jpg&width=120&height=120". Then create the thumb and display it. Also my script would check if the thumb already exists, if it does it doesn't need to recrate it so just display it. Each 5 Days launch a script with a crontask to delete all the thumbs (to be sure to only use the usefull ones).
I think that the second solution is better but I'm a little concern by the fact that I need to call php everytime an image is shown, even if it's already created, it's php that give it to display...
Thanks for your advises
Based on the original question and subsequent comments, it would sound like on-demand generation would be suitable for you, as it doesn't sound like you will have a demanding environment in terms of absolutely minimizing the amount of download time to the end client.
It seems you already have a grasp around the option to give your <img> tags a src value that is a PHP script, with that script either serving up a cached thumbnail if it exists, or generating it on the fly, caching it, and then serving it up, so let me give you another option.
Generally speaking, utilizing PHP to serve up static resources is not a great idea as you begin to scale your site as
This would require the additional overhead of invoking PHP to serve these sorts of requests, something much more optimized with the basic web server like Apache, Nginx, etc. This means your site is going to be able to handle less traffic per server because it is using extra memory, CPU, etc. in order to serve up this static content.
It makes it hard to move those static resources into a single repository outside of the server for serving up content (such as CDN). This means you have to duplicate your files on each and every web server you have powering a site.
As such, my suggestion would be to still serve up the images as static image files via the webserver, but generate thumbnails on the fly if they are missing. To achieve this you can simply create a custom redirect rule or 404 handler on the web server, such that requests in your thumbnail directory which do not match an existing thumbnail image could be redirected to a PHP script to automatically generate the thumbnail and serve up the image (without the browser even knowing it). Future requests against this thumbnail would be served up as a static image.
This scales quite nicely as, if in the future you have the need to move your static images to a single server (or CDN), you can just use an origin-pull mechanism to try to get the content from your main servers, which will auto-generate them via the same mechanism I just mentioned.
Use the second option, if you don't have too much storage and first if you don't have too much CPU.
Or you can combine these: generate and store the image at the first open of the php thumbnails generator and nex time just give back the cached image.
With this solution you'll have only the necessary images and if you want you can delete sometimes the older ones.
I need a way to remove "unused" images from my filesystem, i.e. images that are never accessed from any point in my website (doesn't matter if I break external links. I might disable external hotlinking altogether). What's the best way of going about this? Regular users can add multiple attachments to topics/posts and content contributers can bulk upload large numbers of images which can be used in articles or image galleries.
The problem is that the images could be referenced in any of the following ways:
From user content (text/html, possibly Markdown or BBCode) stored in the database
Hardcoded into an HTML page
Hardcoded into a PHP file
Hardcoded into a CSS file
As an "attachment" field in a database table, usually containing only the filename itself with no path, because the application assumes that it would be in a certain folder.
And to top it off, the path of the image could be an absolute or relative HTTP or PHP path and may or may not be built with string concatenation in PHP.
So obviously find/replace or regexing the database or filesystem is out of the question. But luckily for you and me, this system isn't fully implemented yet and I don't need anything that deals with an existing hoard of images. I just need to set up some efficient structure that will allow this in the future.
Some ideas I've thought of:
Intercepting the HTTP request for the image with PHP, and keeping track of the HTTP_REFERER. The problem with this is that just because no one has clicked on a link at the time of checking this doesn't mean the link doesn't exist.
Use extreme database normalization - i.e. make a table for images and use foreign keys for anything that references it. However this would result in making a metric craptonne of many-to-many relationships (and the crosstables) in addition to being impractical for any regular user to use.
Backup all the images and delete them, and check every single 404 request and run a script each time that attempts to find the image from the backup folder and puts it in the "real" folder. The problem is that this cache would have to be purged every so often and the server might be strained when rebuilding the cache.
Ideas/suggestions? Is this just something you have to ignore and live with even if you're making a site with a ridiculous amount of images? Even if it's not worth it, how would something work just for proof-of-concept (I added the garbage-collection tag just because this might be going into that area conceptually).
I will admit that my experience with this was simpler than yours. I had no 'user generated content' so to speak, and my images were all in only templates or database with full path. But what I did is create a perl script that
Analyzed my HTML templates, database
table, and CSS generated a list of
files
In the HTML it looked for <img> tags
In the CSS it looked for any .png, .jp*g, or .gif regex strings
The tables were easy because I had an Image table for the image data
The files list was then
ordered to remove duplicates
The script iterated through the list and
wrote a csv like:
filename,(CSS filename|HTML filename|DBTABLE),(exists|notexists) for
auditing
In another iteration it
renamed all files not in the list by
appended .del to the filename
After regression testing I called the
script with a -docleanup tag which
told it to go through and delete all
the .del appended files.
If for whatever reason an image was tagged
as .del and shouldn't have been, I
just manually renamed it back to its
original form.
A couple of notes: I realize that I could have made this script 'smoother' and done multiple things in multiple steps, but its use grew over time and I wanted clearly delineated processing steps so it couldn't ever run amok. I used the CSV to go back and clean up the information where the image didn't exist.
How to push new images in a web application, so that the cached is not taken?
When I am having a new JS or CSS file, it's easy.
Because they are in smarty templates, and I am having a version number in the URL (like a.js?v=9).
Now, the problem with images are: -
They are referred from the CSS files, and I can not have a version variable there.
So, how do you do it?
In the middle between cleanest and easiest way, I would :
In the CSS, point to images with URLs containing a distinct marker ; like "image.png?VERSION_NUMBER" (literaly)
this will allow the CSS file to be used while developping
To avoid any problem with cache, I would configure Apache (on the development machine) to indicate files should not be cached by the browser
I would use some kind of "build process", that would replace this VERSION_NUMBER marker by the real version number in every CSS file (and possibly, JS, PHP, HTML, ... )
This would create modified files, containing the right version number
Those files would be the ones deployed to the webserver
Ideally, the VERSION_NUMBER could be the SVN revision of each file ; this way, only files really modified would have to be modified ; but also harder : for each file (each URL in the CSS file !), you have to determine it's revision number before replacing the marker !
If some browser don't cache images/js/css because of the query string, the marker could be included in the files' names.
And now that you have a "build process", you can also use it to make some other manipulations, like minifying JS and CSS files, for instance
As a side note : yes, creating and testing the build process / script takes some time ; it might be easier to server CSS files through PHP, using a variable in those to indicate the version number... But, for performances, serving CSS files (at least one per PHP page ; probably more !) wouldn't be that wise ; so, probably better to take a bit more time to write the build process...
You could manually change the CSS file at the same time that you change the image file, put a "?v=1" onto the end of the image url.
You could also configure your server to send CSS files through the PHP processor, so you could stick some PHP code in there to set the "?v=8" query string on the image url.
As part of my build process, I append a querystring to javascript includes and image URLs that has the file's LastModified date/time as a long. This allows caching to work when it should, and automates something that is easy for the developer to forget.
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.