Best way of caching sorted data in PHP/MySQL - php

Assume we have an application which present continuous data to user. E.g. blog - we present list of blog entries and this list is divided into pages - so we end up with /page1, /page2 etc.
The first page is obviously requested most often, but the higher number of page the less often it is requested.
If we implement cache for our app we have two choices:
update cache of every page after every new entry
when a page is requested PHP is looking for cached version; if it exists it is returned, otherwise the cache is created with expiration date set to, let's say, an hour
First solution seems like a really waste of resources to me. The second one creates possibility of dangerous scenario:
What happends if user requests for pages x and then (x+1), where page x is cached and page (x+1) is not? If cache for page x is outdated then on page (x+1) user'll see the same content. Or worse, what if user go from x page to (x-1) page? He'll miss some entries!
How to implement caching to avoid this problem?

It's usually best to cache on demand, not cache eagerly unless you can be assured that the work done there won't be wasted.
Typically you use a backing store like Memcached do hold your transient data. This can be configured with a "time-to-live" (TTL) That will automatically expire anything that becomes stale or hasn't been used.
Generally you cache a good chunk of the page into a string, then save that using an identifying key of some sort. In your case the page URL or some subset of the parameters might serve as sufficiently unique. Remember that if the user session has an impact on the contents of this section, then something relating to that, such as user_id must be part of the cache key as well.

Related

Caching debate/forum entries in PHP

Just looking for a piece of advice. On one of our webpages we have a debate/forum site. Everytime a user request the debate page, he/she will get a list of all topics (and their count of answers etc.).
Too when the user request a specific topic/thread, all answers to the thread will be shown to the user a long with username, user picture, age, number of totalt forum-posts from the poster of the answer.
All content is currently retrieved by using an MySQL-query everytime the page is accessed. But this is however starting to get painfully slow (especially with large threads, +3000 answers).
I would like to cache the debate entries somehow, to speed up this proces. However the problem is, that if I cache the entries it self, number of post etc. (which is dynamic, of course), will not always be up to date.
Is there any smart way of caching the pages/recaching them when stuff like this is updated? :)
Thanks in advance,
fischer
You should create a tag or a name for the cache based on it's data.
For example for the post named Jake's Post you could create an md5 of the name, this would give you the tag 49fec15add24931728652baacc08b8ee.
Now cache the contents and everything to do with this post against the tag 49fec15add24931728652baacc08b8ee. When the post is updated or a comment is added go to the cache and delete everything associated with 49fec15add24931728652baacc08b8ee.
Now there is no cache and it will be rebuilt when the next visitors arrives to new the post.
You could break this down further by having multiple tags per post. E.g you could have a tag for comments and answers, when a comment is added delete the comments tag, but not the answers tag. This reduces the work the server has to do when rebuilding the cache as only the comments are now missing.
There are number of libraries and frameworks that can aid you in doing this.
Jake
EDIT
I'd use files to store the data, more specifically the HTML output of the page. You can then do something like:
if(file_exists($tag))
{
// Load the contents of the cache file here and output it
}
else
{
// Do complex database look up and cache the file for later
}
Remember that frameworks like Zend have this sort of stuff built in. I would seriously considering using a framework.
Interesting topic!
The first thing I'd look at is optimizing your database - even if you have to spend money upgrading the hardware, it will be significantly easier and cheaper than introducing a cache - fewer moving parts, fewer things that can go wrong...
If you can't squeeze more performance out of your database, the next thing I'd consider is de-normalizing the data a little. For instance, maintain a "reply_count" column, rather than counting the replies against each topic. This is ugly, but introduces fewer opportunities for things to go wrong - with a bit of luck, you can localize all the logic in your data access layer.
The next option I'd consider is to cache pages. For instance, just caching the "debate page" for 30 seconds should dramatically reduce the load on your database if you've got reasonable levels of traffic, and even if it all goes wrong, because you're caching the entire page, it will sort itself out the next time the page goes stale. In most situations, caching an entire page is okay - it's not the end of the world if a new post has appeared in the last 30 seconds and you don't see it on your page.
If you really have to provide more "up to date" content on the page, you might introduce caching at the database access level. I have, in the past, built a database access layer which cached the results of SQL queries based on hard-wired logic about how long to cache the results. In our case, we built a function to call the database which allowed you to specify the query (e.g. get posts for user), an array of parameters (e.g. username, date-from), and the cache duration. The database access function would cache results for the cache duration based on the query and the parameters; if the cache duration had expired, it would refresh the cache.
This scheme was fairly bug-proof - as an end user, you'd rarely notice weirdness due to caching, and because we kept the cache period fairly short, it all sorted itself out very quickly.
Building up your page by caching snippets of content is possible, but very quickly becomes horribly complex. It's very easy to create a page that makes no sense to the end user due to the different caching policies - "unread posts" doesn't add up to the number of posts in the breakdown because of different caching policies between "summary" and "detail".

Patterns for caching related data

I'm currently developing the foundation of a an application, and looking for ways to optimize performance. My setup is based on the CakePHP framework, but I believe my question is relevant to any technology stack, as it relates to data caching.
Let's take a typical post-author relation, which is represented by 2 tables in my db. When I query the database for a specific blog post, at the same time the built-in ORM functionality in CakePHP also fetches the author of the post, comments on the post, etc. All of this is returned as one big-ass nested array, which I store in cache using a unique identifier for the concerned blog post.
When updating the blog post, it is child play to destroy the cache for the post, and have it regenerated with the next request.
But what happens when not the main entity (in this case the blog post) gets updated, but rather some of the related data? For example, a comment could be deleted, or the author could update his avatar. Are there any approaches (patterns) which I could consider for tracking updates to related data, and applying updates to my cache accordingly?
I'm curious to hear whether you've also run into similar challenges, and how you have managed to potentially overcome the hurdle. Feel free to provide an abstract perspective, if you're using another stack on your end. Your views are anyhow much appreciated, many thanks!
It is rather simple, cache entries can be
added
destroyed
You should take care of destroying cache entries when related data change (so in application layer in addition to updating the data you should destroy certain types of cached entries when you update certain tables; you keep track of dependencies by hard-coding it).
If you'd like to be smart about it you could have your cache object state their dependencies and cache the last update times for your DB tables as well.
Then you could
fetch cached data, examine dependencies,
get update times for relevant DB tables and
in case the record is stale (update time of a table that your big ass cache entry depends on is later then the time of the cache entry) drop it and get fresh data from the database.
You could even integrate the above into your persistence layer.
EDIT:
Of course the above is for when you want to have consistent cache. Sometimes, and for some data, you can relax the consistency requirements and there are scenarios where simple TTL will be good enough (for a trivial example, if you have ttl of 1 sec, you should mostly be out of trouble with users and can help data processing; and with higher times you might still be ok - for example let's say you are caching the list of country ISO codes; your application might be perfectly ok if you say let's cache this for 86400 sec).
Furthermore, you could also track the times of information presented to user, for example
let's say user has seen data A from cache and that we know that this data was created/modified at time t1
user makes changes to the data A (and makes it data B) and commits the change
the application layer can then examine if the data A is still as in DB (if the cached data upon which the user made decisions and/or changes was indeed fresh)
if it was not fresh then there is a conflict and user should confirm the changes
This has a cost of extra read of data A from DB, but it occurs only on writes.
Also, the conflict can occur not only because of the cache, but also because of multiple users trying to change the data (i.e. it is related to locking strategies).
One Approach for memcached is to use tags ( http://code.google.com/p/memcached-tag/ ). For Example, you have your Post "big-ass nested array" lets say, it inclused the autors information, the post itself and is shown on the frontpage and in some box in the sidebar. So it gets the tags: frontpage, {auhothor-id}, sidebar, {post-id} - now if someone changes the Author Information you flush every cache entry with the tag {author-id}. But thats only one Solution, and only for Cache Backends that support Tags, for example not APC (afaik). Hope That gave you an example.

PHP: How to do caching?

So I'm looking to do caching for a forum I'm building and I want to understand the best method. I've been doing some reading and the way that the Zend Framework handles caching (here) explains the idea well, but there are a few things I'm not sure about.
Let's say that I want to cache posts, should I simply "dump" the contents of the query into a file and then retrieve from that, or should I be building the layout around the data and then simply returning the contents of the file? How would I handle user information, historically the standard forum display includes a users total postcount next to a post, this can change (assuming 30 posts per page) very often and would mean I'd have to constantly clear the cache, which would seem pretty redundant.
I can't find any articles about how I should approach this and I'd be interested to learn more, does anyone have any insight or relevant articles to help?
There's always a trade-off between how often you will hit the cache (and hence who useful the cache is) and how much you want to cache and how big the lifetime should be.
You should identify the bottlenecks in your application. If it's the query that's holding the performance back, by all means cache the query. If it's building some parts of the page, cache those instead.
As to retrieving the user posts, if you want that be as live as possible, then you can't cache those (or if you do, you'll have to invalidate all the cached threads where that user has ever posted...). Retrieving post counts from the database (if done right) shouldn't be too taxing. You can just cache a template where the post count is left blank to be filled later or you can do some tricks with Javascript.

Personal Cache vs Memcache?

I have a personal caching class, which can be seen here ( based off WordPress' ):
http://pastie.org/988427
I recently learned about memcache and it said to memcache EVERYTHING:
http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
My first thought was just to keep my class with the current functions and make it use memcache instead -- is there any downside to doing this?
The main difference I see is that memcache stays on with the server from page to page, while mine is for 1 page load. The problem I see arising, and this is with any system, is that they're dynamic. They change all the time. Whether its search results, visible products, etc. etc. If it's all cached, won't the create a problem?
Is there a way to handle this? Obviously if something is bringing back the same results everytime it would be cached, but that's why I was doing it on a per page load basis. I'm sure there is a way to handle this, or is the cache time usually set between 5 minutes and an hour?
You certainly need a good caching strategy to avoid problems with stale data. With dynamic data and using memcached, you would have to delete cache entries on certain data updates. You can't just rely on cache entries to time out. With memcached you can cache just parts of your dynamic content for a specific page generation. If you want to cache complete html documents, I would recommend using a reverse proxy like varnish (http://varnish-cache.org/).

How to properly cache files in php

I have a page with a post and multiple comments, by using PHP's ob_start() I am able to cache it successfully.
Next to each comment I have a username and its number of current posts and reputation. Now I am keeping the cache of the page with the post all until someone adds a new comment, only then I update the cache file.
Now the problem is that a user's post number and reputation will increase as he posts/comments on other topics, and its post number and reputation will not change on elder posts.
What would be the best practice to tackle this issue.
If you are by any means concerned with your site's performance you should switch to APC as it provides both opcode caching as well as means for caching as a key/value store.
You can store entire blocks of content, arrays, objects, you name it:
// you must supply:
// 1. a key you will later use to retrieve your content
// 2. the data you wish to cache
// 3. how long the cache should remain valid
apc_store($key, $data, $ttl);
As far as retrieval goes, you simply make a call like:
$data = apc_fetch($key);
I sort of hope to be proven wrong, but I don't think there's currently any easy way around this other than limiting the duration of the cache.
You could of course update the relevant reputations, etc. via AJAX but it's quite possible that the connections & bandwidth that this consumes would ultimately outweigh the benefit of caching the page in the first place.
If one of the main goals of caching is to reduce processing overhead (as opposed to bandwidth consumption) you could of course simply flatten out the non-dynamic parts of the page (each post as a static text file or similar - hence reducing the need to re-generate the HTML if you're using Markdown or BBCode, etc.) and include these as required/update them if they're edited.
Some of my thoughts:
You could choose to keep the post pages cached for a certain period of time, like one hour or 15 minutes. This time is depending on the amount of visitors you get on the page, the frequency the details change and your personal preference. Because it does not really matter whether the number of posts of an user is slightly outdated. After this period remove the cached version (also saves resources) and if the page is visited again, it will be re-cached with the updated details.
By clever (re-)using ob_start() you can buffer multiple parts of the page, like the post part and the comments part. Store these parts separately and you only need to regenerate one part instead of the complete page. Most of the times, the post part is not changing very often.
Keep track of the pages where a certain user posted comments (or the page itself, if he created it). Upon changes in the user details (new post/comment added), make these pages obsolete (ie remove the cached version). If you have a lot of changes in a small period of time you could use some background process to re-cache the pages and keep your web-server responsive.
Insert tokens (unique pieces of text, like %user:123,postcount%) of frequent changing details is another possibility. Then store this version into your cache and upon a page request you can replace the tokens with their details. This could also be combined with other caching techniques if the number of page views per period of time is very high (or at least much higher then the frequency of the detail changes).

Categories