So I'm looking to do caching for a forum I'm building and I want to understand the best method. I've been doing some reading and the way that the Zend Framework handles caching (here) explains the idea well, but there are a few things I'm not sure about.
Let's say that I want to cache posts, should I simply "dump" the contents of the query into a file and then retrieve from that, or should I be building the layout around the data and then simply returning the contents of the file? How would I handle user information, historically the standard forum display includes a users total postcount next to a post, this can change (assuming 30 posts per page) very often and would mean I'd have to constantly clear the cache, which would seem pretty redundant.
I can't find any articles about how I should approach this and I'd be interested to learn more, does anyone have any insight or relevant articles to help?
There's always a trade-off between how often you will hit the cache (and hence who useful the cache is) and how much you want to cache and how big the lifetime should be.
You should identify the bottlenecks in your application. If it's the query that's holding the performance back, by all means cache the query. If it's building some parts of the page, cache those instead.
As to retrieving the user posts, if you want that be as live as possible, then you can't cache those (or if you do, you'll have to invalidate all the cached threads where that user has ever posted...). Retrieving post counts from the database (if done right) shouldn't be too taxing. You can just cache a template where the post count is left blank to be filled later or you can do some tricks with Javascript.
Related
Well this is kind of a question of how to design a website which uses less resources than normal websites. Mobile optimized as well.
Here it goes: I was about to display a specific overview of e.g. 5 posts (from e.g. a blog). Then if I'd click for example on the first post, I'd load this post in a new window. But instead of connecting to the Database again and getting this specific post with the specific id, I'd just look up that post (in PHP) in my array of 5 posts, that I've created earlier, when I fetched the website for the first time.
Would it save data to download? Because PHP works server-side as well, so that's why I'm not sure.
Ok, I'll explain again:
Method 1:
User connects to my website
5 Posts become displayed & saved to an array (with all its data)
User clicks on the first Post and expects more Information about this post.
My program looks up the post in my array and displays it.
Method 2:
User connects to my website
5 Posts become displayed
User clicks on the first Post and expects more Information about this post.
My program connects to MySQL again and fetches the post from the server.
First off, this sounds like a case of premature optimization. I would not start caching anything outside of the database until measurements prove that it's a wise thing to do. Caching takes your focus away from the core task at hand, and introduces complexity.
If you do want to keep DB results in memory, just using an array allocated in a PHP-processed HTTP request will not be sufficient. Once the page is processed, memory allocated at that scope is no longer available.
You could certainly put the results in SESSION scope. The advantage of saving some DB results in the SESSION is that you avoid DB round trips. Disadvantages include the increased complexity to program the solution, use of memory in the web server for data that may never be accessed, and increased initial load in the DB to retrieve the extra pages that may or may not every be requested by the user.
If DB performance, after measurement, really is causing you to miss your performance objectives you can use a well-proven caching system such as memcached to keep frequently accessed data in the web server's (or dedicated cache server's) memory.
Final note: You say
PHP works server-side as well
That's not accurate. PHP works server-side only.
Have you think in saving the posts in divs, and only make it visible when the user click somewhere? Here how to do that.
Put some sort of cache between your code and the database.
So your code will look like
if(isPostInCache()) {
loadPostFromCache();
} else {
loadPostFromDatabase();
}
Go for some caching system, the web is full of them. You can use memcached or a static caching you can made by yourself (i.e. save post in txt files on the server)
To me, this is a little more inefficient than making a 2nd call to the database and here is why.
The first query should only be pulling the fields you want like: title, author, date. The content of the post maybe a heavy query, so I'd exclude that (you can pull a teaser if you'd like).
Then if the user wants the details of the post, i would then query for the content with an indexed key column.
That way you're not pulling content for 5 posts that may never been seen.
If your PHP code is constantly re-connecting to the database you've configured it wrong and aren't using connection pooling properly. The execution time of a query should be a few milliseconds at most if you've got your stack properly tuned. Do not cache unless you absolutely have to.
What you're advocating here is side-stepping a serious problem. Database queries should be effortless provided your database is properly configured. Fix that issue and you won't need to go down the caching road.
Saving data from one request to the other is a broken design and if not done perfectly could lead to embarrassing data bleed situations where one user is seeing content intended for another. This is why caching is an option usually pursued after all other avenues have been exhausted.
Just looking for a piece of advice. On one of our webpages we have a debate/forum site. Everytime a user request the debate page, he/she will get a list of all topics (and their count of answers etc.).
Too when the user request a specific topic/thread, all answers to the thread will be shown to the user a long with username, user picture, age, number of totalt forum-posts from the poster of the answer.
All content is currently retrieved by using an MySQL-query everytime the page is accessed. But this is however starting to get painfully slow (especially with large threads, +3000 answers).
I would like to cache the debate entries somehow, to speed up this proces. However the problem is, that if I cache the entries it self, number of post etc. (which is dynamic, of course), will not always be up to date.
Is there any smart way of caching the pages/recaching them when stuff like this is updated? :)
Thanks in advance,
fischer
You should create a tag or a name for the cache based on it's data.
For example for the post named Jake's Post you could create an md5 of the name, this would give you the tag 49fec15add24931728652baacc08b8ee.
Now cache the contents and everything to do with this post against the tag 49fec15add24931728652baacc08b8ee. When the post is updated or a comment is added go to the cache and delete everything associated with 49fec15add24931728652baacc08b8ee.
Now there is no cache and it will be rebuilt when the next visitors arrives to new the post.
You could break this down further by having multiple tags per post. E.g you could have a tag for comments and answers, when a comment is added delete the comments tag, but not the answers tag. This reduces the work the server has to do when rebuilding the cache as only the comments are now missing.
There are number of libraries and frameworks that can aid you in doing this.
Jake
EDIT
I'd use files to store the data, more specifically the HTML output of the page. You can then do something like:
if(file_exists($tag))
{
// Load the contents of the cache file here and output it
}
else
{
// Do complex database look up and cache the file for later
}
Remember that frameworks like Zend have this sort of stuff built in. I would seriously considering using a framework.
Interesting topic!
The first thing I'd look at is optimizing your database - even if you have to spend money upgrading the hardware, it will be significantly easier and cheaper than introducing a cache - fewer moving parts, fewer things that can go wrong...
If you can't squeeze more performance out of your database, the next thing I'd consider is de-normalizing the data a little. For instance, maintain a "reply_count" column, rather than counting the replies against each topic. This is ugly, but introduces fewer opportunities for things to go wrong - with a bit of luck, you can localize all the logic in your data access layer.
The next option I'd consider is to cache pages. For instance, just caching the "debate page" for 30 seconds should dramatically reduce the load on your database if you've got reasonable levels of traffic, and even if it all goes wrong, because you're caching the entire page, it will sort itself out the next time the page goes stale. In most situations, caching an entire page is okay - it's not the end of the world if a new post has appeared in the last 30 seconds and you don't see it on your page.
If you really have to provide more "up to date" content on the page, you might introduce caching at the database access level. I have, in the past, built a database access layer which cached the results of SQL queries based on hard-wired logic about how long to cache the results. In our case, we built a function to call the database which allowed you to specify the query (e.g. get posts for user), an array of parameters (e.g. username, date-from), and the cache duration. The database access function would cache results for the cache duration based on the query and the parameters; if the cache duration had expired, it would refresh the cache.
This scheme was fairly bug-proof - as an end user, you'd rarely notice weirdness due to caching, and because we kept the cache period fairly short, it all sorted itself out very quickly.
Building up your page by caching snippets of content is possible, but very quickly becomes horribly complex. It's very easy to create a page that makes no sense to the end user due to the different caching policies - "unread posts" doesn't add up to the number of posts in the breakdown because of different caching policies between "summary" and "detail".
I don't really have any experience with caching at all, so this may seem like a stupid question, but how do you know when to cache your data? I wasn't even able to find one site that talked about this, but it may just be my searching skills or maybe too many variables to consider?
I will most likely be using APC. Does anyone have any examples of what would be the least amount of data you would need in order to cache it? For example, let's say you have an array with 100 items and you use a foreach loop on it and perform some simple array manipulation, should you cache the result? How about if it had a 1000 items, 10000 items, etc.?
Should you be caching the results of your database query? What kind of queries should you be caching? I assume a simple select and maybe a couple joins statement to a mysql db doesn't need caching, or does it? Assuming the mysql query cache is turned on, does that mean you don't need to cache in the application layer, or should you still do it?
If you instantiate an object, should you cache it? How to determine whether it should be cached or not? So a general guide on what to cache would be nice, examples would also be really helpful, thanks.
When you're looking at caching data that has been read from the database in APC/memcache/WinCache/redis/etc, you should be aware that it will not be updated when the database is updated unless you explicitly code to keep the database and cache in synch. Therefore, caching is most effective when the data from the database doesn't change often, but also requires a more complex and/or expensive query to retrieve that data from the database (otherwise, you may as well read it from the database when you need it)... so expensive join queries that return the same data records whenever they're run are prime candidates.
And always test to see if queries are faster read from the database than from cache. Correct database indexing can vastly improve database access times, especially as most databases maintain their own internal cache as well, so don't use APC or equivalent to cache data unless the database overheads justify it.
You also need to be aware of space usage in the cache. Most caches are a fixed size and you don't want to overfill them... so don't use them to store large volumes of data. Use the apc.php script available with APC to monitor cache usage (though make sure that it's not publicly accessible to anybody and everybody that accesses your site.... bad security).
When holding objects in cache, the object will be serialized() when it's stored, and unserialized() when it's retrieved, so there is an overhead. Objects with resource attributes will lose that resource; so don't store your database access objects.
It's sensible only to use cache to store information that is accessed by many/all users, rather than user-specific data. For user session information, stick with normal PHP sessions.
The simple answer is that you cache data when things get slow. Obviously for any medium to large sized application, you need to do much more planning than just a wait and see approach. But for the vast majority of websites out there, the question to ask yourself is "Are you happy with the load time". Of course if you are obsessive about load time, like myself, you are going to want to try to make it even faster regardless.
Next, you have to identify what specifically is the cause of the slowness. You assumed that your application code was the source but its worth examining if there are other external factors such as large page file size, excessive requests, no gzip, etc. Use a site like http://tools.pingdom.com/ or an extension like yslow as a start for that. (quick tip make sure keepalives and gzip are working).
Assuming the problem is the duration of execution of your application code, you are going to want to profile your code with something like xdebug (http://www.xdebug.org/) and view the output with kcachegrind or wincachegrind. That will let you know what parts of your code are taking long to run. From there you will make decisions on what to cache and how to cache it (or make improvements in the logic of your code).
There are so many possibilities for what the problem could be and the associated solutions, that it is not worth me guessing. So, once you identify the problem you may want to post a new question related to solving that specific problem. I will say that if not used properly, the mysql query cache can be counter productive. Also, I generally avoid the APC user cache in favor of memcached.
I need help find the right caching solution for a clients site. Current site is centoOS, php, mysql, apache using smarty templates (i know they suck but it as built by someone else). The current models/methods use fairly good OO structure but there are WAY to many queries being done for some of the simple page functions. I'm looking try find some sort of caching solution but i'm a noob when it comes to this and don't know what is available that would fit the current site setup.
It is an auction type site with say 10 auctions displayed on one page at one time -- the time and current bid on each auction being updated via an ajax call returning json every 1 second (it's a penny auction site like beezid.com so updates every second are necessary). As you can see, if the site gets any sort of traffic the number of simultaneous requests could be huge. Obviously this data changes every second because the json data returned has the updated time left in the auction, and possibly updated bid amounts and bid users for each auction.
What i want is the ability to cache certain pages for a given amount of time or based on other changed variable. For example, memory caching the page that displays 10 auctions and only updating that cache copy when one of the auctions ends. Or even the script above that returns json string data every second. If i was able to cache the first request to this page in memory, serve the following requests from memory and then re-cache it again after 1 second, that could potentially reduce the serverload a lot. But i don't know if this is even possible or if the overhead of doing something like this outweights any request load savings.
I looked into xcache some but i couldn't find a way that i could set a particular cache time on a specific page or based on other variables?!? Maybe i'm missed something, but does anyone have a recommendation on a caching scheme that would work for these requirements?
Mucho thanks for any input you might have...
Cacheing can be done using many methods. Memcached springs to mind as being suited to your task. but if the site is ultra busy you may run out of ram.
When I do caching I often use a simple file cache, while it does involve at least one stat call to determine the freshness of the cached content it is still fast and marginally better than calling a sql server.
If you must call a sql server then it may pay to use a memory(heap) table to store much of the precomputed data. this technique is no more efficient than memcached, probably less so but saves you installing memcached.
DC
Zend_Cache can do what you want, and a lot more. It supports a lot of backends, including xcache and memcache, and allows you to cache data, full pages, partial pages, and well, just about anything you can imagine :p.
And in case you are wondering : you can use the Zend_Cache component by itself, you don't have to use the complete Zend framework for your application.
I'm struggling with a conceptual question. When you have a forum with thousands of posts and/or threads, how do you retrieve all those posts to be displayed on your site? Do you connect to your database every time someone visits your page then capture every post in an array and display it? Surely this seems like it would be very taxing on your server and would cause a whole bunch of unnecessary database reads. Can anyone shine some light on this topic?
Thanks.
You never retrieve all those posts at once. In most case, forums show a page of X threads/posts, and you just get those X threads/posts from the database each time a page is served. RDBMS are pretty good at this. A forum is (should be) quite dynamic so indeed it generates a pretty good load on the database, but this is what database are made for, storing and retrieving data.
One new(ish) way of doing this is to use a Document Oriented Database like CouchDB where everything about an individual post is stored in the same document and that document gets loaded on request.
It seems in this case a Document Oriented Database would work very well for a forum or blog type site.
As far as Relational Databases go, I'm pretty sure the database gets hit every time the page loads unless there is some sort of caching implemented (then you'd have to worry about data getting stale though, which brings up a whole new mess of problems.)
Don't worry a lot about stale data. Facebook doesn't... their database is only "eventually consistent". The idea is like this: making sure that the comments are 100% always, always up-to-date is very expensive. That does put a large load on your DB. Although as Serty says, that's what the DB is made for, but whether or not your physical box is sufficient for the load is another matter.
Facebook and Digg to name a few took a different approach... Is it really all that important that every load of every page be 100% accurate? How many page loads actually result in every single comment being read by the end user anyways? It's a lot cheaper to get the comments right 'most' of the time and by 'most' I mean something you get to decide. Is a 10% chance of a page with missing comments ok? is a 1% chance? How many nodes need to have the right data NOW. When I write a new comment, how many nodes have to say they got the update for it to be successful.
I like the idea behind Cassandra which is in summary, "how much are we willing to spend to get Aunt Martha's comment about her nephew's baptism picture 100% correct?"
But that's a fine question for a free website, but this wouldn't work so good for a business application.