I'm wondering if it's worth caching a result that is only going to change every couple of hours but is shown on every page of the website. Makes sense to me to do it but just wanting a second or third opinion.
The situation is that on each page there is 2 top 5 lists. These do change but only really need to be updated every hour for example. The result of this top 5 is found through a MySQL query. Would it be a smart idea to cache these results? What would be the best way to cache them? (Using PHP).
I've had a look through some of the questions asked here but they don't relate well enough so thought it'd be best to ask again.
Thanks.
Your use case is precisely the reason that caching was developed! The most common pattern is to use a caching service, such as memcached. (see Caching with PHP to take stress away from MySQL)
However, I would encourage you to think about whether or not this is a premature optimization. Is your database unable to cope with the requests? Are all pages loading slower than you need? Further, are these database queries slow for some reason? Are you properly using indexes?
I would encourage you to think only add a cache when you need to, and to ensure you have optimized as best as possible before then.
Checkout https://stackoverflow.com/a/1600252/1058853 for a good commentary.
Related
I'm hoping to develop a LAMP application that will centre around a small table, probably less than 100 rows, maybe 5 fields per row. This table will need to have the data stored within accessed rapidly, maybe up to once a second per user (though this is the 'ideal', in practice, this could probably drop slightly). There will be a number of updates made to this table, but SELECTs will far outstrip UPDATES.
Available hardware isn't massively powerful (it'll be launched on a VPS with perhaps 512mb RAM) and it needs to be scalable - there may only be 10 concurrent users at launch, but this could raise to the thousands (and, as we all hope with these things, maybe 10,000s, but this level there will be more powerful hardware available).
As such I was wondering if anyone could point me in the right direction for a starting point - all the data retrieved will be the same for all users, so I'm trying to investigate if there is anyway of sharing this data across all users, rather than performing 10,000 identical selects a second. Soooo:
1) Would the mysql_query_cache cache these results and allow access to the data, WITHOUT requiring a re-select for each user?
2) (Apologies for how broad this question is, I'd appreciate even the briefest of reponses greatly!) I've been looking into the APC cache as we already use this for an opcode cache - is there a method of caching the data in the APC cache, and just doing one MYSQL select per second to update this cache - and then just accessing the APC for each user? Or perhaps an alternative cache?
Failing all of this, I may look into having a seperate script which handles the queries and outputs the data, and somehow just piping this one script's data to all users. This isn't a fully formed thought and I'm not sure of the implementation, but perhaps a combo of AJAX to pull the outputted data from... "Somewhere"... :)
Once again, apologies for the breadth of these question - a couple of brief pointers from anyone would be very, very greatly appreciated.
Thanks again in advance
If you're doing something like an AJAX chat which polls the server constantly, you may want to look at node.js instead, which keeps an open connection between server and browser. This way, you can have changes pushed to the user when they happen and you won't need to do all that redundant checking once per second. This can scale very well to thousands of users and is written in javascript on the server-side, so not too difficult.
The problem with using the MySQL cache is that the entire table cache gets invalidated on any write to that table. You're better off using a caching solution like memcached or APC if you're trying to control that behavior more precisely. And yes, APC would be able to cache that information.
One other thing to keep in mind is that you need to know when to invalidate the cache as well, so you don't have stale data.
You can use apc,xcache or memcache for database query caching or you can use vanish or squid for gateway caching...
I don't really have any experience with caching at all, so this may seem like a stupid question, but how do you know when to cache your data? I wasn't even able to find one site that talked about this, but it may just be my searching skills or maybe too many variables to consider?
I will most likely be using APC. Does anyone have any examples of what would be the least amount of data you would need in order to cache it? For example, let's say you have an array with 100 items and you use a foreach loop on it and perform some simple array manipulation, should you cache the result? How about if it had a 1000 items, 10000 items, etc.?
Should you be caching the results of your database query? What kind of queries should you be caching? I assume a simple select and maybe a couple joins statement to a mysql db doesn't need caching, or does it? Assuming the mysql query cache is turned on, does that mean you don't need to cache in the application layer, or should you still do it?
If you instantiate an object, should you cache it? How to determine whether it should be cached or not? So a general guide on what to cache would be nice, examples would also be really helpful, thanks.
When you're looking at caching data that has been read from the database in APC/memcache/WinCache/redis/etc, you should be aware that it will not be updated when the database is updated unless you explicitly code to keep the database and cache in synch. Therefore, caching is most effective when the data from the database doesn't change often, but also requires a more complex and/or expensive query to retrieve that data from the database (otherwise, you may as well read it from the database when you need it)... so expensive join queries that return the same data records whenever they're run are prime candidates.
And always test to see if queries are faster read from the database than from cache. Correct database indexing can vastly improve database access times, especially as most databases maintain their own internal cache as well, so don't use APC or equivalent to cache data unless the database overheads justify it.
You also need to be aware of space usage in the cache. Most caches are a fixed size and you don't want to overfill them... so don't use them to store large volumes of data. Use the apc.php script available with APC to monitor cache usage (though make sure that it's not publicly accessible to anybody and everybody that accesses your site.... bad security).
When holding objects in cache, the object will be serialized() when it's stored, and unserialized() when it's retrieved, so there is an overhead. Objects with resource attributes will lose that resource; so don't store your database access objects.
It's sensible only to use cache to store information that is accessed by many/all users, rather than user-specific data. For user session information, stick with normal PHP sessions.
The simple answer is that you cache data when things get slow. Obviously for any medium to large sized application, you need to do much more planning than just a wait and see approach. But for the vast majority of websites out there, the question to ask yourself is "Are you happy with the load time". Of course if you are obsessive about load time, like myself, you are going to want to try to make it even faster regardless.
Next, you have to identify what specifically is the cause of the slowness. You assumed that your application code was the source but its worth examining if there are other external factors such as large page file size, excessive requests, no gzip, etc. Use a site like http://tools.pingdom.com/ or an extension like yslow as a start for that. (quick tip make sure keepalives and gzip are working).
Assuming the problem is the duration of execution of your application code, you are going to want to profile your code with something like xdebug (http://www.xdebug.org/) and view the output with kcachegrind or wincachegrind. That will let you know what parts of your code are taking long to run. From there you will make decisions on what to cache and how to cache it (or make improvements in the logic of your code).
There are so many possibilities for what the problem could be and the associated solutions, that it is not worth me guessing. So, once you identify the problem you may want to post a new question related to solving that specific problem. I will say that if not used properly, the mysql query cache can be counter productive. Also, I generally avoid the APC user cache in favor of memcached.
I have a Wordpress Plug-in with users requesting a feature that is the view counter.
I know only a few approaches on making a view counter, and the problem is that I want to optimize for performance and memory issues.
I have done a small amount of research and it seems, "mod_log_mysql", may be a great approach, but I do not have any prior knowledge on how this mod works, nor could have any ideas on how to connect it with the Wordpress Plug-in.
Or I could use a database. When the page is viewed, an update or a insert (which is said to be faster than a update) event occurs.
Thus my following option are:
Update/Insert on server side when
the page is called for.
Research more on mod_log_mysql and
find a way to connect it with the
plug-in.
Find a Premade view counter.
If there is a better approach, I would like to hear them in hoping this will solve my problem.
It really depends on what you want to achieve and how much time you want to spend on it.
If you need more than a simple view counter per page/event, go for a premade one.
If you need something simple, I would go for option #1.
If you're worried about performance, use a memory table for staging the 'counts', and then have a php script move that into a regular table periodically (i.e. using a cronjob). I wouldn't expect updating a view counter in a memory table to have any significant performance impact.
Option #2 could easily fall into premature optimization category.
Option 1 by far seems the easiest and probably most efficient. There is little overhead associated with making a single call to a database whose connection is already open from other operations done on the page previously.
I'm having somewhat theoretical question: I'm designing my own CMS/app-framework (as many PHP programmers on various levels did before... and always will) to either make production-ready solution or develop various modules/plugins that I'll use later.
Anyway, I'm thinking on gathering SQL connections from whole app and then run them on one place:
index.php:
<?php
include ('latestposts.php');
include ('sidebar.php');
?>
latestposts.php:
<?php
function gather_data ($arg){ $sql= ""; }
function draw ($data) {...}
?>
sidebar.php:
<?php
function gather_data ($arg){ $sql= ""; }
function draw ($data) {...}
?>
Now, while whole module system application is yet-to-be-figured, it's idea is already floating somewhere in my brain. However, I'm thinking, if I'm able to first load all gather_data functions, then run sql and then run draw functions - and if I'm able to reuse results!
If, in example, $sql is SELECT * FROM POSTS LIMIT 10 and $sql2 is SELECT * FROM POSTS LIMIT 5, is it possible to program PHP to see: "ah, it's the same SQL, I'll call it just once and reuse the first 5 rows"?
Or is it possible to add this behavior to some DRM?
However, as tags say, this is still just an idea in progress. If it proves to be easy to accomplish, then I will post more question how :)
So, basically: Is it possible, does it make sense? If both are yes, then... any ideas how?
Don't get me wrong, that sounds like a plausible idea and you can probably get it running. But I wonder if it is really going to be beneficial. Will it cause a system to be faster? Give you more control? Make development easier?
I would just look into using (or building) a system using well practiced MVC style coding standards, build a good DB structure, and tweak the heck out of Apache (or use something like Lighttpd). You will have a lot more widespread acceptance of your code if you ever decide to make it open source, and if you ever need a hand with it another developer could step right in and pick up the keyboard.
Also, check out query caching in MySQL--you will see a similar (though not one-to-one) benefit from caching your query results server side with regard to your query example. Even better that is stored in server memory so PHP/MySQL overhead is dropped AND you don't have to code it.
All of that aside, I do think it is possible. =)
Generally speaking, such a cache system can generate significant time savings, but at the cost of memory and complexity. The more results you want to keep, the more memory it will take; and there's no guarantee that your results will ever be used again, particularly the larger result sets.
Second, there are certain queries that should never be cached, or that should be run again even if they're in the cache. For the most part, only SELECT and SHOW queries can be cached effectively, but you need to worry about invalidating them when you modify the underlying data. Even in the same pageview, you might find yourself working around your own cache system on occasion.
Third, this kind of problem has already been solved several times. First, consider turning on the MySQL query cache. Most of the time, it will speed things up a bit without requiring any code changes on your end. However, it's a bit aggressive about invalidating entries, so you could gain some performance at a higher level.
If you need another level, consider memcached. You'll have to store and invalidate entries manually, but it can store results across page views (where you'll really find the performance benefit), and will let unused entries expire before running out of memory.
Okay, so I'm sure plenty of you have built crazy database intensive pages...
I am building a page that I'd like to pull all sorts of unrelated database information from. Here are some sample different queries for this one page:
article content and info
IF the author is a registered user, their info
UPDATE the article's view counter
retrieve comments on the article
retrieve information for the authors of the comments
if the reader of the article is signed in, query for info on them
etc...
I know these are basically going to be pretty lightning quick, and that I could combine some; but I wanted to make sure that this isn't abnormal?
How many fairly normal and un-heavy queries would you limit yourself to on a page?
As many as needed, but not more.
Really: don't worry about optimization (right now). Build it first, measure performance second, and IFF there is a performance problem somewhere, then start with optimization.
Otherwise, you risk spending a lot of time on optimizing something that doesn't need optimization.
I've had pages with 50 queries on them without a problem. A fast query to a non-large (ie, fits in main memory) table can happen in 1 millisecond or less, so you can do quite a few of those.
If a page loads in less than 200 ms, you will have a snappy site. A big chunk of that is being used by latency between your server and the browser, so I like to aim for < 100ms of time spent on the server. Do as many queries as you want in that time period.
The big bottleneck is probably going to be the amount of time you have to spend on the project, so optimize for that first :) Optimize the code later, if you have to. That being said, if you are going to write any code related to this problem, write something that makes it obvious how long your queries are taking. That way you can at least find out you have a problem.
I don't think there is any one correct answer to this. I'd say as long as the queries are fast, and the page follows a logical flow, there shouldn't be any arbitrary cap imposed on them. I've seen pages fly with a dozen queries, and I've seen them crawl with one.
Every query requires a round-trip to your database server, so the cost of many queries grows larger with the latency to it.
If it runs on the same host there will still be a slight speed penalty, not only because a socket is between your application but also because the server has to parse your query, build the response, check access and whatever else overhead you got with SQL servers.
So in general it's better to have less queries.
You should try to do as much as possible in SQL, though: don't get stuff as input for some algorithm in your client language when the same algorithm could be implemented without hassle in SQL itself. This will not only reduce the number of your queries but also help a great deal in selecting only the rows you need.
Piskvor's answer still applies in any case.
Wordpress, for instance, can pull up to 30 queries a page. There are several things you can use to stop MySQL pull down - one of them being memchache - but right now and, as you say, if it will be straightforward just make sure all data you pull is properly indexed in MySQL and don't worry much about the number of queries.
If you're using a Framework (CodeIgniter for example) you can generally pull data for the page creation times and check whats pulling your site down.
As other have said, there is no single number. Whenever possible please use SQL for what it was built for and retrieve sets of data together.
Generally an indication that you may be doing something wrong is when you have a SQL inside a loop.
When possible Use joins to retrieve data that belongs together versus sending several statements.
Always try to make sure your statements retrieve exactly what you need with no extra fields/rows.
If you need the queries, you should just use them.
What I always try to do, is to have them executed all at once at the same place, so that there is no need for different parts (if they're separated...) of the page to make database connections. I figure it´s more efficient to store everything in variables than have every part of a page connect to the database.
In my experience, it is better to make two queries and post-process the results than to make one that takes ten times longer to run that you don't have to post-process. That said, it is also better to not repeat queries if you already have the result, and there are many different ways this can be achieved.
But all of that is oriented around performance optimization. So unless you really know what you're doing (hint: most people in this situation don't), just make the queries you need for the data you need and refactor it later.
I think that you should be limiting yourself to as few queries as possible. Try and combine queries to mutlitask and save time.
Premature optimisation is a problem like people have mentioned before, but that's where you're crapping up your code to make it run 'fast'. But people take this 'maxim' too far.
If you want to design with scalability in mind, just make sure whatever you do to load data is sufficiently abstracted and calls are centralized, this will make it easier when you need to implement a shared memory cache, as you'll only have to change a few things in a few places.