Smart (?) Database Cache - php

I've seen several database cache engines, all of them are pretty dumb (i.e.: keep this query cached for X minutes) and require that you manually delete the whole cache repository after a INSERT / UPDATE / DELETE query has been executed.
About 2 or 3 years ago I developed an alternative DB cache system for a project I was working on, the idea was basically to use regular expressions to find the table(s) involved in a particular SQL query:
$query_patterns = array
(
'INSERT' => '/INTO\s+(\w+)\s+/i',
'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i',
'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i',
'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i',
'REPLACE' => '/INTO\s+(\w+)\s+/i',
'TRUNCATE' => '/TRUNCATE\s+(\w+)/i',
'LOAD' => '/INTO\s+TABLE\s+(\w+)/i',
);
I know that these regexs probably have some flaws (my regex skills were pretty green back then) and obviously don't match nested queries, but since I never use them that isn't a problem for me.
Anyway, after finding the involved tables I would alphabetically sort them and create a new folder in the cache repository with the following naming convention:
+table_a+table_b+table_c+table_...+
In case of a SELECT query, I would fetch the results from the database, serialize() them and store them in the appropriate cache folder, so for instance the results of the following query:
SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC;
Would be stored in:
/cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5
The MD5 being the query itself. Upon a consequent SELECT query the results would be trivial to fetch.
In case of any other type of write query (INSERT, REPLACE, UPDATE, DELETE and so on) I would glob() all the folders that had +matched_table(s)+ in their name all delete all the file contents. This way it wouldn't be necessary to delete the whole cache, just the cache used by the affected and related tables.
The system worked pretty well and the difference of performance was visible - although the project had many more read queries than write queries. Since then I started using transactions, FK CASCADE UPDATES / DELETES and never had the time to perfect the system to make it work with these features.
I've used MySQL Query Cache in the past but I must say the performance doesn't even compare.
I'm wondering: am I the only one who sees beauty in this system? Is there any bottlenecks I may not be aware of? Why do popular frameworks like CodeIgniter and Kohana (I'm not aware of Zend Framework) have such rudimentary DB cache systems?
More importantly, do you see this as a feature worth pursuing? If yes, is there anything I could do / use to make it even faster (my main concerns are disk I/O and (de)serialization of query results)?
I appreciate all input, thanks.

I can see the beauty in this solution, however, I belive it only works for a very specific set of applications. Scenarios where it is not applicable include:
Databases which utilize cascading deletes/updates or any kind of triggers. E.g., your DELETE to table A may cause a DELETE from table B. The regex will never catch this.
Accessing the database from points which do not go through you cache invalidation scheme, e.g. crontab scripts etc. If you ever decide to implement replication across machines (introduce read-only slaves), it may also disturb the cache (because it does not go through cache invalidation etc.)
Even if these scenarios are not realistic for your case it does still answer the question of why frameworks do not implement this kind of cache.
Regarding if this is worth pursuing, it all depends on your application. Maybe you care to supply more information?

The solution, as you describe it, is at risk for concurrency issues. When you're receiving hundreds of queries per second, you're bound to hit a case where an UPDATE statement runs, but before you can clear your cache, a SELECT reads from it, and gets stale data. Additionally, you may run in to issues when several UPDATEs hit the same set of rows in a short time period.
In a broader sense, best practice with caching is to cache the largest objects possible. E.g., rather than having a bunch of "user"-related rows cached all over the place, it's better to just cache the "user" object itself.
Even better, if you can cache whole pages (e.g., you show the same homepage to everyone; a profile page appears identical to almost everyone, etc.), that's even better. One cache fetch for a whole, pre-rendered page will dramatically outperform dozens of cache fetches for row/query level caches followed by re-rending the page.
Long story short: profile. If you take the time to do some measurement, you'll likely find that caching large objects, or even pages, rather than small queries used to build those things, is a huge performance win.

While I do see the beauty in this - especially for environments where resources are limited and can not easily be extended, like on shared hosting - I personally would fear complications in the future: What if somebody, newly hired and unaware of the caching mechanism, starts using nested queries? What if some external service starts updating the table, with the cache not noticing?
For a specialized, defined project that urgently needs a speedup that cannot be helped by adding processor power or RAM, this looks like a great solution. As a general component, I find it too shaky, and would fear subtle problems in the long run that stem from people forgetting that there is a cache to be aware of.

I suspect that the regexes may not provide for every case - certainly they don't seem to deal with the scenario of mixing base table names and the tables themselves. e.g. consider
update stats.measures set amount=50 where id=1;
and
use stats;
update measures set amount=50 where id=1;
Then there's PL/SQL.
Then there's the fact that it depends on every client opting in to an advisory control mechanism i.e. it pre-supposes that all the database access is from machines implementing the caching control mechanism on a shared filesystem.
(as a small point - wouldn't it be simpler to just check the modification times on the data files to determine if the cached version of a query on a defined set of tables is still current, rather then trying to identify if the cache control mechanism has spotted an update - it would certainly be a lot more robust)
Stepping back a bit, implementing this from scratch using a robust architecture would mean that all queries would have to be intercepted by the control mechanism. The control mechanism would probably need a more sophisticated query parser. It certainly requires a common storgae substrate for all the instances of the control mechanism. It probably needs an understanding of the data dictionary - all things which are already implemented by the database itself.
You state that "I've used MySQL Query Cache in the past but I must say the performance doesn't even compare."
I find this rather odd. Certainly when dealing with large result sets from queries, my experience is that loading the data into the heap from a database is a lot faster than unserializing large arrays - although large result sets are rather atypical of web based applications.
When I've tried to speed up database access (after fixing everything else of course) then I've gone down the route of replicating and partitioning data across multiple DBMS instances.
C.

This is related to the problem of session splitting when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions are used to determine which tables (or even which rows) are being read from or written to. The system keeps track of which tables were written to and when, and when a read to one of those tables comes up, it's routed to the master. If a query is reading from a table whose data needn't be up-to-the-second accurate, then it's routed to the slave. Generally, information only really needs to be current when it's something a user changed themselves (i.e., editing a user's profile).
They talk about this a good bit in the O'Reilly book High Performance MySQL. I used it quite a bit when developing a system for handling session splits back in the day.

The improvement you describe is to avoid invalidating caches that are guaranteed to not have been affected by an update because they draw data from a different table.
That is of course nice, but I am not sure if it is fine-grained enough to make a real difference. You would still be invaliding lots of caches that did not really need to be (because the update was on the table, but on different rows).
Also, even this "simple" scheme relies on being able to detect the relevant tables by looking at the SQL query string. This can be difficult to do in the general case, because of views, table aliases, and multiple catalogs.
It is very difficult to automatically (and efficiently) detect whether a cache needs to be invalidated. Because of that, you can either use a very simple scheme (such as invalidating on every update, or per table, as in your system, which does not work too well when there are many updates), or a very hand-crafted cache for the specific application with deep hooks into the query logic (probably difficult to write and hard to maintain), or accept that the cache can contain stale data and just refresh it periodically.

Related

Store some records in the application and some in the database?

I have an application where it seems as if it would make sense to store some records hard-coded in the application code rather than an entry in the database, and be able to merge the two for a common result set when viewing the records. Are there any pitfalls to this approach?
Firstly, it would seem to make it easier to enforce that a record is never edited/deleted, other than when the application developer wants to. Second, in some scenarios such as installing a 3rd party module, the records could be read from their configuration rather than performing an insert in the db (with the related maintenance issues).
Some common examples:
In the application In the database
----------------------------------- ------------------ ----------------------
customers (none) all customers
HTML templates default templates user-defined templates
'control panel' interface languages default language additional languages
Online shop payment processors all payment processors (none)
So, I think I have three options depending on the scenario:
All records in the database
Some records in the application, some records in the database
All records in the application
And it seems that there are two ways to implement it:
All records in the database:
A column could be flagged as 'editable' or 'locked'
Negative IDs could represent locked values and positive IDs could represent editable
Odd IDs represent locked and even IDs represent editable...
Some records live in the application (as variables, arrays or objects...)
Are there any standard ways to deal with this scenario? Am I missing some really obvious solutions?
I'm using MySQL and php, if that changes your answer!
By "in the application", do you mean these records live in the filesystem, accessible to the application?
It all depends on the app you're building. There are a few things to consider, especially when it comes to code complexity and performance. While I don't have enough info about your project to suggest specifics, here are a few pointers to keep in mind:
Having two possible repositories for everything ramps up the complexity of your code. That means readability will go down and weird errors will start cropping up that are hard to trace. In most cases, it's in your best interest to go with the simplest solution that can possibly work. If you look at big PHP/MySQL software packages you will see that even though there are a lot of default values in the code itself, the data comes almost exclusively from the database. This is probably a reasonable policy when you can't get away with the simplest solution ever (namely storing everything in files).
The big downside of heavy database involvement is performance. You should definitely keep track of all the database calls of any typical codepath in your app. If you rely heavily on lots of queries, you have to employ a lot of caching. Track everything that happens and keep in mind what the computer has to in order to fulfill the request. It's you job to make the computer's task as easy as possible.
If you store templates in the DB, another big performance penalty will be the lack of opcode re-use and caching. Normal web hosting environments compile a PHP file once and then keep the bytecode version of it around for a while. This saves subsequent recompiles and speeds up execution substantially. But if you fill PHP template code into an eval() statement, this code will have to be recompiled by PHP every single time it's called.
Also, if you're using eval() in this fashion and you allow users to edit templates, you have to make sure those users are trusted - because they'll have access to the entire PHP environment. If you're going the other route and are using a template engine, you'll potentially have a much bigger performance problem (but not a security problem). In any case, consider caching template outputs wherever possible.
Regarding the locking mechanism: it seems you are introducing a big architectural issue here since you now have to make each repository (file and DB) understand what records are off-limits to the other one. I'd suggest you reconsider this approach entirely, but if you must, I'd strongly urge you to flag records using a separate column for it (the ID-based stuff sounds like a nightmare).
The standard way would be to keep classical DB-shaped stuff in the DB (these would be user accounts and other stuff that fits nicely into tables) and keep the configuration, all your code and template things in the filesystem.
I think that keeping some fixed values hard-coded in the application may be a good way to deal with the problem. In most cases, it will even reduce load on database server, because some not all the values must be retrieved via SQL.
But there are cases when it could lead to performance issues, mainly if you have to join values coming from the database with your hard-coded values. In this case, storing all the values in database may have better performance, because all values could be optimized and processed by the database server, rather than getting all the values from SQL query and joining them manually in the code.
To deal with this case, you can store the values in database, but inserts and updates must be handled just by your maintenance or upgrade routines. If you have a bigger concern about not letting the data be modified, you can setup a maintenance routine to check if the values from the database are the same as the code from time to time. In this case, this database tables act much like a "cache" of the hard-coded values. And when you don't need to join the fixed values with the database values, you can still get them from the code, avoiding an unnecessary SQL query (because you're sure the values are the same).
In general, anytime you're performing a database query if you want to include something that's hard-coded into the work-flow, there isn't any joining that needs to happen. You would simply the action on your hard-coded data as well as the data you pulled from the database. This is especially true if we're talking about information that is formed into an object once it is in the application. For instance, I can see this being useful if you want there to always be a dev user in the application. You could have this user hard-coded in the application and whenever you would query the database, such as when you're logging in a user, you would check your hard-coded user's values before querying the database.
For instance:
// You would place this on the login page
$DevUser = new User(info);
$_SESSION['DevUser'] = $DevUser;
// This would go in the user authentication logic
if($_SESSION['DevUser']->GetValue(Username) == $GivenUName && $_SESSION['DevUser']->GetValue(PassHash) == $GivenPassHash)
{
// log in user
}
else
{
// query for user that matches given username and password hash
}
This shows how there doesn't need to be any special or tricky database stuff going on. Hard-coding variables to include in your database driven workflow is extremely simple when you don't over think it.
There could be a case where you might have a lot of hard-coded variables/objects and/or you might want to execute a large block of logic on both sets of information. In this case it could be beneficial to have an array that holds the hard-coded information and then you could just add the queried information to that array before you perform any logic on it.
In the case of payment processors, I would assume that you're referring to online payments using different services such as PayPal, or a credit card, or something else. This would make the most sense as a Payment class that has a separate function for each payment method. That way you can call whichever method the client chooses. I can't think of any other way you would want to handle this. If you're maybe talking about the payment options available to your customers, that would be something hard-coded on your payment page.
Hopefully this helps. Remember, don't make it more complicated than it needs to be.

Is this a viable solution to a list in Memcached?

Basically we have sales people that request leads to call. Right now it tried a "fresh lead" query to get those.
If there aren't any fresh leads it moves on to a "relatively new" query. We call these "sources" and essentially a closer will go through sources until they find a viable lead.
These queries all query the same table, just different groups of data. However, there is a lot of complex sorting on each query and between that and inserts/updates to the table (table being InnoDB) we're experience lots of waits (no deadlocks i'm pretty sure since they don't show in InnoDB status) so my guess is we have slow selects, coupled with lots of inserts/updates.
NOW, the ultimate question IS:
Should we query the DB for each source and grab about 100ish (obviously variable depending on the system) and cache them in memcached. Then, as closers request leads, send them from cache but update the cache to reflect an "is_acccepted" flag. This way we only call each source as we run out of cached leads so just once as we run out, instead of once per closer requesting a lead?
Then we can use simulated locking with memcached - http://code.google.com/p/memcached/wiki/FAQ#Emulating_locking_with_the_add_command
Does this seem like a viable solution? Any recommendations? We need to minimize the chances of lock waits desperately and quickly.
Sounds viable, but have you looked at your indexes and are you using proper isolation levels on your selects?
Previous SO question may help with the answer your seeking: Any way to select without causing locking in MySQL?
If you perform your select/update in a SP with full transaction's this could also speed things up quite a bit due to optimization. Of course, there are times when SP's in MySQL are much slower :(
I'd have put this as a comment, but haven't reached that level yet :)
And I did read the part about inno-db, but experience has shown me improvements even with inno when using isolation levels.
You should definitely look at making sure your DB queries are fully optimized before you employ another datastore.
If you do decide to cache this data then consider using Redis, which makes lists first class citizens.

How best to implement Memcached on a Ecommerce Website

I've got a large ECommerce website running LAMP and was wondering how best to easily implement Memcached?
Store all queries in memcached for a certain period - sounds pointless
Store only certain important data like product information into Memcached and make sure the proper updates can expire it correctly - sounds like an end to end solution.
Store complex query results which do not change often - involves a lot of static code
Trying to get an overview of what changes I should make to take the best advantage of memcached.
Thanks :)
I'd let your users decide.
In other words rather than trying to second guess what will work best, I'd rework ALL the database queries to use memcached along the lines of;
Can memcache answer this query? If
so - return the results from cache.
If not 1), pull results from
database and write back to memcached
so the next time it's in the cache.
Ensure all your updates / inserts /
deletes invalidate the appropriate
cache keys.
Now given that 3) might be complex, I'd use that factor to choose which queries to load through the cache - if it's hard and/or time consuming to invalidate the cache, don't cache back those queries to start with.
Because memcached will automatically dump the least recently used keys when the store approaches capacity, you can set everything to never expire and just allow available resources to determine what is currently in the cache. This will largely be determined by user behaviour (which products are popular etc) and hence my first comment about letting the users decide.
It's also worth saying that you should ensure your MySQL database is well tuned first as that can often be an easier win. Query caching, checking heavy queries with Explain to tune your indexes etc, all of this can have a greater impact.
There is no way to get optimization tailored specifically to your system here.
Either you put the name of the OS system you use, or pay someone to analyze what you have.
There is no "common threads" here. (besides, to cache queries, you can do it in the level of the DB with enough memory)

PHP APC To cache or not to cache?

I don't really have any experience with caching at all, so this may seem like a stupid question, but how do you know when to cache your data? I wasn't even able to find one site that talked about this, but it may just be my searching skills or maybe too many variables to consider?
I will most likely be using APC. Does anyone have any examples of what would be the least amount of data you would need in order to cache it? For example, let's say you have an array with 100 items and you use a foreach loop on it and perform some simple array manipulation, should you cache the result? How about if it had a 1000 items, 10000 items, etc.?
Should you be caching the results of your database query? What kind of queries should you be caching? I assume a simple select and maybe a couple joins statement to a mysql db doesn't need caching, or does it? Assuming the mysql query cache is turned on, does that mean you don't need to cache in the application layer, or should you still do it?
If you instantiate an object, should you cache it? How to determine whether it should be cached or not? So a general guide on what to cache would be nice, examples would also be really helpful, thanks.
When you're looking at caching data that has been read from the database in APC/memcache/WinCache/redis/etc, you should be aware that it will not be updated when the database is updated unless you explicitly code to keep the database and cache in synch. Therefore, caching is most effective when the data from the database doesn't change often, but also requires a more complex and/or expensive query to retrieve that data from the database (otherwise, you may as well read it from the database when you need it)... so expensive join queries that return the same data records whenever they're run are prime candidates.
And always test to see if queries are faster read from the database than from cache. Correct database indexing can vastly improve database access times, especially as most databases maintain their own internal cache as well, so don't use APC or equivalent to cache data unless the database overheads justify it.
You also need to be aware of space usage in the cache. Most caches are a fixed size and you don't want to overfill them... so don't use them to store large volumes of data. Use the apc.php script available with APC to monitor cache usage (though make sure that it's not publicly accessible to anybody and everybody that accesses your site.... bad security).
When holding objects in cache, the object will be serialized() when it's stored, and unserialized() when it's retrieved, so there is an overhead. Objects with resource attributes will lose that resource; so don't store your database access objects.
It's sensible only to use cache to store information that is accessed by many/all users, rather than user-specific data. For user session information, stick with normal PHP sessions.
The simple answer is that you cache data when things get slow. Obviously for any medium to large sized application, you need to do much more planning than just a wait and see approach. But for the vast majority of websites out there, the question to ask yourself is "Are you happy with the load time". Of course if you are obsessive about load time, like myself, you are going to want to try to make it even faster regardless.
Next, you have to identify what specifically is the cause of the slowness. You assumed that your application code was the source but its worth examining if there are other external factors such as large page file size, excessive requests, no gzip, etc. Use a site like http://tools.pingdom.com/ or an extension like yslow as a start for that. (quick tip make sure keepalives and gzip are working).
Assuming the problem is the duration of execution of your application code, you are going to want to profile your code with something like xdebug (http://www.xdebug.org/) and view the output with kcachegrind or wincachegrind. That will let you know what parts of your code are taking long to run. From there you will make decisions on what to cache and how to cache it (or make improvements in the logic of your code).
There are so many possibilities for what the problem could be and the associated solutions, that it is not worth me guessing. So, once you identify the problem you may want to post a new question related to solving that specific problem. I will say that if not used properly, the mysql query cache can be counter productive. Also, I generally avoid the APC user cache in favor of memcached.

Which one is less costly in terms of resources?

Im on an optimization crusade for one of my sites, trying to cut down as many mysql queries as I can.
Im implementing partial caching, which writes .txt files for various modules of the site, and updates them on demand. I've came across one, that cannot remain static for all the users, so the .txt file thats written on the HD, will need to be altered on the fly via php.
Which is done via
flush();
ob_start();
include('file.txt');
$contents = ob_get_clean();
Then I modify the html in the $contents variable, and echo it out for different users.
Alternatively, I can leave it as it is, which runs a mysql query, which queries a small table that has category names (about 13 of them).
Which one is less expensive? Running a query every single time.... or doing it via the method I posted above, to inject html code on the fly, into a static .txt file?
Reading the file (save in very weird setups) will be minutely faster than querying the DB (no network interaction, &c), but the difference will hardly be measurable -- just try and see if you can measure it!
Optimize your queries first! Then use memcache or similar caching system, for data that is accessed frequently and then you can add file caching. We use all three combined and it runs very smooth. Small optimized queries aren't so bad. If your DB is in local server - network is not an issue. And don't forger to use MySQL query cache (i guess you do use MySQL).
Where is your the performance bottleneck?
If you don't know the bottleneck, you can't make any sensible assessment about optimisations.
Collect some metrics, and optimise accordingly.
Try both and choose the one that either is a clear winner or if not available, more maintainable. This depends on where the DB is, how much load it's getting, and whether you'll need to run more than one application instance (then they'd need to share this file on the network and it's not local anymore).
Here are the patterns that work for me when I'm refactoring PHP/MySQL site code.
The number of queries per page is absolutely critical - one complex query with joins is fastest as long as indexes are proper. A single page can almost always be generated with five or fewer queries in my experience, plus good use of classes and arrays of classes. Often one query for the session and one query for the app.
After indexes the biggest thing to work on is the caching configuration parameters.
Never have queries in loops.
Moving database queries to files has never been a useful strategy, especially since it often ends up screwing up your query integrity.
Alex and the others are right about testing. If your pages are noticeably slow, then they are slow for a reason (or reasons) - don't even start changing anything until you know what the reasons are and can measure the consequences of your changes. Refactoring by guessing is always a losing strategy espeically when (as in your case) you're adding complexity.

Categories