How can I avoid duplicate copies of an object in a cache?

How can I avoid duplicate copies of an object in a cache? - php

I'm using memcache to design a cache for the model layer of a web application, one of my biggest problems is data consistency.
It came to my mind caching data like this:
(key=query, value=list of object ids result of the query)
for each id of the list:
(key=object.id, value=object)
So, every time a query is done:
If the query already exists I retrieve the objects signaled in the list from the cache.
If it doesn't, all the objects of the lists are stored in the cache replacing any other old value.
Has someone use this alternative, is it god? any other ideas?

Caching is one of those topics where there is no one right answer - it depends on your domain.
The caching policy that you describe may be sufficient for your domain. However, you don't appear to be worried about stale data. Often I would expect to see a timestamp against some of the entities - if the cached value is older than some system defined parameter, then it would be considered stale and re-fetched.
For more discussion on caching algorithms, see Wikipedia (for starters)

Welcome to the world of concurrency programming. You'll want to learn a bit about mutual exclusion. If you tell us what language/platform you are developing for we can describe more specifically your options.

Related

REST service and Memcache

I am considering enabling Memcache support for my large-scale REST service. However I have some questions regarding best approaches for these key-value stores.
The setup:
A database wrapper which has functions for select, update and etc.
A REST framework which contains all the API functions (getUser, createUser and etc.)
In my head, the ideal approach would be to integrate the Memcache in the database wrapper so, for example, every SQL query would get md5-hashed and saved in the cache (this is btw what most online resources suggests). However, there is obviously a problem with this approach: if a search query has been cached, and one of the users from the search result has been updated after the cached result, this wont reflect in the next request (because it is now in the cache).
As I see it I have several ways of handeling this:
Implement the Memcache in the REST framework for each function (getUser, createUser etc) and thereby explicit handle the updating of the cache etc. if users gets updated. This could end up in redundant code.
Let the cached values expire very quickly and live with the fact that some requests shows old cached values.
Do a more advanced implementation of the Memcache in the database wrapper so that I can identify which parts(e.g. users) to update in e.g. a search request.
Could you guide me to which of the following, or a complete another approach, to take?
Thanks in advance.

Enabling cache for a web application is not something to take lightly.
Maybe you have done that already bit... I recommend you first come up with a goal based on business needs or forcast (ex: must accept 1000 requests per seconds) then properly stress-test your system to have numbers before you start changing anything and then identify your bottleneck.
http://en.wikipedia.org/wiki/Performance_tuning
I usually use profiling tools such as HXProf (by facebook).
https://github.com/facebook/xhprof
Caching all your data to mirror your database might not be the best approach.
Find out how big you can allocate for your cache. If your architecture only allow you to allocate 100MB for your memcache, then it will affect your decision about what you cache and how long you cache it.
The best cache is to cache forever. But we all know that data changes. You can start by caching data that is requested often and requires the most resources to fetch.
Always try to make sure you are not working on improving something that will get you low improvement.
Without understanding your architecture in depth, it would be hazardous for anyone to recommend a caching strategy that best fit your needs.
Maybe you should cache the resutling output of your web services instead? Using a reverse proxy for example (What #Darrel is talking about) or using output buffering...
http://en.wikipedia.org/wiki/Reverse_proxy
http://php.net/manual/en/book.outcontrol.php
Optimize your database queries before you think about caching. Make sure your use a PHP Op cache (like APC) and all those things that are standard practice.
http://phplens.com/lens/php-book/optimizing-debugging-php.php
http://blog.digitalstruct.com/2008/01/31/performance-tuning-overview/
If you want to cache data and prevent stale/old data from being served, the trick is to identify your data (primary key maybe?) and when the data is updated or deleted, you delete or update the cache for that identifyer.
<?php
// After inserting into DB, you can also put it in the cache
$memcache->set($userId, $userData);
// After updating or deleting the user, you update or delete the data
$memcache->delete($userId);
A lot of site will show stale data. When I am on stackoverflow and my reputation is increased and then I got in the stackoverflow chat, the reputation shown is my old reputation. When I got a reputation of 20 (reputation required to chat) I still could not chat for another 5 minutes because the chat system had my old reputation data and did not yet know my reputation had increased enough to allow me to chat. Some data can be stale while other type of data should never be stale. Consider that when caching data.
Conclusion
Your approaches can all be valid depending on the factors that I talk about above. In fact, you can use a combination of those for all the different type of data you want to cache and how long it is acceptable to show old data for them. Maybe the categories or list of countries (since they do not change often) can be cached for a long time while the reputation (or whatever data changes all the time for all users) should be cached for a short period only.

PHP APC To cache or not to cache?

I don't really have any experience with caching at all, so this may seem like a stupid question, but how do you know when to cache your data? I wasn't even able to find one site that talked about this, but it may just be my searching skills or maybe too many variables to consider?
I will most likely be using APC. Does anyone have any examples of what would be the least amount of data you would need in order to cache it? For example, let's say you have an array with 100 items and you use a foreach loop on it and perform some simple array manipulation, should you cache the result? How about if it had a 1000 items, 10000 items, etc.?
Should you be caching the results of your database query? What kind of queries should you be caching? I assume a simple select and maybe a couple joins statement to a mysql db doesn't need caching, or does it? Assuming the mysql query cache is turned on, does that mean you don't need to cache in the application layer, or should you still do it?
If you instantiate an object, should you cache it? How to determine whether it should be cached or not? So a general guide on what to cache would be nice, examples would also be really helpful, thanks.

When you're looking at caching data that has been read from the database in APC/memcache/WinCache/redis/etc, you should be aware that it will not be updated when the database is updated unless you explicitly code to keep the database and cache in synch. Therefore, caching is most effective when the data from the database doesn't change often, but also requires a more complex and/or expensive query to retrieve that data from the database (otherwise, you may as well read it from the database when you need it)... so expensive join queries that return the same data records whenever they're run are prime candidates.
And always test to see if queries are faster read from the database than from cache. Correct database indexing can vastly improve database access times, especially as most databases maintain their own internal cache as well, so don't use APC or equivalent to cache data unless the database overheads justify it.
You also need to be aware of space usage in the cache. Most caches are a fixed size and you don't want to overfill them... so don't use them to store large volumes of data. Use the apc.php script available with APC to monitor cache usage (though make sure that it's not publicly accessible to anybody and everybody that accesses your site.... bad security).
When holding objects in cache, the object will be serialized() when it's stored, and unserialized() when it's retrieved, so there is an overhead. Objects with resource attributes will lose that resource; so don't store your database access objects.
It's sensible only to use cache to store information that is accessed by many/all users, rather than user-specific data. For user session information, stick with normal PHP sessions.

The simple answer is that you cache data when things get slow. Obviously for any medium to large sized application, you need to do much more planning than just a wait and see approach. But for the vast majority of websites out there, the question to ask yourself is "Are you happy with the load time". Of course if you are obsessive about load time, like myself, you are going to want to try to make it even faster regardless.
Next, you have to identify what specifically is the cause of the slowness. You assumed that your application code was the source but its worth examining if there are other external factors such as large page file size, excessive requests, no gzip, etc. Use a site like http://tools.pingdom.com/ or an extension like yslow as a start for that. (quick tip make sure keepalives and gzip are working).
Assuming the problem is the duration of execution of your application code, you are going to want to profile your code with something like xdebug (http://www.xdebug.org/) and view the output with kcachegrind or wincachegrind. That will let you know what parts of your code are taking long to run. From there you will make decisions on what to cache and how to cache it (or make improvements in the logic of your code).
There are so many possibilities for what the problem could be and the associated solutions, that it is not worth me guessing. So, once you identify the problem you may want to post a new question related to solving that specific problem. I will say that if not used properly, the mysql query cache can be counter productive. Also, I generally avoid the APC user cache in favor of memcached.

Smart (?) Database Cache

I've seen several database cache engines, all of them are pretty dumb (i.e.: keep this query cached for X minutes) and require that you manually delete the whole cache repository after a INSERT / UPDATE / DELETE query has been executed.
About 2 or 3 years ago I developed an alternative DB cache system for a project I was working on, the idea was basically to use regular expressions to find the table(s) involved in a particular SQL query:
$query_patterns = array
(
'INSERT' => '/INTO\s+(\w+)\s+/i',
'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i',
'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i',
'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i',
'REPLACE' => '/INTO\s+(\w+)\s+/i',
'TRUNCATE' => '/TRUNCATE\s+(\w+)/i',
'LOAD' => '/INTO\s+TABLE\s+(\w+)/i',
);
I know that these regexs probably have some flaws (my regex skills were pretty green back then) and obviously don't match nested queries, but since I never use them that isn't a problem for me.
Anyway, after finding the involved tables I would alphabetically sort them and create a new folder in the cache repository with the following naming convention:
+table_a+table_b+table_c+table_...+
In case of a SELECT query, I would fetch the results from the database, serialize() them and store them in the appropriate cache folder, so for instance the results of the following query:
SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC;
Would be stored in:
/cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5
The MD5 being the query itself. Upon a consequent SELECT query the results would be trivial to fetch.
In case of any other type of write query (INSERT, REPLACE, UPDATE, DELETE and so on) I would glob() all the folders that had +matched_table(s)+ in their name all delete all the file contents. This way it wouldn't be necessary to delete the whole cache, just the cache used by the affected and related tables.
The system worked pretty well and the difference of performance was visible - although the project had many more read queries than write queries. Since then I started using transactions, FK CASCADE UPDATES / DELETES and never had the time to perfect the system to make it work with these features.
I've used MySQL Query Cache in the past but I must say the performance doesn't even compare.
I'm wondering: am I the only one who sees beauty in this system? Is there any bottlenecks I may not be aware of? Why do popular frameworks like CodeIgniter and Kohana (I'm not aware of Zend Framework) have such rudimentary DB cache systems?
More importantly, do you see this as a feature worth pursuing? If yes, is there anything I could do / use to make it even faster (my main concerns are disk I/O and (de)serialization of query results)?
I appreciate all input, thanks.

I can see the beauty in this solution, however, I belive it only works for a very specific set of applications. Scenarios where it is not applicable include:
Databases which utilize cascading deletes/updates or any kind of triggers. E.g., your DELETE to table A may cause a DELETE from table B. The regex will never catch this.
Accessing the database from points which do not go through you cache invalidation scheme, e.g. crontab scripts etc. If you ever decide to implement replication across machines (introduce read-only slaves), it may also disturb the cache (because it does not go through cache invalidation etc.)
Even if these scenarios are not realistic for your case it does still answer the question of why frameworks do not implement this kind of cache.
Regarding if this is worth pursuing, it all depends on your application. Maybe you care to supply more information?

The solution, as you describe it, is at risk for concurrency issues. When you're receiving hundreds of queries per second, you're bound to hit a case where an UPDATE statement runs, but before you can clear your cache, a SELECT reads from it, and gets stale data. Additionally, you may run in to issues when several UPDATEs hit the same set of rows in a short time period.
In a broader sense, best practice with caching is to cache the largest objects possible. E.g., rather than having a bunch of "user"-related rows cached all over the place, it's better to just cache the "user" object itself.
Even better, if you can cache whole pages (e.g., you show the same homepage to everyone; a profile page appears identical to almost everyone, etc.), that's even better. One cache fetch for a whole, pre-rendered page will dramatically outperform dozens of cache fetches for row/query level caches followed by re-rending the page.
Long story short: profile. If you take the time to do some measurement, you'll likely find that caching large objects, or even pages, rather than small queries used to build those things, is a huge performance win.

While I do see the beauty in this - especially for environments where resources are limited and can not easily be extended, like on shared hosting - I personally would fear complications in the future: What if somebody, newly hired and unaware of the caching mechanism, starts using nested queries? What if some external service starts updating the table, with the cache not noticing?
For a specialized, defined project that urgently needs a speedup that cannot be helped by adding processor power or RAM, this looks like a great solution. As a general component, I find it too shaky, and would fear subtle problems in the long run that stem from people forgetting that there is a cache to be aware of.

I suspect that the regexes may not provide for every case - certainly they don't seem to deal with the scenario of mixing base table names and the tables themselves. e.g. consider
update stats.measures set amount=50 where id=1;
and
use stats;
update measures set amount=50 where id=1;
Then there's PL/SQL.
Then there's the fact that it depends on every client opting in to an advisory control mechanism i.e. it pre-supposes that all the database access is from machines implementing the caching control mechanism on a shared filesystem.
(as a small point - wouldn't it be simpler to just check the modification times on the data files to determine if the cached version of a query on a defined set of tables is still current, rather then trying to identify if the cache control mechanism has spotted an update - it would certainly be a lot more robust)
Stepping back a bit, implementing this from scratch using a robust architecture would mean that all queries would have to be intercepted by the control mechanism. The control mechanism would probably need a more sophisticated query parser. It certainly requires a common storgae substrate for all the instances of the control mechanism. It probably needs an understanding of the data dictionary - all things which are already implemented by the database itself.
You state that "I've used MySQL Query Cache in the past but I must say the performance doesn't even compare."
I find this rather odd. Certainly when dealing with large result sets from queries, my experience is that loading the data into the heap from a database is a lot faster than unserializing large arrays - although large result sets are rather atypical of web based applications.
When I've tried to speed up database access (after fixing everything else of course) then I've gone down the route of replicating and partitioning data across multiple DBMS instances.
C.

This is related to the problem of session splitting when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions are used to determine which tables (or even which rows) are being read from or written to. The system keeps track of which tables were written to and when, and when a read to one of those tables comes up, it's routed to the master. If a query is reading from a table whose data needn't be up-to-the-second accurate, then it's routed to the slave. Generally, information only really needs to be current when it's something a user changed themselves (i.e., editing a user's profile).
They talk about this a good bit in the O'Reilly book High Performance MySQL. I used it quite a bit when developing a system for handling session splits back in the day.

The improvement you describe is to avoid invalidating caches that are guaranteed to not have been affected by an update because they draw data from a different table.
That is of course nice, but I am not sure if it is fine-grained enough to make a real difference. You would still be invaliding lots of caches that did not really need to be (because the update was on the table, but on different rows).
Also, even this "simple" scheme relies on being able to detect the relevant tables by looking at the SQL query string. This can be difficult to do in the general case, because of views, table aliases, and multiple catalogs.
It is very difficult to automatically (and efficiently) detect whether a cache needs to be invalidated. Because of that, you can either use a very simple scheme (such as invalidating on every update, or per table, as in your system, which does not work too well when there are many updates), or a very hand-crafted cache for the specific application with deep hooks into the query logic (probably difficult to write and hard to maintain), or accept that the cache can contain stale data and just refresh it periodically.

How to increase performance for MySQL database

How to increase the performance for mysql database because I have my website hosted in shared server and they have suspended my account because of "too many queries"
the stuff asked "index" or "cache" or trim my database
I don't know what does "index" and cache mean and how to do it on php
thanks

What an index is:
Think of a database table as a library - you have a big collection of books (records), each with associated data (author name, publisher, publication date, ISBN, content). Also assume that this is a very naive library, where all the books are shelved in order by ISBN (primary key). Just as the books can only have one physical ordering, a database table can only have one primary key index.
Now imagine someone comes to the librarian (database program) and says, "I would like to know how many Nora Roberts books are in the library". To answer this question, the librarian has to walk the aisles and look at every book in the library, which is very slow. If the librarian gets many requests like this, it is worth his time to set up a card catalog by author name (index on name) - then he can answer such questions much more quickly by referring to the catalog instead of walking the shelves. Essentially, the index sets up an 'alternative ordering' of the books - it treats them as if they were sorted alphabetically by author.
Notice that 1) it takes time to set up the catalog, 2) the catalog takes up extra space in the library, and 3) it complicates the process of adding a book to the library - instead of just sticking a book on the shelf in order, the librarian also has to fill out an index card and add it to the catalog. In just the same way, adding an index on a database field can speed up your queries, but the index itself takes storage space and slows down inserts. For this reason, you should only create indexes in response to need - there is no point in indexing a field you rarely search on.
What caching is:
If the librarian has many people coming in and asking the same questions over and over, it may be worth his time to write the answer down at the front desk. Instead of checking the stacks or the catalog, he can simply say, "here is the answer I gave to the last person who asked that question".
In your script, this may apply in different ways. You can store the results of a database query or a calculation or part of a rendered web page; you can store it to a secondary database table or a file or a session variable or to a memory service like memcached. You can store a pre-parsed database query, ready to run. Some libraries like Smarty will automatically store part or all of a page for you. By storing the result and reusing it you can avoid doing the same work many times.
In every case, you have to worry about how long the answer will remain valid. What if the library got a new book in? Is it OK to use an answer that may be five minutes out of date? What about a day out of date?
Caching is very application-specific; you will have to think about what your data means, how often it changes, how expensive the calculation is, how often the result is needed. If the data changes slowly, it may be best to recalculate and store the result every time a change is made; if it changes often but is not crucial, it may be sufficient to update only if the cached value is more than a certain age.

Setup a copy of your application locally, enable the mysql query log, and setup xdebug or some other profiler. The start collecting data, and testing your application. There are lots of guides, and books available about how to optimize things. It is important that you spend time testing, and collecting data first so you optimize the right things.
Using the data you have collected try and reduce the number of queries per page-view, Ideally, you should be able to get everything you need in less 5-10 queries.
Look at the logs and see if you are asking for the same thing twice. It is a bad idea to request a record in one portion of your code, and then request it again from the database a few lines later unless you are sure the value is likely to have changed.
Look for queries embedded in loop, and try to refactor them so you make a single query and simply loop on the results.
The select * you mention using is an indication you may be doing something wrong. You probably should be listing fields you explicitly need. Check this site or google for lots of good arguments about why select * is evil.
Start looking at your queries and then using explain on them. For queries that are frequently used make sure they are using a good index and not doing a full table scan. Tweak indexes on your development database and test.

There are a couple things you can look into:
Query Design - look into more advanced and faster solutions
Hardware - throw better and faster hardware at the problem
Database Design - use indexes and practice good database design
All of these are easier said than done, but it is a start.

Firstly, sack your host, get off shared hosting into an environment you have full control over and stand a chance of being able to tune decently.
Replicate that environment in your lab, ideally with the same hardware as production; this includes things like RAID controller.
Did I mention that you need a RAID controller. Yes you do. You can't achieve decent write performance without one - which needs a battery backed cache. If you don't have one, each write needs to physically hit the disc which is ruinous for performance.
Anyway, back to read performance, once you've got the machine with the same spec RAID controller (and same discs, obviously) as production in your lab, you can try to tune stuff up.
More RAM is usually the cheapest way of achieving better performance - make sure that you've got MySQL configured to use it - which means tuning storage-engine specific parameters.
I am assuming here that you have at least 100G of data; if not, just buy enough ram that your entire DB fits in ram then read performance is essentially solved.
Software changes that others have mentioned such as optimising queries and adding indexes are helpful too, but only once you've got a development hardware environment that enables you to usefully do performance work - i.e. measure performance of your application meaningfully - which means real hardware (not VMs), which is consistent with the hardware environment used in production.
Oh yes - one more thing - don't even THINK about deploying a database server on a 32-bit OS, it's a ruinous waste of good ram.

Indexing is done on the database tables in order to speed queries. If you don't know what it means you have none. At a minumum you should have indexes on every foriegn key and on most fileds that are used frequently in the where clauses of your queries. Primary keys should have indexes automatically assuming you set them up to begin with which I would find unlikely in someone who doesn't know what an index is. Are your tables normalized?
BTW, since you are doing a division in your math (why I haven't a clue), you should Google integer math. You may neot be getting correct results.

You should not select * ever. Instead, select only the data you need for that particular call. And what is your intention here?
order by votes*1000+((1440 - ($server_date - date))/60)2+visites600 desc

You may have poorly-written queries, and/or poorly written pages that run too many queries. Could you give us specific examples of queries you're using that are ran on a regular basis?

sure
this query to fetch the last 3 posts
select * from posts where visible = 1 and date > ($server_date - 86400) and dont_show_in_frontpage = 0 order by votes*1000+((1440 - ($server_date - date))/60)*2+visites*600 desc limit 3
what do you think?

PHP memcache design patterns

We use memcache basically as an after thought to just cache query results.
Invalidation is a nightmare due to the way it was implemented. We since learned some techniques with memcache thru reading the mailing list, for example the trick to allow group invalidation of a bunch of keys. For those who know it, skip the next paragraph..
For those who don't know and are interested, the trick is adding a sequence number to your keys and storing that sequence number in memcache. Then every time before you do your "get" you grab the current sequence number and build your keys around that. Then, to invalidate the whole group you just increment that sequence number.
So anyway, I'm currently revising our model to implement this.
My question is..
We didn't know about this pattern, and I'm sure there are others we don't know about. I've searched and haven't been able to find any design patterns on the web for implementing memcache, best practices, etc.
Can someone point me to something like this or even just write up an example? I would like to make sure we don't make a beginners mistake in our new refactoring.

One point to remember with object caching is that it's just that - a cache of objects/complex structures. A lot of people make the mistake of hitting their caches for straightforward, efficient queries, which incurs the overhead of a cache check/miss, when the database would have obtained the result far faster.
This piece of advice is one I've taken to heart since it was taught to me; know when not to cache, that is, when the overhead cancels out the perceived benefits. I know it doesn't answer the specific question here, but I thought it was worth pointing out as a general hint.

What rob is saying is good advice. From my experience, there are two common ways to identify and invalidate tags: unique identification and tag-based identification. Those are usually combined to form a complete solution in which:
A cache record is assigned a unique identifier (which usually depends somehow on the data that it caches) and optionally any number of tags.
Cache records are recalled by their unique identifier.
Cache records can be invalidated by their unique identifier (one at a time), or by any tag they are tagged with (possibly invalidating multiple records at the same time).
This is relatively simple to implement and generally works very well. I have yet to come across a system that needed more, though there are probably some edge cases out there that require specific solutions.

I use the Zend Cache component (you don't have to use the entire framework just the zend cache stuff if you want). It abstracts some of the caching stuff (it supports grouping cache by 'tags' though that feature is not supported for the memcache back end I've rolled my own support for 'tags' with relative ease). So the pattern i use for functions that access cache (generally in my model) is:
public function getBySlug($ignoreCache = true)
{
if($ignoreCache || !$result = $this->cache->load('someKeyBasedOnQuery'))
{
$select = $this->select()
->where('slug = ?', $slug);
$result = $this->fetchRow($select);
try
{
$this->cache->save($result,'someKeyBasedOnQuery');
}
catch(Zend_Exception $error)
{
//log exception
}
}
else
{
$this->registry->logger->info('someKeyBasedOnQuery came from cache');
}
return $result;
}
basing the cache key on a hash of the query means that if another developer bypasses my models or used another function elsewhere that does the same thing it's still pulled from cache. Generally I tag the cache with a couple generate tag (the name of the table is one and the other is the name of the function). So by default our code invalidates on insert,delete and update the cached items with the tag of the table. All in all caching is pretty automatic in our base code and developers can be secure that caching 'just works' in projects that we do. (also the great side effect of making use of tagging is that we have a page that offers granular cache clearing/management, with options to clear cache by model functions, or tables).

We also store the query results from our database (PostgreSQL) in memcache and we are using triggers on the tables to invalidate the cache - there are several APIs out there (e.g. pgmemcache, I think mysql has something like that too but I don't know for sure). The benefit is that the database self (triggers) can handle the invalidation of data on changes (update,insert,delete), you don't need to write all that stuff into your "application".

mysqlnd_qc, which inserts memcaching at the database query results return level, auto caches result sets from mysql. It is FANTASTIC and automatic.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.