APC lets you store data inside keys, but you cannot group these keys.
So if i want to have a group called "articles", and inside this group I would have keys that take the form of the article ID I can't do this easily.
articles -> 5 -> cached data
-> 10 -> cached data
-> 17 -> cached data
...
I could prefix the key with the "group" name like:
article_5 -> cached data
article_10 -> cached data
article_17 -> cached data
...
But this it makes it impossible to delete the entire group if I want to :(
A working solution would be to store multidimensional arrays (this is what I'm doing now), but I don't think it's good because when I want to access / or delete cached data, I need to get the entire group first. So if the group has one zillion articles in it you can image what kind of array I will be iterating and searching
Do you have better ideas on how could I achieve the group thing?
edit: found another solution, not sure if it's much better because I don't know how reliable is yet. I'm adding a special key called __paths which is basically a multidimensional array containing the full prefixed key paths for all the other entries in the cache. And when I request or delete the cache I use this array as a reference to quickly find out the key (or group of keys) I need to remove, so I don't have to store arrays and iterate trough all keys...
Based upon your observations, I looked at the underlying C implementation of APC's caching model (apc_cache.c) to see what I could find.
The source corroborates your observations that no grouping structure exists in the backing data store, such that any loosely-grouped collection of objects will need to be done based on some namespace constraint or a modification to the cache layer itself. I'd hoped to find some backdoor relying on key chaining by way of a linked list, but unfortunately it seems collisions are reconciled by way of a direct reallocation of the colliding slot instead of chaining.
Further confounding this problem, APC appears to use an explicit cache model for user entries, preventing them from aging off. So, the solution Emil Vikström provided that relies on the LRU model of memcached will, unfortunately, not work.
Without modifying the source code of APC itself, here's what I would do:
Define a namespace constraint that your entries conform to. As you've originally defined above, this would be something like article_ prepended to each of your entries.
Define a separate list of elements in this set. Effectively, this would be the 5, 10, and 17 scheme you'd described above, but in this case, you could use some numeric type to make this more efficient than storing a whole lot of string values.
Define an interface to updating this set of pointers and reconciling them with the backing memory cache, including (at minimum) the methods insert, delete, and clear. When clear is called, walk each of your pointers, reconstruct the key you used in the backing data store, and flush each from your cache.
What I'm advocating for here is a well-defined object that performs the operations you seek efficiently. This scales linearly with the number of entries in your sub-cache, but because you're using a numeric type for each element, you'd need over 100 million entries or so before you started to experience real memory pain at a constraint of, for example, a few hundred megabytes.
Tamas Imrei beat me to suggesting an alternate strategy I was already in the process of documenting, but this has some major flaws I'd like to discuss.
As defined in the backing C code, APCIterator is a linear time operation over the full data set when performing searches (using its constructor, public __construct ( string $cache [, mixed $search = null ...]] )).
This is flatly undesirable in the case where the backing elements you're searching for represent a small percentage of your total data, because it would walk every single element in your cache to find the ones you desire. Citing apc_cache.c:
/* {{{ apc_cache_user_find */
apc_cache_entry_t* apc_cache_user_find(apc_cache_t* cache, char *strkey, \
int keylen, time_t t TSRMLS_DC)
{
slot_t** slot;
...
slot = &cache->slots[h % cache->num_slots];
while (*slot) {
...
slot = &(*slot)->next;
}
}
Therefore, I would most strongly recommend using an efficient, pointer-based virtual grouping solution to your problem as I've sketched out above. Although, in the case where you're severely memory-restricted, the iterator approach may be most correct to conserve as much memory as possible at the expense of computation.
Best of luck with your application.
I have had this problem once with memcached and I solved it by using a version number in my keys, like this:
version -> 5
article_5_5 -> cached data
article_10_5 -> cached data
article_17_5 -> cached data
Just change the version number and the group will be effectively "gone"!
memcached uses a least-recently-used policy to remove old data so the old-versioned group will be removed from the cache when the space is needed. I don't know if APC have the same feature.
According to MrGomez this is NOT working for APC. Please read his post, and keep my post in mind only for other cache systems which use a least-recently-used policy (not APC).
You may use the APCIterator class which seems to exists especially for tasks like this:
The APCIterator class makes it easier to iterate over large APC caches. This is helpful as it allows iterating over large caches in steps ...
Unfortunately, APC can't do this. I wished myself often enough that it could. So I looked for alternatives.
Zend_Cache has an interesting way of doing it, but it simply uses caches to cache the tagging information. It's a component that can in turn use backends (like apc).
If you want to go a step further, you could install Redis. This one has all that natively included and some other really interesting features. This would probably the cleanest solution to go with. If you were able to use APC, you should be also able to use Redis.
Related
I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.
I looked at the following options (among others):
SNOWFLAKE (by Twitter)
It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
It lacks documentation at this stage, so I don't think it will be a good investment;
The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)
UUID
Just look at it: 550e8400-e29b-41d4-a716-446655440000;
Its a 128 bit ID;
There has been some known collisions (depending on the version I guess) see this post.
AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL
This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.
AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE
This could work since we are using Couchbase as our database server, but;
This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;
MY PROPOSED SOLUTION (this is what I need help with)
Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.
Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:
Step 1
All regions will be assigned integer IDs (unique identifiers):
1 - Africa;
2 - America;
3 - Asia;
4 - Europe;
5 - Ociania.
Step 2
Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):
00001 - 192.187.22.14
00002 - 164.254.58.22
00003 - 142.77.22.45
and so forth.
Please note that all of these are in the same cluster, so that means you can have node 00001 per region.
Step 3
For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:
Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.
Bringing it all together
Say a user is signing up from Europe:
The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).
We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.
In code, this will look something like this:
$id = '4'.'00005'.'1';
NB: Not $id = 4+00005+1;.
Pros
IDs look better than UUIDs;
They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
They can still be stored as integers (probably Big Unsigned integers);
It's all part of the architecture, no added complexities.
Cons
No sorting (or is there)?
This is where I need your input (most)
I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?
Thank you in advance for your help :-)
EDIT
As #DaveRandom suggested, we can add the 4th step:
Step 4
We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:
4000051357 instead of just 4000051.
I think this looks pretty solid. Each region maintains consistency, and if you use XDCR there are no collisions. INCR is atomic within a cluster, so you will have no issues there. You don't actually need to have the Machine code part of it. If all the app servers within a region are connected to the same cluster, it's irrelevant to infix the 00001 part of it. If that is useful for you for other reasons (some sort of analytics) then by all means, but it isn't necessary.
So it can simply be '4' . 1' (using your example)
Can you give me an example of what kind of "sorting" you need?
First: One downside of adding entropy (and I am not sure why you would need it), is you cannot iterate over the ID collection as easily.
For Example: If you ID's from 1-100, which you will know from a simple GET query on the Counter key, you could assign tasks by group, this task takes 1-10, the next 11-20 and so on, and workers can execute in parallel. If you add entropy, you will need to use a Map/Reduce View to pull the collections down, so you are losing the benefit of a key-value pattern.
Second: Since you are concerned with readability, it can be valuable to add a document/object type identifier as well, and this can be used in Map/Reduce Views (or you can use a json key to identify that).
Ex: 'u:' . '4' . '1'
If you are referring to ID's externally, you might want to obscure in other ways. If you need an example, let me know and I can append my answer with something you could do.
#scalabl3
You are concerned about IDs for two reasons:
Potential for collisions in a complex network infrastructure
Appearance
Starting with the second issue, Appearance. While a UUID certainly isn't a great beauty when it comes to an identifier, there are diminishing returns as you introduce a truly unique number across a complex data center (or data centers) as you mention. I'm not convinced that there is a dramatic change in perception of an application when a long number versus a UUID is used for example in a URL to a web application. Ideally, neither would be shown, and the ID would only ever be sent via Ajax requests, etc. While a nice clean memorable URL is preferable, it's never stopped me from shopping at Amazon (where they have absolutely hideous URLs). :)
Even with your proposal, the identifiers, while they would be shorter in the number of characters than a UUID, they are no more memorable than a UUID. So, the appearance likely would remain debatable.
Talking about the first point..., yes, there are a few cases where UUIDs have been known to generate conflicts. While that shouldn't happen in a properly configured and consistently obtained architecture, I can see how it might happen (but I'm personally a lot less concerned about it).
So, if you're talking about alternatives, I've become a fan of the simplicity of the MongoDB ObjectId and its techniques for avoiding duplication when generating an ID. The full documentation is here. The quick relevant pieces are similar to your potential design in several ways:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
The timestamp can often be useful for sorting. The machine identifier is similar to your application server having a unique ID. The process id is just additional entropy, and finally to prevent conflicts, there is a counter that is auto incremented whenever the timestamp is the same as the last time an ObjectId is generated (so that ObjectIds can be created rapidly). ObjectIds can be generated on the client or on the database. Further, ObjectIds do take up fewer bytes than a UUID (but only 4). Of course, you could not use the timestamp and drop 4 bytes.
For clarification, I'm not suggesting you use MongoDB, but be inspired by the technique they use for ID generation.
So, I think your solution is decent (and maybe you want to be inspired by MongoDB's implementation of a unique ID) and doable. As to whether you need to do it, I think that's a question only you can answer.
Every site should have options, changeable via a Control Panel. My new project is going to have this. However, I am curious as to which is the best method for storing these options.
Here are my known methods:
I have tested these, although not very thoroughly, as in all likelihood I will not find the long-term problems or benefits.
Using a MySQL table with fields key and value, where each column
is a new key/value pair. The downside to this is that MySQL can be
slow, and in fact, would require a loop before every page to fetch
the options from the database and parse it into an array.
Using a MySQL table with a field for each value and a single record.
The downside to this is that each new options requires a new field,
and this is not the standard use of MySQL tables, but a big benefit
is that it requires a single function to bring it into a PHP indexed
array.
Using a flat file containing the options array in serialized form,
using the PHP functions serialize and unserialize. The main
problem with this method is that I would have to first traverse to
the file, read in the whole file, and serializing can be slow, so it
would get slower as more options are created. It also offers a small
layer of obfuscation to the data.
Using an ini file. Ini parser's are rather fast, and this options
would make it easy to pass around a site configuation. However, as
above, I would have to traverse to the ini, and also, using an ini
file with PHP is generally unused.
Other formats, such as XML and JSON, have all been considered
too. However, they all require some sort of storage, and I am mostly
curious about the benefits of each kind of storage.
These are my specific idealistic requirements:
So the basic thing I am looking for is speed, security, and portability. I want the configuration to not be human readable (hence, an unencrypted flat file is bad), be easily portable (ruling out MySQL), and have almost zero but constant performance impact (ruling out most options).
I am not trying to ask people to write code for me, or anything like that. I just need a second pair of eyes on this problem, possibly bringing up points that I never factored in.
Thank you for your help
Thank you- Daniel.
Using a MySQL table with fields key and value, where each column is a
new key/value pair. The downside to this is that MySQL can be slow,
and in fact, would require a loop before every page to fetch the
options from the database and parse it into an array.
That is false. Unless you plan on storing couple hundred million configuration pairs, you will be fine and dandy. If you worry about performance using this method, simply cache the query (and wipe the cache only when you make changes inside the table).
This will also give you most flexibility, ease of use and so on.
I am working on a search application that uses a form with 16 filter options that are either 1 (selected) or 0 (not selected). The result as JSON is retrieved via AJAX using a GET request.
The query string then looks like this:
filter_1=0&filter_2=1 ...omitted... &filter_16=1&page=20
Each searchresult has at least 2 pages which can be browsed by the user.
My question is: how can I cache the searchresults based on the input parameter? My first idea was to md5 the requestparameters and then write a cache file using the hash as filename.
Every time a new request comes in, I search for the cache file and if it is there, then use the data from that file instead of querying the database and converting the rows to a json result.
But this seems not like a good idea because of the many search options. There would be quite a lot cache files (16 * 16 ???), and because the application is only used by a few users, I doubt that all possible combinations will ever get cached. And each result contains X pages, so each of that page would be a cache file of its own (16 * 16 * X).
What would be a good caching strategy for an application like this? Is it acutually possible to implement a cache?
Because all of your search parameters are flags that can be either 0 or 1, you might consider bitmasking.
Each of your filters would represent a value that is a power of 2:
$filter_1 = 1;
$filter_2 = 2;
$filter_3 = 4;
...
$filter_8 = 256;
...
$filter_16 = 65536;
By using PHP's bitwise operators, you can easily store all 16 filter values in a single integer. For instance, the value "257" can only be reached using a combination of filter_1 and filter_8. If the user selected filter_1 and filter_8, you could determine the bitmask by doing:
$bitmask = $filter_1 | $filter_8 //gives 257
With a unique bitmask representing the state of all your filters, you can simply use that as your cache key as well, with no expensive md5 operations needed. So in this case, you would save a file named "257" into your cache.
This technique gives you an easy tool to invalidate your cache with as well, as you can check new and updated records to determine which filters they match, and delete any file that has that "bit" set in the name, ie. if ( ((int)$filename) & $filter == $filter) unlink($filename);. If your tables have frequent writes, this could cause some performance issues for scanning your cache, but it's a decent technique for a read-heavy application.
This is an approach I love to use when dealing with bits or flags. You should consider carefully if you really need caching like this however. If you only have a few users of the system, are you really going to be having performance problems based on a few search queries? As well, MySQL has built-in query caching which performs very well on a high-read application. If your result page generation routines are expensive, then caching the output fragments can definitely be beneficial, but if you're only talking about microseconds of performance here for a handful of users, it might not be worth it.
Why do you need the cache?
If the app is only used by a few users then caching may not actually be required.
Given the requirements you describe (small number of users), it seems to me that caching all combinations seems reasonable. Unless, of course, caching makes sense at all. How much time does a typical query take? Since you say that the application will be used only by several people, is it even worth caching? My very rough estimate is that if the query does not take several seconds in this case, don’t worry about caching. If it is less than a second, and you really don’t want to make the application super responsive, no caching should be needed.
Otherwise, I would say (again given the small number of users) that caching all combinations is OK. Even if very large number of them was used, there is still at most 65536 of them, and many modern operating systems can easily handle thousands of files in a directory (in case you plan to cache into files). But in any case, it would be reasonable to limit the number of items in the cache and purge the old regularly. Also, I would not use an MD5, I would just concatenate the zeros and ones from your filters for the cache key (e.g. 0101100010010100).
First verify you actually need a cache (like Toby suggested).
After that, think about how fresh the information needs to be - you'll be needing to flush out old values. You may want to use a preexisting solution for this, such as memcached.
$key = calc_key();
$result = $memcache->get($key);
if (!$result) {
$result = get_data_from_db();
/* cache result for 3600 seconds == 1 hour */
$memcache->set($key, $result, 0, 3600);
}
/* use $result */
We use memcache basically as an after thought to just cache query results.
Invalidation is a nightmare due to the way it was implemented. We since learned some techniques with memcache thru reading the mailing list, for example the trick to allow group invalidation of a bunch of keys. For those who know it, skip the next paragraph..
For those who don't know and are interested, the trick is adding a sequence number to your keys and storing that sequence number in memcache. Then every time before you do your "get" you grab the current sequence number and build your keys around that. Then, to invalidate the whole group you just increment that sequence number.
So anyway, I'm currently revising our model to implement this.
My question is..
We didn't know about this pattern, and I'm sure there are others we don't know about. I've searched and haven't been able to find any design patterns on the web for implementing memcache, best practices, etc.
Can someone point me to something like this or even just write up an example? I would like to make sure we don't make a beginners mistake in our new refactoring.
One point to remember with object caching is that it's just that - a cache of objects/complex structures. A lot of people make the mistake of hitting their caches for straightforward, efficient queries, which incurs the overhead of a cache check/miss, when the database would have obtained the result far faster.
This piece of advice is one I've taken to heart since it was taught to me; know when not to cache, that is, when the overhead cancels out the perceived benefits. I know it doesn't answer the specific question here, but I thought it was worth pointing out as a general hint.
What rob is saying is good advice. From my experience, there are two common ways to identify and invalidate tags: unique identification and tag-based identification. Those are usually combined to form a complete solution in which:
A cache record is assigned a unique identifier (which usually depends somehow on the data that it caches) and optionally any number of tags.
Cache records are recalled by their unique identifier.
Cache records can be invalidated by their unique identifier (one at a time), or by any tag they are tagged with (possibly invalidating multiple records at the same time).
This is relatively simple to implement and generally works very well. I have yet to come across a system that needed more, though there are probably some edge cases out there that require specific solutions.
I use the Zend Cache component (you don't have to use the entire framework just the zend cache stuff if you want). It abstracts some of the caching stuff (it supports grouping cache by 'tags' though that feature is not supported for the memcache back end I've rolled my own support for 'tags' with relative ease). So the pattern i use for functions that access cache (generally in my model) is:
public function getBySlug($ignoreCache = true)
{
if($ignoreCache || !$result = $this->cache->load('someKeyBasedOnQuery'))
{
$select = $this->select()
->where('slug = ?', $slug);
$result = $this->fetchRow($select);
try
{
$this->cache->save($result,'someKeyBasedOnQuery');
}
catch(Zend_Exception $error)
{
//log exception
}
}
else
{
$this->registry->logger->info('someKeyBasedOnQuery came from cache');
}
return $result;
}
basing the cache key on a hash of the query means that if another developer bypasses my models or used another function elsewhere that does the same thing it's still pulled from cache. Generally I tag the cache with a couple generate tag (the name of the table is one and the other is the name of the function). So by default our code invalidates on insert,delete and update the cached items with the tag of the table. All in all caching is pretty automatic in our base code and developers can be secure that caching 'just works' in projects that we do. (also the great side effect of making use of tagging is that we have a page that offers granular cache clearing/management, with options to clear cache by model functions, or tables).
We also store the query results from our database (PostgreSQL) in memcache and we are using triggers on the tables to invalidate the cache - there are several APIs out there (e.g. pgmemcache, I think mysql has something like that too but I don't know for sure). The benefit is that the database self (triggers) can handle the invalidation of data on changes (update,insert,delete), you don't need to write all that stuff into your "application".
mysqlnd_qc, which inserts memcaching at the database query results return level, auto caches result sets from mysql. It is FANTASTIC and automatic.
I'm using memcache to design a cache for the model layer of a web application, one of my biggest problems is data consistency.
It came to my mind caching data like this:
(key=query, value=list of object ids result of the query)
for each id of the list:
(key=object.id, value=object)
So, every time a query is done:
If the query already exists I retrieve the objects signaled in the list from the cache.
If it doesn't, all the objects of the lists are stored in the cache replacing any other old value.
Has someone use this alternative, is it god? any other ideas?
Caching is one of those topics where there is no one right answer - it depends on your domain.
The caching policy that you describe may be sufficient for your domain. However, you don't appear to be worried about stale data. Often I would expect to see a timestamp against some of the entities - if the cached value is older than some system defined parameter, then it would be considered stale and re-fetched.
For more discussion on caching algorithms, see Wikipedia (for starters)
Welcome to the world of concurrency programming. You'll want to learn a bit about mutual exclusion. If you tell us what language/platform you are developing for we can describe more specifically your options.