Memcached scaling: key "grouping"

Memcached scaling: key "grouping" - php

As it is best practice to group related keys that are frequently retrieved together (using multiGet) on a single server for optimum performance, I have a couple questions regarding the implicit mechanics employed by the client functions built for doing this.
I have seen two different approaches for serving what I assume is the same purpose using libmemcache (php-memcached specifically). The first and most obvious approach is to use getByKey/setByKey to map keys to servers and the second is to use the option OPT_PREFIX_KEY (there is a simple example posted in the php documentation under memcached::_construct), which according to the documentation is "used to create a 'domain' for your item keys". The caveat of the second approach is that it can only be set on a per-instance basis, which may or may not be a good thing.
So unless I am completely mistaken, and these two approaches don't actually serve the same purpose; is that any clear benefit for going with approach over the other?
And while I'm on this topic my other question would be: What are the implications, if any, to mapping keys to servers in a consistently hashed scenario? I'm assuming that if a node were to fail, the freeform key would simply be remapped to a new server without any issue..
Thanks!

If these keys are really almost always retrieved together you probably want to cache them together in a single key/value pair, for example by sorting and concatenating keys and storing values serialized as a dictionary in JSON or similar format.
Returning to your question:
OPT_PREFIX_KEY has almost nothing to do with grouping values by key, it just prefixes all keys used by this particular client, so "1" becomes "foo1" and is distributed by consistent hashing using this new value, without any grouping by "foo".
getByKey/setByKey does the closest thing to what you want, since it can pass different keys to libketama (used to choose server) and memcached server. If you specify same first key and different second keys - they will end up on same memcached server, but won't overwrite each other.
Premature optimization is the root of all evil

Related

Why do web sites tend to use random id:s on database tables?

I wonder why many web sites choose to use random id:s instead of incrementing from 1 on their database tables. I´ve searched without finding any good reasons, are there any?
Also, which is the best method to use? It seems quite inefficient to check if an id already exists before inserting the data, (takes a second query).
Thanks for your help!

Under the hood, it is likely that they are using incremental ids in the database to identify rows, but the value that gets exposed to end users via the URL parameters is often made into a random string to make the sequence of available objects harder to guess.
It is really a matter of security through obscurity. It hinders automated scripts from proceeding through incremental values and attempting attacks via the URL, and it hinders automated scraping of site content.
If youtube, for example, used incremental ids instead of values like v=HSsdaX4s, you could download every by simply starting at v=1 and incrementing that value millions of times.

Sequential ids do not scale well (they become a synchronization bottle-neck in distributed systems).
Also, you don't need to check if a newly generated random id already exists, you can just assume that it does not (because there are so many of them).

Are you sure that the id's are random? or are they encoded? Either way it is for security.

PHP Options Map Discussion

Every site should have options, changeable via a Control Panel. My new project is going to have this. However, I am curious as to which is the best method for storing these options.
Here are my known methods:
I have tested these, although not very thoroughly, as in all likelihood I will not find the long-term problems or benefits.
Using a MySQL table with fields key and value, where each column
is a new key/value pair. The downside to this is that MySQL can be
slow, and in fact, would require a loop before every page to fetch
the options from the database and parse it into an array.
Using a MySQL table with a field for each value and a single record.
The downside to this is that each new options requires a new field,
and this is not the standard use of MySQL tables, but a big benefit
is that it requires a single function to bring it into a PHP indexed
array.
Using a flat file containing the options array in serialized form,
using the PHP functions serialize and unserialize. The main
problem with this method is that I would have to first traverse to
the file, read in the whole file, and serializing can be slow, so it
would get slower as more options are created. It also offers a small
layer of obfuscation to the data.
Using an ini file. Ini parser's are rather fast, and this options
would make it easy to pass around a site configuation. However, as
above, I would have to traverse to the ini, and also, using an ini
file with PHP is generally unused.
Other formats, such as XML and JSON, have all been considered
too. However, they all require some sort of storage, and I am mostly
curious about the benefits of each kind of storage.
These are my specific idealistic requirements:
So the basic thing I am looking for is speed, security, and portability. I want the configuration to not be human readable (hence, an unencrypted flat file is bad), be easily portable (ruling out MySQL), and have almost zero but constant performance impact (ruling out most options).
I am not trying to ask people to write code for me, or anything like that. I just need a second pair of eyes on this problem, possibly bringing up points that I never factored in.
Thank you for your help
Thank you- Daniel.

Using a MySQL table with fields key and value, where each column is a
new key/value pair. The downside to this is that MySQL can be slow,
and in fact, would require a loop before every page to fetch the
options from the database and parse it into an array.
That is false. Unless you plan on storing couple hundred million configuration pairs, you will be fine and dandy. If you worry about performance using this method, simply cache the query (and wipe the cache only when you make changes inside the table).
This will also give you most flexibility, ease of use and so on.

Are there alternative data structures than array in PHP, where I can benefit from different index techniques?

Lately I had an issue with an array that contained some hundred thousands of values and the only thing I wanted to do was to check whether a value was already present.
In my case this were IPs from a webserver log.
So basically something like:
in_array(ip2long(ip),$myarray) did the job
However the lookup time increased dramatically and 10k of lookups took around 17 seconds or so.
So in this case I didn't care whether I had duplicates or not, I just needed to check for existence. So I could store the IPs in the index like this:
isset($myarray[ip2long($ip)])
And boom, lookup times went down from 17 seconds (and more) to a static time of 0.8 seconds for 10k lookups. As a value for the array entry I just used int 1.
I think the array index is probably based on some b-tree which should have log(n) lookup time and the index on a hashmap.
In my case using the index worked fine, but are there any data structures where I can use hashmaps as a value index, where multiple values may also occour (i realize that this makes only sense if do not have too many duplicates and I cannot use range/search requests efficiently, which is the primary benefit of tree structures)?

There are a whole range of alternatives datastructures beyond simple arrays in the SPL library bundled with PHP, including linked lists, stacks, heaps, queues, etc.
However, I suspect you could make your logic a whole lot more efficient if you flipped your array, allowing you to do a lookup on the key (using the array_key_exists() function) rather than search for the value. The array index is a hash, rather than a btree, making for very fast direct access via the key.
However, if you're working with 10k entries in an array, you'd probably be better taking advantage of a database, where you can define your own indexes.

You also have the chdb (constant hash database) extension - which is perfect for this.

Arrays have an sequential order and it's quick to access certain elements, because you don't need to traverse a tree or work through a sequential list structure.
A set is of course faster here, because you only check unique elements and not all elements (in the array).
Tree's are fine for in example sorted structures. You could implement a tree with IPs sorted by their ranges, then you could decide faster if this IP exist or not.
I'm not sure if PHP provides such customised tree structures. I guess you'll need to implement this yourself, but this will take about half an hour.
You'll find sample codes on the web for such tree structures.

as already answered, you can use brand new classes provided by spl http://www.php.net/spl
BUT apparently they are not as fast as people think. probably they are not implemented as we expect. it is my opinion that splfixedarray, for example, is not a real array, but a hashtable as classic php's arrays
BUT also, you have some alternative solutions
first you can store your result in a database. queries are fast because db indexes may be better optimized than a php datastructure
you can use http://www.php.net/sqlite3 and store results in a temporary database (a file or in memory)
I suggest a temporary file, because you don't have to load all in memory, and in plus you can add each row individually (using http://www.php.net/fgets for example)
HTH!
feel free to correct my English

Key groups with APC cache

APC lets you store data inside keys, but you cannot group these keys.
So if i want to have a group called "articles", and inside this group I would have keys that take the form of the article ID I can't do this easily.
articles -> 5 -> cached data
-> 10 -> cached data
-> 17 -> cached data
...
I could prefix the key with the "group" name like:
article_5 -> cached data
article_10 -> cached data
article_17 -> cached data
...
But this it makes it impossible to delete the entire group if I want to :(
A working solution would be to store multidimensional arrays (this is what I'm doing now), but I don't think it's good because when I want to access / or delete cached data, I need to get the entire group first. So if the group has one zillion articles in it you can image what kind of array I will be iterating and searching
Do you have better ideas on how could I achieve the group thing?
edit: found another solution, not sure if it's much better because I don't know how reliable is yet. I'm adding a special key called __paths which is basically a multidimensional array containing the full prefixed key paths for all the other entries in the cache. And when I request or delete the cache I use this array as a reference to quickly find out the key (or group of keys) I need to remove, so I don't have to store arrays and iterate trough all keys...

Based upon your observations, I looked at the underlying C implementation of APC's caching model (apc_cache.c) to see what I could find.
The source corroborates your observations that no grouping structure exists in the backing data store, such that any loosely-grouped collection of objects will need to be done based on some namespace constraint or a modification to the cache layer itself. I'd hoped to find some backdoor relying on key chaining by way of a linked list, but unfortunately it seems collisions are reconciled by way of a direct reallocation of the colliding slot instead of chaining.
Further confounding this problem, APC appears to use an explicit cache model for user entries, preventing them from aging off. So, the solution Emil Vikström provided that relies on the LRU model of memcached will, unfortunately, not work.
Without modifying the source code of APC itself, here's what I would do:
Define a namespace constraint that your entries conform to. As you've originally defined above, this would be something like article_ prepended to each of your entries.
Define a separate list of elements in this set. Effectively, this would be the 5, 10, and 17 scheme you'd described above, but in this case, you could use some numeric type to make this more efficient than storing a whole lot of string values.
Define an interface to updating this set of pointers and reconciling them with the backing memory cache, including (at minimum) the methods insert, delete, and clear. When clear is called, walk each of your pointers, reconstruct the key you used in the backing data store, and flush each from your cache.
What I'm advocating for here is a well-defined object that performs the operations you seek efficiently. This scales linearly with the number of entries in your sub-cache, but because you're using a numeric type for each element, you'd need over 100 million entries or so before you started to experience real memory pain at a constraint of, for example, a few hundred megabytes.
Tamas Imrei beat me to suggesting an alternate strategy I was already in the process of documenting, but this has some major flaws I'd like to discuss.
As defined in the backing C code, APCIterator is a linear time operation over the full data set when performing searches (using its constructor, public __construct ( string $cache [, mixed $search = null ...]] )).
This is flatly undesirable in the case where the backing elements you're searching for represent a small percentage of your total data, because it would walk every single element in your cache to find the ones you desire. Citing apc_cache.c:
/* {{{ apc_cache_user_find */
apc_cache_entry_t* apc_cache_user_find(apc_cache_t* cache, char *strkey, \
int keylen, time_t t TSRMLS_DC)
{
slot_t** slot;
...
slot = &cache->slots[h % cache->num_slots];
while (*slot) {
...
slot = &(*slot)->next;
}
}
Therefore, I would most strongly recommend using an efficient, pointer-based virtual grouping solution to your problem as I've sketched out above. Although, in the case where you're severely memory-restricted, the iterator approach may be most correct to conserve as much memory as possible at the expense of computation.
Best of luck with your application.

I have had this problem once with memcached and I solved it by using a version number in my keys, like this:
version -> 5
article_5_5 -> cached data
article_10_5 -> cached data
article_17_5 -> cached data
Just change the version number and the group will be effectively "gone"!
memcached uses a least-recently-used policy to remove old data so the old-versioned group will be removed from the cache when the space is needed. I don't know if APC have the same feature.
According to MrGomez this is NOT working for APC. Please read his post, and keep my post in mind only for other cache systems which use a least-recently-used policy (not APC).

You may use the APCIterator class which seems to exists especially for tasks like this:
The APCIterator class makes it easier to iterate over large APC caches. This is helpful as it allows iterating over large caches in steps ...

Unfortunately, APC can't do this. I wished myself often enough that it could. So I looked for alternatives.
Zend_Cache has an interesting way of doing it, but it simply uses caches to cache the tagging information. It's a component that can in turn use backends (like apc).
If you want to go a step further, you could install Redis. This one has all that natively included and some other really interesting features. This would probably the cleanest solution to go with. If you were able to use APC, you should be also able to use Redis.

Associative Array : PHP/C vs Flex/Flash

In PHP an Associative Array keeps its order.
// this will keep its order in PHP
a['kiwis']
a['bananas']
a['potatoes']
a['peaches']
However in Flex it doesn't with a perfectly valid explanation. I really can't remember how C treats this problem but I am more leaned to believe it works like php as the Array has it's space pre-reserved in memory and we can just walk the memory. Am I right?
The real question here is why. Why does C/PHP interpretation of this varies from Flash/Flex and what is the main reason Adobe has made Flash work this way.
Thank you.

There isn't a C implementation, you roll your own as needed, or choose from a pre-existing one. As such, a given C implementation may be ordered or unordered.
As to why, the reason is that the advantages are different. Ordered allows you (obviously enough) to depend on that ordering. However, it's wasteful when you don't need that ordering.
Different people will consider the advantage of ordering more or less important than the advantage of not ordering.
The greatest flexibility comes from not ordering, as if you also have some sort of ordered structure (list, linked list, vector would all do) then you can easily create an ordered hashmap out of that (not the optimal solution, but it is easy, so you can't complain you didn't have one given to you). This makes it the obvious choice in something intended, from the early in its design, to be general purpose.
On the other hand, the disadvantage of ordering is generally only in terms of performance, so it's the obvious choice for something intended to give relatively wide-ranging support with a small number of types for a new developer to learn.
The march of history sometimes makes these decisions optimal and sometimes sub-optimal, in ways that no developer can really plan for.

For PHP arrays: These beasts are unique constructs and are somehow complicated, an overview is given in a slashdot response from Kendall Hopkins (scroll down to his answer):
Ken: The PHP array is a chained hash table (lookup of O(c) and O(n) on key collisions)
that allows for int and string keys. It uses 2 different hashing algorithms
to fit the two types into the same hash key space. Also each value stored in
the hash is linked to the value stored before it and the value stored after
(linked list). It also has a temporary pointer which is used to hold the
current item so the hash can be iterated.
In C/C++, there is, as has been said, no "associative array" in the core lanuage. It has a map (ordered) in the STL, as will be in the new standard library (hash_map, unordered_map) and there was a gnu_hash_map (unordered) on some implementations (which was very good imho).
Furthermore, the "order" of elements in an "ordered" C/C++ map is usually not the "insertion order" (as in PHP), it's the "key sort order" or "string hash value sort order".
To answer your question: your view of equivalence of PHP and C/C++ associative arrays does not hold, in PHP, they made a design decision in order to provide maximum comfort under a single interface (and failed or succeeded, whatever). In C/C++, there are many different implementations (with advantages and tradeoffs) available.
Regards
rbo

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.