retrieving data from memcache - php

I am starting to learn the benefits of memcache, and would like to implement it on my project. I have understood most of it, such as how data can be retrieved by a key and so on.
Now I get it that I can put a post with all of its details in memcache and call the key POST:123, that is OK, and I can do it for each post.
But how to deal with the case when I query the table posts to get the list of all posts with their titles. Can this be done with memcache, or should this always be queried from the table?

Memcache is a key-value cache, so as you've described, it is typically used when you know exactly what data you want to retrieve (ie., it is not used for querying and returning an unknown list of results based on some filters).
Generally, the goal is not to replace all database queries with memcache calls, especially if the optimization isn't necessary.
Assuming the data won't fit in a single value, and you did have a need for quick-access to all the data in your table, you could consider dumping it into keys based on some chunk-value, like a timestamp range.
However, it's probably best to either keep the database query or load it into memory directly (if we're talking about a single server that writes all the updates).

You could have a key called "posts" with a value of "1,2,3,4,10,12" and then update, retrieve it every time new post is created, updated, deleted.
In any case, memcached is not a database replacement, it's a key/value storage (fast and scalable at it). So you have to decide what data to store in DB and what offload to memcached.
Alternatively you could use "cachemoney" plugin which will do read/write through caching of your AcriveRecord in memory (even in memcached)

Look at the cache_money gem from Twitter. This gem adds caching at the ActiveRecord level.
The find calls by id will go through the cache. You can add support for indexing by other fields in your table (i.e. title in your case)
class Post < ActiveRecord::Base
index :title
end
Now the find calls that filter by title will go through cache.
Post.all(:conditions => {:title => "xxxx"})

Basically what you should do it check the cache first to see if it has the information you need. If there is nothing in the cache (or if the cached data is stale), then you should query the table and place the returned data in the cache, as well as return it to the front-end.
Does that help?

You cannot enumerate memcached contents - it's not designed to do that the way you try. You need to store the complete result set of SQL queries instead of single records.
So, yes: Do a SELECT title FROM posts and store the result in memcached (eg as POSTS:ALL_UNORDERED). This is how it is designed. And yes, if querying single records like SELECT * FROM posts WHERE id = XX then store that. Just keep in mind to create unique memcached keys for each query you perform.

As far as I know you can't query for multiple keys and values. But if you need to access the same list of posts often, why don't you store this in with a key like "posts" in your memcache.

Related

NewsFeed with Redis At Scale Stratagy

I'm running into an interesting dilema while trying to solve a scale problem.
Current we have a social platform that has a pretty typical feed. We are using a graph database and each time a feed is requested by the user we hit the DB. While this is fine now, it will come to a grinding halt as we grow our user base. Enter Redis.
Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID and update them when there are updates, additions or deletes. Then in code we loop through the DB results of posts and pull in the data from the Redis store. This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time. The challenge is keeping up with changing data such as commenter's/liker's avatars, Screen Names, closed accounts, new likes, new comments etc associated with each individual post.
I am trying to decide on a strategy to handle this the most effective way. Redis will only take us so far since we will top out at about 12 gig of ram per machine.
One of the concepts in discussion is to use a beacon for each user that stores new post ID. So when a user shares something new, all of their connected friends' beacon get the post ID so that when the user logs in their feed is seen as Dirty requiring an update, then storing the feed by ID in a Redis Set sorted by timestamp. To retrieve the feed data we can do a single query by IDs rather than a full traversal which is hundreds of times faster. That still does not solve the interacting user's account information, their likes, and comments problem which is ever changing, but does solve in part building the feed problem.
Another idea is to store a user's entire feed (JSON encoded) in a MYSQL record and update it on the fly when the user requests it and the beacon shows a dirty feed. Otherwise it's just a single select and json decode to build the feed. Again, the dynamic components are the huddle.
Has anyone dealt with this challenge successfully, or have working knowledge of a strategy to approach this problem.
Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID
Use more efficient serializer, like igbinary or msgpack. I suggest igbinary (check http://kiss-web.blogspot.com/)
Then in code we loop through the DB results of posts and pull in the data from the Redis store.
Be sure to use pipelining for maximum performance.
This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time.
Do not underestimate power of DB primary keys. Try to do the same (not join, but select by keys) with your DB:
SELECT * FROM comments WHERE id IN (1,2,3,4,5,6):
Single redis call is faster than single DB call, but doing lots of redis calls (even pipelined) compared to one sql query on primary keys - not so much. Especially when you give your DB enough memory for buffers.
You can use DB and Redis by "caching" DB data in redis. You do something like that:
Every time you update data, you update it in DB and delete from Redis.
When fetching data, you first try to get them from Redis. If data is not found in Redis, you search them id DB and insert into Redis for future use with some expire time.
That way you store in redis only usefull data. Unused(old) data will stay only in DB.
You may want to use a mongodb as #solocommand has described, but you may want to stop the expectation that you "update" data on demand. Instead push your users changes into a "write" queue, which will then update the database as needed. Then you can load from the database (mongodb) and work with it as needed, or update the other redis records.
Moving to a messaging system Amazon SQS, IronMQ, or RabbitMQ, may help you scale better. You can also use redis queues as a basic message bus.

Which caching technique to be used for apache php

Currently i m using shared hosting domain for my site .But we have currently near about 11,00,000 rows in one of the tables.So its taking a lot of time to load the webpage.So we want to implement the database caching techniques like APC or memcache for our site.But in shared domain we dont have those facilities available,we have only eaccelerator.But eaccelerator does not cache db calls,If i m not wrong.So considering all these points we want to move to VPS and in this case.which database caching technique we need to use APC or memcache to decrease the page load time...Please guide on VPS and better caching technique of two
we have similar website and we use APC
APC will cache the opcode as well the html that is generated. This helps to avoid unrequired hits to the page
you should also enable caching on mysql to cache results of your query
I had a task where i needed to fetch rows from a database table that had more than 100.000 record. it was a scrollable page. So what i did was to fetch the first 50 records and cache the next 50 in the first call. and on scroll down events i wrote an ajax request to check if the data is available in cache; if not i fetched it from the database and also cached the next 50. It worked pretty well and solved the inconvenient load time.
if you have a similar scenario you might benefit from this approach.
ps: I used memcache.
From your comment I take it you're doing a LIKE %..% query and want to paginate the result. First of all, investigate whether FULLTEXT indices are an option for you, as they should perform better. If that's not an option, you can add a simple cache like so:
Treat each unique search term as an id, i.e. if in your URL you have ..?search=foobar, then "foobar" is the id of the result set. Keep that in all your links, e.g. ..?search=foobar&page=2.
If the result set does not yet exist (see below), create it:
Query the database with your slow query.
Get all the results into an array. Don't overdo it, you don't want to be storing hundreds of megabytes.
Create a unique filename per query, e.g. sha1($query), or maybe sha1(strtolower($query)).
serialize the data and store it in the file.
Get the data from the file, unserialize it, display the portion of the array corresponding to the requested page.
Occasionally, delete old cached results. You can do that with something like if (rand(0, 100) == 1) .., which will run the cleanup job every 100 queries on average. Strike a balance between server load and data freshness. Cache invalidation is a topic whole books can be written about, BTW.
That's a simple poor man's cache implementation. It's not great, but if you have absolutely nothing else to work with, it's better than running slow queries over and over.
APC is Alternative PHP Cache and works only with PHP. Whereas Memcahced will work independently with any language.

How do I use memcached to invalidate updated records?

On our site, we have a variety of data collections. We pull a lot of records by ID, and we also have some records that are paginated content.
We are caching our queries using memcached - we essentially serialize the query array and then md5() it. This provides us a key to use for this specific query. However, this query can return multiple records - and if one of those records is changed, we want to invalidate the cached query that resulted in that record being returned.
What would be the best way to accomplish this? I toyed with the idea of having two instances of memcached, with one acting as like an index server, if you will.
Thanks!
One of the ways you can address this (if I read you correctly), is to store collection of IDs with the MD5 key and then those rows separately. So for example if you end up querying for collection and it returned 10 results, you would save the IDs only by the MD5 of the query key. For each of the ID, you will then query the database and get the full details, which are individually stored in memcached. When individual items are updates, it's now easier to invalidate the memcache entry. And these records updates wont effect the collection now as you only have the IDs stored. Hope this helps.

Mysql count rows using filters on high traffic database

Let's say you have a search form, with multiple select fields, let's say a user selects from a dropdown an option, but before he submits the data I need to display the count of the rows in the database .
So let's say the site has at least 300k(300.000) visitors a day, and a user selects options from the form at least 40 times a visit, that would mean 12M ajax requests + 12M count queries on the database, which seems a bit too much .
The question is how can one implement a fast count (using php(Zend Framework) and MySQL) so that the additional 12M queries on the database won't affect the load of the site .
One solution would be to have a table that stores all combinations of select fields and their respective counts (when a product is added or deleted from the products table the table storing the count would be updated). Although this is not such a good idea when for 8 filters (select options) out of 43 there would be +8M rows inserted that need to be managed.
Any other thoughts on how to achieve this?
p.s. I don't need code examples but the idea itself that would work in this scenario.
I would probably have an pre-calculated table - as you suggest yourself. Import is that you have an smart mechanism for 2 things:
Easily query which entries are affected by which change.
Have an unique lookup field for an entire form request.
The 8M entries wouldn't be very significant if you have solid keys, as you would only require an direct lookup.
I would go trough the trouble to write specific updates for this table on all places it is necessary. Even with the high amount of changes, this is still efficient. If correctly done you will know which rows you need to update or invalidate when inserting/updating/deleting the product.
Sidenote:
Based on your comment. If you need to add code on eight places to cover all spots can be deleted - it might be a good time to refactor and centralize some code.
there are few scenarios
mysql has the query cache, you dun have to bother the caching IF the update of table is not that frequently
99% user won't bother how many results that matched, he/she just need the top few records
use the explain - if you notice explain will return how many rows going to matched in the query, is not 100% precise, but should be good enough to act as rough row count
Not really what you asked for, but since you have a lot of options and want to count the items available based on the options you should take a look at Lucene and its faceted search. It was made to solve problems like this.
If you do not have the need to have up to date information from the search you can use a queue system to push updates and inserts to Lucene every now and then (so you don't have to bother Lucene with couple of thousand of updates and inserts every day).
You really only have three options, and no amount of searching is likely to reveal a fourth:
Count the results manually. O(n) with the total number of the results at query-time.
Store and maintain counts for every combination of filters. O(1) to retrieve the count, but requires O(2^n) storage and O(2^n) time to update all the counts when records change.
Cache counts, only calculating them (per #1) when they're not found in the cache. O(1) when data is in the cache, O(n) otherwise.
It's for this reason that systems that have to scale beyond the trivial - that is, most of them - either cap the number of results they'll count (eg, items in your GMail inbox or unread in Google Reader), estimate the count based on statistics (eg, Google search result counts), or both.
I suppose it's possible you might actually require an exact count for your users, with no limitation, but it's hard to envisage a scenario where that might actually be necessary.
I would suggest a separate table that caches the counts, combined with triggers.
In order for it to be fast you make it a memory table and you update it using triggers on the inserts, deletes and updates.
pseudo code:
CREATE TABLE counts (
id unsigned integer auto_increment primary key
option integer indexed using hash key
user_id integer indexed using hash key
rowcount unsigned integer
unique key user_option (user, option)
) engine = memory
DELIMITER $$
CREATE TRIGGER ai_tablex_each AFTER UPDATE ON tablex FOR EACH ROW
BEGIN
IF (old.option <> new.option) OR (old.user_id <> new.user_id) THEN BEGIN
UPDATE counts c SET c.rowcount = c.rowcount - 1
WHERE c.user_id = old.user_id and c.option = old.option;
INSERT INTO counts rowcount, user_id, option
VALUES (1, new.user_id, new.option)
ON DUPLICATE KEY SET c.rowcount = c.rowcount + 1;
END; END IF;
END $$
DELIMITER ;
Selection of the counts will be instant, and the updates in the trigger should not take very long either because you're using a memory table with hash indexes which have O(1) lookup time.
Links:
Memory engine: http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html
Triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html
A few things you can easily optimise:
Cache all you can allow yourself to cache. The options for your dropdowns, for example, do they need to be fetched by ajax calls? This page answered many of my questions when I implemented memcache, and of course memcached.org has great documentation available too.
Serve anything that can be served statically. Ie, options that don't change frequently could be stored in a flat file as array via cron every hour for example and included with script at runtime.
MySQL with default configuration settings is often sub-optimal for any serious application load and should be tweaked to fit the needs, of the task at hand. Maybe look into memory engine for high performance read-access.
You can have a look at these 3 great-but-very-technical posts on materialized views, as a matter of fact that whole blog is truly a goldmine of performance tips for mysql.
GOod-luck
Presumably you're using ajax to make the call to the back end that you're talking about. Use some kind of a chached flat file as an intermediate for the data. Set an expire time of 5 seconds or whatever is appropriate. Name the data file as the query key=value string. In the ajax request if the data file is older than your cooldown time, then refresh, if not, use the value stored in your data file.
Also, you might be underestimating the strength of the mysql query cache mechanism. If you're using mysql query cache, I doubt there would be any significant performance dip over doing it the way I just described. If the query was being query cached by mysql then virtually the only slowdown effect would be from the network layer between your application and mysql.
Consider what role replication can play in your architecture. If you need to scale out, you might consider replicating your tables from InnoDB to MyISAM. The MyISAM engine automatically maintains a table count if you are doing count(*) queries. If you are doing count(col) where queries, then you need to rely heavily on well designed indicies. In that case you your count queries might take shape like so:
alter table A add index ixA (a, b);
select count(a) using from A use index(ixA) where a=1 and b=2;
I feel crazy for suggesting this as it seems that no-one else has, but have you considered client-side caching? JavaScript isn't terrible at dealing with large lists, especially if they're relatively simple lists.
I know that your ideal is that you have a desire to make the numbers completely accurate, but heuristics are your friend here, especially since synchronization will never be 100% -- a slow connection or high latency due to server-side traffic will make the AJAX request out of date, especially if that data is not a constant. IF THE DATA CAN BE EDITED BY OTHER USERS, SYNCHRONICITY IS IMPOSSIBLE USING AJAX. IF IT CANNOT BE EDITED BY ANYONE ELSE, THEN CLIENT-SIDE CACHING WILL WORK AND IS LIKELY YOUR BEST OPTION. Oh, and if you're using some sort of port connection, then whatever is pushing to the server can simply update all of the other clients until a sync can be accomplished.
If you're willing to do that form of caching, you can also cache the results on the server too and simply refresh the query periodically.
As others have suggested, you really need some sort of caching mechanism on the server side. Whether it's a MySQL table or memcache, either would work. But to reduce the number of calls to the server, retrieve the full list of cached counts in one request and cache that locally in javascript. That's a pretty simple way to eliminate almost 12M server hits.
You could probably even store the count information in a cookie which expires in an hour, so subsequent page loads don't need to query again. That's if you don't need real time numbers.
Many of the latest browser also support local storage, which doesn't get passed to the server with every request like cookies do.
You can fit a lot of data into a 1-2K json data structure. So even if you have thousands of possible count options, that is still smaller than your typical image. Just keep in mind maximum cookie sizes if you use cookie caching.

Transfer table to Memcache

I have a large table and I'd like to store the results in Memcache. I have Memcache set up on my server but there doesn't seem to be any useful documentation (that I can find) on how to efficiently transfer large amounts of data. The only way that I currently can think of is to write a mysql query that grabs the key and value of the table and then saves that in Memcache. Its not a particularly scalable solution (especially when my query generates a few hundred thousand rows). Any advice on how to do this?
EDIT: there is some confusion about what I"m attempting to do. Lets say that I have a table with two fields (key and value). I am pulling in information on the fly and have to match it to the key and return the value. I'd like to avoid having to execute ~1000 queries per page load. Memcache seems like a perfect alternative because its set up to use key value. Lets say this table has 100K rows. THe only way that I know to get that data from the db table to memcache is to run a query that loops through every row in the table and creates an individual memcache row.
Questions: Is this a good way to use memcache? If yes, is there a better way to transfer my table?
you can actually pull all the rows in an array and store the array in memcache
memcache_set($memcache_obj, 'var_key', $your_array);
but you have to remember few things
PHP will serialize/unserialize the array from memcache so if you have many rows it might be slower then actually querying the DB
you cannot do any filtering (NO SQL), if you want to filter some items you have to implement this filter yourself and it would probably perform worst then the DB engine.
memcache won't store more then 1 megabyte ...
I don't know what you try to achieve but the general use of memcache is:
store the result of SQL/time consuming processing but the number of resulting row should be small
store some pre created (X)HTML blobs to avoid DB access.
user session storage
Russ,
It sounds almost as if using a MySQL table with the storage engine set to MEMORY might be your way to go.
A RAM based table gives you the flexibility of using SQL, and also prevents disk thrashing due to a large amount of reads/writes (like memcached).
However, a RAM based table is very volatile. If anything is stored in the table and not flushed to a disk based table, and you lose power... well, you just lost your data. That being said, ensure you flush to a real disk-based table every once in a while.
Also, another plus from using memory tables is you can store all the typical MySQL data types, so there is no 1MB size limit.

Categories