Best way to do caching on a reddit-like website

Best way to do caching on a reddit-like website - php

We have a PHP website like reddit, users can vote for the stories.
We tried to use APC, memcached etc. for the website but we gave up. The problem is we want to use a caching mechanism, but users can vote anytime on site and the cached data may be old and confusing for the other visitors.
Let me explain with an example, We have an array of 100 stories and stored in cache for 5 mins., a user voted for some stories so the ratings of the stories are changed. When the other user enter the website, he/she will see the cached data, therefore the old data. (This is the same if the voter user refreshes the page, he'll also see the old vote number for the stories.)
We cannot figure it out, any help will be highly appreciated

This is a matter of finding a balance between low-latency updates, and overall system/network load (aka, performance vs. cost).
If you have capacity to spare, the simplest solution is to keep your votes in a database, and always look them up during a page load. Of course, there's no caching here.
Another low-latency (but high-cost) solution is to have a pub-sub type system that publishes votes to all other caches on the fly. In addition to the high cost, there are various synchronization issues you'll need to deal with here.
The next alternative is to have a shared cache (e.g., memcached, but shared across different machines). Updates to the database will always update the cache. This reduces the load on the database and would get you lower latency responses (since cache lookups are usually cheaper than queries to a relational database). But if you do this, you'll need to size the cache carefully, and have enough redundancy such that the shared cache isn't a single point of failure.
Another, more commonly used, alternative is to have some kind of background vote aggregation, where votes are only stored as transactions on each of the front-end servers, and you have a background process that continuously (e.g., every five seconds) aggregates the votes and populates all the caches.
AFAIK, reddit does not do live low-latency vote propagation. If you vote something up, it isn't immediately reflected across other clients. My guess is that they're doing some kind of aggregation (as in #4), but that's just me speculating.

Perhaps this is a solution you've already considered, but why not just cache everything but the ratings? Instead, just update a single array, where the ith position contains the rating for the ith top story. Keep this in memory all the time, and flush ratings back to the database as it's available.
If you only care about the top N stories being the most up-to-date, then i only needs to be the size of the number of stories on the front page, which is presumably a very small number like 50 or so.

Related

Should I cache everything

I'm very close to finishing my application. I'm currently using file cache on laravel to cache basically all of the data that is saved in my tables every time a record is saved, updated or deleted. These tables over a period of time perhaps in 3 years or less will have over 2 million records. I was wondering if there are any pitfalls that I need to be aware of in file caching all of my records. Does anyone foresee a problem with file-caching hundreds of thousands of records over a period of time?
Currently, the way how my cache system works is when I save a new record, delete or make an update to a record it resets the cache for the table/record in question. Each of my tables/and most records has it's own cache. Does anyone foresee any problems with this design with a very large database?
Regards,
Darren

You have millions of data, which is a bit concerning if the application has high hits per second. In that case, more hits mean increased disk IO. Any bottleneck in the disk IO should impact the overall performance of the application.
From my personal experience, the best approach is to decide which data should be cached and which should not, what layers and architecture of caching should be used, etc. For example, you may not want to cache any data that is very dynamic. For instance, user carts, balance, view count, etc. As these data are always changing, caching them usually results in (close to) doubling the resource intake rather than increasing the actual throughput in practical scenarios. If a set of data is rarely fetched, it may not need caching at all.
On the other hand, if some part of your data has a very high hit ratio, for instance, home page elements, top posts, best-selling products, etc. then those should be cached using a fast and high availability caching mechanism such as object stores. In this way, high demand can be met in case of thousands of concurrent hits for those data without impacting the disk IO or database performance.
Simply put, different segments of the application data may or may not need different approaches for caching.

Caching debate/forum entries in PHP

Just looking for a piece of advice. On one of our webpages we have a debate/forum site. Everytime a user request the debate page, he/she will get a list of all topics (and their count of answers etc.).
Too when the user request a specific topic/thread, all answers to the thread will be shown to the user a long with username, user picture, age, number of totalt forum-posts from the poster of the answer.
All content is currently retrieved by using an MySQL-query everytime the page is accessed. But this is however starting to get painfully slow (especially with large threads, +3000 answers).
I would like to cache the debate entries somehow, to speed up this proces. However the problem is, that if I cache the entries it self, number of post etc. (which is dynamic, of course), will not always be up to date.
Is there any smart way of caching the pages/recaching them when stuff like this is updated? :)
Thanks in advance,
fischer

You should create a tag or a name for the cache based on it's data.
For example for the post named Jake's Post you could create an md5 of the name, this would give you the tag 49fec15add24931728652baacc08b8ee.
Now cache the contents and everything to do with this post against the tag 49fec15add24931728652baacc08b8ee. When the post is updated or a comment is added go to the cache and delete everything associated with 49fec15add24931728652baacc08b8ee.
Now there is no cache and it will be rebuilt when the next visitors arrives to new the post.
You could break this down further by having multiple tags per post. E.g you could have a tag for comments and answers, when a comment is added delete the comments tag, but not the answers tag. This reduces the work the server has to do when rebuilding the cache as only the comments are now missing.
There are number of libraries and frameworks that can aid you in doing this.
Jake
EDIT
I'd use files to store the data, more specifically the HTML output of the page. You can then do something like:
if(file_exists($tag))
{
// Load the contents of the cache file here and output it
}
else
{
// Do complex database look up and cache the file for later
}
Remember that frameworks like Zend have this sort of stuff built in. I would seriously considering using a framework.

Interesting topic!
The first thing I'd look at is optimizing your database - even if you have to spend money upgrading the hardware, it will be significantly easier and cheaper than introducing a cache - fewer moving parts, fewer things that can go wrong...
If you can't squeeze more performance out of your database, the next thing I'd consider is de-normalizing the data a little. For instance, maintain a "reply_count" column, rather than counting the replies against each topic. This is ugly, but introduces fewer opportunities for things to go wrong - with a bit of luck, you can localize all the logic in your data access layer.
The next option I'd consider is to cache pages. For instance, just caching the "debate page" for 30 seconds should dramatically reduce the load on your database if you've got reasonable levels of traffic, and even if it all goes wrong, because you're caching the entire page, it will sort itself out the next time the page goes stale. In most situations, caching an entire page is okay - it's not the end of the world if a new post has appeared in the last 30 seconds and you don't see it on your page.
If you really have to provide more "up to date" content on the page, you might introduce caching at the database access level. I have, in the past, built a database access layer which cached the results of SQL queries based on hard-wired logic about how long to cache the results. In our case, we built a function to call the database which allowed you to specify the query (e.g. get posts for user), an array of parameters (e.g. username, date-from), and the cache duration. The database access function would cache results for the cache duration based on the query and the parameters; if the cache duration had expired, it would refresh the cache.
This scheme was fairly bug-proof - as an end user, you'd rarely notice weirdness due to caching, and because we kept the cache period fairly short, it all sorted itself out very quickly.
Building up your page by caching snippets of content is possible, but very quickly becomes horribly complex. It's very easy to create a page that makes no sense to the end user due to the different caching policies - "unread posts" doesn't add up to the number of posts in the breakdown because of different caching policies between "summary" and "detail".

"Caching" identical MySQL results for all users

I'm hoping to develop a LAMP application that will centre around a small table, probably less than 100 rows, maybe 5 fields per row. This table will need to have the data stored within accessed rapidly, maybe up to once a second per user (though this is the 'ideal', in practice, this could probably drop slightly). There will be a number of updates made to this table, but SELECTs will far outstrip UPDATES.
Available hardware isn't massively powerful (it'll be launched on a VPS with perhaps 512mb RAM) and it needs to be scalable - there may only be 10 concurrent users at launch, but this could raise to the thousands (and, as we all hope with these things, maybe 10,000s, but this level there will be more powerful hardware available).
As such I was wondering if anyone could point me in the right direction for a starting point - all the data retrieved will be the same for all users, so I'm trying to investigate if there is anyway of sharing this data across all users, rather than performing 10,000 identical selects a second. Soooo:
1) Would the mysql_query_cache cache these results and allow access to the data, WITHOUT requiring a re-select for each user?
2) (Apologies for how broad this question is, I'd appreciate even the briefest of reponses greatly!) I've been looking into the APC cache as we already use this for an opcode cache - is there a method of caching the data in the APC cache, and just doing one MYSQL select per second to update this cache - and then just accessing the APC for each user? Or perhaps an alternative cache?
Failing all of this, I may look into having a seperate script which handles the queries and outputs the data, and somehow just piping this one script's data to all users. This isn't a fully formed thought and I'm not sure of the implementation, but perhaps a combo of AJAX to pull the outputted data from... "Somewhere"... :)
Once again, apologies for the breadth of these question - a couple of brief pointers from anyone would be very, very greatly appreciated.
Thanks again in advance

If you're doing something like an AJAX chat which polls the server constantly, you may want to look at node.js instead, which keeps an open connection between server and browser. This way, you can have changes pushed to the user when they happen and you won't need to do all that redundant checking once per second. This can scale very well to thousands of users and is written in javascript on the server-side, so not too difficult.

The problem with using the MySQL cache is that the entire table cache gets invalidated on any write to that table. You're better off using a caching solution like memcached or APC if you're trying to control that behavior more precisely. And yes, APC would be able to cache that information.
One other thing to keep in mind is that you need to know when to invalidate the cache as well, so you don't have stale data.

You can use apc,xcache or memcache for database query caching or you can use vanish or squid for gateway caching...

How to best cache calculated metrics on database-stored information

I've got a very simple algorithm that I'm playing with to determine what user-submitted content should display on the front page of a website I'm building. Right now the algorithm is triggered each time the page is loaded. The algorithm goes through all of the posts, calculates the newest weighted score, and displays the posts accordingly.
With no one on the site, this works fine, but seems unlikely to scale well. Is it typical in industry to optimize these cases? If I need to cache the calculations, how should that best be done? I'm very much self-taught so although I can get things to work, I don't always know if its ideal.
Part of the algorithm is time, which is important here. Aside from time, there are some other variables at play that I weight differently. I add all these "scores" together to form one "weighted score", the higher the score, the higher the post.

You could cache the results in the database, say in a field "Score", then upon a user accessing the page, run a SQL select to find any articles with a null score.
SQL: SELECT * FROM Articles WHERE Score IS NULL
Calculate these scores and store them to their associated articles, then utilize them through an ordered select statement to find which articles to display, possibly limiting how many articles to fetch and even performing pagination entirely through the cache.
Note: The scores should be absolute, based entirely on the article in question, not relative to the contents of other articles in the database.
SQL: SELECT * FROM Articles ORDER BY Score
Further improvements to efficiency could be done by limiting the cache generation to only events which actually change the articles. For example, you could call the cache generation event on submission of a new article, or upon the editing of an article.

There is no standard, really. Some systems run on an interval, like once a day, or once an hour. Others run each time the page is accessed. Caching can be used to reduce the load in the latter case.
It all depends on how efficiently the algorithm scales, how many posts it will have to deal with and how often the information is needed. If the operation is quick and cheap, you may as well just run it every time the page is accessed for your initial version. If it is fast enough in your testing and doesn't kill the server's memory usage then doing any more work is a waste of time. If it isn't good enough, think about caching the result, investing in better hardware or look for opportunities to improve the code.
If the results don't need to change very often, just schedule it once an hour/inute or so and make sure it meets your needs before shipping.
It is generally better to test the simplest solution before worrying about optimisation.

You're currently following the "don't store if you can calculate" tactic taught as a first step in database design classes.
However, if the "score" is not likely to change frequently, then it may be better to process all entries on a schedule, storing their score in a database, and then just pull the highest-scored items when the page loads.

The standard way to run anything periodically is cron. You can make it run any command periodically, including PHP scripts.
You could also cache the score of the post, or at least the part of the score related to its content, to increase efficiency. Full-text processing is expensive, so the score is certainly worth caching in the database from this point of view.
The trick is to figure out how to implement it in a way that allows you to score the post based on both content and age, whilst still allowing you to cache it. I would create a base score that is calculated from the content, then cache that. When you want to get the real score, you retrieve the cached base score and adjust it based on the age of the post.
Example:
// fetch cached post score, which doesn't take time into account
$base_score = get_post_base_score($post_id);
// now adjust the base score given how old the post is
$score = adjust_score($base_score, time() - $post_time);

Caching of current user specific pages

We are using Smarty Templates on our LAMP site but my question would also apply to a site running Memcached (which we are planning to also bring online). Many of the pages of our user generated site have different views depending on who is looking at them. For instance, a list of comments where your own comments are highlighted. There would need to be a unique cache-id for each logged in user for this specific view. My question is, in this scenario, would you not even cache these views? Or is the overhead in creating/using the cache (either for smarty or memcached), low enough that you still would see some benefit to the cache?

Unless individual users are requesting the pages over and over again, there's no point caching this sort of thing, and I expect the overhead of caching will vastly exceed the performance benefits, simply since the cache hit ratio will be poor.
You may be better off looking into caching fragments of your site that do not depend on the individual user, or fragments that will be the same for a large number of page impressions (e.g. content that is the same for a large subset of your users).
For example - on this page you might want to cache the list of related questions, or the tag information, but there's probably little point caching the top-bar with reputation info too aggressively, since it will be requested relatively infrequently.

If the view code isn't too complicated just cache the data and generate the view each time.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.