Should I cache everything

Should I cache everything - php

I'm very close to finishing my application. I'm currently using file cache on laravel to cache basically all of the data that is saved in my tables every time a record is saved, updated or deleted. These tables over a period of time perhaps in 3 years or less will have over 2 million records. I was wondering if there are any pitfalls that I need to be aware of in file caching all of my records. Does anyone foresee a problem with file-caching hundreds of thousands of records over a period of time?
Currently, the way how my cache system works is when I save a new record, delete or make an update to a record it resets the cache for the table/record in question. Each of my tables/and most records has it's own cache. Does anyone foresee any problems with this design with a very large database?
Regards,
Darren

You have millions of data, which is a bit concerning if the application has high hits per second. In that case, more hits mean increased disk IO. Any bottleneck in the disk IO should impact the overall performance of the application.
From my personal experience, the best approach is to decide which data should be cached and which should not, what layers and architecture of caching should be used, etc. For example, you may not want to cache any data that is very dynamic. For instance, user carts, balance, view count, etc. As these data are always changing, caching them usually results in (close to) doubling the resource intake rather than increasing the actual throughput in practical scenarios. If a set of data is rarely fetched, it may not need caching at all.
On the other hand, if some part of your data has a very high hit ratio, for instance, home page elements, top posts, best-selling products, etc. then those should be cached using a fast and high availability caching mechanism such as object stores. In this way, high demand can be met in case of thousands of concurrent hits for those data without impacting the disk IO or database performance.
Simply put, different segments of the application data may or may not need different approaches for caching.

Related

SQL views and JSON disk caching

I'm programming a reddit-like website.
The user can display items from categories of its choice.
For this I'm querying a JOIN of the categories he subscribed to and the items.
Hardcore query
First solution : store the data on disk in a "categories_1-2-4-7-10.json" and serve it to the users browsing the same categories.
Cons : takes space on disk, heavy load.
I'm thinking about a new solution : views. But I don't really know how they work, do they regenerate often enough to be a heavyload on the server?
View would let me query data that already has been JOINED
Further : I'm only making a view for the frontpage items. I don't need to optimize later pages as they're not as frequently accessed.

It's a bad idea to store things to disk and then load them for a site. Disk operations are insanely slow compared to in memory operations.
You can still store JSON documents, but consider storing them in a caching layer.
Something like Redis, which is the new hotness these days (http://redis.io/) or Couchbase (http://www.couchbase.com/)
Store everything in memory and the site will be much faster.
As far as how often to regenerate your views ... a good idea is to give them an expiration time. Read about how that might work with caching in general. You would set a category view to exist in the cache for maybe 1 minute. After a minute the item leaves memory and you make a database query to put a newer version back in. Rinse and repeat.

Scalable web application

We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.

You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.

I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.

What is the best practice in caching? What are the limits?

Im using Zend_cache to cache results of some complicated db queries, services etc.
My site is social, that means that there is a lot of user interaction.
Im able to cache users data here and there as well. But taht means, that i will have nearly tens of thousands cache files (with 10 000 users). Is this approach to cache almost everything coming from db still good for performance? Or there are some limits of filesystem?
Was looking for some article around, didnt find.
Thanks for an advice!
Jaroušek

The question you should be asking is if the overhead of creating/populating/maintaining that cache exceeds the cost of generating the cacheable data in the first place.
If it costs you $1 to generate some data, $10 to cache it, and $0.8 to retrieve from cache, then you'd have to be able to retrieve that data from cache 50 times to break even.
If you only access the cached data 10 times before it expires/invalidates, then you're losing $8.

Identify data to cache in which layer - PHP/MySQL

Think you are the proud owner of Facebook, then
which data you want to store in app layer [memcached/ APC] and which data in MySQL cache ?
Please explain also why you think so.
[I want to have an idea on which data to cache where]

For memcache, store session data. You have to typically query from a large table or from the filesystem to get it, depending on how it's stored. Putting that on memory removes hitting the disk for a relatively small amount data (that is typically critical to one's web application).
For your database cache, put stuff in there that is not changing so often. We're talking about wall posts, comments, etc. They are queried a lot and rarely change, all things considered. You may also want to consider doing a flat file cache, so you can purge individual files with greater ease, and divide it up as you see fit.
I generally don't directly cache any arbitrary data with APC, usually I will just let it cache stuff automatically and get lessened memory loads.
This is only one way to do it, but as far as the industry goes, this is a somewhat well-used model.

Best way to do caching on a reddit-like website

We have a PHP website like reddit, users can vote for the stories.
We tried to use APC, memcached etc. for the website but we gave up. The problem is we want to use a caching mechanism, but users can vote anytime on site and the cached data may be old and confusing for the other visitors.
Let me explain with an example, We have an array of 100 stories and stored in cache for 5 mins., a user voted for some stories so the ratings of the stories are changed. When the other user enter the website, he/she will see the cached data, therefore the old data. (This is the same if the voter user refreshes the page, he'll also see the old vote number for the stories.)
We cannot figure it out, any help will be highly appreciated

This is a matter of finding a balance between low-latency updates, and overall system/network load (aka, performance vs. cost).
If you have capacity to spare, the simplest solution is to keep your votes in a database, and always look them up during a page load. Of course, there's no caching here.
Another low-latency (but high-cost) solution is to have a pub-sub type system that publishes votes to all other caches on the fly. In addition to the high cost, there are various synchronization issues you'll need to deal with here.
The next alternative is to have a shared cache (e.g., memcached, but shared across different machines). Updates to the database will always update the cache. This reduces the load on the database and would get you lower latency responses (since cache lookups are usually cheaper than queries to a relational database). But if you do this, you'll need to size the cache carefully, and have enough redundancy such that the shared cache isn't a single point of failure.
Another, more commonly used, alternative is to have some kind of background vote aggregation, where votes are only stored as transactions on each of the front-end servers, and you have a background process that continuously (e.g., every five seconds) aggregates the votes and populates all the caches.
AFAIK, reddit does not do live low-latency vote propagation. If you vote something up, it isn't immediately reflected across other clients. My guess is that they're doing some kind of aggregation (as in #4), but that's just me speculating.

Perhaps this is a solution you've already considered, but why not just cache everything but the ratings? Instead, just update a single array, where the ith position contains the rating for the ith top story. Keep this in memory all the time, and flush ratings back to the database as it's available.
If you only care about the top N stories being the most up-to-date, then i only needs to be the size of the number of stories on the front page, which is presumably a very small number like 50 or so.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.