On my website, I'm using Apache, MySQL, and Memcached.
I have implemented a system, where a user will see certain updates from his friends on his feed page. Although it's working, it's slow. Of course it is! Every time the page is being loaded, the data is fetched again and again from MySQL. So I was banging my head, thinking on a proper implementation of caching this data with Memcached. I have thought of the following :
Fetch the data every time (Current) (Hell slow)
Store updates from each user in Memcached in separate keys. Then when the page is being loaded, get the data from Memcached, sort it, trim it, or whatever I want to do with it.
Store each user's feed separately. But this would mean , that when a user posts an update, all of his friends' cache will have to be updated, causing extra over head.
In 2nd and 3rd, there will be another problem. I do not believe that Memcache's object limit of 1mb can store this much data. And then, there are several risks of Race Conditions. What if 2 friends post an update on the same time, causing a user's feed cache to be updated twice? (Ofcourse that wouldn't be a problem if Memcached processes are atomic, but still, I don't think this is the right way to do it.)
What are your thoughts on this? What will be the best method to achieve what I want?
Have your logged in clients register to a listener for all their friends channels. If someone adds something to his timeline, check if there are registered listeners for this. If so, publish to those clients (use queue, push, whatever).
Related
I'm running into an interesting dilema while trying to solve a scale problem.
Current we have a social platform that has a pretty typical feed. We are using a graph database and each time a feed is requested by the user we hit the DB. While this is fine now, it will come to a grinding halt as we grow our user base. Enter Redis.
Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID and update them when there are updates, additions or deletes. Then in code we loop through the DB results of posts and pull in the data from the Redis store. This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time. The challenge is keeping up with changing data such as commenter's/liker's avatars, Screen Names, closed accounts, new likes, new comments etc associated with each individual post.
I am trying to decide on a strategy to handle this the most effective way. Redis will only take us so far since we will top out at about 12 gig of ram per machine.
One of the concepts in discussion is to use a beacon for each user that stores new post ID. So when a user shares something new, all of their connected friends' beacon get the post ID so that when the user logs in their feed is seen as Dirty requiring an update, then storing the feed by ID in a Redis Set sorted by timestamp. To retrieve the feed data we can do a single query by IDs rather than a full traversal which is hundreds of times faster. That still does not solve the interacting user's account information, their likes, and comments problem which is ever changing, but does solve in part building the feed problem.
Another idea is to store a user's entire feed (JSON encoded) in a MYSQL record and update it on the fly when the user requests it and the beacon shows a dirty feed. Otherwise it's just a single select and json decode to build the feed. Again, the dynamic components are the huddle.
Has anyone dealt with this challenge successfully, or have working knowledge of a strategy to approach this problem.
Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID
Use more efficient serializer, like igbinary or msgpack. I suggest igbinary (check http://kiss-web.blogspot.com/)
Then in code we loop through the DB results of posts and pull in the data from the Redis store.
Be sure to use pipelining for maximum performance.
This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time.
Do not underestimate power of DB primary keys. Try to do the same (not join, but select by keys) with your DB:
SELECT * FROM comments WHERE id IN (1,2,3,4,5,6):
Single redis call is faster than single DB call, but doing lots of redis calls (even pipelined) compared to one sql query on primary keys - not so much. Especially when you give your DB enough memory for buffers.
You can use DB and Redis by "caching" DB data in redis. You do something like that:
Every time you update data, you update it in DB and delete from Redis.
When fetching data, you first try to get them from Redis. If data is not found in Redis, you search them id DB and insert into Redis for future use with some expire time.
That way you store in redis only usefull data. Unused(old) data will stay only in DB.
You may want to use a mongodb as #solocommand has described, but you may want to stop the expectation that you "update" data on demand. Instead push your users changes into a "write" queue, which will then update the database as needed. Then you can load from the database (mongodb) and work with it as needed, or update the other redis records.
Moving to a messaging system Amazon SQS, IronMQ, or RabbitMQ, may help you scale better. You can also use redis queues as a basic message bus.
So I have a website with flash games, each time a user gets a answer correct or wrong the values and the timestamps are sent to the database.
I wanted to reduce the number of accesses to the database so I made a solution with APC Cache, where it accumulates the values and it also has a table to store the references between the users and the sessions that have data to send. When the user logs out/in or changes game they are sent to the mySQL DB. For example if the user shuts down the PC without logout the data is stored in cache until he logs in again. But I found out the APC Cache is very unreliable, sometimes it deletes the values without warning.
Is there any PHP cache where I can achieve that, or another similar solution?
Caches are .. well, they are caches. They're not meant for information that are going to live forever, and will need to be flushed (for write caches) at regular intervals. Both memcached and apc will expire entries when the cache is full (or the TTL expires), and if they didn't, they'd be in-memory databases instead.
Without knowing the details of why you're having trouble storing data directly in the database, you could use a simple table without indices and one row for each entry to store data. The insert time should be negligible, and you can process the data every five / ten / fifteen minutes and move it into its proper location. This will give you more permanence, at least. That's probably the simplest way of doing stuff with the stack you have today.
You can also look into other solutions, such as redis, message queues (rabbit, gearman, etc.) and a whole sleeve of other technologies. The important part is to avoid using technology made for non-permanent data to store permanent data.
I've been working on a website lately and want to speed up my application.
I want to cache my users' pages but the pages are dynamic like if someone posts a new feed then the homepage is updated with that new feed. If I cache the homepage for one user and a friend of his posts a new feed I want that cache to be expired and the next time he visits the homepage again the application contacts the database and fetches the new feeds and caches it.
I'm using memcache and PHP and MySQL for my DB.
I have a table called friends, feeds and users.
Will it be efficient to cache every user's friends and when that user posts a feed, my app fetches his/her friends and caches a notification with their userid so that when those friends log in the app checks at every page if there is a notification to take action (in this case deleting the homepage in the cache).
Regards,
Resul
Profile your application and locate places where you access data that is expensive to fetch (or calculate). Those places are good places to start with memcached, unless you're doing more writes than reads (where you'd likely have to update the cache more often than you could make use of it).
Caching everything you ever access could well lead to nothing than a quite full memcached that holds mostly data that is rarely accessed (while potentially pushing things out from the cache you actually should cache). In many cases you shouldn't use memcached as a 1:1 copy of your database in key-value form.
Before you even start server-side optimizations, you should run ySlow and try to get an A rating. Take a hard look at you JavaScript too. If you are using jQuery, then getting rid of it would vastly improve the overall performance of site. The front-end optimization usually is much more important.
Next step would be optimizing cleaning up the server-side code. Try testing your SQL queries qith EXPLAIN. See if you are missing some indexes. And then do some profiling on PHP side with Xdebug. See where the bottlenecks are.
And only then start messing with caching. As for Memcached, unless your website runs on top of cluster of servers, you do not need it. Hell .. it might even be harmful. If you site is located on single box, you will get much better results with APC, which, unlike Memcached, is not distributed by nature.
Write a class that handles all the DB queries, caches the tables, and does the queries on the cached tables instead your DB. update your cache each time you do an Insert or an update on a Table.
I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
Create a Google Analytics account and use its info.
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
UPDATE:
Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
Log every page visit to a simple text log file
Send a cookie to the user to separate unique users
Every 8 minutes, load the log file, collect the info and update the MySQL tables.
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
You could try the API from Clicky. Bear in mind that:
Free accounts are limited to the last 30 days of history, and 100 results per request.
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.
I'm currently developing the foundation of a an application, and looking for ways to optimize performance. My setup is based on the CakePHP framework, but I believe my question is relevant to any technology stack, as it relates to data caching.
Let's take a typical post-author relation, which is represented by 2 tables in my db. When I query the database for a specific blog post, at the same time the built-in ORM functionality in CakePHP also fetches the author of the post, comments on the post, etc. All of this is returned as one big-ass nested array, which I store in cache using a unique identifier for the concerned blog post.
When updating the blog post, it is child play to destroy the cache for the post, and have it regenerated with the next request.
But what happens when not the main entity (in this case the blog post) gets updated, but rather some of the related data? For example, a comment could be deleted, or the author could update his avatar. Are there any approaches (patterns) which I could consider for tracking updates to related data, and applying updates to my cache accordingly?
I'm curious to hear whether you've also run into similar challenges, and how you have managed to potentially overcome the hurdle. Feel free to provide an abstract perspective, if you're using another stack on your end. Your views are anyhow much appreciated, many thanks!
It is rather simple, cache entries can be
added
destroyed
You should take care of destroying cache entries when related data change (so in application layer in addition to updating the data you should destroy certain types of cached entries when you update certain tables; you keep track of dependencies by hard-coding it).
If you'd like to be smart about it you could have your cache object state their dependencies and cache the last update times for your DB tables as well.
Then you could
fetch cached data, examine dependencies,
get update times for relevant DB tables and
in case the record is stale (update time of a table that your big ass cache entry depends on is later then the time of the cache entry) drop it and get fresh data from the database.
You could even integrate the above into your persistence layer.
EDIT:
Of course the above is for when you want to have consistent cache. Sometimes, and for some data, you can relax the consistency requirements and there are scenarios where simple TTL will be good enough (for a trivial example, if you have ttl of 1 sec, you should mostly be out of trouble with users and can help data processing; and with higher times you might still be ok - for example let's say you are caching the list of country ISO codes; your application might be perfectly ok if you say let's cache this for 86400 sec).
Furthermore, you could also track the times of information presented to user, for example
let's say user has seen data A from cache and that we know that this data was created/modified at time t1
user makes changes to the data A (and makes it data B) and commits the change
the application layer can then examine if the data A is still as in DB (if the cached data upon which the user made decisions and/or changes was indeed fresh)
if it was not fresh then there is a conflict and user should confirm the changes
This has a cost of extra read of data A from DB, but it occurs only on writes.
Also, the conflict can occur not only because of the cache, but also because of multiple users trying to change the data (i.e. it is related to locking strategies).
One Approach for memcached is to use tags ( http://code.google.com/p/memcached-tag/ ). For Example, you have your Post "big-ass nested array" lets say, it inclused the autors information, the post itself and is shown on the frontpage and in some box in the sidebar. So it gets the tags: frontpage, {auhothor-id}, sidebar, {post-id} - now if someone changes the Author Information you flush every cache entry with the tag {author-id}. But thats only one Solution, and only for Cache Backends that support Tags, for example not APC (afaik). Hope That gave you an example.