NewsFeed with Redis At Scale Stratagy - php

I'm running into an interesting dilema while trying to solve a scale problem.
Current we have a social platform that has a pretty typical feed. We are using a graph database and each time a feed is requested by the user we hit the DB. While this is fine now, it will come to a grinding halt as we grow our user base. Enter Redis.
Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID and update them when there are updates, additions or deletes. Then in code we loop through the DB results of posts and pull in the data from the Redis store. This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time. The challenge is keeping up with changing data such as commenter's/liker's avatars, Screen Names, closed accounts, new likes, new comments etc associated with each individual post.
I am trying to decide on a strategy to handle this the most effective way. Redis will only take us so far since we will top out at about 12 gig of ram per machine.
One of the concepts in discussion is to use a beacon for each user that stores new post ID. So when a user shares something new, all of their connected friends' beacon get the post ID so that when the user logs in their feed is seen as Dirty requiring an update, then storing the feed by ID in a Redis Set sorted by timestamp. To retrieve the feed data we can do a single query by IDs rather than a full traversal which is hundreds of times faster. That still does not solve the interacting user's account information, their likes, and comments problem which is ever changing, but does solve in part building the feed problem.
Another idea is to store a user's entire feed (JSON encoded) in a MYSQL record and update it on the fly when the user requests it and the beacon shows a dirty feed. Otherwise it's just a single select and json decode to build the feed. Again, the dynamic components are the huddle.
Has anyone dealt with this challenge successfully, or have working knowledge of a strategy to approach this problem.

Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID
Use more efficient serializer, like igbinary or msgpack. I suggest igbinary (check http://kiss-web.blogspot.com/)
Then in code we loop through the DB results of posts and pull in the data from the Redis store.
Be sure to use pipelining for maximum performance.
This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time.
Do not underestimate power of DB primary keys. Try to do the same (not join, but select by keys) with your DB:
SELECT * FROM comments WHERE id IN (1,2,3,4,5,6):
Single redis call is faster than single DB call, but doing lots of redis calls (even pipelined) compared to one sql query on primary keys - not so much. Especially when you give your DB enough memory for buffers.
You can use DB and Redis by "caching" DB data in redis. You do something like that:
Every time you update data, you update it in DB and delete from Redis.
When fetching data, you first try to get them from Redis. If data is not found in Redis, you search them id DB and insert into Redis for future use with some expire time.
That way you store in redis only usefull data. Unused(old) data will stay only in DB.

You may want to use a mongodb as #solocommand has described, but you may want to stop the expectation that you "update" data on demand. Instead push your users changes into a "write" queue, which will then update the database as needed. Then you can load from the database (mongodb) and work with it as needed, or update the other redis records.
Moving to a messaging system Amazon SQS, IronMQ, or RabbitMQ, may help you scale better. You can also use redis queues as a basic message bus.

Related

MySQL or JSON for data retrieval

So, I have situation and I need second opinion. I have database and it' s working great with all foreign keys, indexes and stuff, but, when I reach certain amount of visitors, around 700-800 co-current visitors, my server hits bottle neck and displays "Service temporarily unavailable." So, I had and idea, what if I pull data from JSON instead of database. I mean, I would still update database, but on each update I would regenerate JSON file and pull data from it to show on my homepage. That way I would not press my CPU to hard and I would be able to make some kind of cache on user-end.
What you are describing is caching.
Yes, it's a common optimization to avoid over-burdening your database with query load.
The idea is you store a copy of data you had fetched from the database, and you hold it in some form that is quick to access on the application end. You could store it in RAM, or in a JSON file. Some people operate a Memcached or Redis in-memory database as a shared resource, so your app can run many processes or threads that access the same copy of data in RAM.
It's typical that your app reads some given data many times for every single time it updates the data. The greater this ratio of reads to writes, the better the savings in terms of lightening the load on your database.
It can be tricky, however, to keep the data in cache in sync with the most recent changes in the database. In other words, how do all the cache copies know when they should re-fetch the data from the database?
There's an old joke about this:
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton
So after another few days of exploring and trying to get the right answer this is what I have done. I decided to create another table, instead of JSON, and put all data, that was suposed to go in JSON file, in the table.
WHY?
Number one reason is MySQL has ability to lock tables while they're being updated, JSON has not.
Number two is that I will downgrade from few dozens of queries to just one, simplest, query: SELECT * FROM table.
Number three is that I have better control over content this way.
Number four, while I was searching for answer I found out that some people had issues with JSON availability if a lot of co-current connections were making request for same JSON, I would never have a problem with availability.

Redis - Best data structure to store and then fetch large data

I've recently implemented Redis into one of my Laravel projects. It's currently more of an technical exercise as opposed to production as I want to see what it's capable of.
What I've done is created a list of payment transactions. What I'm pushing to the list is the payload which I receive from a webhook every time a transaction is processed. The payload is essentially an object containing all the information to do with that particular transaction.
I've created a VueJS frontend that then displays all the data in a table and has pagination so it's show 10 rows at a time.
Initially this was working super quick but now that the list contains 30,000 rows which is about 11MB worth of data, the request is taking about 11seconds.
I think the issue here is that I'm using a list and am fetching all the rows from the list using LRANGE.
The reason I used a list was because it has the LPUSH command so that latest transactions go to the start of the list.
I decided to do a test where I got all the data from the list and outputted the value to a blank page and this took about the same time so it's not an issue with Vue, Axios, etc.
Firslty, is this read speed normal? I've always heard that Redis is blazing fast.
Secondly, is there a better way to increase read performance when using Redis?
Thirdly, am I using the wrong data type?
In time I need to be able to store 1m rows of data.
As I realized you get all 30,000 rows in any transaction update and then paginate it in frontend. In my opinion, the true strategy is getting lighter data packs in each request.
For example, use Laravel pagination in response to your request.
In my opinion:
Firstly: As you know, Redis is blazing fast and Redis is really fast. Because Redis data always in memory, you say read 11MB data about use 11s, you can check your bandwidth
Secondly: I'm sorry I don't know how to increase in this env.
Thirdly: I think your choice ok.
So, you can check your bandwidth first(redis server).

Using Memcache, Should I use PDO or an ORM?

I am working on a project with a custom HTML5 front end and a backend I've designed from experience. The backend is composed of a message queue and a cache - currently I've chosen Beanstalk and Memcache because I'm famliar with them but I am open to suggestions.
My question though comes from how my coder is interfacing with the MySQL DB we are using to store the data. The idea is to pre-cache most or all of the DB so the site runs really fast. It's not a huge DB so RAM for Memcache shouldn't be an issue. However, my coder is using CodeIgniter with GreenBean. I've never heard of GreenBean before and when I google it I get almost nothing that isn't related to greenbeans the food. What little I could find suggested it was an ORM which fits from what my coder has told me.
The problem is this. With raw PDO my pre-caching scheme is simple - I would grab each row from each table and store it in the cache with a key. Then every time I needed that data I would look at the cache first for it and then the DB. If something is changed on the backend then I only need to update that row in the DB and the associated key in the cache.
With an ORM, if I store the entire ORM object serialized into the cache then it holds a bunch of related data. Data that could be incorrect if something were changed. For example, you have a DB of employees that is linked to the office they work in and the dept they work in. The ORM grabs the office and the dept and we store all of that in the cache. But if the office address changes the ORM object for every employee in that office is now stale/incorrect.
In that example, just letting the cache expire probably isn't an issue most of the time. But in my application, that data should really get updated immediately. So in a simple PDO scheme you flush the cache keys related to the data that changed and every future page call gets the updated data. But with an ORM you have lots and lots of cached object instances that might be incorrect and no good way of finding them. So it seems to me you are now left with some form of indexing of your cached objects and when you change something simple you could be flushing and refilling a big chunk of the cache. The site gets really slow then.
Typically I would just cache a DB result after the first time I needed it but in this case I think that could end up being really slow for a lot of users that make the first requests that particular set of data. Additionally, there are some search features that could require a lot of data from the DB. Thus my desire to pre-cache.
So in this case I'm thinking an ORM would hurt the site's performance. I'm thinking I'm not the first person to have this issue though. Is there an ORM out there that would handle this scenario well? Is there a better backend architecture I'm missing?
Thanks

What will be my choices for implementing cache in this case?

On my website, I'm using Apache, MySQL, and Memcached.
I have implemented a system, where a user will see certain updates from his friends on his feed page. Although it's working, it's slow. Of course it is! Every time the page is being loaded, the data is fetched again and again from MySQL. So I was banging my head, thinking on a proper implementation of caching this data with Memcached. I have thought of the following :
Fetch the data every time (Current) (Hell slow)
Store updates from each user in Memcached in separate keys. Then when the page is being loaded, get the data from Memcached, sort it, trim it, or whatever I want to do with it.
Store each user's feed separately. But this would mean , that when a user posts an update, all of his friends' cache will have to be updated, causing extra over head.
In 2nd and 3rd, there will be another problem. I do not believe that Memcache's object limit of 1mb can store this much data. And then, there are several risks of Race Conditions. What if 2 friends post an update on the same time, causing a user's feed cache to be updated twice? (Ofcourse that wouldn't be a problem if Memcached processes are atomic, but still, I don't think this is the right way to do it.)
What are your thoughts on this? What will be the best method to achieve what I want?
Have your logged in clients register to a listener for all their friends channels. If someone adds something to his timeline, check if there are registered listeners for this. If so, publish to those clients (use queue, push, whatever).

filtering data stored in Couchbase

I currently have a read-heavy mobile app (90% reads, 10% writes) that communicates with a single web server through php calls and single MySQL db. The db stores user profile information and messages the users send and receive. We get a few messages per second added to the db.
I'm in the process scaling horizontally, load balancing, etc. So we'll have a load balancer in front of a cluster of web servers and then I plan to put a layer of Couchbase nodes on top of a MySQL cluster so we can have fast access to user profile info and messages info. We'll memcache all user info in Couchbase but then I want to memcache only the latest 24 hours worth of messages in Couchbase since that is the timeframe where most of the read activity will happen.
For the messages data stored in memcache, I want to be able to filter messages based on various data found in a message's fields like country, city, time, etc. I know Couchbase uses a KV approach so I can't query using where clauses like I would with MySQL.
Is there a way to do this? Is Couchbase Views the answer? Or am I totally barking up the wrong tree with Couchbase?
The views in Couchbase Server 2.0 and later are what you're looking for. If the data being put in Couchbase is JSON, you can use those views to perform queries across the data you put in the Couchbase cluster.
Note that you can use a view that emits a date time as an array (a common technique) and even use that in restricting your view time period so you could, potentially, just store all of your data in Couchbase without a need to put it in another system too. If you have other reasons though, you can certainly just have the items expire 24 hours after you put them in the cache. Then, if you're using one of the clients that supports it, you'll be able to get-and-touch the document in the cache extending the expiration if needed. The only downside there is that you'll need to come up with a method of invalidating the document on update.
One way to do that is a trigger in mysql which would delete the given key-- another way is to invalidate it from the application layer.
p.s.: full disclosure: I'm one of the Couchbase folks

Categories