filtering data stored in Couchbase

filtering data stored in Couchbase - php

I currently have a read-heavy mobile app (90% reads, 10% writes) that communicates with a single web server through php calls and single MySQL db. The db stores user profile information and messages the users send and receive. We get a few messages per second added to the db.
I'm in the process scaling horizontally, load balancing, etc. So we'll have a load balancer in front of a cluster of web servers and then I plan to put a layer of Couchbase nodes on top of a MySQL cluster so we can have fast access to user profile info and messages info. We'll memcache all user info in Couchbase but then I want to memcache only the latest 24 hours worth of messages in Couchbase since that is the timeframe where most of the read activity will happen.
For the messages data stored in memcache, I want to be able to filter messages based on various data found in a message's fields like country, city, time, etc. I know Couchbase uses a KV approach so I can't query using where clauses like I would with MySQL.
Is there a way to do this? Is Couchbase Views the answer? Or am I totally barking up the wrong tree with Couchbase?

The views in Couchbase Server 2.0 and later are what you're looking for. If the data being put in Couchbase is JSON, you can use those views to perform queries across the data you put in the Couchbase cluster.
Note that you can use a view that emits a date time as an array (a common technique) and even use that in restricting your view time period so you could, potentially, just store all of your data in Couchbase without a need to put it in another system too. If you have other reasons though, you can certainly just have the items expire 24 hours after you put them in the cache. Then, if you're using one of the clients that supports it, you'll be able to get-and-touch the document in the cache extending the expiration if needed. The only downside there is that you'll need to come up with a method of invalidating the document on update.
One way to do that is a trigger in mysql which would delete the given key-- another way is to invalidate it from the application layer.
p.s.: full disclosure: I'm one of the Couchbase folks

Related

MySQL or JSON for data retrieval

So, I have situation and I need second opinion. I have database and it' s working great with all foreign keys, indexes and stuff, but, when I reach certain amount of visitors, around 700-800 co-current visitors, my server hits bottle neck and displays "Service temporarily unavailable." So, I had and idea, what if I pull data from JSON instead of database. I mean, I would still update database, but on each update I would regenerate JSON file and pull data from it to show on my homepage. That way I would not press my CPU to hard and I would be able to make some kind of cache on user-end.

What you are describing is caching.
Yes, it's a common optimization to avoid over-burdening your database with query load.
The idea is you store a copy of data you had fetched from the database, and you hold it in some form that is quick to access on the application end. You could store it in RAM, or in a JSON file. Some people operate a Memcached or Redis in-memory database as a shared resource, so your app can run many processes or threads that access the same copy of data in RAM.
It's typical that your app reads some given data many times for every single time it updates the data. The greater this ratio of reads to writes, the better the savings in terms of lightening the load on your database.
It can be tricky, however, to keep the data in cache in sync with the most recent changes in the database. In other words, how do all the cache copies know when they should re-fetch the data from the database?
There's an old joke about this:
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton

So after another few days of exploring and trying to get the right answer this is what I have done. I decided to create another table, instead of JSON, and put all data, that was suposed to go in JSON file, in the table.
WHY?
Number one reason is MySQL has ability to lock tables while they're being updated, JSON has not.
Number two is that I will downgrade from few dozens of queries to just one, simplest, query: SELECT * FROM table.
Number three is that I have better control over content this way.
Number four, while I was searching for answer I found out that some people had issues with JSON availability if a lot of co-current connections were making request for same JSON, I would never have a problem with availability.

NewsFeed with Redis At Scale Stratagy

I'm running into an interesting dilema while trying to solve a scale problem.
Current we have a social platform that has a pretty typical feed. We are using a graph database and each time a feed is requested by the user we hit the DB. While this is fine now, it will come to a grinding halt as we grow our user base. Enter Redis.
Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID and update them when there are updates, additions or deletes. Then in code we loop through the DB results of posts and pull in the data from the Redis store. This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time. The challenge is keeping up with changing data such as commenter's/liker's avatars, Screen Names, closed accounts, new likes, new comments etc associated with each individual post.
I am trying to decide on a strategy to handle this the most effective way. Redis will only take us so far since we will top out at about 12 gig of ram per machine.
One of the concepts in discussion is to use a beacon for each user that stores new post ID. So when a user shares something new, all of their connected friends' beacon get the post ID so that when the user logs in their feed is seen as Dirty requiring an update, then storing the feed by ID in a Redis Set sorted by timestamp. To retrieve the feed data we can do a single query by IDs rather than a full traversal which is hundreds of times faster. That still does not solve the interacting user's account information, their likes, and comments problem which is ever changing, but does solve in part building the feed problem.
Another idea is to store a user's entire feed (JSON encoded) in a MYSQL record and update it on the fly when the user requests it and the beacon shows a dirty feed. Otherwise it's just a single select and json decode to build the feed. Again, the dynamic components are the huddle.
Has anyone dealt with this challenge successfully, or have working knowledge of a strategy to approach this problem.

Currently we store things like comments, likes and such in individual Redis keys in JSON encoded strings by post ID
Use more efficient serializer, like igbinary or msgpack. I suggest igbinary (check http://kiss-web.blogspot.com/)
Then in code we loop through the DB results of posts and pull in the data from the Redis store.
Be sure to use pipelining for maximum performance.
This is causing multiple calls to Redis to construct each post, which is far better than touching the DB each time.
Do not underestimate power of DB primary keys. Try to do the same (not join, but select by keys) with your DB:
SELECT * FROM comments WHERE id IN (1,2,3,4,5,6):
Single redis call is faster than single DB call, but doing lots of redis calls (even pipelined) compared to one sql query on primary keys - not so much. Especially when you give your DB enough memory for buffers.
You can use DB and Redis by "caching" DB data in redis. You do something like that:
Every time you update data, you update it in DB and delete from Redis.
When fetching data, you first try to get them from Redis. If data is not found in Redis, you search them id DB and insert into Redis for future use with some expire time.
That way you store in redis only usefull data. Unused(old) data will stay only in DB.

You may want to use a mongodb as #solocommand has described, but you may want to stop the expectation that you "update" data on demand. Instead push your users changes into a "write" queue, which will then update the database as needed. Then you can load from the database (mongodb) and work with it as needed, or update the other redis records.
Moving to a messaging system Amazon SQS, IronMQ, or RabbitMQ, may help you scale better. You can also use redis queues as a basic message bus.

Using Memcache, Should I use PDO or an ORM?

I am working on a project with a custom HTML5 front end and a backend I've designed from experience. The backend is composed of a message queue and a cache - currently I've chosen Beanstalk and Memcache because I'm famliar with them but I am open to suggestions.
My question though comes from how my coder is interfacing with the MySQL DB we are using to store the data. The idea is to pre-cache most or all of the DB so the site runs really fast. It's not a huge DB so RAM for Memcache shouldn't be an issue. However, my coder is using CodeIgniter with GreenBean. I've never heard of GreenBean before and when I google it I get almost nothing that isn't related to greenbeans the food. What little I could find suggested it was an ORM which fits from what my coder has told me.
The problem is this. With raw PDO my pre-caching scheme is simple - I would grab each row from each table and store it in the cache with a key. Then every time I needed that data I would look at the cache first for it and then the DB. If something is changed on the backend then I only need to update that row in the DB and the associated key in the cache.
With an ORM, if I store the entire ORM object serialized into the cache then it holds a bunch of related data. Data that could be incorrect if something were changed. For example, you have a DB of employees that is linked to the office they work in and the dept they work in. The ORM grabs the office and the dept and we store all of that in the cache. But if the office address changes the ORM object for every employee in that office is now stale/incorrect.
In that example, just letting the cache expire probably isn't an issue most of the time. But in my application, that data should really get updated immediately. So in a simple PDO scheme you flush the cache keys related to the data that changed and every future page call gets the updated data. But with an ORM you have lots and lots of cached object instances that might be incorrect and no good way of finding them. So it seems to me you are now left with some form of indexing of your cached objects and when you change something simple you could be flushing and refilling a big chunk of the cache. The site gets really slow then.
Typically I would just cache a DB result after the first time I needed it but in this case I think that could end up being really slow for a lot of users that make the first requests that particular set of data. Additionally, there are some search features that could require a lot of data from the DB. Thus my desire to pre-cache.
So in this case I'm thinking an ORM would hurt the site's performance. I'm thinking I'm not the first person to have this issue though. Is there an ORM out there that would handle this scenario well? Is there a better backend architecture I'm missing?
Thanks

Scalable web application

We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.

You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.

I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.

How to speed up my PHP app with Memcached

I've been working on a website lately and want to speed up my application.
I want to cache my users' pages but the pages are dynamic like if someone posts a new feed then the homepage is updated with that new feed. If I cache the homepage for one user and a friend of his posts a new feed I want that cache to be expired and the next time he visits the homepage again the application contacts the database and fetches the new feeds and caches it.
I'm using memcache and PHP and MySQL for my DB.
I have a table called friends, feeds and users.
Will it be efficient to cache every user's friends and when that user posts a feed, my app fetches his/her friends and caches a notification with their userid so that when those friends log in the app checks at every page if there is a notification to take action (in this case deleting the homepage in the cache).
Regards,
Resul

Profile your application and locate places where you access data that is expensive to fetch (or calculate). Those places are good places to start with memcached, unless you're doing more writes than reads (where you'd likely have to update the cache more often than you could make use of it).
Caching everything you ever access could well lead to nothing than a quite full memcached that holds mostly data that is rarely accessed (while potentially pushing things out from the cache you actually should cache). In many cases you shouldn't use memcached as a 1:1 copy of your database in key-value form.

Before you even start server-side optimizations, you should run ySlow and try to get an A rating. Take a hard look at you JavaScript too. If you are using jQuery, then getting rid of it would vastly improve the overall performance of site. The front-end optimization usually is much more important.
Next step would be optimizing cleaning up the server-side code. Try testing your SQL queries qith EXPLAIN. See if you are missing some indexes. And then do some profiling on PHP side with Xdebug. See where the bottlenecks are.
And only then start messing with caching. As for Memcached, unless your website runs on top of cluster of servers, you do not need it. Hell .. it might even be harmful. If you site is located on single box, you will get much better results with APC, which, unlike Memcached, is not distributed by nature.

Write a class that handles all the DB queries, caches the tables, and does the queries on the cached tables instead your DB. update your cache each time you do an Insert or an update on a Table.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.