I want to save a lengthy form's inputs at the server. But I don't think making db calls on each auto-save action is the best approach to go for.
What would constitute as a good approach to solve this?
Another problem is that I have 3 app servers. So in memory cache wouldn't work.
I was thinking keeping the data in redis and updating it on every call and finally updating the db. But since I have 3 servers how do I make sure the calls are in queue?
Can anyone help with the architecture?
But I don't think making db calls on each auto-save action is the best approach to go for.
That's the real question, let's start with that. Why would you think that?
You want auto-save, right? This is the only thing that saves user work.
All the other options you listed (memcached/redis, in-process caching) - not only do they not save user work, they're ticking time bombs. Think of all things that can fail there: redis dies, network is split, the whole data center is hit by lightning.
Why create all the complexity when you can just... save? You may find out that it's not that slow (if this was your concern).
This is a very classical problem faced when scaling your architecture, but lets come to scaling later as essentially your initial app requires many calls to the data base level lets optimize that first.
Since you have not given any details to your iops I'll describe below the approaches to solving it, in increasing order of load, all approaches are of cascading nature, that is the last approach actually is built on all previous solutions :
The Direct Database Approach {db calls on each auto-save action}:
client -> application layer -> database (save data here directly)
Where basically on any data update we directly propagate it to the database level. The main block in this schema is the database
Things to take care in this approach :
For the most popularly used relational databases like Mysql :
An insert query takes less time than an update query, for a college project you could keep updating the row against the same primary key and it would work beautifully.
But in any mid sized application where a single form modify takes 30-40 requests the update query would block your db resource.
So keep a addendum kind of schema, like maybe maintain a secondary key like status that keeps track of till which level the user has filled the form but keep inserting data for each update. And always read as per the most recent status inserted.
For further optimization use of indexes such as foreign key constraints should be applied
When this step fails, next step being the database itself, the next step to optimise on is, The type of data you're dealing with, you can choose a schema-less db like mongo, dynamodb etc for non transactional data
Use a schema-less db can be very helpful for large amounts of non-transactional data as inherently they allow for the addendum approach to the same row of data.
For additional optimization use secondary indexes to query data faster.
Involving The Application Layer Approach {Application layer caching }:
client -> application layer (save some data here then propagate later) -> database (Finally save here)
When the simple database is unable to serve and scale as required the easiest way is to shift some load on to your API server . As horizontal scaling of the same is much easier and cheaper.
It is exactly as you understand the memory cache approach, though do not go about designing your own - It takes years of understanding large scale web infrastructure, then also people are not able to design an efficient cache on the application layer.
use something like Express Session , I was using this for a web app having 10 ec2 nodejs instances, storing about 15mb of user data per session . It scales beautifully, maintains unique user session data across all servers - so on each auto update save data to user session - on form submit write to database from session. Can be easily scaled by adding more api servers, best use case with use it to save data on redis {Why re-invent the wheel ?}
Re-inventing the wheel : Custom level application layer caching + db caching + db optimize
client -> application layer (save some data here) ->{ Add your db cache}-> database (save data here finally)
This is where we come to designing our own thing using redis/Dynamodb/mongo to act as a cache for your primary database. (Please note if using a non transactional db in the first place go for the addendum approach - this is purely much more suited to scale transactional databases by adding a wrapper of a non transactional database)
Also express session works in the same way actually by caching the data on redis at the application layer, always try to lessen the number of calls to db as much as possible.
So if you have a fully functioning application layer caching and an optimized db then go for this approach, as it needs experienced developers and is resource intensive and usually employed for very large scale applications like I had a layer of redis caching after the express session for an application that avgd iops of 300k reqs per second
The approach here would be to save data on user session, apply a lazy write back to your cache. Maritain a cache ledger then write to your main database. For the queuing approach in such a large scale system I had written an entire separate microservice that worked in background to transport data from session to redis to mysql, for your concern of how to maintain a queue read more about priority queues and background workers . I used Kue is a priority job queue backed by redis, built for node.js.
Maintaining parallel and sequential queues
php client for KUE
Background Services
This solution adopted industry wide successfully.
Configure a 3 node redis cluster which take care of replication of data.
Writes happen only to the master node.
redis (master) - app server 1,
redis (slave1) - app server 2,
redis (slave2) - app server 3
Adding a slave is straighforward using the slaveof :port command.
Replication is done over the wire (not temporary disk storage)
Link -https://redis.io/topics/replication
Related
I'm working on a site that has a store locator built in.
Since I have similar sites developed in the past, I have experienced some troubles when I had search peaks hitting the database (mySQL) hard.
All these past location search engines were querying the database to get the results.
Now I have taken a different approach, but since I'm not 100% sure, I thought that asking this great community could make me feel more secure about this direction or stick to what I did before.
So for this new search, instead of hitting the database for requests, I'm serving the search with a JSON file that regenerates (querying the database) only when something is updated, created or deleted on the locations list.
My doubt is, can a high load of requests over the json file have the same effect than a high load of query requests over the database?
Serving the search results from a JSON to lower the impact on db (and server resources) is a good approach or it's not a good idea?
Maybe someone out there had to take the same decision and can share the experience with me, or maybe you just know how things really are and recommend me a certain approach.
Flat files are the poor man's db and can be even more problematic than a heavily pounded database. For example reading and writing the file still requires a lock, and will not scale, as the same file may not be accessible to all app servers.
My suggestion would be any one of the following:
Benchmark your current hardware, identify bottlenecks, scale out or up accordingly.
Implement a caching layer, this will save on costly queries for readonly data.
Consider more high performant storage solutions such as Aerospike or Redis
Implement a real full text search engine such as ElasticSearch or SOLR.
Response to comment #1:
You could accomplish the same thing without having to read/write a flat file (which must be accessible by all app servers), by caching the data. Here's just a quick N dirty rundown of how I would do it:
Zip + 10 miles:
Query database, pull store data, json_encode, cache using a key construct like 92562_10, then store in cache. Now when other users enter 92562 + 10 they will pull data from cache vs the database (or flat file).
City, State + 50 miles:
Same as above, except key construct may look like murrieta_ca_50.
But with the caching layer you get better performance, and the cache server will be available to all your app servers, which would be much easier than having to install/configure NFS to share the file on a network.
I am working on a project with a custom HTML5 front end and a backend I've designed from experience. The backend is composed of a message queue and a cache - currently I've chosen Beanstalk and Memcache because I'm famliar with them but I am open to suggestions.
My question though comes from how my coder is interfacing with the MySQL DB we are using to store the data. The idea is to pre-cache most or all of the DB so the site runs really fast. It's not a huge DB so RAM for Memcache shouldn't be an issue. However, my coder is using CodeIgniter with GreenBean. I've never heard of GreenBean before and when I google it I get almost nothing that isn't related to greenbeans the food. What little I could find suggested it was an ORM which fits from what my coder has told me.
The problem is this. With raw PDO my pre-caching scheme is simple - I would grab each row from each table and store it in the cache with a key. Then every time I needed that data I would look at the cache first for it and then the DB. If something is changed on the backend then I only need to update that row in the DB and the associated key in the cache.
With an ORM, if I store the entire ORM object serialized into the cache then it holds a bunch of related data. Data that could be incorrect if something were changed. For example, you have a DB of employees that is linked to the office they work in and the dept they work in. The ORM grabs the office and the dept and we store all of that in the cache. But if the office address changes the ORM object for every employee in that office is now stale/incorrect.
In that example, just letting the cache expire probably isn't an issue most of the time. But in my application, that data should really get updated immediately. So in a simple PDO scheme you flush the cache keys related to the data that changed and every future page call gets the updated data. But with an ORM you have lots and lots of cached object instances that might be incorrect and no good way of finding them. So it seems to me you are now left with some form of indexing of your cached objects and when you change something simple you could be flushing and refilling a big chunk of the cache. The site gets really slow then.
Typically I would just cache a DB result after the first time I needed it but in this case I think that could end up being really slow for a lot of users that make the first requests that particular set of data. Additionally, there are some search features that could require a lot of data from the DB. Thus my desire to pre-cache.
So in this case I'm thinking an ORM would hurt the site's performance. I'm thinking I'm not the first person to have this issue though. Is there an ORM out there that would handle this scenario well? Is there a better backend architecture I'm missing?
Thanks
We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.
You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.
I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.
Replication
I have an app that Is polling data from a large number of data feeds. It processes thousands of records per day and this number is ever increasing. The data is stored in Mysql.
I then have a website that utilises this data.
I'm trying to build my environment with future in mind.
I thought of mysql replication so that the website can use it's own database on a different server and get bogged down by the thousands of write commands that are happening on the main database.
I am having difficulty getting this setup, despite mysql reporting it's all working fine.
I then started think - is there not a better way ?
From what I understand mysql sends the write command to the slave database as the master.
Does this not mean that what I am trying to avoid is just happening anyway?
Does this mean that the slave database will suffer thousands of writes
I am a one man band, doing this venture with my own money so I need to do this a cheapest way. I am getting a bit lost !
I have a dedicated server,
A vps
Using Php5, mysql 5 in a lamp stack.
I cannot begin to tell you how much I would appreciate some guidance!
If the slaves are a 1:1 clone of the master, than all writes to the master MUST be propagated down to the slaves. Otherwise replication would be useless.
Thousands of records per day is actually very small. Assuming the same processing time for each, and doing 5000 records, you'd have 86400/5000 = 17.28 seconds per record. That's very minimal write overhead.
If you were doing millions of records a day, THEN you'd have a write bottleneck.
I would split this in three layers.
Data Feed layer. Data read from the feeds is preprocessed and posted into a queue. This layer has a temporary queue that serves also as a temporary storage, a buffer to allow all data feed to post its data. I'd use a Message Queue System. It's fast and reliable.
Data Store layer. This layer reads from the queue, maybe processes someway the data read, and stores the data in the database.
Data Analysis layer. This is your "slave" database. It's a data warehouse. It periodically does ETL (extract, transform and load) data from the Data Store layer to this secondary database.
This layeread approach allows you isolate concerns (speed, reliability, security) and implementation details; and allows for future scalability.
Replication is literally what the word suggest - replicating queries on another machine.
MySQL creates a log that's filled with queries that were used to create the dataset on the original machine (master) and sends it to the slave(s) that read the log and re-execute those queries.
Basically, what you want is to increase your write ratio. That's achievable trough using different engines, for example TokuDB is one of them (however it isn't free, but you are allowed to store 50gb of user data for free and use it).
What you want (for the moment) is fast HDD subsystem more than a monolithic write-scalable storage system. InnoDB is capable of achieving a lot of queries per second on properly configured machine with sufficient hardware. I am not sure about pricing, but SSD and 4-8 gigs of ram shouldn't be that expensive. As Marc. B said - until you reach millions of records per day, you don't have to worry about scaling reads and writes trough replication.
You say you have an app "polling" your data from datafeeds. Does that mean you are doing full text searches? I'm making an assumption here in that you are batch processing date feeds and then querying that. If that is the case I'd offload all your fulltext queries to something like Solr. It actually isn't too time consuming to setup, depending on the size of your DB you can get away with running it on a fairly small VPS or on your dedicated, and best yet the difference is search speed is incredible. I've had full text mysql queries that would take 20 minutes to run be done in solr in under a second.
Just make sure you use a try statement in the event your solr instance goes down.
I constantly read on the Internet how it's important to correctly architect my PHP applications so that they can scale.
I have built a simple/small CMS that is written in PHP (think of Wordpress, but waaaay simpler).
I essentially have URLs like such: http://example.com/?page_id=X where X is the id in my MySQL database that has the page content.
How can I configure my application to be load balanced where I'm simply performing PHP read activities.
Would something like Nginx as the front door setup to route traffic to multi-nodes running my same code to handle example.com/?page_id=X be enough to "load balance" my site?
Obviously, MySQL is not being load balanced in this situation, though for simplicity - that makes that out of scope for this question.
These are some well known techniques for scaling such an app.
Reduce DB hits
Most often the bottle neck will be your DB, so cache recent pages so that you reduce DB activity, perhaps in something like memcached.
Design your schema such that it is partition-able.
In the simplest case, separate your data into logical partitions, and store each partition in a separate mysql DB. Craigslist, for example, partitions data by city, and in some cases, by section within that. In your case, you could partition by Id quite simply.
Manage php sessions
Putting ngnx in front of a php website will not work if you use sessions. Load balancing php does have issues as sessions are persisted on local storage. Therefore you need to do session management explicitly. The traditional solution is to use memcached to store and look up some kind of cookie.
Don't optimize prematurely.
Focus on getting your application out so that the next magnitude of current users gets the optimal experience.
Note: Your main potential pain points are discussed here on SO
No, it is not at all important to scale your application if you don't need to.
My view on this is:
Make it work
Make sure it works correctly - testability, robustness
Make it work efficiently enough to be cost effective to run
Then, if you have to so much traffic that your system cannot handle it, AND you've already thrown all the hardware that (sensible) money can buy at it, then you need to scale. Not sooner.
Yes it is relatively easy to scale read-workloads, because you can simply perform reads against readonly database replicas. The challenge is to scale write-workloads.
A lot of sites have few writes, even if they're really busy.
The correct approach is to use some kind of load balancer such as:
http://www.softwareprojects.com/resources/programming/t-how-to-install-and-configure-haproxy-as-an-http-loa-1752.html
What this does is forward a certain user session only to a certain server, hence you dont have to worry about sessions and where they are stored at all. What you do have to worry is how to distribute the filesystem if the 2 servers are running on two different machines, especially if you make heavy use of the filesystem. Hope this article above helps...