I've got a gaming-oriented website with 200+ users. The site has a large database tracking user plays, and one of the motivations for continued participation is the extensive statistics and rankings (S&R) with which the site provides the user.
As the list of S&Rs tracked has grown, some of the more intricate calculations have been moved to tables within the database, rather than be generated on-the-fly in order to improve page loading speed.
However, I plan to move from extensive to exhaustive S&Rs by the end of the year, increasing the overall number of datapoints available to the user by a factor of 10. I've already decided to stop doing on-the-fly queries and to move all the calculations to a cron job, but I'm unsure where to store the data.
Given a user base <1000, would it make more sense to place this data within the database or read/write a text file for each user's stats?
These are the main pros and cons in my mind:
Storing S&Rs in the Database
+ cross-user comparisons are easy and fast
+ faster cron jobs because there's no need to write to many, many files
- database table count will jump from ~50 to 200+ (at least)
- one point of failure (database corruption) for all site data
- modifying S&R structure requires modifying database as well
Storing S&Rs in Text Files
+ neatly organized and distributes data corruption risk
+ database is easier to navigate
+ redesigning S&R structure is done by simply modifying script and
overwriting all text files, rather than adjusting database tables
- cron job will have to read/update XXX files each time
- cross-user comparisons are difficult and time-consuming
But I've never done something of this magnitude before, so I'm not really sure (for example) if a 200+ table MySQL database is even really a problem?
I'd appreciate any suggestions you can provide! :-)
Any popular database software should be able to handle millions of entries, having 200+ tables is not an issue on that end.
Corruption is unlikely, but on a site of that nature you should be doing backups fairly frequently, and preferable storing a copy outside the server - using individual files distributes and decreases the likelihood of a general failure, but there's also a small chance of problems ocurring.
Database software excels at performing tasks on it's data, using flat files would probably force you to write your own method to process them, and this could easily prove to be a major task, at the extra cost of a loss of speed compared to using a database (I'm just assuming this, I might be very wrong).
Related
I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!
My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.
If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.
Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions
This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.
I'm the developer for a real estate syndication website and am currently having trouble figuring out a way to update massive numbers of listings/records efficiently (2,000,000+ listings).
We currently accept XML feeds, containing real estate listings, from about ~20 different websites. Most of the incoming feeds are small (~100 or so listings), but we have a couple of XML feeds that contain ~1,000,000 listings. The small feeds are parsed fast and easy, however, the large feeds are taking upwards of 2-3 hours each.
The current "live" database table that contains the listings for viewing on the site is MyISAM. I chose MyISAM because ~95% of the queries to the table are SELECTs. Really the only time there are writes (UPDATE/INSERT queries) are during the time the XML feeds are being processed.
The current process is as follows:
There is a CRON in place that starts the main parsing script.
It loops through a feeds table and grabs the external XML feed source files. It then runs through said file and for each record in the XML file it checks against the listings table to see if a listing needs to be updated or inserted (if it's a new listing).
This is all happening against the live table. What I'd like to find out is if anybody has any better logic to make these updates/inserts happen in the background so as to not slow down the production tables, and ultimately, the user experience.
Would a delta table be the best choice? Maybe do all the heavy work on a separate database and just copy the new table over to the production database? On a separate workhorse domain altogether? Should I have a separate listings table that does all the parsing which would be InnoDB instead of MyISAM?
What we're trying to accomplish is to have our system be able to update listings frequently throughout the day without slowing the site down. Our competitors boast that they are updating their listings every 5 minutes in some cases. I just don't see how that's even possible.
I'm working right now so this is more of a brain dump just to get the ball rolling. If anybody would like me to provide table schematics, I'd be more than happy.
In summary: I'm looking for a way to frequently update millions of records in our database (daily) via a couple dozen external XML feeds/files. I just need some logic on how to effectively, and efficiently, make this happen so as to not drag the production server down with it.
Firstly, for your existing 3 hour import, try wrapping every 100 inserts in a transaction. They will be written to the database in one go, and that might speed things up dramatically. Play around with the 100 value - the best value will depend on how resilient you want it, and how much memory your transaction cache has. (This will of course require you to switch to a different engine).
For providers that are known to offer larger files, try keeping a copy of the previous XML download, and then do a text diff between the old one and the new one. If you set your context settings (i.e. the number of unchanged lines around changed lines) sufficiently you might be able to capture the primary keys of changed items. You would then just do a small number of updates.
Of course, it would help if your providers maintain the order of their XML listing. If they don't, a text sort then a diff may still be faster than importing everything.
FWIW, I think a complete refresh every 5 minutes is probably not feasible. I expect your providers would not be happy with you downloading 1M records at this frequency!
I would like to know what do you think about storing chat messages in a database?
I need to be able to bind other stuff to them (like files, or contacts) and using a database is the best way I see for now.
The same question comes for files, because they can be bound to chat messages, I have to store them in the database too..
With thousands of messages and files I wonder about performance drops and database size.
What do you think considering I'm using PHP with MySQL/Doctrine?
I think that it would be OK to store any textual information on the database (names, messages history, etc) provided that you structure your database properly. I have worked for big Web-sites (multi-kilo visits a day) and telecom companies that store information about their users (including their traffic statistics) on the databases that have grown up to hundreds of gigabytes and the applications were working fine.
But regarding binary information like images and files it would be better to store them on the file systems and store only their paths on the database, because it will be cheaper to read them off the disks that to tie a database process to reading a multi-megabyte file.
As I said, it is important that you do several things:
Structure you information properly - it is very important to properly design your database, properly divide it into tables and tables into fields with your performance goals in mind because this will form the basis for your application and queries. Get that wrong and your queries will be slow.
Make proper decisions on table engines pertinent to every table. This is an important step because it will greatly affect the performance of your queries. For example, MyISAM blocks reading access to the table while it is being updated. That will be a problem for a web application like a social networking or a news site because im many situations your users will basically have to wait for a information update to be completed before the will see a generated page.
Create proper indexes - very important for performance, especially for applications with rapidly growing big databases.
Measure performance of your queries as data grows and look for the ways to improve it - you will always find bottlenecks that have to be removed, this is an ongoing non-stop process. Every popular web application has to do it.
I think a NoSQL database like CouchDB or MongtoDB is an option. You can also store the files separate and link them via a known filename but it depends on your system architecture.
I'm fairly familiar with most aspects of web development and I consider myself a junior level programmer. I'm always anxious when I think about application scaling and would like to learn a little more about it. Let's have a hypothetical situation.
I'm working on a web application that polls a device and fetches about 2kb of XML data at 15 minute intervals. This data must be stored for A Very Long Time (at least a couple years?). Now imagine that this web application has 100 users that each have this device.
After 10 years we're talking tens of millions of table rows. With 100 users we have a cron job that is querying each users device, getting 2kb of XML, and inserting it into the SQL database every 15 minutes.
Assuming my queries are relatively simple, only collecting the columns necessary, using joins, and avoiding subqueries, is there any reason this should not scale?
Inserting doesn't generally get slower as a table gets larger, but index updates may take longer. At some point you may want to split the table into two parts. One for archival storage, optimized for data retrieval (basically index the heck out of it), and a second table to handle the newer data, optimized more for insertion (fewer indexes).
But as always, the only way to tell for sure is to benchmark things. Set up some cloned tables with a few thousand rows, and some with multi-millions of rows, and see what happens.
You could always consider using partitioning to automagically split your data files by date, and age older records off to an slower, high-capacity disk array while keeping the newer records (and the INSERTs) on a high-speed array. Then, your index builds will only have to work on a subset of the data rather than the whole deal, and should go quickly (disk I/O is typically the slowest part of a database system).
Assuming my queries are relatively simple, only collecting the columns
necessary, using joins, and avoiding subqueries, is there any reason
this should not scale?
When you get large you should put you active dataset in a in-memory database(faster than disc) just like Facebook, Twitter, etc do. Twitter became very slow when they did not put active dataset in memory/scale up => A lot of people called this fail whale. Both use memcached for this, but you could also use Redis(I like this) or APC if you are just a single box. You should always install APC if want performance because APC is used for caching the compiled bytecode.
Most PHP accelerators work by caching the compiled bytecode of PHP
scripts to avoid the overhead of parsing and compiling source code on
each request (some or all of which may never even be executed). To
further improve performance, the cached code is stored in shared
memory and directly executed from there, minimizing the amount of slow
disk reads and memory copying at runtime.
I am currently in a debate with a coworker about the best practices concerning the database design of a PHP web application we're creating. The application is designed for businesses, and each company that signs up will have multiple users using the application.
My design methodology is to create a new database for every company that signs up. This way everything is sand-boxed, modular, and small. My coworkers philosophy is to put everyone into one database. His argument is that if we have 1000+ companies sign up, we wind up with 1000+ databases to deal with. Not to mention the mess that doing Business Intelligence becomes.
For the sake of example, assume that the application is an order entry system. With separate databases, table size can remain manageable even if each company is doing 100+ orders a day. In a single-bucket application, tables can get very big very quickly.
Is there a best practice for this? I tried hunting around the web, but haven't had much success. Links, whitepapers, and presentations welcome.
Thanks in advance,
The1Rob
I talked to the database architect from wordpress.com, the hosting service for WordPress. He said that they started out with one database, hosting all customers together. The content of a single blog site really isn't that much, after all. It stands to reason that a single database is more manageable.
This did work well for them until they got hundreds and thousands of customers, they realized that they needed to scale out, running multiple physical servers and hosting a subset of their customers on each server. When they add a server, it would be easy to migrate individual customers to the new server, but harder to separate data within a single database that belongs to an individual customer's blog.
As customers come and go, and some customers' blogs have high-volume activity while others go stale, the rebalancing over multiple servers becomes an even more complex maintenance job. Monitoring size and activity per individual database is easier too.
Likewise doing a database backup or restore of a single database containing terrabytes of data, versus individual database backups and restores of a few megabytes each, is an important factor. Consider: a customer calls and says their data got SNAFU'd due to some bad data entry, and could you please restore the data from yesterday's backup? How would you restore one customer's data if all your customers share a single database?
Eventually they decided that splitting into a separate database per customer, though complex to manage, offered them greater flexibility and they re-architected their hosting service to this model.
So, while from a data modeling perspective it seems like the right thing to do to keep everything in a single database, some database administration tasks become easier as you pass a certain breakpoint of data volume.
I would never create a new database for each company. If you want a modular design, you can create this using tables and properly connected primary and secondary keys. This is where i learned about database normalization and I'm sure it will help you out here.
This is the method I would use. SQL Article
I'd have to agree with your co-worker. Relational databases are designed to handle large amounts of data, and the numbers you're talking about (1000+ companies, multiple users per company, 100+ orders/day) are well within the expected bounds. Separate databases means:
multiple database connections in each script (memory and speed penalty)
maintenance is harder (DB systems generally do not provide tools for acting on databases as a group) so schema changes, backups, and similar tasks will be more difficult
harder to run queries on data from multiple companies
If your site becomes huge, you may eventually need to distribute your data across multiple servers. Deal with that when it happens. To start out that way for performance reasons sounds like premature optimization.
I haven't personally dealt with this situation, but I would think that if you want to do business intelligence, you should aggregate the data into an offline database that you can then run any analysis you want on.
Also, keeping them in separate databases makes it easier to partition across servers (which you will likely have to do if you have 1000+ customers) without resorting to messy replication technologies.
I had a similar question a while back and came to the conclusion that a single database is drastically more manageable. Right now, we have multiple databases (around 10) and it is already becoming a pain to manage especially when we upgrade the code. We have to migrate every single database.
The upside is that the data is segregated cleanly. Due to the sensitivity of our data, this is a good thing, but it does make it quite a bit more difficult to keep up with.
The separate database methodology has a very big advance over the other:
+ You could broke it up into smaller groups, this architecture scales much better.
+ You could make stand alone servers in an easy way.
That depends on how likely your schemas are to change. If they ever have to change, will you be able to safely make those changes to 1000 separate databases? If a scalability problem is found with your design, how are you going to fix it for 1000 databases?
We run a SaaS (Software-as-a-Service) business with a large number of customers and have elected to keep all customers in the same database. Managing 1000's of separate databases is an operational nightmare.
You do have to be very diligent creating your data model and the business objects / reporting queries that access them. One approach you may want to consider is to carry the company ID in every table and ensure that every WHERE clause includes the company ID for the currently logged-in user. If you use a data access layer, you can enforce that condition there.
As you grow large, you can still vertically partition by placing groups of companies on each physical server, e.g. the first 100 companies on Server A, the next 100 companies on Server B.