We are planning to create an advertisement network. As any normal online advertisement network, we would provide ad serving, reporting (stats) and a little browsing site for publishers/advertisers.
Because the application would get huge impression (ad serving) requests, our application must be able to quickly insert data to log impressions and clicks, log the count of impressions and clicks for every publisher/advertiser. This data then would be used to monitor impressions/clicks from publishers and to generate reports.
Right now we have planned the whole system to be based on PHP, MySQL (InnoDB), php-eAccelerator, Memcached (just to store active ads)
Problems/Issues
Scaling...
I seriously feel that our application is not going to scale well when our traffic grows.
MySQL insertion and UPDATES would surely be the bottleneck. Also how to distribute this all to multiple servers so that our application may scale according to load.
Can anyone please help propose a structure of the application especially for impressions logging and calculation? Would MongoDB be a better solution in any way?
Any help would be highly appreciated.
I've built several high-volume statistical collection systems using MySQL. They perform fairly well so long as you keep ahead of the scaling curve with careful planning. In particular, if you're doing lots of INSERT or UPDATE queries, heavy writes, you'll need to keep your row sizes smaller, using INT from a look-up table instead of VARCHAR columns for instance, and pay careful attention to how big your indexes are getting.
Always, always simulate your schema with massive amounts of test data. Abuse it to the breaking point, fix it, and abuse it all over again. You want to see smoke or you're not trying hard enough. Remember, hardware makes a massive difference, so be careful to use something as close as possible to the deployment target. Your SSD notebook will blow the doors off of a server with 15K enterprise drives in a RAID10 setup, for example, if you're doing heavy writes.
That being said, you might want to look at Redis. It's not a relational database, but it's several orders of magnitude faster than MySQL for things like "add one to column X" or "give me Y count for Z interval" type operations.
Related
i am planning to create a card game engine using sql, the game consits of 4 human players and cards are in an sql table, now every thing is done regarding game logic and points, each game is manged by a seperate sql table, and players are able to create rooms
each room shall have a game table contains cards data with each player represnted in a column and a seperate chat table
if there was 1000 games running in the same
time and each time a card played then a requst is made to the server
either to remove a card from a players deck record player score and
total game score, can this be handled in a single sql database
without delayes and performance issues?
can i use global temporary tables ##sometable for each game room or
do i have to create the tables manually and delete them after the
game ends?
i would like also to know if storing chat data in a single sql
table would make issues, one thing i thought of is saving chat data
for all open rooms in a single datatable with a game id column, but
would this give some performance issues if there was thausands of
lines of chat data?
also what about a database for each game, would that be an over
kill?
How such applications are managed normally?
do i have to use multiple servers and distribute the running games
on them?
any ideas you have about optimizing such things
You should consider a memory-based cache system such as Velocity or Memcached to address the performance issues.
Yes. The discussion of how to scale a task like this is a long one.
You could. But you should rather consider a smarter model whereby multiple games occur in a single table.
I would use SQL Server Service Broker for the chat
Yes.
I recommend you break your questions up into multiple questions so that contributers who specialise in a single aspect of your problem domain can contribute accordingly.
I don't know how PHP works; but I am fairly sure that it would be far more efficient for a lot of the game logic to occur client-side. Making a server call for every game action would work, my opinion is just that it is sub-optimal.
Yes, I would expect live players to have at least 1 second delay before making their moves and only one play is making a move at a time per game. So roughly 1000 transactions per second peak for 1000 games. Not an excessive load on modern architectures.
There is more overhead in most DBMSs for creating and destroying tables. Keep it all in the same table.
Chat would be fine in a single table. You could keep performance up by archiving chat from previous inactive games and removing from the primary live db.
Yes, very inefficient. Added complexity for no gain.
Not sure what you are asking.
Only as you scale. I would imagine you would start with a single db server until you needed more capacity.
Good design db design from the beginning from someone with experience will go a long ways. Don't waste too much time micro-optimizing at the get go or you will never get off the ground. Optimize as you need to as you scale.
The short version is that relational DB such as SQL Server are not very useful for games because they cannot efficiently store heavily structured hierarchical data
I would still advocate avoiding SQL, but there are now many more options in the NoSQL and for real performance you should consider using a Fast Temporary Storage such as Redis or Memcache
You can quickly look at Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison
Optimizing is a different topic entirely .. to wide and project specific .
We are building a social website using PHP (Zend Framework), MySQL, server running Apache.
There is a requirement where in dashboard the application will fetch data for different events (there are about 12 events) on which this dashboard for user will be updated. We expect the total no of users to be around 500k to 700k. While at one time on average about 20% users would be online (for peak time we expect 50% users to be online).
So the problem is the event data as per our current design will be placed in a MySQL database. I think running a few hundred thousands queries concurrently on MySQL wouldn't be a good idea even if we use Amazon RDS. So we are considering to use both DynamoDB (or Redis or any NoSQL db option) along with MySQL.
So the question is: Having data both in MySQL and any NoSQL database would give us this benefit to have this power of scalability for our web application? Or we should consider any other solution?
Thanks.
You do not need to duplicate your data. One option is to use the ElastiCache that amazon provides to give your self in memory caching. This will get rid of your database calls and in a sense remove that bottleneck, but this can be very expensive. If you can sacrifice rela time updates then you can get away with just slowing down the requests or caching data locally for the user. Say, cache the next N events if possible on the browser and display them instead of making another request to the servers.
If it has to be real time then look at the ElastiCache and then tweak with the scaling of how many of them you require to handle your estimated amount of traffic. There is no point in duplicating your data. Keep it in a single DB if it makes sense to keep it there, IE you have some relational information that you need and then also have a variable schema system then you can use both databases, but not to load balance them together.
I would also start to think of some bottle necks in your architecture and think of how well your application will/can scale in the event that you reach your estimated numbers.
I agree with #sean, there’s no need to duplicate the database. Have you thought about a something with auto-scalability, like Xeround. A solution like that can scale out automatically across several nodes when you have throughput peaks and later scale back in, so you don’t have to commit to a larger, more expansive instance just because of seasonal peaks.
Additionally, if I understand correctly, no code changes are required for this auto-scalability. So, I’d say that unless you need to duplicate your data on both MySQL and NoSQL DB’s for reasons other than scalability-related issues, go for a single DB with auto-scaling.
I've been given a big project by a big client and I've been working on it for 2 months now. I'm getting closer and closer to a solution but it's just so insanely complex that I can't quite get there, and so I need ideas.
The project is quite simple: There is a 1mil+ database of lat/lng coordinates with lots of additional data for each record. A user will visit a page and enter some search terms which will filter out quite a lot of the records. All of the records that match the filter are displayed (often clustered) on a Google Maps.
The problem with this is that the client demands it's fast, lean, and low-bandwidth. Hence, I'm stuck. What I'm currently doing is: Present the first clusters, and when they hover over a cluster, begin loading in the data for that clusters children.
However, I've upped it to 30,000 of the millions of listings and it's starting to drag a little. I've made as many optimizations that I possibly can. When the filter is changed, I AJAX a query to the DB and return all the ID's of the matches, then update the map to reflect this.
So, optimization is not an option. I need an entirely new conceptual model for this. Any input at all would be highly appreciated, as this is an incredibly complex project of which I can't find anything in history even remotely close to it- I even looked at MMORPG's which have a lot of similar problems, and I have made a few, but the concept of having a million players in one room is still something MMORPG makers cringe at. It's getting common that people think there may be bottlenecks, but let me say that it's not a case of optimizing this way. I need a new model in which a huge database stays on the server, but is displayed fluidly to the user.
I'll be awarding 500 rep as soon as it becomes available for anything that solves this.
Thanks- Daniel.
I think there are a number of possible answers to your question depending on where it is slowing down, so here goes a few thoughts.
A wider table can effect the speed with which a query is returned. Longer records mean that more disc is being accessed to get the right data, so you might want to think about limiting your initial table to hold only the information that can be filtered out. Having said that, it will also depend on the db engine you are using, some suffer more than others.
Ensuring that your tables are correctly indexed makes a HUGE difference in performance. You need to make sure that the query is using the indexes to quickly get to the records that it needs.
A friend was working with Google Maps and said that the API really suffered if too much was displayed on the maps. This might just be totally out of your control.
Having worked for Epic Games in the past, the reason that "millions of players in a room" is something to cringe at is more often hardware driven. In a game, having that number of players would grind the graphics card to a halt as it tries to render all the polygons of the models. Secondly (and likely more importantly) the problem would be that you have to send each client information about what each item/player is doing. This means that your bandwidth use will spike very heavily. Your server might handle the load, but the players internet connection might not.
I do think that you need to edit your question though with some extra information on WHAT is slowing down. Your database? Your query? Google API? The transfer of data between server and client machine?
Let's be honest here; a db with 1 million records being accessed by presumably a large amount of users, is not going to run very well unless you put some extremely powerful hardware behind it.
In this type of case, I would suggest using several different database servers, and setting up some decent load balancing regimes in order to keep them running as smoothly as possible. First and foremost, you will need to find out the "average" load you can place on a db server before it starts to lag up; let's say for example, this is 50,000 records. Setting a low MaxClients per server may assist you with server performance and preventing against crashes, but it might aggravate your users when they can't execute any queries due to high load.. but it's something to keep in mind if your budget doesn't allow for much wiggle room hardware-wise.
On the topic of hardware however, that's something you really need to take a look at. Databases typically don't use a huge amount of CPU/RAM, but they can be quite taxing on your HDD. I would recommend going for SAS or SSD before looking at other components on your setup; these will make the world of a difference for you.
As far as load balancing goes, a very common technique used for most content providers is that when one query/particular content item (such as a popular video on youtube etc) is pulling in an above average amount of traffic, you can cache its result. A quick and dirty approach to this is to use an if statement in your search bar, which will then grab a static html page instead of actually running the query.
Another approach to this is to have a seperate db server on standalone, only for running queries which are taking in an excessive amount of traffic.
With that, never underestimate your code optimisation. While the differences may seem subtle to you, when run across millions of queries by thousands of users, those tiny differences really do add up.
Best of luck with it - let me know if you need any further assistance.
Eoghan
Google has a service named "Big Query". It is a sql Server in the cloud. It uses its fast servers for sql and it can search millions of data rows quickly. Unfortunately it is not free.. but maybe it will help you out:
https://developers.google.com/bigquery/
I am currently in a debate with a coworker about the best practices concerning the database design of a PHP web application we're creating. The application is designed for businesses, and each company that signs up will have multiple users using the application.
My design methodology is to create a new database for every company that signs up. This way everything is sand-boxed, modular, and small. My coworkers philosophy is to put everyone into one database. His argument is that if we have 1000+ companies sign up, we wind up with 1000+ databases to deal with. Not to mention the mess that doing Business Intelligence becomes.
For the sake of example, assume that the application is an order entry system. With separate databases, table size can remain manageable even if each company is doing 100+ orders a day. In a single-bucket application, tables can get very big very quickly.
Is there a best practice for this? I tried hunting around the web, but haven't had much success. Links, whitepapers, and presentations welcome.
Thanks in advance,
The1Rob
I talked to the database architect from wordpress.com, the hosting service for WordPress. He said that they started out with one database, hosting all customers together. The content of a single blog site really isn't that much, after all. It stands to reason that a single database is more manageable.
This did work well for them until they got hundreds and thousands of customers, they realized that they needed to scale out, running multiple physical servers and hosting a subset of their customers on each server. When they add a server, it would be easy to migrate individual customers to the new server, but harder to separate data within a single database that belongs to an individual customer's blog.
As customers come and go, and some customers' blogs have high-volume activity while others go stale, the rebalancing over multiple servers becomes an even more complex maintenance job. Monitoring size and activity per individual database is easier too.
Likewise doing a database backup or restore of a single database containing terrabytes of data, versus individual database backups and restores of a few megabytes each, is an important factor. Consider: a customer calls and says their data got SNAFU'd due to some bad data entry, and could you please restore the data from yesterday's backup? How would you restore one customer's data if all your customers share a single database?
Eventually they decided that splitting into a separate database per customer, though complex to manage, offered them greater flexibility and they re-architected their hosting service to this model.
So, while from a data modeling perspective it seems like the right thing to do to keep everything in a single database, some database administration tasks become easier as you pass a certain breakpoint of data volume.
I would never create a new database for each company. If you want a modular design, you can create this using tables and properly connected primary and secondary keys. This is where i learned about database normalization and I'm sure it will help you out here.
This is the method I would use. SQL Article
I'd have to agree with your co-worker. Relational databases are designed to handle large amounts of data, and the numbers you're talking about (1000+ companies, multiple users per company, 100+ orders/day) are well within the expected bounds. Separate databases means:
multiple database connections in each script (memory and speed penalty)
maintenance is harder (DB systems generally do not provide tools for acting on databases as a group) so schema changes, backups, and similar tasks will be more difficult
harder to run queries on data from multiple companies
If your site becomes huge, you may eventually need to distribute your data across multiple servers. Deal with that when it happens. To start out that way for performance reasons sounds like premature optimization.
I haven't personally dealt with this situation, but I would think that if you want to do business intelligence, you should aggregate the data into an offline database that you can then run any analysis you want on.
Also, keeping them in separate databases makes it easier to partition across servers (which you will likely have to do if you have 1000+ customers) without resorting to messy replication technologies.
I had a similar question a while back and came to the conclusion that a single database is drastically more manageable. Right now, we have multiple databases (around 10) and it is already becoming a pain to manage especially when we upgrade the code. We have to migrate every single database.
The upside is that the data is segregated cleanly. Due to the sensitivity of our data, this is a good thing, but it does make it quite a bit more difficult to keep up with.
The separate database methodology has a very big advance over the other:
+ You could broke it up into smaller groups, this architecture scales much better.
+ You could make stand alone servers in an easy way.
That depends on how likely your schemas are to change. If they ever have to change, will you be able to safely make those changes to 1000 separate databases? If a scalability problem is found with your design, how are you going to fix it for 1000 databases?
We run a SaaS (Software-as-a-Service) business with a large number of customers and have elected to keep all customers in the same database. Managing 1000's of separate databases is an operational nightmare.
You do have to be very diligent creating your data model and the business objects / reporting queries that access them. One approach you may want to consider is to carry the company ID in every table and ensure that every WHERE clause includes the company ID for the currently logged-in user. If you use a data access layer, you can enforce that condition there.
As you grow large, you can still vertically partition by placing groups of companies on each physical server, e.g. the first 100 companies on Server A, the next 100 companies on Server B.
I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.
I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.
I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.
I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.
Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.
Example Insert: IP, time(), http_referer, etc.
I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.
If anyone can help me, I'm a great coder, I would be more than willing to return the favor.
We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.
This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.
Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.
Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.
For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.
You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.
MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.
Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.
In my opinion it's a good thing to stick to MySQL for registering the visits, because it provides tools to analyze your data. To decrease the load I would have the following suggestions.
Make a fast collecting table, with no indixes except primary key, myisam, one row per hit
Make a normalized data structure for the hits and move the records once a day to that database.
This gives you a smaller performance hit for logging and a well indexed normalized structure for querying/analyzing.
Presuming that your MySQL server is on a different physical machine to your web server, then yes it probably would be a bit more efficient to log the hit to a file on the local filesystem and then push those to the database periodically.
That would add some complexity though. Have you tested or considered testing it with regular queries? Ie, increment a counter using an UPDATE query (because you don't need each entry in a separate row). You may find that this doesn't slow things down as much as you had thought, though obviously if you are pushing 80,000,000 page views a day you probably don't have much wiggle room at all.
You should be able to get that kind of volume quite easily, provided that you do some stuff sensibly. Here are some ideas.
You will need to partition your audit table on a regular (hourly, daily?) basis, if nothing else only so you can drop old partitions to manage space sensibly. DELETEing 10M rows is not cool.
Your web servers (as you will be running quite a large farm, right?) will probably want to do the inserts in large batches, asynchronously. You'll have a daemon process which reads flat-file logs on a per-web-server machine and batches them up. This is important for InnoDB performance and to avoid auditing slowing down the web servers. Moreover, if your database is unavailable, your web servers need to continue servicing web requests and still have them audited (eventually)
As you're collecting large volumes of data, some summarisation is going to be required in order to report on it at a sensible speed - how you do this is very much a matter of taste. Make sensible summaries.
InnoDB engine tuning - you will need to tune the InnoDB engine quite significantly - in particular, have a look at the variables controlling its use of disc flushing. Writing out the log on each commit is not going to be cool (maybe unless it's on a SSD - if you need performance AND durability, consider a SSD for the logs) :) Ensure your buffer pool is big enough. Personally I'd use the InnoDB plugin and the file per table option, but you could also use MyISAM if you fully understand its characteristics and limitations.
I'm not going to further explain any of the above as if you have the developer skills on your team to build an application of that scale anyway, you'll either know what it means or be capable of finding it out.
Provided you don't have too many indexes, 1000 rows/sec is not unrealistic with your data sizes on modern hardware; we insert that many sometimes (and probably have a lot more indexes).
Remember to performance test it all on production-spec hardware (I don't really need to tell you this, right?).
I think that using MySQL is an overkill for the task of collecting the logs and summarizing them. I'd stick to plain log files in your case. It does not provide the full power of relational database management but it's quite enough to generate summaries. A simple lock-append-unlock file operation on a modern OS is seamless and instant. On the contrary, using MySQL for the same simple operation loads the CPU and may lead to swapping and other hell of scalability.
Mind the storage as well. With plain text file you'll be able to store years of logs of a highly loaded website taking into account current HDD price/capacity ratio and compressability of plain text logs