There are some other SO questions about concurrency but they don't quite address my scenario.
So let's say that I have a game where users interact with each other, fighting and whatnot. At any given time, a player could potentially be involved in multiple interactions with other players, all of whom can see the event happening. When any one of these players hits the site, it needs to update any data involved and show that to the user.
Example situation: Player A is fighting with player B, and events happen every few minutes in this fight. At the same time, player A is also interacting with player C. By dumb luck, the events for both interactions happen to next be due at the exact same second.
When that second arrives, by dumb luck again, both player B and player C hit the site at the same time, in order to check the status of their fights with player A. Fighting requires updates to information about player A. If I don't code properly, A's data can get messed up.
I have two games with this situation, each with a different solution and different issues. One of them uses a lock, so when a user hits the site, they acquire a lock on a db row, read the data for locks they successfully acquired, then write the changes and release the lock. But sometimes, for reasons still unknown, this fails and the lock gets stuck forever, users complain and we have to fix it manually. My other game uses a daemon to execute these transactions, making the issue (nearly) moot as there is only one process ever making these changes. But players could still do other things at the same time, and potentially cause the same issue.
I've read a bit about different solutions to this, like optimistic or timestamp-based control. I would like to ask:
Which of these is most commonly used for situations like mine, and which is easiest to implement?
My next project is using Kohana (PHP) and its ORM, so my db writes will by default take the form "just overwrite all these fields." Will I need to write my own update queries for this or can I get a solution that is compatible with the ORM?
What about transactions that involve multiple tables? The outcome of a combat has to change the table of combats, and the table of player information, possibly more things too. Which solutions are easier to work with here? Will all of my tables need transaction timestamp columns?
A lot of these solutions say that when there is a conflict, either retry or ignore. What does this mean for me? Does "retry" mean restart my entire script, which would cause additional load time for the user? I don't think ignore is a valid option, since the events have to execute at some point. In the other questions I found, presenting a conflict error to the user was usually a valid option - for me, it isn't.
What are the performance implications of concurrency control - is it even worth it?
I think what you are looking for is already contained in your question : transactions.
If you are using MySQL, you will need to setup your tables with the innoDb engine to be able to use transactions. Some documentation :
http://dev.mysql.com/doc/refman/5.1/en/commit.html
http://www.php.net/manual/en/pdo.begintransaction.php
http://www.php.net/manual/en/mysqli.autocommit.php
Don't try to reinvent the wheel when you can.
Related
I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!
My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.
If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.
Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions
This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.
I have a small scale PHP social network, running with a MySQL database. Users on the network can join various groups and receive updates.
I want to notify the user when a new update has been released by a group.
I don't want to do anything fancy with sockets, I'd just like a display of how many updates have been posted since the user was last active.
I'd thought of recording the current time against a user every time they refresh the page, this way I can compare the date of updates vs. the last time a user was active.
I'm not sure that writing to the database on every page load is a good idea though. Any other suitable suggestions?
I think the best way to do this is indeed writing to the database. However, in terms of performance there are a few ways to make it faster. Two ways I can think of are caching the updates for popular groups and creating a table which will only have users and timestamps, both indexed, so that should work very quickly.
Your solution is fine. Make sure you index the correct fields in your database.
Now if you ever have a question like this again.. go through the following steps:
Try it
Measure
If you're worried about scalability problems down the road.. do the same thing again, except you measure with how many records you expect (or want to support). It will be easy to just generate 10.000 records or even millions and try again.
Now if you actually run into unacceptable speeds, write down your problem and ask again here how you could potentially optimize.
This probably seems like a very simple question, and I would probably know if I had a more in depth knowledge of computer processes and the like, but anyway..
If two people request the same page from my server, is the PHP page processed once for the first person, and then a second time for the second person, or might these run along side each other at the same time?
Take this as an example. I have one stock Item left in my PHP driven online shop. A user adds this to their cart. Php script 1) checks to see if it is in stock, Yup, its in stock, so it 2)reserves it for him.
What If, in between checking if its in stock and reserving it, the same PHP page was loading for someone else, and just after user A checked if it was in stock, so did user B, before user A got a chance to reserve it, so they both end up reserving it!
Sorry if this seems silly, can't seem to find an answer on it, which is it?
Congratulations, you have identified a race condition! :-)
Whether PHP pages run in parallel or one after the other depends on the web server. Typically a web server allocates several threads to handle multiple incoming requests at once. So it may indeed happen that several instances of the same script are run in parallel if two or more users request the same page at the same time. Due to timing and scheduling differences it is unpredictable when each page will execute which action exactly.
Hence for such situations as you describe it is important to program actions in an atomic way, meaning that they either complete in their entirety or not at all. In your case you could use locks, transactions, cleverly formed UPDATE statements, UNIQUE indexes or a number of other techniques that avoid the possibility of two users reserving the same thing.
Yes, in general, without getting into too much detail: PHP scripts are executed simultanously for each request separately.
For making sure the problem you mentioned does not occur, you should probably implement feature of your database management system called "transactions". This way if you do something on the database layer and at the end you will find out the reservation can not happen, all the actions made within transaction will be rolled back.
In addition to transactions you should design your application keeping in mind that the problem you mentioned may occur. Thus you should design your database & application in a way allowing you to 1) shorten the time between "checking" and "reserving" as much as possible, 2) stopping the action if you cannot make reservation, and finally - in case of emergency - 3) identifying which reservation came first and which should be revoked.
Another idea, falling into category of "your application's design", may be something we could call "temporary reservation". That means you can temporarily (eg. for a couple of seconds) lock your reservation if you are about to make reservation. After that you can check if you really can make that reservation and either turn it into permanent reservation or just revoke it. I believe some systems also make longer temporary reservations right after the customer begins the process of reserving his/her places. Then, if the process is successful, the reservation is changed into permanent, but if some specific amount of time passes without success, the reservation can be simply revoked, allowing another customer to begin the process.
yes definately, they are parallel for php but when the database concerns you should learn transaction portion of database management system.
Yes and no. PHP may run in simultaneous processes depending on server setup, but on a small-scale, you'll only have one database. Database queries are handled sequentially, so you'll never have that kind of conflict. (As long as you check to see if an item's in stock immediately before you reserve it for someone.) More information
Of course, Users A + B might both see that it's in stock, and A might request it before B. But your code can realize that it's now out of stock and display an error to User B.
(You get into trouble with multiple database servers. If you have the same data stored across multiple servers, there's lag time before data can be fully replicated. But you won't have that issue. We're talking like top 1,000 sites here.)
I have a site where the users can view quite a large number of posts. Every time this is done I run a query similar to UPDATE table SET views=views+1 WHERE id = ?. However, there are a number of disadvantages to this approach:
There is no way of tracking when the pageviews occur - they are simply incremented.
Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
Therefore I consider employing an approach where I create a table, say:
object_views { object_id, year, month, day, views }, so that each object has one row pr. day in this table. I would then periodically update the views column in the objects table so that I wouldn't have to do expensive joins all the time.
This is the simplest solution I can think of, and it seems that it is also the one with the least performance impact. Do you agree?
(The site is build on PHP 5.2, Symfony 1.4 and Doctrine 1.2 in case you wonder)
Edit:
The purpose is not web analytics - I know how to do that, and that is already in place. There are two purposes:
Allow the user to see how many times a given object has been shown, for example today or yesterday.
Allow the moderators of the site to see simple view statistics without going into Google Analytics, Omniture or whatever solution. Furthermore, the results in the backend must be realtime, a feature which GA cannot offer at this time. I do not wish to use the Analytics API to retrieve the usage data (not realtime, GA requires JavaScript).
Quote : Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
There is much more than this. This is database killer.
I suggest u make table like this :
object_views { object_id, timestamp}
This way you can aggregate on object_id (count() function).
So every time someone view the page you will INSERT record in the table.
Once in a while you must clean the old records in the table. UPDATE statement is EVIL :)
On most platforms it will basically mark the row as deleted and insert a new one thus making the table fragmented. Not to mention locking issues .
Hope that helps
Along the same lines as Rage, you simply are not going to get the same results doing it yourself when there are a million third party log tools out there. If you are tracking on a daily basis, then a basic program such as webtrends is perfectly capable of tracking the hits especially if your URL contains the ID's of the items you want to track... I can't stress this enough, it's all about the URL when it comes to these tools (Wordpress for example allows lots of different URL constructs)
Now, if you are looking into "impression" tracking then it's another ball game because you are probably tracking each object, the page, the user, and possibly a weighted value based upon location on the page. If this is the case you can keep your performance up by hosting the tracking on another server where you can fire and forget. In the past I worked this using SQL updating against the ID and a string version of the date... that way when the date changes from 20091125 to 20091126 it's a simple query without the overhead of let's say a datediff function.
First just a quick remark why not aggregate the year,month,day in DATETIME, it would make more sense in my mind.
Also I am not really sure what is the exact reason you are doing that, if it's for a marketing/web stats purpose you have better to use tool made for that purpose.
Now there is two big family of tool capable to give you an idea of your website access statistics, log based one (awstats is probably the most popular), ajax/1pixel image based one (google analytics would be the most popular).
If you prefer to build your own stats database you can probably manage to build a log parser easily using PHP. If you find parsing apache logs (or IIS logs) too much a burden, you would probably make your application ouput some custom logs formated in a simpler way.
Also one other possible solution is to use memcached, the daemon provide some kind of counter that you can increment. You can log view there and have a script collecting the result everyday.
If you're going to do that, why not just log each access? MySQL can cache inserts in continuous tables quite well, so there shouldn't be a notable slowdown due to the insert. You can always run Show Profiles to see what the performance penalty actually is.
On the datetime issue, you can always use GROUP BY MONTH( accessed_at ) , YEAR( accessed_at) or WHERE MONTH(accessed_at) = 11 AND YEAR(accessed_at) = 2009.
I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.
I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.
I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.
I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.
Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.
Example Insert: IP, time(), http_referer, etc.
I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.
If anyone can help me, I'm a great coder, I would be more than willing to return the favor.
We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.
This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.
Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.
Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.
For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.
You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.
MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.
Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.
In my opinion it's a good thing to stick to MySQL for registering the visits, because it provides tools to analyze your data. To decrease the load I would have the following suggestions.
Make a fast collecting table, with no indixes except primary key, myisam, one row per hit
Make a normalized data structure for the hits and move the records once a day to that database.
This gives you a smaller performance hit for logging and a well indexed normalized structure for querying/analyzing.
Presuming that your MySQL server is on a different physical machine to your web server, then yes it probably would be a bit more efficient to log the hit to a file on the local filesystem and then push those to the database periodically.
That would add some complexity though. Have you tested or considered testing it with regular queries? Ie, increment a counter using an UPDATE query (because you don't need each entry in a separate row). You may find that this doesn't slow things down as much as you had thought, though obviously if you are pushing 80,000,000 page views a day you probably don't have much wiggle room at all.
You should be able to get that kind of volume quite easily, provided that you do some stuff sensibly. Here are some ideas.
You will need to partition your audit table on a regular (hourly, daily?) basis, if nothing else only so you can drop old partitions to manage space sensibly. DELETEing 10M rows is not cool.
Your web servers (as you will be running quite a large farm, right?) will probably want to do the inserts in large batches, asynchronously. You'll have a daemon process which reads flat-file logs on a per-web-server machine and batches them up. This is important for InnoDB performance and to avoid auditing slowing down the web servers. Moreover, if your database is unavailable, your web servers need to continue servicing web requests and still have them audited (eventually)
As you're collecting large volumes of data, some summarisation is going to be required in order to report on it at a sensible speed - how you do this is very much a matter of taste. Make sensible summaries.
InnoDB engine tuning - you will need to tune the InnoDB engine quite significantly - in particular, have a look at the variables controlling its use of disc flushing. Writing out the log on each commit is not going to be cool (maybe unless it's on a SSD - if you need performance AND durability, consider a SSD for the logs) :) Ensure your buffer pool is big enough. Personally I'd use the InnoDB plugin and the file per table option, but you could also use MyISAM if you fully understand its characteristics and limitations.
I'm not going to further explain any of the above as if you have the developer skills on your team to build an application of that scale anyway, you'll either know what it means or be capable of finding it out.
Provided you don't have too many indexes, 1000 rows/sec is not unrealistic with your data sizes on modern hardware; we insert that many sometimes (and probably have a lot more indexes).
Remember to performance test it all on production-spec hardware (I don't really need to tell you this, right?).
I think that using MySQL is an overkill for the task of collecting the logs and summarizing them. I'd stick to plain log files in your case. It does not provide the full power of relational database management but it's quite enough to generate summaries. A simple lock-append-unlock file operation on a modern OS is seamless and instant. On the contrary, using MySQL for the same simple operation loads the CPU and may lead to swapping and other hell of scalability.
Mind the storage as well. With plain text file you'll be able to store years of logs of a highly loaded website taking into account current HDD price/capacity ratio and compressability of plain text logs