Related
How would I handle user statistics in PHP?
There are two obvious methods that I can choose. Both have their flaws.
Select MySQL COUNTs when necessary. The flaw here is that if you have many rows to count then it may be slow especially when you have to do it on seemingly every page load. The benefit is that the count will always be correct.
Store user statistics in a statistics table. The flaw here is that you have to continuously update it whenever a change is made, and this makes code overly complicated if you need to update in bulk. The benefit is that it will be fast to select a single row of stats for a user as opposed to performing counts.
Another possible method that I'm a bit "eh" about is storing a job in a queue (and have Laravel handle it). These jobs will update the statistics necessary using other tables so that it's synchronised properly. The benefit is that it takes the load off of the web server and the flaw is that a user may get incorrect statistics. It is not desirable for your own friends list to say there is for example, 15 friends and 7 friend requests when the actual numbers vary greatly.
I've put into detail the methods I have come up with and I'm not sure what's best in terms of giving correct results for the user, as well as balancing speed and simplicity. If I'm doing the COUNT method then potentially I could cache the result and remove the cache record if the statistics are to be updated but I'd imagine storing a row in the cache table for EACH user is a bit overkill. Maybe this isn't really a problem as long as the database has enough space but surely searching through a massive cache table is going to be slow anyway?
Maybe someone can give me the best choice to handle user statistics. My head is spinning as it's over-thinking everything and I need to be put on the straight and narrow.
Thanks in advance.
Don't exaggerate the cost of COUNT(*) as you plan this part of your app. If you have the correct index on your table, row counting is very quick. In fact, in if your table is MyISAM it can be O(1) in complexity.
For example, if you have an index on user the query SELECT COUNT(*) AS num FROM friend WHERE user = 'mickey#disney.com' will be very fast.
Build your app the easy way. When you have ten thousand users, you can rework this kind of statistical computation to be more elaborate and efficient. When you have more users, it will not be as obvious if you present approximate results.
Be careful, though. COUNT(*) is much faster than COUNT(expression) in most cases. The * allows MySQL to avoid evaluating every row.
Let's say you have a search form, with multiple select fields, let's say a user selects from a dropdown an option, but before he submits the data I need to display the count of the rows in the database .
So let's say the site has at least 300k(300.000) visitors a day, and a user selects options from the form at least 40 times a visit, that would mean 12M ajax requests + 12M count queries on the database, which seems a bit too much .
The question is how can one implement a fast count (using php(Zend Framework) and MySQL) so that the additional 12M queries on the database won't affect the load of the site .
One solution would be to have a table that stores all combinations of select fields and their respective counts (when a product is added or deleted from the products table the table storing the count would be updated). Although this is not such a good idea when for 8 filters (select options) out of 43 there would be +8M rows inserted that need to be managed.
Any other thoughts on how to achieve this?
p.s. I don't need code examples but the idea itself that would work in this scenario.
I would probably have an pre-calculated table - as you suggest yourself. Import is that you have an smart mechanism for 2 things:
Easily query which entries are affected by which change.
Have an unique lookup field for an entire form request.
The 8M entries wouldn't be very significant if you have solid keys, as you would only require an direct lookup.
I would go trough the trouble to write specific updates for this table on all places it is necessary. Even with the high amount of changes, this is still efficient. If correctly done you will know which rows you need to update or invalidate when inserting/updating/deleting the product.
Sidenote:
Based on your comment. If you need to add code on eight places to cover all spots can be deleted - it might be a good time to refactor and centralize some code.
there are few scenarios
mysql has the query cache, you dun have to bother the caching IF the update of table is not that frequently
99% user won't bother how many results that matched, he/she just need the top few records
use the explain - if you notice explain will return how many rows going to matched in the query, is not 100% precise, but should be good enough to act as rough row count
Not really what you asked for, but since you have a lot of options and want to count the items available based on the options you should take a look at Lucene and its faceted search. It was made to solve problems like this.
If you do not have the need to have up to date information from the search you can use a queue system to push updates and inserts to Lucene every now and then (so you don't have to bother Lucene with couple of thousand of updates and inserts every day).
You really only have three options, and no amount of searching is likely to reveal a fourth:
Count the results manually. O(n) with the total number of the results at query-time.
Store and maintain counts for every combination of filters. O(1) to retrieve the count, but requires O(2^n) storage and O(2^n) time to update all the counts when records change.
Cache counts, only calculating them (per #1) when they're not found in the cache. O(1) when data is in the cache, O(n) otherwise.
It's for this reason that systems that have to scale beyond the trivial - that is, most of them - either cap the number of results they'll count (eg, items in your GMail inbox or unread in Google Reader), estimate the count based on statistics (eg, Google search result counts), or both.
I suppose it's possible you might actually require an exact count for your users, with no limitation, but it's hard to envisage a scenario where that might actually be necessary.
I would suggest a separate table that caches the counts, combined with triggers.
In order for it to be fast you make it a memory table and you update it using triggers on the inserts, deletes and updates.
pseudo code:
CREATE TABLE counts (
id unsigned integer auto_increment primary key
option integer indexed using hash key
user_id integer indexed using hash key
rowcount unsigned integer
unique key user_option (user, option)
) engine = memory
DELIMITER $$
CREATE TRIGGER ai_tablex_each AFTER UPDATE ON tablex FOR EACH ROW
BEGIN
IF (old.option <> new.option) OR (old.user_id <> new.user_id) THEN BEGIN
UPDATE counts c SET c.rowcount = c.rowcount - 1
WHERE c.user_id = old.user_id and c.option = old.option;
INSERT INTO counts rowcount, user_id, option
VALUES (1, new.user_id, new.option)
ON DUPLICATE KEY SET c.rowcount = c.rowcount + 1;
END; END IF;
END $$
DELIMITER ;
Selection of the counts will be instant, and the updates in the trigger should not take very long either because you're using a memory table with hash indexes which have O(1) lookup time.
Links:
Memory engine: http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html
Triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html
A few things you can easily optimise:
Cache all you can allow yourself to cache. The options for your dropdowns, for example, do they need to be fetched by ajax calls? This page answered many of my questions when I implemented memcache, and of course memcached.org has great documentation available too.
Serve anything that can be served statically. Ie, options that don't change frequently could be stored in a flat file as array via cron every hour for example and included with script at runtime.
MySQL with default configuration settings is often sub-optimal for any serious application load and should be tweaked to fit the needs, of the task at hand. Maybe look into memory engine for high performance read-access.
You can have a look at these 3 great-but-very-technical posts on materialized views, as a matter of fact that whole blog is truly a goldmine of performance tips for mysql.
GOod-luck
Presumably you're using ajax to make the call to the back end that you're talking about. Use some kind of a chached flat file as an intermediate for the data. Set an expire time of 5 seconds or whatever is appropriate. Name the data file as the query key=value string. In the ajax request if the data file is older than your cooldown time, then refresh, if not, use the value stored in your data file.
Also, you might be underestimating the strength of the mysql query cache mechanism. If you're using mysql query cache, I doubt there would be any significant performance dip over doing it the way I just described. If the query was being query cached by mysql then virtually the only slowdown effect would be from the network layer between your application and mysql.
Consider what role replication can play in your architecture. If you need to scale out, you might consider replicating your tables from InnoDB to MyISAM. The MyISAM engine automatically maintains a table count if you are doing count(*) queries. If you are doing count(col) where queries, then you need to rely heavily on well designed indicies. In that case you your count queries might take shape like so:
alter table A add index ixA (a, b);
select count(a) using from A use index(ixA) where a=1 and b=2;
I feel crazy for suggesting this as it seems that no-one else has, but have you considered client-side caching? JavaScript isn't terrible at dealing with large lists, especially if they're relatively simple lists.
I know that your ideal is that you have a desire to make the numbers completely accurate, but heuristics are your friend here, especially since synchronization will never be 100% -- a slow connection or high latency due to server-side traffic will make the AJAX request out of date, especially if that data is not a constant. IF THE DATA CAN BE EDITED BY OTHER USERS, SYNCHRONICITY IS IMPOSSIBLE USING AJAX. IF IT CANNOT BE EDITED BY ANYONE ELSE, THEN CLIENT-SIDE CACHING WILL WORK AND IS LIKELY YOUR BEST OPTION. Oh, and if you're using some sort of port connection, then whatever is pushing to the server can simply update all of the other clients until a sync can be accomplished.
If you're willing to do that form of caching, you can also cache the results on the server too and simply refresh the query periodically.
As others have suggested, you really need some sort of caching mechanism on the server side. Whether it's a MySQL table or memcache, either would work. But to reduce the number of calls to the server, retrieve the full list of cached counts in one request and cache that locally in javascript. That's a pretty simple way to eliminate almost 12M server hits.
You could probably even store the count information in a cookie which expires in an hour, so subsequent page loads don't need to query again. That's if you don't need real time numbers.
Many of the latest browser also support local storage, which doesn't get passed to the server with every request like cookies do.
You can fit a lot of data into a 1-2K json data structure. So even if you have thousands of possible count options, that is still smaller than your typical image. Just keep in mind maximum cookie sizes if you use cookie caching.
Fairly simple concept, making an extremely basic message board system and I want users to have a post count. Now I was debating on whether or not to have a tally in their row that is added each time a post by them is created, or subtracted by one each time a post of theirs is deleted. However I'm sure that performing a count query when the post count is requested would be more accurate due to unforseen circumstances (say a thread gets deleted and it doesn't lower their tally properly), however this seems like it would be less efficient to run a query EVERY time their post count is loaded, especially in the case of them having 10 posts on the same page and it lists their post count each post.
Thoughts/Advice?
Thanks
post_count should definitely be a column in the user table. the little extra effort to get this right is minimal compared to the additional database load you produce with running a few count query on every thread view.
if you use some sort of orm or database abstraction, it should be quite simple to add the counting to their create / delete filters.
Just go for count each time. Unless your load is going to be astronomical, COUNT shouldn't be a problem, and reduces the amount of effort involved in saving and updating data.
Just make sure you put an index on your user_id column, so that you can filter the data with a WHERE clause efficiently.
If you get to the point where this doesn't do it for you, you can implement caching strategies, but given that it's a simple message board, you shouldn't encounter that problem for a while.
EDIT:
Just saw your second concern about the same query repeating 10 times on a page. Don't do that :) Just pull the data once and store it in a variable. No need to repeat the same query multiple times.
Just use COUNT. It will be more accurate and will avoid any possible missed cases.
The case you mention of displaying the post count multiple times on a page won't be a problem unless you have an extremely high traffic site.
In any other case, the query cache of your database server will execute the query, then keep a cache of the response until any of the tables that the query relies on change. In the course of a single page load, nothing else should change, so you will only be executing the query once.
If you really need to worry about it, you can just cache it yourself in a variable and just execute the query once.
Generally speaking, your database queries will always be extremely efficient compared to your app logic. As such, the time wasted on maintaining the post_count in the user table will most probably be far far less than is needed to run a query to update the user table whenever a comment is posted.
Also, it is usually considered bad DB structure to have a field such as you are describing.
There are arguments for both, so ultimately it depends on the volume of traffic you expect. If your code is solid and properly layered, you can confidently keep a row count in your users' record without worrying about losing accuracy, and over time, count() will potentially get heavy, but updating a row count also adds overhead.
For a small site, it makes next to no difference, so if (and only if) you're a stickler for efficiency, the only way to get a useful answer is to run some benchmarks and find out for yourself. One way or another, it's going to be 3/10ths of 2/8ths of diddley squat, so do whatever feels right :)
It's totally reasonable to store the post counts in a column in your Users table. Then, to ensure that your post counts don't become increasingly inaccurate over time, run a scheduled task (e.g. nightly) to update them based on your Posts table.
I am trying to create a point system in my program similar to stack overflow i.e. when the user does some good deed (activity) his/her points are increased. I am wondering what is the best way to go about implementing this in terms of db schema + logic.
I can think of three options:
Add an extra field called points in the users table, and everytime a user does something, add it to that field (but this will not be able to show an activity of sorts)
Create a function which will run everytime the user does good deed and it calculates from scratch the value and updates the points field
Calculate everytime using a function without any points field.
What is the best way to go about this? Thank you for your time.
Personally, I would use the second option to approach this problem.
The first option limits functionality, so I eliminate that right away.
The third option is inefficient in terms of performance - it is likely that you will be fetching that number a lot, and, if your program is anything like stackoverflow, perhaps showing (calculating) that number many times per pageview/action.
To me, the second option is a decent hybrid solution. Normally, I hate having duplicated data in my system (actions and points, rather than one or the other), but in this case, an integer field is a rather small amount of space per user that saves you a lot of time in recalculating the values unnecessarily.
We must, at times, trade data storage space for performance or vice versa, and I would say that #2 is a trade-off that greatly benefits the application.
This depends very much on the number of expected computations you'll face. In fact, SO apparently uses a method which is similar to your 1) approach, for performance reasons I assume.
This also prevents jumps in the numbers if factors change (such as deleted items which awarded points, or here on SO replies which become community wiki, changes in the point rules, external actions such as joining another account here on SO etc.)
If a recalc solution (2) is what you want, you may implement a "smart" caching by clearing the value (setting it to NULL which would mean "dirty") each time a point modification may take place, and re-computing it when it is NULL, using the cache otherwise. You could also (as a self-correcting measure when non-explicit things happened) clear out the values after an hour, a day or whatever you think firs so that a recalc is forced after a certain time, independently of the "dirty" state.
I would go for 1 and 2 (run in cron on every minute or so).
So that:
- extra field would act as a cache to the amount of point.
- The function to calc the points could be a single sql query that would recalculate the points for all users at once to gain some speed.
I think that recalculating the field each time when point is recieved would be an overkill.
Personally, I'd go with the first option, and add an "Actions" table to keep track of your activity history.
When a user does something good, they get an entry in the "Actions" table, with the action and some point value. The point value can come from another table, or some config file. That same value gets added to the user record.
At any point in time, you could sum up the actions and get the user total, but for performance, simply updating when you add the action record would be simple enough.
How simple is your points system going to be?
I reckon some kind of logging / journalling is good so that you can track activity on a daily /weekly/monthly basis across all users
Check out http://code.google.com/p/userinfuser/
Its open source and allows for you to add points and badges to your application. It has Java, Python, PHP, and Ruby bindings.
I'm designing a very simple (in terms of functionality) but difficult (in terms of scalability) system where users can message each other. Think of it as a very simple chatting service. A user can insert a message through a php page. The message is short and has a recipient name.
On another php page, the user can view all the messages that were sent to him all at once and then deletes them on the database. That's it. That's all the functionality needed for this system. How should I go about designing this (from a database/php point of view)?
So far I have the table like this:
field1 -> message (varchar)
field2 -> recipient (varchar)
Now for sql insert, I find that the time it takes is constant regardless of number of rows in the database. So my send.php will have a guaranteed return time which is good.
But for pulling down messages, my pull.php will take longer as the number of rows increase! I find the sql select (and delete) will take longer as the rows grow and this is true even after I have added an index for the recipient field.
Now, if it was simply the case that users will have to wait a longer time before their messages are pulled on the php then it would have been OK. But what I am worried is that when each pull.php service time takes really long, the php server will start to refuse connections to some request. Or worse the server might just die.
So the question is, how to design this such that it scales? Any tips/hints?
PS. Some estiamte on numbers:
number of users starts with 50,000 and goes up.
each user on average have around 10 messages stored before the other end might pull it down.
each user sends around 10-20 messages a day.
UPDATE from reading the answers so far:
I just want to clarify that by pulling down less messages from pull.php does not help. Even just pull one message will take a long time when the table is huge. This is because the table has all the messages so you have to do a select like this:
select message from DB where recipient = 'John'
even if you change it to this it doesn't help much
select top 1 message from DB where recipient = 'John'
So far from the answers it seems like the longer the table the slower the select will be O(n) or slightly better, no way around it. If that is the case, how should I handle this from the php side? I don't want the php page to fail on the http because the user will be confused and end up refreshing like mad which makes it even worse.
the database design for this is simple as you suggest. As far as it taking longer once the user has more messages, what you can do is just paginate the results. Show the first 10/50/100 or whatever makes sense and only pull those records. Generally speaking, your times shouldn't increase very much unless the volume of messages increases by an order of magnatude or more. You should be able to pull back 1000 short messages in way less than a second. Now it may take more time for the page to display at that point, but thats where the pagination should help.
I would suggest though going through and thinking of future features and building your database out a little more based on that. Adding more features to the software is easy, changing the database is comparatively harder.
Follow the rules of normalization. Try to reach 3rd normal form. To go further for this type of application probably isn’t worth it. Keep your tables thin.
Don’t actually delete rows just mark them as deleted with a bit flag. If you really need to remove them for some type of maintenance / cleanup to reduce size. Mark them as deleted and then create a cleanup process to archive or remove the records during low usage hours.
Integer values are easier for SQL server to deal with then character values. So instead of where recipient = 'John' use WHERE Recipient_ID = 23 You will gain this type of behavior when you normalize your database.
Don't use VARCHAR for your recipient. It's best to make a Recipient table with a primary key that is an integer (or bigint if you are expecting extremely large quantities of people).
Then when you do your select statements:
SELECT message FROM DB WHERE recipient = 52;
The speed retrieving rows will be much faster.
Plus, I believe MySQL indexes are B-Trees, which is O(log n) for most cases.
A database table without an index is called a heap, querying a heap results in each row of the table being evaluated even with a 'where' clause, the big-o notation for a heap is O(n) with n being the number of rows in the table. Adding an index (and this really depends on the underlying aspects of your database engine) results in a complexity of O(log(n)) to find the matching row in the table. This is because the index most certainly is implemented in a b-tree sort of way. Adding rows to the table, even with an index present is an O(1) operation.
> But for pulling down messages, my pull.php will take longer as the number of rows
increase! I find the sql select (and delete) will take longer as the rows grow and
this is true even after I have added an index for the recipient field.
UNLESS you are inserting into the middle of an index, at which point the database engine will need to shift rows down to accommodate. The same occurs when you delete from the index. Remember there is more than one kind of index. Be sure that the index you are using is not a clustered index as more data must be sifted through and moved with inserts and deletes.
FlySwat has given the best option available to you... do not use an RDBMS because your messages are not relational in a formal sense. You will get much better performance from a file system.
dbarker has also given correct answers. I do not know why he has been voted down 3 times, but I will vote him up at the risk that I may lose points. dbarker is referring to "Vertical Partitioning" and his suggestion is both acceptable and good. This isn't rocket surgery people.
My suggestion is to not implement this kind of functionality in your RDBMS, if you do remember that select, update, insert, delete all place locks on pages in your table. If you do go forward with putting this functionality into a database then run your selects with a nolock locking hint if it is available on your platform to increase concurrency. Additionally if you have so many concurrent users, partition your tables vertically as dbarker suggested and place these database files on separate drives (not just volumes but separate hardware) to increase I/O concurrency.
So the question is, how to design this such that it scales? Any tips/hints?
Yes, you don't want to use a relational database for message queuing. What you are trying to do is not what a relational database is best designed for, and while you can do it, its kinda like driving in a nail with a screwdriver.
Instead, look at one of the many open source message queues out there, the guys at SecondLife have a neat wiki where they reviewed a lot of them.
http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes
This is an unavoidable problem - more messages, more time to find the requested ones. The only thing you can do is what you already did - add an index and turn O(n) look up time for a complete table scan into O(log(u) + m) for a clustered index look up where n is the number of total messages, u the number of users, and m the number of messages per user.
Limit the number of rows that your pull.php will display at any one time.
The more data you transfer, longer it will take to display the page, regardless of how great your DB is.
You must limit your data in the SQL, return the most recent N rows.
EDIT
Put an index on Recipient and it will speed it up. You'll need another column to distinguish rows if you want to take the top 50 or something, possibly SendDate or an auto incrementing field. A Clustered index will slow down inserts, so use a regular index there.
You could always have only one row per user and just concatenate messages together into one long record. If you're keeping messages for a long period of time, that isn't the best way to go, but it reduces your problem to a single find and concatenate at storage time and a single find at retrieve time. It's hard to say without more detail - part of what makes DB design hard is meeting all the goals of the system in a well-compromised way. Without all the details, its hard to give advice on the best compromise.
EDIT: I thought I was fairly clear on this, but evidently not: You would not do this unless you were blanking a reader's queue when he reads it. This is why I prompted for clarification.