How would I handle user statistics in PHP?
There are two obvious methods that I can choose. Both have their flaws.
Select MySQL COUNTs when necessary. The flaw here is that if you have many rows to count then it may be slow especially when you have to do it on seemingly every page load. The benefit is that the count will always be correct.
Store user statistics in a statistics table. The flaw here is that you have to continuously update it whenever a change is made, and this makes code overly complicated if you need to update in bulk. The benefit is that it will be fast to select a single row of stats for a user as opposed to performing counts.
Another possible method that I'm a bit "eh" about is storing a job in a queue (and have Laravel handle it). These jobs will update the statistics necessary using other tables so that it's synchronised properly. The benefit is that it takes the load off of the web server and the flaw is that a user may get incorrect statistics. It is not desirable for your own friends list to say there is for example, 15 friends and 7 friend requests when the actual numbers vary greatly.
I've put into detail the methods I have come up with and I'm not sure what's best in terms of giving correct results for the user, as well as balancing speed and simplicity. If I'm doing the COUNT method then potentially I could cache the result and remove the cache record if the statistics are to be updated but I'd imagine storing a row in the cache table for EACH user is a bit overkill. Maybe this isn't really a problem as long as the database has enough space but surely searching through a massive cache table is going to be slow anyway?
Maybe someone can give me the best choice to handle user statistics. My head is spinning as it's over-thinking everything and I need to be put on the straight and narrow.
Thanks in advance.
Don't exaggerate the cost of COUNT(*) as you plan this part of your app. If you have the correct index on your table, row counting is very quick. In fact, in if your table is MyISAM it can be O(1) in complexity.
For example, if you have an index on user the query SELECT COUNT(*) AS num FROM friend WHERE user = 'mickey#disney.com' will be very fast.
Build your app the easy way. When you have ten thousand users, you can rework this kind of statistical computation to be more elaborate and efficient. When you have more users, it will not be as obvious if you present approximate results.
Be careful, though. COUNT(*) is much faster than COUNT(expression) in most cases. The * allows MySQL to avoid evaluating every row.
I understand that multiple variables are part of this equation like number of tables, number of columns, number of returned rows, used indexes etc. But if we speak overall
Is more efficient to run a query with multiple (say 5+) joins where most of the tables will contain rows with information corresponding to rows in the main table and the returned result would be in the 20.000 rows range. For the sake of argument let's say the first table would contain users with a creation date and it's on this date we decide the users to pick out. The other tables contain stuff such as session information, user notes etc. All users should be picked out but depending on the values of fields in the secondary tables we might ignore the session data for one user and do some work with the session data on another user when we go through the results. This way we would get all needed data in one query but might get some redundant data for some users at the same time.
Or would it be more efficient to pick the users by date and when iterating the results we fetch data from the other tables per user when it's necessary?
Let's say that the work on the returned rows is done within PHP5+.
I'll say, do a benchmark.
It will depends on the frequency of "when it's necessary". If you need the extra date for 10% of the users, the seconde approach will be better I think. If you need them for 90%, it will be better to retrieve everything in one big query.
Big join.
I can cite absolutely no evidence to back that up. I do speak from some experience, though: in the system i work with, we do millions of little tiny simple queries, rather than a few big ones, and all the data-intensive work takes ages. For example, it takes an hour to load data that a direct SQL load can do in a couple of minutes. The per-query cost completely dominates the equation.
If your tables have the proper indexes (which will help a lot, when it comes to joins), one single SQL query, even a bit complex, will probably be faster than several queries, which will each imply an exchange between PHP and the MySQL server.
(But, of course, the only way to know for sure what applies the best in your specific situation is to test both solutions, benchmarking them !)
I am looking for the best way to retrieve the next and previous records of a record without running a full query. I have a fully implemented solution in place, and would like to know whether there are any better approaches to do this out there.
Let's say we are building a web site for a fictitious greengrocer. In addition to his HTML pages, every week, he wants to publish a list of special offers on his site. He wants those offers to reside in an actual database table, and users have to be able to sort the offers in three ways.
Every item also has to have a detail page with more, textual information on the offer and "previous" and "next" buttons. The "previous" and "next" buttons need to point to the neighboring entries depending on the sorting the user had chosen for the list.
(source: pekkagaiser.com)
Obviously, the "next" button for "Tomatoes, Class I" has to be "Apples, class 1" in the first example, "Pears, class I" in the second, and none in the third.
The task in the detail view is to determine the next and previous items without running a query every time, with the sort order of the list as the only available information (Let's say we get that through a GET parameter ?sort=offeroftheweek_price, and ignore the security implications).
Obviously, simply passing the IDs of the next and previous elements as a parameter is the first solution that comes to mind. After all, we already know the ID's at this point. But, this is not an option here - it would work in this simplified example, but not in many of my real world use cases.
My current approach in my CMS is using something I have named "sorting cache". When a list is loaded, I store the item positions in records in a table named sortingcache.
name (VARCHAR) items (TEXT)
offeroftheweek_unsorted Lettuce; Tomatoes; Apples I; Apples II; Pears
offeroftheweek_price Tomatoes;Pears;Apples I; Apples II; Lettuce
offeroftheweek_class_asc Apples II;Lettuce;Apples;Pears;Tomatoes
obviously, the items column is really populated with numeric IDs.
In the detail page, I now access the appropriate sortingcache record, fetch the items column, explode it, search for the current item ID, and return the previous and next neighbour.
array("current" => "Tomatoes",
"next" => "Pears",
"previous" => null
);
This is obviously expensive, works for a limited number of records only and creates redundant data, but let's assume that in the real world, the query to create the lists is very expensive (it is), running it in every detail view is out of the question, and some caching is needed.
My questions:
Do you think this is a good practice to find out the neighbouring records for varying query orders?
Do you know better practices in terms of performance and simplicity? Do you know something that makes this completely obsolete?
In programming theory, is there a name for this problem?
Is the name "Sorting cache" is appropriate and understandable for this technique?
Are there any recognized, common patterns to solve this problem? What are they called?
Note: My question is not about building the list, or how to display the detail view. Those are just examples. My question is the basic functionality of determining the neighbors of a record when a re-query is impossible, and the fastest and cheapest way to get there.
If something is unclear, please leave a comment and I will clarify.
Starting a bounty - maybe there is some more info on this out there.
Here is an idea. You could offload the expensive operations to an update when the grocer inserts/updates new offers rather than when the end user selects the data to view. This may seem like a non-dynamic way to handle the sort data, but it may increase speed. And, as we know, there is always a trade off between performance and other coding factors.
Create a table to hold next and previous for each offer and each sort option. (Alternatively, you could store this in the offer table if you will always have three sort options -- query speed is a good reason to denormalize your database)
So you would have these columns:
Sort Type (Unsorted, Price, Class and Price Desc)
Offer ID
Prev ID
Next ID
When the detail information for the offer detail page is queried from the database, the NextID and PrevID would be part of the results. So you would only need one query for each detail page.
Each time an offer is inserted, updated or deleted, you would need to run a process which validates the integrity/accuracy of the sorttype table.
I have an idea somewhat similar to Jessica's. However, instead of storing links to the next and previous sort items, you store the sort order for each sort type. To find the previous or next record, just get the row with SortX=currentSort++ or SortX=currentSort--.
Example:
Type Class Price Sort1 Sort2 Sort3
Lettuce 2 0.89 0 4 0
Tomatoes 1 1.50 1 0 4
Apples 1 1.10 2 2 2
Apples 2 0.95 3 3 1
Pears 1 1.25 4 1 3
This solution would yield very short query times, and would take up less disk space than Jessica's idea. However, as I'm sure you realize, the cost of updating one row of data is notably higher, since you have to recalculate and store all sort orders. But still, depending on your situation, if data updates are rare and especially if they always happen in bulk, then this solution might be the best.
i.e.
once_per_day
add/delete/update all records
recalculate sort orders
Hope this is useful.
I've had nightmares with this one as well. Your current approach seems to be the best solution even for lists of 10k items. Caching the IDs of the list view in the http session and then using that for displaying the (personalized to current user) previous/next. This works well especially when there are too many ways to filter and sort the initial list of items instead of just 3.
Also, by storing the whole IDs list you get to display a "you are at X out of Y" usability enhancing text.
By the way, this is what JIRA does as well.
To directly answer your questions:
Yes it's good practice because it scales without any added code complexity when your filter/sorting and item types crow more complex. I'm using it in a production system with 250k articles with "infinite" filter/sort variations. Trimming the cacheable IDs to 1000 is also a possibility since the user will most probably never click on prev or next more than 500 times (He'll most probably go back and refine the search or paginate).
I don't know of a better way. But if the sorts where limited and this was a public site (with no http session) then I'd most probably denormalize.
Dunno.
Yes, sorting cache sounds good. In my project I call it "previous/next on search results" or "navigation on search results".
Dunno.
In general, I denormalize the data from the indexes. They may be stored in the same rows, but I almost always retrieve my result IDs, then make a separate trip for the data. This makes caching the data very simple. It's not so important in PHP where the latency is low and the bandwidth high, but such a strategy is very useful when you have a high latency, low bandwidth application, such as an AJAX website where much of the site is rendered in JavaScript.
I always cache the lists of results, and the results themselves separately. If anything affects the results of a list query, the cache of the list results is refreshed. If anything affects the results themselves, those particular results are refreshed. This allows me to update either one without having to regenerate everything, resulting in effective caching.
Since my lists of results rarely change, I generate all the lists at the same time. This may make the initial response slightly slower, but it simplifies cache refreshing (all the lists get stored in a single cache entry).
Because I have the entire list cached, it's trivial to find neighbouring items without revisiting the database. With luck, the data for those items will also be cached. This is especially handy when sorting data in JavaScript. If I already have a copy cached on the client, I can resort instantly.
To answer your questions specifically:
Yes, it's a fantastic idea to find out the neighbours ahead of time, or whatever information the client is likely to access next, especially if the cost is low now and the cost to recalculate is high. Then it's simply a trade off of extra pre-calculation and storage versus speed.
In terms of performance and simplicity, avoid tying things together that are logically different things. Indexes and data are different, are likely to be changed at different times (e.g. adding a new datum will affect the indexes, but not the existing data), and thus should be accessed separately. This may be slightly less efficient from a single-threaded standpoint, but every time you tie something together, you lose caching effectiveness and asychronosity (the key to scaling is asychronosity).
The term for getting data ahead of time is pre-fetching. Pre-fetching can happen at the time of access or in the background, but before the pre-fetched data is actually needed. Likewise with pre-calculation. It's a trade-off of cost now, storage cost, and cost to get when needed.
"Sorting cache" is an apt name.
I don't know.
Also, when you cache things, cache them at the most generic level possible. Some stuff might be user specific (such as results for a search query), where others might be user agnostic, such as browsing a catalog. Both can benefit from caching. The catalog query might be frequent and save a little each time, and the search query may be expensive and save a lot a few times.
I'm not sure whether I understood right, so if not, just tell me ;)
Let's say, that the givens are the query for the sorted list and the current offset in that list, i.e. we have a $query and an $n.
A very obvious solution to minimize the queries, would be to fetch all the data at once:
list($prev, $current, $next) = DB::q($query . ' LIMIT ?i, 3', $n - 1)->fetchAll(PDO::FETCH_NUM);
That statement fetches the previous, the current and the next elements from the database in the current sorting order and puts the associated information into the corresponding variables.
But as this solution is too simple, I assume I misunderstood something.
There are as many ways to do this as to skin the proverbial cat. So here are a couple of mine.
If your original query is expensive, which you say it is, then create another table possibly a memory table populating it with the results of your expensive and seldom run main query.
This second table could then be queried on every view and the sorting is as simple as setting the appropriate sort order.
As is required repopulate the second table with results from the first table, thus keeping the data fresh, but minimising the use of the expensive query.
Alternately, If you want to avoid even connecting to the db then you could store all the data in a php array and store it using memcached. this would be very fast and provided your lists weren't too huge would be resource efficient. and can be easily sorted.
DC
Basic assumptions:
Specials are weekly
We can expect the site to change infrequently... probably daily?
We can control updates to the database with ether an API or respond via triggers
If the site changes on a daily basis, I suggest that all the pages are statically generated overnight. One query for each sort-order iterates through and makes all the related pages. Even if there are dynamic elements, odds are that you can address them by including the static page elements. This would provide optimal page service and no database load. In fact, you could possibly generate separate pages and prev / next elements that are included into the pages. This may be crazier with 200 ways to sort, but with 3 I'm a big fan of it.
?sort=price
include(/sorts/$sort/tomatoes_class_1)
/*tomatoes_class_1 is probably a numeric id; sanitize your sort key... use numerics?*/
If for some reason this isn't feasible, I'd resort to memorization. Memcache is popular for this sort of thing (pun!). When something is pushed to the database, you can issue a trigger to update your cache with the correct values. Do this in the same way you would if as if your updated item existed in 3 linked lists -- relink as appropriate (this.next.prev = this.prev, etc). From that, as long as your cache doesn't overfill, you'll be pulling simple values from memory in a primary key fashion.
This method will take some extra coding on the select and update / insert methods, but it should be fairly minimal. In the end, you'll be looking up [id of tomatoes class 1].price.next. If that key is in your cache, golden. If not, insert into cache and display.
Do you think this is a good practice to find out the neighboring records for varying query orders? Yes. It is wise to perform look-aheads on expected upcoming requests.
Do you know better practices in terms of performance and simplicity? Do you know something that makes this completely obsolete? Hopefully the above
In programming theory, is there a name for this problem? Optimization?
Is the name "Sorting cache" is appropriate and understandable for this technique? I'm not sure of a specific appropriate name. It is caching, it is a cache of sorts, but I'm not sure that telling me you have a "sorting cache" would convey instant understanding.
Are there any recognized, common patterns to solve this problem? What are they called? Caching?
Sorry my tailing answers are kind of useless, but I think my narrative solutions should be quite useful.
You could save the row numbers of the ordered lists into views, and you could reach the previous and next items in the list under (current_rownum-1) and (current_rownum+1) row numbers.
The problem / datastructur is named bi-directional graph or you could say you've got several linked lists.
If you think of it as a linked list, you could just add fields to the items table for every sorting and prev / next key. But the DB Person will kill you for that, it's like GOTO.
If you think of it as a (bi-)directional graph, you go with Jessica's answer. The main problem there is that order updates are expensive operations.
Item Next Prev
A B -
B C A
C D B
...
If you change one items position to the new order A, C, B, D, you will have to update 4 rows.
Apologies if I have misunderstood, but I think you want to retain the ordered list between user accesses to the server. If so, your answer may well lie in your caching strategy and technologies rather than in database query/ schema optimization.
My approach would be to serialize() the array once its first retrieved, and then cache that in to a separate storage area; whether that's memcached/ APC/ hard-drive/ mongoDb/ etc. and retain its cache location details for each user individually through their session data. The actual storage backend would naturally be dependent upon the size of the array, which you don't go into much detail about, but memcached scales great over multiple servers and mongo even further at a slightly greater latency cost.
You also don't indicate how many sort permutations there are in the real-world; e.g. do you need to cache separate lists per user, or can you globally cache per sort permutation and then filter out what you don't need via PHP?. In the example you give, I'd simply cache both permutations and store which of the two I needed to unserialize() in the session data.
When the user returns to the site, check the Time To Live value of the cached data and re-use it if still valid. I'd also have a trigger running on INSERT/ UPDATE/ DELETE for the special offers that simply sets a timestamp field in a separate table. This would immediately indicate whether the cache was stale and the query needed to be re-run for a very low query cost. The great thing about only using the trigger to set a single field is that there's no need to worry about pruning old/ redundant values out of that table.
Whether this is suitable would depend upon the size of the data being returned, how frequently it was modified, and what caching technologies are available on your server.
So you have two tasks:
build sorted list of items (SELECTs with different ORDER BY)
show details about each item (SELECT details from database with possible caching).
What is the problem?
PS: if ordered list may be too big you just need PAGER functionality implemented. There could be different implementations, e.g. you may wish to add "LIMIT 5" into query and provide "Show next 5" button. When this button is pressed, condition like "WHERE price < 0.89 LIMIT 5" is added.
I'm making a forum.
And I'm wondering if i should store the number of replies in the topic table or count the posts of the topic?
How much slower will it be if i use sql and count them? Lets say i have a billion posts.
Will it be much slower? Im not planning on being that big but what if? How much slower would i be compared to stroing the num in topics?
Thanks
It will be slower as your db grows in size. If you are planning on having a large post table, store the value in the topic table
I just ran some tests on a MySQL 4.0 box we have using a table with over 1 million records.
SELECT COUNT(*) FROM MyTable; ~1 million took 22ms
SELECT COUNT(*) FROM MyTable WHERE Role=1; ~800,000 took 3.2s
SELECT COUNT(*) FROM MyTable WHERE Role=2; ~20 took 12ms
The Role column in this case was indexed and this was connecting to the MySQL remotely.
I think your posts table will have to get very large for the query times to really become an issue. I also think it is a pre-optimization to put the cache of the count in your topics table. Build it without it for now and if it becomes a problem its a pretty easy update to change it.
Do not store the value in a table.
Cache the value in the application for some time so the count(*) query wont be executed too often.
Choose cache time depending on the server load: higher for very busy and zero for couple of users.
The count(*) in SqlServer is pretty fast (assuming you have index on the field you are counting on). So you just need to reduce number of hits under the heavy load.
If you will store the value in a table you will have a lot of hassle maintaining it.
This is going to affect scaling and is an issue of normalization. Hardcore normalization nerds will tell you that you shouldn't keep the number of posts on the topic because it causes redundant data. But you need to keep in mind that if you don't store it there you need to do an extra query on every load to fetch the number. The alternative is to do an extra query on every update/insert instead, which will almost always occur much less often than select's. As you scale a site to support a lot of traffic it becomes almost inevitable that you have to eventually start to de-normalize some of your data, especially in cases like this.
Redundant data isn't inherently bad. Poorly managed redundancy is. As long as you have the proper checks in place to prevent the data from getting out of sync then the potential benefit of storing the number of posts on the thread is worth the extra bit of code IMO.
I think a lot of this will depend on how rapidly you're pushing data in. If you store the value in a topic table, then you may find that you're needing to increment (or decrement if you delete records) very frequently too.
Indexes (indices?) may be a nicer option, as you can store a tiny subset of the data, and be able to access richer information. Consider the fact that it can be quite quick to count how many Farleys there are in the phone-book, because I can go straight there and easily count them.
So, as is often the case, the answer is probably 'It depends'.
I like storing counts in the table rather than counting them every time. It's such an easy operation and you never have to think about the expense of showing it when you're retrieving it. With a forum you're going to be displaying it more often than you're going to be changing it anyway so it makes sense to make that as cheap as possible. It might be a bit premature but it might save you some headaches later.
I'm designing a very simple (in terms of functionality) but difficult (in terms of scalability) system where users can message each other. Think of it as a very simple chatting service. A user can insert a message through a php page. The message is short and has a recipient name.
On another php page, the user can view all the messages that were sent to him all at once and then deletes them on the database. That's it. That's all the functionality needed for this system. How should I go about designing this (from a database/php point of view)?
So far I have the table like this:
field1 -> message (varchar)
field2 -> recipient (varchar)
Now for sql insert, I find that the time it takes is constant regardless of number of rows in the database. So my send.php will have a guaranteed return time which is good.
But for pulling down messages, my pull.php will take longer as the number of rows increase! I find the sql select (and delete) will take longer as the rows grow and this is true even after I have added an index for the recipient field.
Now, if it was simply the case that users will have to wait a longer time before their messages are pulled on the php then it would have been OK. But what I am worried is that when each pull.php service time takes really long, the php server will start to refuse connections to some request. Or worse the server might just die.
So the question is, how to design this such that it scales? Any tips/hints?
PS. Some estiamte on numbers:
number of users starts with 50,000 and goes up.
each user on average have around 10 messages stored before the other end might pull it down.
each user sends around 10-20 messages a day.
UPDATE from reading the answers so far:
I just want to clarify that by pulling down less messages from pull.php does not help. Even just pull one message will take a long time when the table is huge. This is because the table has all the messages so you have to do a select like this:
select message from DB where recipient = 'John'
even if you change it to this it doesn't help much
select top 1 message from DB where recipient = 'John'
So far from the answers it seems like the longer the table the slower the select will be O(n) or slightly better, no way around it. If that is the case, how should I handle this from the php side? I don't want the php page to fail on the http because the user will be confused and end up refreshing like mad which makes it even worse.
the database design for this is simple as you suggest. As far as it taking longer once the user has more messages, what you can do is just paginate the results. Show the first 10/50/100 or whatever makes sense and only pull those records. Generally speaking, your times shouldn't increase very much unless the volume of messages increases by an order of magnatude or more. You should be able to pull back 1000 short messages in way less than a second. Now it may take more time for the page to display at that point, but thats where the pagination should help.
I would suggest though going through and thinking of future features and building your database out a little more based on that. Adding more features to the software is easy, changing the database is comparatively harder.
Follow the rules of normalization. Try to reach 3rd normal form. To go further for this type of application probably isn’t worth it. Keep your tables thin.
Don’t actually delete rows just mark them as deleted with a bit flag. If you really need to remove them for some type of maintenance / cleanup to reduce size. Mark them as deleted and then create a cleanup process to archive or remove the records during low usage hours.
Integer values are easier for SQL server to deal with then character values. So instead of where recipient = 'John' use WHERE Recipient_ID = 23 You will gain this type of behavior when you normalize your database.
Don't use VARCHAR for your recipient. It's best to make a Recipient table with a primary key that is an integer (or bigint if you are expecting extremely large quantities of people).
Then when you do your select statements:
SELECT message FROM DB WHERE recipient = 52;
The speed retrieving rows will be much faster.
Plus, I believe MySQL indexes are B-Trees, which is O(log n) for most cases.
A database table without an index is called a heap, querying a heap results in each row of the table being evaluated even with a 'where' clause, the big-o notation for a heap is O(n) with n being the number of rows in the table. Adding an index (and this really depends on the underlying aspects of your database engine) results in a complexity of O(log(n)) to find the matching row in the table. This is because the index most certainly is implemented in a b-tree sort of way. Adding rows to the table, even with an index present is an O(1) operation.
> But for pulling down messages, my pull.php will take longer as the number of rows
increase! I find the sql select (and delete) will take longer as the rows grow and
this is true even after I have added an index for the recipient field.
UNLESS you are inserting into the middle of an index, at which point the database engine will need to shift rows down to accommodate. The same occurs when you delete from the index. Remember there is more than one kind of index. Be sure that the index you are using is not a clustered index as more data must be sifted through and moved with inserts and deletes.
FlySwat has given the best option available to you... do not use an RDBMS because your messages are not relational in a formal sense. You will get much better performance from a file system.
dbarker has also given correct answers. I do not know why he has been voted down 3 times, but I will vote him up at the risk that I may lose points. dbarker is referring to "Vertical Partitioning" and his suggestion is both acceptable and good. This isn't rocket surgery people.
My suggestion is to not implement this kind of functionality in your RDBMS, if you do remember that select, update, insert, delete all place locks on pages in your table. If you do go forward with putting this functionality into a database then run your selects with a nolock locking hint if it is available on your platform to increase concurrency. Additionally if you have so many concurrent users, partition your tables vertically as dbarker suggested and place these database files on separate drives (not just volumes but separate hardware) to increase I/O concurrency.
So the question is, how to design this such that it scales? Any tips/hints?
Yes, you don't want to use a relational database for message queuing. What you are trying to do is not what a relational database is best designed for, and while you can do it, its kinda like driving in a nail with a screwdriver.
Instead, look at one of the many open source message queues out there, the guys at SecondLife have a neat wiki where they reviewed a lot of them.
http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes
This is an unavoidable problem - more messages, more time to find the requested ones. The only thing you can do is what you already did - add an index and turn O(n) look up time for a complete table scan into O(log(u) + m) for a clustered index look up where n is the number of total messages, u the number of users, and m the number of messages per user.
Limit the number of rows that your pull.php will display at any one time.
The more data you transfer, longer it will take to display the page, regardless of how great your DB is.
You must limit your data in the SQL, return the most recent N rows.
EDIT
Put an index on Recipient and it will speed it up. You'll need another column to distinguish rows if you want to take the top 50 or something, possibly SendDate or an auto incrementing field. A Clustered index will slow down inserts, so use a regular index there.
You could always have only one row per user and just concatenate messages together into one long record. If you're keeping messages for a long period of time, that isn't the best way to go, but it reduces your problem to a single find and concatenate at storage time and a single find at retrieve time. It's hard to say without more detail - part of what makes DB design hard is meeting all the goals of the system in a well-compromised way. Without all the details, its hard to give advice on the best compromise.
EDIT: I thought I was fairly clear on this, but evidently not: You would not do this unless you were blanking a reader's queue when he reads it. This is why I prompted for clarification.