MySQL big chunks of data - fast access without recalculation

MySQL big chunks of data - fast access without recalculation - php

I would like to store big chunks of data in RAM using sphinx /solr/ elastic search whatever else suits such needs (The problem is I don't know what tool suits the best I had only heard that people use them).
I build reports about sales, I get nearly 800-900k lines of sales per month and user wants to scroll the page and see them smoothly.
I can't give them all data at once becasue browser will just hang
in the same time I can't use LIMIT from mysql because queries demand merging cross tables.
Recalculating it on the flow is not an option.
Creating a temp table in mysql is a bad idea because there are a bunch of criteria and more than one user can view data.
Temporary_table
id product_id product_count order_id order_status.... .....user_id
Having such table I would store all result for current user in the table and would hold them there as long as user doesn't make a new query. But I don't like this solution. There must be something better.
I feel like it's over my head.
Any ideas?

"Drill down", don't "Scroll down" !
When I need to present a million lines of info, I start by thinking of way to slice and dice it -- subtotals by hour, by region, by product type, by whatever. Each slice might be a hundred lines -- quite manageable, especially with summary tables.
In that hundred lines would be clickable items that take the user to a more detailed page about one of the items. That would also have a hundred lines (or 10 or 1000 -- whatever makes sense; but >1000 is usually unreasonable). That page may have further links to drill further down. And/or links to move laterally.
With suitable slicing and dicing, you are very unlikely to need to send him a million lines; only a few hundred.
With suitable Summary Tables, the "tmp tables", etc, go away.

Related

How do I make a huge table of data load faster on a webpage assuming DB is already indexed?

Say I have a webmail service and I have a table with fields - username, IP, location, login_time. Let us say my service is hugely popular with 100s of users logging in every minute. At end of the day, if I want to display this table for today's list of users, there are say half a million rows. Now even after indexing DB table, it's taking a huge amount of time to load this page. How do I make it faster (or give a feel of speedy load) ? May be I can do pagination and load say 50 rows at a time as users shift pages. What if I do not have that option ?

Best would be to use a Jquery "Load more" plugin and get only a restricted amount of data at once... Users can click "Load more" button and see the whole table if they want.

Use backend pagination, as you said.
Imagine that you have an excel file containing that many rows - how fast do you think it will open? And, unlike your browser, excel is a specialized software to work with rows of data.
Put it in another perspective - is it helpful in some way for the user to see .5 millions rows at once? I doubt they do. The user can get exactly the same functionality from your software if you offer him a paged results list, with a search form.

I think table Partitioning based on the login_time column is the solution.Partitioning lets you to store parts of your table in their own logical space. Your query will only have to look at a subset of the data to get a result, and not the whole table making the query multiple times faster depending on the number of rows. More about partitioning in the below link
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Once you have partitioned your table, you can use a pagination mechanism since showing all 0.5 million rows to the user would not serve any purpose.

Archive Tables - Old Copies

I have a program that keeps track of games being played for the year. Once that year is done I was planning to allow admin to archive that season. To do this I created a copy of that table with a name for the previous season. I stored that name in a separate table.
Problem is I guess I cannot do sql query with the table name generated from the table.
Anyone have any advice on how to handle this. The table could have 1000-2000 records per year, and this could be in place in 20 years (probably not with technology). I could add a column for year and use that as a sort, but is it going to get slow when I hit 20K - 40K worth of records?
Another idea I had is keep the active year in one table and archive all the history to another separate one? Users don't care if history takes a few seconds to load, but not current data?
Thanks

Slowness is a function of query complexity as much as it is of record size. You can make slow queries with a 100 rows or fast ones with millions. If you need specific help on query optimization, post it, otherwise there isn't much to work with, except the previously suggested indexes.
Generally a 10k order of magnitude will do nothing special. Don't try to overoptimize before you actually identify the weak points. Partitioning the tables is very rarely neccessary before hitting millions, if not tens of millions, of rows. You're falling into the 'premature optimization' trap. Don't do it unless you're certain it's going to cause problems.
Me, I'd throw an index on the columns you use to JOIN the table (probably user_id) and be done with it. There's an automatic one on it if you're using foreign keys. Until you see your queries slowing down the application, you'd probably do more harm than good by forcing it to work around partitioning.

MySQL managing catalogue views

A friend of mine has a catalogue that currently holds about 500 rows or 500 items. We are looking at ways that we can provide reports on the catalogue inclduing the number of times an item was viewed, and dates for when its viewed.
His site is averaging around 25,000 page impressions per month and if we assumed for a minute that half of these were catalogue items then we'd assume roughly 12,000 catalogue items viewed each month.
My question is the best way to manage item views in the database.
First option is to insert the catalogue ID into a table and then increment the number of times its viewed. The advantage of this is its compact nature. There will only ever be as many rows in the table as there are catalogue items.
`catalogue_id`, `views`
The disadvantage is that no date information is being held, short of maintaining the last time an item was viewed.
The second option is to insert a new row each time an item is viewed.
`catalogue_id`, `timestamp`
If we continue with the assumed figure of 12,000 item views that means adding 12,000 rows to the table each month, or 144,000 rows each year. The advantage of this is we know the number of times the item is viewed, and also the dates for when its viewed.
The disadvantage is the size of the table. Is a table with 144,000 rows becoming too large for MySQL?
Interested to hear any thoughts or suggestions on how to achieve this.
Thanks.

As you have mentioned the first is a lot more compact but limited. However if you look at option 2 in more detail; for example if you wish to store more than just view count, for instance entry/ exit page, host ip ect. This information maybe invaluable for stats and tracking. The other question is are these 25,000 impressions unique? If not you are able to track by username, ip or some other unique identifier, this could enable you to not use as many rows. The answer to your question relies on how much detail you wish to store? and what is the importance of the data?
Update:
True, limiting the repeats on a given item due to a time interval would be a good solution. Also knowing if someone visited the same item could be useful for suggested items perdition widgets similar to what amazon does. Also knowing that someone visited an item many times says to me that this is a good item to promote to them or others in a mail-out, newsletter or popular product page. Tracking unique views will give a more honest view count, which you can choose to display or store. On the issue of limiting the value of repeat visitors, this mainly only comes into play depending on what information you display. It is all about framing the information in the way that best suits you.

Your problem statement: We want to be able to track number of views for a particular catalogue item.
Lets review you options.
First Option:
In this option you will be storing the catalogue_id and a integer value of the number of views of the items.
Advantages:
Well since you really have a one to one relationship the new table is going to be small. If you have 500 items you will have 500 hundred rows. I would suggest if you choose this route not to create a new table but add another column to the catalogue table with the number of views on it.
Disadvantages:
The problem here is that since you are going to be updating this table relatively frequently it is going to be a very busy little table. For example 10 users are viewing the same item. These 10 updates will have to run one after the other. Assuming you are using InnoDB the first view action would come in lock the row update the counter release the lock. The other updates would queue behind it. So while the data is small on the table it could potentially become a bottleneck later on especially if you start scaling the system.
You are loosing granular data i.e. you are not keeping track of the raw data. For example lets say the website starts growing and you have a interested investor they want to see a breakdown of the views per week over the last 6 months. If you use this option you wont have the data to provide to the investor. Essentially you are keeping a summary.
Second Option:
In this option you would create a logging table with at least the following minimal fields catalogue_id and timestamp. You could expand this to add a username/ip address or some other information to make it even more granular.
Advantages:
You are keeping granular data. This will allow you to summarise the data in a variety of ways. You could for example add a ip address column store the visitors IP and then do a monthly report showing you products viewed by country(you could do a IP address lookup to get a idea of which country they were from). Another example would be to see over the last quarter which products was viewed the most etc. This data is pretty essential in helping you make decisions on how to grow you business. If you want to know what is working what is not working as far as products are concerned this detail is absolutely critical.
Your new table will be a logging table. It will only be insert operations. Inserts can pretty much happen in parallel. If you go with this option it will probably scale better as the site grows compared to a constantly updated table.
Disadvantages:
This table will be bigger probably the biggest table in the database. However this is not a problem. I regularly deal with 500 000 000 rows+ tables. Some of my tables are over 750GB by themselves and I can still run reporting on it. You just need to understand your queries and how to optimise them. This is really not a problem as MySQL was designed to handle millions of rows with ease. Just keep in mind you could archive some information into other tables. Say you archive the data every 3 years you could move data older than 3 years into another table. You dont have to keep all the data there. Your estimate of 144 000 rows means you could probably safely keep about 15+ years worth without every worrying about the performance of the table.
My suggestion to you is to serious consider the second option. If you decide to go this route update your question with the proposed table structures and let us have a look at it. Don't be scared of big data rather be scared of BAD design it is much more difficult to deal with.
However as always the choice is yours.

Personalized Search Results based on History

What are some of the techniques for providing personalized search results to a logged in user? One way I can think of will be by analyzing the user's browsing history.
Tracking: A log of a user's activities like pages viewed and 'like' buttons clicked can be use to bias search results.
Question 1: How do you track a user's browsing history? A table with columns user_id, number_of_hits, page id? If I have 1000 daily visitors, each browsing 10 pages on average, wont there be a large number of records to select each time a personalized recommendation is required? The table will grow at 300K rows a month! It will take longer and longer to select the rows each time a search is made. I guess the table for recording 'likes' will take the same table design.
Question 2: How do you bias the results of a search? For example, if a user as been searching for apple products, how does the search engine realise that the user likes apple products and subsequently bias the search towards them? Tag the pages and accumulate a record of tags on the page visited?

You probably don't want to use a relational database for this type of thing, take a look at mongodb or cassandra. That's because you basically want to add a new column to the user's history so a column-oriented database makes more sense.

300k rows per month is not really that much, in fact, that's almost nothing. it doesn't matter if you use a relational or non-relational database for this.
Straightforward approach is the following:
put entries into the table/collection like this:
timestamp, user, action, misc information
(make sure that you put as much information as possible, such that you don't need to join this data warehousing table with any other table)
partition by timestamp (one partition per month)
never go against this table directly, instead have say daily report jobs running over all data and collect and compute the necessary statistics and write them to a summary table.
reflect on your report queries and put appropriate partition local indexes
only go against the summary table from your web frontend

If you stored only the last X results as opposed to everything, it would probably be do-able. Might slow things down, but it'd work. Any time you're writing more data and reading more data, there's going to be an impact. Proper DBA methods such as indexing and query optimizing can help, but no matter what you use there's going to be an affect.
I'd personally look at storing just a default view for the user in a DB and use the session to keep track of the rest. Sure, when you login there'd be no history. But you could take advantage of that to highlight a set of special pages that you think are important or relevant to steer the user to. A highlight system of sorts. Faster, easier, and more user-friendly.
As for bias, you could write a set of keywords for each record and array sort them accordingly. Wouldn't be terribly difficult using PHP.

I use MySQL and over 2M records (page views) a month and we run reports on that table daily and often.
The table is partitioned by month (like already suggested) and indexed where needed.
I also clear the table from data that is over 6 months by creating a new table called "page_view_YYMM" (YY=year, MM=month) and using some UNIONS when necessary
for the second question, the way I would approach it is by creating a table with the list of your products that is a simple:
url, description
the description will be a tag stripped of the content of your page or item (depend how you want to influence the search) and then add a full text index on description and a search on that table adding possible extra terms that you have been collecting while the user was surfing your site that you think are relevant (for example category name, or brand)

Optimizing queries for the next and previous element

I am looking for the best way to retrieve the next and previous records of a record without running a full query. I have a fully implemented solution in place, and would like to know whether there are any better approaches to do this out there.
Let's say we are building a web site for a fictitious greengrocer. In addition to his HTML pages, every week, he wants to publish a list of special offers on his site. He wants those offers to reside in an actual database table, and users have to be able to sort the offers in three ways.
Every item also has to have a detail page with more, textual information on the offer and "previous" and "next" buttons. The "previous" and "next" buttons need to point to the neighboring entries depending on the sorting the user had chosen for the list.
(source: pekkagaiser.com)
Obviously, the "next" button for "Tomatoes, Class I" has to be "Apples, class 1" in the first example, "Pears, class I" in the second, and none in the third.
The task in the detail view is to determine the next and previous items without running a query every time, with the sort order of the list as the only available information (Let's say we get that through a GET parameter ?sort=offeroftheweek_price, and ignore the security implications).
Obviously, simply passing the IDs of the next and previous elements as a parameter is the first solution that comes to mind. After all, we already know the ID's at this point. But, this is not an option here - it would work in this simplified example, but not in many of my real world use cases.
My current approach in my CMS is using something I have named "sorting cache". When a list is loaded, I store the item positions in records in a table named sortingcache.
name (VARCHAR) items (TEXT)
offeroftheweek_unsorted Lettuce; Tomatoes; Apples I; Apples II; Pears
offeroftheweek_price Tomatoes;Pears;Apples I; Apples II; Lettuce
offeroftheweek_class_asc Apples II;Lettuce;Apples;Pears;Tomatoes
obviously, the items column is really populated with numeric IDs.
In the detail page, I now access the appropriate sortingcache record, fetch the items column, explode it, search for the current item ID, and return the previous and next neighbour.
array("current" => "Tomatoes",
"next" => "Pears",
"previous" => null
);
This is obviously expensive, works for a limited number of records only and creates redundant data, but let's assume that in the real world, the query to create the lists is very expensive (it is), running it in every detail view is out of the question, and some caching is needed.
My questions:
Do you think this is a good practice to find out the neighbouring records for varying query orders?
Do you know better practices in terms of performance and simplicity? Do you know something that makes this completely obsolete?
In programming theory, is there a name for this problem?
Is the name "Sorting cache" is appropriate and understandable for this technique?
Are there any recognized, common patterns to solve this problem? What are they called?
Note: My question is not about building the list, or how to display the detail view. Those are just examples. My question is the basic functionality of determining the neighbors of a record when a re-query is impossible, and the fastest and cheapest way to get there.
If something is unclear, please leave a comment and I will clarify.
Starting a bounty - maybe there is some more info on this out there.

Here is an idea. You could offload the expensive operations to an update when the grocer inserts/updates new offers rather than when the end user selects the data to view. This may seem like a non-dynamic way to handle the sort data, but it may increase speed. And, as we know, there is always a trade off between performance and other coding factors.
Create a table to hold next and previous for each offer and each sort option. (Alternatively, you could store this in the offer table if you will always have three sort options -- query speed is a good reason to denormalize your database)
So you would have these columns:
Sort Type (Unsorted, Price, Class and Price Desc)
Offer ID
Prev ID
Next ID
When the detail information for the offer detail page is queried from the database, the NextID and PrevID would be part of the results. So you would only need one query for each detail page.
Each time an offer is inserted, updated or deleted, you would need to run a process which validates the integrity/accuracy of the sorttype table.

I have an idea somewhat similar to Jessica's. However, instead of storing links to the next and previous sort items, you store the sort order for each sort type. To find the previous or next record, just get the row with SortX=currentSort++ or SortX=currentSort--.
Example:
Type Class Price Sort1 Sort2 Sort3
Lettuce 2 0.89 0 4 0
Tomatoes 1 1.50 1 0 4
Apples 1 1.10 2 2 2
Apples 2 0.95 3 3 1
Pears 1 1.25 4 1 3
This solution would yield very short query times, and would take up less disk space than Jessica's idea. However, as I'm sure you realize, the cost of updating one row of data is notably higher, since you have to recalculate and store all sort orders. But still, depending on your situation, if data updates are rare and especially if they always happen in bulk, then this solution might be the best.
i.e.
once_per_day
add/delete/update all records
recalculate sort orders
Hope this is useful.

I've had nightmares with this one as well. Your current approach seems to be the best solution even for lists of 10k items. Caching the IDs of the list view in the http session and then using that for displaying the (personalized to current user) previous/next. This works well especially when there are too many ways to filter and sort the initial list of items instead of just 3.
Also, by storing the whole IDs list you get to display a "you are at X out of Y" usability enhancing text.
By the way, this is what JIRA does as well.
To directly answer your questions:
Yes it's good practice because it scales without any added code complexity when your filter/sorting and item types crow more complex. I'm using it in a production system with 250k articles with "infinite" filter/sort variations. Trimming the cacheable IDs to 1000 is also a possibility since the user will most probably never click on prev or next more than 500 times (He'll most probably go back and refine the search or paginate).
I don't know of a better way. But if the sorts where limited and this was a public site (with no http session) then I'd most probably denormalize.
Dunno.
Yes, sorting cache sounds good. In my project I call it "previous/next on search results" or "navigation on search results".
Dunno.

In general, I denormalize the data from the indexes. They may be stored in the same rows, but I almost always retrieve my result IDs, then make a separate trip for the data. This makes caching the data very simple. It's not so important in PHP where the latency is low and the bandwidth high, but such a strategy is very useful when you have a high latency, low bandwidth application, such as an AJAX website where much of the site is rendered in JavaScript.
I always cache the lists of results, and the results themselves separately. If anything affects the results of a list query, the cache of the list results is refreshed. If anything affects the results themselves, those particular results are refreshed. This allows me to update either one without having to regenerate everything, resulting in effective caching.
Since my lists of results rarely change, I generate all the lists at the same time. This may make the initial response slightly slower, but it simplifies cache refreshing (all the lists get stored in a single cache entry).
Because I have the entire list cached, it's trivial to find neighbouring items without revisiting the database. With luck, the data for those items will also be cached. This is especially handy when sorting data in JavaScript. If I already have a copy cached on the client, I can resort instantly.
To answer your questions specifically:
Yes, it's a fantastic idea to find out the neighbours ahead of time, or whatever information the client is likely to access next, especially if the cost is low now and the cost to recalculate is high. Then it's simply a trade off of extra pre-calculation and storage versus speed.
In terms of performance and simplicity, avoid tying things together that are logically different things. Indexes and data are different, are likely to be changed at different times (e.g. adding a new datum will affect the indexes, but not the existing data), and thus should be accessed separately. This may be slightly less efficient from a single-threaded standpoint, but every time you tie something together, you lose caching effectiveness and asychronosity (the key to scaling is asychronosity).
The term for getting data ahead of time is pre-fetching. Pre-fetching can happen at the time of access or in the background, but before the pre-fetched data is actually needed. Likewise with pre-calculation. It's a trade-off of cost now, storage cost, and cost to get when needed.
"Sorting cache" is an apt name.
I don't know.
Also, when you cache things, cache them at the most generic level possible. Some stuff might be user specific (such as results for a search query), where others might be user agnostic, such as browsing a catalog. Both can benefit from caching. The catalog query might be frequent and save a little each time, and the search query may be expensive and save a lot a few times.

I'm not sure whether I understood right, so if not, just tell me ;)
Let's say, that the givens are the query for the sorted list and the current offset in that list, i.e. we have a $query and an $n.
A very obvious solution to minimize the queries, would be to fetch all the data at once:
list($prev, $current, $next) = DB::q($query . ' LIMIT ?i, 3', $n - 1)->fetchAll(PDO::FETCH_NUM);
That statement fetches the previous, the current and the next elements from the database in the current sorting order and puts the associated information into the corresponding variables.
But as this solution is too simple, I assume I misunderstood something.

There are as many ways to do this as to skin the proverbial cat. So here are a couple of mine.
If your original query is expensive, which you say it is, then create another table possibly a memory table populating it with the results of your expensive and seldom run main query.
This second table could then be queried on every view and the sorting is as simple as setting the appropriate sort order.
As is required repopulate the second table with results from the first table, thus keeping the data fresh, but minimising the use of the expensive query.
Alternately, If you want to avoid even connecting to the db then you could store all the data in a php array and store it using memcached. this would be very fast and provided your lists weren't too huge would be resource efficient. and can be easily sorted.
DC

Basic assumptions:
Specials are weekly
We can expect the site to change infrequently... probably daily?
We can control updates to the database with ether an API or respond via triggers
If the site changes on a daily basis, I suggest that all the pages are statically generated overnight. One query for each sort-order iterates through and makes all the related pages. Even if there are dynamic elements, odds are that you can address them by including the static page elements. This would provide optimal page service and no database load. In fact, you could possibly generate separate pages and prev / next elements that are included into the pages. This may be crazier with 200 ways to sort, but with 3 I'm a big fan of it.
?sort=price
include(/sorts/$sort/tomatoes_class_1)
/*tomatoes_class_1 is probably a numeric id; sanitize your sort key... use numerics?*/
If for some reason this isn't feasible, I'd resort to memorization. Memcache is popular for this sort of thing (pun!). When something is pushed to the database, you can issue a trigger to update your cache with the correct values. Do this in the same way you would if as if your updated item existed in 3 linked lists -- relink as appropriate (this.next.prev = this.prev, etc). From that, as long as your cache doesn't overfill, you'll be pulling simple values from memory in a primary key fashion.
This method will take some extra coding on the select and update / insert methods, but it should be fairly minimal. In the end, you'll be looking up [id of tomatoes class 1].price.next. If that key is in your cache, golden. If not, insert into cache and display.
Do you think this is a good practice to find out the neighboring records for varying query orders? Yes. It is wise to perform look-aheads on expected upcoming requests.
Do you know better practices in terms of performance and simplicity? Do you know something that makes this completely obsolete? Hopefully the above
In programming theory, is there a name for this problem? Optimization?
Is the name "Sorting cache" is appropriate and understandable for this technique? I'm not sure of a specific appropriate name. It is caching, it is a cache of sorts, but I'm not sure that telling me you have a "sorting cache" would convey instant understanding.
Are there any recognized, common patterns to solve this problem? What are they called? Caching?
Sorry my tailing answers are kind of useless, but I think my narrative solutions should be quite useful.

You could save the row numbers of the ordered lists into views, and you could reach the previous and next items in the list under (current_rownum-1) and (current_rownum+1) row numbers.

The problem / datastructur is named bi-directional graph or you could say you've got several linked lists.
If you think of it as a linked list, you could just add fields to the items table for every sorting and prev / next key. But the DB Person will kill you for that, it's like GOTO.
If you think of it as a (bi-)directional graph, you go with Jessica's answer. The main problem there is that order updates are expensive operations.
Item Next Prev
A B -
B C A
C D B
...
If you change one items position to the new order A, C, B, D, you will have to update 4 rows.

Apologies if I have misunderstood, but I think you want to retain the ordered list between user accesses to the server. If so, your answer may well lie in your caching strategy and technologies rather than in database query/ schema optimization.
My approach would be to serialize() the array once its first retrieved, and then cache that in to a separate storage area; whether that's memcached/ APC/ hard-drive/ mongoDb/ etc. and retain its cache location details for each user individually through their session data. The actual storage backend would naturally be dependent upon the size of the array, which you don't go into much detail about, but memcached scales great over multiple servers and mongo even further at a slightly greater latency cost.
You also don't indicate how many sort permutations there are in the real-world; e.g. do you need to cache separate lists per user, or can you globally cache per sort permutation and then filter out what you don't need via PHP?. In the example you give, I'd simply cache both permutations and store which of the two I needed to unserialize() in the session data.
When the user returns to the site, check the Time To Live value of the cached data and re-use it if still valid. I'd also have a trigger running on INSERT/ UPDATE/ DELETE for the special offers that simply sets a timestamp field in a separate table. This would immediately indicate whether the cache was stale and the query needed to be re-run for a very low query cost. The great thing about only using the trigger to set a single field is that there's no need to worry about pruning old/ redundant values out of that table.
Whether this is suitable would depend upon the size of the data being returned, how frequently it was modified, and what caching technologies are available on your server.

So you have two tasks:
build sorted list of items (SELECTs with different ORDER BY)
show details about each item (SELECT details from database with possible caching).
What is the problem?
PS: if ordered list may be too big you just need PAGER functionality implemented. There could be different implementations, e.g. you may wish to add "LIMIT 5" into query and provide "Show next 5" button. When this button is pressed, condition like "WHERE price < 0.89 LIMIT 5" is added.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.