This is for an upcoming project. I have two tables - first one keeps tracks of photos, and the second one keeps track of the photo's rank
Photos:
+-------+-----------+------------------+
| id | photo | current_rank |
+-------+-----------+------------------+
| 1 | apple | 5 |
| 2 | orange | 9 |
+-------+-----------+------------------+
The photo rank keeps changing on a regular basis, and this is the table that tracks it:
Ranks:
+-------+-----------+----------+-------------+
| id | photo_id | ranks | timestamp |
+-------+-----------+----------+-------------+
| 1 | 1 | 8 | * |
| 2 | 2 | 2 | * |
| 3 | 1 | 3 | * |
| 4 | 1 | 7 | * |
| 5 | 1 | 5 | * |
| 6 | 2 | 9 | * |
+-------+-----------+----------+-------------+ * = current timestamp
Every rank is tracked for reporting/analysis purpose.
[Edit] Users will have access to the statistics on demand.
I talked to someone who has experience in this field, and he told me that storing ranks like above is the way to go. But I'm not so sure yet.
The problem here is data redundancy. There are going to be tens of thousands of photos. The photo rank changes on a hourly basis (many times- within minutes) for recent photos but less frequently for older photos. At this rate the table will have millions of records within months. And since I do not have experience in working with large databases, this makes me a little nervous.
I thought of this:
Ranks:
+-------+-----------+--------------------+
| id | photo_id | ranks |
+-------+-----------+--------------------+
| 1 | 1 | 8:*,3:*,7:*,5:* |
| 2 | 2 | 2:*,9:* |
+-------+-----------+--------------------+ * = current timestamp
That means some extra code in PHP to split the rank/time (and sorting), but that looks OK to me.
Is this a correct way to optimize the table for performance? What would you recommend?
The first one. Period.
Actually you'll lose much more. A timestamp stored in the int column will occupy only 4 bytes of space.
While the same timestamp stored in the string format will take 10 bytes.
Your first design is correct for a relational database. The redundancy in the key columns is preferable because it gives you a lot more flexibility in how you validate and query the rankings. You can do sorts, counts, averages, etc. in SQL without having to write any PHP code to split your string six ways from Sunday.
It sounds like you would like to use a non-SQL database like CouchDB or MongoDB. These would allow you to store a semi-structured list of rankings right in the record for the photo, and subsequently query the rankings efficiently. With the caveat that you don't really know that the rankings are in the right format, as you do with SQL.
I would stick with your first approach. In the second you will have a lot of data stored in the row, as time goes by it gets more ranks! That is, if a photo gets thousands and thousands of rankings.
The first approach is also more maintainable, that is, if you wish to delete a rank.
I'd think the database 'hit' of over normalistion (querying the ranks table over and over) is nicely avoided by 'caching' the last rank in current_rank. It does not really matter ranks is growing tremendously if it is seldom queried (analyis / reporting you said), never updated but just gets records inserted at the end: even a very light box would have no problem having millions of rows in that table.
You alternative would require lots of updates on different locations on the disk, possibly resulting in degraded performance.
Of course, if you need all the old data, and always by photo_id, you could plan a scheduled run to another table rankings_old, possibly with photo_id, year,month, rankings (including timestamps) when a month is over, so retrieving old data stays easily possible, but there are no updates needed in rankings_old or rankings, only inserts at the end of the table.
And take it from me: millions of records in a pure logging table should be absolutely no problem.
Normalized data or not normalized data. You will find thousands of articles about that. :)
It really depends of your needs.
If you want to build your database only with performance (speed or RAM consumption or...) in mind you should only trust the numbers. To do that you have to profile your queries with the expected data "volume" (You can generate the data with some script you write). To profile your queries, learn how to read the results of the 2 following queries:
EXPLAIN extended...
SHOW STATUS
Then learn what to do to improve the figures (mysql settings, data structure, hardware, etc).
As a starter, I really advise these 2 great articles:
http://www.xaprb.com/blog/2006/10/12/how-to-profile-a-query-in-mysql/
http://ajohnstone.com/archives/mysql-php-performance-optimization-tips/
If you want to build for the academic beauty of the normalization: just follow the books and the general recommandations. :)
Out of the two options - like everyone before me said - it has to be option 1.
What you should really be concerned about are the bottlenecks in the application itself. Are users going to refer to the historical data often, or does it only show up for a few select users? If the answer is that everyone gets to see historical data of the ranks, then option 1 is good enough. If you are not going to refer to the historical ranks that often, then you could create a third "archive" table, and before updating the ranks, you can copy the rows of the original rank table to the archive table. This ensures that the number of rows stays minimal on the main table that is being called.
Remember, if you're updating the rows, and there's 10s of thousands, it might be more fruitful to get the results in your code (PHP/Python/etc), truncate the table and insert the results in rather than updating it row by row, as that would be a potential bottleneck.
You may want to look up sharding as well (horizontal partitioning) - http://en.wikipedia.org/wiki/Shard_%28database_architecture%29
And never forget to index well.
Hope that helped.
You stated the rank is only linked to the image, in which case all you need is table 1 and keep updating the rank in real time. Table 2 just stores unnecessary data. The disadvantage of this approach is that user cant change his vote.
You said the second table is for analysing /statistics, so it actually isn't something that needs to be stored in db. My suggestion is to get rid of the second table and use a logging facility to record rank changes.
Your second design is very dangerous in case you have 1 million votes for a photo. Can PHP handle that?
With the first design you can do all math on the database level which will be returning you a small result set.
Related
I have a MySQL table with area and lat/lon location columns. Every area has many locations, say 20.000. Is there a way to pick just a few, say 100, that look somewhat evenly distributed on the map?
The distribution doesn't have to be perfect, query speed is more important. If that is not possible directly with MySQL a very fast algorithm that somehow picks evenly distributed locations might also work.
Thanks in advance.
Edit: answering some requests in comments. The data doesn't have something that can be used, it's just area and coordinates of locations, example:
+-------+--------------+----------+-----------+------------+--------+--------+
| id | area | postcode | lat | lon | colour | size |
+-------+--------------+----------+-----------+------------+--------+--------+
| 16895 | Athens | 10431 | 37.983917 | 23.7293599 | red | big |
| 16995 | Athens | 11523 | 37.883917 | 23.8293599 | green | medium |
| 16996 | Athens | 10432 | 37.783917 | 23.7293599 | yellow | small |
| 17000 | Thessaloniki | 54453 | 40.783917 | 22.7293599 | green | small |
+-------+--------------+----------+-----------+------------+--------+--------+
There are some more columns with characteristics but those are just used for filtering.
I did try getting the nth row in the meantime, it seems to work although a bit slow
SET #a = 0;
select * from `locations` where (#a := #a + 1) % 200 = 0
using random() also works but a bit slow too.
Edit2: Turns out it was easy to add postal codes on the table. Having that, getting grouped by postal code seems to give a nice to the eye result. Only issue is that there are very large areas with around 3000 distinct postcodes and getting just 100 may end up many of them show in one place, so will probably need to further process in PHP.
Edit3, answering #RickJames questions in comments so they are in one place:
Please define "evenly distributed" -- evenly spaced in latitude? no two are "close" to each other? etc.
"Evenly distributed" was a bad choice of words. We just want to show some locations on the area that are not all in one place
Are the "areas" rectangles? Hexagons? Or gerrymandered congressional districts?
They can be thought roughly as rectangles but it shouldn't matter. Important thing I missed, we also need to show locations from multiple areas. Areas may be far apart from each other or neighboring (but not overlap). In that case we'd want the sample of 100 to be split between the areas.
Is the "100 per area" fixed? Or can it be "about 100"
It's not fixed, it's around 100 but we can change this if it doesn't look nice
Is there an AUTO_INCREMENT id on the table? Are there gaps in the numbers?
Yes there is an AUTO_INCREMENT id and can have gaps
Has the problem changed from "100 per area" to "1 per postal code"?
Nope the problem is still the same, "show 100 per area in a way that not all of them are in the same place", how this is done it doesn't matter
What are the total row count and desired number of rows in output?
Total row count depends on area and criteria, it can be up to 40k in an area. If total is more than 1000 we want to fall back showing just a random of 100. If 1000 or less we can just show all of them
Do you need a different sample each time you run the query?
Same sample or different sample even with the same criteria is fine
Are you willing to add a column to the table?
It's not up to me but if I have good argument then most probably we can add a new column
Here's an approach that may satisfy the goals.
Preprocess the table, making a new table, to get rid of "duplicate" items.
If the new table is small enough, a full scan of it may be fast enough.
As for "duplicates", consider this as a crude way to discover that two items land in the same spot:
SELECT ROUND(latitude * 5),
ROUND(longitude * 3),
MIN(id) AS id_to_keep
FROM tbl
GROUP BY 1,2
The "5" and "3" can be tweaked upward (or downard) to cause more (or fewer) ids to be kept. "5" and "3" are different because of the way the lat/lng are laid out; that ratio might work most temperate latitudes. (Use equal numbers near the equator, use a bigger ration for higher latitudes.)
There is a minor flaw... Two items very close to each other might be across the boundaries created by those ROUNDs.
How many rows are in the original table? How many rows does the above query generate? ( SELECT COUNT(*) FROM ( ... ) x; )
So I need an opinion / a way of solution on the matter below.
There is this questionnaire which has 67 questions, coded with PHP and uses a database (MySQL). By design the data table is as follows, where it contains ID and question numbers.
So,
I will generate a report with these answers. i.e. I'll get the mean, median for each question and show them on a user report screen. There are 493 rows now and want to think something which will not get longer and longer to process in time.
Any opinions or an approach which makes the process easier(bearable)? Shall I create a class for the calculations and run for each questions and store the values on a view? Found an answer here for a similar issue but just could not make sure. Really would love to hear any ideas.
Personally, I'd avoid using a table 67 columns wide, and do a 3-column table with a two-column Primary-key instead.
ID | Q | Result
1 | 1 | 1
1 | 2 | 3
1 | 3 | 2
...
4 | 5 | 4
Then run stats on that; it'll be 67 times longer, but your stats will be all be primary-key lookups. And anything less than a couple million rows will be pretty damned fast anyway.
Oh, and do the stats using mysql, it's good at that sort of thing. For example:
SELECT AVG(Result) WHERE Q = 1;
And use this solution for the median.
What's the best way to store site statistics for specific users? Basically I want to store how many times a user has done a specific task. The data will be coming from a potentially large table and will be referenced frequently, so I want to avoid COUNT() and store them in their own table.
Method A
Have a table with the following fields, then have a row for each user to store the count for each field:
User_id | posted_comments | comment_replies | post_upvotes | post_downvotes
50 12 7 23 54
Method B
Have one table storing the actions, and another storing the count for that action:
Table 1:
Id | Action
1 | posted_comments
2 | comment_replies
3 | post_upvotes
4 | post_downvotes
Table 2
User_id | Action | Count
50 | 1 | 12
50 | 2 | 7
50 | 3 | 23
50 | 4 | 54
I can't see me having more than 25-30 actions in total, but I'm not sure if that is too many to store horizontally as in method A.
I think you answered your question. If you don't know what the actions are, then store each action in a separate row. That would be the second option.
Be sure that you have the proper indexes on the table. One possibility is (user_id, action, count). With this index, it will be fast to denormalize the table at the user level.
If you have a well-defined problem and won't need to be adding/removing/renaming columns in a table, then the first version is also feasible. Otherwise, just stick with inserting rows. The queries may seem a little bit more complicated, but the application is more flexible.
Seems like a typical BI question to me. The real question is not how many "actions" you have in your dimension, but how often they change.
Table A is denormalized and quick and easy to read: with a "SELECT" you get your information in the proper format.
Table B is normalized and easier to maintain It is highly recommended if your list of actions difficult to defined in advance, and is a must if it is dynamic.
To pass back and forth from Table A to Table B is known as pivot operations, for which you find standard tools, but which are never easy to code manually. So do not jump too quickly to the conclusion that Table B is better just because every body tells so since Codd in 1970.
I suggest you to ask yourself the question of how often will your COUNT(*) table(s) will be read. If you can live with the statistics of yesterday, then compute BOTH tables every night.
Apologies if this has been covered thoroughly in the past - I've seen some related posts but haven't found anything that satisfies me with regards to this specific scenario.
I've been recently looking over a relatively simple game with around 10k players. In the game you can catch and breed pets that have certain attributes (i.e. wings, horns, manes). There's currently a table in the database that looks something like this:
-------------------------------------------------------------------------------
| pet_id | wings1 | wings1_hex | wings2 | wings2_hex | horns1 | horns1_hex | ...
-------------------------------------------------------------------------------
| 1 | 1 | ffffff | NULL | NULL | 2 | 000000 | ...
| 2 | NULL | NULL | NULL | NULL | NULL | NULL | ...
| 3 | 2 | ff0000 | 1 | ffffff | 3 | 00ff00 | ...
| 4 | NULL | NULL | NULL | NULL | 1 | 0000ff | ...
etc...
The table goes on like that and currently has 100+ columns, but in general a single pet will only have around 1-8 of these attributes. A new attribute is added every 1-2 months which requires table columns to be added. The table is rarely updated and read frequently.
I've been proposing that we move to a more vertical design scheme for better flexibility as we want to start adding larger volumes of attributes in the future, i.e.:
----------------------------------------------------------------
| pet_id | attribute_id | attribute_color | attribute_position |
----------------------------------------------------------------
| 1 | 1 | ffffff | 1 |
| 1 | 3 | 000000 | 2 |
| 3 | 2 | ffffff | 1 |
| 3 | 1 | ff0000 | 2 |
| 3 | 3 | 00ff00 | 3 |
| 4 | 3 | 0000ff | 1 |
etc...
The old developer has raised concerns that this will create performance issues as users very frequently search for pets with specific attributes (i.e. must have these attributes, must have at least one in this colour or position, must have > 30 attributes). Currently the search is quite fast as there are no JOINS required, but introducing a vertical table would presumably mean an additional join for every attribute searched and would also triple the number of rows or so.
The first part of my question is if anyone has any recommendations with regards to this? I'm not particularly experienced with database design or optimisation.
I've run tests for a variety of cases but they've been largely inconclusive - the times vary quite significantly for all of the queries that I ran (i.e. between half a second and 20+ seconds), so I suppose the second part of my question is whether there's a more reliable way of profiling query times than using microtime(true) in PHP.
Thanks.
This is called the Entity-Attribute-Value-Model, and relational database systems are really not suited for it at all.
To quote someone who deems it one of the five errors not to make:
So what are the benefits that are touted for EAV? Well, there are none. Since EAV tables will contain any kind of data, we have to PIVOT the data to a tabular representation, with appropriate columns, in order to make it useful. In many cases, there is middleware or client-side software that does this behind the scenes, thereby providing the illusion to the user that they are dealing with well-designed data.
EAV models have a host of problems.
Firstly, the massive amount of data is, in itself, essentially unmanageable.
Secondly, there is no possible way to define the necessary constraints -- any potential check constraints will have to include extensive hard-coding for appropriate attribute names. Since a single column holds all possible values, the datatype is usually VARCHAR(n).
Thirdly, don't even think about having any useful foreign keys.
Finally, there is the complexity and awkwardness of queries. Some folks consider it a benefit to be able to jam a variety of data into a single table when necessary -- they call it "scalable". In reality, since EAV mixes up data with metadata, it is lot more difficult to manipulate data even for simple requirements.
The solution to the EAV nightmare is simple: Analyze and research the users' needs and identify the data requirements up-front. A relational database maintains the integrity and consistency of data. It is virtually impossible to make a case for designing such a database without well-defined requirements. Period.
The table goes on like that and currently has 100+ columns, but in general a single pet will only have around 1-8 of these attributes.
That looks like a case for normalization: Break the table into multiple, for example one for horns, one for wings, all connected by foreign key to the main entity table. But do make sure that every attribute still maps to one or more columns, so that you can define constraints, data types, indexes, and so on.
Do the join. The database was specifically designed to support joins for your use case. If there is any doubt, then benchmark.
EDIT: A better way to profile the queries is to run the query directly in the MySQL interpretter on the CLI. It will give you the exact time that it took to run the query. The PHP microtime() function will also introduce other latencies (Apache, PHP, server resource allocation, network if connection to a remote MySQL instance, etc).
What you are proposing is called 'normalization'. This is exactly what relational databases were made for - if you take care of your indexes, the joins will run almost as fast as if the data were in one table.
Actually, they might even go faster: instead of loading 1 table row with 100 columns, you can just load the columns you need. If a pet only has 8 attributes, you only load those 8.
This question is a very subjective. If you have the resources to update the middleware to reflect the column that has been added then, by all means, go with horizontal there is nothing safer and easier to learn than a fixed structure. One thing to remember, anytime you update a tables structure you have to update each one of its dependencies unless there is some catch-all like *, which I suggest you stay aware from unless you are just dumping data to a screen and order of columns is irrelevant.
With that said, Verticle is the way to go if you don't have all of your requirements in place or don't have the desire to update code in n number of areas. Most of the time you just need storage containers to store data. I would segregate things like numbers, dates, binary, and text in separate columns to preserve some data integrity, but there is nothing wrong with verticle storage, as long as you know how to formulate and structure queries to bring back the data in the appropriate format.
FYI, Wordpress uses verticle data storage for majority of the dynamic content it has to store for the millions of uses it has.
First thing from Database point of view is that your data should be grow vertically not in horizontal way. So, adding a new column is not a good design at all. Second thing, this is very common scenario in DB design. And the way to solve this you have to create three tables. 1st is of Pets, 2nd is of Attributes and 3rd is mapping table between theres two. Here is the example:
Table 1 (Pet)
Pet_ID | Pet_Name
1 | Dog
2 | Cat
Table 2 (Attribute)
Attribute_ID | Attribute_Name
1 | Wings
2 | Eyes
Table 3 (Pet_Attribute)
Pet_ID | Attribute_ID | Attribute_Value
1 | 1 | 0
1 | 2 | 2
About Performance:
Pet_ID and Attribute_ID are the primary keys which are indexed (http://developer.mimer.com/documentation/html_92/Mimer_SQL_Engine_DocSet/Basic_concepts4.html), so the search is very fast. And this is the right way to sovle the problem. Hope, now it will be clear to you.
Here is the scenario. 2 web servers in two separate locations having two mysql databases with identical tables. The data within the tables is also expected to be identical in real time.
Here is the problem. if a user in either location simultaneously enters a new record into identical tables, as illustrated in the two first tables below, where the third record in each table has been entered simultaneously by the different people. The data in the tables is no longer identical. Which is the best way to maintain that the data remains identical in real time as illustrated in the third table below regardless of where the updates take place? That way in the illustrations below instead of ending up with 3 rows in each table, the new records are replicated bi-directionally and they are inserted in both tables to create 2 identical tables again with 4 columns this time?
Server A in Location A
==============
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
|-----------|
| 3 | John |
|-----------|
Server B in Location B
==============
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
|-----------|
| 3 | Peter |
|-----------|
Expected Scenario
===========
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
| 3 | Peter |
| 4 | John |
|-----------|
There isn't much performance to be gained from replicating your database on two masters. However, there is a nifty bit of failover if you code your application correct.
Master-Master setup is essentially the same as the Slave-Master setup but has both Slaves started and an important change to your config files on each box.
Master MySQL 1:
auto_increment_increment = 2
auto_increment_offset = 1
Master MySQL 2:
auto_increment_increment = 2
auto_increment_offset = 2
These two parameters ensure that when two servers are fighting over a primary key for some reason, they do not duplicate and kill the replication. Instead of incrementing by 1, any auto-increment field will by default increment by 2. On one box it will start offset from 1 and run the sequence 1 3 5 7 9 11 13 etc. On the second box it will start offset at 2 and run along 2 4 6 8 10 12 etc. From current testing, the auto-increment appears to take the next free number, not one that has left before.
E.g. If server 1 inserts the first 3 records (1 3 and 5) when Server 2 inserts the 4th, it will be given the key of 6 (not 2, which is left unused).
Once you've set that up, start both of them up as Slaves.
Then to check both are working ok, connect to both machines and perform the command SHOW SLAVE STATUS and you should note that both Slave_IO_Running and Slave_SQL_Running should both say “YES” on each box.
Then, of course, create a few records in a table and ensure one box is only inserting odd numbered primary keys and the other is only incrementing even numbered ones.
Then do all the tests to ensure that you can perform all the standard applications on each box with it replicating to the other.
It's relatively simple once it's going.
But as has been mentioned, MySQL does discourage it and advise that you ensure you are mindful of this functionality when writing your application code.
Edit: I suppose it's theoretically possible to add more masters if you ensure that the offsets are correct and so on. You might more realistically though, add some additional slaves.
MySQL does not support synchronous replication, however, even if it did, you would probably not want to use it (can't take the performance hit of waiting for the other server to sync on every transaction commit).
You will have to consider more appropriate architectural solutions to it - there are third party products which will do a merge and resolve conflicts in a predetermined way - this is the only way really.
Expecting your architecture to function in this way is naive - there is no "easy fix" for any database, not just MySQL.
Is it important that the UIDs are the same? Or would you entertain the thought of having a table or column mapping the remote UID to the local UID and writing custom synchronisation code for objects you wish to replicate across that does any necessary mapping of UIDs for foreign key columns, etc?
The only way to ensure your tables are synchronized is to setup a 2-ways replication between databases.
But, MySQL only permits one-way replication, so you can't simply resolve your problem in this configuration.
To be clear, you can "setup" a 2-ways replication but MySQL AB discourages this.