I've got myself into a bit of a tiss over averaging and joining tables.
Essentially I want to display the average heights of different plant species using Highcharts, pulling the data from a MySQL database. Unfortunately the height data and the species names were setup to be added in different tables.
I've got it working, however when I download the data and find the averages in Excel the figures are different to those being displayed - so I'm obviously not doing it right. I've double checked I'm doing it right in Excel so almost certain it's my MySQL query that's stuffing up.
There's loads of entries in the actual tables, so I've just put an example below.
The query I have at the moment is:
<?php
$result = mysql_query("
SELECT DISTINCT(plant_records.plant_id), ROUND(AVG(plant_records.height),2) as plant_average, plant_list.id, plant_list.plant_species
FROM plant_records
INNER JOIN plant_list
ON plant_records.plant_id=plant_list.id
GROUP BY plant_list.plant_species
") or die(mysql_error());
while ($row = mysql_fetch_array($result)) {
$xAxisValues[] = "'" . $row['plant_species'] . "'";
$AseriesValues[] = $row['plant_average'];
}
?>
Am I doing it right? I found some nice tutorials explaining joins, like this one, but I'm still confused. I'm wondering if I'm averaging before I've joined them, or something??
"plant_id" in the Records table corresponds with "id" in the List table
plant_records:
id plant_id date_recorded height
1 3 01/01/2013 0.2523123
2 1 02/01/2013 0.123
3 3 03/02/2013 0.446
4 3 04/03/2013 0.52
5 1 05/03/2013 0.3
6 2 06/03/2013 0.111
7 2 07/05/2013 0.30
8 4 08/05/2013 0.22564
9 1 09/05/2013 1.27
10 3 10/05/2013 1.8
plant_list:
id registration_date contact_name plant_species plant_parent
1 01/01/2013 Dave ilex_prinos London_Holly
2 02/01/2013 Bill acer_saccharum Brighton_Maple
3 01/01/2013 Bob ilex_prinos London_Holly
4 04/01/2013 Bruno junip_communis Park_Juniper
EDIT:
I've tried every possible way of finding the data using Excel (e.g. deliberately not filtering unique IDs, different average types, selecting multiple species, etc) to find the calculation my query is using, but I can't get the same results.
I notice two issues with your query at the moment.
Selecting plant_list.id while having a GROUP BY plant_list.plant_species will not yield anything of interest, due to the fact that MySQL will return an arbitrary id from any of the plants that match each species.
You state that you are only interested in the most recent recording, but nothing in your query reflects that fact.
Given that information, try this query:
SELECT ROUND(AVG(pr.height),2) as plant_average, plant_list.plant_species
FROM plant_records pr
INNER JOIN plant_list
ON pr.plant_id=plant_list.id
WHERE pr.date_recorded = (
SELECT MAX(pri.date_recorded) FROM plant_records pri
WHERE pri.plant_id = pr.plant_id
)
GROUP BY plant_list.plant_species
Alternately, if you want just the average heights for a specific date, simply pass that directly into the query, instead of using the subquery.
If we are assuming that plant_id is not the unique identifier - meaning that a single plant_id is only for one single plant of any given species and you want to know what the average height of a single species is you can do this:
SELECT PL.plant_species, ROUND(AVG(PR.height),2) as plant_average
FROM plant_records AS PR
JOIN plant_list AS PL
ON PR.plant_id=PL.id
GROUP BY PL.plant_species
This will return something like:
plant_species plant_average
acer_saccharum 0.2100000
ilex_prinos 0.6700000
junip_communis 0.2300000
Related
I need to summary columns together on each row, like a leaderboard. How it looks:
Name | country | track 1 | track 2 | track 3 | Total
John ENG 32 56 24
Peter POL 45 43 35
Two issues here, I could use the
update 'table' set Total = track 1 + track 2 + track 3
BUT it's not always 3 tracks, anywhere from 3 to 20.
Secound if I don't SUM it in mysql I can not sort it when I present data in HTML/php.
Or is there some other smart way to build leaderboards?
You need to redesign your table to have colums for name, country, track number and data Then instead if having a wide table with just 3 track numbers you have a tall, thin table with each row being the data for a given name, country and track.
Then you can summarise using something like
SELECT
country,
name,
sum(data) as total
FROM trackdata
GROUP BY
name,
country
ORDER BY
sum(data) desc
Take a look here where I have made a SQL fiddle showing this working the way you want it
Depending upon your expected data however you might really be better having a separate table for Country, where each country name only appears once (and also for name maybe). For example, if John is always associated with ENG then you have a repeating group and its better to remove that association from the table above which is really about scores on a track not who is in what country and put that into its own table which is then joined to the track data.
A full solution might have the following tables
**Athlete**
athlete_id
athlete_name
(other data about athletes)
**Country**
country_id
country_name
(other data about countries)
**Track**
Track_id
Track_number
(other data about tracks)
**country_athlete** (this joining table allows for the one to many of one country having many athletes
country_athlete_id
country_id
athlete_id
**Times**
country_athlete_id <--- this identifies a given combination of athlete and country
track_id <--- this identifies the track
data <--- this is where you store the actual time
It can get more complex depending on your data, eg can the same track number appear in different countries? if so then you need another joining table to join one track number to many countries.
Alternatively, even with the poor design of my SQL fiddle example, it might be good to make name,country and track a primary key so that you can only ever have one 'data' value for a given combination of name, country and track. However, this decision, and that of normalising your table into multiple joined tables would be based upon the data you expect to get.
But either way as soon as you say 'I don't know how many tracks there will be' then you should start thinking 'each track's data appears in one ROW and not one COLUMN'.
Like others mentioned, you need to redesign your database. You need an One-To-Many relationship between your Leaderboard table and a new Tracks table. This means that one User can have many Tracks, with each track being represented by a record in the Tracks table.
These two databases should be connected by a foreign key, in this case it could be a user_id field.
The total field in the leaderboard table could be updated every time a new track is inserted or updated, or you could have a query similar to the one you wanted. Here is how such a query could look like:
UPDATE leaderboard SET total = (
SELECT SUM(track) FROM tracks WHERE user_id = leaderboard.user_id
)
I recommend you read about database relationships, here is a link:
https://code.tutsplus.com/articles/sql-for-beginners-part-3-database-relationships--net-8561
I still get a lot of issues with this... I don't think that the issue is the database though, I think it's more they way I pressent the date on the web.
I'm able to get all the data etc. The only thing is my is not filling up the right way.
What I do now is like: "SELECT * FROM `times` NATURAL JOIN `players`
Then <?php foreach... ?>
<tr>
<td> <?php echo $row[playerID];?> </td>
<td> <?php echo $row[Time];?> </td>
....
The thing is it's hard to get sorting, order and SUM all in ones with this static table solution.
I searched around for leaderboards and I really don't understand how they build theres with active order etc. like. https://www.pgatour.com/leaderboard.html
How do they build leaderboards like that? With sorting and everything.
I have a table like this
d_id | d_name | d_desc | sid
1 |flu | .... |4,13,19
Where sid is VARCHAR. What i want to do is when enter 4 or 13 or 19, it will display flu. However my query only works when user select all those value. Here is my query
SELECT * FROM diseases where sid LIKE '%sid1++%'
From above query, I work with PHP and use for loop to put the sid value inside LIKE value. So there I just put sid++ to keep it simple. My query only works when all of the value is present. If let say user select 4 and 19 which will be '%4,19%' then it display nothing. Thanks all.
If you must do what you ask for, you can try to use FIND_IN_SET().
SELECT d_id, d_name, d_description
FROM diseases
WHERE FIND_IN_SET(13,sid)<>0
But this query will not be sargable, so it will be outrageously slow if your table contains more than a few dozen rows. And the ICD10 list of disease codes contains almost 92,000 rows. You don't want your patient to die or get well before you finish looking up her disease. :-)
So, you should create a separate table. Let's call it diseases_sid.
It will contain two columns. For your example the contents will be
d_id sid
1 4
1 13
1 19
If you want to find a row from your diseases table by sid, do this.
SELECT d.d_id, d.d_name, d.d_description
FROM diseases d
JOIN diseases_sid ds ON d.d_id = ds.d_id
WHERE ds.sid = 13
That's what my colleagues are talking about in the comments when they mention normalization.
I have a one-to-many relationship of rooms and their occupants:
Room | User
1 | 1
1 | 2
1 | 4
2 | 1
2 | 2
2 | 3
2 | 5
3 | 1
3 | 3
Given a list of users, e.g. 1, 3, what is the most efficient way to determining which room is completely/perfectly filled by them? So in this case, it should return room 3 because, although they are both in room 2, room 2 has other occupants as well, which is not a "perfect" fit.
I can think of several solutions to this, but am not sure about the efficiency. For example, I can do a group concatenate on the user (ordered ascending) grouping by room, which will give me comma separated strings such as "1,2,4", "1,2,3,5" and "1,3". I can then order my input list ascending and look for a perfect match to "1,3".
Or I can do a count of the total number of users in a room AND containing both users 1 and 3. I will then select the room which has the count of users equal to two.
Note I want to most efficient way, or at least a way that scales up to millions of users and rooms. Each room will have around 25 users. Another thing I want to consider is how to pass this list to the database. Should I construct a query by concatenating AND userid = 1 AND userid = 3 AND userid = 5 and so on? Or is there a way to pass the values as an array into a stored procedure?
Any help would be appreciated.
For example, I can do a group concatenate on the user (ordered ascending) grouping by room, which will give me comma separated strings such as "1,2,4", "1,2,3,5" and "1,3". I can then order my input list ascending and look for a perfect match to "1,3".
First, a word of advice, to improve your level of function as a developer. Stop thinking of the data, and of the solution, in terms of CSVs. It limits you to thinking in spreadsheet terms, and prevents you from thinking in Relational Data terms. You do not need to construct strings, and then match strings, when the data is in the database, you can match it there.
Solution
Now then, in Relational data terms, what exactly do you want ? You want the rooms where the count of users that match your argument user list is highest. Is that correct ? If so, the code is simple.
You haven't given the tables. I will assume room, user, room_user, with deadly ids on the first two, and a composite key on the third. I can give you the SQL solution, you will have to work out how to do it in the non-SQL.
Another thing I want to consider is how to pass this list to the database. Should I construct a query by concatenating AND userid = 1 AND userid = 3 AND userid = 5 and so on? Or is there a way to pass the values as an array into a stored procedure?
To pass the list to the stored proc, because it needs a single calling parm, the length of which is variable, you have to create a CSV list of users. Let's call that parm #user_list. (Note, that is not contemplating the data, that is passing a list to a proc in a single parm, because you can't pass an unknown number of identified users to a proc otherwise.)
Since you constructed the #user_list on the client, you may as well compute #user_count (the number of members in the list) while you are at it, on the client, and pass that to the proc.
Something like:
CREATE PROC room_user_match_sp (
#user_list CHAR(255),
#user_count INT
...
)
AS
-- validate parms, etc
...
SELECT room_id,
match_count,
match_count / #user_count * 100 AS match_pct
FROM (
SELECT room_id,
COUNT(user_id) AS match_count -- no of users matched
FROM room_user
WHERE user_id IN ( #user_list )
GROUP BY room_id -- get one row per room
) AS match_room -- has any matched users
WHERE match_count = MAX( match_count ) -- remove this while testing
It is not clear, if you want full matches only. In that case, use:
WHERE match_count = #user_count
Expectation
You have asked for a proc-based solution, so I have given that. Yes, it is the fastest. But keep in mind that for this kind of requirement and solution, you could construct the SQL string on the client, and execute it on the "server" in the usual manner, without using a proc. The proc is faster here only because the code is compiled and that step is removed, as opposed to that step being performed every time the client calls the "server" with the SQL string.
The point I am making here is, with the data in a reasonably Relational form, you can obtain the result you are seeking using a single SELECT statement, you don't have to mess around with work tables or temp tables or intermediate steps, which requires a proc. Here, the proc is not required, you are implementing a proc for performance reasons.
I make this point because it is clear from your question that your expectation of the solution is "gee, I can't get the result directly, I have work with the data first, I am ready and willing to do that". Such intermediate work steps are required only when the data is not Relational.
Maybe not the most efficient SQL, but something like:
SELECT x.room_id,
SUM(x.occupants) AS occupants,
SUM(x.selectees) AS selectees,
SUM(x.selectees) / SUM(x.occupants) as percentage
FROM ( SELECT room_id,
COUNT(user_id) AS occupants,
NULL AS selectees
FROM Rooms
GROUP BY room_id
UNION
SELECT room_id,
NULL AS occupants,
COUNT(user_id) AS selectees
FROM Rooms
WHERE user_id IN (1,3)
GROUP BY room_id
) x
GROUP BY x.room_id
ORDER BY percentage DESC
will give you a list of rooms ordered by the "best fit" percentage
ie. it works out a percentage of fulfilment based on the number of people in the room, and the number of people from your set who are in the room
I'm struggling a bit on the best way to do this with as little performance hit as possible.
Here's the setup...
Search results page with search refining filters that make an AJAX call to a PHP handler which returns a new (refined) set of results.
I have 4 tables that contain all of the data I need to connect to in the PHP handler code.
Table 1 - Main table of records with main details
Table 2 - Ratings for each product from professional rating company #1
Table 3 - Ratings for each product from professional rating company #2
Table 4 - Ratings for each product from professional rating company #3
The refiners on the search results page are jquery sliders with ranges from the lowest allowed rating to the highest for each.
When a slider handle is moved, a new AJAX call is made with the new value(s) and the database query will run to create a fresh set of refined results.
Getting the data I need from Table 1 is the easy part. What I'm struggling with is how to efficiently include a join on the other 3 tables and only picking up rows that match the refining values/ranges. Table 2, 3, and 4 all have multiple columns for year (2004-2012) and when I made an initial attempt to put it all into one query, it bogged down.
Table 2, 3, and 4 hold the various ratings for each record in Table 1.
The columns in Table 2, 3, and 4 are...
id - productID - y2004 - y2005 - y2006 - y2007 - ... you get the idea.
Each year column has a numeric value for each record (default is 0).
What I need to do is efficiently select records that match the refiner ranges selected by the user across all 4 tables at once.
An example refiner search would be...get all records from Table 1 where price is between $25 and $50 AND where Table 2 records have a rating (from any year/column) between 1 - 4 AND where Table 3 records have a rating (from any year/column) between 80 - 100 AND where Table 4 records have a rating (from any year/column) between 80 - 100.
Any advice on how to set this up with as much performance as possible?
My suggestion would be to use a different table structure. You should merge Table 2, 3 and 4 into a single ratings table with the following structure:
id | productID | companyID | year | rating
Then you could rewrite your query as:
SELECT *
FROM products p
JOIN ratings r ON p.id = r.productID
WHERE p.price BETWEEN 25 AND 50
AND (
( r.companyID = 1 AND r.rating BETWEEN 1 AND 4 )
OR ( r.companyID = 2 AND r.rating BETWEEN 80 AND 100 )
OR ( r.companyID = 3 AND r.rating BETWEEN 80 AND 100 )
)
This way the performance would surely increase. Also, your tables will be more scalable, both with the years and the number of companies.
One more thing: if you have a lot of fields in your products table, it might be more useful to execute 2 queries instead of joining. The reason for this is that you are fetching redundant data - every joined row will have the columns for product, even though you only need it once. This is a side-effect of joins, and there is probably a performance threshold where it will be more useful to query twice than to join. It is up to you to decide if/when that is the case.
I always struggle with dealing with normalised data, and how I display it. Maybe its because I don't fully understand the normalisation rules, like how to get it fully into Boyce-Codd. Performance is not really an issue at this stage, though maintainability of the schema is.
user
ID Name
1 Alice
2 Bob
3 Charlie
skills
ID Name
1 Karate
2 Marksmen
3 Cook
event
ID Name
1 Island
2 Volcano
user-m2m-skill
MemberID SkillID
1 1
1 2
2 1
2 3
3 1
user-m2m-event
MemberID EventID
1 1
1 2
2 1
3 2
How do I get this information out of the database? I'd like to display a table like this, where I've got the total count of each skill:
Skills at event
Event Karate Marksmen Cook
Island 2 1 1
Volcano 2 1 0
It is unlikely that the skills table will change very much. This means I could do a set of subqueries like this (obviously shortened and incorrect syntax)
SELECT event.name,
(SELECT COUNT(*) FROM ... WHERE skill = 'Karate'),
(SELECT COUNT(*) FROM ... WHERE skill = 'Marksmen') FROM event
And that's what I've been doing, putting it into a view. But its a bit horrible, no? I have to edit the view every time I add a new skill.
The other way to to process it client side. So I just get back something like this:
Event Skill Count
Island Karate 2
Island Marksmen 1
Island Cook 1
Volcano Karate 2
Volcano Marksmen 1
And I loop through the results, reformatting it. But I hate that even more. Isn't the database supposed to do data?
So: what am I doing wrong? Am I expecting too much? Which is the lesser evil?
(As b3ta would say, apologies for length of post and for bad markup. :( )
This is a typical pivot query, because you are looking to convert data in rows into columns.
SELECT e.name,
MAX(CASE WHEN x.skill_name = 'Karate' THEN x.num_skill ELSE 0) END AS Karate,
MAX(CASE WHEN x.skill_name = 'Marksmen' THEN x.num_skill ELSE 0 END) AS Marksmen
FROM EVENT e
LEFT JOIN (SELECT um.eventid,
s.name AS skill_name,
COUNT(*) 'num_skill'
FROM SKILLS s
JOIN USER-M2M-SKILL us ON us.skillid = s.id
JOIN USER-M2M-EVENT um ON um.memberid = us.memberid
GROUP BY um.eventid, s.name) x ON x.eventid = e.id
GROUP BY e.name
Followup question:
...what does this have that a load of sub queries doesn't?
SELECTs as statements within the SELECT clause. IE:
SELECT x.name,
(SELECT COUNT(*) FROM TABLE)
...means that a separate query is run for every skill. If the queries were correllated - if they were tied together by an ID to make sure records sync'd with an event, then the count would be running for every event.
Conclusion to Followup
The approach is terribly inefficient. It is better to fetch the necessary values once, as I provided in my answer.
Addendum
Regarding updating the query - it is possible to minimize the maintenance by implementing the query with dynamic SQL.
Your second example, with "Event", "Skill", and "Count" as headers, is what you should expect from dynamically generated results from normalized data. Databases are not designed to format data for display (this isn't an Excel spreadsheet), they're designed to store data and return the meaning of that data. It's up to your code to display it in a nice fashion.
Who's b3ta?
As far as the database "doing data" that doesn't mean that client code will be free of all parsing and processing. And normalization in practice shouldn't b a goal in itself. You should also take into account ease of querying and performance.