I have a project for some local high school sport leagues which want some real time updates with statistics. There will be people at events (american football, basketball, volleyball, golf, wrestling, etc) who will be using my CMS system to update the stats.
I can't seem to wrap my head around how to store those stats so when the REST API calls happen, the latest events will be sent back (ex: gathering all basketball games happening at that time on the server and saving them).
The data coming to the server is in JSON format and I would like to be able to store it as so, each sport being the main key, then the stats on a game by game basis. It seems to me using a RDBMS or another db type would be pointless because adding the stats in real time would mean a ton of rows where the data barely differs, then collecting the most recent games would be a pain if I were to break up each person's POST and save it as it's own row.
On the other hand, I could just store everything in a file, gather the stats as they come in and update the file. But if there will be many writes happening, the responses to the API calls might get slow.
Any suggestions? Which of my thoughts is wrong here?
Storing data as JSON generally limits your ability to query the data. I would suggest against that. JSON is perfectly acceptable format to accept on the server, but you should immediately deserialize it into an object and store it in a way that will meet your use cases. In my opinion your use cases demand a relational database. E.g. a schema like this would give you good performance finding all games that are happening:
Sport:
pk int sportId
varchar description
Game:
pk int gameId
fk int sportId
datetime start
datetime end
Player:
pk int playerId
varchar name
StatType:
pk int statTypeId
varchar description
Stat:
pk bigint statId
fk int gameId
fk int playerId
fk int statTypeId
datetime time
real value
To get the current game:
SELECT * FROM Game WHERE currentTime > start AND end IS NULL
To get all time stats for a player
SELECT max(st.description), sum(value) FROM Stat s LEFT JOIN StatDescription st ON s.statTypeId = st.statTypeId LEFT JOIN Player p ON s.playerId = p.playerId GROUP BY st.statTypeId WHERE p.name = 'John Smith'
Related
I've been asked to develop a web software able to store some reading data from heat metering device and to divide the heat expenses among all the flat owner. I chose to work in php with MySQL engine MyISAM.
I was not used to work with large data, so i simply created a logical database where we have:
a table for building, with an id as primary key indexed (now we have ~1200
buildings in the db)
a table with all the flats in all the buildings, with an id as primary key indexed and the building_id to link to the building (around 32k+ flats in total)
a table with all the heaters in all the flats, with an id as primary key indexed and the flat_id to link to the flat (around 280k+ heaters)
a table with all the reading value, with the timestamp of the reading, an id as primary key and the heater_id to link to the heater (around 2.7M+ reading now)
There is also a separate table, linked to the building, where are stored the starting date and the end date between which the division of expenses have to be done.
When it is necessary to get all the data from a building, the approach i used is to get raw data from DB with single query, elaborate in php, than make the next query.
So here is roughly the operation sequence i used:
get the starting and end date from the specific table with a single query
store the dates in a php variable
get all the flats of the building: SELECT * FROM flats where building_id=my_building_id
parse all the data in php with a php while cycle
on each step of the while cycle i make a query getting all the heaters of that specific flat: SELECT * FROM heaters where flat_id=my_flat_id
parse all the data of the heaters with a php while cycle
on each step of this inner while cycle i'll get the last reading value of that specific heater: SELECT * FROM reading_values where heater_id=my_heater_id AND data<my_data
Now the problem is that i have serious performance issue.
Before someone point it out, i cannot get only reading value jumping all the first 6 steps of the list above, since i need to print bills and on each bill i have to write all flat information and all heaters information, so i have to get all the flats and heaters data anyway.
So I'd like some suggestions on how to improve script performance:
all the tables are indexed, but i have to add some index somewhere else?
would using a single query with subquery instead of several one among php code improve performance?
any other suggestions?
I haven't inserted specific code as i think it would have made the question too heavy, but if asked i could insert some.
Some:
Don't use 'SELECT *' if you can avoid it -> Just get the fields you really need
I didn't test it in your particular case, but usually a single query which joins all three tables should achieve much better performance rather than looping through results with php.
If you need to loop for some reason, then at least use mysql prepared statements, which again should increase performance given the amount of queries :)
Hope it helps!
Regards
EDIT:
just to exemplify an alternative query, not sure if this suits your specific needs and without testing it (which probably means I forgot something):
SELECT
a.field1,
b.field2,
c.field3,
d.field4
FROM heaters a
JOIN reading_values b ON (b.heater_id = a.heater_id)
JOIN flats c ON (c.flat_id = a.flat_id)
JOIN buildings d ON (d.building_id = c.building_id)
WHERE
a.heater_id = my_heater_id
AND b.date < my_date
GROUP BY a.heater_id
EDIT 2
Following your comments, I modified the query so that it retrieves the information as you want it: Given a building id, it will list all the heaters and their newest reading value according to a given date:
SELECT
a.name,
b.name,
c.name,
d.reading_value,
d.created
FROM buildings a
JOIN flats b ON (b.building_id = a.building_id)
JOIN heaters c ON (c.flat_id = b.flat_id)
JOIN reading_values d ON (d.reading_value_id = (SELECT reading_value_id FROM reading_values WHERE created <= my_date AND heater_id = c.heater_id ORDER BY created DESC LIMIT 1))
WHERE
a.building_id = my_building_id
GROUP BY c.heater_id
It should be interesting to know how it performs in your environment.
Regards
I'm looking for some advice/help on quite a complex search algorithm. Any articles to relevant techniques etc. would be much appreciated.
Background
I'm building an application, which, in a nutshell, allows users to set their "availability" for any given day. The User first sets a general availability template which allows them to say:
Monday - AM
Tuesday - PM
Wednesday - All Day
Thursday - None
Friday - All Day
So this User is generally available Monday AM, Tuesday PM etc.
Schema:
id
user_id
day_of_week (1-7)(Monday to Sunday)
availability
They can then override specific dates manually, for example:
2013-03-03 - am
2013-03-04 - pm
2013-03-05 - all_day
Schema:
id
user_id
date
availability
This all works well - I have a Calendar being generated which combines the template and overrides and allows Users to modify their availability etc.
The Problem
I now need to allow Admin Users to search for Users who have specific availability. So the Admin User would use a calendar to select required dates and availability's and hit search.
For example, find me Users who are available:
2013-03-03 - pm
2013-03-04 - pm
2013-03-05 - pm
The search process would have to search for available Users using the Templated Availability and Overrides, then return the best results. Ideally, it would return Users who are available all of the time but in the case that no single user can match the dates, I need to provide a combination of Users who can.
I know this is quite a complex problem and I'm not looking for a complete answer, perhaps just some guidance or links to potentially relevant techniques etc.
What I've tried
At the moment, I have a halfway solution. I'm grabbing all the available Users, looping through each of them, and within that loop, looping through all of the required dates and breaking as soon as a User doesn't meet a required date. This is obviously very un-scalable and it's also only returning "perfect matches".
Possible Solutions
Full Text Searching with Aggregate Table
I thought about creating a separate table which had the following schema:
user_id
body
The body field would be populated with the Users template days and overrides so an example record might look like:
user_id: 2
body: monday_am tuesday_pm wednesday_pm thursday_am friday_allday 2013-03-03_all_day 2013-03-03_pm
I would then convert a Users search query into a similar format. So if a User was looking for someone who was available on the 19th March 2013 - All Day and 20th March 2013 - PM, I'd convert that into a string.
Firstly, as 19th March is a Tuesday, I'd convert that into tuesday_allday and same with the 20th. I'd therefore end up with:
tuesday_allday wednesday_pm 2013-03-19_allday 2013-03-20_pm
I'd then do a full text search against our aggregate table and return a "weighted" result set which I can then loop through and further interrogate.
I'm not sure how this would work in practice, so that's why I'm asking if anyone has any links to techniques or relevant articles I could use.
I am confident this problem can be solved with a more well defined DB schema.
By utilizing a more detailed DB schema you will be able to find any available user for any given time frame (not just am & pm) if you should so choose.
It will also allow you to keep template data, while not polluting your availability data with template information (instead you would select from the template table to programmatically fill in the availability for a given date, which then can be modified by the user).
I spent some time diagramming this problem and came up with a schema structure that I believe solves the problem you specified and allows you to grow your application with a minimum of schema changes.
(To make this easier to read I've added the SQL at the end of this proposed answer)
I have also included an example select statement that would allow you to pull availability data with any number of arguments.
For clarity that SELECT is above the SQL for the schema # the end of my explanatory text.
Please don't be intimidated by the select, it may look complicated # first glance but is really a map to the entire schema (save the templates table).
(btw, I'm not saying that because I have any doubt that you can understand it, I'm sure you can, but I've known many programmers who ignore more complex DB structures to their own detriment because it LOOKS overly complex but when analyzed is actually less complex than the acrobatics they have to do in their program to get similar results... Relational DBs are based on a branch of mathematics that is good # accurately, consistently, & (relatively) succinctly, associating data).
General Use:
(for more details read the comments in the SQL CREATE TABLE statements)
-Populate the DaysOfWeek table.
-Populate the TimeFrames table with some time frames you want to track (an AM timeframe might have a StartTime of 00:00:00 & an end time of 11:59:59 while PM might have StartTime of 12:00:00 & EndTime of 23:59:59)
-Add Users
-Add Dates to be tracked (see notes in SQL for thoughts on avoiding bloat & also the virtues of this table)
-Populate the Templates table for each user
-Generate the list of default Availabilities (with their associated AvailableTimes data) for each user
-Expose the default Availabilities to the users so they can override the defaults
NOTE: you can also add an optional table for Engagements to be the opposite of Availabilities (or maybe there is a better abstraction that would include both concepts...)
Disclaimer: I did not take the additional time to fully populate my local DB & verify everything so there may be some weaknesses/errors I did not see in my diagrams... (sorry I spent far longer than intended on this & must get work done on an overdue project).
While I have worked fairly extensively with DB structures & with DBs others have created for 12+ years I'm sure I am not without fault, I hope others on StackOverflow will round out mistakes I may have included.
I apologize for not including more example data.
If I have time in the near future I will provide some, (think adding George, Fred, & Harry to the users table, adding some dates to the Dates table then detailing how busy George & Fred are compared to Harry during their school week using the Availabilities, AvailableTimes & TimeFrames tables).
The SELECT statement (NOTE: I would highly recommend making this into a view... in that way you can select whatever columns you want & add whatever arguments/conditions you want in a WHERE clause without having to write the joins out every time... so the view would NOT include the WHERE clause... just to make that clear):
SELECT *
FROM Users Us
JOIN Availabilities Av
ON Us.User_ID=Av.User_ID
JOIN Dates Da
ON Av.Date_ID=Da.Date_ID
JOIN AvailableTimes Avt
ON Av.Av_ID=Avt.Av_ID
WHERE Da.Date='2014-01-03' -- whatever date
-- alternately: WHERE Da.DayOWeek_ID=3 -- which would be Wednesday
-- WHERE Da.Date BETWEEN() -- whatever date range...
-- etc...
Recommended data in DaysOfWeek (which is effectively a lookup table):
INSERT INTO DaysOfWeek(DayOWeek_ID,Name,Description)
VALUES (1,'Sunday', 'First Day of the Week'),(1,'Monday', 'Second Day of the Week')...(7,'Saturday', 'Last Day of the Week'),(8,'AllWeek','The entire week'),(9,'Weekdays', 'Monday through Friday'),(10,'Weekends','Saturday & Sunday')
Example Templates data:
INSERT INTO Templates(Time_ID,User_ID,DayOWeek_ID)
VALUES (1,1,9)-- this would show the first user is available for the first time frame every weekday as their default...
,(1,2,2) -- this would show the first user available on Tuesdays for the second time frame
The following is the recommended schema structure:
CREATE TABLE `test`.`Users` (
User_ID INT NOT NULL AUTO_INCREMENT ,
UserName VARCHAR(45) NULL ,
PRIMARY KEY (User_ID) );
CREATE TABLE `test`.`Templates` (
`Template_ID` INT NOT NULL AUTO_INCREMENT ,
`Time_ID` INT NULL ,
`User_ID` INT NULL ,
`DayOWeek_ID` INT NULL ,
PRIMARY KEY (`Template_ID`) )
`COMMENT = 'This table holds the template data for general expected availability of a user/agent/person (so the person would use this to set their general availability)'`;
CREATE TABLE `test`.`Availabilities` (
`Av_ID` INT NOT NULL AUTO_INCREMENT ,
`User_ID` INT NULL ,
`Date_ID` INT NULL ,
PRIMARY KEY (`Av_ID`) )
COMMENT = 'This table holds a users actual availability for a particular date.\nIf the use is not available for a date then this table has no entry for that user for that date.\n(btw, this suggests the possiblity of an alternate table that could utilize all other structures except the templates called Engagements which would record when a user is actually busy... in order to use this table & the other table together would need to always join to AvailableTimes as a date would actually be in both tables but associated with different time frames).';
CREATE TABLE `test`.`Dates` (
`Date_ID` INT NOT NULL AUTO_INCREMENT ,
`DayOWeek_ID` INT NULL ,
`Date` DATE NULL ,
PRIMARY KEY (`Date_ID`) )
COMMENT = 'This table is utilized to hold actual dates whith which users/agents can be associated.\nThe important thing to note here is: this may end up holding every day of every year... this suggests a need to archive this data (and everything associated with it for performance reasons as this database is utilized).\nOne more important detail... this is more efficient than associating actual dates directly with each user/agent with an availability on that date... this way the date is only recorded once, the other approach records this date with the user for each availability.';
CREATE TABLE `test`.`AvailableTimes` (
`AvTime_ID` INT NOT NULL AUTO_INCREMENT ,
`Av_ID` INT NULL ,
`Time_ID` INT NULL ,
PRIMARY KEY (`AvTime_ID`) )
COMMENT = 'This table records the time frames that a user is available on a particular date.\nThis allows the time frames to be flexible without affecting the structure of the DB.\n(e.g. if you only keep track of AM & PM at the beginning of the use of the DB but later decide to keep track on an hourly basis you simply add the hourly time frames & start populating them, no changes to the DB schema need to be made)';
CREATE TABLE `test`.`TimeFrames` (
`Time_ID` INT NOT NULL AUTO_INCREMENT ,
`StartTime` TIME NOT NULL ,
`EndTime` TIME NOT NULL ,
`Name` VARCHAR(45) NOT NULL ,
`Desc` VARCHAR(128) NULL ,
PRIMARY KEY (`Time_ID`) ,
UNIQUE INDEX `Name_UNIQUE` (`Name` ASC) )
COMMENT = 'Utilize this table to record the times that are being tracked.\nThis allows the flexibility of having multiple time frames on the same day.\nIt also provides the flexibility to change the time frames being tracked without changing the DB structure.';
CREATE TABLE `test`.`DaysOfWeek` (
`DaysOWeek_ID` INT NOT NULL AUTO_INCREMENT ,
`Name` VARCHAR(45) NOT NULL ,
`Description` VARCHAR(128) NULL ,
PRIMARY KEY (`DaysOWeek_ID`) ,
UNIQUE INDEX `Name_UNIQUE` (`Name` ASC) )
COMMENT = 'This table is a lookup table to hold the days of the week.\nI personally would recommend adding a row for:\nWeekends, All Week, & WeekDays \nThis will often be used in conjunction with the templates and will allow less entries in that table to be made with those 3 entries in this table.';
Ok, this is would I would do:
In the users table create fields for Sunday, Monday ... Saturday.
Use pm , am or both for values in those fields.
You should also index each field in the db for faster querying.
Then make a separate table for user/date/meridian fields (meridian means am or pm). Again the meridian field values would be pm , am or both.
You will need to do a little research with php's date function to pull out the day of the week number and use a switch statement against it perhaps.
Use the requested dates and pull out the day of the week and query the user table for their day of the week availability.
Then use the requested date/meridian itself and query the new user/date/meridian table for the users' individual availability dates/meridians.
I don't think there is going to be much of an algorithm here except when extracting the days of the weeks in the date requests. If you are doing a date range then you could benefit from a algorithm but if it is just a bunch of cherry picked dates then you are just going to have to do them one by one. Let me know and maybe I'll throw you an algo for you.
Description:
I am building a rating system with mysql/php. I am confused as to how I would set up the database.
Here is my article setup:
Article table:
id | user_id | title | body | date_posted
This is my assumed rating table:
Rating table:
id | article_id | score | ? user_id ?
Problem:
I don't know if I should place the user_id in the rating table. My plan is to use a query like this:
SELECT ... WHERE user_id = 1 AND article_id = 10
But I know that it's redundant data as it stores the user_id twice. Should I figure out a JOIN on the tables or is the structure good as is?
It depends. I'm assuming that the articles are unique to individual users? In that case, I could retain the user_id in your rating table and then just alter your query to:
SELECT ... WHERE article_id = 10
or
SELECT ... WHERE user_id = 1
Depending on what info you're trying to pull.
You're not "storing the user_id twice" so much as using the user_id to link the article to unique data associated to the user in another table. You're taking the right approach, except in your query.
I don't see anything wrong with this approach. The user id being stored twice is not particularly relevant since one is regarding a rating entry and the other, i assume, is related to the article owner.
The benefit of this way is you can prevent multiple scores being recorded for each user by making article_id and user_id unique and use replace into to manage scoring.
There are many things to elaborate on this depending on whether or not this rating system needs to be intelligent to prevent gaming, etc. How large the user base is, etc.
I bet for any normal person, this setup would not be detrimental to even a relatively large scale system.
... semi irrelevant:
Just FYI, depending on the importance and gaming aspects of this score, you could use STDDEV() to fetch an average factoring the standard deviation on the score column...
SELECT STDDEV(`score`) FROM `rating` WHERE `article_id` = {article_id}
That would factor outliers supposing you cared whether or not it looked like people were ganging up on a particular article to shoot it down or praise it without valid cause.
you should not, due to 3rd normal form, you need to keep the independence.
"The third normal form (3NF) is a normal form used in database normalization. 3NF was originally defined by E.F. Codd in 1971.[1] Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:
The relation R (table) is in second normal form (2NF)
Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R."
Source here: http://en.wikipedia.org/wiki/Third_normal_form
First normal Form: http://en.wikipedia.org/wiki/First_normal_form
Second normal Form: http://en.wikipedia.org/wiki/Second_normal_form
you should take a look to normalization and E/R model it will help you a lot.
normalization in wikipedia: http://en.wikipedia.org/wiki/Database_normalization
So what I am trying to do is make a trending algorithm, i need help with the SQL code as i cant get it to go.
There are three aspects to the algorithm: (I am completely open to ideas on a better trend algorithm)
1.Plays during 24h / Total plays of the song
2.Plays during 7d / Total plays of the song
3.Plays during 24h / The value of plays of the most played item over 24h (whatever item leads the play count over 24h)
Each aspect is to be worth 0.33, for a maximum value of 1.0 being possible.
The third aspect is necessary as newly uploaded items would automatically be at top place unless their was a way to drop them down.
The table is called aud_plays and the columns are:
PlayID: Just an auto-incrementing ID for the table
AID: The id of the song
IP: ip address of the user listening
time: UNIX time code
I have tried a few sql codes but im pretty stuck being unable to get this to work.
In your ?aud_songs? (the one the AID points to) table add the following columns
Last24hrPlays INT -- use BIGINT if you plan on getting billion+
Last7dPlays INT
TotalPlays INT
In your aud_plays table create an AFTER INSERT trigger that will increment aud_song.TotalPlays.
UPDATE aud_song SET TotalPlays = TotalPlays + 1 WHERE id = INSERTED.aid
Calculating your trending in real time for every request would be taxing on your server, so it's best to just run a job to update the data every ~5 minutes. So create a SQL Agent Job to run every X minutes that updates Last7dPlays and Last24hrPlays.
UPDATE aud_songs SET Last7dPlays = (SELECT COUNT(*) FROM aud_plays WHERE aud_plays.aid = aud_songs.id AND aud_plays.time BETWEEN GetDate()-7 AND GetDate()),
Last24hrPlays = (SELECT COUNT(*) FROM aud_plays WHERE aud_plays.aid = aud_songs.id AND aud_plays.time BETWEEN GetDate()-1 AND GetDate())
I would also recommend removing old records from aud_plays (possibly older than 7days since you will have the TotalPlays trigger.
It should be easy to figure out how to calculate your 1 and 2 (from the question). Here's the SQL for 3.
SELECT cast(Last24hrPlays as float) / (SELECT MAX(Last24hrPlays) FROM aud_songs) FROM aud_songs WHERE aud_songs.id = #ID
NOTE I made the T-SQL pretty generic and unoptimized to illustrate how the process works.
The website I have to manage is a search engine for worker (yellow page style)
I have created a database like this:
People: <---- 4,000,000 records
id
name
address
id_activity <--- linked to the activites table
tel
fax
id_region <--- linked to the regions table
activites: <---- 1500 activites
id
name_activity
regions: <--- 95 regions
id
region_name
locations: <---- 4,000,000 records
id_people
lat
lon
So basically the request that I am having slow problem with is to select all the "workers" around a selecty city (select by the user)
The request I have created is fully working but takes 5-6 seconds to return results...
Basically I do a select on the table locations to select all the city in a certain radius and then join to the people table
SELECT people.*,id, lat, lng, poi,
(6371 * acos(cos(radians(plat)) * cos(radians(lat)) * cos(radians(lng) - radians(plon)) + sin(radians(plat)) * sin(radians(lat)))) AS distance
FROM locations,
people
WHERE locations.id = people.id
HAVING distance < dist
ORDER BY distance LIMIT 0 , 20;
My questions are:
Is my Database nicely designed? I don't know if it's a good idea to have 2 table with 4,000,000 records each. Is it OK to do a select on it?
Is my request badly designed?
How can I speed up the search?
The design looks normalized. This is what I would expect to see in most well designed databases. The amount of data in the tables is important, but secondary. However if there is a 1-to-1 correlation between People and Locations, as appears from your query, I would say the tables should be one table. This will certainly help.
Your SQL looks OK, though adding constraints to reduce the number of rows involved would help.
You need to index your tables. This is what will normally help most with slowness (as most developers don't consider database indexes at all).
There are a couple of basic things that could be making your query run slowly.
What are your indexes like on your tables? Have you declared primary keys on the tables? Joining two tables each with 4M rows without having indexes causes a lot of work on the DB. Make sure you get this right first.
If you've already built the right indexes for your DB you can look at caching data. You're doing a calculation in your query Are the locations (lat/lon) generally fixed? How often do they change? Are the items in your locations table actual places (cities, buildings, etc), or are they records of where the people have been (like Foursquare checkins)?
If your locations are places you can make a lot of nice optimizations if you isolate the parts of your data that change infrequently and pre-calculate the distances between them.
If all else fails, make sure your database server has enough RAM. If the server can keep your data in memory it will speed things up a lot.