How to capture chronological changes to an array in SQL?

How to capture chronological changes to an array in SQL? - php

I have a simple, location based, key-value array (PHP), which changes throughout the day. I intend to capture variation in this array.
I can calculate the difference between previous array and current array values. I could, then save them in SQL DB as:
Location, Date, Key, NewValue
How will the schema look like for this. My newbie attempt is as follows:
CREATE TABLE `Variations` (
`Location` TEXT(128),
`Date` DATETIME,
`Key` TEXT(64),
`Value` TEXT(256),
`ID` INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`ID`)
);
How would I know all the (latest) key value pairs at a given date ?
Looking for guidance on SQL query for data retrieval.

An autoincrementing ID must be of type INTEGER, not INT, and AUTOINCREMENT (not AUTO_INCREMENT) has a meaning different from what you think it has.
To get the latest values for a date, you need rows for which no other row with a later date exists:
SELECT *
FROM Variations
WHERE date(Date) <= 'xxxx-xx-xx'
AND NOT EXISTS (SELECT 1
FROM Variations AS V2
WHERE V2.Location = Variations.Location
AND V2.Key = Variations.Key
AND V2.Date <= 'xxxx-xx-xx'
AND V2.Date > Variations.Date)

Related

Most efficient way to store data for a graph

I have come up with a total of three different, equally viable methods for saving data for a graph.
The graph in question is "player's score in various categories over time". Categories include "buildings", "items", "quest completion", "achievements" and so on.
Method 1:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`date` DATE NOT NULL,
`category` ENUM('buildings','items',...) NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`userid`, `date`, `category`),
INDEX `userid` (`userid`),
INDEX `date` (`date`)
) ENGINE=InnoDB
This table contains one row for each user/date/category combination. To show a user's data, select by userid. Old entries are cleared out by:
DELETE FROM `graphdata` WHERE `date` < DATE_ADD(NOW(),INTERVAL -1 WEEK)
Method 2:
CREATE TABLE `graphdata` (
`userid` INT UNSIGNED NOT NULL,
`buildings-1day` FLOAT UNSIGNED NOT NULL,
`buildings-2day` FLOAT UNSIGNED NOT NULL,
... (and so on for each category up to `-7day`
PRIMARY KEY (`userid`)
)
Selecting by user id is faster due to being a primary key. Every day scores are shifted down the fields, as in:
... SET `buildings-3day`=`buildings-2day`, `buildings-2day`=`buildings-1day`...
Entries are not deleted (unless a user deletes their account). Rows can be added/updated with an INSERT...ON DUPLICATE KEY UPDATE query.
Method 3:
Use one file for each user, containing a JSON-encoded array of their score data. Since the data is being fetched by an AJAX JSON call anyway, this means the file can be fetched statically (and even cached until the following midnight) without any stress on the server. Every day the server runs through each file, shift()s the oldest score off each array and push()es the new one on the end.
Personally I think Method 3 is by far the best, however I've heard bad things about using files instead of databases - for instance if I wanted to be able to rank users by their scores in different categories, this solution would be very bad.
Out of the two database solutions, I've implemented Method 2 on one of my older projects, and that seems to work quite well. Method 1 seems "better" in that it makes better use of relational databases and all that stuff, but I'm a little concerned in that it will contain (number of users) * (number of categories) * 7 rows, which could turn out to be a big number.
Is there anything I'm missing that could help me make a final decision on which method to use? 1, 2, 3 or none of the above?

If you're going to use a relational db, method 1 is much better than method 2. It's normalized, so it's easy to maintain and search. I'd change the date field to a timestamp and call it added_on (or something that's not a reserved word like 'date' is). And I'd add an auto_increment primary key score_id so that user_id/date/category doesn't have to be unique. That way, if a user managed to increment his building score twice in the same second, both would still be recorded.
The second method requires you to update all the records every day. The first method only does inserts, no updates, so each record is only written to once.
... SET buildings-3day=buildings-2day, buildings-2day=buildings-1day...
You really want to update every single record in the table every day until the end of time?!
Selecting by user id is faster due to being a primary key
Since user_id is the first field in your Method 1 primary key, it will be similarly fast for lookups. As first field in a regular index (which is what I've suggested above), it will still be very fast.
The idea with a relational db is that each row represents a single instance/action/occurrence. So when a user does something to affect his score, do an INSERT that records what he did. You can always create a summary from data like this. But you can't get this kind of data from a summary.
Secondly, you seem unwontedly concerned about getting rid of old data. Why? Your select queries would have a date range on them that would exclude old data automatically. And if you're concerned about performance, you can partition your tables based on row age or set up a cronjob to delete old records periodically.
ETA: Regarding JSON stored in files
This seems to me to combine the drawbacks of Method 2 (difficult to search, every file must be updated every day) with the additional drawbacks of file access. File accesses are expensive. File writes are even more so. If you really want to store summary data, I'd run a query only when the data is requested and I'd store the results in a summary table by user_id. The table could hold a JSON string:
CREATE TABLE score_summaries(
user_id INT unsigned NOT NULL PRIMARY KEY,
gen_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
json_data TEXT NOT NULL DEFAULT '{}'
);
For example:
Bob (user_id=7) logs into the game for the first time. He's on his profile page which displays his weekly stats. These queries ran:
SELECT json_data FROM score_summaries
WHERE user_id=7
AND gen_date > DATE_SUB(CURDATE() INTERVAL 1 DAY);
//returns nothing so generate summary record
SELECT DATE(added_on), category, SUM(score)
FROM scores WHERE user_id=7 AND added_on < CURDATE() AND > DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY DATE(added_on), category; //never include today's data, encode as json with php
INSERT INTO score_summaries(user_id, json_data)
VALUES(7, '$json') //from PHP, in this case $json == NULL
ON DUPLICATE KEY UPDATE json_data=VALUES(json_data)
//use $json for presentation too
Today's scores are generated as needed and not stored in the summary. If Bob views his scores again today, the historical ones can come from the summary table or could be stored in a session after the first request. If Bob doesn't visit for a week, no summary needs to be generated.

method 1 seems like a clear winner to me . If you are concerned about size of single table (graphData) being too big you could reduce it by creating
CREATE TABLE `graphdata` (
`graphDataId` INT UNSIGNED NOT NULL,
`categoryId` INT NOT NULL,
`score` FLOAT UNSIGNED NOT NULL,
PRIMARY KEY (`GraphDataId'),
) ENGINE=InnoDB
than create 2 tables because you obviosuly need to have info connecting graphDataId with userId
create table 'graphDataUser'(
`graphDataId` INT UNSIGNED NOT NULL,
`userId` INT NOT NULL,
)ENGINE=InnoDB
and graphDataId date connection
create table 'graphDataDate'(
`graphDataId` INT UNSIGNED NOT NULL,
'graphDataDate' DATE NOT NULL
)ENGINE=InnoDB
i think that you don't really need to worry about number of rows some table contains because most of dba does a good job regarding number of rows. Its your job only to get data formatted in a way it is easly retrived no matter what is the task for which data is retrieved. Using that advice i think should pay off in a long run.

Getting the total for two queries in PHP

I'm tracking costs to clients by session and by items specific to each session. I'm trying to get the total session costs and session item costs (cost * count from tbl_sessionitem). But when I check the results, the code outputs the error:
Warning: mysql_fetch_array(): supplied argument is not a valid MySQL result resource
Here are my tables:
CREATE TABLE tbl_session (
`clientid` INT UNSIGNED NOT NULL,
`sessioncost` DECIMAL(6,2) NOT NULL,
`datetoday` DATETIME NOT NULL,
);
CREATE TABLE tbl_sessionitem (
`clientid` INT UNSIGNED NOT NULL,
`cost` DECIMAL(6,2) NOT NULL,
`count` INT UNSIGNED NOT NULL,
`datetoday` DATETIME NOT NULL
);
Here is my php code:
<?php
$date=$_POST['date'];
mysql_connect("localhost","root","");
mysql_select_db("database");
$sql=mysql_query("
SELECT id
, SUM(tbl_session.sessioncost) AS 'totalcost'
, SUM(tbl_sessionitem.count) * SUM(tbl_sessionitem.cost) AS 'totalquantitycost'
FROM (
SELECT clientid
, sessioncost
FROM tbl_session
WHERE datetoday = ('$date')
UNION ALL
SELECT clientid
, cost
, count
FROM tbl_sessionitem
WHERE datetoday = ('$date')
)
GROUP BY id");
while($row = mysql_fetch_array($sql))
{
echo $row['totalcost'];
echo $row['totalquantitycost'];
}
mysql_close();
?>

The warning means what it said: the value passed to mysql_fetch _array isn't a result. mysql_query returns a mixed value; when the query fails, it returns false. You need to perform error checking. mysql_error will give you an error message from MySQL, though be careful never to output database error messages to non-admins.
If you had done that, you would have seen a number of problems:
the subselect result must be given an alias.
the selects being UNIONed have a different number of columns
there's no column named "id" in the subselect results.
the aggregate functions reference the tables from the subselect, but the outer select can only access the result table (the one missing an alias).
Even if you fix those SQL errors, the query itself won't give the results you're looking for, due to the way grouping and aggregate functions work.
There's a much better approach. Session items are associated with sessions, but in the schema this association is loose, via the datetoday column. As a result, you have the odd use of unions. Instead, create surrogate keys for the tables and give the session items table a column that refers to the session table. While you're at it, drop the redundant "tbl_" prefix.
CREATE TABLE sessions (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
client INT UNSIGNED NOT NULL,
cost DECIMAL(5,2),
`date` TIMESTAMP DEFAULT CURRENT_TIMESTAMP
FOREIN KEY (client) REFERENCES clients (id)
) Engine=InnoDB;
CREATE TABLE session_items (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
session INT UNSIGNED NOT NULL,
cost DECIMAL(5,2),
`count` INT UNSIGNED,
FOREIN KEY (session) REFERENCES sessions (id)
) Engine=InnoDB;
To get the total session cost and quantity cost for a given day, you can use a subquery to get the quantity cost for a session (necessary to prevent including session costs multiple time in the totalcost sum), then sum the session and quantity costs in an outer query for each client's total costs for a given day.
SELECT client,
SUM(cost) AS totalcost,
SUM(quantitycost) AS totalquantitycost
FROM (
SELECT client,
sessions.cost,
SUM(session_items.`count`) * SUM(session_items.cost) AS quantitycost
FROM sessions
JOIN session_items ON sessions.id=session_items.session
WHERE sessions.`date` = NOW()
GROUP BY sessions.id
) AS session_invoices
GROUP BY client
;

COUNT is not to be used as a Column name, it's a function, it's used like this:
Select COUNT(id) as countOfId FROM table
Also, I would recommend doing all those calculations in PHP, much easier to maintain and probably better performance, MySql isn't meant as a calculator.
If you want to use reserved keywords as column names, you need to add backticks and don't write them in capitals because that decreases readability in this case:
Select `count` from table
And what is COST?

Optimizing an SQL query with generated GROUP BY statement

I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1

Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)

I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?

As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?

I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC

If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Creating Unique Key in MySQL table referring to date

Question on preventing duplicated entry in my simple web form.
My table record user input from a web form, and distinguished by date e.g. DATE(). How to prevent user with the same name to enter information twice in a single date, e.g. same username cannot be entered twice in the same date, but can be entered at other date?

Your table should have these:
create table tablename (
...
user_id bigint, -- or whatever
date_created date,
unique key(user_id, date_created)
...
);

You can simple create a composite primary key. For your case this means that your primary key must consists of a date field as well as the username field.

In several ways.
First, you can create index on your table. (i'm using simple table as an example).
CREATE TABLE `test` (
`id` INT NOT NULL ,
`name` VARCHAR( 255 ) NOT NULL ,
`date` DATE NOT NULL ,
PRIMARY KEY ( `id` )
) ENGINE = MYISAM;
ALTER TABLE `test` ADD UNIQUE (
`name` ,
`date`
);
This is MySQL way.
You also should make checks in PHP ,although you can do it when inserting (MySQL will return error and you can check it). But you can make additional SELECT before inserting (SELECT * from test WHERE name=USER AND date=DATE) and check record count. If it's more than 0, you show error.
When saving, you seldom should worry about one additional SQL. If you should, just check MySQL statement for errors (MySQL way :)).

Create a unique key on the user and date column
http://dev.mysql.com/doc/refman/5.1/en/create-table.html

Counting how many times a rating was entered in a MySQL Database using PHP

I'm trying to count how many times an article has been rated by my members buy using PHP to count a certain articles total entered ratings that have been stored in my MySQL database.
I really want to use PHP and not MySQL to do this and was wondering how I can do this?
I hope I explained this right?
An example would be very helpful my MySQL database that holds the ratings are listed below.
Here is the MySQL database.
CREATE TABLE articles_ratings (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ratings_id INT UNSIGNED NOT NULL,
users_articles_id INT UNSIGNED NOT NULL,
user_id INT UNSIGNED NOT NULL,
date_created DATETIME NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE ratings (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
points FLOAT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (id)
);

It's much easier just to do it with SQL:
select count(*) from articles_ratings where id = (id value)
You could of course just select * from articles_ratings where id = (id value), then loop through all the rows to count them -- but if the database can do all this work for you, then it's usually best to use it!

If that's really what you want, you could SELECT the ratings and then use http://php.net/manual/en/function.mysql-num-rows.php to count them. Is this what you had in mind?

"I really want to use PHP" - this will mean you will retrieve all rows from MySQL server and count them using PHP loop?
This is wrong - use SQL to aggregate information, then retrieve it from database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.