Find % of days of the week using timestamp from MySQL database - php

I have a MySQL DB table called 'purchases' populated with records of various transactions. I wish to find out what % of transactions have occurred on particular day of the week and was wondering how I would go about achieving that using SQL. I have a field in this table called 'submitted' which holds a standard unix timestamp for when the transaction was recorded.
What I have so far:
SELECT COUNT(DISTINCT DATE_FORMAT(submitted,%a)) FROM purchases;
My end goal is to create a table with % of all transactions on given days, Monday - Sunday.
My table structure is as follows:
CREATE TABLE IF NOT EXISTS `purchases` (<br />
`id` int(11) NOT NULL AUTO_INCREMENT,<br />
`orderinfo` varchar(50) NOT NULL,<br />
`amount` varchar(5) NOT NULL,<br />
`reference` varchar(15) NOT NULL,<br />
`name` varchar(50) NOT NULL,<br />
`address` varchar(100) NOT NULL,
`town` varchar(50) NOT NULL,<br />
`postcode` varchar(12) NOT NULL,<br />
`submitted` int(11) NOT NULL,<br />
UNIQUE KEY `id` (`id`)<br />
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Convert the time stamp into date and then take the COUNT, This will fetch count as well as percentage
SELECT date ( from_unixtime(submitted) ), count(*) as `count`, count(*)/(select count(*) from purchases)*100 as percentage
FROM purchases
GROUP BY date ( from_unixtime(submitted) );

You can get what you want with two nested queries like this (splitted in more lines and indented for better readability)
SELECT
DAYOFWEEK(FROM_UNIXTIME(`submitted`)),
100 * COUNT(*) /
(SELECT COUNT(*) from `purchases`)
FROM
`purchases`
GROUP BY
DAYOFWEEK(FROM_UNIXTIME(`submitted`))
Note that the inner query (SELECT COUNT(*) from purchases) is used to retrieve the total number of purchases since you want a percentage number.
mysql's DAYOFWEEK() returns not the name of the day but an index number from 1 to 7.
You can look here for reference for mysql date ant time functions:
http://dev.mysql.com/doc/refman/5.6/en/date-and-time-functions.html
If you just need a query then you're done.
If you want to build a view with that query you can't because mysql doesn't allow nested SELECT in a view's definition.
You would then split the job in two views, one that just calculates the total number of records, and the second that calculates the percentage.
A final note about performance:
if you have a huge amount of records to process this is not performing very fast.
If you then plan to run the query very often this is not good.
In that case since submitted is most likely to be written just one time and never changed you may then consider to add another field on your table called submitted_dayofweek, type int.
When you update subitted, update also submitted_dayofweek storing DAYOFWEEK(FROM_UNIXTIME(submitted)).
Finally make sure submitted_dayofweek is indexed.
The query becomes
SELECT
`submitted_dayofweek`,
100 * COUNT(*) /
(SELECT COUNT(*) from `purchases`)
FROM
`purchases`
GROUP BY
`submitted_dayofweek`
and will run very fast at the price of a little more storage space used.

Related

Slow GroupBy query for filter results Laravel [duplicate]

I have 2 tables. 1 is music and 2 is listenTrack. listenTrack tracks the unique plays of each song. I am trying to get results for popular songs of the month. I'm getting my results but they are just taking too long. Below is my tables and query
430,000 rows
CREATE TABLE `listentrack` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sessionId` varchar(50) NOT NULL,
`url` varchar(50) NOT NULL,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ip` varchar(150) NOT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=731306 DEFAULT CHARSET=utf8
12500 rows
CREATE TABLE `music` (
`music_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`title` varchar(50) DEFAULT NULL,
`artist` varchar(50) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`genre` int(4) DEFAULT NULL,
`file` varchar(255) NOT NULL,
`url` varchar(50) NOT NULL,
`allow_download` int(2) NOT NULL DEFAULT '1',
`plays` bigint(20) NOT NULL,
`downloads` bigint(20) NOT NULL,
`faved` bigint(20) NOT NULL,
`dateadded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`music_id`)
) ENGINE=MyISAM AUTO_INCREMENT=15146 DEFAULT CHARSET=utf8
SELECT COUNT(listenTrack.url) AS total, listenTrack.url
FROM listenTrack
LEFT JOIN music ON music.url = listenTrack.url
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY listenTrack.url
ORDER BY total DESC
LIMIT 0,10
this query isn't very complex and the rows aren't too large, i don't think.
Is there any way to speed this up? Or can you suggest a better solution? This is going to be a cron job at the beggining of every month but I would also like to do by the day results as well.
Oh btw i am running this locally, over 4 min to run, but on prod it takes about 45 secs
I'm more of a SQL Server guy but these concepts should apply.
I'd add indexes:
On ListenTrack, add an index with url, and date_created
On Music, add an index with url
These indexes should speed the query up tremendously (I originally had the table names mixed up - fixed in the latest edit).
For the most part you should also index any column that is used in a JOIN. In your case, you should index both listentrack.url and music.url
#jeff s - An index music.date_created wouldnt help because you are running that through a function first so MySQL cannot use an index on that column. Often, you can rewrite a query so that the indexed referenced column is used statically like:
DATEDIFF(DATE(date_created),'2009-08-15') = 0
becomes
date_created >= '2009-08-15' and date_created < '2009-08-15'
This will filter down records that are from 2009-08-15 and allow any indexes on that column to be candidates. Note that MySQL might NOT use that index, it depends on other factors.
Your best bet is to make a dual index on listentrack(url, date_created)
and then another index on music.url
These 2 indexes will cover this particular query.
Note that if you run EXPLAIN on this query you are still going to get a using filesort because it has to write the records to a temporary table on disk to do the ORDER BY.
In general you should always run your query under EXPLAIN to get an idea on how MySQL will execute the query and then go from there. See the EXPLAIN documentation:
http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
Try creating an index that will help with the join:
CREATE INDEX idx_url ON music (url);
I think I might have missed the obvious before. Why are you joining the music table at all? You do not appear to be using the data in that table at all and you are performing a left join which is not required, right? I think this table being in the query will make it much slower and will not add any value. Take all references to music out, unless the url inclusion is required, in which case you need a right join to force it to not include a row without a matching value.
I would add new indexes, as the others mention. Specifically I would add:
music url
listentrack date_created,url
This will improve your join a ton.
Then I would look at the query, you are forcing the system to perform work on each row of the table. It would be better to rephrase the date restriction as a range.
Not sure of the syntax off the top of my head:
where '2009-08-15 00:00:00' <= date_created < 2009-08-16 00:00:00
That should allow it to rapidly use the index to locate the appropriate records. The combined two key index on music should allow it to find the records based on the date and URL. You should experiment, they might be better off going in the other direction url,date_created on the index.
The explain plan for this query should say "using index" on the right hand column for both. That means that it will not have to hit the data in the table to calculate your sums.
I would also check the memory settings that you have configured for MySQL. It sounds like you do not have enough memory allocated. Be very careful on the differences between server based settings and thread based settings. The server with a 10MB cache is pretty small, a thread with a 10MB cache can use a lot of memory quickly.
Jacob
Pre-grouping and then joining makes things a lot faster with MySQL/MyISAM. (I'm suspicious less of this is needed with other DB's)
This should perform about as fast as the non-joined version:
SELECT
total, a.url, title
FROM
(
SELECT COUNT(*) as total, url
from listenTrack
WHERE DATEDIFF(DATE(date_created),'2009-08-15') = 0
GROUP BY url
ORDER BY total DESC
LIMIT 0,10
) as a
LEFT JOIN music ON music.url = a.url
;
P.S. - Mapping between the two tables with an id instead of a url is sound advice.
Why are you repeating the url in both tables?
Have listentrack hold a music_id instead, and join on that. Gets rid of the text search as well as the extra index.
Besides, it's arguably more correct. You're tracking the times that a particular track was listened to, not the url. What if the url changes?
After you add indexes then you may want to explore adding a new column for the date_created to be a unix_timestamp, which will make math operations quicker.
I am not certain why you have the diff function though, as it appears you are looking for all rows that were updated on a particular date.
You may want to look at your query as it seems to have an error.
If you use unit tests then you can compare the results of your query and a query using a unix timestamp instead.
you might want to add an index to the url field of both tables.
having said that, when i converted from mysql to sql server 2008, with the same queries and same database structures, the queries ran 1-3 orders of magnitude faster.
i think some of it had to do with the rdbms (mysql optimizers are not so good...) and some of it might have had to do with how the rdbms reserve system resources. although, the comparisons were made on production systems where only the db would run.
This below would probably work to speed up the query.
CREATE INDEX music_url_index ON music (url) USING BTREE;
CREATE INDEX listenTrack_url_index ON listenTrack (url) USING BTREE;
You really need to know the total number of comparisons and row scans that are happening. To get that answer look at the code here of how to do that using explain http://www.siteconsortium.com/h/p1.php?id=mysql002.

Database design for optimisation

A couple of years ago I designed a reward system for 11-16yo students in PHP, JavaScript and MySQL.
The premise is straightforward:
Members of staff issue points to students under various categories ("Positive attitude and behaviour", "Model citizen", etc)
Students accrue these points then spend them in our online store (iTunes vouchers, etc)
Existing system
The database structure is also straightforward (probably too much so):
Transactions
239,189 rows
CREATE TABLE `transactions` (
`Transaction_ID` int(9) NOT NULL auto_increment,
`Datetime` date NOT NULL,
`Giver_ID` int(9) NOT NULL,
`Recipient_ID` int(9) NOT NULL,
`Points` int(4) NOT NULL,
`Category_ID` int(3) NOT NULL,
`Reason` text NOT NULL,
PRIMARY KEY (`Transaction_ID`),
KEY `Giver_ID` (`Giver_ID`),
KEY `Datetime` (`Datetime`),
KEY `DatetimeAndGiverID` (`Datetime`,`Giver_ID`),
KEY `Recipient_ID` (`Recipient_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=249069 DEFAULT CHARSET=latin1
Categories
34 rows
CREATE TABLE `categories` (
`Category_ID` int(9) NOT NULL,
`Title` varchar(255) NOT NULL,
`Description` text NOT NULL,
`Default_Points` int(3) NOT NULL,
`Groups` varchar(125) NOT NULL,
`Display_Start` datetime default NULL,
`Display_End` datetime default NULL,
PRIMARY KEY (`Category_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Rewards
82 rows
CREATE TABLE `rewards` (
`Reward_ID` int(9) NOT NULL auto_increment,
`Title` varchar(255) NOT NULL,
`Description` text NOT NULL,
`Image_URL` varchar(255) NOT NULL,
`Date_Inactive` datetime NOT NULL,
`Stock_Count` int(3) NOT NULL,
`Cost_to_User` float NOT NULL,
`Cost_to_System` float NOT NULL,
PRIMARY KEY (`Reward_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=91 DEFAULT CHARSET=latin1
Purchases
5,889 rows
CREATE TABLE `purchases` (
`Purchase_ID` int(9) NOT NULL auto_increment,
`Datetime` datetime NOT NULL,
`Reward_ID` int(9) NOT NULL,
`Quantity` int(4) NOT NULL,
`Student_ID` int(9) NOT NULL,
`Student_Name` varchar(255) NOT NULL,
`Date_DealtWith` datetime default NULL,
`Date_Collected` datetime default NULL,
PRIMARY KEY (`Purchase_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=6133 DEFAULT CHARSET=latin1
Problems
The system ran perfectly well for a period of time. It's now starting to slow-down massively on certain queries.
Essentially, every time I need to access a students' reward points total, the required query takes ages. Here is a few example queries and their run-times:
Top 15 students, excluding attendance categories, across whole school
SELECT CONCAT( s.Firstname, " ", s.Surname ) AS `Student` , s.Year_Group AS `Year Group`, SUM( t.Points ) AS `Points`
FROM frog_rewards.transactions t
LEFT JOIN frog_shared.student s ON t.Recipient_ID = s.id
WHERE t.Datetime > '2013-09-01' AND t.Category_ID NOT IN ( 12, 13, 14, 26 )
GROUP BY t.Recipient_ID
ORDER BY `Points` DESC
LIMIT 0 , 15
Run-time: 44.8425 sec
SELECT Recipient_ID, SUM(points) AS Total_Points FROMtransactionsGROUP BY Recipient_ID
Run-time: 9.8698 sec
Now I appreciate that, especially with the second query, I shouldn't ever be running a call which would return such a vast quantity of rows but the limitations of the framework within which the system runs meant that I had no other choice if I wanted to display students' total reward points for teachers/tutors/year managers/leadership to view and analyse.
Time for a solution
Fortunately the framework we've been forced to use is changing. We'll now be using oAuth rather than a horrible, outdated JavaScript widget format.
Unfortunately - or, I guess, fortunately - it means we'll have to rewrite quite a lot of the system.
One of the main areas I intend to look at when rewriting the system is the database structure. As time goes on it will only get bigger, so I need to do a bit of future-proofing.
As such, my main question is this: what is the most efficient and effective way of storing students' point totals?
The only idea I can come up with is to have a separate table called totals with Student_ID and Points fields. Every time a member of staff gives out some points, it adds a row into the transactions table but also updates the totals table.
Is that efficient? Would it be efficient to also have a Points_Since_Monday type field? How would I update/keep on top of that?
On top of the main question, if anyone has suggestions for general improvement with regard to optimisation of the database table, please let me know.
Thanks in advance,
Duncan
There is nothing particularly wrong with your design which should make it as slow as you have reported. I'm thinking there must be other factors at work, such as the server it is running on being overloaded or slow, for example. Only you will be able to find out if that is the case.
In order to test your design I recreated it on the 2008 SQL Server I have running on my desktop computer. I have a standard computer, single hard disc, not SSD, not raid etc. so on a proper database server the results should be even better. I had to make some changes to the design as you are using MySQL but none of the changes should affect performace, it's just so I can run it on my database.
Here's the table structure I used, I had to guess at what you would have in the Student and Staff tables as you do not descibe those. I also took the liberty of changing the field names in the Transaction table for Giver_ID and Receiver_ID as I assume only staff give points and students receive them.
I generated random data to fill the tables with the same number of rows as you said you have in your database
I ran the two queries you said are taking a long time, I've changed them to suit my design but I (hope) the result is the same
SELECT TOP 15
Firstname + ' ' + Surname
,Year_Group
,SUM(Points) AS Points
FROM points.[Transaction]
INNER JOIN points.Student ON points.[Transaction].Student_ID = points.Student.Student_ID
WHERE [Datetime] > '2013-09-01'
AND Category_ID NOT IN ( 12, 13, 14, 26 )
GROUP BY Firstname + ' ' + Surname
,Year_Group
ORDER BY SUM(Points) DESC
SELECT Student_ID
,SUM(Points) AS Total_Points
FROM points.[Transaction]
GROUP BY Student_ID
Both queries returned results in about 1s. I have not created any additional indexes on the tables other than the CLUSTERED indexes generated by default on the primary keys. Looking at the execution plan the query processor estimates that implementing the following index could improve the query cost by 81.0309%
CREATE NONCLUSTERED INDEX [<Name of Missing Index>]
ON [points].[Transaction] ([Datetime],[Category_ID])
INCLUDE ([Student_ID],[Points])
As others have commented I would look elsewhere for bottlenecks before spending a lot of time redesigning your database.
Update:
I realised I never actually addressed your specific question:
what is the most efficient and effective way of storing students'
point totals?
The only idea I can come up with is to have a separate table called
totals with Student_ID and Points fields. Every time a member of staff
gives out some points, it adds a row into the transactions table but
also updates the totals table.
I would not recommend keeping a separate point total unless you have explored every other possible way to speed up the database. A separate tally can become out of sync with the transactions and then you have to reconcile everything and track down what went wrong, and what the correct total should be.
You should always focus on maintaining the correctness and consistency of the data before trying to increase speed. Most of the time a correct (normalised) data model will operate quickly enough.
In one place I worked we found the most cost effective way to speed up our database was simply to upgrade the hardware; much quicker and cheaper than spending many man-hours redesigning the database :)

SQL for Top10 Companies by Hits on Company Profile - or a better solution?

I browsed summaries for all 500+ answers related to this question but found no apparent SQL solution to my problem. Maybe there is none.
I wish to display a Top10 Companies by Hits on Company Profile from nnn to nnn time period.
MySQL Table to track hits on company profile
CREATE TABLE `hit_company` (
`id` mediumint(9) unsigned NOT NULL AUTO_INCREMENT,
`hitdate` date NOT NULL DEFAULT '0000-00-00',
`customerid` mediumint(9) unsigned NOT NULL DEFAULT '0',
`organization` varchar(80) DEFAULT NULL,
`hitstamp` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`remote_addr` varchar(20) DEFAULT NULL,
`hittime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=123456 ;
The MySQL query for a 24 hour report looks like this:
SELECT hitstamp,hitdate,customerid,organization, COUNT(organization) AS 'count' FROM hit_company
WHERE hitstamp >= '2014-01-12 21:23' AND hitstamp <= '2014-01-13 21:23'
GROUP BY customerid ORDER BY count DESC LIMIT 10
This yields a Top10. The problem is that I am getting hits significantly inflated by varying remote_addr who refresh their company profile page and thus artificially increase their hit counter. I only want to count 1 vote per IP per company within the report time range.
I have considered but am not sure how to code something similar to:
WHERE COUNT(remote_addr < 2)
or
GROUP BY customerid, remote_addr HAVING 'votes' < 2
... so that the same IP address is not counted more than once per company in the final result.
What is the SQL to do this?
If there is no sql, what is the optimum solution in php/mysql ?
Thank you.
Note
While this report spans 24 hours, other reports span from 12 hours to 30 days to all time.
SELECT whatever FROM (
SELECT whatever FROM hit_company
WHERE hitstamp >= whatever
GROUP BY organization, remoteaddr
) GROUP BY customerid ORDER BY count
Hand written not tested. The basic idea is use subquery. It'll be very slow if computing on large data.

mysql/php - Determine where the data is from

I need to determine what table data is from for a news feed. The feed must say something like "Person has uploaded a video" or "Person has updated their bio". Therefore I need to determine where data came from as different types of data are in different tables, obviously. I am hoping you can do this with SQL but probably not so PHP is the option. I have no idea how to do this so just need pointing in the right direction.
I'll briefly describe the database as I don't have time to make a diagram.
1.There is a table titled members with all basic info such as email, password and ID. The ID is the primary key.
All other tables have foreign keys for the ID linking to the ID in the members table.
Other tables include; tracks, status, pics, videos. All pretty self explanatory from there.
I need to determine somehow what table the updated data comes from so I can then tell the user what so and so has done. Preferably I would want only one SQL statement for the whole feed so all the tables are joined and ordered by timestamp making everything much simpler for me. Hopefully I can do both but as I said really not sure.
A basic outline of the statement, will be longer but have simplified;
SELECT N.article, N.ID, A.ID, A.name,a.url, N.timestamp
FROM news N
LEFT JOIN artists A ON N.ID = A.ID
WHERE N.ID = A.ID
ORDER BY N.timestamp DESC
LIMIT 10
Members table;
CREATE TABLE `members` (
`ID` int(111) NOT NULL AUTO_INCREMENT,
`email` varchar(100) COLLATE latin1_general_ci NOT NULL,
`password` varchar(100) COLLATE latin1_general_ci NOT NULL,
`FNAME` varchar(100) COLLATE latin1_general_ci NOT NULL,
`SURNAME` varchar(100) COLLATE latin1_general_ci NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`ID`),
UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci
Tracks table, all other tables are pretty much the same;
CREATE TABLE `tracks` (
`ID` int(11) NOT NULL,
`url` varchar(200) COLLATE latin1_general_ci NOT NULL,
`name` varchar(100) COLLATE latin1_general_ci NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
`track_ID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`track_ID`),
UNIQUE KEY `url` (`url`),
UNIQUE KEY `track_ID` (`track_ID`),
KEY `ID` (`ID`),
CONSTRAINT `tracks_ibfk_1` FOREIGN KEY (`ID`) REFERENCES `members` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci
Before I have tried using a mysql query for each table and putting everything into an array and echoing it out. This seemed long and tiresome and I had no luck with it. I have now deleted all that code as it was a week or so ago.
Please do not feel you have to go into depth with this just point me in the right direction.
ADDITION:
Here is the sql query i have made for a trigger that was suggested. Not sure what is wrong as have never used trigger before. When inserting something into tracks this error comes up
#1054 - Unknown column 'test' in 'field list'
The values in the query are just for testing at the moment
delimiter $$
CREATE
TRIGGER tracks_event AFTER INSERT
ON tracks FOR EACH ROW
BEGIN
INSERT into events(ID, action)
VALUES (3, test);
END$$
delimiter ;
UPDATE!
I have now created a table called events as suggested and used triggers to update it AFTER an insert in one of several tables.
Here is the query I have tried but it is wrong. The query needs to get info referenced in the events table from all the other tables and order by timestamp.
SELECT T.url, E.ID, T.ID, E.action, T.name, T.timestamp
FROM tracks T
LEFT JOIN events E ON T.ID = E.ID
WHERE T.ID = E.ID
ORDER BY T.timestamp DESC
In that query I have only include the events and tracks table for simplicity as the problem is still there. There will be many more tables so the problem will worsen.
It's hard to describe the problem but basically because there is an ID in every table and one ID can do several actions, the action can be shown with the wrong outcome, in this case url.
I will explain what's in the events table and the tracks table and give the outcome to further explain.
In the events table;
4 has uploaded a track.
3 has some news.
4 has become an NBS artist.
In the tracks;
2 uploads/abc.wav Cannonballs & Stones 2012-08-20 23:59:59 1
3 uploads/19c9aa51c821952c81be46ca9b2e9056.mp3 test 2012-08-31 23:59:59 2
4 uploads/2b412dd197d464fedcecb1e244e18faf.mp3 testing 2012-08-31 00:32:56 3
4 111 111111 0000-00-00 00:00:00 111111
Outcome of query;
uploads/19c9aa51c821952c81be46ca9b2e9056.mp3 3 3 has some news. test 2012-08-31 23:59:59
uploads/2b412dd197d464fedcecb1e244e18faf.mp3 4 4 has uploaded a track. testing 2012-08-31 00:32:56
uploads/2b412dd197d464fedcecb1e244e18faf.mp3 4 4 has become an NBS artist. testing 2012-08-31 00:32:56
111 4 4 has become an NBS artist. 111111 0000-00-00 00:00:00
111 4 4 has uploaded a track. 111111 0000-00-00 00:00:00
As you can see the query gives unwanted results. The action for each ID is given on each url so the url can be shown more than once and with the wrong action. Because there is only the tracks table in that query, the only action i would want showing is 'has uploaded a track.'
It's hard to provide the statement you want without the full details of your schema. For example, the question refers to a news table and an artists table, but doesn't provide the schemas for those, or indicate how the statement that contains those references relate to any of the other tables mentioned in the question.
Still, I think what you want can be done entirely in MySQL, without any fun PHP tricks, especially if there are common fields in each of the various tables.
But first: this might not be the answer you're really wanting, but using triggers on your various tables to update an "events feed" table is likely the best solution. i.e., when an insert or update happens on the "status" table, have a trigger on the status table that inserts into the "events feed" table the ID of the person, and their type of action. You could have a separate insert and update trigger to indicate different events for the same data type.
Then it'd be super-easy to have an events feed, because you're just selecting straight from that events feed table.
Check out the create trigger syntax.
That said, I think you might have a look at the CASE and UNION keywords.
You can then construct a query that grabs data from all tables and outputs strings indicating something. You could then turn that query into a view, and use that as an "events feed" table to select directly from.
Say you have a list of members (which you do), and the various tables that contain actions from those members (i.e., tracks, status, pics, videos), which all have a key pointing back to your members table. You don't need to select from members to generate a list of activity, then; you can just UNION together the tables that have certain events.
SELECT
events.member_id
, events.table_id
, events.table
, events.action
, events.when_it_happened
, CASE
WHEN events.table = "tracks" THEN "Did something with tracks"
WHEN events.table = "status" THEN "Did something with status"
END
AS feed_description
FROM (
SELECT
tracks.ID AS member_id
, tracks.track_ID AS table_id
, "tracks" AS table
, CONCAT(tracks.url, ' ', tracks.name) AS action
, tracks.timestamp AS when_it_happened
ORDER BY tracks.timestamp DESC
LIMIT 10
UNION
SELECT
status.ID as member_id
, status.status_id AS table_id
, "status" AS table
, status.value AS action
, status.timestamp AS when_it_happened
ORDER BY status.timestamp DESC
LIMIT 10
UNION
...
) events
ORDER BY events.when_it_happened DESC
I still think you'd be better off creating a feed table built by triggers, because it'll perform a lot better if you're querying for the feed more often than generating events.

Optimizing an SQL query with generated GROUP BY statement

I have this query:
SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/$secondInterval) as meh
FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
WHERE sensor_id = '$id'
AND project_id = '$project'
GROUP BY meh
ORDER BY timestamp ASC
The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.
So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.
Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.
When running explain extended with the query I get the following:
1 SIMPLE sensor_locass ref sensor_id_lookup,project_id_lookup sensor_id_lookup 4 const 2 100.00 Using where; Using temporary; Using filesort
1 SIMPLE sensor_data ref idsensor_lookup idsensor_lookup 4 webstech.sensor_locass.sensor_id 66857 100.00
The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome
EDIT table definitions:
CREATE TABLE `sensor_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`gateway_id` int(11) NOT NULL,
`timestamp` int(10) NOT NULL,
`v1` int(11) NOT NULL,
`v2` int(11) NOT NULL,
`v3` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`temp` decimal(5,3) NOT NULL,
`oxygen` decimal(5,3) NOT NULL,
`batVol` decimal(4,3) NOT NULL,
PRIMARY KEY (`id`),
KEY `gateway_id` (`gateway_id`),
KEY `time_lookup` (`timestamp`),
KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1
CREATE TABLE `sensor_locass` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_id` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`start` date NOT NULL,
`end` date NOT NULL,
`multT` decimal(6,3) NOT NULL,
`conT` decimal(6,3) NOT NULL,
`multO` decimal(6,3) NOT NULL,
`conO` decimal(6,3) NOT NULL,
`xpos` decimal(4,2) NOT NULL,
`ypos` decimal(4,2) NOT NULL,
`lat` decimal(9,6) NOT NULL,
`lon` decimal(9,6) NOT NULL,
`isRef` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `sensor_id_lookup` (`sensor_id`),
KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1
Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).
Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.
Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.
The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).
So you'll have to decide whether such restrictions are reasonable given your requirements.
If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)
I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:
sensor_data:
PRIMARY KEY (sensor_id, timestamp)
sensor_locass:
PRIMARY KEY (sensor_id, project_id)
InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.
How many rows are you generally returning? How long is it taking now?
As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.
However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?
I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass.
If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there.
What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.
The query to populate the aggregated table will be something like this:
INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
sensor_locass.sensor_id,
sensor_locass.project_id,
secondInterval,
timestamp,
ROUND(AVG(temp)*multT + conT,2) as temp,
FLOOR(timestamp/secondInterval) as meh
FROM
sensor_locass
LEFT JOIN sensor_data
USING(sensor_id)
LEFT JOIN secondIntervalRange
ON 1 = 1
WHERE
sensor_id = '$id'
AND
project_id = '$project'
GROUP BY
sensor_locass.sensor_id,
sensor_locass.project_id,
meh
ORDER BY
timestamp ASC
And you can use this query to extract the aggregated data:
SELECT
temp,
meh
FROM
aggregated_sensor_data
WHERE
sensor_id = '$id'
AND project_id = '$project'
AND secondInterval = $secondInterval
ORDER BY
timestamp ASC
If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Categories