Best approach to select most viewed posts from last n hours - php

I'm using PHP and MYSQL(innodb engine).
As MYSQL reference says, selecting with comparison of one column and ordering by another can't use our considered index.
I have a table named News.
This table has at least 1 million records with two important columns: time_added and number_of_views.
I need to select most viewed records from last n hours. What is the best index to do this? Or is it possible to run this kind of queries very fast for a table with millions of records?
I've already done this for "last day", meaning I can select most viewed records from last day by adding a new column (date_added). But if I decide to select these records from last week, I'm in trouble again.

First, write the query:
select n.*
from news n
where time_added >= date_sub(now(), interval <n> hours)
order by number_of_views desc
limit ??;
The best index is (time_added, number_of_views). Actually, number_of_views won't be used for the full query, but I would include it for other possible queries.

First you must add the following line to the my.cnf (in section
[mysqld]):
query_cache_size = 32M (or more).
query_cache_limit = 32M (or more)
query_cache_size Sets size of the cache
Another option, which should pay attention - this query_cache_limit - it sets the maximum amount of the result of the query, which can be placed in the cache.
Check the status of the cache, you can request the following:
show global status like 'Qcache%';
http://dev.mysql.com/doc/refman/5.7/en/mysql-indexes.html
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3). For more information, see http://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html

You need a summary table. Since 'hour' is your granularity, something like this might work:
CREATE TABLE HourlyViews (
the_hour DATETIME NOT NULL,
ct SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(the_hour)
) ENGINE=InnoDB;
It might need another column (and add it to the PK) if there is some breakdown of the items you are counting. And you might want some other things SUM'd or COUNT'd in this table.
Build and maintain this table incrementally. That is, every hour, add another row to the table. (Or you could keep it updated with INSERT .. ON DUPLICATE KEY UPDATE ...)
More on Summary Tables
Then change the query to use that table; it will be a lot faster.

Related

MYSQL Select MAX COUNT in DATE RANGE

I was looking to a lot of examples, but I couldn't find an answer. What I need to do is check if there is free space for new orders. In one time I can have maximum 5 customers. Order duration is not limited, customers just selects date-time range in picker.
For example my DB records are:
`id` `start` `end`
`1` `2017/06/10 10:00` `2017/06/15 08:00`
`2` `2017/06/11 10:00` `2017/06/16 08:00`
`3` `2017/06/12 10:00` `2017/06/17 08:00`
`4` `2017/06/13 10:00` `2017/06/18 08:00`
`5` `2017/06/14 10:00` `2017/06/19 08:00`
Customer want to reserve from 2017/06/11 08:00 until 2017/06/15 12:00, but I can't let him because this period coincides with more than 5 records in my DB. How can I do it in mysql select query?
Try:
Select count(*)>5 as vacant
From orders
Where (input_start < end and input_start>start ) or
(Input_end > start and input start<start)
Where input are your search values
You can do count the number of matching records using logic like this:
select count(*)
from t
where $start < end and $end > start;
This counts the number of records that overlap with ($start, $end). You can put the comparison to "5" in your application. (Or just use select count(*) < 5.
Note: You should be passing in the values as parameters.
Try likethis :
select id from table where id not in
/*the query below will get the wrong result */
(select id from table where yourstartdate between start and end or yourenddate between start and end);
There are many, many solutions. You've tagged this as PHP so there is scope to do this in the application logic as well as one the storage.
You could separate the 5 slots across attributes (which are constrained by definition) rather than across rows (which are not constrained). Then let the application decide if the transaction is viable (although you could map this to a single slot-per-record representation using views and triggers).
While as others have suggested, you could count the number of slots currently booked to determine if there is availability (but this would need be done at least twice and you need to avoid race conditions). For preference this would be implemented as a before-insert trigger.
An alternative approach or enhancement to this would be to add a primary/unique key to the table containing the reservations including an enumerated column (with 5 possible values and a non-null constraint) making it impossible to inject more than 5 records for one slot.
But this is all about preventing a state which is considered inconsistent. Your first line of defence is presenting the user with options which are unlikely to cause such conflicts. The obvious case is not to present slots which already have 5 bookings.

JOIN or 2 queries - 1 large table, 1 small, hardware limited

I have a page in which there is a <select> menu, which contains all of the values from a small table (229 rows), such that <option value='KEY'>VALUE</option>.
This select menu is a filter for a query which runs on a large table (3.5M rows).
In the large table is a foreign key which references KEY from small table.
However, in the results of the large table query, I also need to display the relative VALUE from the small table.
I could quite easily do an INNER JOIN to retrieve the results, OR I could do a separate 'pre'-query to my smaller table, fetch it's values into an array, and then let the application get the VALUE from small table results.
The application is written in PHP.
Hardware resources IS an issue (cannot upgrade to higher instance right now, boss constrained) - I am running this on a t2.micro RDS on Amazon Web Services instance.
I have added both single and covering indexes on columns in WHERE & HAVING clauses, and my server is reporting that I have 46mb RAM available.
Given the above, I know that JOIN can be expensive especially on big tables. Does it just make sense here to do 2 queries, and let the application handle some of the work, until I can negotiate better resources?
EDIT:
No Join : 6.9 sec
SELECT nationality_id, COUNT(DISTINCT(txn_id)) as numtrans,
SUM(sales) as sales, SUM(units) as units, YrQtr
FROM 1_txns
GROUP BY nationality_id;
EXPLAIN
'1', 'SIMPLE', '1_txns', 'index', 'covering,nat', 'nat', '5', NULL, '3141206', NULL
With Join: 59.03 Sec
SELECT 4_nationality.nationality, COUNT(DISTINCT(txn_id)) as numtrans,
SUM(sales) as sales, SUM(units) as units, YrQtr
FROM 1_txns INNER JOIN 4_nationality USING (nationality_id)
GROUP BY nationality_id
HAVING YrQtr LIKE :period;
EXPLAIN
'1', 'SIMPLE', '4_nationality', 'ALL', 'PRIMARY', NULL, NULL, NULL, '229', 'Using temporary; Using filesort'
'1', 'SIMPLE', '1_txns', 'ref', 'covering,nat', 'nat', '5', 'reports.4_nationality.nationality_id', '7932', NULL
Schema is
Table 1_txns (txn_id, nationality_id, yrqtr, sales, units)
Table 4_nationality (nationality_id, nationality)
I have separate indexes on each nationality_id, txn_id, yrqtr. in my Large Transactions Table. And just a primary key index on my small table.
Something strange also, is that the query WITHOUT the join, is missing a row from it's results!
If your lookup "menu" list table is only the 229 rows as stated, and it has a unique key, and your menu table has index on (key, value), the join would be negligible... especially if your only querying the results based on a single key anyhow.
The bigger question to me would be on your table of 3.5 million records. At 229 "menu" items, it would be returning an average of over 15k records each time. And I am sure that not every category is evenly balanced... some could have a few hundred or thousand entries, others could have 30k+ entries. Is there some other criteria that would allow smaller subsets to be returned? Obviously not enough info to quantify.
Now, after seeing your revised post while entering this, I see you are trying to get aggregations. The table would otherwise be fixed for historical data. I would suggest a summary table be done on a per Nationality/YrQtr basis. This way, you can query that directly if the period is PRIOR to the current period in question. If current period, then sum aggregates from production. Again, since transactions wont change historically, neither would their counts and you would have immediate response from the pre-summary table.
Feedback
As for how / when to implement a summary table. I would create the table with the respective columns you need... Nationality, Period (Yr/Month), and respective counts for distinct transactions, etc.
I would then pre-aggregate once for all your existing data for everything UP TO but not including the current period (Yr/Month). Now you have your baseline established in summary.
Then, add a trigger to your transaction table on insert. Then, process something like... (AND NOTE, THIS IS NOT ACTUAL TRIGGER, BUT CONTEXT OF WHAT TO DO)
update summaryTable
set numTrans = numTrans + 1,
TotSales = TotSales + NEWENTRY.Sales,
TotUnits = TotUnits + NEWENTRY.Units
where
Nationality = NEWENTRY.Nationality
AND YrQtr = NEWENTRY.YrQtr
if # records affected by the update = 0
Insert into SummaryTable
( Nationality,
YrQtr,
NumTrans,
TotSales,
TotUnits )
values
( NEWENTRY.Nationality,
NEWENTRY.YrQtr,
1,
NEWENTRY.Sales,
NEWENTRY.Units )
Now, your aggregates will ALWAYS be in synch in the summary table after EVERY record inserted into the transaction table. You can ALWAYS query this summary table instead of the full transaction table. If you never have activity for a given Nationality / YrQtr, no record will exist.
First, move the HAVING to WHERE so that the rest of the query has less to do. Second, delay the lookup of nationality until after the GROUP BY:
SELECT
( SELECT nationality
FROM 4_nationality
WHERE nationality_id = t.nationality_id
) AS nationality,
COUNT(DISTINCT(txn_id)) as numtrans,
SUM(sales) as sales,
SUM(units) as units,
YrQtr
FROM 1_txns AS t
WHERE YrQtr LIKE :period
GROUP BY nationality_id;
If possible, avoid wild cards and simply do YrQtr = :period. That would allow INDEX(YrQtr, nationality_id) for even more performance.

SQL INSERT INTO SELECT and Return the SELECT data to Create Row View Counts

So I'm creating a system that will be pulling 50-150 records at a time from a table and display them to the user, and I'm trying to keep a view count for each record.
I figured the most efficient way would be to create a MEMORY table that I use an INSERT INTO to pull the IDs of the rows into and then have a cron function that runs regularly to aggregate the view ID counts and clears out the memory table, updating the original one with the latest view counts. This avoids constantly updating the table that'll likely be getting accessed the most, so I'm not locking 150 rows at a time with each query(or the whole table if I'm using MyISAM).
Basically, the method explained here.
However, I would of course like to do this at the same time as I pull the records information for viewing, and I'd like to avoid running a second, separate query just to get the same set of data for its counts.
Is there any way to SELECT a dataset, return that dataset, and simultaneously insert a single column from that dataset into another table?
It looks like PostgreSQL might have something similar to what I want with the RETURNING keyword, but I'm using MySQL.
First of all, I would not add a counter column to the Main table. I would create a separate Audit table that would hold ID of the item from the Main table plus at least timestamp when that ID was requested. In essence, Audit table would store a history of requests. In this approach you can easily generate much more interesting reports. You can always calculate grand totals per item and also you can calculate summaries by day, week, month, etc per item or across all items. Depending on the volume of data you can periodically delete Audit entries older than some threshold (a month, a year, etc).
Also, you can easily store more information in Audit table as needed, for example, user ID to calculate stats per user.
To populate Audit table "automatically" I would create a stored procedure. The client code would call this stored procedure instead of performing the original SELECT. Stored procedure would return exactly the same result as original SELECT does, but would also add necessary details to the Audit table transparently to the client code.
So, let's assume that Audit table looks like this:
CREATE TABLE AuditTable
(
ID int
IDENTITY -- SQL Server
SERIAL -- Postgres
AUTO_INCREMENT -- MySQL
NOT NULL,
ItemID int NOT NULL,
RequestDateTime datetime NOT NULL
)
and your main SELECT looks like this:
SELECT ItemID, Col1, Col2, ...
FROM MainTable
WHERE <complex criteria>
To perform both INSERT and SELECT in one statement in SQL Server I'd use OUTPUT clause, in Postgres - RETURNING clause, in MySQL - ??? I don't think it has anything like this. So, MySQL procedure would have several separate statements.
MySQL
At first do your SELECT and insert results into a temporary (possibly memory) table. Then copy item IDs from temporary table into Audit table. Then SELECT from temporary table to return result to the client.
CREATE TEMPORARY TABLE TempTable
(
ItemID int NOT NULL,
Col1 ...,
Col2 ...,
...
)
ENGINE = MEMORY
SELECT ItemID, Col1, Col2, ...
FROM MainTable
WHERE <complex criteria>
;
INSERT INTO AuditTable (ItemID, RequestDateTime)
SELECT ItemID, NOW()
FROM TempTable;
SELECT ItemID, Col1, Col2, ...
FROM TempTable
ORDER BY ...;
SQL Server (just to tease you. this single statement does both INSERT and SELECT)
MERGE INTO AuditTable
USING
(
SELECT ItemID, Col1, Col2, ...
FROM MainTable
WHERE <complex criteria>
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(ItemID, RequestDateTime)
VALUES
(Src.ItemID, GETDATE())
OUTPUT
Src.ItemID, Src.Col1, Src.Col2, ...
;
You can leave Audit table as it is, or you can set up cron to summarize it periodically. It really depends on the volume of data. In our system we store individual rows for a week, plus we summarize stats per hour and keep it for 6 weeks, plus we keep daily summary for 18 months. But, important part, all these summaries are separate Audit tables, we don't keep auditing information in the Main table, so we don't need to update it.
Joe Celko explained it very well in SQL Style Habits: Attack of the Skeuomorphs:
Now go to any SQL Forum text search the postings. You will find
thousands of postings with DDL that include columns named createdby,
createddate, modifiedby and modifieddate with that particular
meta data on the end of the row declaration. It is the old mag tape
header label written in a new language! Deja Vu!
The header records appeared only once on a tape. But these meta data
values appear over and over on every row in the table. One of the main
reasons for using databases (not just SQL) was to remove redundancy
from the data; this just adds more redundancy. But now think about
what happens to the audit trail when a row is deleted? What happens to
the audit trail when a row is updated? The trail is destroyed. The
audit data should be separated from the schema. Would you put the log
file on the same disk drive as the database? Would an accountant let
the same person approve and receive a payment?
You're kind of asking if MySQL supports a SELECT trigger. It doesn't. You'll need to do this as two queries, however you can stick those inside a stored procedure - then you can pass in the range you're fetching, have it both return the results AND do the INSERT into the other table.
Updated answer with skeleton example for stored procedure:
DELIMITER $$
CREATE PROCEDURE `FetchRows`(IN StartID INT, IN EndID INT)
BEGIN
UPDATE Blah SET ViewCount = ViewCount+1 WHERE id >= StartID AND id <= EndID;
# ^ Assumes counts are stored in the same table. If they're in a seperate table, do an INSERT INTO ... ON DUPLICATE KEY UPDATE ViewCount = ViewCount+1 instead.
SELECT * FROM Blah WHERE id >= StartID AND id <= EndID;
END$$
DELIMITER ;

Mysql count slow when filter by category

Why my query fast when I run.
select count(*) as aggregate from `news` where `news`.`deleted_at` is null and `status` = '1'
But, slow when I run.
select count(*) as aggregate from `news` where `news`.`deleted_at` is null and `status` = '1' and `newscategory_id` = '17'
It is my table news structure image, have a look at here.
Sorry because my reputation is less than 8, so I can't attach image.
try adding an composite index on the three columns you are using for your select:
ALTER TABLE news ADD INDEX comp_index (deleted_at, status, newscategory_id);
and check it again.
probably use EXPLAIN to see if any indexes you have are used.
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data.
This is much faster than reading every row sequentially.
Try to add this in your DB:
CREATE INDEX newCategory_indx ON news (newscategory_id)
CREATE INDEX status_indx ON news (status)
This will give you quick result, as compared to previously generate (non-indexed column) result.
To know more about index and it's importance visit here

Does MYSQL load the whole table into cache everytime?

Lets say I have a table, with say 1 million rows, with the first column being a primary key.
Then, if I run the following:
SELECT * FROM table WHERE id='tomato117' LIMIT 1
Does the table ALL get put into the cache (thereby causing the query to slow as more and more rows get added) or would the number of rows of the table not matter, since the query uses the primary key?
edit: (added limit 1)
If the id is define as primary key, which only one record with value tomato117, so limit does not useful.
Using SELECT * will trigger mysql read from disk because unlikely all columns are stored into index. (mysql not able to fetch from index) In theory, it will affect performance.
However, your sql is matching query cache condition. So, mysql will stored the result into query cache for subsequent usage.
If you query cache size is huge, mysql will keep store all sql results into query cache until memory full.
This come with a cost, if there is an update on your table, query cache invalidation will be harder for mysql.
http://www.mysqlperformanceblog.com/2007/03/23/beware-large-query_cache-sizes/
http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/
nothing of the sort.
It will only fetch the row you selected and perhaps a few other blocks. They will remain in cache until something pushes them out.
By cache, I refer to the innodb buffer pool, not query cache, which should probably be off anyway.
SELECT * FROM table WHERE id = 'tomato117' LIMIT 1
When tomato117 is found, it stops searching, if you don't set LIMIT 1 it will search until end of table. tomato117 can be second, and it will still search 1 000 000 rows for other tomato117.
http://forge.mysql.com/wiki/Top10SQLPerformanceTips
Showing rows 0 - 0 (1 total, Query took 0.0159 sec)
SELECT *
FROM 'forum_posts'
WHERE pid = 643154
LIMIT 0 , 30
Showing rows 0 - 0 (1 total, Query took 0.0003 sec)
SELECT *
FROM `forum_posts`
WHERE pid = 643154
LIMIT 1
Table is about 1GB, 600 000+ rows.
If you add the word EXPLAIN before the word SELECT, it will show you a table with a summary of how many rows it's reading instead of the normal results.
If your table has an index on the id column (including if it's set as primary key), the engine will be able to jump straight to the exact row (or rows, for a non-unique index) and only read the minimal amount of date. If there's no index, it will need to read the whole table.

Categories