Algorithms for processing visit logs

Algorithms for processing visit logs - php

Suppose I have a MySQL table that looks like the following, where I keep track of when (Date) a user (User.id) read an article on my website (Article.id):
------------------------------------------
Article_Impressions
------------------------------------------
date | user_id | article_id
--------------------+---------+-----------
2013-04-02 15:33:23 | 815 | 2342
2013-04-02 15:38:21 | 815 | 108
2013-04-02 15:39:33 | 161 | 4815
...
I'm trying to determine how many session I had, as well as average session duration per user on a given day. A session ends when an article was not read within 30 minutes after another article.
Question
How can I efficiently determine how many session I had on a given day? I'm using PHP and MySQL.
My first idea is to query all that data for a given day, sorted by user. Then I iterate through each user, check if an impression was within 30 minutes of the last impression, and tally up a total count of session each user had that day.
Since we have around 2 million impressions a day on our site, I'm trying to optimize this report generator.

Try this query
Query 1:
select
#sessionId:=if(#prevUser=user_id AND diff <= 1800 , #sessionId, #sessionId+1) as sessionId,
#prevUser:=user_id AS user_id,
article_id,
date,
diff
from
(select #sessionId:=0, #prevUser:=0) b
join
(select
TIME_TO_SEC(if(#prevU=user_id, TIMEDIFF(date, #prevD), '00:00')) as diff,
#prevU:=user_id as user_id,
#prevD:=date as date,
article_id
from
tbl
join
(select #prev:=0, #prevU=0)a
order by
user_id,
date) a
[Results]:
| SESSIONID | USER_ID | ARTICLE_ID | DATE | DIFF |
-----------------------------------------------------------------
| 1 | 161 | 4815 | 2013-04-02 15:39:33 | 0 |
| 2 | 815 | 2342 | 2013-04-02 15:33:23 | 0 |
| 2 | 815 | 108 | 2013-04-02 15:38:21 | 298 |
| 3 | 815 | 108 | 2013-04-02 16:38:21 | 3600 |
This query will return a unique session for every new user and also for same user if the next article read is after 30 mins as per your requirement mentioned in your question. The diff column returns the seconds difference between the 2 articles by the same user which helps us count the sessionId. Now using this result it will be easy for you to count the average time per user and also total time per session.
Hope this helps you...
SQL Fiddle

If the concept of the user "session" is important to your analytics, then I would start logging data in your table to make querying of session-related data not such a painful process. A simple approach would be to log your PHP session ID. If your PHP session id is set to have the same 30 minute expiry, and you log the PHP session ID to this table then you would basically have exactly what you are looking for.
Of course that won't help you with your existing records. I would probably go ahead and create the session field and then back-populate it with randomly generated "session" id's. I wouldn't look for a fully SQL solution for this, as it may not do what you want in terms of handling edge cases (sessions spanning across days, etc.). I would write a script to perform this backfill, which would contain all the logic you need.
My general approach would be to SELECT all the records like this:
SELECT user_id, date /* plus any other fields like unique id that you would need for insert */
FROM Article_Impressions
WHERE session_id IS NULL
ORDER BY user_id ASC, date ASC
Note: make sure you have index on both user_id and date fields.
I would then loop through the result set, building a temp array of each user_id, and loop through that array for all date values assigning a randomly generated session id which would change each time the date change was greater than 30 minutes. Once the user value increments, I would make inserts for that previous user to update the session_id values and then reset the temp array to empty and continue that process with the next user.
Note that it is probably important to take the approach of keeping a relatively small temp/working array like this, as with the number of records you are talking about, you are likely not going to be able to read the entire result set into an array in memory.
Once your data is populated, the query becomes trivial:
Unique sessions for each day:
SELECT DATE(date) as `day`, COUNT(DISTINCT session_id) AS `unique_sessions`
FROM Article_Impressions
GROUP BY `day`
ORDER BY `day` DESC /* or ASC depending on how you want to view it */
Average sessions per day:
SELECT AVG(sessions_per_day.`unique_sessions`) AS `average_sessions_per_day`
FROM
(
SELECT DATE(date) as `day`, COUNT(DISTINCT session_id) AS `unique_sessions`
FROM Article_Impressions
GROUP BY `day`
) AS sessions_per_day
GROUP BY sessions_per_day.`day`
Note: you need an index on the new session_id field.

Related

How to select all entries of a user from a database?

I am trying to make a "top purchaser" module on my store and I am a bit confused about the MySQL query.
I have a table with all transactions and I need to select the person (which could have one or many transactions) with the highest amount of money spent in the past month.
What I have:
name | money spent
------------------
john | 50
mike | 12
john | 10
jane | 504
carl | 99
jane | 12
jane | 1
What I want to see:
With a query, I need to see:
name | money spent last month
-----------------------------
jane | 517
carl | 99
john | 60
mike | 12
How do I do that?
I do not really seem to find many good solutions since my MySQL query skills are quite basic. I thought of making a table in which money is added to the user when he buys something.

That's a simple aggregated query :
SELECT t.name, SUM(t.moneyspent) money_spent_last_month
FROM mytable t
GROUP BY t.name
ORDER BY t.money_spent_last_month DESC
LIMIT 1
The query sums the total money sped by customer name. The results are ordered by descending total money spent, and only the first row is retained.
If you are looking to filter data over last month, you need a column in the table that keeps track of the transaction date, say transaction_date, and then you can just add a WHERE clause to the query, like :
SELECT t.name, SUM(t.moneyspent) money_spent_last_month
FROM mytable t
WHERE
t.transaction_date >=
DATE_ADD(LAST_DAY(DATE_SUB(NOW(), INTERVAL 2 MONTH)), INTERVAL 1 DAY)
AND t.transaction_date <=
DATE_SUB(NOW(), INTERVAL 1 MONTH)
GROUP BY t.name
ORDER BY t.money_spent_last_month DESC
LIMIT 1
This method is usually more efficient than using DATE_FORMAT to format dates as string and compare the results.

Create API endpoint for fetching dynamic data based on time

I have a scraper which periodically scrapes articles from news sites and stores them in a database [MYSQL].
The way the scraping works is that the oldest articles are scraped first and then i move onto much more recent articles.
For example an article that was written on the 1st of Jan would be scraped first and given an ID 1 and an article that was scraped on the 2nd of Jan would have an ID 2.
So the recent articles would have a higher id as compared to older articles.
There are multiple scrapers running at the same time.
Now i need an endpoint which i can query based on timestamp of the articles and i also have a limit of 10 articles on each fetch.
The problem arises for example when there are 20 articles which were posted with a timestamp of 1499241705 and when i query the endpoint with a timestamp of 1499241705 a check is made to give me all articles that is >=1499241705 in which case i would always get the same 10 articles each time,changing the condition to a > would mean i skip out on the articles from 11-20. Adding another where clause to check on id is unsuccessful because articles may not always be inserted in the correct date order as the scraper is running concurrently.
Is there a way i can query this end point so i can always get consistent data from it with the latest articles coming first and then the older articles.
EDIT:
+-----------------------+
| id | unix_timestamp |
+-----------------------+
| 1 | 1000 |
| 2 | 1001 |
| 3 | 1002 |
| 4 | 1003 |
| 11 | 1000 |
| 12 | 1001 |
| 13 | 1002 |
| 14 | 1003 |
+-----------------------+
The last timestamp and ID is being sent through the WHERE clause.
E.g.
$this->db->where('unix_timestamp <=', $timestamp);
$this->db->where('id <', $offset);
$this->db->order_by('unix_timestamp ', 'DESC');
$this->db->order_by('id', 'DESC');
On querying with a timestamp of 1003, ids 14 and 4 are fetched. But then during the next call, id 4 would be the offset thereby not fetching id 13 and only fetching id 3 the next time around.So data would be missing .

Two parts: timestamp and id.
WHERE timestamp <= $ts_leftoff
AND ( timestamp < $ts_leftoff
OR id <= $id_leftoff )
ORDER BY (timestamp DESC, id DESC)
So, assuming id is unique, it won't matter if lots of rows have the same timestamp, the order is fully deterministic.
There is a syntax for this, but unfortunately it is not well optimized:
WHERE (timestamp, id) <= ($ts_leftoff, $id_leftoff)
So, I advise against using it.
More on the concept of "left off": http://mysql.rjweb.org/doc.php/pagination

Current day total login time mysql

I am trying to calculate user's login time in hours and mins for current day.
I have my session table as
id| user_id | logintime |logouttime | isactive
1| 100 | 2017-06-12 22:53:53 |2017-06-13 02:53:53 | 0
2| 100 | 2017-06-13 08:53:53 |2017-06-13 09:13:53 | 0
3| 100 | 2017-06-13 10:53:53 |2017-06-13 11:33:53 | 0
4| 100 | 2017-06-13 11:53:53 |2017-06-13 12:13:53 | 0
5| 100 | 2017-06-13 12:53:53 |NULL (As user is currently logged in)| 1
I want a query which can calculate total login day of current day let say the date is 13 today. One more thing i want to mention is that in Record id 1 it states that user did login at 12-06-2017 but as i need record time of 13 so it will start from 2017-06-13 00:00:00 (as login time of that day).
Thanks in advance.
EDIT:
So far query i tried which is giving wrong calculation for the id 1 and 5 .
for id 1 it is calculating the minutes of yesterday also and for 5th it is giving null as user is currently logged in
SELECT TIMESTAMPDIFF(minute,logintime,logouttime) FROM `table` WHERE user_id= 17 and DATE(logouttime) = DATE(UTC_TIMESTAMP)

For active sessions you could use CURRENT_TIMESTAMP instead of logouttime value. Create view with corrected timestamps to avoid duplicate same select subquery in your query:
CREATE VIEW sessions_view AS
SELECT
user_id,
logintime,
IF(isactive, CURRENT_TIMESTAMP, logouttime) AS logouttime
FROM sessions_table;
To split sessions that passes from one day to another use UNION syntax with conditions. Use TIMESTAMPDIFF function to get difference between logintime and logouttime values. To get aggregated duration that is time data type, use SUM aggregate function and SEC_TO_TIME to convert seconds to DATETIME value:
SELECT
user_id,
DATE(logintime) AS `day`,
SEC_TO_TIME(SUM(TIMESTAMPDIFF(SECOND, logintime, logouttime)))
FROM
(SELECT
user_id,
logintime,
IF(DATE(logouttime)>DATE(logintime), TIMESTAMP(DATE(logouttime)), logouttime) AS logouttime
FROM sessions_view
UNION ALL
SELECT
user_id,
TIMESTAMP(DATE(logouttime)) AS logintime,
logouttime
FROM sessions_view
WHERE DATE(logouttime)>DATE(logintime)) splited_sessions
GROUP BY user_id, `day`;
If you want to get data for specific date only, then append following WHERE clause to the query:
WHERE DATE(logintime) = '2017-06-13'

How to make a timed reward system in php and mysql

How can I make a reward system in PHP that has a timer and sets a timer for a certain amount of time after they click on it and mysql inserts a random value into the table?
I am making a project and I want users to accumulate the in-game currency as if they were using a Bitcoin faucet.

A simple way to do this would be to keep track of the users "currency" using a table that has both amounts and datetimes. The user's balance would only show the entries that have datetimes before the current time. That way, you can insert the entry right away and then it only goes live when you need it.
rowid | user_id | amount | activetime |
-------------------------------------------------------
90000 | 1 | 0.01 | 12:13:12 01-16-15 |
90001 | 1 | 0.01 | 12:13:12 02-16-15 |
On inserts you can set the active time to something like DATE_ADD(NOW(), INTERVAL 2 HOUR) so it will show up 2 hours later.
Example of adding a row:
INSERT INTO transactions
(user_id, amount, activetime)
VALUES
(1, 0.01, DATE_ADD(NOW(), INTERVAL 2 HOUR))
Example of getting current balance for a user:
SELECT SUM(amount) AS balance FROM transactions WHERE activetime <= NOW()
Also, if you want to rate limit the user, it is easy now to check if they have already clicked the button because there will be an entry greater than the current time. You could query quickly like this:
SELECT 1 FROM transactions WHERE activetime > NOW() LIMIT 1
Then in php, if num_rows() is 1, they have already clicked the button and are waiting for payment. Otherwise it will return 0.
Hopefully this helps. Let me know if you have any other questions on this topic, or if you need some clarification.

SQL group & sort by two columns [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
SQL ORDER BY total within GROUP BY
UPDATE: I've found my solution, which I've posted here. Thanks to everyone for your help!
I'm developing a Facebook application which requires a leaderboard. Scores and time taken to complete the game are recorded and these are organised by score first, then in the case of two identical scores, the time is used. If a user has played multiple times, their best score is used.
The lower the score, the better the performance in the game.
My table structure is:
id
facebook_id - (Unique Identifier for the user)
name
email
score
time - (time to complete game in seconds)
timestamp - (unix timestamp of entry)
date - (readable format of timestamp)
ip
The query I thought would work is:
SELECT *
FROM entries
ORDER BY score ASC, time ASC
GROUP BY facebook_id
The problem I'm having is in some cases it's pulling in the user's first score in the database, not their highest score. I think this is down to the GROUP BY statement. I would have thought the ORDER BY statement would have fixed this, but apparently not.
For example:
----------------------------------------------------------------------------
| ID | NAME | SCORE | TIME | TIMESTAMP | DATE | IP |
----------------------------------------------------------------------------
| 1 | Joe Bloggs | 65 | 300 | 1234567890 | XXX | XXX |
----------------------------------------------------------------------------
| 2 | Jane Doe | 72 | 280 | 1234567890 | XXX | XXX |
----------------------------------------------------------------------------
| 3 | Joe Bloggs | 55 | 285 | 1234567890 | XXX | XXX |
----------------------------------------------------------------------------
| 4 | Jane Doe | 78 | 320 | 1234567890 | XXX | XXX |
----------------------------------------------------------------------------
When I use the query above, I get the following result:
1. Joe Bloggs - 65 - 300 - (Joes First Entry, not his best entry)
2. Jane Doe - 72 - 280
I would have expected...
1. Joe Bloggs - 55 - 285 - (Joe's best entry)
2. Jane Doe - 72 - 280
It's like the Group By is ignoring the Order - and just overwriting the values.
Using MIN(score) with the group by selects the lowest score, which is correct - however it merges the time from the users first record in the database, so often returns incorrectly.
So, how can I select a user's highest score and the associated time, name, etc and order the results by score, then time?
Thanks in advance!

Your query does not actually make sense, because the order by should be after the group by. What SQL engine are you using? Most would give an error.
I think what you want is more like:
select e.facebookid, minscore, min(e.time) as mintime -- or do you want maxtime?
from entries e join
(select e.facebookid, min(score) as minscore
from entries e
group by facebookid
) esum
on e.facebookid = esum.facebookid and
e.score = e.minscore
group by e.facebookid, minscore
You can also do this with window functions, but that depends on your database.

One approach would be this:
SELECT entries.facebook_id, MIN(entries.score) AS score, MIN(entries.time) AS time
FROM entries
INNER JOIN (
SELECT facebook_id, MIN(score) AS score
FROM entries
GROUP BY facebook_id) highscores
ON highscores.facebook_id = entries.facebook_id
AND entries.score = highscores.score
GROUP BY entries.facebook_id
ORDER BY MIN(entries.score) ASC, MIN(entries.time) ASC
If you need more information from the entries table, you can then use this as a subquery, and join again on the information presented (facebook_id, score, time) to get one row per user.
You need to aggregate twice, is the crux of this; once to find the minimum score for the user, and again to find the minimum time for that user and score. You could reverse the order of the aggregation, but I would expect that this will filter most quickly and thus be most efficient.
You might also want to check which is faster, aggregating the second time: using the minimum score or grouping using the score as well.

You need to min the score
SELECT
facebook_id,
name,
email,
min(score) as high_score
FROM
entries
GROUP BY
facebook_id,
name,
email
ORDER BY
min(score) ASC

Thanks for your help. #Penguat had the closest answer.. Here was my final Query for anyone who might have the same issue...
SELECT f.facebook_id, f.name, f.score, f.time FROM
(SELECT facebook_id, name, min(score)
AS highscore FROM golf_entries
WHERE time > 0
GROUP BY facebook_id)
AS x
INNER JOIN golf_entries as f
ON f.facebook_id = x.facebook_id
AND f.score = x.highscore
ORDER BY score ASC, time ASC
Thanks again!

If you want their best time, you want to use the MIN() function - you said that the lower the score, the better they did.
SELECT facebook_id, MIN(score), time, name, ...
FROM entries
GROUP BY facebook_id, time, name, ...
ORDER BY score, time

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Algorithms for processing visit logs - php

Related

How to select all entries of a user from a database?

Create API endpoint for fetching dynamic data based on time

Current day total login time mysql

How to make a timed reward system in php and mysql

SQL group & sort by two columns [duplicate]

Categories

Resources