Context:
Data is loaded in table every 5 minutes. Each row of data represents a given department with number of calls received so far for the last 30 minutes i.e. a running count that resets every 30 minutes.
Problem:
Need to a query that will show running count for last 15 minutes.
Hint:
We can get a running count for last 15 minutes if we can show how many calls were actually received in each 5 minute load. We get that "actual" count by subtracting the previous interval (row) with the new incoming interval (row).
My question:
What is best way to solve this that will be the fastest in terms of execution? We are at the initial design stage, so this means we can either work from one table, or work with two separate tables (one for previous, and one for new).
Ultimately the result (difference) needs to be inserted in a different table.
Related
I recently came upon this theoretical problem:
There are two PHP scripts in an application;
The first script connects to a DB each day at 00:00 and inserts in an existing DB table 1 million rows;
The second script has a foreach loop, iterating through the same DB table's rows; It then makes an API call which takes exactly 1 second to complete (request + response = 1s); Independently of the content of a response, it then deletes one row from the DB table;
Hence, each day the DB table gains 1 million rows, but only loses 1 row per second, i.e. 86400 rows per day, and because of that it grows infinitely big;
What modification to the second script should be changed so that the DB table size does not grow infinitely big?
Does this problem sound familiar to anyone? If so, is there a 'canonical' solution to it? Because the first thing that crossed my mind was, if the row deletion does not depend on the API response, why not just simply take the API call outside of the foreach loop? Unfortunately, I didn't have a chance to ask my question.
Any other ideas?
I have a online shop application and a database of around 1000 ITEMS.
ITEM{
categories / up to 5 out of 60
types / up to 2 out of 10
styles / up to 2 out of 10
rating / 0-5
}
Now I wont to create a comparison item-to-item with predefined conditions:
- At least one common category += 25points
- At least one common type += 25p.
- If first item has no styles += 0p.
- If no styles in common -= 10p.
- For each point in rating difference -= 5p.
And store the result in a table. as item_to_item_similarity.score.
Now I made the whole thing with a nice and shiny PHP functions and classes ..
And a function to calculate and update all the relations.
In the test withs 20 items .. all went well.
But when increased the test data to 1000 items .. resulting in 1000x1000 relations
The server started complaining about script_time_out .. and out of memory :)
Indexes, transaction and pre-loading some of the data .. helped me half the way.
Is there a smarter way to compare and evaluate this type of data?
I was thinking to represent the related categories, styles etc.
as a set of IDs, possibly in some binary mask .. so that they can be easily compared
(even in the SQL ?) with out the need to create classes, and loops trough arrays millions of times.
I know this isn't the best but, what about the following:
You have your table which links the two items, a timestamp, and has their score. This table will hold the 1,000,00 records.
You have a CRON script, which runs every 15 mins.
First time cron runs, it creates the 1,000,000 rows. No scores are calculated. This can be done by counting rows in table. If count==0 then it's first run
Second run and thereafter runs, it selects 1000 records, and calculates their score and updates the timestamp. It should select 1000 records ordered by the timestamp, so that it selects 1000 oldest records.
Leave this to run in the background, every 15 mins or so. Will take like 10 days to run in total and calculate all the scores.
Whenever you update a product, you need to reset the date on the linking table, so that when the cron runs it recalculates the score for all rows that mention that item.
When you create a new product, you must create the linking rows, so it has to add a row for each other item
Personally, I'd consider using a different method altogether, there are plenty of algorithms out there you just have to find one which applies to this scenario. Here is one example:
How to find "related items" in PHP
Also, here is the Jaccard Index written in PHP which may be more efficient that your current method
https://gist.github.com/henriquea/540303
Say we are a site receiving massive amounts of traffic, Amazon.com size traffic. And say we wanted to display a counter on the home page displaying the total number of sales since December the first and the counter was to refresh via ajax every 10 seconds.
How would we go about doing this?
Would we have a summary database table displaying the total sales and each checkout would +1 to the counter and we would get that number every 10 seconds? Would we COUNT() the entire 'sales' table every 10 seconds?? Is there an external API I can push the stats off to and then do an ajax pull from them?
Hope you can help, Thanks
If your site is ecomm based, in that you are conducting sales, then you MUST have a sales tracking table somewhere. You could simply make the database count part of the page render when a user visits or refreshes your site.
IMO, there is no need to ajax this count as most visitors won't really care.
Also, I would recommend this query be run against a readonly (slave) database if your traffic is truly at amazon levels.
I would put triggers on the tables to manage the counter tables. When inserting a new sale the sum table would get the new value added to the row for the current day. That also gives sales per day historically without actually querying the big table.
Also, it allows for orders to be entered manually for other dates than today and that day get updated statistics.
As for the Ajax part that's just going to be a query into that sum table.
Whatever you do, do not re-COUNT everything every 10 seconds. Why not to have a cronjob, which does the counting of data every 10 seconds? It could take current time-10 seconds and in slave database add the difference to current count ?
Still 10 seconds sound bizarre. Every minute, mm?
I have a person's username, and he is allowed ten requests per day. Every day the requests go back to 0,and old data is not needed.
What is the best way to do this?
This is the way that comes to mind, but I am not sure if it's the best way
(two fields, today_date, request_count):
Query the DB for the date of last request and request count.
Get result and check if it was today.
If today, check the request count, if less than 10, update query database to ++count.
If not today, update the DB with today's date and count = 1.
Is there another way with fewer DB queries?
I think your solution is good. It is possible to reset the count on a daily basis too. That will allow you to skip a column, but you do need to run a cron job. If there are many users that won't have any requests at all, it is needless to reset their count each day.
But whichever you pick, both solutions are very similar in performance, data size and development time/complexity.
Just one column request_count. Then query this column and update it. As far as I know with stored procedures this may be possible in one single query. Even if not, it will be just two. Then create a cron job, that calls a script, that resets the column to 0 every day at 00:00.
To spare you some requests to the DB define
the maximum number of requests per day allowed.
the first day available to your application (date offset).
Then add a requestcount field to the database per user.
On the first request get the count from the db.
The count is always the number of the day multiplied with the maximum + 1 of requests per day plus the actual requests by that user:
day * (max + 1) + n
So if on first request the count from the db is actually higher than allowed, block.
Otherwise if it's lower than the current day base, reset to the current day base (in the PHP variable)
And count up. Store this value into the DB.
This is one read operation, and in case the request is still valid, one write operation to the DB per request.
There is no need to run a cron job to clean this up.
That's actually the same as you propose in your question, but the day information is part of the counter value. So you can do more with one value at once, while counting up with +1 per request still works for the block.
You have to take into account that each user may be in a different time zone than your server, so you can't just store the count or the "day * max" trick. Try to get the time offset and then the start of the user's day can be stored in your "quotas" database. In mySQL, that would look like:
`start_day`=ADDTIME(CURDATE()+INTERVAL 0 SECOND,'$offsetClientServer')
Then simply look at this time the next time you check the quota. The quota check can all be done in one query.
Which method do you suggest and why?
Creating a summary table and . . .
1) Updating the table as the action occurs in real time.
2) Running group by queries every 15 minutes to update the summary table.
3) Something else?
The data must be near real time, it can't wait an hour, a day, etc.
I think there is a 3rd option, which might allow you to manage your CPU resources a little better. How about writing a separate process that periodically updates the summarized data tables? Rather than recreating the summary with a group by, which is GUARANTEED to run slower over time because there will be more rows every time you do it, maybe you can just update the values. Depending on the nature of the data, it may be impossible, but if it is so important that it can't wait and has to be near-real-time, then I think you can afford the time to tweak the schema and allow the process to update it without having to read every row in the source tables.
For example, say your data is just login_data (cols username, login_timestamp, logout_timestamp). Your summary could be login_summary (cols username, count). Once every 15 mins you could truncate the login_summary table, and then insert using select username, count(*) kind of code. But then you'd have to rescan the entire table each time. To speed things up, you could change the summary table to have a last_update column. Then every 15 mins you'd just do an update for every record newer than the last_update record for that user. More complicated of course, but it has some benefits: 1) You only update the rows that changed, and 2) You only read the new rows.
And if 15 minutes turned out to be too old for your users, you could adjust it to run every 10 mins. That would have some impact on CPU of course, but not as much as redoing the entire summary every 15 mins.