I'm not used to work with values that should decrement every a timelapse like for a user warned, for example a warn which persist for 30 days that can reach a maximum value of 3 warns before the user get banned
I thought to design a user table like this but now I should work on it, I find it not useful on decrementing the values every 30 days:
table_user
- username
- email
- warnings (integer)
- last_warn (timestamp data type)
should I use some php timer?
does exist any standard tecnique on user warnings?
You could create another table
User_warnings:
user_id
warn_timestamp
Whenever the user is warned, you first delete all entries older than 30 days, then you check if there still exist two or more warnings. Ban the user then.
If you want a history about all warnings, don't delete old warnings, but just query for warnings within the last 30 days.
This way you don't have to decrement every day, but just have to check when another warning appears.
Normalize your tables, by breaking out the warnings from the user, like:
Table: Users
UserID int auto generate PK
UserName
UserEmail
Table: UserWarnings
UserID
WarningDate
you can now write a query to determine if there are three warning in the last 30 days. Run this query when a "warn" happens, and if a row is returned, ban the user.
The query would look something like this:
SELECT
COUNT(*)
FROM UserWarnings
WHERE UserID=...your user id... AND WarningDate>=...current date time...
HAVING COUNT(*)>2
By making a warning table, you can keep a complete warning history, which may be useful.
There's really no standard design for user warning systems, I believe. The "three strikes and you're out" is a typical approach, but not always the best. For example, if I have N rules on my website, and we'll say that K of those rules are serious offenses, then the offenses that aren't so serious I would say give three strikes. But maybe the serious offenses are autoban or give two strikes?
If I had to set up something like this, I would create a table that looked like this:
user_warnings:
- warning_id
- user_id
- created_at
- offense_level
And then maybe have a query set up where you could find any users that had a sum offense level over the last T days that were greater than or equal to the value of the bannable offense level. And if their total offense level was over the recommended value, ban the user. I'd say set the offense level to be something like 5, and have tiered levels of offenses.
Never delete past offenses, though, in my opinion. You never know when it's important to remember the stuff that happened previously, and it's good to keep records of it. Just make sure this query only checks the dates that are less than 30 days old (or however many days old that the warnings you're wanting to set can be).
Related
Well, I'm afraid that I will not be able to post a minimum reproducible example, and for that I apologize. But, here goes nothing.
Ours is a weekly prepared meals service. I track order volume in many ways. Here is the structure of the relevant table:
So then I utilize the highlighted fields in many ways, such as indicating to delivery drivers if a customer is returning from the prior order being more than a month ago (last_order_w - prev_order_w > 4), for instance.
Lately I have been noticing that the data is not consistently updating properly. In the past 3 weeks, I would say it is an occurrence of 5%. If it were more consistent, I would be more confident in my ability to track down the issue, but I am not even sure how to provoke it, as I only really notice it after the fact.
The code that should cause the update is below:
<?php
//retrieve and iterate over IDs of orders placed since last synchronization.
$newOrders=array_map('reset',$dbh->query("select id from wp_posts where id > (select max(synced) from fitaf_weeks) and post_type='shop_order' and post_status='wc-processing'")->fetchAll(PDO::FETCH_NUM));
foreach($newOrders as $no){
//retrieve the metadata for the current order
$newMetas=array_map('reset',$dbh->query("select meta_key,meta_value from wp_postmeta where post_id=$no")->fetchAll(PDO::FETCH_GROUP|PDO::FETCH_UNIQUE));
//check if the current order is associated with an existing customer
$exist=$dbh->query("select * from fitaf_customers where id=".$newMetas['_customer_user'])->fetch();
//if not, gather the information we want to store from this post
$noExist=[$newMetas['_customer_user'],$newMetas['_shipping_first_name'],$newMetas['_shipping_last_name'],$newMetas['_shipping_address_1'],(strlen($newMetas['_shipping_address_2'])==0?NULL:$newMetas['_shipping_address_2']),$newMetas['_shipping_city'],$newMetas['_shipping_state'],$newMetas['_shipping_postcode'],$phone,$newMetas['_billing_email'],1,1,$no,$newMetas['_paid_date'],$week[3],$newMetas['_order_total']];
if($exist){
//if we found a record in the customer table, retrieve the data we want to modify
$oldO=$dbh->query("select last_order_id,last_order,last_order_w,lo,num_orders from fitaf_customers where id=".$newMetas['_customer_user'])->fetch(PDO::FETCH_GROUP|PDO::FETCH_ASSOC|PDO::FETCH_UNIQUE);
//make changes to the retrieved data, and make sure we are storing the most recently used delivery address and prepare the data points for the update command
$exists=[$phone,$newMetas['_shipping_first_name'],$newMetas['_shipping_last_name'],$newMetas['_shipping_postcode'],$newMetas['_shipping_address_1'],(strlen($newMetas['_shipping_address_2'])==0?NULL:$newMetas['_shipping_address_2']),$newMetas['_shipping_city'],$newMetas['_shipping_state'],$newMetas['_paid_date'],$no,$week[3],$oldO['last_order'],$oldO['last_order_id'],$oldO['last_order_w'],($oldO['num_orders']+1),($oldO['lo']+$newMetas['_order_total']),$newMetas['_customer_user']];
}
if(!$exist){
//if the customer did not exist, perform an insert
$dbh->prepare("insert into fitaf_customers(id,fname,lname,addr1,addr2,city,state,zip,phone,email,num_orders,num_weeks,last_order_id,last_order,last_order_w,lo) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)")->execute($noExist);
}
else{
//if the customer did exist, update their data
$dbh->prepare("update fitaf_customers set phone=?,fname=?,lname=?,zip=?,addr1=?,addr2=?,city=?,`state`=?,last_order=?,last_order_id=?,last_order_w=?,prev_order=?,prev_order_id=?,prev_order_w=?,num_orders=?,lo=? where id=?")->execute($exists);
}
}
//finally retrieve the most recent post ID and update the field we check against when the syncornization script runs
$lastPlaced=$dbh->query('select max(id) from wp_posts where post_type="shop_order"')->fetch()[0];
$updateSync=$dbh-> query("update fitaf_weeks set synced=$lastPlaced order by id desc limit 1");
?>
Unfortunately I don't have any relevant error logs to show, however, as I documented the code for this post, I realized a potential shortcoming. I should be utilizing the data retrieved from the initial query of new posts, rather than a selecting the highest post id after performing this logic. However, I have timers running on my scripts, and this section hasn't taken over 3 seconds to run in a long time. So it seems unlikely, that the script, which runs on a cron every 5 minutes, is experiencing this unintended overlap?
While I have made the change to pop the highest ID off of $newOrders, and hope it solves the issue, I am still curious to see if anyone has any insights on what could cause this logic to fail at such a low occurrence.
It seems likely your problem comes from race conditions between multiple operations accessing your db.
First of all, your last few lines of code do SELECT MAX(ID) and then uses that value to update something. You Can't Do That™. If somebody else adds a row to that wp_posts table anytime after the entry you think is relevant, you'll use the wrong ID. I don't understand your app well enough to recommend a fix. But I do know this is a serious and notorious problem.
You have another possible race condition as well. Your logic is this:
SELECT something.
make a decision based on what you SELECTED.
INSERT or UPDATE based on that decision.
If some other operation, done by some other user of the db, intervenes between step 1 and step 3, your decision might be wrong.
You fix this with a db transaction. The ->beginTransaction() operation, well, begins the transaction. The ->commit() operation concludes it. And, the SELECT operation you use for step one should say SELECT ... FOR UPDATE.
My situations is this... I have a table of opportunities that is sorted. We have a paid service that will allow people to view the opportunities on the website any time. However we want an unpaid view that will show a random %/# of opportunities, which will always be the same. The opportunities are sorted out by dates; e.g. they will expire and be removed from the list, and a new one should be on the free search. However the only problem is that they will always have to show the same opportunity. (For example, I can't just pick random rows because it will cycle through them if they keep refreshing, and likewise can't just take the ones about to expire or furthest form expiry because people still end up seeing the entire list.
My only solution thus far is to add an extra column to the table to mark that it is open display. Then to count them on display, and if we are missing rows then to randomly select a few more. Below is a mock up...
SELECT count(id) as total FROM opportunities WHERE display_status="open" LIMIT 1000;
...
while(total < requiredNumber) {
UPDATE opportunities SET display_status="open" WHERE display_status="private" ORDER BY random() LIMIT (required-total);
}
Can anyone think of a better way to solve this problem, preferably one that does not leave me adding another column to the table, and possible conflicts if many people load the page at a single time. One final note as well, it can't be a random set number of them (e.g. pick one, skip a few, take the next).
Any thought/comments would be very helpful,
Thanks.
One way to make sure that a user only sees the same set of random rows is to feed the random number generator a seed that is linked to that user (such as their user_id). That means every user gets a random ordering of rows but it's always the same random ordering for each user.
Your code would be something:
SELECT ...
FROM ...
WHERE ...
ORDER BY random(<user id>)
LIMIT <however many>
Note: as Twelfth pointed out, as new rows are created, they will get new order values and may end up in your random selection.
I'm the type that doesn't like to lose information...including what random rows someone got to see. However I do not like the modification of your existing table idea...
Create a second table as randon_rows or something to that extent to save the ID's of the user and the ID's of the random records they got to see. Inner join to the table whenever you need to find those same rows again. You can also put expirey dates and the sort in the table as well, so the user isn't perma stuck with the same 10 rows.
Hello again Stackoverflow!
I'm currently working on custom forumsoftware and one of the things you like to see on a forum is a viewcounter.
All the approaches for a viewcounter that I found would just select the topic from the database, retrieve the number from a "views" column, add one and update it.
But here's my thought: If, lets say 400, people at the exact same time open a topic, the MySQL database probably won´t count all views because it takes time for the queries to complete, and so the last person (of the 400) might overwrites the first persons (of the 400) view.
Ofcourse one could argue that on a normal site this is never going to happen, but if you have ~7 people opening that topic at the exact same second and the server is struggleing at that moment, you could have the same problem.
Is there any other good approach to count views?
EDIT
Woah, could the one who voted down specify why?
I ment by "Retrieving the number of views and adding one" that I would use SELECT to retrieve the number, add one using PHP (note the tags) and updating it using UPDATE. I had no idea of the other methods specified below, that's why I asked.
If, lets say 400, people at the exact same time open a topic, the MySQL database apparently would count all the views because this is exactly what databases were invented for.
All the approaches for a viewcounter that you have found are wrong. To update a field you don't need to retrieve it, but just already update:
UPDATE forum SET views + 1 WHERE id = ?
So something like that will work:
UPDATE tbl SET cnt = cnt+1 WHERE ...
UPDATE is guaranteed to be atomic. That means no one will be able to alter cnt between the time it is read and the time it is replaced. If you have several concurrent UPDATE for the same row (InnoDB) or table (MyISAM) they have to wait their turn to update the date.
See Is incrementing a field in MySQL atomic?
and http://dev.mysql.com/doc/refman/5.1/en/ansi-diff-transactions.html
I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.
The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.
So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?
Table scheme
"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID
"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)
SQL
SELECT node.nid, node.title, COUNT(popularity.nid) AS count
FROM `node` INNER JOIN `popularity` USING (nid)
WHERE node.status = 1
AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)
GROUP BY popularity.nid
ORDER BY count DESC
LIMIT 10;
We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:
Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.
The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.
The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.
You actually have two problems to solve further down the road.
One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.
The other, which you've outlined in your question, is actually using the stats.
Let's start with input throughput.
Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.
In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.
To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...
For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:
$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
$memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
$db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}
Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:
$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish
Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...
As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.
Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)
One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.
When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.
Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.
While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.
For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.
As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.
Assuming you did all the above your final query becomes:
select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10
It should be plenty fast with the index on weekly_total.
Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.
you can add indexes and try tweaking your SQL but the real solution here is to cache the results.
you should really only need to caclulate the last 7/30 days of traffic once daily
and you could do the past 24 hours hourly ?
even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.
RRDtool
Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia) there are bindings for PHP and other languages.
From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.
You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:
1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).
your table can be somewhere along the lines of this:
hourly_results (
nid integer,
start_time datetime,
amount integer
)
after you parse them into your aggregate table you can more or less delete them.
2.Use result caching (memcache, apc)
You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.
3. Optimize your database
10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?
Say you've got a database like this:
books
-----
id
name
And you wanted to get the total number of books in the database, easiest possible sql:
"select count(id) from books"
But now you want to get the total number of books last month...
Edit: but some of the books have been
deleted from the table since last month
Well obviously you cant total for a month thats already past - the "books" table is always current and some of the records have already been deleted
My approach was to run a cron job (or scheduled task) at the end of the month and store the total in another table, called report_data, but this seems clunky. Any better ideas?
Add a default column that has the value GETDATE(), call it "DateAdded". Then you can query between any two dates to find out how many books there were during that date period or you can just specify one date to find out how many books there were before a certain date (all the way into history).
Per comment: You should not delete, you should soft delete.
I agree with JP, do a soft delete/logical delete. For the one extra AND statement per query it makes everything a lot easier. Plus, you never lose data.
Granted, if extreme size becomes an issue, then yeah, you'll potentially have to start physically moving/removing rows.
My approach was to run a cron job (or scheduled task) at the end of the month and store the total in another table, called report_data, but this seems clunky.
I have used this method to collect and store historical data. It was simpler than a soft-delete solution because:
The "report_data" table is very easy to generate reports/graphs from
You don't have to implement special soft-delete code for anything that needs to delete a book
You don't have to add "and active = 1" to the end of every query that selects from the books table
Because the code to do the historical reporting is isolated from everything else that uses books, this was actually the less clunky solution.
If you needed data from the previous month then you should not have deleted the old data. Instead you can have a "logical delete."
I would add a status field and some dates to the table.
books
_____
id
bookname
date_added
date_deleted
status (active/deleted)
From there you would be able to query:
SELECT count(id) FROM books WHERE date_added <= '06/30/2009' AND status = 'active'
NOTE: It my not be the best schema, but you get the idea... ;)
If changing the schema of the tables is too much work I would add triggers that would track the changes. With this approach you can track all kinds of things like date added, date deleted etc.
Looking at your problem and the reluctance in changing the schema and the code, I would suggest you to go with your idea of counting the books at the end of each month and storing the count for the month in another table. You can use database scheduler to invoke a SP to do this.
You have just taken a baby step down the road of history databases or data warehousing.
A data warehouse typically stores data about the way things were in a format such that later data will be added to current data instead of superceding current data. There is a lot to learn about data warehousing. If you are headed down that road in a serious way, I suggest a book by Ralph Kimball or Bill Inmon. I prefer Kimball.
Here's the websites: http://www.ralphkimball.com/
http://www.inmoncif.com/home/
If, on the other hand, your first step into this territory is the only step you plan to take, your proposed solution is good enough.
The only way to do what you want is to add a column to the books table "date_added". Then you could run a query like
select count(id) from books where date_added <= '06/30/2009';