Checking if a user changed any of their data in multiple tables - php

In my database, I have several tables. One is a checkpoint table that makes note of a user choosing to finalize one of their projects. This table contains a timestamp that is automatically created. Whenever a user finalizes their project a new row is added to the checkpoint table (that way we can also keep a history of previous times the project was finalized).
I have several other tables with timestamps (or tables that I could add timestamp columns too) that automatically update when their tables change.
Is there a simple way to be able to tell if any of the other tables have updated their data since the project was last finalized? I do not need to know which tables have changed data just that there are tables that have changed data.
For example, if a user changes data in one of their tables I want to be able to display a message indicating that their project has unfinalized data.
There are a couple of ways that I have thought about doing this:
Checking every single table to see if any timestamps are newer than the latest timestamp in the checkpoint table.
Add an additional timestamp column (I already have a created and updated timestamp column) to the main project table. Most of the other tables are linked directly or indirectly to this main project table. Add triggers to every other table to update this timestamp when their data changes. I am not quite sure yet how to correctly set up a proper trigger for this.
Creating a new table with just the project_id and a timestamp column. Add a trigger to the other tables as shown in option 2.
As new modules are added, I will be adding more tables to the project so will need something that is easy to scale as well.
Each of these approaches seems like there would be a lot of steps involved.
Would one of these approaches be more efficient or viable than another? Is there another approach that I am not thinking about? If triggers are the best way to do this how would I go about setting up the trigger?
A simplified overview of my tables looks like this:
main_project_table
id
user_id (FK to user_table)
created_timestamp
updated_timestamp
checkpoint_group_table (users can choose which group to finalize their project too)
id
user_id (FK to user_table)
group_name
checkpoint_table (the table that records the finalized data and time of finalization)
id
checkpoint_group_id (FK to checkpoint_group_table)
project_id (FK to main_project_table)
project_finalized_timestamp
parent_table (several of these)
id
project_id (FK to main_project_table)
child_table (0 or more of these for each parent_table)
id
parent_id (FK to parent_table)

You really only have three solutions: Middleware, Triggers, and General Log File.
Middleware solution:
Add a timestamp field to each relevant table, and set the default value is set to "CURRENT_TIMESTAMP". This will update the timestamp field to the current time on every update.
Assuming that users are going through some API, you can write a JOIN query where it returns the latest time stamp. It would look like this.
SELECT
CASE
WHEN b.timestamp IS NOT NULL THEN 0
WHEN c.timestamp IS NOT NULL THEN 0
WHEN d.timestamp IS NOT NULL THEN 0
WHEN e.timestamp IS NOT NULL THEN 0
ELSE 1
AS `test`
FROM checkpoint_table a
LEFT JOIN main_project_table b
ON a.project_id = b.id
AND b.timestamp > a.project_finalized_timestamp
LEFT JOIN checkpoint_group_table c
ON b.user_id = c.user_id
AND c.timestamp > a.project_finalized_timestamp
LEFT JOIN parent_table d
ON b.id = d.project_id
AND d.timestamp > a.project_finalized_timestamp
LEFT JOIN child_table e ON d.id = e.parent_id
ON b.id = d.project_id
AND e.timestamp > a.project_finalized_timestamp
Now when a request routed to the tables you can run this query and if test == 0, then you return the message.
<?php
class middleware{
public function getMessage(){
// run query
if($data[0]['test'] == 1){
return "project has unfinalized data";
}else{
return null;
}
}
}
Trigger Solution:
CREATE TRIGGER checkpoint_group_table
AFTER UPDATE on _table_
FOR EACH ROW UPDATE _table_
SET main_project_table.updated_timestamp = CURTIME()
WHERE main_project_table.user_id=checkpoint_group_table.id
The advantages to this are that it is perhaps more elegant than the middleware solution. The disadvantages are that triggers are not in plain view, and it is my experience, that when processes are in the background they eventually are forgotten. In the long term, you could be left with this Jenga puzzle, which would make like difficult.
General Log File Solution:
Mysql can log every query on the server. It is possible to access this log file during the time, parse it out, and figure out if any tables were updated. This way you can figure if anything was updated after the project was finalized.
Turn on a general log file.
SET GLOBAL general_log = 'ON';
Set the path of the log file.
SET GLOBAL general_log_file = 'var/log/mysql/mysql_general.log'
Confirm by going to the command terminal.
mysql -se "SHOW VARIABLES" | grep -e general_log
You might need to reset MySQL.
sudo service MySQL restart
This script can you started...
$v = shell_exec("sudo less /var/log/mysql/mysql_general.log");
$lines = explode("\n",$v);
$new = array();
foreach($lines as $i => $line){
if(substr($line,0,1) != " "){
if(isset($l)){
array_push($new,$l);
}
$l = $line;
}else{
$l.= preg_replace('/\s+/', ' ', $line);
}
}
$lines = $new;
$index = array();
foreach($lines as $i => $line){
$e = explode("\t",$line);
$new = array();
foreach($e as $key => $value){
$new[$key] = trim($value);
}
$index[$i] = $new;
}
This will result in this...
array(3) {
[0]=> string(27) "2017-10-01T08:17:04.659274Z"
[1]=> string(8) "70 Query"
[2]=> string(129) "UPDATE checkpoint_group_table SET group_name = 'Dev Group' Where id=6"
}
From here you can use a library called PHP-SQL-Parser to parse out the query.
The advantages to this approach might scale well, being that you will not have to add any columns to your database. The disadvantages are that this will involve more code and that means more complexity. You probably cannot really do this solution without writing unit tests for it.

If I would have been at your situation, I would have made a table with fields project id (FK) and boolean for is_finalized. So every time a project is finalized, I would add an entry in it.
+-----------------+--------------+
| project_id | is_finalized |
|-----------------|--------------|
| 12 | 1 |
+-----------------+--------------+
before any update/insert, Just check if this key exists for my project. if exists, change it to 0 and while loading the file, Just check if the value is 0. If 0, then show the Message: project has unfinalized data.
It should show the message only if the key exists and the value is 0. If the project is not finalized. The table won't have the value, hence no message.
Quite easy, faster in processing (rather than checking each timestamp) and extensible approach as it would be just dependent on the update or insert queries, which you can use in you upcoming modules in future.

timestamp comparison could be messy to do multiple check.
...I do not need to know which tables have changed data just that there are tables that have changed data...
Join-query to generate a (1) data-set, JSON/SERIALIZE it, then MD5, keep this hasted string into db. Next time compare it back, if there ANY different, the data-set has been changed. This is the general idea in large data/file comparison / code-repo.
but in light of...
..more tables to the project..
Then just use MD5 on each data-row in the table. Once changed the hashed string will be different.

Plan A: An off-the-wall solution:
Set up Master-Slave. The Slave will contain an 'old' copy of the data.
Establish "delayed" replication. Let's say 1 hour.
Get pt-table-checksum; run it twice an hour.
That will discover changes within an hour. (The timings may need tweaking if data size is quite large or small.)
Plan B
Deny all direct access from actual humans. Instead, build an application that handles all normal accesses through some API. Then I would instrument the API to collect whatever I choose.
Ad Hoc queries (for which there is no API):
Perhaps disallow them
Perhaps have a review board (me) to admit their running.
Perhaps have an API that runs the query, but immediately logs/emails/rings bells/whatever.

Really not sure why these answers are suggesting reliance on IDs or complex data-logging, this is a fairly common problem with some very simple solutions.
Use those parent/child relationships
Note: when documenting a schema, it is important to note more than just FK relationships, but also the type of replationship (one-to-one, many-to-one, one-to-many, many-to-many).
You already have a fairly well defined parent/child relationships, I assume to be:
main_project one<--many parent one<--many child
Use them one of two ways:
Update a date for parent and main_project which stores the most recent date any child was modified.
Use a combination of join/max/modified in a query utilizing main_project, parent, and child.
child_updated date
main_project.child_updated
parent.child_updated
Whenever updating any child, also update the child_modified dates for main_project and parent. Similar for parent, update main_project. This can be done with triggers, php, or some clever uses of joins or views to as the main_project objects. I would highly advise sticking to doing this with PHP models of those tables.
join/max/modified
Just build a query to get you four values, then check them:
checkpoint_table.main_project_finalized
main_project.modified
MAX(parent.modified)
MAX(child.modified)
These joins can get a bit tricky, so you'll have to play with this a bit.
SELECT m.modified as modified, MAX(c.project_finalized_timestamp) as finalized, MAX(p.modified) AS parent_modified, MAX(c.modified) as child_modified
FROM main_project_table m
LEFT JOIN checkpoint_table c
ON m.id = c.project_id
LEFT JOIN parent_table p
ON m.id = p.project_id
LEFT JOIN child_table c
ON p.id = c.parent_id
GROUP BY m.id
This will give you ONE row of all the dates you care about, allowing you to create some simple logic for it in PHP.
$result = // retrieve joined data as above
if ($result['finalized'] < max($result['modified'], $result['parent_modified'], $result['child_modified']) {
// changed
}

There are some good solutions mentioned so far. Another one is to make use of MySQL's information schema. Doing this, you can for example select all tables that have a timestamp field with the name you know, and check their modification times. This is probably the most dynamic and seamless approach but not really the best one. I would typically only do something like this if I was building an interface on top of legacy or third party code and didn't have control of that part of the application.
Architecturally I think the best approach is to have your application aware of pertinent tables / fields and do an audit of them. I am assuming that the data is relational to the object at question and therefore although they are foreign tables, they can still be easily checked for modifications.
Another good idea would be to add versioning to all of the tables in question so that during this step in your application you can show what changed.

Related

Can a field change during the execution of a MySQL query and both values be present in the result set?

I've come across some odd defensive code. Basically it does a query like this:
select * from A join B on (A.b_id=B.id)
Then it iterates through the result set and whenever it meets a new row from table B, it caches it (by id). Afterwards only the cached copy is used, even for subsequent rows.
It looks like it was trying to safeguard against a result set like this:
A.id | A.value | B.id | B.value
------+---------+------+---------
1 | First | 1 | Yay
2 | Second | 1 | Nay
But is this even possible? Even if the row in table B is updated while the select query is fetched half way, will it really be visible? Can the update even proceed while someone is querying the table?
For what it's worth, I think the table at the time was MyISAM, although it's been since converted to InnoDB. Also, the code which is running the query is written in PHP. As far as I can tell, it uses the default transaction isolation level and fetch mode.
OK, it seems I need some clarifications. Here's a code, similar to what I've found:
$sql = "select A.id a_id, A.value a_value, B.id b_id, B.value b_value from A join B on (A.b_id=B.id)";
$res = mysql_query($sql);
$cacheB = array();
$A = new classA();
$B = new classB();
while ($row = mysql_fetch_assoc($res)) {
$A->setData($row);
if ( !isset($cacheB[$row['b_id']]) ) {
$cacheB[$row['b_id']] = $row;
}
$B->setData($cacheB[$row['b_id']]);
// Do some processing depending on $A and $B
}
This code is a CLI application running from a cron job. The data from $A and $B isn't returned to anything, but depending on the contents, some external services may be called and some other DB tables may be modified. The contents of classA, classB and the processing are not relevant to this question.
My question is - is there a point for this "safeguard", or is it a deadweight that can be deleted? Let's assume that the processing part would actually be sensitive to a change in the values of B (although in reality I doubt it, but still).
Can a field change during the execution of a MySQL query and both values be present in the result set?
No.
In MyISAM the entire table is locked by each query, so it's not possible at all, by design (see table locking).
In InnoDB queries are isolated and a select is a consistent read as mentioned in the doc Locks Set by Different SQL Statements in InnoDB. A consistent read is defined as "A read operation that uses snapshot information to present query results based on a point in time, regardless of changes performed by other transactions running at the same time."
Even if the row in table B is updated while the select query is fetched half way, will it really be visible?
Yes, even then, it's impossible.
Can the update even proceed while someone is querying the table?
In MyISAM no, it'll have to wait, as explain in the doc: "Table locking enables many sessions to read from a table at the same time, but if a session wants to write to a table, it must first get exclusive access, meaning it might have to wait for other sessions to finish with the table first. During the update, all other sessions that want to access this particular table must wait until the update is done."
In InnoDB yes, but the queries are isolated and work on different "snapshot" of the database as explained, so it doesn't matter. Transactions are particularly useful in this case if you have any doubt by the way.
The code you are showing might or might not have another purpose, this I can't say. But if the only purpose it to prevent something that cannot happen to happen, then it's completely redundant and can be safely removed.
Just in case:
Right now in your while loop you have a row
{ a_id1, a_value1, b_id1, b_value1 }
And you set $B and save in the cache the whole row not just the values from B.
so next row in the loop will have a different a_id, but same b_id
{ a_id2, a_value2, b_id1, b_value1 }
But in this case you will set $B using the cached version of $row so you will have a_id1 instead of a_id2
My guess is $B->setData() only care about fields related to B so using the cache version doesn't make any difference, but if that isn't the case you are cloning the A values from the first row on the following rows with same b_id1.

Too relation or not to relation ? A MySQL, PHP database workflow

im kinda new with mysql and i'm trying to create a kind complex database and need some help.
My db structure
Tables(columns)
1.patients (Id,name,dob,etc....)
2.visits (Id,doctor,clinic,Patient_id,etc....)
3.prescription (Id,visit_id,drug_name,dose,tdi,etc....)
4.payments (id,doctor_id,clinic_id,patient_id,amount,etc...) etc..
I have about 9 tables, all of them the primary key is 'id' and its set to autoinc.
i dont use relations in my db (cuz i dont know if it would be better or not ! and i never got really deep into mysql , so i just use php to run query's to Fitch info from one table and use that to run another query to get more info/store etc..)
for example:
if i want to view all drugs i gave to one of my patients, for example his id is :100
1-click patient name (name link generated from (tbl:patients,column:id))
2-search tbl visits WHERE patient_id=='100' ; ---> that return all his visits ($x array)
3-loop prescription tbl searching for drugs with matching visit_id with $x (loop array).
4- return all rows found.
as my database expanding more and more (1k+ record in visit table) so 1 patient can have more than 40 visit that's 40 loop into prescription table to get all his previous prescription.
so i came up with small teak where i edited my db so that patient_id and visit_id is a column in nearly all tables so i can skip step 2 and 3 into one step (
search prescription tbl WHERE patient_id=100), but that left me with so many duplicates in my db,and i feel its kinda stupid way to do it !!
should i start considering using relational database ?
if so can some one explain a bit how this will ease my life ?
can i do this redesign but altering current tables or i must recreate all tables ?
thank you very much
Yes, you should exploit MySQL's relational database capabilities. They will make your life much easier as this project scales up.
Actually you're already using them well. You've discovered that patients can have zero or more visits, for example. What you need to do now is learn to use JOIN queries to MySQL.
Once you know how to use JOIN, you may want to declare some foreign keys and other database constraints. But your system will work OK without them.
You have already decided to denormalize your database by including both patient_id and visit_id in nearly all tables. Denormalization is the adding of data that's formally redundant to various tables. It's usually done for performance reasons. This may or may not be a wise decision as your system scales up. But I think you can trust your instinct about the need for the denormalization you have chosen. Read up on "database normalization" to get some background.
One little bit of advice: Don't use columns named simply "id". Name columns the same in every table. For example, use patients.patient_id, visits.patient_id, and so forth. This is because there are a bunch of automated software engineering tools that help you understand the relationships in your database. If your ID columns are named consistently these tools work better.
So, here's an example about how to do the steps numbered 2 and 3 in your question with a single JOIN query.
SELECT p.patient_id p.name, v.visit_id, rx.drug_name, rx.drug_dose
FROM patients AS p
LEFT JOIN visits AS v ON p.patient_id = v.patient_id
LEFT JOIN prescription AS rx ON v.visit_id = rx.visit_id
WHERE p.patient_id = '100'
ORDER BY p.patient_id, v.visit_id, rx.prescription_id
Like all SQL queries, this returns a virtual table of rows and columns. In this case each row of your virtual table has patient, visit, and drug data. I used LEFT JOIN in this example. That means that a patient with no visits will have a row with NULL data in it. If you specify JOIN MySQL will omit those patients from the virtual table.

Multiple joins in database

This situation is pretty difficult to explain, but I'll do my best.
For school, we have to create a web application (written in PHP) which allows teachers to manage their students' projects and allow these to make peer-evaluation. As there are many students, every projects has multiple projectgroups (and ofcourse you should only peer-evaluate your own group members).
My databasestructure looks like this at the moment:
Table users: contains all user info (user_id is primary)
Table: projects: Contains a project_id, a name, a description and a start date.
So far this is pretty easy. But now it gets more difficult.
Table groups: Contains a group_id, a groupname and as a group is specific for a project, it also holds a project_id.
Table groupmembers: A group contains multiple users, but users can be in multiple groups (as they can be active in multiple projects). So this table contains a user_id and a group_id to link these.
At last, admins can decide when users need to do their peer-evaluation and how much time they have for it. So there is a last table evaluations containing an evaluation_id, a start and end date and a project_id (the actual evaluations are stored in a sixth table, which is not relevant for now).
I think this is a good design, but it gets harder when I actually have to use this data. I would like to show a list of evaluations you still have to fill in. The only thing you know is your user_id as this is stored in the session.
So this would have to be done:
1) Run a query on groupmembers to see in which groups the user is.
2) With this result, run a query on groups to see to which projects these groups are related.
3) Now that we know what projects the user is in, the evaluations table should be queried to see if there are ongoing evaluations for this projects.
4) We now know which evaluations are available, but now we also need to check the sixth table to see if the user has already completed this evaluation.
All these steps are dependent on the result of each other, so they should all contain their own error handling. Once the user has chosen the evaluation they wish to fill in (a evaluationID will be send via GET), a lot of new queries will have to be run to check which users this member has in his group and will have to evaluate and another check to see which other groupmembers are already evaluated).
As you see, this is quite complex. With all the errorhandling included, my script will be a real mess. Someone told me a "view" might help in this situation, but I don't really understand why this would help me here.
Is there a good way to do this?
Thank you very much!
you are thinking too procedurally.
all your conditions should be easily entered into one single where clause of a sql statement.
you will end up with a single list of the items to be evaluated. only one list, only one set of error handling.
Not sure if this is exactly right, but try this basic approach. I didn't run this against an actual database so the syntax may need to be tweaked.
select p.project_name
from projects p inner join evaluations e on p.project_id = e.project_id
where p.project_id in (
select project_id
from projects p inner join groups g on p.project_id = g.project_id
inner join groupmembers gm on gm.group_id = g.group_id
where gm.user_id = $_SESSION['user_id'])
Also, you'll need to make sure that you properly escape your user_id when making it a part of the query, but that is a whole other topic.

MySql Temp Tables VS Views VS php arrays

I have currently created a facebook like page that pulls notifications from different tables, lets say about 8 tables. Each table has a different structure with different columns, so the first thing that comes to mind is that I'll have a global table, like a table of contents, and refresh it with every new hit. I know inserts are resource intensive, but I was hoping that since it is a static table, I'd only add maybe one new record every 100 visitors, so I thought "MAYBE" I could get away with this, but I was wrong. I managed to get deadlocks from just three people hammering the website.
So anyways, now I have to redo it using a different method. Initially I was going to do views, but I have an issue with views. The selected table will have to contain the id of a user. Here is an example of a select statement from php:
$get_events = "
SELECT id, " . $userId . ", 'admin_events', 0, event_start_time
FROM admin_events
WHERE CURDATE() < event_start_time AND
NOT EXISTS(SELECT id
FROM admin_event_registrations
WHERE user_id = " . $userId . " AND admin_events.id = event_id) AND
NOT EXISTS(SELECT id
FROM admin_event_declines
WHERE user_id = " . $userId . " AND admin_events.id = event_id) AND
event_capacity > (SELECT COUNT(*) FROM admin_event_registrations WHERE event_id = admin_events.id)
LIMIT 1
Sorry about the messiness. In any event, as you can see, I need to return the user Id from the page as a selected column from the table. I could not figure out how to do it with views so I don't think views are the way that I will be heading because there's a lot more of these types of queries. I come from an MSSQL background, and I love stored procedures, so if there are stored procedures for MYSQL, that would be excellent.
Next I started thinking about temp tables. The table will be in memory, the table will be probably 150 rows max, and there will be no deadlocks. Is it still very expensive to do inserts on a temp table? Will I end up crashing the server? Right now we have maybe 100 users per day, but I want to try to be future proof when we get more users.
After a long thought, I figured that the only way is the user php and get all the results as an array. The problem is that I'd get something like:
$my_array[0]["date_created"] = <current_date>
The problem with the above is that I have to sort by date_created, but this is a multi dimensional array.
Anyways, to pull 150 to 200 MAX records from a database, which approach would you take? Temp Table, View, or php?
Some thoughts:
Temp Tables:
temporary tables will only last as long as the session is alive. If you run the code in a PHP script, the temporary table will be destroyed automatically when the script finishes executing.
Views:
These are mainly for hiding complexity in that you create it with a join and then access it like a single table. The underlining code is a SELECT statement.
PHP Array:
A bit more cumbersome than SQL to get data from. However, PHP does have some functions to make life easier but no real query language.
Stored Procedures:
There are stored procedures in MySQL - see: http://dev.mysql.com/doc/refman/5.0/en/stored-routines-syntax.html
My Recommendation:
First, re-write your query using the MySQL Query Analyzer: http://www.mysql.com/products/enterprise/query.html
Now I would use PDO to put my values into an array using PHP. This will still leaves the initial heavy lifting to the DB Engine and keeps you from making multiple calls to the DB Server.
Try this:
SELECT id, " . $userId . ", 'admin_events', 0, event_start_time
FROM admin_events AS ae
LEFT JOIN admin_event_registrations AS aer
ON ae.id = aer.event_id
LEFT JOIN admin_event_declines AS aed
ON ae.id = aed.event_id
WHERE aed.user_id = ". $userid ."
AND aer.user_id = ". $userid ."
AND aed.id IS NULL
AND aer.id IS NULL
AND CURDATE() < ae.event_start_time
AND ae.event_capacity > (
SELECT SUM(IF(aer2.event_id IS NOT NULL, 1, 0))
FROM admin_event_registrations aer2
JOIN admin_events AS ae2
ON aer2.event_id = ae2.id
WHERE aer2.user_id = ". $userid .")
LIMIT 1
It still has a subquery, but you will find that it is much faster than the other options given. MySQL can join tables easily (they should all be of the same table type though). Also, the last count statement won't respond the way you want it to with null results unless you handle null values. This can all be done in a flash, and with the join statements it should reduce your overall query time significantly.
The problem is that you are using correlated subqueries. I imagine that your query takes a little while to run if it's not in the query cache? That's what would be causing your table to lock and causing contention.
Switching the table type to InnoDB would help, but your core problem is your query.
150 to 200 records is a very amount. MySQL does support stored procedures, but this isn't something you would need it for. Inserts are not resource intensive, but a lot of them at once, or in sequence (use bulk insert syntax) can cause issues.

A better logging design or some SQL magic?

I'm knee deep in modifying some old logging code that i didn't write and wondering what you think of it. This is an event logger written in PHP with MySQL, that logs message like:
Sarah added a user, slick101
Mike deleted a user, slick101
Bob edited a service, Payment
Broken up like so:
Sarah [user_id] added a user [message], slick101 [reference_id, reference_table_name]
Into a table like this:
log
---
id
user_id
reference_id
reference_table_name
message
Please note that the "Bob" and "Payment" in the above example messages are Id's to other tables, not the actual names. A join is needed to get the names.
It looks like the "reference _ table _ name" is for finding the proper names in the correct table, since only the reference _ id is stored. This would probably be good if somehow i could join on a table name that stored in reference_table_name, like so:
select * from log l
join {{reference_table_name}} r on r.id = l.reference_id
I think I see where he was going with this table layout - how much better to have ids for statistics instead of a storing the entire message in a single column (which would require text parsing). Now I'm wondering..
Is there a better way or is it possible to do the make-believe join somehow?
Cheers
To get the join based on the modelling, you'd be looking at a two stage process:
Get the table name from LOG for a particular message
Use dynamic SQL by constructing the actual query as a string. IE:
"SELECT l.* FROM LOG l JOIN "+ tableName +" r ON r.id = l.reference_id"
There's not a lot of value to logged deletions because there's no record to join to in order to see what was deleted.
How much history does the application need?
Do you need to know who did what to a value months/years in the past? If records are required, they should be archived & removed from the table. If you don't need all the history, consider using the following audit columns on each table:
ENTRY_USERID, NOT NULL
ENTRY_TIMESTAMP, DATE, NOT NULL
UPDATE_USERID, NOT NULL
UPDATE_TIMESTAMP, DATE, NOT NULL
These columns allow you to know who created the record & when, and who last successfully updated it and when. I'd create audit tables on a case by case basis, it just depends on what functionality the user needs.

Categories