Basic version control for MySQL table - php

I'm trying to setup a (I thought) fairly simple versioning system for static html pages on a site. The goal is to keep previous versions of the content, then restore to them if needed (I guess basically creating a new version that's a duplicate of an old one), and optionally to toss out data older than X versions ago.
The table's setup is fairly straightforward:
id
reference_id (string/used to determine what page the item pertains to)
content (document/html page sized amount of data)
e_user (user who changed it last)
e_timestamp (when it was changed)
I just want to have something setup to create a previous version for each edit to the content, then be able to restore to it if needed.
What's the best method for accomplishing this? Should everything be in the same table, or spread across a few different ones?
I read through a few pages on the subject, but a lot of them seemed like overkill for what i'm trying to accomplish (ex http://www.jasny.net/articles/versioning-mysql-data/ )
Are there any platforms/guides about that will help me in this endeavorer?

Ideally you would want everything in the same table with something in your query to get the correct version, however you should be careful how you do this as an inefficient query will put extra load on your server. If normally you would select a single item like this:
SELECT * FROM your_table WHERE id = 42
This would then become:
SELECT * FROM your_table
WHERE id = 42
AND date < '2010-10-12 15:23:24'
ORDER BY date DESC
LIMIT 1
Index (id, e_timestamp) to allow this to perform efficiently.
Selecting multiple rows in a single query is more tricky and requires a groupwise-maximum approach but it can be done.

You can use a technique called "auditing". You would set up audit tables. Then you would either write it into your code or setup triggers on the DB side so that every time a change is made, an entry is added into the appropriate audit table. Then you can go back through the audit table and see things like:
"Oh, yesterday Sue went in and fixed a typo"
"Uh oh, steve wiped out an entire paragraph by accident earlier today while trying to rewrite this section"
Your primary table that stores the data doesn't keep all that data, so it can stay slim. If you ever need to look at that data and say roll stuff back, you can go look in your audit table and do that. You can setup the audit table however you want, so each audit row can have the entire content BEFORE edit, and not just what was edited. That should make "rolling back" fairly easy.

Add a version column and a delete column (bool) and create some functions that compare the versions of rows with the same id. You'll definitely want to be able to easily find the current version and the previous version. To get rid of the data you'll want to write another function that sorts all of the versions of id, figures out which are old enough to be deleted, and marks them for deletion by another function. You'll probably want to have an option to make certain pages immune to deletion or postpone it.

Related

What do you think of this approach for logging changes in mysql and have some kind of audit trail

I've been reading through several topics now and did some research about logging changes to a mysql table. First let me explain my situation:
I've a ticket system with a table: 'ticket'
As of now I've created triggers which will enter a duplicate entry in my table: 'ticket_history' which has "action" "user" and "timestamp" as additional columns. After some weeks and testing I'm somewhat not happy with that build since every change is creating a full copy of my row in the history table. I do understand that disk space is cheap and I should not worry about it but in order to retrieve some kind of log or nice looking history for the user is painful, at least for me. Also with the trigger I've written I get a new row in the history even if there is no change. But this is just a design flaw of my trigger!
Here my trigger:
BEFORE UPDATE ON ticket FOR EACH ROW
BEGIN
INSERT INTO ticket_history
SET
idticket = NEW.idticket,
time_arrival = NEW.time_arrival,
idticket_status = NEW.idticket_status,
tmp_user = NEW.tmp_user,
action = 'update',
timestamp = NOW();
END
My new approach in order to avoid having triggers
After spening some time on this topic I came up with an approach I would like to discuss and implement. But first I would have some questions about that:
My idea is to create a new table:
id sql_fwd sql_bwd keys values user timestamp
-------------------------------------------------------------------------
1 UPDATE... UPDATE... status 5 14 12345678
2 UPDATE... UPDATE... status 4 7 12345678
The flow would look like this in my mind:
At first I would select something or more from the DB:
SELECT keys FROM ticket;
Then I display the data in 2 input fields:
<input name="key" value="value" />
<input type="hidden" name="key" value="value" />
Hit submit and give it to my function:
I would start with a SELECT again: SELECT * FROM ticket;
and make sure that the hidden input field == the value from the latest select. If so I can proceed and know that no other user has changed something in the meanwhile. If the hidden field does not match I bring the user back to the form and display a message.
Next I would build the SQL Queries for the action and also the query to undo those changes.
$sql_fwd = "UPDATE ticket
SET idticket_status = 1
WHERE idticket = '".$c_get['id']."';";
$sql_bwd = "UPDATE ticket
SET idticket_status = 0
WHERE idticket = '".$c_get['id']."';";
Having that I run the UPDATE on ticket and insert a new entry in my new table for logging.
With that I can try to catch possible overwrites while two users are editing the same ticket in the same time and for my history I could simply look up the keys and values and generate some kind of list. Also having the SQL_BWD I simply can undo changes.
My questions to that would be:
Would it be noticeable doing an additional select everytime I want to update something?
Do I lose some benefits I would have with triggers?
Are there any big disadvantages
Are there any functions on my mysql server or with php which already do something like that?
Or is there might be a much easier way to do something like that
Is maybe a slight change to my trigger I've now already enough?
If I understad this right MySQL is only performing an update if the value has changed but the trigger is executed anyways right?
If I'm able to change the trigger, can I still prevent somehow the overwriting of data while 2 users try to edit the ticket the same time on the mysql server or would I do this anyways with PHP?
Thank you for the help already
Another approach...
When a worker starts to make a change...
Store the time and worker_id in the row.
Proceed to do the tasks.
When the worker finishes, fetch the last worker_id that touched the record; if it is himself, all is well. Clear the time and worker_id.
If, on the other hand, another worker slips in, then some resolution is needed. This gets into your concept that some things can proceed in parallel.
Comments could be added to a different table, hence no conflict.
Changing the priority may not be an issue by itself.
Other things may be messier.
It may be better to have another table for the time & worker_ids (& ticket_id). This would allow for flagging that multiple workers are currently touching a single record.
As for History versus Current, I (usually) like to have 2 tables:
History -- blow-by-blow list of what changes were made, when, and by whom. This is table is only INSERTed into.
Current -- the current status of the ticket. This table is mostly UPDATEd.
Also, I prefer to write the History directly from the "database layer" of the app, not via Triggers. This gives me much better control over the details of what goes into each table and when. Plus the 'transactions' are clear. This gives me confidence that I am keeping the two tables in sync:
BEGIN; INSERT INTO History...; UPDATE Current...; COMMIT;
I've answered a similar question before. You'll see some good alternatives in that question.
In your case, I think you're merging several concerns - one is "storing an audit trail", and the other is "managing the case where many clients may want to update a single row".
Firstly, I don't like triggers. They are a side effect of some other action, and for non-trivial cases, they make debugging much harder. A poorly designed trigger or audit table can really slow down your application, and you have to make sure that your trigger logic is coordinated between lots of developers. I realize this is personal preference and bias.
Secondly, in my experience, the requirement is rarely "show the status of this one table over time" - it's nearly always "allow me to see what happened to the system over time", and if that requirement exists at all, it's usually fairly high priority. With a ticketing system, for instance, you probably want the name and email address of the users who created, and changed the ticket status; the name of the category/classification, perhaps the name of the project etc. All of those attributes are likely to be foreign keys on to other tables. And when something does happen that requires audit, the requirement is likely "let me see immediately", not "get a database developer to spend hours trying to piece together the picture from 8 different history tables. In a ticketing system, it's likely a requirement for the ticket detail screen to show this.
If all that is true, then I don't think history tables populated by triggers are a good idea - you have to build all the business logic into two sets of code, one to show the "regular" application, and one to show the "audit trail".
Instead, you might want to build "time" into your data model (that was the point of my answer to the other question).
Since then, a new style of data architecture has come along, known as CQRS. This requires a very different way of looking at application design, but it is explicitly designed for reactive applications; these offer much nicer ways of dealing with the "what happens if someone edits the record while the current user is completing the form" question. Stack Overflow is an example - we can see, whilst typing our comments or answers, whether the question was updated, or other answers or comments are posted. There's a reactive library for PHP.
I do understand that disk space is cheap and I should not worry about it but in order to retrieve some kind of log or nice looking history for the user is painful, at least for me.
A large history table is not necessarily a problem. Huge tables only use disk space, which is cheap. They slow things down only when making queries on them. Fortunately, the history is not something you'd use all the time, most likely it is only used to solve problems or for auditing.
It is useful to partition the history table, for example by month or week. This allows you to simply drop very old records, and more important, since the history of the previous months has already been backed up, your daily backup schedule only needs to backup the current month. This means a huge history table will not slow down your backups.
With that I can try to catch possible overwrites while two users are editing the same ticket in the same time
There is a simple solution:
Add a column "version_number".
When you select with intent to modify, you grab this version_number.
Then, when the user submits new data, you do:
UPDATE ...
SET all modified columns,
version_number=version_number+1
WHERE ticket_id=...
AND version_number = (the value you got)
If someone came in-between and modified it, then they will have incremented the version number, so the WHERE will not find the row. The query will return a row count of 0. Thus you know it was modified. You can then SELECT it, compare the values, and offer conflict resolution options to the user.
You can also add columns like who modified it last, and when, and present this information to the user.
If you want the user who opens the modification page to lock out other users, it can be done too, but this needs a timeout (in case they leave the window open and go home, for example). So this is more complex.
Now, about history:
You don't want to have, say, one large TEXT column called "comments" where everyone enters stuff, because it will need to be copied into the history every time someone adds even a single letter.
It is much better to view it like a forum: each ticket is like a topic, which can have a string of comments (like posts), stored in another table, with the info about who wrote it, when, etc. You can also historize that.
The drawback of using a trigger is that the trigger does not know about the user who is logged in, only the MySQL user. So if you want to record who did what, you will have to add a column with the user_id as I proposed above. You can also use Rick James' solution. Both would work.
Remember though that MySQL triggers don't fire on foreign key cascade deletes... so if the row is deleted in this way, it won't work. In this case doing it in the application is better.

RSS aggregator; how to insert only new items

A tutorial here shows how to build an agregator in PHP but I'm having some trouble finding the best way not to insert the same items in my database.
If I were to run the script on http://visualwebsiteoptimizer.com/split-testing-blog/feed/ and then run it again in 5 minutes it'll just insert the same items again.
That tutorial just has an interval time specified in wich it will reload the RSS feed and save all the items.
I was wondering if RSS implement some request header that will only send the items after a certain date. I see here that I could use lastBuildDate and mabe ignore channels that have a date older than last fetched but it doesn't say if that is mandatory.
My question here is: how can I check RSS feeds regularly and insert it in a database without inserting the same item more than once?
I'm thinking the only way to do it is to check if a record already exist using link and only insert if it doesn't exist already. I know link is optional but I won't save items that don't have one anyway. This seems a bit inefficient though; checking before every insert might be fine in the beginning but when the database starts filling up it might get very slow.
You might have to use a few different strategies depending on how well the site you are consuming has implemented the spec.
First I would try adding a unique index on the database for the GUID value, GUIDs by there nature should be unique, http://en.wikipedia.org/wiki/Globally_unique_identifier - then depending on which DB you are using you should be able to use syntax like INSERT IGNORE INTO... or INSERT ... ON DUPLICATE KEY UPDATE... and just have the update syntax not really do anything
If some sites don't have a guid field (I am assuming you will end up consuming more than just the example) you could add the unique on the siteId field and the either the time or the title, both are less than ideal of course contacting the site own to get them to implement a guid might work too ;)
You could also run an md5 hash on the post content and store that alongside the post, that should stop duplicates too.
How big are you expecting the DB to get? with proper indexing I would have thought that it would have to be huge before it runs slow; indexes on siteId, guid, time and/or hash and limited to just 1 row and just the rowId should be quick enough, epscialyl if you can get your script to run commandline / on a cron job rather than through a webserver

Save changes in contenteditable with timestamp

I have a <div contenteditable="true" /> that the user can write in that is pretty much unlimited in length.
The data in the div is saved on change with a timestamp in a MySQL database.
Now my goal is to have a little note on the left that tells the user when each part of the document has been created (resolution should be in days).
Now, the question is: How can I save the information (what part has changed when) best?
I considered the following options so far which both seem improvable:
Every time the user visits the site at the end of the document I insert a flag (e.g. an emtpy span with a class and a data attribute that stores the start of editing). This flag is then saved into the database when the save script is called. This option would make it very easy to show the date on the side - I would just put them on the same height as the empty span and the span tells me the date. Downsides are: The user might accidently delete the timestamp span and if the user doesn't close the window for a long time no new timespan spans are inserted (this could probably be avoid by inserting new timestamp spans every X minutes so the deleting part is more relevant)
Trying to do a string diff comparision each time the data is passed to the saving script and only save the diff with a timestamp. Then when the page is loaded put all parts together in the right order and in Javascript put the date notes in the right place. This sounds like a lot of overhead for me though + when older parts are changed two parts could become one, etc. All in all this options sounds very complicated.
Any input / ideas / suggestions highly appreciated!
What you are trying to implement is the feature called "annotate" or "blame" in the source code control world (though you just want the date of update rather than date+author or simply author).
To do that properly, you need a way to make diffs (php-diff might do the job) and obtain the versions of the text.
There are several strategy:
store the latest version and keep only deltas (such as unified diffs, my preference)
store all the versions and compute deltas on the fly
Once you have your current version and the list of deltas (you can definitely shorten the list if more than say a few dozen delta and let the user ask more if really important). You compose the deltas together, this is where the annotation phase happens as you can do this part remembering from which version comes each line. Composing is pretty simple in fact (start from latest, all lines added in the patch leading to it are from the latest, the other need to be explained, so start again with next patch until you reach the most ancient patch that you want to treat, the remaining lines are from that version or earliest so some flag saying 'at or before ' can be used).
Google has a diff library for Javascript so could do all the hard work on user machine. They have the patch part in this library as well. I did not find an annotate/blame library.
One way that you could do it is by having a table for revisions of the div. Every so often, you can add a new entry into this table, containing the content and the timestamp, and therefore keep track of all the revisions of editing. Then, to find out what is changed, you can simply compare two of the entries to find out what has been added, stayed the same, and removed.
You should save this information when the user submits the information. The moment that the user wants to see the information, there should be hardly any computation.
In the backend you create two tables. In one table, lets call it 'currentdocs', you store always the latest version of the data. When the user loads the document, all the information is coming from this table 'currentdocs'.
In the other table, lets call it 'docsintime', you save every new save. It has a foreign key to the table 'currentdocs' and you can find the last row in this table 'docsintime' by selecting the maximum id with that foreign key. A select statement can be something like:
select id from docsintime where cur_key = x order desc limit 1;
In both tables do you store the for each relevant part the latest timestamp that it has been changed.
When a new document is saved, you get that last saved version in the table 'docsintime'. You compare all relevant parts with the data from that record. If it does not differ then you copy the timestamp of that relevant part in the new record to be saved. If it does differ then do you create a new timestamp for that relevant part.
After the comparison do you save the new record in both tables 'currentdocs' and 'docsintime'. You update the table 'currentdocs' and insert a new record in the table 'docsintime'. Only with a new document will you make an insertion in the table 'currentdocs'.
With the next request for that document will you only have to collect the information from the table 'currentdocs'. And the process starts all over again.

Display latest search results from MySQL with PHP

Google unfortunately didn't seem to have the answers I wanted. I currently own a small search engine website for specific content using PHP GET.
I want to add a latest searches page, meaning to have each search recorded, saved, and then displayed on another page, with the "most searched" at the top, or even the "latest search" at the top.
In short: Store my latest searches in a MySQL database (or anything that'll work), and display them on a page afterwards.
I'm guessing this would best be accomplished with MySQL, and then I'd like to output it in to PHP.
Any help is greatly appreciated.
Recent searches could be abused easily. All I have to do is to go onto your site and search for "your site sucks" or worse and they've essentially defaced your site. I'd really think about adding that feature.
In terms of building the most popular searches and scaling it nicely I'd recommend:
Log queries somewhere. Could be a MySQL db table but a logfile would be more sensible as it's a log.
Run a script/job periodically to extract/group data from the log
Have that periodic script job populate some table with the most popular searches
I like this approach because:
A backend script does all of the hard work - there's no GROUP BY, etc made by user requests
You can introduce filtering or any other logic to the backend script and it doesn't effect user requests
You don't ever need to put big volumes of data into the database
Create a database, create a table (for example recent_searches) and fields such as query (the query searched) and timestamp (unix timestamp that the query was made) said, then for your script your MySQL query will be something like:
SELECT * FROM `recent_searches` ORDER BY `timestamp` DESC LIMIT 0, 5
This should return the 5 most recent searches, with the most recent one appearing first.
Create table (something named like latest_searches) with fields query, searched_count, results_count.
Then after each search (if results_count>0), check, if this search query exists in that table. And update or insert new line into table.
And on some page you can just use data from this table.
It's pretty simple.
Ok, your question is not yet clear. But I'm guessing that you mean you want to READ the latest results first.
To achieve this, follow these steps:
When storing the results use an extra field to hold DATETIME. So your insert query will look like this:
Insert into Table (SearchItem, When) Values ($strSearchItem, Now() )
When retrieving, make sure you include an order by like this:
Select * from Table Order by When Desc
I hope this is what you meant to do :)
You simply store the link and name of the link/search in MySQL and then add a timestamp to record what time sb searched for them. Then you pull them out of the DB ordered by the timestamp and display them on the website with PHP.
Create a table with three rows: search link timestamp.
Then write a PHP script to insert rows when needed (this is done when the user actually searches)
Your main page where you want stuff to be displayed simply gets the data back out and puts them into a link container $nameOfWebsite
It's probably best to use a for/while loop to do step 3
You could additionally add sth like a counter to know what searches are the most popular / this would be another field in MySQL and you just keep updating it (increasing it by one, but limited to the IP)

Tracking data changes

I work on a market research database centric website, developed in PHP and MySQL.
It consists of two big parts – one in which users insert and update own data (let say one table T with an user_id field) and another in which an website administrator can insert new or update existing records (same table).
Obviously, in some cases end users will have their data overridden by the administrator while in other cases, administrator entered data is updated by end users (it is fine both ways).
The requirement is to highlight the view/edit forms with (let’s say) blue if end user was the last to update a certain field or red if the administrator is to “blame”.
I am looking into an efficient and consistent method to implement this.
So far, I have the following options:
For each record in table T, add another one ( char(1) ) in which write ‘U’ if end user inserted/updated the field or ‘A’ if the administrator did so. When the view/edit form is rendered, use this information to highlight each field accordingly.
Create a new table H storing an edit history containing something like user_id, field_name, last_update_user_id. Keep table H up-to-date when fields are updated in main table T. When the view/edit form is rendered, use this information to highlight each form field accordingly.
What are the pros/cons of these options; can you suggest others?
I suppose it just depends how forward-looking you want to be.
Your first approach has the advantage of being very simple to implement, is very straightforward to update and utilize, and also will only increase your storage requirements very slightly, but it's also the extreme minimum in terms of the amount of information you're storing.
If you go with the second approach and store a more complete history, if you need to add an "edit history" in the future, you'll already have things set up for that, and a lot of data waiting around. But if you end up never needing this data, it's a bit of a waste.
Or if you want the best of both worlds, you could combine them. Keep a full edit history but also update the single-character flag in the main record. That way you don't have to do any processing of the history to find the most recent edit, just look at the flag. But if you ever do need the full history, it's available.
Personally, I prefer keeping more information than I think I'll need at the time. Storage space is very cheap, and you never know when it's going to come in handy. I'd probably go even further than what you proposed, and also make it so the edit history keeps track of what they changed, and the before/after values. That can be very handy for debugging, and could be useful in the future depending on the project's exact needs.
Yes, implement an audit table that holds copies of the historical data, by/from whom &c. I work on a system currently that keeps it simple and writes the value changes as simple name-value string pairs along with date and by whom. It requires mandatory master record adjustment, but works well for tracking. You could implement this easily with a trigger.
The best way to audit data changes is through a trigger on the database table. In your case you may want to just update the last person to make the change. Or you may want a full auditing solution where you store the previous values making it easy to restore them if they were made in error. But the key to this is to do this on the database and not through the application. Database changes are often made through sources other than the application and you will want to know if this happened as well. Suppose someone hacked into the database and updated the data, wouldn't you like to be able to find the old data easily or know who did it even if he or she did it through a query window and not through the application? You might also need to know if the data was changed through a data import if you ever have to get large amounts of data at one time.

Categories