File+database transaction safety - php

I have a MySQL table which basically serves as a file index. The primary key of each record is also the name of a file in a directory on my web host.
When the user wants to delete a file from the system, I want to ensure some kind of transaction safety, i.e. if something goes wrong while deleting the file the record is not erased, and if for some reason the database server dies the file won't be erased. Either event occurring would be very unlikely, but if there's even the slightest chance of a problem I want to prevent it.
Unfortunately I have absolutely no idea how to implement this. Would I need to work out which is less likely to fail, and simply assume that it never will? Are there any known best practices for this?
Oh and here's the kicker - my web host only supports MyISAM tables, so no MySQL transactions for me.
In case it matters, I'm using PHP as my server-side scripting language.

Whether the file is "Deleted" from the DB via a UPDATE or a DELETE of a row, the problem is the same -- the database + file operations are not atomic. Neither an UPDATE or a DELETE are safer than the other, they're both transactions in a database whereas the file operation is not.
The solution is that there is never any conflict as to the state of the data. Only one source is considered "the truth" and the other reflects that truth. That way if there's ever an inconsistency between the two, you know what the "truth" is. In fact, there is never a "logical" inconsistency, only the aftermath manifested by physical artifacts on the disk.
In most cases, the Database is a better representation of The Truth.
Here's the truth table:
File Exists -- DB Record exists -- Truth
Yes No File does not exist
Yes Yes File does exist
No Yes File does exist, but its in error.
No No File does not exist
Operationally, here's how this works.
To create a file, copy the file to the final destination, then make an entry in the DB.
If the file copy fails, you don't update the DB.
If the file copy succeeds, but the DB is not updated, the file "does not exist", so back to step one.
If the file copy succeeds and the DB update succeeds, then everything is A-OK
To delete a file, first update the DB to show the file is deleted.
If the DB update succeeds, then delete the actual file.
If the DB update does not, then do not delete the file.
If the file delete fails, no problem -- the file is still "deleted" because the DB says so.
If you follow the work flow, there's "no way" that the file should be missing while the DB says it exists. If the file goes missing, you have an undefined state, that you will need to resolve. But this shouldn't happen barring someone walking on your file system.
The DB transactions help keep things honest.
Occasionally, as Jonathan mentioned, you should run some kind of scavenging, syncing process to make sure there aren't any rogue files. but even then, that's really not an issue save for file space, especially if the file names of the actual files have nothing to do with the original file names. (i.e. they're synthetic files names) That way you don't have to worry about overwrites etc.

In the circumstances, I think I'd use a logical deletion mechanism where a record is marked as deleted even though it is still present in the database, and the file is still present in the file system. I might move the file to a new location when it is logically deleted (think 'recycle bin') and update the record with the new location as well as the 'logically deleted' marker.
Then, sometime later, you could do an offline scavenge operation to physically delete files and records marked as logically deleted.
This reduces the risk to the live data. It slightly complicates the SQL, but a view might work - rename the main table, then create a view with the same name as the main table used to have, but with criteria that eliminate logically deleted records:
CREATE VIEW MainTable(...) AS
SELECT * FROM RenamedTable WHERE DeleteFlag = 'N';
Even upgrading to a company that provides MySQL transactions is not a huge help. You would need a transaction manager which can run Two-Phase Commit protocols between the file system and the DBMS, which is non-trivial.

You can create a Status column (or an "is_active" column) in the File table with two values: 0=Active, 1=Deleted.
When a user deletes a file, only the Status field is changed and the file remain intact.
When a user browses files, only files with Status=0 are shown.
The Administrator can view/delete files with Status=1.

Related

I'm hitting a race condition in my Laravel application when trying to conditionally INSERT or UPDATE, any suggestions...?

My users need to be able to upload files to my site, so I've implemented a file uploader widget on the frontend. It allows for multiple uploads at once, and each upload triggers code one file at a time to save the file to the DB.
The problem is that files need to be stored as an array in a single row in the database (I know, I know... legacy reasons).
In English pseudocode, here's what's happening:
Laravel sees a new file has been uploaded
Laravel checks whether or not any files (at all) have been uploaded to this entity
No files have been uploaded yet? Create a new record to store that file.
There are already files for this entity? Update the existing record to add this file to the array.
The problem is that when multiple files are uploaded at once in quick succession for the first time, Laravel has entered the first file in the database moments after the second file has conducted it's check to see if any files already exist. So we end up with duplicate rows, rather than it updating them in to a single record.
If I upload 5 files at once, typically I'll get 4 rows in the database - 3 single entries and one double-entry, that managed to catch up in time.
Any practical ways to get around this problem? I know I should be using a many-to-one database schema here, but I've greatly simplified an already complex situation for brevity!
This is Laravel 5.2 using a MySQL InnoDB database.
Plan A: When you see one new file, wait a little while. Then look for 'all' the files and deal with them.
Plan B: Store a timestamp with the record. When another file is noticed, see if there is an existing record with a 'recent' timestamp. If so, assume it is part of the same 'group'.
Both Plans will occasionally have hiccups -- mostly because of the vague definition of "at once".
Perhaps you currently have another problem: A file is half uploaded when you grab it, and you get an incomplete file?
The 'real' answer is to have some 'message' uploaded when they are finished. Then, don't start the processing until you see it. (I realize this is probably not practical.)

Do I need (will I ever need) LOCK IN SHARED MODE | FOR UPDATE within a transaction?

I'm painfully struggling trying to understand how to write best my code and queries
straight to the question: do I need or will I ever need to write explicitly LOCK IN SHARED MODE or FOR UPDATE in a transaction (apart from READ UNCOMMITTED ones)?
if I have external keys, do I need to select rows explicitly to apply the lock to these rows, or the foreign keys definition is enough?
The short answer: absolutely yes.
The complete answer: it depends on the use case. Perhaps in the most of scenarios the default locks used by InnoDb is sufficient. These locks make sure that your data is consistent within a transaction. But here's a scenario that needs a lock using SELECT ... FOR UPDATE:
Consider you are making a web application in which your session data is stored in database. Race condition is a concern when it comes to session data. While this concern is satisfied when files are used to store session data, but if you move them to database it's your duty to make sure requests won't overwrite each other's session changes. In this scenario you need to read session data from MySQL using FOR UPDATE to make sure other requests will wait for this one to write back the session and commit the transaction before they could read it.
[UPDATE]
Here's a use case for shared mode:
Shared mode is useful when you want to make sure that some record remains unchanged to the end of your transaction. e.g. when you are trying to insert a child record with a foreign key to a parent when the parent record has been inserted in previous transactions. In this case you will first select the parent record locked in shared mode and then try to insert the child to make sure that when you are inserting the child, parent still exists. This way other sessions can still read the parent record but no one can change it. That's just off the top of my mind, but in the all use cases of shared mode lock this fact remains the same that you want a record to remain unchanged while it's still accessible for others to read.
[UPDATE]
Regarding the transaction isolation level of SERIALIZABLE, the MySQL's documentation is pretty clear. If you set the transaction level this way, and also SET autocommit = 0;, then it's exactly like the REPEATABLE READ level plus writing all your select queries with LOCK IN SHARE MODE at the end (unless you mention FOR UPDATE explicitly). It means everything you'll touch (explicitly or implicitly) will be locked. Those that are selected without any mention of locks or those with LOCK IN SHARED MODE are locked in shared mode, and the rest in exclusive mode.

Doctrine entity manager and multiple threads updating database

I currently have an XHR request that can fire off N times from the client. This requests are handled by the server and each requests typically creates a new row in the database (all doctrine / xml).
Before I persist() the object, I make sure if has a unique filename (I am uploading assets) and I do this by overriding persist(), calling my getUniqueFilename() and then calling parent::persist.
I get a race condition when I perform multiple XHR requests w/the same filename. What happens if that multiple threads are running at the exact same time and checking the database for duplicates in order to generate the unique filename.
Upload file
Check database if filename exists
Increment filename (eg: filename_1)
However, when multiple XHR requests occur in multiple threads a race condition occurs where multiple files are inserted into the database w/the same name (filename_1 is generated multiple times).
I think the way to solve this is either
A mysql trigger
Adding a unique constraint to the table for filename, and wrapping in try/catch in code
What would you do?
Adding a unique constraint is the safest way to ensure consistent data. Anything at the PHP level could have some race conditions unless you have some other forms of locking, which will be less efficient.
You could also avoid the problem by making sure the file names become unique using other attributes such as the user it came from or keeping version history which would just make it look like there was a new version almost at the same time.
I would suggest using a different strategy: Using the mysql auto_increment to get an id, saving the asset to the database, retrieving the id and then adding it to the filename. All the other ways have drawbacks where you have to perform partial rollbacks, handle dublicated filenames etc.
I would also suggest to not use the original filename for storing the object: You run into troubles with forbidden characters on different operating systems as well as character encoding, possible dublicates for some reason (e.g. because the database is case_sensitive where the file system is not). There may be other drawbacks like max file name length or so which you might not be aware of right now.
My solution is simply using a mysql auto increment as filename. If you think about it it makes sense. The auto increment is used as a unique idenfitier. If you make sure to only story objects from one table into one folder, you have no problems with identifieng different assets, filenames etc.
If you insist on keeping your way, you could make the filename unique in the database and then restart on a failing flush as you suggested.

What is the best way to do testing database (MYSQL spesific)

Right now i'm on testing something in a database. It's a wordpress database. i have to write and delete and do other operation on it. As you know it, it has indexing mechanism that will always make every new post inherit the next highest possible ID.
Please consider that this database is a copying of used database. it has been written before. So, i will need to make sure when i finish my testing, it will be the same
Right now, my only solution is making backup. So if i have end in some section of planned testing, i will backup it and start next testing on another copy of it.
Fortunately, the size of database is only a small one. so delete and copy and backup it will be easy. but i know this way of database testing is only partial solution.It force me to create too many backup copy. I don't know what i will do if the database has bigger size. it will be a very long of testing nightmare.
so i wonder is there any solution that work just like rollback. So it will just lock the database and just put new entry as some kind of cache. I can erase it or write it into the database.
i use mysql and phpmyadmin and use it to developed some custom solution.
EDIT ::: How to effectively doing testing on database when developing PHP solution ?
If your file system supports writeable snapshots - LVM, ZFS, Veritas, etc. - you can take an instant copy of the entire database partition, mount that in another place, start a new instance of MYSQL which uses the snapshot, perform your testing, remove your snapshot - all without disturbing your replica.
The snapshot only needs storage for the amount of data that gets changed during your testing, and so might only need a few GB.
After you delete the record, when you insert the new record, if you set the id to the number you want, rather than leaving it blank, it should just fill that index again.
Is the post number an autoincrement field? If not, hack into WordPress (temporarily) and look for the code where the new posts are stored. Before saving why don't you add a constant e.g. 1000 to post id number. When you are finished with your testing simply delete the id numbers that are greater than your constant.

What Would be a Suitable Way to Log Changes Within a Database Using CodeIgniter

I want to create a simple auditing system for my small CodeIgniter application. Such that it would take a snapshot of a table entry before the entry has been edited. One way I could think of would be to create a news_audit table, which would replicate all the columns in the news table. It would also create a new record for each change with the added column of date added. What are your views, and opinions of building such functionality into a PHP web application?
There are a few things to take into account before you decide which solution to use:
If your table is large (or could become large) your audit trail needs to be in a seperate table as you describe or performance will suffer.
If you need an audit that can't (potentially) be modified except to add new entries it needs to have INSERT permissions only for the application (and to be cast iron needs to be on a dedicated logging server...)
I would avoid creating audit records in the same table as it might be confusing to another developer (who might no realize they need to filter out the old ones without dates) and will clutter the table with audit rows, which will force the db to cache more disk blocks than it needs to (== performance cost). Also to index this properly might be a problem if your db does not index NULLS. Querying for the most recent version will involve a sub-query if you choose to time stamp them all.
The cleanest way to solve this, if your database supports it, is to create an UPDATE TRIGGER on your news table that copies the old values to a seperate audit table which needs only INSERT permissions). This way the logic is built into the database, and so your applications need not be concerned with it, they just UPDATE the data and the db takes care of keeping the change log. The body of the trigger will just be an INSERT statement, so if you haven't written one before it should not take long to do.
If I knew which db you are using I might be able to post an example...
What we do (and you would want to set up archiving beforehand depending on size and use), but we created an audit table that stores user information, time, and then the changes in XML with the table name.
If you are in SQL2005+ you can then easily search the XML for changes if needed.
We then added triggers to our table to catch what we wanted to audit (inserts, deletes, updates...)
Then with simple serialization we are able to restore and replicate changes.
What scale are we looking at here? On average, are entries going to be edited often or infrequently?
Depending on how many edits you expect for the average item, it might make more sense to store diff's of large blocks of data as opposed to a full copy of the data.
One way I like is to put it into the table itself. You would simply add a 'valid_until' column. When you "edit" a row, you simply make a copy of it and stamp the 'valid_until' field on the old row. The valid rows are the ones without 'valid_until' set. In short, you make it copy-on-write. Don't forget to make your primary keys a combination of the original primary key and the valid_until field. Also set up constraints or triggers to make sure that for each ID there can be only one row that does not have it's valid_until set.
This has upsides and downsides. The upside is less tables. The downside is far more rows in your tables. I would recommend this structure if you often need to access old data. By simply adding a simple WHERE to your queries you can query the state of a table at a previous date/time.
If you only need to access your old data occasionally then I would not recommend this though.
You can take this all the way to the extreme by building a Temportal database.
In small to medium size project I use the following set of rules:
All code is stored under Revision Control System (i.e. Subversion)
There is a directory for SQL patches in source code (i.e. patches/)
All files in this directory start with serial number followed by short description (i.e. 086_added_login_unique_constraint.sql)
All changes to DB schema must be recorded as separate files. No file can be changed after it's checked in to version control system. All bugs must be fixed by issuing another patch. It is important to stick closely to this rule.
Small script remembers serial number of last executed patch in local environment and runs subsequent patches when needed.
This way you can guarantee, that you can recreate your DB schema easily without the need of importing whole data dump. Creating such patches is no brainer. Just run command in console/UI/web frontend and copy-paste it into patch if successful. Then just add it to repo and commit changes.
This approach scales reasonably well. Worked for PHP/PostgreSQL project consisting of 1300+ classes and 200+ tables/views.

Categories