Implementing order in a PHP/MySQL CMS & dealing with concurrency - php

I have the following tables:
======================= =======================
| galleries | | images |
|---------------------| |---------------------|
| PK | gallery_id |<--\ | PK | image_id |
| | name | \ | | title |
| | description | \ | | description |
| | max_images | \ | | filename |
======================= \-->| FK | gallery_id |
=======================
I need to implement a way for the images that are associated with a gallery to be sorted into a specific order. It is my understanding that relational databases are not designed for hierarchical ordering.
I also wish to prepare for the possibility of concurrency, even though it is highly unlikely to be an issue in my current project, as it is a single-user app. (So, the priority here is dealing with allowing the user to rearrange the order).
I am not sure the best way to go about this, as I have never implemented ordering in a database and am new to concurrency. Because of this I have read about locking MySQL tables and am not sure if this is a situation where I should implement it.
Here are my two ideas:
Add a column named order_num to the images table. Lock the table and allow the client to rearrange the order of the images, then update the table and unlock it.
Add a column named order_num to the images table (just as idea 1 above). Allow the client to update one image's place at a time without locking.
Thanks!

Here's my thought: you don't want to put too many man-hours into a problem that isn't likely to happen. Therefore, take a simple solution that's not going to cause a lot of side effects, and fix it later if it's a problem.
In a web-based world, you don't want to lock a table for a user to do edits and then wait until they're done to unlock the table. User 1 in this scenario may never come back, they may lose their session, or their browser could crash, etc. That means you have to do a lot of work to figure out when to unlock the table, plus code to let user 2 know that the table's locked, and they can't do anything with it.
I'd suggest this design instead: let them both go into edit mode, probably in their browser, with some javascript. They can drag images around in order until their happy, then they submit the order in full. You update your order_num field in a single transaction to the database.
In this scenario the worst thing that happens is that user 1 and user 2 are editing at the same time, and whoever edits last is the one whose order is preserved. Maybe they update at the exact same time, but the database will handle that, as it's going to queue up transactions.
The fallback to this problem is that whoever got their order overwritten has to do it again. Annoying but there's no loss, and the code to implement this is much simpler than the code to handle locking.
I hate to sidestep your question, but that's my thoughts about it.

If you don't want "per user sortin" the order_num column seems the right way to go.
If you choose InnoDB for your storage subsystem you can use transactions and won't have to lock the table.

Relational database and hierarchy:
I use id (auto increment) and parent columns to achieve hierarchy. A parent of zero is always the root element. You could order by id, parent.
Concurrency:
This is an easy way to deal with concurrency. Use a version column. If the version has changed since user 1 started editing, block the save, offer to reload edit. Increment the version after each successful edit.

Related

Postgres - UPDATE's becomes slower over time

I have a table like this (more columns but these will do):
events
+----------+----------------+--------------------+------------------+------------------+---------+
| event_id | user_ipaddress | network_userid | domain_userid | user_fingerprint | user_id |
+----------+----------------+--------------------+------------------+------------------+---------+
| 1 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 2 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 3 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 4 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
+----------+----------------+--------------------+------------------+------------------+---------+
The table contains around 1M records. I'm trying to update all records to set the user_id.
I'm using a very simple PHP script for that.
I'm looping over each record with user_id = NULL and SELECT from the entire table to find existing user_id based on user_ipaddress, network_userid, domain_userid and/or user_fingerprint.
If nothing was found I will generate a unique user_id and UPDATE the record.
If a match was found I will UPDATE the record with the correspondent user_id.
The query looks like this:
UPDATE events SET user_id = 'abc' WHERE event_id = '1'
The SELECT part is super fast (~5ms).
The UPDATE part starts fast (~10ms) but becomes slower (~800ms) after a few hundred updates.
If I wait for around 10-20 minutes it's becomes fast again.
I'm running a PostgreSQL 9.3.3 on AWS RDS (db.m1.medium) with General Purpose SSD storage.
I have indexes on all columns combined and individually.
I have played with FILLFACTOR and currently it's set to 70. I have tried to run VACUUM FULL events, but I never know if it finished (waited more than 1h). Also I've tried REINDEX TABLE events.
I'm the only one using this server.
Here's an EXPLAIN ANALYZE of the UPDATE query:
Update on events (cost=0.43..8.45 rows=1 width=7479) (actual time=0.118..0.118 rows=0 loops=1)
-> Index Scan using events_event_id_idx on events (cost=0.43..8.45 rows=1 width=7479) (actual time=0.062..0.065 rows=1 loops=1)
Index Cond: (event_id = '1'::bpchar)
Total runtime: 0.224 ms
Any good ideas on how I can keep the query fast?
Over the 10-20 minutes to become fast again, do you get a gradual improvement?
Things I'd check:
are you creating new connections with each update and leaving them open? They would then timeout and close sometime later.
what is the system load (CPU, memory, IO) doing? I did wonder whether the instance might support bursts, but I don't think so.
I am just guessing,
It is because your primary key is char not int. Try to convert your primary key into int and see the result.
Your explain analyze result says Index Cond: (event_id = '1'::bpchar)
The best choice for primary key are integer data types since integer values are process faster than character data type values. A character data type (as a primary key) needs to be converted to ASCII equivalent values before processing.
Fetching the record on the basis of primary key will be faster in case of integers as primay keys as this will mean more index records will be present on a single page. So the total search time decreases. Also the joins will be faster. But this will be applicable incase your query uses clustered index seek and not scan and if only one table is used. In case of scan not having additional column will mean more rows on one data page.
SQL Index - Difference Between char and int
I found out that the problem was caused by the filesystem chosen for my RDS instance.
I was running with General Purpose Storage (SSD). It apparently has some I/O limits. So the solution was to switch storage. Now I'm running the Provisioned IOPS Storage and the performance improved instantly.
Also a solution could be to stick to the General Purpose Storage (SSD) and increase storage size, since that would increase I/O limits as well.
Read more:
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD
Thanks for all replies. And thanks to #Dan and #ArtemGr for pointing me in that direction.

Copying records accross multiple databases using PHP

I want to do something which may sound wierd.I have a database for my main application which holds few html templates created using my application.These templates are stored in a traditional RDBMS style.A table for template details and other for page details of the template.
I have a similar application for different purpose on another domain.It has a different database with the same structure as the main app.I want to move the templates from one database to the other,with all columns intact.I cannot export as both have independent content of there own i.e same in structure and differ in content. 1st is the template table and 2nd is the page table
+----+----------+----------+
| id |templatename |
+----+----------+----------+|
| 1 | File A | |
| 2 | File B | |
| 3 | File C |
| 4 | File 123 |
| .. | ....... | ........ |
+----+----------+----------+
+----+----------+----------+
| id | page_name| template_id|(foreign key from above table)
+----+----------+----------+
| 1 | index | 1 |
| 2 | about | 1 |
| 3 | contact| 2 |
| 4 | | |
| .. | ........ | ........ |
+----+----------+------------+
I want to select records from 1st database and insert them to the other.Both are on differnet domains.
I thought of writing a PHP script which will use two DB connections,one to select and the other for insert to the other DB,but I want to know if I can achieve this in any other efficient way using command line or export feature in any way
EDIT: for better understanding
I have two databases A and B both n diff servers.Both have two tables say tbl_site and tbl_pages.Now both are independently updated on their domains via application interface.I have a few templates created in database A stored in tbl_site and tbl_pages as mentioned in the question above.I want the template records to be moved to the database B
You can do this in phpMyAdmin (and other query tools, but you mention PHP so I assume phpAdmin is available for you).
On the first database run a query to select the records that you want to copy to the second server. In the "Query results operations" section of the results screen, choose "Export" and select "SQL" as the format.
This will produce a text file containing SQL INSERT statements with the records from the first database.
Then connect to the second database and run the INSERT statements from the generated file.
As other mentioned you can use phpmyadmin, but if your second database table fields are different, then you can write down a small php script to do that for you. Please follow the following steps.
Note : Consider two databases A and B, and you want to move some data from A to B and both are on different servers.
1) First allow remote access on database A server for the database A. Also get a host, username and password for database A.
2) Now using mysqli_ extension, connect to that database. As you have the host for the other database A server, so you have to use that, not localhost. On most servers, the host is the IP of the other remote server.
3) Query database table and get your results. After you get results, close the database connection.
4) Connect to database B. Please note that in this case, database B host may be localhost. Check your server settings for that.
5) Process the data you got from database A and insert them to database B table(s).
I use this same method to import data from different systems (Drupal to Prestahop, Joomla to a customized system), and it works fine.
I hope this will help
Export just data of db A (to .sql). Or use php script - can then be automated if you need to do it again
Result:
INSERT table_A values(1, 'File A')
....
INSERT table_B values(1, 'index', 1)
....
Be careful now when importing data - if you have ids the same you will get error (keep this in mind). Make any mods to the script to solve these problems (remember if you change an id for table_A you will have to change the foreign key in table_B). Again this is a process which you might be forced to automate.
Run the insert scripts in db B
As my question was a bit different I preffered answering it.Also the above comments are relevant in different scenarios so,I won't say they are totally wrong.
I had to run a script to make the inserts happen based on new ids to the target database.
To make it a bit easy and avoid cross domain request to database,I took a dump of the first database and restored it to the target.
Now I wrote a script to select records from one database and insert them to the other i.e the target.So the ids were taken care of automatically.Only the problem(not a problem actually) was I had to run the script for each record independently.

EAV vs. Column based organization for my data

I'm in the process of rebuilding an application (lone developer here) using PHP and PostgreSQL. For most of the data, I'm storing it using a table with multiple columns for each attribute. However, I'm now starting to build some of the tables for the content storage. The content in this case, is multiple sections that each contain different data sets; some of the data is common and shared (and foreign key'd) and other data is very unique. In the current iteration of the application we have a table structure like this:
id | project_name | project_owner | site | customer_name | last_updated
-----------------------------------------------------------------------
1 | test1 | some guy | 12 | some company | 1/2/2012
2 | test2 | another guy | 04 | another co | 2/22/2012
Now, this works - but it gets hard to maintain for a few reasons. Adding new columns (happens rarely) requires modifying the database table. Audit/history tracking requires a separate table that mirrors the main table with additional information - which also requires modification if the main table is changed. Finally, there are a lot of columns - over 100 in some tables.
I've been brainstorming alternative approaches, including breaking out one large table into a number of smaller tables. That introduces other issues that I feel also cause problems.
The approach I am currently considering seems to be called the EAV model. I have a table that looks like this:
id | project_name | col_name | data_varchar | data_int | data_timestamp | update_time
--------------------------------------------------------------------------------------------------
1 | test1 | site | | 12 | | 1/2/2012
2 | test1 | customer_name | some company | | | 1/2/2012
3 | test1 | project_owner | some guy | | | 1/2/2012
...and so on. This has the advantage that I'm never updating, always inserting. Data is never over-written, only added. Of course, the table will eventually grow to be rather large. I have an 'index' table that lists the projects and is used to reference the 'data' table. However I feel I am missing something large with this approach. Will it scale? I originally wanted to do a simple key -> value type table, but realized I need to be able to have different data types within the table. This seems managable because the database abstraction layer I'm using will include a type that selects data from the proper column.
Am I making too much work for myself? Should I stick with a simple table with a ton of columns?
My advice is that if you can avoid using an EAV table, do so. They tend to be performance killers. They are also difficult to properly query especially for reporting (Yes let me join to this table an unknown number times to get all of the data out of it I need and, oh by the way, I don't know what columns I have available so I have no idea what columns the report will need to contain) and it is hard to get the kind of database constraints that you need to ensure data integrity (how to ensure that the required fields are filled in for instance) and it can cause you to use bad datatypes. It is far better in the long run to define tables that store the data you need.
If you are really need the functionality, then at least look into NoSQL databases which are more optimized for this sort of undefined data.
Moving your entire structure to EAV can lead to a lot of problems down the line, but it might be acceptable for the audit-trail portion of your problem since often foreign key relationships and strict datatyping may disappear over time anyway. You can probably even generate your audit tables automatically with triggers and stored procedures.
Note, however, that reconstructing old versions of records is non-trivial with an EAV audit trail and will require a fair amount of application code. The database will not be able to do it by itself.
An alternative you could consider is to store all your data (new and old records) in the same table. You can either include audit fields in the same table and leave NULL when unnecessary, or store some rows in the table being "current" and with audit-related fields stored in another table. To simplify your application, you can create a view which only shows current rows and issue queries against the view.
You can accomplish this with a joined table inheritance pattern. With joined table inheritance, you put common attributes into a base table along with a "type" column, and you can join to additional tables (which have the same primary key which is also a foreign key) based on type. Many Data-Mapper-Pattern ORMs have native support for this pattern, often called "polymorphism".
You could also use PostgreSQL's native table inheritance mechanism, but note the caveats carefully!

Soft delete best practices (PHP/MySQL)

Problem
In a web application dealing with products and orders, I want to maintain information and relationships between former employees (users) and the orders they handled. I want to maintain information and relationships between obsolete products and orders which include these products.
However I want employees to be able to de-clutter the administration interfaces, such as removing former employees, obsolete products, obsolete product groups etc.
I'm thinking of implementing soft-deletion. So, how does one usually do this?
My immediate thoughts
My first thought is to stick a "flag_softdeleted TINYINT NOT NULL DEFAULT 0" column in every table of objects that should be soft deletable. Or maybe use a timestamp instead?
Then, I provide a "Show deleted" or "Undelete" button in each relevant GUI. Clicking this button you will include soft-deleted records in the result. Each deleted record has a "Restore" button. Does this make sense?
Your thoughts?
Also, I'd appreciate any links to relevant resources.
That's how I do it. I have a is_deleted field which defaults to 0. Then queries just check WHERE is_deleted = 0.
I try to stay away from any hard-deletes as much as possible. They are necessary sometimes, but I make that an admin-only feature. That way we can hard-delete, but users can't...
Edit: In fact, you could use this to have multiple "layers" of soft-deletion in your app. So each could be a code:
0 -> Not Deleted
1 -> Soft Deleted, shows up in lists of deleted items for management users
2 -> Soft Deleted, does not show up for any user except admin users
3 -> Only shows up for developers.
Having the other 2 levels will still allow managers and admins to clean up the deleted lists if they get too long. And since the front-end code just checks for is_deleted = 0, it's transparent to the frontend...
Using soft-deletes is a common thing to implement, and they are dead useful for lots of things, like:
Saving a user's data when they deleted something
Saving your own data when you delete something
Keep a track record of what really happened (a kind of audit)
etcetera
There is one thing I want to point out that almost everyone miss, and it always comes back to bite you in the rear piece. The users of your application does not have the same understanding of a delete as you have.
There are different degrees of deletions. The typical user deletes stuff when (s)he
Made a misstake and want to remove the bad data
Doesn't want to see something on the screen anymore
The problem is that if you don't record the intention of the delete, your application cannot distinguish between erronous data (that should never have been created) and historically correct data.
Have a look at the following data:
PRICES | item | price | deleted |
+------+-------+---------+
| A | 101 | 1 |
| B | 110 | 1 |
| C | 120 | 0 |
+------+-------+---------+
Some user doesn't want to show the price of item B, since they don't sell that item anymore. So he deletes it. Another user created a price for item A by misstake, so he deleted it and created the price for item C, as intended. Now, can you show me a list of the prices for all products? No, because either you have to display potentially erronous data (A), or you have to exclude all but current prices (C).
Of course the above can be dealt with in any number of ways. My point is that YOU need to be very clear with what YOU mean by a delete, and make sure that there is no way for the users to missunderstand it. One way would be to force the user to make a choice (hide/delete).
If I had existing code that hits that table, I would add the column and change the name of the table. Then I would create a view with the same name as the current table which selects only the active records. That way none of the existing code woudl break and you could have the soft delete column. If you want to see the deleted record, you select from the base table, otherwise you use the view.
I've always just used a deleted column as you mentioned. There's really not much more to it than that. Instead of deleting the record, just set the deleted field to true.
Some components I build allow the user to view all deleted records and restore them, others just display all records where deleted = 0
Your idea does make sense and is used frequently in production but, to implement it you will need to update quite a bit of code to account for the new field. Another option could be to archive (move) the "soft-deleted" records to a separate table or database. This is done frequently as well and makes the issue one of maintenance rather than (re)programming. (You could have a table trigger react to the delete to archive the deleted record.)
I would do the archiving to avoid a major update to production code. But if you want to use deleted-flag field, use it as a timestamp to give you additional useful info beyond a boolean. (Null = not deleted.) You might also want to add a DeletedBy field to track the user responsible for deleting the record. Using two fields gives you a lot of info tells you who deleted what and when. (The two extra field solution is also something that can be done in an archive table/database.)
The most common scenario I've come across is what you describe, a tinyint or even bit representing a status of IsActive or IsDeleted. Depending on whether this is considered "business" or "persistence" data it may be baked into the application/domain logic as transparently as possible, such as directly in stored procedures and not known to the application code. But it sounds like this is legitimate business information for your needs so would need to be known throughout the code. (So users can view deleted records, as you suggest.)
Another approach I've seen is to use a combination of two timestamps to show a "window" of activity for a given record. It's a little more code to maintain it, but the benefit is that something can be scheduled to soft-delete itself at a pre-determined time. Limited-time products can be set that way when they're created, for example. (To make a record active indefinitely one could use a max value (or just some absurdly distant future date) or just have the end date be null if you're ok with that.)
Then of course there's further consideration of things being deleted/undeleted from time to time and tracking some kind of audit for that. The flag approach knows only the current status, the timestamp approach knows only the most recent window. But anything as complex as an audit trail should definitely be stored separately than the records in question.
Instead I would use a bin table in which to move all the records deleted from the other tables. The main problem with the delete flag is that with linked tables you will definitely run into a double key error when trying to insert a new record.
The bin table could have a structure like this:
id, table_name, data, date_time, user
Where
id is the primary key with auto increment
table_name is the name of the table from which the record was deleted
data contains the record in JSON format with name and value of all fields
date_time is the date and time of the deletion
user is the identifier of the user (if the system provides for it) who performed the operation
this method will not only save you from checking the delete flag at each query (immagine the ones with many joins), but will allow you to have only the really necessary data in the tables, facilitating any searches and corrections using SQL client programs

How to code a simple versioning system?

I want to do a simple versioning system but i don't have ideas on how to structure my datas, and my code.
Here is a short example:
User logs in
User has two options when uploading a file:
Submit a new file
Submit a new version of a file
Users should be able to see the tree. (the different version)
The tree can only be up to 2 levels:
|
|--File_A_0
\--File_A_1
\--File_A_2
\--File_A_3
\--File_A_4
There are also 2 types of file, a final (which is the latest approved version) and a draft version (which the latest uploaded file)
The file will be physically stored on the server.
Each files are owned by a user (or more) and only one group.
Edit: Groups represent a group of document, document could only be owned by ONE group at once. Users do NOT depend on groups.
Begin edit:
Here is what i did, but it is not really efficient !
id_article | relative_group_id | id_group | title | submited | date | abstract | reference | draft_version | count | status
id_draft | id_file | version | date
But it's difficult to manage, to extend.
I think it's because the group paramater...
End edit
So the questions are:
How can i schematize my database ?
What kind of infos should be usefull
to version this work ?
What kind of structure for the
folders, files ?
What kind of tips, hints do you have
to do this kind of work ?
(The application is developped with PHP and Zend Framework, database should be mysql or postgresql)
For God's sake, don't. You really don't want to go down this road.
Stop and think about the bigger picture for a moment. You want to keep earlier versions of documents, which means that at some point, somebody is going to want to see some of those earlier versions, right? And then they are going to ask, "What's the difference between version 3 and version 7"? And then they are going to say, "I want to roll back to version 3, but keep some of the changes that I put in version 5, ummm, ok?"
Version control is non-trivial, and there's no need to reinvent the wheel-- there are lots of viable version control systems out there, some of them free, even.
In the long run, it will be much easier to learn the API of one of these systems, and code a web front-end that offers your users the subset of features they are looking for (now.)
You wouldn't code a text editor for your users, would you?
You may get inspiration from there.
Concerning your comment :
As for a database structure, you may try this kind of structure (MySQL sql) :
CREATE TABLE `Users` (
`UserID` INT NOT NULL AUTO_INCREMENT
, `UserName` CHAR(50) NOT NULL
, `UserLogin` CHAR(20) NOT NULL
, PRIMARY KEY (`UserID`)
);
CREATE TABLE `Groups` (
`GroupID` INT NOT NULL AUTO_INCREMENT
, `GroupName` CHAR(20) NOT NULL
, PRIMARY KEY (`GroupID`)
);
CREATE TABLE `Documents` (
`DocID` INT NOT NULL AUTO_INCREMENT
, `GroupID` INT NOT NULL
, `DocName` CHAR(50) NOT NULL
, `DocDateCreated` DATETIME NOT NULL
, PRIMARY KEY (`DocID`)
, INDEX (`GroupID`)
, CONSTRAINT `FK_Documents_1` FOREIGN KEY (`GroupID`)
REFERENCES `Groups` (`GroupID`)
);
CREATE TABLE `Revisions` (
`RevID` INT NOT NULL AUTO_INCREMENT
, `DocID` INT
, `RevUserFileName` CHAR(30) NOT NULL
, `RevServerFilePath` CHAR(255) NOT NULL
, `RevDateUpload` DATETIME NOT NULL
, `RevAccepted` BOOLEAN NOT NULL
, PRIMARY KEY (`RevID`)
, INDEX (`DocID`)
, CONSTRAINT `FK_Revisions_1` FOREIGN KEY (`DocID`)
REFERENCES `Documents` (`DocID`)
);
CREATE TABLE `M2M_UserRev` (
`UserID` INT NOT NULL
, `RevID` INT NOT NULL
, INDEX (`UserID`)
, CONSTRAINT `FK_M2M_UserRev_1` FOREIGN KEY (`UserID`)
REFERENCES `Users` (`UserID`)
, INDEX (`RevID`)
, CONSTRAINT `FK_M2M_UserRev_2` FOREIGN KEY (`RevID`)
REFERENCES `Revisions` (`RevID`)
);
Documents is a logical container, and Revisions contains actual links to the files.
Whenever a person updates a new file, create an entry in each of these tables, the one in Revisions containing a link to the one inserted in Documents.
The table M2M_UserRev allows to associate several users to each revision of a document.
When you update a document, insert only in Revisions, with alink to the corresponding Document. To know which document to link to, you may use naming conventions, or asking the user to select the right document.
For the file system architecture of your files, it really doesn't matter. I would just rename my files to something unique before they are stored on the server, and keep the user file name in the database. Just store the files renamed in a folder anywhere, and keep the path to it in the database. This way, you know how to rename it when the user asks for it. You may as well keep the original name given by the user if you are sure it will be unique, but I wouldn't rely on it too much. You may soon see two different revisions having the same name and one overwriting the other on your file system.
Database schema
To keep it exremely simple, I would choose the following database design. I'm separating the "file" (same as a filesystem file) concept from the "document" (the gerarchic group of documents) concept.
User entity:
userId
userName
Group entity:
groupId
groupName
File entity:
fileId (a sequence)
fileName (the name the user gives to the file)
filesystemFullPath
uploadTime
uploaderId (id of the uploader User)
ownerGroupId
Document entity:
documentId
parentDocumentId
fileId
versionNumber
creationTime
isApproved
Every time a new file is uploaded, a "File" record is created, and also a new "Document". If it's the first time that file is uploaded, parentDocumentId for that document would be NULL. Otherwise, the new document record would point to the first version.
The "isApproved" field (boolean) would handle the document being a draft or an approved revision. You get the latest draft of a document simply ordering descending by version number or upload time.
Hints
From how you describe the problem, you should analyze better those aspects, before moving to database schema design:
which is the role of the "group" entity?
how are groups/users/files related?
what if two users of different groups try to upload the same document?
will you need folders? (probably you will; my solution is still valid, giving a type, "folder" or "document", to the "document" entity)
Hope this helps.
Might an existing version-control solution work better than rolling your own? Subversion can be made to do most of what you want, and it's right there.
Creating a rich data structure in a traditional relational database such as MySQL can often be difficult, and there are much better ways of going about it. When working with a path based data structure with a hierarchy I like to create a flat-file based system that uses a data-serialization format such as JSON to store information about a specific file, directory or an entire repository.
This way you can use current available tools to navigate and manipulate the structure easily, and you can read, edit and understand the structure easily. XML is good for this too - it's slightly more verbose than JSON but easy to read and good for messaging and other XML-based systems too.
A quick example. If we have a repository that has a directory and three files. Looking at it front on it will look like this:
/repo
/folder
code.php
file.txt
image.jpg
We can have a metadata folder, which contains our JSON files, hidden from the OS, at the root of each directory, which describe that directory's contents. This is how traditional versioning systems work, except they use a custom language instead of JSON.
/repo
*/.folderdata*
/code
*/.folderdata*
code.php
file.txt
image.jpg
Each .folderdata folder could contain it's own structure that we can use to organize the folder's data properly. Each .folderdata folder could then be compressed to save disk space. If we look at the .folderdata folder inside the /code directory:
*/.folderdata*
/revisions
code.php.r1
code.php.r2
code.php.r3
folderstructure.json
filerevisions.json
The folder structure defines the structure of our folder, where the files and folders are in relation to one another etc. This could look something like this:
{
'.': 'code',
'..': 'repo',
'code.php': {
'author_id': 11543,
'author_name': 'Jamie Rumbelow',
'file_hash': 'a26hb3vpq22'
'access': 'public'
}
}
This allows us to associate metadata about that file, check for authenticity and integrity, keep persistent data, specify file attributes and do much more. We can then keep information about specific revisions in the filerevisions.json file:
{
'code.php': [
1: {
'commit': 'ah32mncnj654oidfd',
'commit_author_id': 11543,
'commit_author_name': 'Jamie Rumbelow',
'commit_message': 'Made some changes to code.php',
'additions': 2,
'subtractions': 4
},
2: {
'commit': 'ljk4klj34khn5nkk5',
'commit_author_id': 18676,
'commit_author_name': 'Jo Johnson',
'commit_message': 'Fixed Jamie\'s bad code!',
'additions': 2,
'subtractions': 0
},
3: {
'commit': '77sdnjhhh4ife943r',
'commit_author_id': 11543,
'commit_author_name': 'Jamie Rumbelow',
'commit_message': 'Whoah, showstopper found and fixed',
'additions': 8,
'subtractions': 5
},
]
}
This is a basic outline plan for a file versioning system - I like this idea and how it works, and I've used JSON in the past to great effect with rich datastructures like this. This sort of data just isn't suitable for a relational database such as MySQL - as you get more revisions and more files the database will grow bigger and bigger, this way you can stagger the revisions across multiple files, keep backups of everything, make sure you have persistent data across interfaces and platforms etc.
Hope this has given you some insight, and hopefully it'll provide some food for thought for the community too!
For a database schema, you likely need two sets of information, files and file versions. When a new file is stored an initial version is created as well. The latest approved version would have to be stored explicitly, while the newest version can be selected from the versions table (either by finding the highest version related to the file, or the newest date if you store when they are created)
files(id,name,approved_version)
file_versions(id,fileId)
file versions could then be stored using their ids (eg., '/fileId/versionId' or '/fileId/versionId_fileName') on the server, with their original name stored in the database.
I recently built a simple versioning system for some static data entities. The requirement was to have an 'Active' version and 0 or 1 'pending' versions.
In the end, my versioned entity had the following attributes relevant to versioning.
VersionNumber (int/long)
ActiveVersionFlag (boolean)
Where:-
only 1 entity can be ActiveVersionFlag = 'Y'
only 1 entity can be Version number > the 'Active' version (i.e. the 'pending' version)
The kind of operations I allowed were
Clone current version.
Fail if there is already a version > the Versionnumber of the 'Active' version
Copy all of the data to the new version
increment the version number by one
Activate Pending Version
Fail if the specified version is not the 'Active' version + 1
find the 'Active' version and set its ActiveVersionFlag to 'N'
set the ActiveVersionFlag of the 'pending' version to 'Y'
Delete Pending Version
Delete the Pending Entity
This was reasonably successfull and my users now clone and activate all the time :)
Michael
Start from an existing content management system, done in PHP and MySQL if those are your requirements, such as eZ Publish, or Knowledgetree. For rapid testing of these applications, Bitnami provides quick-to-install "stacks" of these as well (WAMP-stacks on steroids).
Then you can tailor these applications to your organizations needs, and stay up-to-date with the changes upstream.
As an alternative to my previous post, if you think a hierarchical structure would be best, you may want to use flat-file storage, and expose an API through a Web service.
The server would have its data root directory, and you can store groups (of files) in folders, with a root meta-data entry in each folder. (XML perhaps?)
Then you can use an existing revision control tool wrapped in an API, or roll your own, keeping revisions of files in a revisions folder underneath the item in the folder. Check for revisions, and do file I/O with file I/O commands. Expose the API to the Web application, or other client application, and let the server determine file permissions and user mapping through the XML files.
Migrate servers? Zip and copy.
Cross platform? Zip and copy.
Backup? Zip and copy.
It's the flat-file back-end that I love about Mercurial DVCS, for example.
Of course, in this little example, the .rev files, could have dates, times, compression, etc, etc, defined in the revisions.xml file. When you want to access one of these revisions, you expose an AccessFile() method, which your server application will look at the revisions.xml, and determine how to open that file, whether access is granted, etc.
So you have
DATA
| + ROOT
| | . metadata.xml
| | |
| | + REVISIONS
| | | . revisionsdata.xml
| | | . documenta.doc.00.rev
| | | . documenta.doc.01.rev
| | | . documentb.ppt.00.rev
| | | . documentb.ppt.03.rev
| | |___
| | |
| | . documenta.doc
| | . documentb.ppt
| | |
| | + GROUP_A
| | | . metadata.xml
| | | |
| | | + REVISIONS
| | | | . revisionsdata.xml
| | | | . documentc.doc.00.rev
| | | | . documentc.doc.01.rev
| | | | . documentd.ppt.00.rev
| | | | . documentd.ppt.03.rev
| | | |___
| | |
| | | . documentc.doc
| | | . documentd.ppt
| | |___
| | |
| | + GROUP_B
| | | . metadata.xml
| | |___
| |
| |___
|
|___
Uploading files is to 1990-ty =)
Look at Google Wave! You can just build your entire application around their 'version control' framework.

Categories