Anyone know of an API (php preferable but I'd be interested in any language) for creating wiki-like data storage?
How about any resources on rolling your own plaintext wiki? How do other plaintext wikis handle the format of the text file?
I understand I can use Markdown or Textile for the formatting. But what I'm most interested in is how to approach the plaintext storage of multi-user edits.
I'm writing a web application that is primarily database driven. I want at least one text field of this database to be in a wiki-like format. Specifically, this text can be edited by multiple users with the ability to roll back to any version. Think the wiki/bio section of Last.FM (almost the entire site is strictly structured by a database except for this one section per artist).
So far, my approach of taking apart MediaWiki and wedging it into a database seems like overkill. I'm thinking it would be much easier to roll my own plaintext wiki, and store this file in the database's appropriate text field.
So, basically this is a "how do I version text information in my DB".
Well, the simplest way is simply copying the data.
Simply, create a "version" table that holds "old versions" of the data, and link it back to your main table.
create table docs {
id integer primary key not null,
version integer not null,
create_date date,
change_date date,
create_user_id integer not null references users(id),
change_user_id integer references users(id),
text_data text
}
create table versions {
id integer primary key not null,
doc_id integer not null references docs(id),
version integer,
change_date date,
change_user integer not null references users(id),
text_data text
}
Whenever you update your original document, you copy the old text value in to this table, copy the user and change date and bump the version.
select version, change_date, change_user, text_data
into l_version, l_change_data, l_change_user, l_text_data
from docs where id = l_doc_id;
insert into versions values (newid, l_doc_id, l_version,
l_change_date, l_change_user, l_text_data);
update docs set version = version + 1, change_date = now,
change_user = cur_user, text_data = l_new_text where id = l_doc_id;
You could even do this in a trigger if your DB supports those.
Faults with this method are that its a full copy of the data (so if you have a large document, the version stay large). You can mitigate that by using something like diff(1) and patch(1).
For example:
diff version2.txt version1.txt > difffile
Then you can store that difffile as "version 1".
In order to recover version 1 from version 2, you grab the version 2 data, run patch on it using the diff file data, and that gives you v1.
If you want to go from v3 to v1, you need to do this twice (once to get v2, and then again to get v1).
This lowers your storage burden, but increases your processing (obviously), so you'll have to judge how you want to do this.
Will's huge answer is right on, but can be summed up, I think: you need to store the versions, and then you need to store the metadata (who what when of the data).
But your question was about resources on Wiki-like versioning. I have none (well, one: Will's answer above). However, about the storage of Wikis, I have one. Check out the comparison matrix from DokuWiki. I know. You're thinking "what do I care what brand of DB different Wikis use?" Because DokuWiki uses plain text files. You can open them and they are indeed plain. So that's one approach, and they've got some interesting arguments as to why DBMS are not the best way to go. They don't even hold much metadata: most of the stuff is done through the flat files themselves.
The point of the DokuWiki for you is that maybe it's a relatively simple problem (depending on how well you want to solve it :)
Here's a list of all 12 wikis on WikiMatrix that are written in PHP and do their storage using text files. Perhaps one of them will have a storage method you can adapt into the database:
http://www.wikimatrix.org/search.php?sid=1760
It sounds like you are essentially just looking for version control. If that is the case, you may want to look into a diff algorithm.
Here is the Wikipedia Diff page.
I did a quick php diff google search, but nothing really stood out as a decent example, since I only have basic PHP knowledge.
Related
I would like to create a column (not a PK) whose value represents as a unique identifier. It is not used for encryption or security purposes - strictly to identify a record. Each time a new record is inserted, I want to generate and store this unique identifier. Not sure if this is relevant, but I have 1 million records now, and anticipate ~3 million in 2 years. I'm using a web app in PHP.
I initially just assumed I'd call UUID() and store it directly as some sort of char data type, but I really wanted to do some research and learn of a more efficient/optimized approach. I found a lot of great articles here on SO but I'm having a hard time with all of the posts because many of them are somewhat older, or disagree on the approach that has ultimately left me very confused. I wanted to ask if someone more wiser/experienced could lend me a hand.
I saw folks linked here on various posts and suggested to implement things this way:
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
but i'm having a hard time fully knowing what to do after reading that article. Ordered UUID? What should I store it as? I think maybe that particular page is a tad over my head. I wanted to ask if someone could help clarify some of this for me. Specifically:
What data type should my column be for storing binary data (that represents my UUID)?
What function should I use to convert my UUID to and from some binary value?
Any more advance or tips someone could share?
Thanks so much!
If you call MySQL's UUID(), you get a variant that is roughly chronological. So, if you need tend to reference "recent" records and ignore "old" records, then rearranging the bits in the UUID can provide better "locality of reference" (that is, better performance).
Version 4 does not provide such.
You can turn the UUID from the bulky 36-character string into a more compact, 16-byte, (Q1) BINARY(16) by code (Q2) in my UUID blog. That document discusses various other aspects of your question. (Q3)
The Percona link you provided gives some benchmarks 'proving' the benefit.
3M uuids taking 16 bytes each = 48MB. It is bulky, but not likely to cause serious problems. Still, I recommend avoiding uuids whenever practical.
I used UUID v4 on a recent project. The code to generate UUID v4 can be sourced here: PHP function to generate v4 UUID
The main difference is that we compressed it to 22 bytes case-sensitive format. This approach is also used by ElasticSearch.
The resulting values are stored simply as char(22).
I'm trying to optimize my PHP and MySQL, but my understanding of SQL databases is shoddy at best. I'm creating a website (mostly for learning purposes) which allows users to make different kinds of posts (image/video/text/link).
Here is the basics of what I'm storing
Auto - int (key index)
User ID - varchar
Post id - varchar
Post Type - varchar (YouTube, vimeo, image, text, link)
File Name - varchar (original image name or link title)
Source - varchar (external link or name of file + ext)
Title - varchar (post title picked by user)
Message - text (user's actual post)
Date - int (unix timestamp)
I have other data stored relevant to the post in other tables which I grab with the post id (like user information) but I'm really doubting if this is the method I should be storing information in. I do use PDO, but I'm afraid this format might just be extremely slow.
Would there be any sense in storing the post information in another format? I don't want excessively large tables, so from a performance standpoint should I store some information as a blob/binary/xml/json?
I can't seem to find any good resources on PHP/MySQL optimization. Most information I come across tends to be 5-10 years old, content you have to pay for, too low-level, or just straight documentation which can't hold my attention for more than half an hour.
Databases are made to store 'data', and are fast to retrieve the data. Do not switch to anything else, stick with a database.
Try not to store pictures and video's in a database. Store them on disk, and keep a reference to them in a database table.
Finally, catch up on database normalization, it will help you in getting your database in optimal condition.
What you have seems okay, but you have missed the important bit about indexes and keys.
Firstly, I am assuming that your primary key will be field 1. Okay, no problems there, but make sure that you also stick an index on userID, PostID, Date and probably a composite on UserID, Date.
Secondly, are you planning on having search functions on these? In that case you may need to enable full text searches.
Don't muck around trying to store data in a JSON or other such things. Store it plain and simple. The last thing you want to be doing is trying to extract a field from the database just to see what is inside. If you database can't work it out, it is bad design.
On that note, there isn't anything wrong with large tables. As long as they are indexed nicely, a small table or large table will make very little difference in terms of accessing it (short of huge badly written SQL joins), so worry about simplicity to be able to get the data back from it.
Edit: A Primary Key is lovely way to identify a row by a unique column of some sort. So, if you want to delete a row, in your example, you might specify a delete from yourTable where ID=6 and you know that this will only delete one row as only one row can have ID=6.
On the other hand, an index is different to a key, in that it is like a cheat-sheet for the database to know where certain information is inside the table. For example, if you have an index on the UserID column, when you pass a userID in a query, the database won't have to look though the entire table, it looks at the index and knows the location of all the rows for that user.
A composite index is taking this one step further again, if you know what you will want to constantly query data for both UserID and ContentType, you can add in a composite index (meaning an index on BOTH fields in one index) which will then allow the database to return only the data you specify in a query using both those columns without having to sift through the entire table - nor even sift through all of a users posts to find the right content type.
Now, indexes take up some extra space on the server, so keep that in mind, but if your tables grow to be larger (which is perfectly fine) the improved efficiency is staggering.
At this time, stick with RDMS for now. Once you will be comfortable with PHP and MySQL then may be later on there will be more to learn like NoSQL, MongoDB etc. but for current purpose of yours as every thing has its purpose, this is quite right and will not slow down. Your table schema seems right with few modifications.
User id and Post id will be integer and I think this table is post so post id will be auto incremented and it will be primary key.
Other thing is that you are using 2 fields, filename and source, please note that filename will be file's name that is uploaded but if by source you mean complete path of file then then DB is not the place for storing complete path. Generate path from PHP function. to access that path every time not in DB. Otherwise if you will need to change path then it will be much overhead.
Also you asked about blob etc. Please note that it is better to store file in file system not in db while these fields like blob etc are good when one want to store file in DB table, that I don't recommend here.
Building a system that has the potential to get hammered pretty hard with hits and traffic.
It's a typical Apache/PHP/MySql setup.
Have build plenty of systems before, but never had a scenario where I really had to make decisions regarding potential scalability of this size. I have dozens of questions regarding building a system of this magniture, but for this particular question, I am trying to decide on what to use as the data type.
Here is the 100ft view:
We have a table which (among other things) has a description field. We have decided to limit it to 255 characters. It will be searchable (ie: show me all entries with description that contains ...). Problem: this table is likely to have millions upon millions of entries at some point (or so we think).
I have not yet figured out the strategy for the search (the MySql LIKE operator is likely to be slow and/or a hog I am guessing for such a large # records), but thats for another SO question. For this question, I am wondering what the pro's and cons are to creating this field as a tinytext, varchar, and char.
I am not a database expert, so any and all commentary is helpful. Thanks -
Use a CHAR.
BLOB's and TEXT's are stored outside the row, so there will be an access penalty to reading them.
VARCHAR's are variable length, which saves storage space by could introduce a small access penalty (since the rows aren't all fixed length).
If you create your index properly, however, either VARCHAR or CHAR can be stored entirely in the index, which will make access a lot faster.
See: varchar(255) v tinyblob v tinytext
And: http://213.136.52.31/mysql/540
And: http://forums.mysql.com/read.php?10,254231,254231#msg-254231
And: http://forums.mysql.com/read.php?20,223006,223683#msg-223683
Incidentally, in my experience the MySQL regex operator is a lot faster than LIKE for simple queries (i.e., SELECT ID WHERE SOME_COLUMN REGEX 'search.*'), and obviously more versatile.
I believe with varchar you've got a variable length stored in the actual database at the low levels, which means it could take less disk space, with the text field its fixed length even if a row doesn't use all of it. The fixed length string should be faster to query.
Edit: I just looked it up, text types are stored as variable length as well. Best thing to do would be to benchmark it with something like mysqlslap
In regards to your other un-asked question, you'd probably want to build some sort of a search index that ties every useful word in the description field individually to a description, then you you can index that and search it instead. will be way way faster than using %like%.
In your situation all three types are bad if you'll use LIKE (a LIKE '%string%' won't use any index created on that column, regardless of its type) . Everything else is just noise.
I am not aware of any major difference between TINYTEXT and VARCHAR up to 255 chars, and CHAR is just not meant for variable length strings.
So my suggestion: pick VARCHAR or TINYTEXT (I'd personally go for VARCHAR) and index the content of that column using a full text search engine like Lucene, Sphinx or any other that does the job for you. Just forget about LIKE (even if that means you need to custom build the full text search index engine yourself for whatever reasons you might have, i.e. you need support for a set of features that no engine out there can satisfy).
If you want to search among millions of rows, store all these texts in a different table (which will decrease row size of your big table) and use VARCHAR if your text data is short, or TEXT if you require greater length.
Instead of searching with LIKE use a specialized solution like Lucene, Sphinx or Solr. I don't remember which, but at least one of them can be easily configured for real-time or near real-time indexing.
EDIT
My proposition of storing text in different table reduces IO required for main table, but when data is inserted it requires to keep an additional index and adds join overhead in selects, so is valid only if you use your table to read a few descriptions at once and other data from the table is is used more often.
idea
I would like to create a little app for myself to store ideas (the thing is - I want it to do MY WAY)
database
I'm thinking going simple:
id - unique id of revision in database
text_id - identification number of text
rev_id - number of revision
flags - various purposes - expl. later
title - self expl.
desc - description
text - self expl
.
flags - if I (i.e.) add flag rb;65, instead of storing whole text, I just said, that whenever I ask for latest revision, I go again in DB and check revision 65
Question: Is this setup the best? Is it better to store the diff, or whole text (i know, space is cheap...)? Does that revision flag make sense (wouldn't it be better to just copy text - more disk space, but less db and php processing.
php
I'm thinking, that I'll go with PEAR here. Although main point is to open-edit-save, possiblity to view revisions can't be that hard to program and can be life-saver in certain situations (good ideas got deleted, saving wrong version, etc...).
However, I've never used PEAR in a long-time or full-project relationship, however, brief encounters in my previous experience left rather bad feeling - as I remember, it was too difficult to implement, slow and humongous to play with, so I don't know, if there's anything better.
Update: It seems, that there are more text diff pre-made libraries, some even more light-weight than PEAR, so I'll have to dig into it, probably.
why?
Although there are bazillions of various time/project/idea management tools, everything lacks something for me, whether it's sharing with users, syncing on more PCs, time-tracking, project management... And I believe, that this text diff webapp will be for internal use with various different tools later. So if you know any good and nice-UI-having project management app with support for text-heavy usage, just let me know, so I'll save my time for something better than redesigning the weel.
I think your question is just boiling down to the one line (If there's something else, let me know, and I'll add on):
Is it better to store the diff, or whole text (i know, space is cheap...)?
It's definitely better to store the whole text, unless you really need to save space. Viewing the text will be a much more common action than checking a diff, and if something has a lot of revisions it could be a significant process to "build" the text for the latest one. Imagine a heavily-used page where you've done thousands of revisions, and the "whole text" is only stored with the original. Then you have to process thousands of diffs just to view the latest text, instead of just pulling it straight out of the database.
If you want to compromise, every time you calculate a diff between any two revisions, store it in a separate table. Then you only have to calculate any given diff once, so it'll be instant the next time you view the same diff. If necessary, this table could be pruned every once in a while to remove diffs that haven't been accessed in a long time.
here is a php diff function : http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
and here is another: holomind.de/phpnet/diff.php
If you're storing a lot of different versions of files git can help you quite a lot.
The cms I'm currently working with only supports live editing of data (news, events, blogs, files, etc), and I've been asked to build a system that supports drafting (with moderation) + revision history system. The cms i'm using was developed in house so I'll probably have to code it from scratch.
At every save of a item it would create a snapshot of the data into a "timeline". The same would go for drafts. Automated functionality would pull the timeline draft into the originating record when required.
The timeline table would store the data type & primary key, seralised version of the data + created/modified dates + a drafting date (if in the future)
I've had a quick look around at other systems, but I've yet to improve from my current idea.
I'm sure someone has already built a system like this and I would like to improve on my design before I start building. Any good articles/resources would help as well.
Thanks
I think using serialize() to encode each row into a single string, then saving that to a central database may be a solution.
You'd have your 'live' database with relevant tables etc., but when you edit or create something (without clicking publish) it would instead of being saved in your main table go into a table like:
id - PRI INT
date - DATETIME
table - VARCHAR
table_id - INT
type - ENUM('UNPUBLISHED','ARCHIVED','DELETED');
data - TEXT/BLOB
...with the type set to 'unpublished' and the table and table_id stored so it knows where it is from. Clicking publish would then serialize the current tables contents, store it in the above table set to 'archive', then read out the latest change (marked as unpublished) and place this in the database. The same could also apply to deleting rows - place them in and mark as 'deleted' for potential undelete/rollback functionality.
It'll require quite a lot of legwork to get it all working, but should provide full publish/unpublish and rollback facilities. Integrated correctly into custom database functions it may also be possible to do all this transparently (from a SQL point of view).
I have been planning on implementing this as a solution to the same problem you appear to be have, but it's still theoretical from my point of view but I reckon the idea is sound.
This sounds very wiki-like to me. You may want to look at MediaWiki, the system used by Wikipedia, which also uses PHP and MySQL.
DotNetNuke is a good open source CMS, you could read the soure for that system to get ideas. Or you could simply use DotNetNuke.
http://www.dotnetnuke.com/
I think that there are many systems out there that would support this functionality out of the box. Although I don't know al your considerations for doing a custom build, consider looking at some of these. It is very likely that they will be able to support what you need, and then some.
Consider having a look at Drupal, I think still the leading CMS for publishing. Drupal in combination with the workflow module contains all that you need:
http://drupal.org
http://drupal.org/project/workflow
And add save draft for usability:
http://drupal.org/project/save_draft