How to create text diff web app

How to create text diff web app - php

idea
I would like to create a little app for myself to store ideas (the thing is - I want it to do MY WAY)
database
I'm thinking going simple:
id - unique id of revision in database
text_id - identification number of text
rev_id - number of revision
flags - various purposes - expl. later
title - self expl.
desc - description
text - self expl
.
flags - if I (i.e.) add flag rb;65, instead of storing whole text, I just said, that whenever I ask for latest revision, I go again in DB and check revision 65
Question: Is this setup the best? Is it better to store the diff, or whole text (i know, space is cheap...)? Does that revision flag make sense (wouldn't it be better to just copy text - more disk space, but less db and php processing.
php
I'm thinking, that I'll go with PEAR here. Although main point is to open-edit-save, possiblity to view revisions can't be that hard to program and can be life-saver in certain situations (good ideas got deleted, saving wrong version, etc...).
However, I've never used PEAR in a long-time or full-project relationship, however, brief encounters in my previous experience left rather bad feeling - as I remember, it was too difficult to implement, slow and humongous to play with, so I don't know, if there's anything better.
Update: It seems, that there are more text diff pre-made libraries, some even more light-weight than PEAR, so I'll have to dig into it, probably.
why?
Although there are bazillions of various time/project/idea management tools, everything lacks something for me, whether it's sharing with users, syncing on more PCs, time-tracking, project management... And I believe, that this text diff webapp will be for internal use with various different tools later. So if you know any good and nice-UI-having project management app with support for text-heavy usage, just let me know, so I'll save my time for something better than redesigning the weel.

I think your question is just boiling down to the one line (If there's something else, let me know, and I'll add on):
Is it better to store the diff, or whole text (i know, space is cheap...)?
It's definitely better to store the whole text, unless you really need to save space. Viewing the text will be a much more common action than checking a diff, and if something has a lot of revisions it could be a significant process to "build" the text for the latest one. Imagine a heavily-used page where you've done thousands of revisions, and the "whole text" is only stored with the original. Then you have to process thousands of diffs just to view the latest text, instead of just pulling it straight out of the database.
If you want to compromise, every time you calculate a diff between any two revisions, store it in a separate table. Then you only have to calculate any given diff once, so it'll be instant the next time you view the same diff. If necessary, this table could be pruned every once in a while to remove diffs that haven't been accessed in a long time.

here is a php diff function : http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
and here is another: holomind.de/phpnet/diff.php

If you're storing a lot of different versions of files git can help you quite a lot.

Related

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn

In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.

Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.

That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)

It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

Storing language and styles. What would be best? Files or DB ( i18N )

I'm starting a Incident Tracking System for IT, and its likely my first PHP project.
I've been designing it in my mind based on software I've seen like vBulletin, and I'd like it to have i18n and styles editables.
So my first question goes here:
What is best method to store these things, knowing they will be likely static. I've been thinking about getting file content with PHP, showing it in a text editor, and when save is made, replace the old one. (Making a copy if it hasn't ever been edited before so we have the "original").
I think this would be considerably faster than using MySQL and storing the language / style.
What about security here? Should I create .htaccess for asking for pw on this folder?
I know how to make a replace using for each getting an array from database and using strreplace ($name, $value, $file) but if I store language in file, can I make a an associative array with it's content (like a JSON).
Thanks a lot and sorry for so many questions, im newbie

this is what im doing in my cms:
for each plugin/program/entity (you name it) i develop, i create a /translations folder.
i put there all my translations, named like el.txt, de.txt, uk.txt etc. all languages
i store the translation data in JSON, because its easy to store to, easy to read from and easiest for everyone to post theirs.
files can be easily UTF8 encoded in-file without messing with databases, making it possible to read them in file-mode. (just JSON.parse them)
on installation of such plugins, i just loop through all translations and put them in database, each language per table row. (etc. a data column of TEXT datatype)
for each page render i just query once the database for taking this row of selected language, and call json_decode() to the whole result to get it once; then put it in a $_SESSION so the next time to get flash-speed translated strings for current selected language.
the whole thing was developed having i mind both performance and compatibility.

The benefit for storing on the HDD vs DB is that backups won't waste as much space. eg. once the file is backed-up once, it doesn't take up tape on the next day. Whereas, a db gets fully backed up every day and takes up increasing amounts of space. The down-side to writing it to the disk is that it increases your chance of somebody uploading something malicious and they might be clever enough to figure out how to execute it. You just need to be more careful, that's all.
Yes, use .htaccess to limit any action on a writable folder. Good job thinking ahead of that risk.
Your approach sounds like a good strategy.
Good luck.

Caching large amounts of content with PHP + MySQL

I'm making an engine/CMS for story-based web browser games. I have quite a bit of data: characters, items, and the bits of story that the player will interact with. The intention behind this project is that writers don't have to be programmers in order to create a narratively-driven web game. It would only require basic knowledge of FTP and website management in order to start creating content.
The problem is that I think the database is going to bog these games down. Each character can have a lot to them, and the stories are going to be extensive. Each bit of story will have its own written text, which could be 100 characters or 500 characters. There's no way I could cache that all with memcached or something similar!
Thankfully, each state of the game is "pushed" through a deploy, meaning you don't just add a character and they appear in the world; you have to add them, and then push a build of the game. I believe I can use this to my advantage. My working notion right now is:
There will be three databases total. One will be the 'working' content DB, another the 'live' content DB, and then finally the DB that holds all user data. (where they are in the story, items they've obtained, etc.) My idea is that I'll push with the working DB, completely destroy the live, and rebuild the live based on what's in the working DB at the time of the push. The live DB will then benefit from read-only abilities: such as the ARCHIVE storage engine and quite a bit of indexing. This sounds pretty solid, but I'm not experienced enough to be confident that this is the best way to go about my business.
I'd love to know if anyone has any suggestions for a new model, or even a suggestion to my current model.

What you're saying sounds like it'll work. You'll have to build your framework and then inject some dummy game data to see how it responds.
One nice thing about gaming is that you can get away numerous loading screens/bars, so take advantage of that. :)

What's the most compact way to store diffs in a database?

I want to implement something similar to Wikimedia's revision history? What would be the best PHP functions/libraries/extensions/algorithms to use?
I would like the diffs to be as compact as possible, but I'm happy to be restricted to only showing the difference between each revision and its sibling, and only being able to roll back one revision at a time.
In some cases only a few characters may change, whereas in other cases the whole string could change, so I'm keen to understand whether some techniques are better for small changes than for large ones, and if in some cases it's more efficient to simply store whole copies.
Backing the whole system with something like Git or SVN seems a bit extreme, and I don't really want to store files on disk.

It is much easier to store each record in its entirety than it is to store diffs of them. Then if you want a diff of two revisions you can generate one as needed using the PECL Text_Diff library.
I like to store all versions of the record in a single table and retrieve the most recent one with MAX(revision), a "current" boolean attribute, or similar. Others prefer to denormalize and have a mirror table that holds non-current revisions.
If you store diffs instead, your schema and algorithms become much more complex. You then need to store at least one "full" revision and multiple "diff" versions, and reconstruct a full version from a set of diffs whenever you need a full version. (This is how SVN stores things. Git stores a full copy of each revision, not diffs.)
Programmer time is expensive, but disk space is usually cheap. Please consider whether storing each revision in full is really a problem.

You must ask yourself: what type of data end user will want to retrieve more often: revisions, or diffs of revisions?
I would use standard diff from unix for that. And, depending on the answer of above question, store diffs or whole revisions in database.
Backing the whole system with something like Git or SVN seems a bit extreme
Why? Github, AFAIR, stores wikis that way ;)

I would implement it using diff to create the delta and patch to apply one or more edits in sequence to build a document at a known state. Of course, the more you do this more it becomes clear that you can offload this task to a version control tool. I have twice re-designed diff/patch systems to use SVN for this type of task.

Collaboration script - track user contributions

I am developing a collaboration tool in PHP and MySQL, and I wanted to ask what would be the most efficent way to do the following; say I have a block of text, that will get edited by different users. I need to record each change, and when the changed text is viewed, the text changed by particular user should be highlighted (possibly with css and/or jQuery).
I am not looking for a particular code snippet (and you can see that my question is fairly vaugue), but I was hoping to get an idea how to go around this particular problem.
As always cheers for all suggestions.

One way to do this, if you're using git for version control, would by to use the git blame command. It'll show you, line by line, who changed what, at what time, and in what commit. Here's some documentation, and here's a gui. I prefer to run this from the command line with
git blame path/to/filename.m
If you don't use git, and want to learn a bit more, you might check out the Git Community Book.

If you are going to build it from scratch, then I have an idea. You can create a table 'Event', in which you record every changes made to your document.
The table includes 3 concepts: "modified time", "who modify" and "what changes". The "what changes" is the main problem here. In my opinion, since you are very likely need to support "revert" ability, you should save all the version of documents. So in "what changes", you only need 2 columns: "before_change_text_link" which refers to the file before changing, and "after_change_text_link" which refers to the file after changing . By that way, you can record all the changes.
You can then highlight the changes made by different users by jQuery/css, with some comparing text procedure at server.

Ok so I've came up with a solution, when user submits the new text, run diff on it, and register the lines that has been changed. Those will be stored in a mysql table with user id, the string that diff returns and it's respective line number. This way I can return the original text, return the changed strings for particular user, and use regex to highlight it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.