Implementing Bayes classifier (in PHP) - php

I have a theoretical question about a Naive Bayes Classifier. Assume I have trained the classifier with the following training data:
class word count
-----------------
pos good 1
sun 1
neu tree 1
neg bad 1
sad 1
Assume I now classify "good sun great". There are now two options:
1) classify against the trainingdata, which remains static. Meaning both "good" and "sun" come from the positive category, classifying this string as a positive. After classification, the training table remains unchanged. All strings are thus classified against the static set of training data.
2) You classify the string, but then update the training data, as in the table underneath. Thus, the next string will be classified against a more "advanced" set of training data than this one. By the end of (automatic) classification, the table that started out as a simple training set, will have grown in size, having been expanded with many words (and updated word counts)
class word count
-----------------
pos good 2
sun 2
great 1
neu tree 1
neg bad 1
sad 1
In my implementation of NMB I used the first method, but I'm now second-guessing I should have done the latter. Please enlighten me :-)

The method you've implemented is indeed the popular and accepted way of building classifiers (and not just Bayesian ones).
Using "unlabeled" data, i.e. data you have no ground-truth about, to update the classifier, is a more advanced and complicated technique, sometimes called "semi-supervised learning".
Using this class of algorithms might or might not be a good fit to your specific task - it's usually a matter of trial and error.
If you do decide to incorporate unlabeled data into your model, you should probably try out one of the popular algorithms of doing that, e.g. EM.

Related

Convert text in specific format into real PHP code assignments

I'm having some problems to get a text in a specific format into real working PHP code.
My text file:
#T1:The German sociologist Max Weber once proposed
#S:Jos Bleau
#C:jos.bleau#domain.com
#L:"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic
#R:At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.
#ST:Fusions
#R:Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#ST:Fusions
#R:The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#BV:Thirty years later, Breidbart remembers
#CP:(Picture: Credit – Jos Bleau) or #CP:(Picture: Thanks)
The expected output I need (Half pseudo code; Unescaped quotes):
<?php
$title1 = 'The German sociologist Max Weber once proposed';
$signature = 'Jos Bleau';
$email = 'jos.bleau#domain.com';
$lead = '"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
$text[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
$subtitle[] = 'Fusions';
//etc...
?>
Note:
The names like $title1 and #T1 are completely unrelated to each other and $title1 is just used as example. It could also be $xy or something else
If #XY appears more than once in the file then the values should be added as array element, else as simple assignment
I don't know if preg_split() is the correct direction and I can do it with it? Or do I have to use other functions to accomplish this?
Explanation
First we get the data from the text file into a variable with file_get_contents() and also initialize our $output array, where each element is a line in the output, with a php tag <?php.
You can also modify $lookup with shortcut => variable name elements, where you can define which #XY: gets replaced with which variable name. If not defined the shortcut will be used as variable name.
Now that we have prepared some stuff we match each #XY: with the corresponding data with preg_match_all().
Regular Expression
/#(\w+):(.*?)(?=#\w+:)/s
\w+ matches all word characters \[a-zA-Z0-9_\], which is the XY part from #XY: and we keep it with a capturing group
+ is a quantifier and says that \w should match 1 or more times
(.*?) matches everything as much as needed
With the flag s, * also matches new lines
(?=#\w+:) makes sure (.*?) matches everything until the next #XY: and not more. Where ?= is a positive lookahead and as it says it looks ahead if that regex in the parentheses(#\w+) can be matched
We also preemptively save the amount each shortcut appears in the data with array_count_values().
Now that we have matched all data which we want we can loop through all shortcuts, which are saved in $m[1]. In the foreach loop we simply check if you have defined a lookup variable name or if we use the shortcut as variable name.
Then we simply add each assignment as new element to the output array. Where you have to note three things:
Complex (curly) syntax is used, so that you don't get problems with invalid variable names, see: How can I access a property with an invalid name?
Depending on how many times a shortcut appeared in the data we decide if it should be added as array element or normal assignment. If the shortcut appears more than once in the data it will be adding the value as array element else as simple string assignment
We use trim() to remove spaces, new lines, ... from the start and end of the string. And we use addslashes(), so we don't get problems with quotes
Done. And now we are already done. Just depending on how you want to output the result you can save it to a file with file_put_contents() or just print out the array.
Code
<?php
$text = file_get_contents("test.txt");
$output = ["<?php"];
$lookup = []; //Example: ["ST" => "subtitle"]
preg_match_all("/#(\w+):(.*?)(?=#\w+:)/s", $text, $m);
$variableShortcutCount = array_count_values($m[1]);
foreach($m[1] as $key => $variableShortcut){
if(isset($lookup[$variableShortcut])){
$output[] = '${"' . $lookup[$variableShortcut] . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
} else {
$output[] = '${"' . $variableShortcut . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
}
}
//Output to file
//file_put_contents("output.txt", implode(PHP_EOL, $output));
//Output to browser
echo "<pre><code>";
highlight_string(implode(PHP_EOL, $output));
?>
output:
<?php
${"T1"} = 'The German sociologist Max Weber once proposed';
${"S"} = 'Jos Bleau';
${"C"} = 'jos.bleau#domain.com';
${"L"} = '\"He used to be so conservative,\" she says, throwing up her hands in mock exasperation. \"We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
${"R"}[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn\'t borrow source code from past programs, and yet, with a single stroke of the president\'s pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab\'s scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard\'s father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son\'s aversion to authority. She can also attest to her son\'s lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"BV"} = 'Thirty years later, Breidbart remembers';
${"CP"} = '(Picture: Credit – Jos Bleau) or';

Formula/Class to find best lineup based on salary cap/points

With all of the daily fantasy games out there, I am looking to see if I can easily implement a platform that will help identify the optimal lineup for a fantasy league based on a salary cap and projected points for each player.
If given a pool of ~500 players and you need to find the highest scoring lineup of within the maximium salary cap restraints.
1 Quarter Back
2 Running Back
3 Wide Receiver
1 Tight End
1 Kicker
1 Defense
Each player is assigned a salary (that changes weekly) and I will assign projected points for those players. I have this information in a MySQL DB and would prefer to use PHP/Pear or JQuery if that's the best option for calculating this.
The Table looks something like this
player_id name position salary ranking projected_points
1 Joe Smith QB 1000 2 21.7
2 Jake Plummer QB 2500 6 11.9
I've tried sorting by projected points and filling in the roster, but it obviously will provide the highest scoring team, but also exceeds the salary cap. I cannot think of a way to have it intelligently remove players and continue to loop through and find the highest scoring lineup based on the salary constraints.
So, is there any PHP or Pear class that you know of that will help "Solve" this type of problem? Any articles you can point me to for reference? I'm not asking for someone to do this, but I've been Googleing for a while and the best solution I currently have is this. http://office.microsoft.com/en-us/excel-help/pick-your-fantasy-football-team-with-solver-HA001124603.aspx and that's using Excel and limited to 200 objects.
I'll suggest two approaches to this problem.
The first is dynamic programming. For brute force, we could initialize a list containing the empty partial team, then, for each successive player, for each partial team currently in the list, add a copy of that partial team with the new player, assuming that this new partial team respects the positional and budget constraints. This is an exponential-time algorithm, but we can reduce the running time by quite a lot (to O(#partial position breakdowns * budget * #players), assuming that all monetary values are integer) if we throw away all but the best possibility so far for each combination of partial position breakdown and budget.
The second is to find an integer programming library callable from PHP that works like Excel's solver. It looks like (e.g.) lpsolve comes with a PHP interface. Then we can formulate an integer program like so.
maximize sum_{player p} value_p x_p
subject to
sum_{quarterback player p} x_p <= 1
sum_{running back player p} x_p <= 2
...
sum_{defense player p} x_p <= 1
sum_{player p} cost_p <= budget
for each player p, x_p in {0, 1} (i.e., x_p is binary)

Levenshtein - grouping hotel names

I have to group some hotel into the same category based on their names. I'm using levenshtein for grouping, but how much I've tried, some hotel are leaved outside the category they supposed to be, or in another category.
For example: all these hotel should be in the same category:
=============================
Best Western Bercy Rive Gauche
Best Western Colisee
Best Western Ducs De Bourgogne
Best Western Folkestone Opera
Best Western France Europe
Best Western Hotel Sydney Opera
Best Western Paris Louvre Opera
Best Western Hotel De Neuville
=============================
I'm having a list with all hotel names( like 1000 rows ). I also have how they should be grouped.
Any idea how to optimize levenshtein, making it more flexible for my situation?
$inserted = false;
foreach($hotelList as $key => $value){
if (levenshtein($key, $hotelName, 2, 5, 1) <= abs(strlen($key) - strlen($hotelName))){
array_push($hotelList[$key], trim($line));
$inserted = true;
}
}
// if no match was found add another entry
if (!$inserted){
$hotelList[$hotelName] = array(
trim($line)
);
}
I'll wade in with my thoughts. Firstly, grouping or "clustering" data like this is a pretty big topic, I won't really go into it particularly but perhaps point things in an ideal direction.
You did a brilliant thing by normalizing Levenshtein on the length of the strings compared- that's exactly right because you avoid the problem that the length of the string would overdetermine the similarity in many cases.
But the algorithm didn't solve the problem. For a start, we want to compare words. "Bent Eastern French Hotels" is obviously very different to "Best Western French Hotels", yet it would score better than "Best Western Paris Bed and Breakfasts", say. The intution to grasp here is that your tokens shouldn't be characters but words.
I like #saury's answer, but I'm not sure about the assumption at the beginning. Instead, let's start with something nice and easy often called "bag of words". We then implement a hashing trick, which would allow you to idetify the key phrases based on the intuition that the least used words contain the most information.
If you subscribe to the idea that hotel brand names are near the beginning you could always skew on their proximity to the start of the string too. Thing is, your groups will as likely end up being "France" as "Best" / "Western" (but not "hotel"- why?).
You want your results to be more accurate?
From here on in, we're gonna have to take a step up to some serious algorithms- enjoy surfing the many stack overflow topics. My instinct is that I bet many hotel names aren't branded at all, so you'll need different categories for them too. And my instinct is also that the number of repeated words in hotel names is going to be relatively slim- some words will be frequent members of hotel names. These facts would be problems for the above. In this case, there's a really popular (if cliched for SO) technique called k-means, a fun introduction to which would be to extend an algorithm like this (very bravely written in php) to take your chosen n keyphrases as the n dimensions of the cluster, then take the majority components of the cluster center-points as your categorization tags. (That would eliminate "France", say, because hits for "France" would be spread across the n-dimensional space pretty evenly).
This is probably all a bit much to take on for something that would seem like a small problem- but I want to emphasize that if your data isn't structured, there really aren't any short-cuts to doing things properly.
what levenshtein distance value do you take as the delta between words to be treated as part of same group ? Seems that you tend to group hotels based on the initial few words and that will require a different approach altogether (like do dictionary sort , compare current string with next strings etc). However if your use-case still requires to calculate levenshtein distance then I would suggest you to sort the Strings based on their length and then start comparing each string with other strings of similar length (apply you own heuristic to what you consider as 'similar' like you may say isSimilar = Math.abs(str1.length - str2.length) < SOME_LOWEST_DELTA_VALUE or something like that)
You might want to read about http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Cluster_analysis in general.

Optimal way to find common negative space in map of student_ids to class_times

I am writing a proof of concept for a scheduling application in PHP. I have a 2D array of the student schedule in the format of (str) class_time => (array) student_ids, printout: http://d.pr/i/UKAy.
At this point in the processing, I need to determine which class_time is the most appropriate to host a new course with say 10 students requesting it. To this end, I would I want to determine how many students have n class_times available, ideally stored as class_time => student_ids => n_available_class_times.
So, what is an ideal way to build/search this data? The end result is a list of all class_times and an idea of what students can utilize a given class as each new course is scheduled. This allows me to sort by available_class_times to find the students who are most constrained in their schedule, and who need priority in being schedule to a given class given how hard it would be to schedule them in the future, given a number of present/potential constraints.
Something as followed would help. Each array student_ids needs to be sorted. You can use quick sort to do that in nlog(n) time. Then you would have to start planning. I think something like AB pruning would work here because you have some optimal states at the end and decisions along the way that affect your optimal state. (the sorting bit at the start is just to make it faster)
Here is some stuff on AB pruning:
To start, there is a decision algorithm called min-max that states that all the decisions in a "game" lead to a final state that is either infinitely good or infinitely bad i.e. winning or losing. So you build this tree each node representing a "game state", in your case a state of students being scheduled. And then you search the tree. Transversing it for best move states. In your case optimal scheduling. At each node, you decide if it is an end state and call it at either infinite or negative infinite or you branch off into other nodes. Note that this is not a binary tree. The decision tree nodes have n branches where n is the number of decisions you can make there. This is not too great for what your doing, but it requires explaining to understand AB pruning.
Now assume that instead of just asking if a node is a win or a lose, you could weight how good of a game state it was. In your case based on the number of students that could be optimally scheduled. The as you tranversed the huge decision tree, you can cut out huge sections because you know they lead to crappy "game states" i.e. states where the students you want easily placed and not easily placed. The way you do this is by considering nodes that lead to game states B that you know to be worst than A (the node you previously evaluated). This is good because searching this tree is a serious computational task. This allows you to evaluate even deeper by ignoring huge sections (something that is really awesome huge computational gain). This gets you your answer of best class schedule states. Good Luck Dude.
// HERE IS SOME CODE FROM THE INTERNET
function alphabeta(node, depth, α, β, Player)
if depth = 0 or node is a terminal node
return the heuristic value of node
if Player = MaxPlayer
for each child of node
α := max(α, alphabeta(child, depth-1, α, β, not(Player) ))
if β ≤ α
break (* Beta cut-off *)
return α
else
for each child of node
β := min(β, alphabeta(child, depth-1, α, β, not(Player) ))
if β ≤ α
break (* Alpha cut-off *)
return β
(* Initial call *)
alphabeta(origin, depth, -infinity, +infinity, MaxPlayer)
Here is a link on the subject:
http://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning

LSA - Latent Semantic Analysis - How to code it in PHP?

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!
LSA links:
Landauer (co-creator) article on LSA
the R-project lsa user guide
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
U,Sigma,V = singular_value_decomposition(M)
U: w x w
Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V: d x d matrix
Thus U * Sigma * V = M
# you might have to do some transposes depending on how your SVD code
# returns U and V. verify this so that you don't go crazy :)
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
Let s1 = sum(Sigma).
total = 0
for ii in range(len(Sigma)):
val = Sigma[ii]
total += val
if total > .5 * s1:
return ii
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.
This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)
That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).
There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.

Categories