I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.
I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).
Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...
You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.
While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)
I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.
Related
I'm having some problems to get a text in a specific format into real working PHP code.
My text file:
#T1:The German sociologist Max Weber once proposed
#S:Jos Bleau
#C:jos.bleau#domain.com
#L:"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic
#R:At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.
#ST:Fusions
#R:Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#ST:Fusions
#R:The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#BV:Thirty years later, Breidbart remembers
#CP:(Picture: Credit – Jos Bleau) or #CP:(Picture: Thanks)
The expected output I need (Half pseudo code; Unescaped quotes):
<?php
$title1 = 'The German sociologist Max Weber once proposed';
$signature = 'Jos Bleau';
$email = 'jos.bleau#domain.com';
$lead = '"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
$text[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
$subtitle[] = 'Fusions';
//etc...
?>
Note:
The names like $title1 and #T1 are completely unrelated to each other and $title1 is just used as example. It could also be $xy or something else
If #XY appears more than once in the file then the values should be added as array element, else as simple assignment
I don't know if preg_split() is the correct direction and I can do it with it? Or do I have to use other functions to accomplish this?
Explanation
First we get the data from the text file into a variable with file_get_contents() and also initialize our $output array, where each element is a line in the output, with a php tag <?php.
You can also modify $lookup with shortcut => variable name elements, where you can define which #XY: gets replaced with which variable name. If not defined the shortcut will be used as variable name.
Now that we have prepared some stuff we match each #XY: with the corresponding data with preg_match_all().
Regular Expression
/#(\w+):(.*?)(?=#\w+:)/s
\w+ matches all word characters \[a-zA-Z0-9_\], which is the XY part from #XY: and we keep it with a capturing group
+ is a quantifier and says that \w should match 1 or more times
(.*?) matches everything as much as needed
With the flag s, * also matches new lines
(?=#\w+:) makes sure (.*?) matches everything until the next #XY: and not more. Where ?= is a positive lookahead and as it says it looks ahead if that regex in the parentheses(#\w+) can be matched
We also preemptively save the amount each shortcut appears in the data with array_count_values().
Now that we have matched all data which we want we can loop through all shortcuts, which are saved in $m[1]. In the foreach loop we simply check if you have defined a lookup variable name or if we use the shortcut as variable name.
Then we simply add each assignment as new element to the output array. Where you have to note three things:
Complex (curly) syntax is used, so that you don't get problems with invalid variable names, see: How can I access a property with an invalid name?
Depending on how many times a shortcut appeared in the data we decide if it should be added as array element or normal assignment. If the shortcut appears more than once in the data it will be adding the value as array element else as simple string assignment
We use trim() to remove spaces, new lines, ... from the start and end of the string. And we use addslashes(), so we don't get problems with quotes
Done. And now we are already done. Just depending on how you want to output the result you can save it to a file with file_put_contents() or just print out the array.
Code
<?php
$text = file_get_contents("test.txt");
$output = ["<?php"];
$lookup = []; //Example: ["ST" => "subtitle"]
preg_match_all("/#(\w+):(.*?)(?=#\w+:)/s", $text, $m);
$variableShortcutCount = array_count_values($m[1]);
foreach($m[1] as $key => $variableShortcut){
if(isset($lookup[$variableShortcut])){
$output[] = '${"' . $lookup[$variableShortcut] . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
} else {
$output[] = '${"' . $variableShortcut . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
}
}
//Output to file
//file_put_contents("output.txt", implode(PHP_EOL, $output));
//Output to browser
echo "<pre><code>";
highlight_string(implode(PHP_EOL, $output));
?>
output:
<?php
${"T1"} = 'The German sociologist Max Weber once proposed';
${"S"} = 'Jos Bleau';
${"C"} = 'jos.bleau#domain.com';
${"L"} = '\"He used to be so conservative,\" she says, throwing up her hands in mock exasperation. \"We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
${"R"}[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn\'t borrow source code from past programs, and yet, with a single stroke of the president\'s pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab\'s scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard\'s father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son\'s aversion to authority. She can also attest to her son\'s lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"BV"} = 'Thirty years later, Breidbart remembers';
${"CP"} = '(Picture: Credit – Jos Bleau) or';
I have to group some hotel into the same category based on their names. I'm using levenshtein for grouping, but how much I've tried, some hotel are leaved outside the category they supposed to be, or in another category.
For example: all these hotel should be in the same category:
=============================
Best Western Bercy Rive Gauche
Best Western Colisee
Best Western Ducs De Bourgogne
Best Western Folkestone Opera
Best Western France Europe
Best Western Hotel Sydney Opera
Best Western Paris Louvre Opera
Best Western Hotel De Neuville
=============================
I'm having a list with all hotel names( like 1000 rows ). I also have how they should be grouped.
Any idea how to optimize levenshtein, making it more flexible for my situation?
$inserted = false;
foreach($hotelList as $key => $value){
if (levenshtein($key, $hotelName, 2, 5, 1) <= abs(strlen($key) - strlen($hotelName))){
array_push($hotelList[$key], trim($line));
$inserted = true;
}
}
// if no match was found add another entry
if (!$inserted){
$hotelList[$hotelName] = array(
trim($line)
);
}
I'll wade in with my thoughts. Firstly, grouping or "clustering" data like this is a pretty big topic, I won't really go into it particularly but perhaps point things in an ideal direction.
You did a brilliant thing by normalizing Levenshtein on the length of the strings compared- that's exactly right because you avoid the problem that the length of the string would overdetermine the similarity in many cases.
But the algorithm didn't solve the problem. For a start, we want to compare words. "Bent Eastern French Hotels" is obviously very different to "Best Western French Hotels", yet it would score better than "Best Western Paris Bed and Breakfasts", say. The intution to grasp here is that your tokens shouldn't be characters but words.
I like #saury's answer, but I'm not sure about the assumption at the beginning. Instead, let's start with something nice and easy often called "bag of words". We then implement a hashing trick, which would allow you to idetify the key phrases based on the intuition that the least used words contain the most information.
If you subscribe to the idea that hotel brand names are near the beginning you could always skew on their proximity to the start of the string too. Thing is, your groups will as likely end up being "France" as "Best" / "Western" (but not "hotel"- why?).
You want your results to be more accurate?
From here on in, we're gonna have to take a step up to some serious algorithms- enjoy surfing the many stack overflow topics. My instinct is that I bet many hotel names aren't branded at all, so you'll need different categories for them too. And my instinct is also that the number of repeated words in hotel names is going to be relatively slim- some words will be frequent members of hotel names. These facts would be problems for the above. In this case, there's a really popular (if cliched for SO) technique called k-means, a fun introduction to which would be to extend an algorithm like this (very bravely written in php) to take your chosen n keyphrases as the n dimensions of the cluster, then take the majority components of the cluster center-points as your categorization tags. (That would eliminate "France", say, because hits for "France" would be spread across the n-dimensional space pretty evenly).
This is probably all a bit much to take on for something that would seem like a small problem- but I want to emphasize that if your data isn't structured, there really aren't any short-cuts to doing things properly.
what levenshtein distance value do you take as the delta between words to be treated as part of same group ? Seems that you tend to group hotels based on the initial few words and that will require a different approach altogether (like do dictionary sort , compare current string with next strings etc). However if your use-case still requires to calculate levenshtein distance then I would suggest you to sort the Strings based on their length and then start comparing each string with other strings of similar length (apply you own heuristic to what you consider as 'similar' like you may say isSimilar = Math.abs(str1.length - str2.length) < SOME_LOWEST_DELTA_VALUE or something like that)
You might want to read about http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Cluster_analysis in general.
I am coding a social network and I need a way to list the most used trends, All statuses are stored in a content field, so what it is exactly that I need to do is match hashtag mentions such as: #trend1 #trend2 #anothertrend
And sort by them, Is there a way I can do this with MySQL? Or would I have to do this only with PHP?
Thanks in advance
The maths behind trends are somewhat complex; machine learning may be a bit over the top, but you probably need to work through some examples.
If you go with #deadtrunk's sample code, you would miss trends that have fired up in the last half hour; if you go with #eggyal's example, you miss trends that have been going strong all day, but calmed down in the last half hour.
The classic solution to this problem is to use a derivative function (http://en.wikipedia.org/wiki/Derivative); it's worth building a sample database and experimenting with this, and making your solution flexible enough to change this over time.
Whilst you want to build something simple, your users will be used to trends, and will assume it's broken if it doesn't work the way they expect.
You should probably extract the hash tags using PHP code, and then store them in your database separately from the content of the post. This way you'll be able to query them directly, rather then parsing the content every time you sort.
I think it is better to store tags in dedicated table and then perform queries on it.
So if you have a following table layout
trend | date
You'll be able to get trends using following query:
SELECT COUNT(*), trend FROM `trends` WHERE `date` = '2012-05-10' GROUP BY trend
18 test2
7 test3
Create a table that associates hashtags with statuses.
Select all status updates from some recent period - say, the last half hour - joined with the hashtag association table and group by hashtag.
The count in each group is an indication of "trend".
My App is supposed to search the stored Images based on the search Query. The User can search in label, description, people tagged in, time posted . for that I am trying to make a Search Query parser that accepts wild card (*) #TaggedPeopleName (DateFrom - DateTo) #Place and all other texts to match label and description. My question is am I reinventing the wheel ? or there already exists such parser may be with similar functionality ?
Example Queries are:
#JohnLenon 500 Miles
will return the Images that Match 500 Miles in Label or in description and has a Tag of John Lenon
(24 Dec - 30 Dec)
will return all Images uploaded in that time Frame.
#Kolkata (24 dec - 31 Dec) Occupy Together
will return all Images that Match the String Occupy Together in Label or in description and withing the Time Frame 24 dec to 31 Dec and Taken at the Place Kolkata
If some Library already does this may be with different syntax I'll accept. as I am not sticked to this syntax only
To my knowledge, there is nothing that does this automatically for you - it's way too particular to your situation.
I would break it down in chunks to make it easy.
Search for all terms starting with # - remove them.
Search for all terms starting with # - remove them.
Search for all terms surrounded by () - remove them.
What's left is the general search term.
Things to think about:
What if someone wants to search for a term in a description that starts with # or #?
What is the format of the () terms - most people won't naturally format dates as you have?
What if someone just puts in junk between the ()?
What if two words are separated by another token, but after removing them are put together?
Or what if the person puts two words together but doesn't want them searched for as one term?
I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!
LSA links:
Landauer (co-creator) article on LSA
the R-project lsa user guide
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
U,Sigma,V = singular_value_decomposition(M)
U: w x w
Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V: d x d matrix
Thus U * Sigma * V = M
# you might have to do some transposes depending on how your SVD code
# returns U and V. verify this so that you don't go crazy :)
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
Let s1 = sum(Sigma).
total = 0
for ii in range(len(Sigma)):
val = Sigma[ii]
total += val
if total > .5 * s1:
return ii
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.
This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)
That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).
There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.