LSA - Latent Semantic Analysis - How to code it in PHP? - php

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!

LSA links:
Landauer (co-creator) article on LSA
the R-project lsa user guide
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
U,Sigma,V = singular_value_decomposition(M)
U: w x w
Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V: d x d matrix
Thus U * Sigma * V = M
# you might have to do some transposes depending on how your SVD code
# returns U and V. verify this so that you don't go crazy :)
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
Let s1 = sum(Sigma).
total = 0
for ii in range(len(Sigma)):
val = Sigma[ii]
total += val
if total > .5 * s1:
return ii
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.

This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)

That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).

There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.

Related

From 6 random numbers calculate random three-digit number?

I have 4 years PHP and C# experience, but Math is not my better side.
I thnik that i need in this project use some math algorithms.
When page load I need randomly create 7 numbers, 6 are numbers that I can use to calculate given three digit number:
rand 1-9
rand 1-9
rand 1-9
rand 1-9
rand 10-100 //5 steps
rand 10-100 //5 steps
and given number to calculate is 100-999,
I can use this operations: +, -, /, *, (, )
What is best algorithm for this?
I probably need to try all possible combinations with this 6 numbers to calculate given number or closest number of calculations.
example:
let say that given three digit number is
350, and I need to calculate this number from this numbers:
3,6,9,5 10, 100
so formula for this is:
(100*3)+(5*10) = 350
if is not possible to calculate exact number, than calculate closest.
You don't need to solve this problem completely, you can introduce me to solve this problem by paste some pseudo, or describing how to do that.
I have no actual experience that might help you with this, though since you're asking for some insight, I'll share my thoughts on how to do this.
As I typed my answer, I realised that this is in fact a knapsack problem, which means you can solve it to optimality using any algorithm that solves the knapsack problem. I recommend using dynamic programming to make your program run faster.
What you need to do is construct all numbers you can generate by combining two numbers with an operator, so that after this you have a list containing the numbers you started with, and the numbers you generated.
Then you solve the knapsack problem using the numbers as items with their value as their weight, and the number as the weight you can store at most.
The only thing that is slightly different is that you have an extra constraint that says that you may only use a number once. So you need to add into your implementation that if you add a combination of numbers, that you must remove the option of storing another combination that is constructed with the same number.
You could enumerate all the solutions by building "Abstract syntax trees", binary trees with the following informations :
the leaves are the 6 numbers
the nodes are the operations, for example a node '+' with the leaf '7' for left son and another node for right son that is 'x' with '140' for left son and '8' for right son would represent (7+(140*8)). Additionally, at each node you store the numbers that you already used (the leaves used in the tree), and the total.
Let's say you store all the constructed trees in the associative map TreeSets, but indexed by the number of leaves you use. For example, the tree (7+(140*8)) would not be stored directly in TreeSets but in TreeSets[3] (TreeSets[3] contains several trees, it is also a set).
You store the most close score in BestScore and one solution of the BestScore in BestSolution.
You start by constructing the 6 leaves (that makes you 6 different trees consisting of only one leaf). You save the closer number in Bestscore and the corresponding leaf in BestSolution.
Then at each step, you try to construct the trees with i leaves, i from 2 to 6, and store them in TreeSets[i].
You take j from 1 to i-1, you take each tree in TreeSets[j] and each tree in TreeSets[i-j], you check that those two trees don't use the same leaves (you don't have to check at the bottom of the tree since you have stored the leaves used in the node), if so you build the four nodes '+', 'x', '/', '-' with the tree from TreeSets[j] as left son and the tree from TreeSets[i-j] and store all four of them in TreeSets[i]. While building a node, you take the total from both tree and apply the operation, you store the total, and you check if it is closer than BestScore (if so you update BestScore and BestSolution with this new total and with the new node). If the total is exactly the value you were looking for, you can stop here.
If you didn't stopped the program by finding an exact solution, there is no such solution, and the closer one is in BestSolution at the end.
Note : You don't have to build a complete tree each time, just build the node with two pointers on other trees.
P.S. : You may avoid to enumerate all the solutions by using the dynamic programming approach, as Glubus said. In this case, it would consist, at each step (i) to remove some solutions that are considered sub-optimal. But with this problem I'm not sure that is possible (except maybe remove the nodes with a total of 0).

Optimal way to find common negative space in map of student_ids to class_times

I am writing a proof of concept for a scheduling application in PHP. I have a 2D array of the student schedule in the format of (str) class_time => (array) student_ids, printout: http://d.pr/i/UKAy.
At this point in the processing, I need to determine which class_time is the most appropriate to host a new course with say 10 students requesting it. To this end, I would I want to determine how many students have n class_times available, ideally stored as class_time => student_ids => n_available_class_times.
So, what is an ideal way to build/search this data? The end result is a list of all class_times and an idea of what students can utilize a given class as each new course is scheduled. This allows me to sort by available_class_times to find the students who are most constrained in their schedule, and who need priority in being schedule to a given class given how hard it would be to schedule them in the future, given a number of present/potential constraints.
Something as followed would help. Each array student_ids needs to be sorted. You can use quick sort to do that in nlog(n) time. Then you would have to start planning. I think something like AB pruning would work here because you have some optimal states at the end and decisions along the way that affect your optimal state. (the sorting bit at the start is just to make it faster)
Here is some stuff on AB pruning:
To start, there is a decision algorithm called min-max that states that all the decisions in a "game" lead to a final state that is either infinitely good or infinitely bad i.e. winning or losing. So you build this tree each node representing a "game state", in your case a state of students being scheduled. And then you search the tree. Transversing it for best move states. In your case optimal scheduling. At each node, you decide if it is an end state and call it at either infinite or negative infinite or you branch off into other nodes. Note that this is not a binary tree. The decision tree nodes have n branches where n is the number of decisions you can make there. This is not too great for what your doing, but it requires explaining to understand AB pruning.
Now assume that instead of just asking if a node is a win or a lose, you could weight how good of a game state it was. In your case based on the number of students that could be optimally scheduled. The as you tranversed the huge decision tree, you can cut out huge sections because you know they lead to crappy "game states" i.e. states where the students you want easily placed and not easily placed. The way you do this is by considering nodes that lead to game states B that you know to be worst than A (the node you previously evaluated). This is good because searching this tree is a serious computational task. This allows you to evaluate even deeper by ignoring huge sections (something that is really awesome huge computational gain). This gets you your answer of best class schedule states. Good Luck Dude.
// HERE IS SOME CODE FROM THE INTERNET
function alphabeta(node, depth, α, β, Player)
if depth = 0 or node is a terminal node
return the heuristic value of node
if Player = MaxPlayer
for each child of node
α := max(α, alphabeta(child, depth-1, α, β, not(Player) ))
if β ≤ α
break (* Beta cut-off *)
return α
else
for each child of node
β := min(β, alphabeta(child, depth-1, α, β, not(Player) ))
if β ≤ α
break (* Alpha cut-off *)
return β
(* Initial call *)
alphabeta(origin, depth, -infinity, +infinity, MaxPlayer)
Here is a link on the subject:
http://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning

Searching for matrix way finding algorithm

i am developing a board game in php and now i have problems in writing an algorithm...
the game board is a multidimensional array ($board[10][10]) to define rows and columns of the board matrix or vector...
now i have to loop through the complete board but with a dynamic start point. for example the user selects cell [5,6] this is the start point for the loop. goal is to find all available board cells around the selected cell to find the target cells for a move method. i think i need a performant and efficient way to do this. does anyone know an algorithm to loop through a matrix/vector, only ones every field to find the available and used cells?
extra rule...
in the picture appended is a blue field selected (is a little bigger than the other). the available fields are only on the right side. the left side are available but not reachable from the current selected position... i think this is a extra information which makes the algorithm a little bit complicated....
big thx so far!
kind regards
not completely sure that I got the requirements right, so let me restate them:
You want an efficient algorithm to loop through all elements of an nxn matrix with n approximately 10, which starts at a given element (i,j) and is ordered by distance from (i,j)!?
I'd loop through a distance variable d from 0 to n/2
then for each value of d loop for l through -(2*d) to +(2*d)-1
pick the the cells (i+d, j+l), if i>=0 also pick (i+l,j-d),(i+l, j+d)
for each cell you have to apply a modulo n, to map negativ indexes back to the matrix.
This considers the matrix basically a torus, glueing upper and lower edge as well as left and right edge together.
If you don't like that you can let run d up to n and instead of a modulo operation just ignore values outside the matrix.
These aproaches give you the fields directly in the correct order. For small fields I do doubt any kind of optimization on this level has much of an effect in most situations, Nicholas approach might be just as good.
Update
I slightly modified the cells to pick in order to honor the rule 'only consider fields that are right from the current column or on the same column'
If your map is only 10x10, I'd loop through from [0][0], collecting all the possible spaces for the player to move, then grade the spaces by distance to current player position. N is small, so the fact that the algorithm has O(N^2) shouldn't affect your performance much.
Maybe someone with more background in algorithms has something up their sleeve.

What is the most efficient way to find the euclidean distance in 3d using mysql?

I have a MySQL table with thousands of data points stored in 3 columns R, G, B. how can I find which data point is closest to a given point (a,b,c) using Euclidean distance?
I'm saving RGB values of colors separately in a table, so the values are limited to 0-255 in each column. What I'm trying to do is find the closest color match by finding the color with the smallest euclidean distance.
I could obviously run through every point in the table to calculate the distance but that wouldn't be efficient enough to scale. Any ideas?
I think the above comments are all true, but they are - in my humble opinion - not answering the original question. (Correct me if I'm wrong). So, let me here add my 50 cents:
You are asking for a select statement, which, given your table is called 'colors', and given your columns are called r, g and b, they are integers ranged 0..255, and you are looking for the value, in your table, closest to a given value, lets say: rr, gg, bb, then I would dare trying the following:
select min(sqrt((rr-r)*(rr-r)+(gg-g)*(gg-g)+(bb-b)*(bb-b))) from colors;
Now, this answer is given with a lot of caveats, as I am not sure I got your question right, so pls confirm if it's right, or correct me so that I can be of assistance.
Since you're looking for the minimum distance and not exact distance you can skip the square root. I think Squared Euclidean Distance applies here.
You've said the values are bounded between 0-255, so you can make an indexed look up table with 255 values.
Here is what I'm thinking in terms of SQL. r0, g0, and b0 represent the target color. The table Vector would hold the square values mentioned above in #2. This solution would visit all the records but the result set can be set to 1 by sorting and selecting only the first row.
select
c.r, c.g, c.b,
mR.dist + mG.dist + mB.dist as squared_dist
from
colors c,
vector mR,
vector mG,
vector mB
where
c.r-r0 = mR.point and
c.g-g0 = mG.point and
c.b-b0 = mB.point
group by
c.r, c.g, c.b
The first level of optimization that I see you can do would be square the distance to which you want to limit the query so that you don't need to perform the square root for each row.
The second level of optimization I would encourage would be some preprocessing to alleviate the need for extraneous squaring for each query (which could possibly create some extra run time for large tables of RGB's). You'd have to do some benchmarking to see, but by substituting in values for a, b, c, and d and then performing the query, you could alleviate some stress from MySQL.
Note that the performance difference between the last two lines may be negligible. You'll have to use test queries on your system to determine which is faster.
I just re-read and noticed that you are ordering by distance. In which case, the d should be removed everything should be moved to one side. You can still plug in the constants to prevent extra processing on MySQL's end.
I believe there are two options.
You have to either as you say iterate across the entire set and compare and check against a maximum that you set initially at an impossibly low number like -1. This runs in linear time, n times (since you're only comparing 1 point to every point in the set, this scales in a linear way).
I'm still thinking of another option... something along the lines of doing a breadth first search away from the input point until a point is found in the set at the searched point, but this requires a bit more thought (I imagine the 3D space would have to be pretty heavily populated for this to be more efficient on average though).
If you run through every point and calculate the distance, don't use the square root function, it isn't necessary. The smallest sum of squares will be enough.
This is the problem you are trying to solve. (Planar case, select all points sorted by a x, y, or z axis. Then use PHP to process them)
MySQL also has a Spatial Database which may have this as a function. I'm not positive though.

A good approximation algorithm for the maximum weight perfect match in non-bipartite graphs?

Drake and Hougardy find a simple approximation algorithm for the maximum weighted matching problem. I think my understanding of academic papers is above my capabilities so I'm looking for an easy implementation preferable in php, c, javascript?
Problem Definition and References
Given a simple graph (undirected, no self-edges, no multi-edges) a matching
is a subset of edges such that no two of them are incident to the same vertex.
A perfect matching is one in which all vertices are incident to an edge of
the matching, something not possible if there are an odd number of vertices.
More generally we can ask for a maximum matching (largest possible number of
edges in a matching) or for a maximal matching (a matching to which no more
edges can be added).
If positive real "weights" are assigned to the edges, we can generalize the
problem to ask for a maximum-weighted matching, one that maximizes the
sum of edges' weights. The exact maximum-weighted matching problem can be
solved in O(nm log(n)) time, where n is the number of vertices and m the
number of edges.
Note that a maximum-weighted matching need not be a perfect matching. For
example:
*--1--*--3--*--1--*
has only one perfect matching, whose total weight is 2, and a maximum
weighted matching with total weight 3.
Discussion and further references for exact and approximate solutions of
these, and of the minimum-weighted perfect matching problem, may be found
in these papers:
"A Simple Approximation Algorithm for the Weighted Matching Problem"
Drake, Doratha E. and Hougardy, Stefan (2002)
Implementation of O(nm log n) Weighted Matchings The Power of Data Structures
Melhorn, Kurt and Schäfer, Guido (2000)
Computing Minimum-Weight Perfect Matchings
Cook, William and Rohe, André (1997)
Approximating Maximum Weight Matching in Near-linear Time
Duan, Ran and Pettie, Seth (2010)
Drake and Hougardy's Simple Approximation Algorithm
The first approximation algorithm of Drake-Hougardy uses the idea
of growing paths using the locally heaviest edge at each vertex met. It
has a "performance ratio" of 1/2 like the greedy algorithm, but linear
time complexity in the number of edges (the greedy algorithm uses
a globally heaviest edge and incurs greater time complexity to find that).
The main implementation task is to identify data structures that support
the steps of their algorithm efficiently.
The idea of the PathGrowing algorithm:
Given: a simple undirected graph G with weighted edges
(0) Define two sets of edges L and R, initially empty.
(1) While the set of edges of G is not empty, do:
(2) Choose arbitrary vertex v to which an edge is incident.
(3) While v has incident edges, do:
(4) Choose heaviest edge {u,v} incident to v.
(5) Add edge {u,v} to L or R in alternating fashion.
(6) Remove vertex v (and its incident edges) from G.
(7) Let u take the role of v.
(8) Repeat 3.
(9) Repeat 1.
Return L or R, whichever has the greater total weight.
Data structures to represent the graph and the output
As a "set" is not in any immediate sense a data structure of C, we
need to decide what kinds of container for edges and vertices will
suit this algorithm. The critical operations are removing vertices
and incident edges in a way that allows us to find if any edges are
left and to compare weights of the remaining edges incident to a
given vertex.
The edges need to be searchable, but only to see if any is still left.
One thinks first of a simple linked list of edges, without any special
ordering. But this list also needs to be maintained through essentially
random deletions. This suggests a doubly-linked list (back links as
well as forward at each node), so that deletion of an edge may be done
by fixing up the links to skip over any "removed" node. Edge weights
can also be stored in this same structure.
Further we need the ability to scan all (remaining) edges incident to
a given vertex. We can do this by creating a linked list for each vertex
of (pointers to) incident edges. I will assume that the vertices have
been preprocessed to ordinal values that can be used as an index into
an array of pointers to these linked lists.
Finally we need to represent the edge sets L and R, one of which is to
be returned as the approximate maximum matching. Our requirements
are to be able to add edges to either set, and to be able to total the
edge weights for both of them. Linked lists with dynamically allocated
nodes can serve this purpose, perhaps storing pointers to the edge nodes
in the original doubly-linked lists as the weight attribute will still
persist there even after an edge becomes "removed" by link manipulation.
Such linked and doubly-linked lists can be created in time proportional
to the number of edges, since the doubly-linked list entries may be
allocated to vertex-specific links on input. With such a design in
mind we can analyze the effort required by each step of the algorithm.
(to be continued)

Categories