I have designed a weighted graph using a normalized adjacency list in mysql. Now I need to find the shortest path between two given nodes.
I have tried to use Dijkstra in php but I couldnt manage to implement it (too difficult for me). Another problem I felt was that if I use Dijkstra I would need to consider all the nodes, that may be perhaps very inefficient in a large graph. So does anybody has a code relating to the above problem? It would be great if somebody atleast shows me a way of solving this problem. I have been stuck here for almost a week now. Please help.
This sounds like a classic case of the A* algorithm, but if you can't implement Dijkstra, I can't see you implenting A*.
A* on Wikipedia
edit: this assumes that you have a good way to estimate (but it is crucial you don't over-estimate) the cost of getting from one node to the goal.
edit2: you are having trouble with the adjacency list representation. It occurs to me that if you create an object for each vertex in the map then you can link directly to these objects when there is a link. So what you'd have essentially is a list of objects that each contain a list of pointers (or references, if you will) to the nodes they are adjacent to. Now, if you want to access the path for a new node, you just follow the links. Be sure to maintain a list of the paths you've followed for a given vertex to avoid infinite cycles.
As far as querying the DB each time you need to access something, you're going to need to do this anyway. Your best hope is to only query the DB when you NEED to... this means only querying it when you want to get info on a specific edge in the graph, or for all edges for one vertext in the graph (the latter would likely be the better route) so you only hit the slow I/O once in a while rather than gigantic chunks all at once.
Here is a literate version of the Dijkstra algorithm, in Java, that may help you to figure out how to implement it in PHP.
http://en.literateprograms.org/Dijkstra%27s_algorithm_%28Java%29
Dijkstra algorithm returns shortest paths from given vertex to other vertexes.
You can find its pseudo-code in Wiki.
But I think you need Floyd algorithm which finds shortest paths between all vertexes in a DIRECTED grapth.
The mathematical complexity of both are pretty close.
I could find PHP implementation from the Wiki for both of them.
Related
Apologies if the answer to this is obvious, please be kind, this is my first time on here :-)
I would gratefully appreciate if someone could give me a steer on the appropriate input data structure for k-means. I am working on a masters dissertation in which I am proposing a new TF-IDF term weighing approach specific to my domain. I want to use k-means to cluster the results and then apply a number of internal and external evaluation criteria to see if my new term weighting method has any merit.
My steps so far (implemented in PHP), all working are
Step 1: Read in document collection
Step 2: Clean document collection, feature extraction, feature selection
Step 3: Term Frequency (TF)
Step 4: Inverse Document Frequency (IDF)
Step 5: TF * IDF
Step 6: Normalise TF-IDF to fixed length vectors
Where I am struggling is
Step 7: Vector Space Model – Cosine Similarity
The only examples I can find, compare an input query to each document and find the similarity. Where there is no input query (this is not an information retrieval system) do I compare every single document in the corpus with every other document in the corpus (every pair of documents)? I cannot find any example of Cosine Similarity applied to a full document collection rather than a single example/query compared to the collection.
Step 8: K-Means
I am struggling here to understand if the input vector for k-means should contain a matrix of the cosine similarity score of every document in the collection against every other document (a matrix of cosine similarity). Or is k-means supposed to be applied over a term vector model. If it is the latter, every example I can find of k-means is quite basic and plots either singular terms. How do I handle the fact that there are multiple terms in my document collection etc.
Cosine Similarity and K-Means are implied as the solution to document clustering on so many examples so i am missing something very obvious.
If anyone could give me a steer I would be forever grateful.
Thanks
Claire
K-means cannot operate on a similarity matrix.
Because k-means computes point-to-mean distances, not pairwise distances.
You need an implementation of spherical k-means if you want to use Cosine distance: at every iteration, the centers should be L2 normalized.
If I'm not mistaken, it should be equivalent to run k-means with cosine similarity, and only normalize the center to unit length at the end. But regular spherical k-means may be faster, because you can exploit data normalization to simplify cosine distance to the dot product.
You may want to reconsider using PHP. It is one of the worst possible choices for this type of programming task. It's good for interactive web page, but it doesn't shine at data analysis at all.
I second Anony-Mousse opinion that you should reconsider PHP and would like to suggest Python as there several useful libraries for these kind of problems:
Numpy: a great and efficient package for scientific computing.
SciPy: Actually has several routines for k-means clustering: see here
Theano: For more machine learning needs, especially deep learning.
Also there is this great tutorial about the k means algorithm. It also supplies pseudo code in Python. You can use this and maybe an implementation done by yourself to understand the algorithm better but ultimately I would make use of the library mentioned above as they are optimized for performance which is definitely something to keep in mind if you have a big collection of documents.
If it helps anybody else out, I have found that it is possible to k-means clustering a multi-dimensional term vector but if more than 3 dimenions are included (which will be the case for any document collection), you cannot visualise it. I believe this is what threw me here, all of the examples I saw of k-means included a graph visualisation, this led me to believe, incorrectly ,that perhaps the source data for k-means was expected to be two dimensional, such as 0 and the cosine similarity. Thank you kindly for the respondents for your help, much appreciated.
Use TF-IDF to calculate the cosine similarity. Use cosine similarity scores as the input data for your clustering algorithm.
Look at ..
Simple Search: The Vector Space Model
I am constructing a PHP framework from scratch (unfortunately I don't have any choice in this matter). The framework is required to rely heavily on object-oriented data, and therefore needs to have the ability to store large amounts of object-oriented data efficiently.
I am struggling with the second part.
I've been working on this for a few months. Initially I was introduced to the idea of an ORM, after trying a few pre-built libraries (Doctrine 2, Redbean etc) I liked the idea, but none of what I could find functioned the way that was required, so I set out to create my own ORM, of which turned out quite well. The only issue really is that it suffers in performance, and after spending some time trying to optimize it, I am now convinced that an ORM is not quite the solution to the problem. Although close, it just doesn't quite cut it.
I have briefly looked into other solutions, but due to my lack of experience in this area I am struggling to pin-point the solution.
Here are the requirements of the data storage engine:
Ultimately, it needs to be able to store key-value pairs
The "value" part can be a simple data type, but can also be an object, or an array of the same type of object.
The application defines the structure of each object (or the SCHEMA), sort of in the same way that a .wsdl file works, so the engine would need to like strict formats.
Objects can either have their instances re-used, or not. Meaning that if an object exists as a child object in multiple locations (across many objects) its values are the same everywhere that it is located (if it re-used). Otherwise, a new instance of the object exists for every existing object (not re-used).
There needs to be the ability to query the data efficiently, to make comparisons on any part of an object to find it. For example: find a customer where customer.address.postcode LIKE ('%XXX%')
Any suggestions would be greatly appreciated
EDIT
Thanks to those that have attempted to aid me so far in my somewhat crazy endeavour. To answer some questions that have so far been asked:
What solutions have you tried, and why did they not work?
ORM systems
I had tried a small number of pre-built ORM libraries for PHP. Including Doctrine 2 and Redbean. With Doctrine it was more to do with how you specified the SCHEMA of a model, in that you are required to do so in docblocks. I found this particularly awkward to use due to the requirements that I had, particularly because I knew of a number of ways this could be avoided. I did eventually manage to get Doctrine to work the way that I wanted, but this was after hacking away at the code. Again, this was fun, but it wasn't right.
Redbean actively required me to change the property names of objects. One of my requirements was to basically be able to plug in any sort of document-oriented object, and store it. So having to specifically name properties in order to do this was counter-intuitive. Again, I did play with Redbean for a bit to get it to work, which wasn't right.
It was after playing with a few more ORM systems that I felt I had the knowledge to make my own. Again, the ORM system that I made was good, in that it met the requirements precisely. It was massively let-down due to poor performance, specifically when dealing with large sets of data, but more so when dealing with largely complex models.
Storing objects in XML files
There was a very small time that I considered this, thinking that maybe my requirements meant that I was always going to end up with performance being a problem. So I set out designing a way to generate text-based storage and ultimately ended up creating a whole SCHEMA engine and a bunch of other interesting things. This turned out to be just a fun project in the end, I just couldn't get it to perform at all.
NoSQL
My most recent endeavours have pushed me down the route of systems such as MongoDB and a few other NoSQL systems that I didn't much get into like Cassandra.
MongoDB comes very close to being a tool I could use, however it would require that I add an additional layer because I do in-fact require a SCHEMA, since my objects always conform to a specific structure. I am slowly coming to terms with MongoDB possibly being the solution, however I want to make sure before I spend more time on this.
What exactly do you mean by efficient?
I'm not 100% talking about performance when I mention efficiency, although performance is most certainly an important factor that I am using to consider my options, I understand that going down this route rather than something like a relational database, performance is naturally going to be a problem.
I am more talking about using the right tools. I never like to have to hack away at someone's code to get things to work. To me, it feels as if I am pushing things down a road that the system wasn't designed to go down, and at some point in the future it will bite me in the a**.
So really, when I mention I am looking for something "efficient", I'm meaning tools that match the requirements as closely as possible, so that I am only using/extending the functionality, rather than re-writing it.
Here are some routes to look into. Your requirement for storing "objects" (quite a broad term when it comes to databases) makes me think of:
Storing data in databases in a serialised format, e.g. JSON. PostgreSQL these days has ways to reach into such a column to do search operations on it, so it is not as non-searchable as has been previously regarded (though I would expect it to be slower than querying correctly normalised data).
The requirement to store customer.address.postcode makes me think that you could store your data as a hierarchy, in which case there are several algorithms available to you. Look into nested sets. This is designed to work well with relational databases, without resorting to recursive SQL.
It's not an area of my expertise, but graph databases may be worth looking into.
On a side note, Doctrine is a great library from what I hear, but I suspect you need to work out what technology to use first. It is designed broadly to map onto a relational database, so if you can't express your problem cleanly in a raw RDBMS, Doctrine may not help.
(This could be an XY question, it's hard to tell. You've said you need Y, but if you can tell us that you want to achieve X, maybe the feedback you're getting would be more concrete - and take you in a better direction).
So!
I am working in PHP and have a huge list of taxonomy/tags, say around 100,000.
A similar list of tags can be can be found in the wealth of tags listed under products at Zazzle.com.
I am attempting to programmatically organize this list into a tiered menu of sorts based on the relationship between words, similar strings, and specificity.
I have toyed around with the levenshtein function, similar_text, searching for sub_str(ings), using the Princeton WordNet database, etc. and just can't crack this nut. Essentially, I am trying to build an Ontology out of this database that goes from very general to very specific in tiers. It doesn't have to be perfect, but I have run out of simple keyphrases to search for and ideas of how to go about doing this in a programmatic way and yet still having some semblance of order.
For instance:
If I use sub_str, I might end up with Dog->Dogma,Dogra, etc.
If I use levenshtein or similar text, I might end up Bog, Log, Cog, and Dog all very closely related.
This database, or taxonomy - if you will, is also conistently changing and thus atleast part of the analysis has to be done on the fly. The good news is only one level of the result needs to be available. For instance, the near results of a query such as Dog might be small dog, large dog, red dog, blue dog, canine, etc.
I know this is a terrible question, but does anyone have a ray-of-light of at least what steps i should take, any useful functions I could use, queries to research, methodologies, etc?
Thank you for your time.
So far, I have two suggestions for programmetically organizing tags into an ontology.
Find co-occurences of tags to organize them into groups. I believe the idea being that if tags occur together they are probably related.
Use algorithmic stemming to reduce multiple forms/derivations/roots of words to a stem. This should reduce the quantity of tags the script needs to sift through.... in addition to possibly identifying similar tags based on the root stem.
If you have whole sentences or at least more than just single words available, you might want to have a look into Latent semantic analysis
Don't be scared by the math, once you got the basic idea behind it, it's fairly simple:
create a (high-dimensional) term-document matrix of your data
essential step: transform your huge sparse matrix into a lower dimension (Singular value decomposition)
every [collection of tags/terms] can then be specified by an vector in your lower dimension model
the (cosine) similarity between those two vectors is a good measurement for the similiarity of your tags, even they might not be the same stem (you may find dog and barking related)
a good input for the term-document matrix is vital
An excellent read on this [and other IR topics] (Free eBook): Introduction to Information Retrieval
Have a look at the book, it's very well written and helped me a lot with my IR thesis.
We are developing an application, in which we will show some available houses for sale in google map. User can select any houses from the map and can find the shortest driving route between all the houses he/she selected.
Can any one please tell me how we can find the shortest route and can show that on the map? Is there any PHP based TSP library, that can help us to achieve what we are trying?
A Google search shows many results.
http://scrivna.com/blog/travelling-salesman-problem/ - Brute force PHP implementation guaranteed to get the optimal answer. Only suitable for a limited number of nodes.
http://www.renownedmedia.com/blog/genetic-algorithm-traveling-salesperson-php/ - Genetic algorithm PHP implementation which will approximate the answer. Suitable for large numbers of nodes.
You could probably combine the two, choosing which to run based on the size of the graph.
As #Barbar points out in the comments, there is an existing app that does what you're attempting. There is a blog post explaining how it works.
Its old but it may be useful to people:
https://developers.google.com/maps/documentation/javascript/v2/services#RoutesAndSteps
just create waypoints for each house and let google do the math for you...
If the problem satisfy the triangle inequality you can try the Christofides algorithm.
Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.
In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.
Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)
Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.
Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.
you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on
I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.
I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help