I'm looking for a way to compare a number of silhouettes and determine which two are most alike, obviously I would like to do this in the most efficient way possible. I thought perhaps this could be done using the image magic morphology functionality, but perhaps I'm misunderstanding the function. http://www.imagemagick.org/Usage/morphology/#intro
Any thoughts?
Mathematical morphology is a technique used in digital image processing mostly for performing image shape analysis. Therefore it can be used for comparing two images to discover whether their shape is similar.
Various methods can be used to achieve this. Basic methods are Erosion and Dilatation and the others are more or less based on them. You can use iterations of Erosion with appropriately chosen structuring element (regarding the nature of the images) and obtain basic shape of the image and then compare it pixel-wise. Alternatively you can detect if something remains of the images or there is nothing left each iteration and based on that determine their similarity. This of course works only for some simple cases and more complex methods with some special treatments have to be implemented for common usage.
If you would like to use mathematical morphology for comparing images I recommend to familiarize yourself more with this concept by reading some related materials, e.g. Mathematical Morphology. There are also other ways you may find more suitable for resolving your task but the problem of comparing images is in general a very complicated issue and not straightforward at all.
A silhouette is a view of an object or scene as a solid shape of a single color, usually black. The shape therefore depicts the outline of the object, while the interior is featureless. It would have useful if you could have given examples of your test images. Thus when we have such simplified case of an image it would good to observe the spatial distribution of grayscale components of different sizes. This is efficiently obtained from the granulometry operation in mathematical morphology. This basically provides different scales of connected components in the image. You could look at this paper which uses the granulometry to characterize Human silhouettes using a dictionary learnt postive and negative human shapes.
Related
I am developing a sports website that would be keeping a record of all tournaments in tennis, football and rugby. Now my database structure is built to hold who plays who in which tournament, so it would just be a select to display all the information. The type of workflow that I am talking about is the one that is commonly used in the sports arena where players' names are listed head to head, and the level of that match(knockout,quater final, semifinal, etc.) are also listed. I do not know the correct term for this though. I will give you an example for how it would look.
I am sure this is possible by using web technology, I am just finding it hard on where to start. Any advice or suggestions are much appreciated. Also if there are any libraries I could use for this, that would be immensely helpful.
Depending on how you want to format the information you should be able to do it in a few ways.
You could use GD like you mentioned but that may be a bit tedious once you get larger and larger brackets. (I don't have a lot of exp. with GD but I know the basics)
I have implemented a 256 person ladder or bracket using html and css. This proved to be pretty simple to do and it should be able to scale easily and be easy to make changes to.
Well on a first glance I would see the following data:
Teams
Cups (having Rounds)
Rounds (of Matches)
Matches (of Teams)
You could model that into a relational database, e.g. MySQL.
You can then create models in classes for your application, e.g. in PHP.
You can then create a Web UI to display the data you've entered into the database. You can use GD for that (if it's a need, I think HTML is not that bad for that, would do it with simple text based output first before turning everything into an image).
Maybe that's helpful. Was a bit lengthy for a comment, so I added it as an answer.
People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.
Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.
I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.
I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.
As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.
Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!
If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.
Of course Google has been doing this for years! However, rather than start from scratch, spend 10 years+ and squander large sums of money :) I was wondering if anyone knows of a simple PHP library that would return a list of important words (and/or some sort of context) from a web page or chunk of text using PHP?
On a basic level, I am guessing the most spiders will pull in words, remove words without real meaning, then count the rest. The most occurring words would most likely be what I'm interested in.
Any sort of pointers would be really appreciated!
Latent Semantic Indexing.
I can give you pointers, but you want to look up/research Latent Semantic Indexing.
Rather than explain it, here is a quick snippet from a webpage.
Latent semantic indexing is
essentially a way of extracting the
meaning from a document without
matching a specific phrase. A simple
example would be that a document
featuring the words ‘Windows’, ‘Bing’,
‘Excel’ and ‘Outlook’ would be about
Microsoft. You wouldn’t need
‘Microsoft’ to appear again and again
to know that.
This example also highlights the
importance of taking into account
related words because if ‘windows’
appeared on a page that also featured
‘glazing’, it would most likely be an
entirely different meaning.
You can of course go down the easy route of dropping all stop words from the text corpus, but LSI is definately more accurate.
I will update this post with more info in about 30 minutes.
(Still intending to update this post - Got too busy with work).
Update
Okay, so the basics behind LSA, is to offer a new/different approach for retieving a document based on a particular search time. You could very easily use it for determining the meaning of a document however though too.
One of the problems with the search of yester-years was that they were based on keywords analysis. If you take Yahoo/Altavista from the late 1999's through to probably 2002/03 (don't quote me on this), they were extremely dependant on ONLY using keywords as a factor of retrieving a document from their index. Keywords however, don't translate to anything other than the keyword which they represent.
However, the keyword "Hot", means lots of things depending on the context which it is placed. If you were to take the term "hot" and identity that it was placed around other terms such as "chillies", "spices" or "herbs", then conceptually it means something totally different to the term "hot" when surronding by other terms such as "heat" or "warmth" or "sexy" and "girl".
LSA attempts to overcome these defficiencies by working upon a matrix of statisical probalities, (which you build yourself).
Anyway onto some tools that help you to build this matrix of document/terms (and cluster them in a proximity which relates to their corpus). This works to the benefit of search engines, by transposing keywords into concepts, so that if you search for a particular keyword, that keyword might not even appear in documents which are retrieved, but the concept which the keyword represents does.
I've always used Lucence / Solr for search. And doing a quick Google search, for Solr LSA LSI returned a few links.
http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
This guy seems to have created a plugin for it.
http://github.com/algoriffic/lsa4solr
I might check it out over the next few weeks and see how it gets on.
Go have a look at Calais and Zemanta. Very cool stuff!
Personally, I'd be inclined to use something like a Brill parser to identify the part of speech of each word, discarding pronouns, verbs, etc and using that to extract a list of nouns (possibly with any qualifying adjectives) to build that list of keywords. You can find a PHP implementation of a Brill Parser on Ian Barber's PHP/IR site.
Given a set of floorplans (in Autocad, svg, or whatever format need be...), I would like to programatically generate directions from point A to point B. Basically I would like to say: "How do I get from room 101 to room 143?" (or for triple bonus points, from room 101 to room 323). Anyone have any ideas how to go about this? I am pretty language agnostic at this point, although I know C(++), Erlang, PHP and Python the best. I do realize this is a tall order.
Thanks!
The general term for this is pathfinding. The problem has been studied extensively for 2D diagrams. I would break apart the problem into these sections:
Convert CAD model of floor into a simple model of rooms, doors, halways.
Run a pathfinding algorithm on that floor from source to destination, with constraints for human motion.
Convert the results to text directions (turn right, go straight, etc.). The addition of landmarks may be helpful
For multiple floors, you could just use the one floor implementation and go from (e.g.) 104 to the 1st floor stairs, 3rd floor stairs to 311. The conversion of the CAD drawing to a semantically useful format seems like the most difficult step to me.
I know you want to use php, but i recommend python and networkx. you have to convert your building into a set of (origin, Destination, cost) and then run either a TSP (as mentioned by still standing) or A* or Dijkstra
read about the traveling salesman algorithm there are an infinite number of paths from point A to point B. are you looking for the shortest? what is your means of transport? can you fly or are you forced to walk or drive? these are factors in determining a solution.
I want to store multiple mp3 files and search them by giving some part of song, to detect which song it is.
I am thinking of storing all binary content in mysql and when I want to search for a specific song by content I will take some middle portion of song and actually match it with the binary data in MySQL.
My questions are:
Is this a reasonable way to find songs by their content?
Is it right to store the songs' content in the database or should I use the filesystem?
This is not going to work. MP3 is a "lossy" format. That means that it constantly alters subtle nuances of the music when encoding, thus producing totally different byte-wise data on almost every encoding for the same song.
Also, even in an uncompressed format like WAV, two identical records at different volumes will produce different byte data. So, it is impossible to compare music by comparing the byte values of the file's contents.
A binary comparison will work only for two exact identical copies of the same MP3 file. It won't even work anymore when you re-encode the same MP3 file with identical settings.
Comparing music is not a trivial matter, several approaches exist but to my knowledge none that can be used in PHP.
If you're lucky, there exists a web service that allows some kind of matching. Expect it to be commercial in some way, though - I doubt we are at the stage where this kind of thing can be used free of charge.
Is it a right way to find songs by content of song.
Only if you can be sure that the part you get as search criterium will actually be an excerpt from that particular MP3 file... and that is very, very unlikely. If the part can be from a different source (i.e. a different recording of the same song, or just a differently compressed MP3), you'll have to use audio fingerprinting which is vastly more complicated.
Is it right to store songs content in database or file store normally will work?
If you do simple binary matching, there is no point in using a database. If you have a more complex indexing technique (such as audio fingerprints) then using a database can make sense.
As others have pointed out - comparing MP3s by looking at the binary content of files is not going to work.
I wrote something like this in Java whilst at university for my final year project. I'd be more than happy to send you the source code. It dealt in relative similarities - "song X is more similar to song Y than it is to song Z", rather than matches, but it might be a step in the right direction.
And please, whatever you do, don't try and do this in PHP. The algorithm I used needed me to compute (if I remember correctly - I worked on this around 3 years ago) 30 30x30 matrices for each MP3 it analysed. Each song took around 30 seconds to process to a set of matrices on my clunky old machine (I'm sure my new PC could get the job done significantly quicker). Once I had those matrices for n songs a second step computed differences between each pair of songs, and a third step reduced those differences down to m-dimensional space. Each of these 3 steps takes a fair amount of horsepower, and PHP definitely isn't the right horse for the job.
What PHP might work for is a frontend - I ended up with a queryable web-app written in Ruby on Rails, where I had a simple backend which stored the co-ordinates of each song in m-dimensional space (I happened to choose m = 6) - given a particular song, or fragment, X, you could then compute songs within a certain "distance" of X.
NB. I should probably point out that all the code I wrote was basically just a wrapper around libraries others had written - which were by some smart people at a university in Austria - those libraries took two songs and generated the matrices - all I did was compute distances and map distances of lots of songs into m-dimensional space. Wish I was smart enough to have done the first bit too!
I don't fully understand what you're trying to do, but if you're going to index an MP3 collection, it's probably a better idea to store a hash (of sufficient length) rather than the actual file.
The problem is that the bytes don't give you any insight to the CONTENT of the file, i.e. the music in it. Even if you cut the metadata from the bytes to compare (to get rid of noise like changes in spelling/capitalisation of metadata), you only know something about the unique file itself. So you could compare two identical files (i.e. exact duplicates) for equality, but you couldn't compare any two random files for similarity.
To search songs, you may probably want to index their tags and focus on a nice, easy to use UI so users can look for them in flexible ways.
As said above, same song will show different content bytes depending on the encoding.
However, one idea pointing to your direction, and I'm not sure how feasible is, would be to index some songs patterns that may uniquely identify it. For ex. what do all Johnny Cash songs have in common? Volume, tone, a combination of them? And when you get a portion of content, you may extract that same pattern from it and match. That would be an interesting concept.