how to store and search mp3 by its content - php

I want to store multiple mp3 files and search them by giving some part of song, to detect which song it is.
I am thinking of storing all binary content in mysql and when I want to search for a specific song by content I will take some middle portion of song and actually match it with the binary data in MySQL.
My questions are:
Is this a reasonable way to find songs by their content?
Is it right to store the songs' content in the database or should I use the filesystem?

This is not going to work. MP3 is a "lossy" format. That means that it constantly alters subtle nuances of the music when encoding, thus producing totally different byte-wise data on almost every encoding for the same song.
Also, even in an uncompressed format like WAV, two identical records at different volumes will produce different byte data. So, it is impossible to compare music by comparing the byte values of the file's contents.
A binary comparison will work only for two exact identical copies of the same MP3 file. It won't even work anymore when you re-encode the same MP3 file with identical settings.
Comparing music is not a trivial matter, several approaches exist but to my knowledge none that can be used in PHP.
If you're lucky, there exists a web service that allows some kind of matching. Expect it to be commercial in some way, though - I doubt we are at the stage where this kind of thing can be used free of charge.

Is it a right way to find songs by content of song.
Only if you can be sure that the part you get as search criterium will actually be an excerpt from that particular MP3 file... and that is very, very unlikely. If the part can be from a different source (i.e. a different recording of the same song, or just a differently compressed MP3), you'll have to use audio fingerprinting which is vastly more complicated.
Is it right to store songs content in database or file store normally will work?
If you do simple binary matching, there is no point in using a database. If you have a more complex indexing technique (such as audio fingerprints) then using a database can make sense.

As others have pointed out - comparing MP3s by looking at the binary content of files is not going to work.
I wrote something like this in Java whilst at university for my final year project. I'd be more than happy to send you the source code. It dealt in relative similarities - "song X is more similar to song Y than it is to song Z", rather than matches, but it might be a step in the right direction.
And please, whatever you do, don't try and do this in PHP. The algorithm I used needed me to compute (if I remember correctly - I worked on this around 3 years ago) 30 30x30 matrices for each MP3 it analysed. Each song took around 30 seconds to process to a set of matrices on my clunky old machine (I'm sure my new PC could get the job done significantly quicker). Once I had those matrices for n songs a second step computed differences between each pair of songs, and a third step reduced those differences down to m-dimensional space. Each of these 3 steps takes a fair amount of horsepower, and PHP definitely isn't the right horse for the job.
What PHP might work for is a frontend - I ended up with a queryable web-app written in Ruby on Rails, where I had a simple backend which stored the co-ordinates of each song in m-dimensional space (I happened to choose m = 6) - given a particular song, or fragment, X, you could then compute songs within a certain "distance" of X.
NB. I should probably point out that all the code I wrote was basically just a wrapper around libraries others had written - which were by some smart people at a university in Austria - those libraries took two songs and generated the matrices - all I did was compute distances and map distances of lots of songs into m-dimensional space. Wish I was smart enough to have done the first bit too!

I don't fully understand what you're trying to do, but if you're going to index an MP3 collection, it's probably a better idea to store a hash (of sufficient length) rather than the actual file.
The problem is that the bytes don't give you any insight to the CONTENT of the file, i.e. the music in it. Even if you cut the metadata from the bytes to compare (to get rid of noise like changes in spelling/capitalisation of metadata), you only know something about the unique file itself. So you could compare two identical files (i.e. exact duplicates) for equality, but you couldn't compare any two random files for similarity.

To search songs, you may probably want to index their tags and focus on a nice, easy to use UI so users can look for them in flexible ways.
As said above, same song will show different content bytes depending on the encoding.
However, one idea pointing to your direction, and I'm not sure how feasible is, would be to index some songs patterns that may uniquely identify it. For ex. what do all Johnny Cash songs have in common? Volume, tone, a combination of them? And when you get a portion of content, you may extract that same pattern from it and match. That would be an interesting concept.

Related

Need an algorithm to find near-duplicate text values

I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.
There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.
Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/
Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;
If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.

Use GD or any other php library to build a workflow

I am developing a sports website that would be keeping a record of all tournaments in tennis, football and rugby. Now my database structure is built to hold who plays who in which tournament, so it would just be a select to display all the information. The type of workflow that I am talking about is the one that is commonly used in the sports arena where players' names are listed head to head, and the level of that match(knockout,quater final, semifinal, etc.) are also listed. I do not know the correct term for this though. I will give you an example for how it would look.
I am sure this is possible by using web technology, I am just finding it hard on where to start. Any advice or suggestions are much appreciated. Also if there are any libraries I could use for this, that would be immensely helpful.
Depending on how you want to format the information you should be able to do it in a few ways.
You could use GD like you mentioned but that may be a bit tedious once you get larger and larger brackets. (I don't have a lot of exp. with GD but I know the basics)
I have implemented a 256 person ladder or bracket using html and css. This proved to be pretty simple to do and it should be able to scale easily and be easy to make changes to.
Well on a first glance I would see the following data:
Teams
Cups (having Rounds)
Rounds (of Matches)
Matches (of Teams)
You could model that into a relational database, e.g. MySQL.
You can then create models in classes for your application, e.g. in PHP.
You can then create a Web UI to display the data you've entered into the database. You can use GD for that (if it's a need, I think HTML is not that bad for that, would do it with simple text based output first before turning everything into an image).
Maybe that's helpful. Was a bit lengthy for a comment, so I added it as an answer.

Is there any way to detect strings like putjbtghguhjjjanika?

People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.
Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.
I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.
I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.
As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.
Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!
If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.

How do search engines find relevant content?

How does Google find relevant content when it's parsing the web?
Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page?
My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.
Let's say we had this URL:
http://domain.tld/posts/stackoverflow-dominates-the-world-wide-web.html
Now from that URL I would work out that the HTML file name would be of high relevance so then I would see how close that string compares with all the paragraphs in the page!
A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc., etc.
I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.
Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithms that may be talked about or any in-depth reply?
Some ideas that I have in mind are:
Find all paragraphs and order by plain text length
Somehow find the Width and Height of div containers and order by (W+H) - #Benoit
Check meta keywords, title, description and check relevancy within the paragraphs
Find all image tags and order by largest, and length of nodes away from main paragraph
Check for object data, such as videos and count the nodes from the largest paragraph / content div
Work out resemblances from previous pages parsed
The reason why I need this information:
I'm building a website where webmasters send us links and then we list their pages, but I want the webmaster to submit a link, then I go and crawl that page finding the following information.
An image (if applicable)
A < 255 paragraph from the best slice of text
Keywords that would be used for our search engine, (Stack Overflow style)
Meta data Keywords, Description, all images, change-log (for moderation and administration purposes)
Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.
I'm not asking for trade secrets, I'm asking what your personal approach to this would be.
This is a very general question but a very nice topic! Definitely upvoted :)
However I am not satisfied with the answers provided so far, so I decided to write a rather lengthy answer on this.
The reason I am not satisfied is that the answers are basically all true (I especially like the answer of kovshenin (+1), which is very graph theory related...), but the all are either too specific on certain factors or too general.
It's like asking how to bake a cake and you get the following answers:
You make a cake and you put it in the oven.
You definitely need sugar in it!
What is a cake?
The cake is a lie!
You won't be satisfied because you wan't to know what makes a good cake.
And of course there are a lot or recipies.
Of course Google is the most important player, but, depending on the use case, a search engine might include very different factors or weight them differently.
For example a search engine for discovering new independent music artists may put a malus on
artists websites with a lots of external links in.
A mainstream search engine will probably do the exact opposite to provide you with "relevant results".
There are (as already said) over 200 factors that are published by Google.
So webmasters know how to optimize their websites.
There are very likely many many more that the public is not aware of (in Google's case).
But in the very borad and abstract term SEO optimazation you can generally break the important ones apart into two groups:
How well does the answer match the question? Or:
How well does the pages content match the search terms?
How popular/good is the answer? Or:
What's the pagerank?
In both cases the important thing is that I am not talking about whole websites or domains, I am talking about single pages with a unique URL.
It's also important that pagerank doesn't represent all factors, only the ones that Google categorizes as Popularity. And by good I mean other factors that just have nothing to do with popularity.
In case of Google the official statement is that they want to give relevant results to the user.
Meaning that all algorithms will be optimized towards what the user wants.
So after this long introduction (glad you are still with me...) I will give you a list of factors that I consider to be very important (at the moment):
Category 1 (how good does the answer match the question?
You will notice that a lot comes down to the structure of the document!
The page primarily deals with the exact question.
Meaning: the question words appear in the pages title text or in heading paragraphs paragraphs.
The same goes for the position of theese keywords. The earlier in the page the better.
Repeated often as well (if not too much which goes under the name of keywords stuffing).
The whole website deals with the topic (keywords appear in the domain/subdomain)
The words are an important topic in this page (internal links anchor texts jump to positions of the keyword or anchor texts / link texts contain the keyword).
The same goes if external links use the keywords in link text to link to this page
Category 2 (how important/popular is the page?)
You will notice that not all factors point towards this exact goal.
Some are included (especially by Google) just to give pages a boost,
that... well... that just deserved/earned it.
Content is king!
The existence of unique content that can't be found or only very little in the rest of the web gives a boost.
This is mostly measured by unordered combinations of words on a website that are generally used very little (important words). But there are much more sophisticated methods as well.
Recency - newer is better
Historical change (how often the page has updated in the past. Changing is good.)
External link popularity (how many links in?)
If a page links another page the link is worth more if the page itself has a high pagerank.
External link diversity
basically links from different root domains, but other factors play a role too.
Factors like even how seperated are the webservers of linking sites geographically (according to their ip address).
Trust Rank
For example if big, trusted, established sites with redactional content link to you, you get a trust rank.
That's why a link from The New York Times is worth much more than some strange new website, even if it's PageRank is higher!
Domain trust
Your whole website gives a boost to your content if your domain is trusted.
Well different factors count here. Of course links from trusted sties to your domain, but it will even do good if you are in the same datacenter as important websites.
Topic specific links in.
If websites that can be resolved to a topic link to you and the query can be resolved to this topic as well, it's good.
Distribution of links in over time.
If you earned a lot of links in in a short period of time, this will do you good at this time and the near future afterwards. But not so good later in time.
If you slow and steady earn links it will do you good for content that is "timeless".
Links from restrited domains
A link from a .gov domain is worth a lot.
User click behaviour
Whats the clickrate of your search result?
Time spent on site
Google analytics tracking, etc. It's also tracked if the user clicks back or clicks another result after opening yours.
Collected user data
Votes, rating, etc., references in Gmail, etc.
Now I will introduce a third category, and one or two points from above would go into this category, but I haven't thought of that... The category is:
** How important/good is your website in general **
All your pages will be ranked up a bit depending on the quality of your websites
Factors include:
Good site architecture (easy to navgite, structured. Sitemaps, etc...)
How established (long existing domains are worth more).
Hoster information (what other websites are hosted near you?
Search frequency of your exact name.
Last, but not least, I want to say that a lot of these theese factors can be enriched by semantic technology and new ones can be introduced.
For example someone may search for Titanic and you have a website about icebergs ... that can be set into correlation which may be reflected.
Newly introduced semantic identifiers. For example OWL tags may have a huge impact in the future.
For example a blog about the movie Titanic could put a sign on this page that it's the same content as on the Wikipedia article about the same movie.
This kind of linking is currently under heavy development and establishment and nobody knows how it will be used.
Maybe duplicate content is filtered, and only the most important of same content is displayed? Or maybe the other way round? That you get presented a lot of pages that match your query. Even if they dont contain your keywords?
Google even applies factors in different relevance depending on the topic of your search query!
Tricky, but I'll take a stab:
An image (If applicable)
The first image on the page
the image with a name that includes the letters "logo"
the image that renders closest to the top-left (or top-right)
the image that appears most often on other pages of the site
an image smaller than some maximum dimensions
A < 255 paragraph from the best slice of text
contents of the title tag
contents of the meta content description tag
contents of the first h1 tag
contents of the first p tag
Keywords that would be used for our search engine, (stack overflow style)
substring of the domain name
substring of the url
substring of the title tag
proximity between the term and the most common word on the page and the top of the page
Meta data Keywords,Description, all images, change-log (for moderation and administration purposes)
ak! gag! Syntax Error.
I don't work at Google but around a year ago I read they had over 200 factors for ranking their search results. Of course the top ranking would be relevance, so your question is quite interesting in that sense.
What is relevance and how do you calculate it? There are several algorithms and I bet Google have their own, but ones I'm aware of are Pearson Correlation and Euclidean Distance.
A good book I'd suggest on this topic (not necessarily search engines) is Programming Collective Intelligence by Toby Segaran (O'Reilly). A few samples from the book show how to fetch data from third-party websites via APIs or screen-scraping, and finding similar entries, which is quite nice.
Anyways, back to Google. Other relevance techniques are of course full-text searching and you may want to get a good book on MySQL or Sphinx for that matter. Suggested by #Chaoley was TSEP which is also quite interesting.
But really, I know people from a Russian search engine called Yandex here, and everything they do is under NDA, so I guess you can get close, but you cannot get perfect, unless you work at Google ;)
Cheers.
Actually answering your question (and not just generally about search engines):
I believe going bit like Instapaper does would be the best option.
Logic behind instapaper (I didn't create it so I certainly don't know inner-workings, but it's pretty easy to predict how it works):
Find biggest bunch of text in text-like elements (relying on paragraph tags, while very elegant, won't work with those crappy sites that use div's instead of p's). Basically, you need to find good balance between block elements (divs, ps, etc.) and amount of text. Come up with some threshold: if X number of words stays undivided by markup, that text belongs to main body text. Then expand to siblings keeping the text / markup threshold of some sort.
Once you do the most difficult part — find what text belongs to actual article — it becomes pretty easy. You can find first image around that text and use it as you thumbnail. This way you will avoid ads, because they will not be that close to body text markup-wise.
Finally, coming up with keywords is the fun part. You can do tons of things: order words by frequency, remove noise (ands, ors and so on) and you have something nice. Mix that with "prominent short text element above detected body text area" (i.e. your article's heading), page title, meta and you have something pretty tasty.
All these ideas, if implemented properly, will be very bullet-proof, because they do not rely on semantic markup — by making your code complex you ensure even very sloppy-coded websites will be detected properly.
Of course, it comes with downside of poor performance, but I guess it shouldn't be that poor.
Tip: for large-scale websites, to which people link very often, you can set HTML element that contains the body text (that I was describing on point #1) manually. This will ensure correctness and speed things up.
Hope this helps a bit.
There are lots of highly sophisticated algorithms for extracting the relevant content from a tag soup. If you're looking to build something usable your self, you could take a look at the source code for readability and port it over to php. I did something similar recently (Can't share the code, unfortunately).
The basic logic of readability is to find all block level tags and count the length of text in them, not counting children. Then each parent node is awarded a fragment (half) of the weight of each of its children. This is used to fund the largest block level tag that has the largest amount of plain text. From here, the content is further cleaned up.
It's not bullet proof by any means, but it works well in the majority of cases.
Most search engines look for the title and meta description in the head of the document, then heading one and text content in the body. Image alt tags and link titles are also considered. Last I read Yahoo was using the meta keyword tag but most don't.
You might want to download the open source files from The Search Engine Project (TSEP) on Sourceforge https://sourceforge.net/projects/tsep/ and have a look at how they do it.
I'd just grab the first 'paragraph' of text. The way most people write stories/problems/whatever is that they first state the most important thing, and then elaborate. If you look at any random text and you can see it makes sense most of the time.
For example, you do it yourself in your original question. If you take the first three sentences of your original question, you have a pretty good summary of what you are trying to do.
And, I just did it myself too: the gist of my comment is summarized in the first paragraph. The rest is just examples and elaborations. If you're not convinced, take a look at a few recent articles I semi-randomly picked from Google News. Ok, that last one was not semi-random, I admit ;)
Anyway, I think that this is a really simple approach that works most of the time. You can always look at meta-descriptions, titles and keywords, but if they aren't there, this might be an option.
Hope this helps.
I would consider these building the code
Check for synonyms and acronyms
applying OCR on images to search as text(Abby Fine Reader and Recostar are nice, Tesseract is free and fine(no so fine as fine reader :) )
weight Fonts as well(size, boldness, underline, color)
weight content depending on its place on page(like contents on upper side of page is more relevant)
Also:
An optinal text asked from the webmaster to define the page
You can also check if you can find anything useful at Google search API: http://code.google.com/intl/tr/apis/ajaxsearch/
I'm facing the same problem right now, and after some tries I found something that works for creating a webpage snippet (must be fine-tuned):
take all the html
remove script and style tags inside the body WITH THEIR CONTENT (important)
remove unnecessary spaces, tabs, newlines.
now navigate through the DOM to catch div, p, article, td (others?) and, for each one
. take the html of the current element
. take a "text only" version of the element content
. assign to this element the score: text lenght * text lenght / html lenght
now sort all the scores, take the greatest.
This is a quick (and dirty) way to identify longest texts with a relatively low balance of markup, like what happens in normal contents. In my tests this seems really good. Just add water ;)
In addition to this you can search for "og:" meta tags, title and description, h1 and a lot of other minor techniques.
Google for 'web crawlers, robots, Spiders, and Intelligent Agents', might try them separately as well to get individual results.
Web Crawler
User-Agents
Bots
Data/Screen Scraping
What I think you're looking for is Screen Scraping (with DOM) which Stack has a ton of Q&A on.
Google also uses a system called Page Rank, where
it examines how many links to a site there are. Let's say that you're looking for a C++ tutorial, and you search Google for one. You find one as the top result, an it's a great tutorial. Google knows this because it searched through its cache of the web and saw that everyone was linking to this tutorial, while ranting how good it was. Google deceides that it's a good tutorial, and puts it as the top result.
It actually does that as it caches everything, giving each page a Page Rank, as said before, based on links to it.
Hope this helps!
To answer one of your questions, I am reading the following book right now, and I recommend it: Google's PageRank and Beyond, by Amy Langville and Carl Meyer.
Mildly mathematical. Uses some linear algebra in a graph theoretic context, eigenanalysis, Markov models, etc. I enjoyed the parts that talk about iterative methods for solving linear equations. I had no idea Google employed these iterative methods.
Short book, just 200 pages. Contains "asides" that diverge from the main flow of the text, plus historical perspective. Also points to other recent ranking systems.
There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.
What your looking for is called Information Retrieval
It usually uses the Bag Of Words model
Say you have two documents:
DOCUMENT A
Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again
and this one
DOCUMENT B
Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything
and you have a query, or something you want to find other relevant documents for
QUERY aka DOCUMENT C
precious wonderful life
Anyways, how do you calculate the most "relevant" of the two documents? Here's how:
tokenize each document (break into words, removing all non letters)
lowercase everything
remove stopwords (and, the etc)
consider stemming (removing the suffix, see Porter or Snowball stemming algorithms)
consider using n-grams
You can count the word frequency, to get the "keywords".
Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.
Now you have this:
Doc precious worf life...
A 0.5 0.0 0.2
B 0.0 0.9 0.0
C 0.7 0.0 0.9
Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant.
Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.
You can see my video here. It uses a graphical Java tool, but explains the concepts:
http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html
here is a decent IR book:
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Find duplicate content using MySQL and PHP

I am facing a problem on developing my web app, here is the description:
This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.
I have come up with the following steps:
On submission fetch a length of text and store it in a separated table (article_id,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words)
Then retrieve all the entries from database with length range = new_post_length +/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?)
Fetch the first 3 keywords and compare them against the articles fetched in the step 2
Having a final array with the most probable matches compare the new entry using PHP's levenstein() function
This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.
Could you provide any idea please?
Thank you!
Mike
Text similarity/plagiat/duplicate is a big topic. There are so many algos and solutions.
Lenvenstein will not work in your case. You can only use it on small texts (due to its "complexity" it would kill your CPU).
Some projects use the "adaptive local alignment of keywords" (you will find info on that on google.)
Also, you can check this (Check the 3 links in the answer, very instructive):
Cosine similarity vs Hamming distance
Hope this will help.
I'd like to point out that git, the version control system, has excellent algorithms for detecting duplicate or near-duplicate content. When you make a commit, it will show you the files modified (regardless of rename), and what percentage changed.
It's open source, and largely written in small, focused C programs. Perhaps there is something you could use.
You could design your app to reduce the load by not having to check text strings and keywords against all other posts in the same category. What if you had the users submit the third party content they are referencing as urls? See Tumblr implementation-- basically there is a free-form text field so each user can comment and create their own narrative portion of the post content, but then there are formatted fields also depending on the type of reference the user is adding (video, image, link, quote, etc.) An improvement on Tumblr would be letting the user add as many/few types of formatted content as they want in any given post.
Then you are only checking against known types like a url or embed video code. Combine that with rexem's suggestion to force users to classify by category or genre of some kind, and you'll have a much smaller scope to search for duplicates.
Also if you can give each user some way of posting to their own "stream" then it doesn't matter if many people duplicate the same content. Give people some way to vote up from the individual streams to a main "front page" level stream so the community can regulate when they see duplicate items. Instead of a vote up/down like Digg or Reddit, you could add a way for people to merge/append posts to related posts (letting them sort and manage the content as an activity on your app rather than making it an issue of behind the scenes processing).

Categories