finding two pages similarity - php

I develop a project were in certain spot, some one share a news and to confirm this news another user have to attach the same news from another source so the original news score increase as a trusted news.
what is the solution for comparing two pages and find the similarity probability between both of them?
I code in PHP so any solution compatible with this lang is highly appreciated.
i read that Neural networks are very heavy and performance costing so i need some kind of lightweight solution for this.

you can
http://php.net/manual/en/function.similar-text.php
or
use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php
http://php.net/manual/en/function.exec.php

Related

Conceptual help on implementing a rating system that lets items decrease in time

I am running a website that lets users contribute by letting them upload files on specific subjects. Right now my rating system is the worst possible (number of downloads of the file). Not only is this highly inaccurate in terms of quality control but also does it prevent new content to become listed on top anytime soon.
This is why I want to change my rating system so that users can up-/down-vote each item. However this should not be the only factor to display the popularity of such item. I would like to have older content to decrease in rating over time. Maybe I could even factor in the amount of downloads but to a very low percentage.
So, my questions are:
Which formula would you suggest under the assumption that there is 1 new upload every day?
How would you implement this in a php/mysql environment?
My problem is that right now I am simply sorting my stuff by the downloads row in the database. How can I sort a query by a factor that is calculated externally (in php) or do I have to update a new row in my table with the rating factor each time someone calls the site in his browser?
(Please excuse any mistakes, I am not a native speaker)
I am not really fluent in php or mysql, but as for the rating system, if you want to damp things in time, have you considered a decaying exponential? Off the top of my head, I would probably do something like
$rating = $downloads * exp(-1*$elapsedTime)
you can read up on it here http://en.wikipedia.org/wiki/Exponential_decay. Maybe build in a one week or one month or something delay before you starting damping the results, or people are going to get their upload downrated immediately.
First of all, in any case, you will need to add at least one column to your table. The best thing would be to have a separate table with id, upvotes, downvotes, datetime
If you want to take in consideration the freshness of posts (or uploads or comments or...) I think the best actual method is Wilson score with a gravity parameter.
For a good start with Wilson score implementation in PHP, check this.
Then you will need to read this to understand the pros and the cons of other solutions and use SQL directly.
Remark: gravity is not explicitly detailed in the SQL code but thanks to the PHP one you should be able to make it work.
Note that if you would like something simpler but still not lame, you could check with Bayesian Average. IMDB uses Bayesian Estimation to calculate its Top 250.
Implementing your own statistical model will only results in drawbacks that you had not imagined first (too far from the mean, downvotes are more important than upvotes, decay too quickly, etc...)
Finally you are talking about rating uploads directly, not the user who uploads the files. If you would like to do the same with the user, the simpler would be to use a Bayesian estimate with the results from your uploads ratings.
You have a lot to read, just in StackOverflow, to dry the subject.
Your journey starts here...

Using Neural Networks For Time Series Predictions In PHP

I'm currently building a website which will aggregate news articles and then categorise them based on their content. However, I would like to analyse the times in which the articles are published so that I can determine if their is some sort of trend occuring, thus allowing me to predict the time when the next article is likely to be published and also eliminating the need to for frequent / unnecessary crawling attempts.
I've had a look on the internet and it seems that neural networks can be used for analysing time series', however I haven't found any examples / code snippets that are benefical or easy for a beginner to understand and then adapt.
These articles also seem to suggest that the inputs to a neural network should either be a 0 or a 1, therefore how would you go about creating a neural network in PHP that has the following inputs i.e. unix timestamps and the capabilitiess to output a single value?
1332193520
1342194916
1342196716
1342197376
1352197856
1362198756
Any help would be greatly appreciated.
It would be hard to train a NN in PHP. This should be done offline using some fast non-scripting language. Oh, implementation, training and especially tuning of NNs parameters is not a trivial task.
If I were you I would go for Linear SVMs that outperform other methods for text classification and are quite simple in deployment (and elegant theoretically - Btw, NNs are old fashion). There are excellent implementation of SVMs like
SVM Light and LIBSVM written in multiple languages.

Hypothetical web dictionary architecture

Let's say I am building a simple dictionary where users type a word and see a definition.
In an oversimplification, are there any problems with setting up my dictionary as a MySQL table, and each user request for a word will call a PHP script to find the word, and display its definition?
What's the optimal way to build this to minimize user lag time/not overheat the server? How does dictionary.com do it? My resources are limited, so I can't afford a dedicated server
As this question is tagged as architecture, so trying to provide a basic architecture overview in this case.
Problem statement consists of following points.
Online application - So single service/application will provide services to multiple users.
Text search - Most of the time queries are not complete word which could be find in database.
Frequent database queries - As the number of user grows this might become problem.
So, you might think of following solutions.
Google the text searching tools/library. You will find lots of them. To have some relevant search results. Or you can use how wordweb does.
To avoid frequent database queries you can cached last 1000 results or some configurable number of results in some file such as Lucene Search does.
DISCLAIMER
Above architecture will hold good if there are simultaneously multiple users. Or if this is even needed. Otherwise this might be more than effort required.
Best way to develop an architecture is to make system adaptable to change. So start with basic work and keep adapting to changes.

How is Gmail search is so fast?

What is the most efficient way to search through so many characters? What do you think?
Let's say website built in PHP and MySQL.
What should I learn to be able to build this as much efficiently as possible? Are there any algorythms I should learn or something?
Text indexing algorithm
Google uses a custom-made database solution called BigTable, http://en.wikipedia.org/wiki/Big_table, which is run linked over hundreds of servers all over the world. So they're fast because they wrote the software specifically to be fast, and set up the hardware in such a way that they could squeeze the most out of it.
You can get to a decent set with PHP and MySQL, but once you start dealing with very large data sets, MySQL, and any other generic database, will start to buckle under the stress. If you want to learn more about this, a good place to start is to do a search for concurrency in database design (briefly explained in http://en.wikipedia.org/wiki/Concurrency_control amongst others), which is a topic way too large to cover in a stackoverflow reply =)
Google goes beyond simply optimizing the databases and the code. They also do a lot of distributed programming. While the exact mechanisms they use to power systems such as Gmail are guarded secrets, it is known that they have entire farms of computers networked, each working on parts of the index at any given time, rather than just one server.
For MySQL, look at the Full-Text Search Functions.
This is assuming your content is stored in the database (such as in a CMS).

Music Recognition and Signal Processing

I want to build something similar to Tunatic or Midomi (try them out if you're not sure what they do) and I'm wondering what algorithms I'd have to use; The idea I have about the workings of such applications is something like this:
have a big database with several songs
for each song in 1. reduce quality / bit-rate (to 64kbps for instance) and calculate the sound "hash"
have the sound / excerpt of the music you want to identify
for the song in 3. reduce quality / bit-rate (again to 64kbps) and calculate sound "hash"
if 4. sound hash is in any of the 2. sound hashes return the matched music
I though of reducing the quality / bit-rate due to the environment noises and encoding differences.
Am I in the right track here? Can anyone provide me any specific documentation or examples? Midori seems to even recognize hum's, that's pretty awesomely impressive! How do they do that?
Do sound hashes exist or is it something I just made up? If they do, how can I calculate them? And more importantly, how can I check if child-hash is in father-hash?
How would I go about building a similar system with Python (maybe a built-in module) or PHP?
Some examples (preferably in Python or PHP) will be greatly appreciated. Thanks in advance!
I do research in music information retrieval (MIR). The seminal paper on music fingerprinting is the one by Haitsma and Kalker around 2002-03. Google should get you it.
I read an early (really early; before 2000) white paper about Shazam's method. At that point, they just basically detected spectrotemporal peaks, and then hashed the peaks. I'm sure that procedure has evolved.
Both of these methods address music similarity at the signal level, i.e., it is robust to environment distortions. I don't think it works well for query-by-humming (QBH). However, that is a different (yet related) problem with different (yet related) solutions, so you can find solutions in the literature. (Too many to name here.)
The ISMIR proceedings are freely available online. You can find valuable stuff there: http://www.ismir.net/
I agree with using an existing library like Marsyas. Depends on what you want. Numpy/Scipy is indispensible here, I think. Simple stuff can be written in Python on your own. Heck, if you need stuff like STFT, MFCC, I can email you code.
I worked on the periphery of a cool framework that implements several Music Information Retrieval techniques. I'm hardly an expert (edit: actually i'm nowhere close to an expert, just to clarify), but I can tell that that the Fast Fourier Transform is used all over the place with this stuff. Fourier analysis is wacky but its application is pretty straight-forward. Basically you can get a lot of information about audio when you analyze it in the frequency domain rather than the time domain. This is what Fourier analysis gives you.
That may be a bit off topic from what you want to do. In any case, there are some cool tools in the project to play with, as well as viewing the sourcecode for the core library itself: http://marsyas.sness.net
I recently ported my audio landmark-based fingerprinting system to Python:
https://github.com/dpwe/audfprint
It can recognize small (5-10 sec) excerpts from a reference database of 10s of thousands of tracks, and is quite robust to noise and channel distortions. It uses combinations of local spectral peaks, similar to the Shazam system.
This can only match the exact same track, since it relies on fine details of frequencies and time differences - it wouldn't even match different takes, certainly not cover versions or hums. As far as I understand, Midomi/SoundHound works by matching hums to each other (e.g. via dynamic time warping), then has a set of human-curated links between sets of hums and the intended music track.
Matching a hum directly to a music track ("Query by humming") is an ongoing research problem in music information retrieval, but is still pretty difficult. You can see abstracts for a set of systems evaluated last year at the MIREX 2013 QBSH Results.
MFCC extracted from the music is very useful in finding the timbrel similarity between songs.. this is most often used to find similar songs. As pointed by darren, Marsyas is a tool that can be used to extract MFCC and find similar songs by converting the MFCC in to a single vector representation..
Other than MFCC, Rhythm is also used to find song similarity.. There are few papers presented in the Mirex 2009
that will give you good overview of different algorithms and features that are most helpful in detecting music similarity.
The MusicBrainz project maintains such a database. You can make queries to it based on a fingerprint.
The project exists already since a while and has used different fingerprints in the past. See here for a list.
The latest fingerprint they are using is AcoustId. There is the Chromaprint library (also with Python bindings) where you can create such fingerprints. You must feed it raw PCM data.
I have recently written a library in Python which does the decoding (using FFmpeg) and provides such functions as to generate the AcoustId fingerprint (using Chromaprint) and other things (also to play the stream via PortAudio). See here.
Its been a while since i last did signal processing, but rather than downsampling you should look at frequency-domain representations (eg FFT or DCT). Then you could make a hash of sorts and search for the database song with that sequence in.
Tricky part is making this search fast (maybe some papers on gene search might be of interest). I suspect that iTunes also does some detection of instruments to narrow down the search.
I did read a paper about the method in which a certain music information retrieval service (no names mentioned) does it - by calculating the Short Time Fourier transform over the sample of audio. The algorithm then picks out 'peaks' in the frequency domain i.e. time positions and frequencies that are particularly high amplitude, and uses the time and frequency of these peaks to generate a hash. Turns out the hash has surprising few collisions between different samples, and also stands up against approx 50% data loss of the peak information.....
Currently I'm developing a music search engine using ActionScript 3. The idea is analyzing the chords first and marking the frames (it's limited to mp3 files at the moment) where the frequency changes drastically (melody changes and ignoring noises). After that I do the same thing to the input sound, and match the results with the inverted files. The matching one determines the matching song.
For Axel's method, I think you shouldn't worry about the query whether it's a singing or just humming, since you don't implement a speech recognition program. But I'm curious about your method which uses hash functions. Could you explain that to me?
For query by humming feature, it is more complicate than the audio fingerprinting solution, the difficult comes from:
how to efficiently collect the melody database in real world application? many demo system use midi to build-up, but midi solution's cost is extremely not affordable for a company.
how to deal with the time variance, for example, user hum may fast or slow. use DTW? yes, DTW is a very good solution for dealing with time series with time variance, BUT it cost too much CPU-load.
how to make time series index?
Here is an demo query by humming open source project, https://github.com/EmilioMolina/QueryBySingingHumming, could be an reference.

Categories