I'm currently building a website which will aggregate news articles and then categorise them based on their content. However, I would like to analyse the times in which the articles are published so that I can determine if their is some sort of trend occuring, thus allowing me to predict the time when the next article is likely to be published and also eliminating the need to for frequent / unnecessary crawling attempts.
I've had a look on the internet and it seems that neural networks can be used for analysing time series', however I haven't found any examples / code snippets that are benefical or easy for a beginner to understand and then adapt.
These articles also seem to suggest that the inputs to a neural network should either be a 0 or a 1, therefore how would you go about creating a neural network in PHP that has the following inputs i.e. unix timestamps and the capabilitiess to output a single value?
1332193520
1342194916
1342196716
1342197376
1352197856
1362198756
Any help would be greatly appreciated.
It would be hard to train a NN in PHP. This should be done offline using some fast non-scripting language. Oh, implementation, training and especially tuning of NNs parameters is not a trivial task.
If I were you I would go for Linear SVMs that outperform other methods for text classification and are quite simple in deployment (and elegant theoretically - Btw, NNs are old fashion). There are excellent implementation of SVMs like
SVM Light and LIBSVM written in multiple languages.
Related
I work for a small company who wants to expand an existing system but by doing so there are also some issues.
The system itself is used to store images and video's.
Always being available
So we talked to our host and they recommended us to use Ceph and Cassandra. Now I did some research on both of them and I really like the idea of Ceph - but Cassandra.. well, it would take a while for the existing system to be adapted to it.
The reason they recommended Cassandra was so our database would always be available. Now - the DB won't increase massively, it would only be used to keep some user-info, image-tags and other small meta-data.
Another issue is that many queries use "like" in order to find the tags. CQL does not support this.
Now we don't have any developers who have knowledge of Cassandra so it might take some time to take used to that.
My question
Is there an alternative for Cassandra, preferred a relational database (so not NoSQL)
which is still highly available (like when one server goes down,
another takes over).
In case there isn't - how long would it take to get used to Cassandra's query language for relatively inexperienced developers (~ 1 year of experience), including the knowhow of how to adapt a system to it.
Just in case I didn't do my research properly, is Cassandra even the system we are looking for, the DB is pretty much only used for storage and some small functions.
In case you recommend a NoSQL language, what other options do I have of searching and finding data (as we now do by using " ..where x like 'something'")
Another issue is that many queries use "like" in order to find the tags. CQL does not support this.
It does, since Cassandra 3.5 with SASI secondary index
Now we don't have any developers who have knowledge of Cassandra so it might take some time to take used to that.
It's not a problem, you have hours of free online video and training at Datastax Academy
In case there isn't - how long would it take to get used to Cassandra's query language for relatively inexperienced developers (~ 1 year of experience), including the knowhow of how to adapt a system to it.
Just watch the data modeling videos, you should grasp quickly the fundamentals ideas behind Cassandra data model.
Just in case I didn't do my research properly, is Cassandra even the system we are looking for, the DB is pretty much only used for storage and some small functions.
If you're looking for extreme high availability (0 downtime) Cassandra is for you.
If you can handle a downtime of some minutes, there are also other systems that can be a good fit.
I develop a project were in certain spot, some one share a news and to confirm this news another user have to attach the same news from another source so the original news score increase as a trusted news.
what is the solution for comparing two pages and find the similarity probability between both of them?
I code in PHP so any solution compatible with this lang is highly appreciated.
i read that Neural networks are very heavy and performance costing so i need some kind of lightweight solution for this.
you can
http://php.net/manual/en/function.similar-text.php
or
use any comparing software (there are really a lot of them out there) and use them in terminal so you can process it in php
http://php.net/manual/en/function.exec.php
I've been working on a new site of mine for a couple of days now which will be retrieving almost all of its most used content from a MySql database. Seeming as the Database and website is still under development the tables are really small at the moment and speed is of no concern yet.
But you know what they say, a little bit of hard work now saves you a headache later on.
Now I'm only 17, the only database I've ever been taught was through Microsoft Access, and we were practically given the database completed - we learned up to 3NF, but that was about it.
I remember reading once when I was looking to pull data (randomly) out of a database how large databases were taking several seconds/minutes to complete a single query, so this just got me thinking. In a fraction of a second I can submit a search to google, google processes the query and returns the result, and then my browser renders it - all done in the blink of an eye. And google has billions of records to search through. And they're also doing this for millions of users simultaneously.
I'm thinking, how do they do it? I know that they have huge data centers, but still.
I realize that it probably comes down to the design of the database, how it's been optimized, and obviously the configuration. And I guess that's my question really. Could someone please tell me how to design high performance databases for millions/billions of rows (yes, I'm being optimistic), and possibly point me towards some good reading material to help me learn further?
Also, all my queries are done via PHP, if that's at all relevant to any answers.
The blog http://highscalability.com/ has some good articles and pointers to how companies handle large problems.
Specifically related to MySQL, you can Google for craigslist.org's use of MySQL.
http://www.slideshare.net/jzawodn/mysql-and-search-at-craigslist
First the good news... MySQL scales well (depending on the hardware) to at least hundreds of millions of rows.
Once you get to a certain point, a single database server will have trouble managing the load. That's when you get into the realm of partitioning or sharding... spreading the load across multiple database servers using any one of a number of different schemes (e.g. putting unrelated tables on different servers, spreading a single table across multiple servers e.g. by using the ID or date range as a partitioning key).
SQL does shard, but is not fundamentally designed to shard well. There's a whole category of storage alternatives collectively referred to as NoSQL that are designed to solve that very problem (MongoDB, Cassandra, HBase are a few).
When you use SQL at very large scale, you run into any number of issues such as making data model changes across a DB server farm, trouble keeping up with data backups, etc. That's a very complex topic, and people that solve it well are rare. For a glimpse at the issues, have a look at http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-death/
When selecting a database platform for a specific project, benchmark the solution early and often to understand whether or not it will meet the performance requirements that you envision. Having a framework to do that will help you learn about scalability, and will help you decide whether to invest effort in improving the data storage part of your solution, and will help you know where best to invest your time.
No one can tell how to design databases. It comes after much reading and many hour working on them. A good design is product of many many years doing them though. As you've only seen Access you got no knowledge of databases. Search through Amazon.com and you'll get tons of titles. For someone that's starting, anyone will do it.
I mean no disrespect. I've been there and I'm also tutor of some people learning programming/database design. I do know that there's no silver bullet or shortcuts for the work you have ahead.
If you intend to work with high performance database, you should have something in mind. The design of them in per application. A good design depends on learning more and more how the app's users interact with the system, the usage patterns, etc. The things you'll learn from books will give you options, using them will depend heavily on the scenario.
Good luck!
It doesn't all come down to the design of the database, though that is indeed a big part of it. The guys who made Google are geniouses, and if I'm not completely wrong about Google you won't be able to find out exactly how they do what they do. Also, I know that years back they had more than 10,000 computers processing queries, and today they probably have many more. I also suspect them for caching most of the recent/popular keywords. And all the websites have been indexed and analyzed using an unknown algorithm which will make sure the computers don't have to look through all the words on every page.
In fact, Google crawls the entire internet around every 14 days, so when you do a search you do not search the entire internet. Your search gets broken down into keywords and then these keywords are used to narrow the number of relevant pages - and I'm pretty sure all pages have already been analyzed for important and/or relevant keywords before you even thought of visiting google.com.
Have a look at this question.
Have a look into Sphinx server.
http://sphinxsearch.com/
Craigslist uses that for their search engine. Basically, you give it a source and it indexes whatever you want (mysql database/table, text files, etc.). If it works for craigslist, it should work for you.
Let's say I am building a simple dictionary where users type a word and see a definition.
In an oversimplification, are there any problems with setting up my dictionary as a MySQL table, and each user request for a word will call a PHP script to find the word, and display its definition?
What's the optimal way to build this to minimize user lag time/not overheat the server? How does dictionary.com do it? My resources are limited, so I can't afford a dedicated server
As this question is tagged as architecture, so trying to provide a basic architecture overview in this case.
Problem statement consists of following points.
Online application - So single service/application will provide services to multiple users.
Text search - Most of the time queries are not complete word which could be find in database.
Frequent database queries - As the number of user grows this might become problem.
So, you might think of following solutions.
Google the text searching tools/library. You will find lots of them. To have some relevant search results. Or you can use how wordweb does.
To avoid frequent database queries you can cached last 1000 results or some configurable number of results in some file such as Lucene Search does.
DISCLAIMER
Above architecture will hold good if there are simultaneously multiple users. Or if this is even needed. Otherwise this might be more than effort required.
Best way to develop an architecture is to make system adaptable to change. So start with basic work and keep adapting to changes.
I want to build something similar to Tunatic or Midomi (try them out if you're not sure what they do) and I'm wondering what algorithms I'd have to use; The idea I have about the workings of such applications is something like this:
have a big database with several songs
for each song in 1. reduce quality / bit-rate (to 64kbps for instance) and calculate the sound "hash"
have the sound / excerpt of the music you want to identify
for the song in 3. reduce quality / bit-rate (again to 64kbps) and calculate sound "hash"
if 4. sound hash is in any of the 2. sound hashes return the matched music
I though of reducing the quality / bit-rate due to the environment noises and encoding differences.
Am I in the right track here? Can anyone provide me any specific documentation or examples? Midori seems to even recognize hum's, that's pretty awesomely impressive! How do they do that?
Do sound hashes exist or is it something I just made up? If they do, how can I calculate them? And more importantly, how can I check if child-hash is in father-hash?
How would I go about building a similar system with Python (maybe a built-in module) or PHP?
Some examples (preferably in Python or PHP) will be greatly appreciated. Thanks in advance!
I do research in music information retrieval (MIR). The seminal paper on music fingerprinting is the one by Haitsma and Kalker around 2002-03. Google should get you it.
I read an early (really early; before 2000) white paper about Shazam's method. At that point, they just basically detected spectrotemporal peaks, and then hashed the peaks. I'm sure that procedure has evolved.
Both of these methods address music similarity at the signal level, i.e., it is robust to environment distortions. I don't think it works well for query-by-humming (QBH). However, that is a different (yet related) problem with different (yet related) solutions, so you can find solutions in the literature. (Too many to name here.)
The ISMIR proceedings are freely available online. You can find valuable stuff there: http://www.ismir.net/
I agree with using an existing library like Marsyas. Depends on what you want. Numpy/Scipy is indispensible here, I think. Simple stuff can be written in Python on your own. Heck, if you need stuff like STFT, MFCC, I can email you code.
I worked on the periphery of a cool framework that implements several Music Information Retrieval techniques. I'm hardly an expert (edit: actually i'm nowhere close to an expert, just to clarify), but I can tell that that the Fast Fourier Transform is used all over the place with this stuff. Fourier analysis is wacky but its application is pretty straight-forward. Basically you can get a lot of information about audio when you analyze it in the frequency domain rather than the time domain. This is what Fourier analysis gives you.
That may be a bit off topic from what you want to do. In any case, there are some cool tools in the project to play with, as well as viewing the sourcecode for the core library itself: http://marsyas.sness.net
I recently ported my audio landmark-based fingerprinting system to Python:
https://github.com/dpwe/audfprint
It can recognize small (5-10 sec) excerpts from a reference database of 10s of thousands of tracks, and is quite robust to noise and channel distortions. It uses combinations of local spectral peaks, similar to the Shazam system.
This can only match the exact same track, since it relies on fine details of frequencies and time differences - it wouldn't even match different takes, certainly not cover versions or hums. As far as I understand, Midomi/SoundHound works by matching hums to each other (e.g. via dynamic time warping), then has a set of human-curated links between sets of hums and the intended music track.
Matching a hum directly to a music track ("Query by humming") is an ongoing research problem in music information retrieval, but is still pretty difficult. You can see abstracts for a set of systems evaluated last year at the MIREX 2013 QBSH Results.
MFCC extracted from the music is very useful in finding the timbrel similarity between songs.. this is most often used to find similar songs. As pointed by darren, Marsyas is a tool that can be used to extract MFCC and find similar songs by converting the MFCC in to a single vector representation..
Other than MFCC, Rhythm is also used to find song similarity.. There are few papers presented in the Mirex 2009
that will give you good overview of different algorithms and features that are most helpful in detecting music similarity.
The MusicBrainz project maintains such a database. You can make queries to it based on a fingerprint.
The project exists already since a while and has used different fingerprints in the past. See here for a list.
The latest fingerprint they are using is AcoustId. There is the Chromaprint library (also with Python bindings) where you can create such fingerprints. You must feed it raw PCM data.
I have recently written a library in Python which does the decoding (using FFmpeg) and provides such functions as to generate the AcoustId fingerprint (using Chromaprint) and other things (also to play the stream via PortAudio). See here.
Its been a while since i last did signal processing, but rather than downsampling you should look at frequency-domain representations (eg FFT or DCT). Then you could make a hash of sorts and search for the database song with that sequence in.
Tricky part is making this search fast (maybe some papers on gene search might be of interest). I suspect that iTunes also does some detection of instruments to narrow down the search.
I did read a paper about the method in which a certain music information retrieval service (no names mentioned) does it - by calculating the Short Time Fourier transform over the sample of audio. The algorithm then picks out 'peaks' in the frequency domain i.e. time positions and frequencies that are particularly high amplitude, and uses the time and frequency of these peaks to generate a hash. Turns out the hash has surprising few collisions between different samples, and also stands up against approx 50% data loss of the peak information.....
Currently I'm developing a music search engine using ActionScript 3. The idea is analyzing the chords first and marking the frames (it's limited to mp3 files at the moment) where the frequency changes drastically (melody changes and ignoring noises). After that I do the same thing to the input sound, and match the results with the inverted files. The matching one determines the matching song.
For Axel's method, I think you shouldn't worry about the query whether it's a singing or just humming, since you don't implement a speech recognition program. But I'm curious about your method which uses hash functions. Could you explain that to me?
For query by humming feature, it is more complicate than the audio fingerprinting solution, the difficult comes from:
how to efficiently collect the melody database in real world application? many demo system use midi to build-up, but midi solution's cost is extremely not affordable for a company.
how to deal with the time variance, for example, user hum may fast or slow. use DTW? yes, DTW is a very good solution for dealing with time series with time variance, BUT it cost too much CPU-load.
how to make time series index?
Here is an demo query by humming open source project, https://github.com/EmilioMolina/QueryBySingingHumming, could be an reference.