I have been given a task, that given discrete data, like this
I need to slice it into 5 pieces, determined by the template it creates.
I am not allowed to guess a template, because every input looks different.
My approach was to find peaks in the data (above or below zero), then use that pattern of peaks to slice the data. Here is what I got: (not for the above data)
The top graph is the peaks in the graph, and because I know I have exactly 5 pieces, and 15 points, I can say that every piece has 3 points, and then slice it, which is the second graph in that picture.
Out of 40 inputs, I managed to do this only for 5 of them, because my "peak detection" algorithm is very very basic.
What peak detection algorithm should I use, that can also find local minimums, and has PHP implementation / simple psudo code? I am a beginner in this field of data analyzing, so I need your tips.
Finally, am I even going in the right direction on how to slice this data? or is there a better known way to do it?
EDIT:
My bad for not explaining before: the goal of this slicing, is to create a uniform not-time dependent model for a slice, meaning that long and short pieces will be the same length, and that is for each peak. If this is done per slice, just stretching, the data looks noisy, like this: (this is still in development, so I didn't write it before)
And I don't know how to do it without the peaks, because every slice has different times for different parts (1 second, 1.1 seconds, etc)
Find the 4 longest sub sets without intersection in your data where values remain within some tolerance of zero. In the case that you don't know how many beats you have to isolate peak detection becomes more relevant as the number of peaks above a given threshold define how many sections you dissect.
I don't think you're the first person to attack this sort of problem...
https://www.biopac.com/knowledge-base/extracting-heart-rate-from-a-noisy-ecg-signal/
Edit::
As far as a peak finding algorithm I think this paper provides some methods.
http://www.ifi.uzh.ch/dbtg/teaching/thesesarch/ReportRSchneider.pdf
The approach labeled Significant Peak-Valley Algorithm more or less boils down to finding local extrema (minimum and maximum) in regions beyond (below and above respectively) a given threshold defined by some arbitrary number of standard deviations from the mean.
Related
I am not skilled in the world of statistics, so I hope this will be easy for someone, my lack of skill also made it very hard to find the correct search terms on this topic so I may have missed my answer in searching. anyway. I am looking at arrays of data, say CPU usage for example. how can i capture accurate information in as few data-points as possible on say, a set of data containing 1-second time intervals on cpu usage over the cores of 1 hr, where the first 30mins where 0% and the second 30 mins are 100%. right now, all i will know in one data-point i can think of is the mean, which is 50%, and not useful at all in this case. also, another case is when the usage graph was like a wave, evenly bouncing up and down between 0-100, yet still giving a mean of 50%. how can i capture this data? thanks.
If I understand your question, it is really more of a statistics question than a programming question. Do you mean, what is the best way to capture a population curve with the fewest variables possible?
Firstly, the assumptions with most standard statistics implies that the system is more or less stable (although, if the system is unstable, the numbers you get will let you know because they will be non-sensical).
The main measures that you need to know statistically are the mean, population size and the standard deviation. From this, you can calculate the rough bell curve defining to population curve, and know the accuracy of the curve based on the scale of the standard deviation.
This gives you a three variable schema for a standard bell curve.
If you want to get in further detail, you can add Cpk, Ppk, which are calculated fields.
Otherwise, you may need to get into non-linear regression and curve fitting which is best handled on a case by case basis (not great for programming).
Check out the following sites for calculating the Cp, Cpk, Pp and Ppk:
http://www.qimacros.com/control-chart-formulas/cp-cpk-formula/
http://www.macroption.com/population-sample-variance-standard-deviation/
In nearly any programming language, if I do $number = rand(1,100) then I have created a flat probability, in which each number has a 1% chance of coming up.
What if I'm trying to abstract something weird, like launching rockets into space, so I want a curved (or angled) probability chart. But I don't want a "stepped" chart. (important: I'm not a math nerd, so there are probably terms or concepts that I'm completely skipping or ignorant of!) An angled chart is fine though.
So, if I wanted a probability that gave results of 1 through 100... 1 would be the most common result. 2 the next most common. In a straight line until a certain point - lets say 50, then the chart angles, and the probability of rolling 51 is less than that of rolling 49. Then it angles again at 75, so the probability of getting a result above 75 is not simply 25%, but instead is some incredibly smaller number, depending on the chart, perhaps only 10% or 5% or so.
Does this question make any sense? I'd specifically like to see how this can be done in PHP, but I wager the required logic will be rather portable.
The short answers to your questions are, yes this makes sense, and yes it is possible.
The technical term for what you're talking about is a probability density function. Intuitively, it's just what it sounds like: It is a function that tells you, if you draw random samples, how densely those samples will cluster (and what those clusters look like.) What you identify as a "flat" function is also called a uniform density; another very common one often built into standard libraries is a "normal" or Gaussian distribution. You've seen it, it's also called a bell curve distribution.
But subject to some limitations, you can have any distribution you like, and it's relatively straightforward to build one from the other.
That's the good news. The bad news is that it's math nerd territory. The ideas behind probability density functions are pretty intuitive and easy to understand, but the full power of working with them is only unlocked with a little bit of calculus. For instance, one of the limitations on your function is that the total probability has to be unity, which is the same as saying that the area under your curve needs to be exactly one. In the exact case you describe, the function is all straight lines, so you don't strictly need calculus to help you with that constraint... but in the general case, you really do.
Two good terms to look for are "Transformation methods" (there are several) and "rejection sampling." The basic idea behind rejection sampling is that you have a function you can use (in this case, your uniform distribution) and a function you want. You use the uniform distribution to make a bunch of points (x,y), and then use your desired function as a test vs the y coordinate to accept or reject the x coordinates.
That makes almost no sense without pictures, though, and unfortunately, all the best ways to talk about this are calculus based. The link below has a pretty good description and pretty good illustrations.
http://www.stats.bris.ac.uk/~manpw/teaching/folien2.pdf
Essentially you need only to pick a random number and then feed into a function, probably exponential, to pick the number.
Figuring out how weighted you want the results to be will make the formula you use different.
Assuming PHP has a random double function, I'm going to call it random.
$num = 100 * pow(random(), 2);
This will cause the random number to multiply by itself twice, and since it returns a number between 0 and 1, it will get smaller, thus increasing the chance to be a lower number. To get the exact ratio you'd just have to play with this format.
To me it seems like you need a logarithmic function (which is curved). You'd still pull a random number, but the value that you'd get would be closer to 1 than 100 most of the time. So I guess this could work:
function random_value($min=0, $max=100) {
return log(rand($min, $max), 10) * 10;
}
However you may want to look into it yourself to make sure.
The easiest way to achieve a curved probability is to think how you want to distribute for example a prize in a game across many winners and loosers. To simplify your example I take 16 players and 4 prizes. Then I make an array with a symbol of the prize (1,2,2,3,3,3,3,3,4,4,4,4,4,4,4) and pick randomly a number out of this array. Mathematically you would have a probability for prize 1 = 1:16, for prize 2 3:16, for prize 3 5:16 and for prize 4 7:16.
I'm having trouble wording my problem to search for it, so if anyone could point me in the right direction it would be appreciated.
I have multiple scores given out of 5 for a series of objects.
How can I find which object has the best overall rating? A similar formula to Amazon's reviews or Reddit's best comments (probably a lot more basic?), so not necessarily finding the highest average score but incorporating the number of reviews given to get the "best".
Any ideas?
This seems to be a classical application of the Friedman test: "n wine judges each rate k different wines. Are any wines ranked consistently higher or lower than the others?" Friedman test is implemented in many statistical packages, e.g., in R: friedman.test.
Friedman test will return the p-value. If the p-value is not siginificant there is no reason to assume that some of the objects are consistently ranked higher than other ones. If the p-value is significant, then you know that some objects have been ranked higher than others but you still do not know which ones. Hence, appropriate post-hoc multiple comparisons tests should be performed.
A number of different post-hoc tests can be performed, see e.g., for R code of an example post-hoc analysis http://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/
I'm working on a full text index system for a project of mine. As one part of the process of indexing pages it splits the data into a very, very large number of very small pieces.
I have gotten the size of the pieces to be as low as a constant 20-30 bytes, and it could be less, it is basically 2 8 byte integers and a float that make up the actual data.
Because of the scale I'm looking for and the number of pieces this creates I'm looking for an alternative to mysql which has shown significant issues at value sets well below my goal.
My current thinking is that a key-value store would be the best option for this and I have adjusted my code accordingly.
I have tried a number but for some reason they all seem to scale even less than mysql.
I'm looking to store on the order of hundreds of millions or billions or more key-value pairs so I need something that won't have a large performance degradation with size.
I have tried memcachedb, membase, and mongo and while they were all easy enough to set up, none of them scaled that well for me.
membase had the most issues due to the number of keys required and the limited memory available. Write speed is very important here as this is a very close to even workload, I write a thing once, then read it back a few times and store it for eventual update.
I don't need much performance on deletes and I would prefer something that can cluster well as I'm hoping to eventually have this able to scale across machines but it needs to work on a single machine for now.
I'm also hoping to make this project easy to deploy so an easy setup would be much better. The project is written in php so it needs to be easy accessed from php.
I don't need to have rows or other higher level abstractions, they are mostly useless in this case and I have already made the code from some of my other tests to get down to a key-value store and that seems to likely be the fastest as I only have 2 things that would be retrieved from a row keyed off a third so there is little additional work done to use a key-value store. Does anyone know any easy to use projects that can scale like this?
I am using this store to store individual sets of three numbers, (the sizes are based on how they were stored in mysql, that may not be true in other storage locations) 2 eight byte integers, one for the ID of the document and one for the ID of the word and a float representation of the proportion of the document that that word was (number of times the work appeared divided by the number of words in the document). The index for this data is the word id and the range the document id falls into, every time I need to retrieve this data it will be all of the results for a given word id. I currently turn the word id, the range, and a counter for that word/range combo each into binary representations of the numbers and concatenate them to form the key along with a 2 digit number to say what value for that key I am storing, the document id or the float value.
Performance measurement was somewhat subjective looking at the output from the processes putting data into or pulling data out of the storage and seeing how fast it was processing documents as well as rapidly refreshing my statistics counters that track more accurate statistics of how fast the system is working and looking at the differences when I was using each storage method.
You would need to provide some more data about what you really want to do...
depending on how you define fast large scale you have several options:
memcache
redis
voldemort
riak
and sooo on.. the list gets pretty big..
Edit 1:
Per this post comments I would say that you take a look to cassandra or voldemort. Cassandra isn't a simple KV storage per se since you can storage much more complex objects than just K -> V
if you care to check cassandra with PHP, take a look to phpcassa. but redis is also a good option if you set a replica.
Here's add a few products and ideas that weren't mentioned above:
OrientDB - this is a graph/document database, but you can use it to store very small "documents" - it is extremely fast, highly scalable, and optimized to handle vast amounts of records.
Berkeley DB - Berkeley DB is a key-value store used at the heart of a number of graph and document databases - supposedly has a SQLite-compatible API that works with PHP.
shmop - Shared memory operations might be one possible approach, if you're willing to do some dirty-work. If you records are small and have a fixed size, this might work for you - using a fixed record-size and padding with zeroes.
handlersocket - this has been in development for a long time, and I don't know how reliable it is. It basically lets you use MySQL at a "lower level", almost like a key/value-store. Because you're bypassing the query parser etc. it's much faster than MySQL in general.
If you have a fixed record-size, few writes and lots of reads, you may even consider reading/writing to/from a flat file. Likely nowhere near as fast as reading/writing to shared memory, but it may be worth considering. I suggest you weigh all the pros/cons specifically for your project's requirements, not only for products, but for any approach you can think of. Your requirements aren't exactly "mainstream", and the solution may not be as obvious as picking the right product.
I have a large data set of around 200, 000 values, all of them are strings. Which data structure should i use so that the searching and retrieval process is fast. Insertion is one time, so even if the insertion is slow it wouldn't matter much.
Hash Map could be one solution, but what are the other choices??
Thanks
Edit:
some pointers
1. I am looking for exact matches and not the partial ones.
2. I have to accomplish this in PHP.
3. Is there any way i can keep such amount of data in cache in form of tree or in some other format?
You really should consider not using maps or hash dictionaries if all you need is a string lookup. When using those, your complexity guaranties for N items in a lookup of string size M are O(M x log(N)) or, best amortised for the hash, O(M) with a large constant multiplier. It is much more efficient to use an acyclic deterministic finite automaton (ADFA) for basic lookups, or a Trie if there is a need to associate data. These will walk the data structure one character at a time, giving O(M) with very small multiplier complexity.
Basically, you want a data structure that parses your string as it is consumed by the data structure, not one that must do full string compares at each node of the lookup. The common orders of complexity you see thrown around around for red-black trees and such assume O(1) compare, which is not true for strings. Strings are O(M), and that propagates to all compares used.
Maybe a trie data structure.
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings
Use a TreeMap in that case. Search and Retrieval will be O(log n). In case of HashMap search can be O(n) worst case, but retrieval is O(1).
For 200000 values, it probably won't matter much though unless you are working with hardware constraints. I have used HashMaps with 2 million Strings and they were still fast enough. YMMV.
You can B+ trees if you want to ensure your search is minimal at the cost of insertion time.
You can also try bucket push and search.
Use a hashmap. Assuming implementation similar to Java's, and a normal collision rate, retrieval is O(m) - the main cost is computing the hashcode and then one string-compare. That's hard to beat.
For any tree/trie implementation, factor in the hard-to-quantify costs of the additional pipeline stalls caused by additional non-localized data fetches. The only reason to use one (a trie, in particular) would be to possibly save memory. Memory will be saved only with long strings. With short strings, the memory savings from reduced character storage are more than offset by all the additional pointers/indices.
Fine print: worse behavior can occur when there are lots of hashcode collisions due to an ill-chosen hashing function. Your mileage may vary. But it probably won't.
I don't do PHP - there may be language characteristics that skew the answer here.