I am not skilled in the world of statistics, so I hope this will be easy for someone, my lack of skill also made it very hard to find the correct search terms on this topic so I may have missed my answer in searching. anyway. I am looking at arrays of data, say CPU usage for example. how can i capture accurate information in as few data-points as possible on say, a set of data containing 1-second time intervals on cpu usage over the cores of 1 hr, where the first 30mins where 0% and the second 30 mins are 100%. right now, all i will know in one data-point i can think of is the mean, which is 50%, and not useful at all in this case. also, another case is when the usage graph was like a wave, evenly bouncing up and down between 0-100, yet still giving a mean of 50%. how can i capture this data? thanks.
If I understand your question, it is really more of a statistics question than a programming question. Do you mean, what is the best way to capture a population curve with the fewest variables possible?
Firstly, the assumptions with most standard statistics implies that the system is more or less stable (although, if the system is unstable, the numbers you get will let you know because they will be non-sensical).
The main measures that you need to know statistically are the mean, population size and the standard deviation. From this, you can calculate the rough bell curve defining to population curve, and know the accuracy of the curve based on the scale of the standard deviation.
This gives you a three variable schema for a standard bell curve.
If you want to get in further detail, you can add Cpk, Ppk, which are calculated fields.
Otherwise, you may need to get into non-linear regression and curve fitting which is best handled on a case by case basis (not great for programming).
Check out the following sites for calculating the Cp, Cpk, Pp and Ppk:
http://www.qimacros.com/control-chart-formulas/cp-cpk-formula/
http://www.macroption.com/population-sample-variance-standard-deviation/
Related
I have been given a task, that given discrete data, like this
I need to slice it into 5 pieces, determined by the template it creates.
I am not allowed to guess a template, because every input looks different.
My approach was to find peaks in the data (above or below zero), then use that pattern of peaks to slice the data. Here is what I got: (not for the above data)
The top graph is the peaks in the graph, and because I know I have exactly 5 pieces, and 15 points, I can say that every piece has 3 points, and then slice it, which is the second graph in that picture.
Out of 40 inputs, I managed to do this only for 5 of them, because my "peak detection" algorithm is very very basic.
What peak detection algorithm should I use, that can also find local minimums, and has PHP implementation / simple psudo code? I am a beginner in this field of data analyzing, so I need your tips.
Finally, am I even going in the right direction on how to slice this data? or is there a better known way to do it?
EDIT:
My bad for not explaining before: the goal of this slicing, is to create a uniform not-time dependent model for a slice, meaning that long and short pieces will be the same length, and that is for each peak. If this is done per slice, just stretching, the data looks noisy, like this: (this is still in development, so I didn't write it before)
And I don't know how to do it without the peaks, because every slice has different times for different parts (1 second, 1.1 seconds, etc)
Find the 4 longest sub sets without intersection in your data where values remain within some tolerance of zero. In the case that you don't know how many beats you have to isolate peak detection becomes more relevant as the number of peaks above a given threshold define how many sections you dissect.
I don't think you're the first person to attack this sort of problem...
https://www.biopac.com/knowledge-base/extracting-heart-rate-from-a-noisy-ecg-signal/
Edit::
As far as a peak finding algorithm I think this paper provides some methods.
http://www.ifi.uzh.ch/dbtg/teaching/thesesarch/ReportRSchneider.pdf
The approach labeled Significant Peak-Valley Algorithm more or less boils down to finding local extrema (minimum and maximum) in regions beyond (below and above respectively) a given threshold defined by some arbitrary number of standard deviations from the mean.
In nearly any programming language, if I do $number = rand(1,100) then I have created a flat probability, in which each number has a 1% chance of coming up.
What if I'm trying to abstract something weird, like launching rockets into space, so I want a curved (or angled) probability chart. But I don't want a "stepped" chart. (important: I'm not a math nerd, so there are probably terms or concepts that I'm completely skipping or ignorant of!) An angled chart is fine though.
So, if I wanted a probability that gave results of 1 through 100... 1 would be the most common result. 2 the next most common. In a straight line until a certain point - lets say 50, then the chart angles, and the probability of rolling 51 is less than that of rolling 49. Then it angles again at 75, so the probability of getting a result above 75 is not simply 25%, but instead is some incredibly smaller number, depending on the chart, perhaps only 10% or 5% or so.
Does this question make any sense? I'd specifically like to see how this can be done in PHP, but I wager the required logic will be rather portable.
The short answers to your questions are, yes this makes sense, and yes it is possible.
The technical term for what you're talking about is a probability density function. Intuitively, it's just what it sounds like: It is a function that tells you, if you draw random samples, how densely those samples will cluster (and what those clusters look like.) What you identify as a "flat" function is also called a uniform density; another very common one often built into standard libraries is a "normal" or Gaussian distribution. You've seen it, it's also called a bell curve distribution.
But subject to some limitations, you can have any distribution you like, and it's relatively straightforward to build one from the other.
That's the good news. The bad news is that it's math nerd territory. The ideas behind probability density functions are pretty intuitive and easy to understand, but the full power of working with them is only unlocked with a little bit of calculus. For instance, one of the limitations on your function is that the total probability has to be unity, which is the same as saying that the area under your curve needs to be exactly one. In the exact case you describe, the function is all straight lines, so you don't strictly need calculus to help you with that constraint... but in the general case, you really do.
Two good terms to look for are "Transformation methods" (there are several) and "rejection sampling." The basic idea behind rejection sampling is that you have a function you can use (in this case, your uniform distribution) and a function you want. You use the uniform distribution to make a bunch of points (x,y), and then use your desired function as a test vs the y coordinate to accept or reject the x coordinates.
That makes almost no sense without pictures, though, and unfortunately, all the best ways to talk about this are calculus based. The link below has a pretty good description and pretty good illustrations.
http://www.stats.bris.ac.uk/~manpw/teaching/folien2.pdf
Essentially you need only to pick a random number and then feed into a function, probably exponential, to pick the number.
Figuring out how weighted you want the results to be will make the formula you use different.
Assuming PHP has a random double function, I'm going to call it random.
$num = 100 * pow(random(), 2);
This will cause the random number to multiply by itself twice, and since it returns a number between 0 and 1, it will get smaller, thus increasing the chance to be a lower number. To get the exact ratio you'd just have to play with this format.
To me it seems like you need a logarithmic function (which is curved). You'd still pull a random number, but the value that you'd get would be closer to 1 than 100 most of the time. So I guess this could work:
function random_value($min=0, $max=100) {
return log(rand($min, $max), 10) * 10;
}
However you may want to look into it yourself to make sure.
The easiest way to achieve a curved probability is to think how you want to distribute for example a prize in a game across many winners and loosers. To simplify your example I take 16 players and 4 prizes. Then I make an array with a symbol of the prize (1,2,2,3,3,3,3,3,4,4,4,4,4,4,4) and pick randomly a number out of this array. Mathematically you would have a probability for prize 1 = 1:16, for prize 2 3:16, for prize 3 5:16 and for prize 4 7:16.
I'm trying to nest material with the least drop or waste.
Table A
Qty Type Description Length
2 W 16x19 16'
3 W 16x19 12'
5 W 16x19 5'
2 W 5x9 3'
Table B
Type Description StockLength
W 16X19 20'
W 16X19 25'
W 16X19 40'
W 5X9 20'
I've looked all over looking into Greedy Algorithms, Bin Packing, Knapsack, 1D-CSP, branch and bound, Brute force, and others. I'm pretty sure it is a Cutting stock problem. I just need help coming up with the function(s) to run this. I don't just have one stock length but multiple and a user may enter his own inventory of less common lengths. Any help at figuring a function or algorithm to use in PHP to come up with the optimized cutting pattern and stock lengths needed with the least waste would be greatly appreciated.
Thanks
If your question is "gimme the code", I am afraid that you have not given enough information to implement a good solution. If you read the whole of this answer, you will see why.
If your question is "gimme the algorithm", I am afraid you are looking for an answer in the wrong place. This is a technology-oriented site, not an algorithms-oriented one. Even though we programmers do of course understand algorithms (e.g., why it is inefficient to pass the same string to strlen in every iteration of a loop, or why bubble sort is not okay except for very short lists), most questions here are like "how do I use API X using language/framework Y?".
Answering complex algorithm questions like this one requires a certain kind of expertise (including, but not limited to, lots of mathematical ability). People in the field of operations research have worked in this kind of problems more than most of us ever has. Here is an introductory book on the topic.
As an engineer trying to find a practical solution to a real-world problem, I would first get answers for these questions:
How big is the average problem instance you are trying to solve? Since your generic problem is NP-complete (as Jitamaro already said), moderately big problem instances require the use of heuristics. If you are only going to solve small problem instances, you might be able to get away with implementing an algorithm that finds the exact optimum, but of course you would have to warn your users that they should not use your software to solve big problem instances.
Are there any patterns you could use to reduce the complexity of the problem? For example, do the items always or almost always come in specific sizes or quantities? If so, you could implementing a greedy algorithm that focuses on yielding high-quality solutions for common scenarios.
What would be your optimality vs. computational efficiency tradeoff? If you only need a good answer, then you should not waste mental or computational effort in trying to provide an optimal answer. Information, whether provided by a person of by a computer, is only useful if it is available when it is needed.
How much are your customers willing to pay for a high-quality solution? Unlike database or Web programming, which can be done by practically everyone because algorithms are kept to a minimum (e.g. you seldom code the exact procedure by which a SQL database provides the result of a query), operations research does require both mathematical and engineering skills. If you are not charging for them, you are losing money.
This looks to me like a variation of a 1d bin-packing. You may try a best-fit and then try it with different sorting of the table b. Anyway there doesn't exist an solution in 3/2 of the optimum and because this is a NP-complete problem. Here is a nice tutorial: http://m.developerfusion.com/article/5540/bin-packing. I used a lot to solve my problem.
I'm running Eclipse in Linux and I was told I could use Xdebug to optimize my program. I use a combination algorithm in my script that takes too long too run.
I am just asking for a starting point to debug this. I know how to do the basics...break points, conditional break points, start, stop, step over, etc... but I want to learn more advanced techniques so I can write better, optimized code.
The first step is to know how to calculate the asymptotic memory usage, which means how much the memory grows when the problem gets bigger. This is done by saying that one recursion takes up X bytes (X = a constant, the easiest is to set it to 1). Then you write down the recurrence, i.e., in what manner the function calls itself or loops and try to conclude how much the memory grows (is it quadratic to the problem size, linear or maybe less?)
This is taught in elementary computer science classes at the universities since it's really useful when concluding how effective an algorithm is. The exact method is hard to describe in a simple forum post, so I recommend you to pick up a book on algorithms (I recommend "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein - MIT Press).
But if you don't have a clue about this type of work, start by using get_memory_usage and echoing how much memory you're using in your loop/recursion. This can give you a hint about were the problem is. Try to reduce the amount of things you keep in memory. Throw away everything you don't need (for example, don't build up a giant array of all data if you can boil it down to intermediary values earlier).
Looking at http://www.nearmap.com/,
Just wondering if you can approximate how much storage is needed to store the images?
(NearMap’s monthly city PhotoMaps are captured at 3cm, 5cm, 7.5cm, or 10cm resolution)
And what kind of systems/architecture is suitable to deliver those data/images?
(say you are not Google, and want to implement this from scratch, what would you do? )
ie. would you store the images in Hadoop, and use apache/php/memcache to deliver etc ?
It's pretty hard to estimate how much space is required without being able to determine the compression ratio. Simply put, if aerial photographs of houses compress well, then it can significantly change how much data needs to be stored.
But, in the interests of math we can try to figure out what is required.
So, if each pixel measures 3cm by 3cm they cover 9cm^2. A quick wikipedia search tells us that London is about 1700km^2, and at 10 billion cm^2 per km^2, is 17,000,000,000,000 cm^2. This mean that we need 1,888,888,888,888 pixels to cover London at a resolution of 3cm. Putting this into bytes, at 4 bytes per pixel, is about 7000 GiB. If you get 50% compression, that drops it down to 3500GiB for London. Multiply this out by every city you want to cover to get an idea for what kind of data storage you will need.
Delivering the content is simple compared to gathering it. Since this is an embarrassingly parallel solution a share-nothing cluster with an appropriate front-end to route traffic to the right nodes would probably be easiest way to implement it. This is because the nodes don't have to maintain state or communicate with each other. The ideal method would depend on how much data you are pushing through, if you do push enough data it might be worthwhile to implement your own webserver that just responds to HTTP GETs.
I'm not sure a distributed FS would be the best way to distribute things since you'd have to spend a significant amount of time trying to pull data from somewhere else in the cluster.