curved (or angled) probability - php

In nearly any programming language, if I do $number = rand(1,100) then I have created a flat probability, in which each number has a 1% chance of coming up.
What if I'm trying to abstract something weird, like launching rockets into space, so I want a curved (or angled) probability chart. But I don't want a "stepped" chart. (important: I'm not a math nerd, so there are probably terms or concepts that I'm completely skipping or ignorant of!) An angled chart is fine though.
So, if I wanted a probability that gave results of 1 through 100... 1 would be the most common result. 2 the next most common. In a straight line until a certain point - lets say 50, then the chart angles, and the probability of rolling 51 is less than that of rolling 49. Then it angles again at 75, so the probability of getting a result above 75 is not simply 25%, but instead is some incredibly smaller number, depending on the chart, perhaps only 10% or 5% or so.
Does this question make any sense? I'd specifically like to see how this can be done in PHP, but I wager the required logic will be rather portable.

The short answers to your questions are, yes this makes sense, and yes it is possible.
The technical term for what you're talking about is a probability density function. Intuitively, it's just what it sounds like: It is a function that tells you, if you draw random samples, how densely those samples will cluster (and what those clusters look like.) What you identify as a "flat" function is also called a uniform density; another very common one often built into standard libraries is a "normal" or Gaussian distribution. You've seen it, it's also called a bell curve distribution.
But subject to some limitations, you can have any distribution you like, and it's relatively straightforward to build one from the other.
That's the good news. The bad news is that it's math nerd territory. The ideas behind probability density functions are pretty intuitive and easy to understand, but the full power of working with them is only unlocked with a little bit of calculus. For instance, one of the limitations on your function is that the total probability has to be unity, which is the same as saying that the area under your curve needs to be exactly one. In the exact case you describe, the function is all straight lines, so you don't strictly need calculus to help you with that constraint... but in the general case, you really do.
Two good terms to look for are "Transformation methods" (there are several) and "rejection sampling." The basic idea behind rejection sampling is that you have a function you can use (in this case, your uniform distribution) and a function you want. You use the uniform distribution to make a bunch of points (x,y), and then use your desired function as a test vs the y coordinate to accept or reject the x coordinates.
That makes almost no sense without pictures, though, and unfortunately, all the best ways to talk about this are calculus based. The link below has a pretty good description and pretty good illustrations.
http://www.stats.bris.ac.uk/~manpw/teaching/folien2.pdf

Essentially you need only to pick a random number and then feed into a function, probably exponential, to pick the number.
Figuring out how weighted you want the results to be will make the formula you use different.
Assuming PHP has a random double function, I'm going to call it random.
$num = 100 * pow(random(), 2);
This will cause the random number to multiply by itself twice, and since it returns a number between 0 and 1, it will get smaller, thus increasing the chance to be a lower number. To get the exact ratio you'd just have to play with this format.

To me it seems like you need a logarithmic function (which is curved). You'd still pull a random number, but the value that you'd get would be closer to 1 than 100 most of the time. So I guess this could work:
function random_value($min=0, $max=100) {
return log(rand($min, $max), 10) * 10;
}
However you may want to look into it yourself to make sure.

The easiest way to achieve a curved probability is to think how you want to distribute for example a prize in a game across many winners and loosers. To simplify your example I take 16 players and 4 prizes. Then I make an array with a symbol of the prize (1,2,2,3,3,3,3,3,4,4,4,4,4,4,4) and pick randomly a number out of this array. Mathematically you would have a probability for prize 1 = 1:16, for prize 2 3:16, for prize 3 5:16 and for prize 4 7:16.

Related

Peak detection / Slice discrete data

I have been given a task, that given discrete data, like this
I need to slice it into 5 pieces, determined by the template it creates.
I am not allowed to guess a template, because every input looks different.
My approach was to find peaks in the data (above or below zero), then use that pattern of peaks to slice the data. Here is what I got: (not for the above data)
The top graph is the peaks in the graph, and because I know I have exactly 5 pieces, and 15 points, I can say that every piece has 3 points, and then slice it, which is the second graph in that picture.
Out of 40 inputs, I managed to do this only for 5 of them, because my "peak detection" algorithm is very very basic.
What peak detection algorithm should I use, that can also find local minimums, and has PHP implementation / simple psudo code? I am a beginner in this field of data analyzing, so I need your tips.
Finally, am I even going in the right direction on how to slice this data? or is there a better known way to do it?
EDIT:
My bad for not explaining before: the goal of this slicing, is to create a uniform not-time dependent model for a slice, meaning that long and short pieces will be the same length, and that is for each peak. If this is done per slice, just stretching, the data looks noisy, like this: (this is still in development, so I didn't write it before)
And I don't know how to do it without the peaks, because every slice has different times for different parts (1 second, 1.1 seconds, etc)
Find the 4 longest sub sets without intersection in your data where values remain within some tolerance of zero. In the case that you don't know how many beats you have to isolate peak detection becomes more relevant as the number of peaks above a given threshold define how many sections you dissect.
I don't think you're the first person to attack this sort of problem...
https://www.biopac.com/knowledge-base/extracting-heart-rate-from-a-noisy-ecg-signal/
Edit::
As far as a peak finding algorithm I think this paper provides some methods.
http://www.ifi.uzh.ch/dbtg/teaching/thesesarch/ReportRSchneider.pdf
The approach labeled Significant Peak-Valley Algorithm more or less boils down to finding local extrema (minimum and maximum) in regions beyond (below and above respectively) a given threshold defined by some arbitrary number of standard deviations from the mean.

Mixing other sources of random numbers with ones generated by /dev/urandom

Further to my question here, I'll be using the random_compat polyfill (which uses /dev/urandom) to generate random numbers in the 1 to 10,000,000 range.
I do realise, that all things being correct with how I code my project, the above tools should produce good (as in random/secure etc) data. However, I'd like to add extra sources of randomness into the mix - just in case 6 months down the line I read there is patch available for my specific OS version to fix a major bug in /dev/urandom (or any other issue).
So, I was thinking I can get numbers from random.org and fourmilab.ch/hotbits
An alternative source would be some logs from a web site I operate - timed to the microsecond, if I ignore the date/time part and just take the microseconds - this has in effect been generated by when humans decide to click on a link. I know this may be classed as haphazard rather than random, but would it be good for my use?
Edit re timestamp logs - will use PHP microtime() which will creaet a log like:
0.**832742**00 1438282477
0.**57241**000 1438282483
0.**437752**00 1438282538
0.**622097**00 1438282572
I will just use the bolded portion.
So let's say I take two sources of extra random numbers, A and B, and the output of /dev/urandom, call that U and set ranges as follows:
A and B are 1 - 500,000
U is 1 - 9,000,000
Final random number is A+B+U
I will be needing several million final numbers between 1 and 10,000,000
But the pool of A and B numbers will only contain a few thousand, but I think by using prime number amounts I can stretch that into millions of A&B combinations like so
// this pool will be integers from two sources and contain a larger prime number
// of members instead of the 7 & 11 here - this sequence repeats at 77
$numbers = array("One","Two","Three","Four","Five","Six","Seven");
$colors = array("Silver","Gray","Black","Red","Maroon","Yellow","Olive","Lime","Green","Aqua","Orange");
$ni=0;
$ci=0;
for ($i=0;$i<$num_numbers_required;$i++)
{
$offset = $numbers[$ni] + $colors[$ci];
if ($ni==6) // reset at prime num 7
$ni=0;
else
$ni++;
if ($ci==10) // reset at prime num 11
$ci=0;
else
$ci++;
}
Does this plan make sense - is there any possibility I can actually make my end result less secure by doing all this? And what of my idea to use timestamp data?
Thanks in advance.
I would suggest reading RFC4086, section 5. Basically it talks about how to "mix" different entropy sources without compromising security or bias.
In short, you need a "mixing function". You can do this with xor, where you simply set the result to the xor of the inputs: result = A xor B.
The problem with xor is that if the numbers are correlated in any way, it can introduce strong bias into the result. For example, if bits 1-4 of A and B are the current timestamp, then the result's first 4 bits will always be 0.
Instead, you can use a stronger mixing function based on a cryptographic hash function. So instead of A xor B you can do HMAC-SHA256(A, B). This is slower, but also prevents any correlation from biasing the result.
This is the strategy that I used in RandomLib. I did this because not every system has every method of generation. So I pull as many methods as I can, and mix them strongly. That way the result is never weaker than the strongest method.
HOWEVER, I would ask why. If /dev/urandom is available, you're not going to get better than it. The reason is simple, even if you call random.org for more entropy, your call is encrypted using random keys generated from /dev/urandom. Meaning if an attacker can compromise /dev/urandom, your server is toast and you will be spinning your wheels trying to make it better.
Instead, simply use /dev/urandom and keep your OS updated...

PHP functions for correct statistics of data

I am not skilled in the world of statistics, so I hope this will be easy for someone, my lack of skill also made it very hard to find the correct search terms on this topic so I may have missed my answer in searching. anyway. I am looking at arrays of data, say CPU usage for example. how can i capture accurate information in as few data-points as possible on say, a set of data containing 1-second time intervals on cpu usage over the cores of 1 hr, where the first 30mins where 0% and the second 30 mins are 100%. right now, all i will know in one data-point i can think of is the mean, which is 50%, and not useful at all in this case. also, another case is when the usage graph was like a wave, evenly bouncing up and down between 0-100, yet still giving a mean of 50%. how can i capture this data? thanks.
If I understand your question, it is really more of a statistics question than a programming question. Do you mean, what is the best way to capture a population curve with the fewest variables possible?
Firstly, the assumptions with most standard statistics implies that the system is more or less stable (although, if the system is unstable, the numbers you get will let you know because they will be non-sensical).
The main measures that you need to know statistically are the mean, population size and the standard deviation. From this, you can calculate the rough bell curve defining to population curve, and know the accuracy of the curve based on the scale of the standard deviation.
This gives you a three variable schema for a standard bell curve.
If you want to get in further detail, you can add Cpk, Ppk, which are calculated fields.
Otherwise, you may need to get into non-linear regression and curve fitting which is best handled on a case by case basis (not great for programming).
Check out the following sites for calculating the Cp, Cpk, Pp and Ppk:
http://www.qimacros.com/control-chart-formulas/cp-cpk-formula/
http://www.macroption.com/population-sample-variance-standard-deviation/

Formula to calculate the best overall score/review/etc from data in PHP

I'm having trouble wording my problem to search for it, so if anyone could point me in the right direction it would be appreciated.
I have multiple scores given out of 5 for a series of objects.
How can I find which object has the best overall rating? A similar formula to Amazon's reviews or Reddit's best comments (probably a lot more basic?), so not necessarily finding the highest average score but incorporating the number of reviews given to get the "best".
Any ideas?
This seems to be a classical application of the Friedman test: "n wine judges each rate k different wines. Are any wines ranked consistently higher or lower than the others?" Friedman test is implemented in many statistical packages, e.g., in R: friedman.test.
Friedman test will return the p-value. If the p-value is not siginificant there is no reason to assume that some of the objects are consistently ranked higher than other ones. If the p-value is significant, then you know that some objects have been ranked higher than others but you still do not know which ones. Hence, appropriate post-hoc multiple comparisons tests should be performed.
A number of different post-hoc tests can be performed, see e.g., for R code of an example post-hoc analysis http://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/

Algorithm to evenly distribute "prizes" / no variance lottery

My problem: I want to make a "kind" lottery-process. This algorithm will distribute prizes evenly if possible. This could be considered unfair to the people who buy a ticket to every prize since he will be more flexible to win the unpopular prizes, but never mind that, we may say that the prizes are roughly the same. The algorithm will help killing variance and reduce the dicerolling to win prizes. (Yep, boring)
I will have N competitions were you can win a prize. The persons, M, can buy a ticket for every N.
So an example, here are prizes and people who have bought tickets:
Prize1=[Pete,Kim, Jim]
Prize2=[Jim, Kim]
Prize3=[Roger, Kim]
Prize4=[Jim]
There are 4 prizes and 4 unique names, so it should be possible to distribute it evenly.
The example may be easy to solve, you should find it out in 15 seconds, but when M and N increase it gets much worse.
I'm trying to make a general algorithm, but it's hard. I need some good tips or even better the solution or link to a solution.
Theory: You have a Bipartite graph. You have to find a Perfect matching. There is a perfect matching in a graph if:
|A| = |B|
The graph satisfies the Hall condition
If a perfect matching exists, you can run the Hungarian algorithm to find it.
You want to look for a job-assignment algorithm, or a hungarian algorithm for example a weighted perfect match in a bipartite graph, or maybe the all-pair floyd warshall algorithm. My idea is that this can be represent as a bipartite graph. This is not an easy to solve task.

Categories