Collect Lowest Numbers Algorithm - php

I'm looking for an algorithm (or PHP code, I suppose) to end up with the 10 lowest numbers from a group of numbers. I was thinking of making a ten item array, checking to see if the current number is lower than one of the numbers in the array, and if so, finding the highest number in the array and replacing it with the current number.
However, I'm planning on finding the lowest 10 numbers from thousands, and was thinking there might be a faster way to do it. I plan on implementing this in PHP, so any native PHP functions are usable.

Sort the array and use the ten first/last entries.
Honestly: sorting an array with a thousand entries costs less time than it takes you to blink.

What you're looking for is called a selection algorithm. The Wikipedia page on the subject has a few subsections in the selecting k smallest or largest elements section. When the list is large enough, you can beat the time required for the naive "sort the whole list and choose the first 10" algorithm.

Naive approach is to just sort the input. It's likely fast enough, so just try it and profile it before doing anything more complicated.
Potentially faster approach: Linearly search the input, but keep the output array sorted to make it easier to determine if the next input belongs in the array or not. Pseudocode:
output[0-9] = input[0-9];
sort(output);
for i=10..n-1
if input[i] < output[9]
insert(input[i])
where insert(x) will find the right spot (binary search) and do the appropriate shifting.
But seriously, just try the naive approach first.

Where are you getting this group of numbers?
If your List of numbers is already in an array you could simply do a sort(), and then a array_slice() to get the first 10.

I doesn't matter much for a small array, but as it gets larger a fast and easy way to increase processing speed is to take advantage of array key indexing, which for 1 mill. rows will use about 40% of the time. Example:
// sorting array values
$numbers = array();
for($i = 0; $i < 1000000; ++$i)
{
$numbers[$i] = rand(1, 999999);
}
$start = microtime(true);
sort($numbers);
$res = array_slice($numbers, 0, 10, true);
echo microtime(true) - $start . "\n";
// 2.6612658500671
print_r($res);
unset($numbers, $res, $start);
// sorting array keys
$numbers = array();
for($i = 0; $i < 1000000; ++$i)
{
$numbers[rand(1, 999999)] = $i;
}
$start = microtime(true);
ksort($numbers);
$res = array_keys(array_slice($numbers, 0, 10, true));
echo microtime(true) - $start . "\n";
// 0.9651210308075
print_r($res);
But if the array data is from a database the fastest is probably to just sort it there:
SELECT number_column FROM table_with_numbers ORDER BY number_column LIMIT 10

Create a sorted set (TreeSet in Java, I don't know about PHP), and add the first 10 numbers. Now iterate over the rest of the numbers Iterate over all your numbers, add the new one, then remove the biggest number from the set.
This algorithm is O(n) if n >> 10.

I would use a heap with 10 elements and the highest number at the root of the tree. Then start at the beginning of the list of numbers:
If the heap has less than 10 elements: add the number to the list
Otherwise, if the number is smaller than the highest number in the heap, remove the highest number in the heap, and then add the current number to the list
Otherwise, ignore it.
You will end up with the 10 lowest numbers in the heap. If you are using an array as the heap data structure, then you can just use the array directly.
(alternatively: you can slice out the first 10 elements, and heapify them instead of using the first step above, which will be slightly faster).
However, as other people have noted, for 1000 elements, just sort the list and take the first 10 elements.

Related

Getting every combination of X numbers given Y numbers?

I've come to a mathematical problem which for I can't program the logic.
Let me explain it with an example:
Let's say I have 4 holes and 3 marbles, the holes are in order and my marbles are A,B and C and also in order.
I need to get every posible ORDERED combination:
ABC4
AB3C
A2BC
1ABC
This is very simple, but what if the number of holes changes? Let's say now I have 5 holes.
ABC45
AB3C5
A2BC5
1ABC5
AB34C
A2B4C
1AB4C
A23BC
1A3BC
12ABC
Now let's say we have 5 holes and 4 marbles.
ABCD5
ABC4D
AB3CD
A2BCD
1ABCD
And this can be any number of holes and any number of marbles.
The number of combinations is given by:
$combinations = factorial($number_of_holes)/(factorial($number_of_marbles)*factorial($number_of_holes-$number_of_marbles)))
(Here it is the factorial function in case you need it)
function factorial($number) {
if ($number < 2) {
return 1;
} else {
return ($number * factorial($number-1));
}
}
What I need and can't figure out how to program, is a function or a loop or something, that returns an array with the position of the holes, given X numbers of holes and Y number of marbles.
For first example it would be: [[4],[3],[2],[1]], for second: [[4,5],[2,5],[1,5],[3,4],[2,4],[1,5],[2,3],[1,3],[1,2]], for third: [[5],[4],[3],[2],[1]].
It doesn't have to be returned in order, I just need all the elements.
As you can see, another approach is the complementary or inverse or don't know how to call it, but the solution is every combinations of X number of free holes given Y number of holes, so, If I have 10 holes, and 5 marbles, there would be 5 free holes, the array returned would be every combination of 5 that can be formed with (1,2,3,4,5,6,7,8,9,10), which are 252 combinations, and what I need is the 252 combinations.
Examples for the 2nd approach:
Given an array=[1,2,3,4], return every combination for sets of 2 and 3.
Sets of 2
[[1,2],[1,3],[1,4],[2,3],[2,4],[3,4]]
Sets of 3
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
What I need is the logic to do this, I'm trying to do it in PHP, but I just can't figure out how to do it.
The function would receive the array and the set size and would return the array of sets:
function getCombinations($array,$setize){
//magic code which I can't figure out
return array(sets);
}
I hope this is clear enough and someone can help me, I've been stuck for several days now, but it seems to be just too much for me to handle by myself.
This post, PHP algorithm to generate all combinations of a specific size from a single set, is for all possible combinations, repeating the elements and order doesn't matter, its a good lead, I did read it, but it doesn't solve my problem, it's very different. I need them without repeating the elements and ordered as explained.
Let's say if I have already a set of [3,4] in my array, I don't want [4,3] as an other set.
Here's a recursive solution in PHP:
function getCombinations($array, $setsize){
if($setsize == 0)
return [[]];
// generate combinations including the first element by generating combinations for
// the remainder of the array with one less element and prepending the first element:
$sets = getCombinations(array_slice($array, 1), $setsize - 1);
foreach ($sets as &$combo) {
array_unshift($combo, $array[0]);
}
// generate combinations not including the first element and add them to the list:
if(count($array) > $setsize)
$sets = array_merge($sets, getCombinations(array_slice($array, 1), $setsize));
return $sets;
}
// test:
print_r(getCombinations([1, 2, 3, 4], 3));
Algorithm works like this:
If setsize is 0 then you return a single, empty combination
Otherwise, generate all combinations that include the first element, by recursively generating all combinations off the array excluding the first element with setsize - 1 elements, and then prepending the first element to each of them.
Then, if the array size is greater than setsize (meaning including the first element is not compulsory), generate all the combinations for the rest of the list and add them to the ones we generated in the second step.
So basically at each step you need to consider whether an element will be included or excluded in the combination, and merge together the set of combinations representing both choices.

Can I do big oh of 1 lookups on short-hand arrays?

This is a question on ability of PHP to use arrays in the way I want.
Consider arrays below:
1) array(10, 20, 30, 40, 50, 60, 70); //aka "shorthand"
2) array(10 => 1, 20 => 1, 30 => 1, 40 => 1, 50 => 1, 60 => 1, 70 => 1);
For my purposes they both have the same numbers essentially, just represented differently.
Naturally, I prefer the first way, as it lists just the numbers I need. It also makes more sense to me as my values are the numbers that I want to operate on. In #2 the values are all 1s, and if I need to loop through my "values of interest", I actually need to loop through array keys. In my legacy code however my arrays are a tad longer and tend to use function in_array, which has big oh of N. I needed a faster lookup on values of the array, so I rewrote the array as in way #2 and used function array_key_exists, which I presume is big oh of 1. Due to large codebase I am basically ending up with two array representations that require different code to work on them. And I don't like that.
I prefer to write my arrays as in #1... So in some ideal world I could write arrays as in #1 but have big oh of 1 lookup on them, when I need to check whether a value exists in array.
Is there such a way? If not, what is the next best thing?
Update: using array flip option:
3) if (in_array(10, [10,20,30,40,50,60,70])) {}
4) if (array_key_exists(10, array_flip([10,20,30,40,50,60,70]))) {}
You can use hashes/dictionaries to achieve a theoretical O(1) speed. Your suggestion #2 is capable of doing that.
Hashes/dictionaries are O(1) best case and average case, O(N) in worst case.
If you are bonded to the use of arrays, the answer to your question depends on whether the array's content is changed frequently or not.
Array doesn't change frequently
If the array is not changed frequently, you could use the function array_flip once, in order to make a searchable version of your array and keep it for as long as possible.
In this way you have once an operation that is O(n), and every time you search you have operations that are near to O(1), using array_key_exists.
Array changes frequently
In this case to use in_array, wich is O(n), is almost equivalent to (if not better than) using array_flip with array_key_exists, since array_flip is O(n) itself and array_key_exists is O(1) at best, and you would have to perform both of them everytime.
It would be an interesting solution to keep the array sorted, and then use an optimized search algorithm, that, with an ordered array would be at worst O(log n).
Binary search could be a good search algorithm for that purpose:
function binarySearch($needle, $array)
{
$start = 0;
$end = count($array) - 1;
while ($start <= $end) {
$middle = (int) ($start + ($end - $start) / 2);
if ($needle < $array[$middle]) {
$end = $middle - 1;
} else if ($needle > $array[$middle]) {
$start = $middle + 1;
} else {
return $needle;
}
}
return false;
}
This binary search sample is taken from here.
To keep the array sorted you would have to sort it the first time (sort function should be fine for that, since you should use it once), and then add new items splitting the array where the element should be placed, put the element at the end of the first part of the array and merge the arrays back.
This would make the array maintainance more expensive, but would improve search performance.

How unique a 5-digit mt_rand() number is?

I am just wondering, how unique is a mt_rand() number is, if you draw 5-digits number?
In the example, I tried to get a list of 500 random numbers with this function and some of them are repeated.
http://www.php.net/manual/en/function.mt-rand.php
<?php
header('Content-Type: text/plain');
$errors = array();
$uniques = array();
for($i = 0; $i < 500; ++$i)
{
$random_code = mt_rand(10000, 99999);
if(!in_array($random_code, $uniques))
{
$uniques[] = $random_code;
}
else
{
$errors[] = $random_code;
}
}
/**
* If you get any data in this array, it is not exactly unique
* Run this script for few times and you may see some repeats
*/
print_r($errors);
?>
How many digits may be required to ensure that the first 500 random numbers drawn in a loop are unique?
If numbers are truly random, then there's a probability that numbers will be repeated. It doesn't matter how many digits there are -- adding more digits makes it much less likely there will be a repeat, but it's always a possibility.
You're better off checking if there's a conflict, then looping until there isn't like so:
$uniques = array();
for($i = 0; $i < 500; $i++) {
do {
$code = mt_rand(10000, 99999);
} while(in_array($code, $uniques));
$uniques[] = $code
}
Why not use range, shuffle, and slice?
<?php
$uniques = range(10000, 99999);
shuffle($uniques);
$uniques = array_slice($uniques, 0, 500);
print_r($uniques);
Output:
Array
(
[0] => 91652
[1] => 87559
[2] => 68494
[3] => 70561
[4] => 16514
[5] => 71605
[6] => 96725
[7] => 15908
[8] => 14923
[9] => 10752
[10] => 13816
*** truncated ***
)
This method is less expensive as it does not search the array each time to see if the item is already added or not. That said, it does make this approach less "random". More information should be provided on where these numbers are going to be used. If this is an online gambling site, this would be the worst! However if this was used in returning "lucky" numbers for a horoscope website, I think it would be fine.
Furthermore, this method could be extended, changing the shuffle method to use mt_rand (where as the original method simply used rand). It may also use openssl_random_pseudo_bytes, but that might be overkill.
The birthday paradox is at play here. If you pick a random number from 10000-99999 500 times, there's a good chance of duplicates.
Intuitive idea with small numbers
If you flip a coin twice, you'll get a duplicate about half the time. If you roll a six-sided die twice, you'll get a duplicate 1/6 of the time. If you roll it 3 times, you'll get a duplicate 4/9 (44%) of the time. If you roll it 4 times you'll get at least one duplicate 13/18 (63.33%). Roll it a fifth time and it's 49/54 (90.7%). Roll it a sixth time and it's 98.5%. Roll it a seventh time and it's 100%.
If you take replace the six-sided die with a 20-sided die, the probabilities grow a bit more slowly, but grow they do. After 3 rolls you have a 14.5% chance of duplicates. After 6 rolls it's 69.5%. After 10 rolls it's 96.7%, near certainty.
The math
Let's define a function f(num_rolls, num_sides) to generalize this to any number of rolls of any random number generator that chooses out of a finite set of choices. We'll define f(num_rolls, num_sides) to be the probability of getting no duplicates in num_rolls of a num_sides-side die.
Now we can try to build a recursive definition for this. To get num_rolls unique numbers, you'll need to first roll num_rolls-1 unique numbers, then roll one more unique number, now that num_rolls-1 numbers have been taken. Therefore
f(num_rolls, num_sides) =
f(num_rolls-1, num_sides) * (num_sides - (num_rolls - 1)) / num_sides
Alternately,
f(num_rolls + 1, num_side) =
f(num_rolls, num_sides) * (num_sides - num_rolls) / num_sides
This function follows a logistic decay curve, starting at 1 and moving very slowly (since num_rolls is very low, the change with each step is very small), then slowly picking up speed as num_rolls grows, then eventually tapering off as the function's value gets closer and closer to 0.
I've created a Google Docs spreadsheet that has this function built in as a formula to let you play with this here: https://docs.google.com/spreadsheets/d/1bNJ5RFBsXrBr_1BEXgWGein4iXtobsNjw9dCCVeI2_8
Tying this back to your specific problem
You've generated rolled a 90000-sided die 500 times. The spreadsheet above suggests you'd expect at least one duplicate pair about 75% of the time assuming a perfectly random mt_rand. Mathematically, the operation your code was performing is choosing N elements from a set with replacement. In other words, you pick a random number out of the bag of 90000 things, write it down, then put it back in the bag, then pick another random number, repeat 500 times. It sounds like you wanted all of the numbers to be distinct, in other words you wanted to choose N elements from a set without replacement. There are a few algorithms to do this. Dave Chen's suggestion of shuffle and then slice is a relatively straightforward one. Josh from Qaribou's suggestion of separately rejecting duplicates is another possibility.
Your question deals with a variation of the "Birthday Problem" which asks if there are N students in a class, what is the probability that at least two students have the same birthday? See Wikipedia: The "Birthday Problem".
You can easily modify the formula shown there to answer your problem. Instead of having 365 equally probable possibilities for the birthday of each student, you have 90001 (=99999-10000+2) equally probable integers that can be generated between 10000 and 99999. The probability that if you generate 500 such numbers that at least two numbers will be the same is:
P(500)= 1- 90001! / ( 90001^n (90001 - 500)! ) = 0.75
So there is a 75% chance that at least two of the 500 numbers that you generate will be the same or, in other words, only a 25% chance that you will be successful in getting 500 different numbers with the method you are currently using.
As others here have already suggested, I would suggest checking for repeated numbers in your algorithm rather than just blindly generating random numbers and hoping that you don't have a match between any pair of numbers.

Amount of data stored in a PHP array

I'm looking for a way to measure the amount of data stored in a PHP array. I'm not talking about the number of elements in the array (which you can figure out with count($array, COUNT_RECURSIVE)), but the cumulative amount of data from all the keys and their corresponding values. For instance:
array('abc'=>123); // size = 6
array('a'=>1,'b'=>2); // size = 4
As what I'm interested in is order of magnitude rather than the exact amount (I want to compare the processing memory and time usage versus the size of the arrays) I thought about using the following trick:
strlen(print_r($array,true));
However the amount of overhead coming from print_r varies depending on the structure of the array which doesn't give me consistent results:
echo strlen(print_r(array('abc'=>123),true)); // 27
echo strlen(print_r(array('a'=>1,'b'=>2),true)); // 35
Is there a way (ideally in a one-liner and without impacting too much performance as I need to execute this at run-time on production) to measure the amount of data stored in an array in PHP?
Does this do the trick:
<?php
$arr = array('abc'=>123);
echo strlen(implode('',array_keys($arr)).implode('',$arr));
?>
Short answer: mission impossible
You could try something like:
strlen(serialize($myArray)) // either this
strlen(json_encode($myArray)) // or this
But to approximate the true memory footprint of an array, you will have to do a lot more than that. If you're looking for a ballpark estimate, arrays take 3-8x more than their serialized version, depending on what you store in them and how many elements you have. It increases gradually, in bigger and bigger chunks as your array grows. To give you an idea of what's happening, here's an array estimation function I came up with, after many hours of trying, for one-level arrays only:
function estimateArrayFootprint($a) { // copied from one of my failed quests :(
$size = 0;
foreach($a as $k=>$v) {
foreach([$k,$v] as $x) {
$n = strlen($x);
do{
if($n>8192 ) {$n = (1+($n>>12)<<12);break;}
if($n>1024 ) {$n = (1+($n>> 9)<< 9);break;}
if($n>512 ) {$n = (1+($n>> 8)<< 8);break;}
if($n>0 ) {$n = (1+($n>> 5)<< 5);break;}
}while(0);
$size += $n + 96;
}
}
return $size;
}
So that's how easy it is, not. And again, it's not a reliable estimation, it probably depends on the PHP memory limit, the architecture, the PHP version and a lot more. The question is how accurately do you need this value.
Also let's not forget that these values came from a memory_get_usage(1) which is not very exact either. PHP allocates memory in big blocks in order to avoid a noticeable overhead as your string/array/whatever else grows, like in a for(...) $x.="yada" situation.
I wish I could say anything more useful.

What is the best algorithm to see if my number is in an array of ranges?

I have a 2 dimensional arrays in php containing the Ranges. for example:
From.........To
---------------
125..........3957
4000.........5500
5217628......52198281
52272128.....52273151
523030528....523229183
and so on
and it is a very long list. now I want to see if a number given by user is in range.
for example numbers 130, 4200, 52272933 are in my range but numbers 1, 5600 are not.
of course I can count all indexes and see if my number is bigger than first and smaller than second item. but is there a faster algorithm or a more efficient way of doing it using php function?
added later
It is sorted. it is actually numbers created with ip2long() showing all IPs of a country.
I just wrote a code for it:
$ips[1] = array (2,20,100);
$ips[2] = array (10,30,200);
$n=11;// input ip
$count = count($ips);
for ($i = 0; $i <= $count; $i++) {
if ($n>=$ips[1][$i]){
if ($n<=$ips[2][$i]){
echo "$i found";
break;
}
}else if($n<$ips[1][$i]){echo "not found";break;}
}
in this situation numbers 2,8,22,and 200 are in range. but not numbers 1,11,300
Put the ranges in a flat array, sorted from lower to higher, like this:
a[0] = 125
a[1] = 3957
a[2] = 4000
a[3] = 5500
a[4] = 5217628
a[5] = 52198281
a[6] = 52272128
a[7] = 52273151
a[8] = 523030528
a[9] = 523229183
Then do a binary search to determine at what index of this array the number in question should be inserted. If the insertion index is even then the number is not in any sub-range. If the insertion index is odd, then the number falls inside one of the ranges.
Examples:
n = 20 inserts at index 0 ==> not in a range
n = 126 inserts at index 1 ==> within a range
n = 523030529 inserts at index 9 ==> within a range
You can speed things up by implementing a binary search algorithm. Thus, you don't have to look at every range.
Then you can use in_array to check if the number is in the array.
I'm not sure if I got you right, do your arrays really look like this:
array(125, 126, 127, ..., 3957);
If so, what's the point? Why not just have?
array(125, 3957);
That contains all the information necessary.
The example you give suggests that the numbers may be large and the space sparse by comparison.
At that point, you don't have very many options. If the array is sorted, binary search is about all there is. If the array is not sorted, you're down to plain, old CS101 linear search.
The correct data structure to use for this problem is an interval tree. This is, in general, much faster than binary search.
I am assuming that the ranges do not overlap.
If that is the case, you can maintain a map data structure that is keyed on the lower value of the range.
Now all you have to do (given the number N) is to find the key in the map that is just lower than N (using binary search - logarithmic complexity) and then check if the number is lesser than the right value.
Basically, it is a binary search (logarithmic) on the constructed map.
From a pragmatic point of view, a linear search may very well turn out to be the fastest lookup method. Think of page faults and hard disk seek time here.
If your array is large enough (whatever "enough" actually means), it may be wise to stuff your IPs in a SQL database and let the database figure out how to efficiently compute SELECT ID FROM ip_numbers WHERE x BETWEEN start AND end;.

Categories