Multiple foreach with over 37 million possibilities - php

I've been tasked with creating a list of all possibilities using data in 8 blocks.
The 8 blocks have the following number of possibilities:
*Block 1: 12 possibilities
*Block 2: 8 possibilities
*Block 3: 8 possibilities
*Block 4: 11 possibilities
*Block 5: 16 possibilities
*Block 6: 11 possibilities
*Block 7: 5 possibilities
*Block 8: 5 possibilities
This gives a potential number of 37,171,200 possibilities.
I tried simply doing and limiting only to displaying the values returned with the correct string length like so:
foreach($block1 AS $b1){
foreach($block2 AS $b2){
foreach($block3 AS $b3){
foreach($block4 AS $b4){
foreach($block5 AS $b5){
foreach($block6 AS $b6){
foreach($block7 AS $b7){
foreach($block8 AS $b8){
if (strlen($b1.$b2.$b3.$b4.$b5.$b6.$b7.$b8) == 16)
{
echo $b1.$b2.$b3.$b4.$b5.$b6.$b7.$b8.'<br/>';
}
}
}
}
}
}
}
}
}
However the execution time was far too long to compute. I was wondering if anyone knew of a simpler way of doing this?

You could improve your algorithm by caching the string prefixes and remember their lengths. Then you don’t have to do that for each combination.
$len = 16:
// array for remaining characters per level
$r = array($len);
// array of level parts
$p = array();
foreach ($block1 AS &$b1) {
// skip if already too long
if (($r[0] - strlen($b1)) <= 0) continue;
$r[1] = $r[0] - strlen($b1);
foreach ($block2 AS &$b2) {
if (($r[1] - strlen($b2)) <= 0) continue;
$r[2] = $r[1] - strlen($b2);
foreach ($block3 AS $b3) {
// …
foreach ($block8 AS &$b8) {
$r[8] = $r[7] - strlen($b8);
if ($r[8] == 0) {
echo implode('', $p).'<br/>';
}
}
}
}
}
Additionally, using references in foreach will stop PHP using a copy of the array internally.

You could try to store the precomputed part the concatenated string known at each of the previous lelels for later reuse, avoiding concatenating everything in the innermost loop
foreach($block7 AS $b7){
$precomputed7 = $precomputed6.$b7
foreach($block8 AS $b8){
$precomputed8 = $precomputed7.$b8
if (strlen($precomputed8) == 16) {
echo $precomputed8.'<br/>';
}
}
}
Doing this analogously for precedent levels. Then you could try to test at one of the higher loop level for strings that are already longer as 16 chars. You can shortcut and avoid trying out other possibilities. But beware calculating the length of the string costs much performance, maybe is the latter improvement not worth it at all, depending on the input data.
Another idea is to precalculate the lengths for each block and then recurse on the array of lengths, calculating sums should be faster than concatenating and computing the length of strings. For the Vector of indexes that match the length of 16, you can easily output the full concatenated string.

Since you have that length requirement of 16 and assuming each (i) possibility in each (b) of the eight blocks has length x_i_b you can get some reduction by some cases becoming impossible.
For example, say we have length requirement 16, but only 4 blocks, with possibilities with lengths indicated
block1: [2,3,4]
block2: [5,6,7]
block3: [8,9,10]
block4: [9,10,11]
Then all of the possibilities are impossible since block 4's lengths are all too large to permit any combination of blocks 1 - 3 of making up the rest of the 16.
Now if you're length is really 16 that means that your (possible) lengths range from 1 to 9, assumng no 0 lengths.
I can see two ways of approaching this:
Greedy
Dynamic Programming
Perhaps even combine them. For the Greedy approach, pick the biggest possibility in all the blocks, then the next biggest etc, follow that through until you cross your threshold of 16. If you got all the blocks, then you can emit that one.
Whether or not you got on threshold or not, you can then iterate through the possibilities.
The dynamic appraoch means that you should store some of the results that you compute already. Like a selection from some of the blocks that gives you a length of 7, you don't need to recompute that in future, but you can iterate through the remaining blocks to see if you can find a combination to give you lenth 9.
EDIT: This is kind of like the knapsack problem but with the additional restriction of 1 choice per block per instance. Anyway, in terms of other optimizations definitely pre process the blocks into arrays of lengths only and keep a running sum at each iteration level. So you only do 1 sum per each iteration of each loop, rather than 8 sums per each iteration. Also only str concat if you need to emit the selection.
If you don't want a general solution (probably easier if you don't) then you can hand code alot of problem instance specific speedups by excluding the largest too small combination of lengths (and all selections smaller than that) and excluding the smallest too large combination of lengths (and all selections larger).

If you can express this as a nested array, try a RecursiveIteratorIterator, http://php.net/manual/en/class.recursiveiteratoriterator.php

Related

Getting every combination of X numbers given Y numbers?

I've come to a mathematical problem which for I can't program the logic.
Let me explain it with an example:
Let's say I have 4 holes and 3 marbles, the holes are in order and my marbles are A,B and C and also in order.
I need to get every posible ORDERED combination:
ABC4
AB3C
A2BC
1ABC
This is very simple, but what if the number of holes changes? Let's say now I have 5 holes.
ABC45
AB3C5
A2BC5
1ABC5
AB34C
A2B4C
1AB4C
A23BC
1A3BC
12ABC
Now let's say we have 5 holes and 4 marbles.
ABCD5
ABC4D
AB3CD
A2BCD
1ABCD
And this can be any number of holes and any number of marbles.
The number of combinations is given by:
$combinations = factorial($number_of_holes)/(factorial($number_of_marbles)*factorial($number_of_holes-$number_of_marbles)))
(Here it is the factorial function in case you need it)
function factorial($number) {
if ($number < 2) {
return 1;
} else {
return ($number * factorial($number-1));
}
}
What I need and can't figure out how to program, is a function or a loop or something, that returns an array with the position of the holes, given X numbers of holes and Y number of marbles.
For first example it would be: [[4],[3],[2],[1]], for second: [[4,5],[2,5],[1,5],[3,4],[2,4],[1,5],[2,3],[1,3],[1,2]], for third: [[5],[4],[3],[2],[1]].
It doesn't have to be returned in order, I just need all the elements.
As you can see, another approach is the complementary or inverse or don't know how to call it, but the solution is every combinations of X number of free holes given Y number of holes, so, If I have 10 holes, and 5 marbles, there would be 5 free holes, the array returned would be every combination of 5 that can be formed with (1,2,3,4,5,6,7,8,9,10), which are 252 combinations, and what I need is the 252 combinations.
Examples for the 2nd approach:
Given an array=[1,2,3,4], return every combination for sets of 2 and 3.
Sets of 2
[[1,2],[1,3],[1,4],[2,3],[2,4],[3,4]]
Sets of 3
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
What I need is the logic to do this, I'm trying to do it in PHP, but I just can't figure out how to do it.
The function would receive the array and the set size and would return the array of sets:
function getCombinations($array,$setize){
//magic code which I can't figure out
return array(sets);
}
I hope this is clear enough and someone can help me, I've been stuck for several days now, but it seems to be just too much for me to handle by myself.
This post, PHP algorithm to generate all combinations of a specific size from a single set, is for all possible combinations, repeating the elements and order doesn't matter, its a good lead, I did read it, but it doesn't solve my problem, it's very different. I need them without repeating the elements and ordered as explained.
Let's say if I have already a set of [3,4] in my array, I don't want [4,3] as an other set.
Here's a recursive solution in PHP:
function getCombinations($array, $setsize){
if($setsize == 0)
return [[]];
// generate combinations including the first element by generating combinations for
// the remainder of the array with one less element and prepending the first element:
$sets = getCombinations(array_slice($array, 1), $setsize - 1);
foreach ($sets as &$combo) {
array_unshift($combo, $array[0]);
}
// generate combinations not including the first element and add them to the list:
if(count($array) > $setsize)
$sets = array_merge($sets, getCombinations(array_slice($array, 1), $setsize));
return $sets;
}
// test:
print_r(getCombinations([1, 2, 3, 4], 3));
Algorithm works like this:
If setsize is 0 then you return a single, empty combination
Otherwise, generate all combinations that include the first element, by recursively generating all combinations off the array excluding the first element with setsize - 1 elements, and then prepending the first element to each of them.
Then, if the array size is greater than setsize (meaning including the first element is not compulsory), generate all the combinations for the rest of the list and add them to the ones we generated in the second step.
So basically at each step you need to consider whether an element will be included or excluded in the combination, and merge together the set of combinations representing both choices.

PHP built in functions complexity (isAnagramOfPalindrome function)

I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)
A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though

How unique a 5-digit mt_rand() number is?

I am just wondering, how unique is a mt_rand() number is, if you draw 5-digits number?
In the example, I tried to get a list of 500 random numbers with this function and some of them are repeated.
http://www.php.net/manual/en/function.mt-rand.php
<?php
header('Content-Type: text/plain');
$errors = array();
$uniques = array();
for($i = 0; $i < 500; ++$i)
{
$random_code = mt_rand(10000, 99999);
if(!in_array($random_code, $uniques))
{
$uniques[] = $random_code;
}
else
{
$errors[] = $random_code;
}
}
/**
* If you get any data in this array, it is not exactly unique
* Run this script for few times and you may see some repeats
*/
print_r($errors);
?>
How many digits may be required to ensure that the first 500 random numbers drawn in a loop are unique?
If numbers are truly random, then there's a probability that numbers will be repeated. It doesn't matter how many digits there are -- adding more digits makes it much less likely there will be a repeat, but it's always a possibility.
You're better off checking if there's a conflict, then looping until there isn't like so:
$uniques = array();
for($i = 0; $i < 500; $i++) {
do {
$code = mt_rand(10000, 99999);
} while(in_array($code, $uniques));
$uniques[] = $code
}
Why not use range, shuffle, and slice?
<?php
$uniques = range(10000, 99999);
shuffle($uniques);
$uniques = array_slice($uniques, 0, 500);
print_r($uniques);
Output:
Array
(
[0] => 91652
[1] => 87559
[2] => 68494
[3] => 70561
[4] => 16514
[5] => 71605
[6] => 96725
[7] => 15908
[8] => 14923
[9] => 10752
[10] => 13816
*** truncated ***
)
This method is less expensive as it does not search the array each time to see if the item is already added or not. That said, it does make this approach less "random". More information should be provided on where these numbers are going to be used. If this is an online gambling site, this would be the worst! However if this was used in returning "lucky" numbers for a horoscope website, I think it would be fine.
Furthermore, this method could be extended, changing the shuffle method to use mt_rand (where as the original method simply used rand). It may also use openssl_random_pseudo_bytes, but that might be overkill.
The birthday paradox is at play here. If you pick a random number from 10000-99999 500 times, there's a good chance of duplicates.
Intuitive idea with small numbers
If you flip a coin twice, you'll get a duplicate about half the time. If you roll a six-sided die twice, you'll get a duplicate 1/6 of the time. If you roll it 3 times, you'll get a duplicate 4/9 (44%) of the time. If you roll it 4 times you'll get at least one duplicate 13/18 (63.33%). Roll it a fifth time and it's 49/54 (90.7%). Roll it a sixth time and it's 98.5%. Roll it a seventh time and it's 100%.
If you take replace the six-sided die with a 20-sided die, the probabilities grow a bit more slowly, but grow they do. After 3 rolls you have a 14.5% chance of duplicates. After 6 rolls it's 69.5%. After 10 rolls it's 96.7%, near certainty.
The math
Let's define a function f(num_rolls, num_sides) to generalize this to any number of rolls of any random number generator that chooses out of a finite set of choices. We'll define f(num_rolls, num_sides) to be the probability of getting no duplicates in num_rolls of a num_sides-side die.
Now we can try to build a recursive definition for this. To get num_rolls unique numbers, you'll need to first roll num_rolls-1 unique numbers, then roll one more unique number, now that num_rolls-1 numbers have been taken. Therefore
f(num_rolls, num_sides) =
f(num_rolls-1, num_sides) * (num_sides - (num_rolls - 1)) / num_sides
Alternately,
f(num_rolls + 1, num_side) =
f(num_rolls, num_sides) * (num_sides - num_rolls) / num_sides
This function follows a logistic decay curve, starting at 1 and moving very slowly (since num_rolls is very low, the change with each step is very small), then slowly picking up speed as num_rolls grows, then eventually tapering off as the function's value gets closer and closer to 0.
I've created a Google Docs spreadsheet that has this function built in as a formula to let you play with this here: https://docs.google.com/spreadsheets/d/1bNJ5RFBsXrBr_1BEXgWGein4iXtobsNjw9dCCVeI2_8
Tying this back to your specific problem
You've generated rolled a 90000-sided die 500 times. The spreadsheet above suggests you'd expect at least one duplicate pair about 75% of the time assuming a perfectly random mt_rand. Mathematically, the operation your code was performing is choosing N elements from a set with replacement. In other words, you pick a random number out of the bag of 90000 things, write it down, then put it back in the bag, then pick another random number, repeat 500 times. It sounds like you wanted all of the numbers to be distinct, in other words you wanted to choose N elements from a set without replacement. There are a few algorithms to do this. Dave Chen's suggestion of shuffle and then slice is a relatively straightforward one. Josh from Qaribou's suggestion of separately rejecting duplicates is another possibility.
Your question deals with a variation of the "Birthday Problem" which asks if there are N students in a class, what is the probability that at least two students have the same birthday? See Wikipedia: The "Birthday Problem".
You can easily modify the formula shown there to answer your problem. Instead of having 365 equally probable possibilities for the birthday of each student, you have 90001 (=99999-10000+2) equally probable integers that can be generated between 10000 and 99999. The probability that if you generate 500 such numbers that at least two numbers will be the same is:
P(500)= 1- 90001! / ( 90001^n (90001 - 500)! ) = 0.75
So there is a 75% chance that at least two of the 500 numbers that you generate will be the same or, in other words, only a 25% chance that you will be successful in getting 500 different numbers with the method you are currently using.
As others here have already suggested, I would suggest checking for repeated numbers in your algorithm rather than just blindly generating random numbers and hoping that you don't have a match between any pair of numbers.

What is the best algorithm to see if my number is in an array of ranges?

I have a 2 dimensional arrays in php containing the Ranges. for example:
From.........To
---------------
125..........3957
4000.........5500
5217628......52198281
52272128.....52273151
523030528....523229183
and so on
and it is a very long list. now I want to see if a number given by user is in range.
for example numbers 130, 4200, 52272933 are in my range but numbers 1, 5600 are not.
of course I can count all indexes and see if my number is bigger than first and smaller than second item. but is there a faster algorithm or a more efficient way of doing it using php function?
added later
It is sorted. it is actually numbers created with ip2long() showing all IPs of a country.
I just wrote a code for it:
$ips[1] = array (2,20,100);
$ips[2] = array (10,30,200);
$n=11;// input ip
$count = count($ips);
for ($i = 0; $i <= $count; $i++) {
if ($n>=$ips[1][$i]){
if ($n<=$ips[2][$i]){
echo "$i found";
break;
}
}else if($n<$ips[1][$i]){echo "not found";break;}
}
in this situation numbers 2,8,22,and 200 are in range. but not numbers 1,11,300
Put the ranges in a flat array, sorted from lower to higher, like this:
a[0] = 125
a[1] = 3957
a[2] = 4000
a[3] = 5500
a[4] = 5217628
a[5] = 52198281
a[6] = 52272128
a[7] = 52273151
a[8] = 523030528
a[9] = 523229183
Then do a binary search to determine at what index of this array the number in question should be inserted. If the insertion index is even then the number is not in any sub-range. If the insertion index is odd, then the number falls inside one of the ranges.
Examples:
n = 20 inserts at index 0 ==> not in a range
n = 126 inserts at index 1 ==> within a range
n = 523030529 inserts at index 9 ==> within a range
You can speed things up by implementing a binary search algorithm. Thus, you don't have to look at every range.
Then you can use in_array to check if the number is in the array.
I'm not sure if I got you right, do your arrays really look like this:
array(125, 126, 127, ..., 3957);
If so, what's the point? Why not just have?
array(125, 3957);
That contains all the information necessary.
The example you give suggests that the numbers may be large and the space sparse by comparison.
At that point, you don't have very many options. If the array is sorted, binary search is about all there is. If the array is not sorted, you're down to plain, old CS101 linear search.
The correct data structure to use for this problem is an interval tree. This is, in general, much faster than binary search.
I am assuming that the ranges do not overlap.
If that is the case, you can maintain a map data structure that is keyed on the lower value of the range.
Now all you have to do (given the number N) is to find the key in the map that is just lower than N (using binary search - logarithmic complexity) and then check if the number is lesser than the right value.
Basically, it is a binary search (logarithmic) on the constructed map.
From a pragmatic point of view, a linear search may very well turn out to be the fastest lookup method. Think of page faults and hard disk seek time here.
If your array is large enough (whatever "enough" actually means), it may be wise to stuff your IPs in a SQL database and let the database figure out how to efficiently compute SELECT ID FROM ip_numbers WHERE x BETWEEN start AND end;.

random and unique subsets generation

Lets say we have numbers from 1 to 25 and we have to choose sets of 15 numbers.
The possible sets are, if i'm right 3268760.
Of those 3268760 options, you have to generate say 100000
What would be the best way to generate 100000 unique and random of that subsets?
Is there a way, an algorithm to do that?
If not, what would be the best option to detect duplicates?
I'm planning to do this on PHP but a general solution would be enough,
and any reference not to much 'academic' (more practical) would help me a lot.
There is a way to generate a sample of the subsets that is random, guaranteed not to have duplicates, uses O(1) storage, and can be re-generated at any time. First, write a function to generate a combination given its lexical index. Second, use a pseudorandom permutation of the first Combin(n, m) integers to step through those combinations in a random order. Simply feed the numbers 0...100000 into the permutation, use the output of the permutation as input to the combination generator, and process the resulting combination.
Here's a solution in PHP based on mjv's answer, which is how I was thinking about it. If you run it for a full 100k sets, you do indeed see a lot of collisions. However, I'm hard pressed to devise a system to avoid them. Instead, we just check them fairly quickly.
I'll think about better solutions ... on this laptop, I can do 10k sets in 5 seconds, 20k sets in under 20 seconds. 100k takes several minutes.
The sets are represented as (32-bit) ints.
<?PHP
/* (c) 2009 tim - anyone who finds a use for this is very welcome to use it with no restrictions unless they're making a weapon */
//how many sets shall we generate?
$gNumSets = 1000;
//keep track of collisions, just for fun.
$gCollisions = 0;
$starttime = time();
/**
* Generate and return an integer with exactly 15 of the lower 25 bits set (1) and the other 10 unset (0)
*/
function genSetHash(){
$hash = pow(2,25)-1;
$used = array();
for($i=0;$i<10;){
//pick a bit to turn off
$bit = rand(0,24);
if (! in_array($bit,$used)){
$hash = ( $hash & ~pow(2,$bit) );
$i++;
$used[] = $bit;
}
}
return $hash;
}
//we store our solution hashes in here.
$solutions = array();
//generate a bunch of solutions.
for($i=0;$i<$gNumSets;){
$hash = genSetHash();
//ensure no collisions
if (! in_array($hash,$solutions)){
$solutions[] = $hash;
//brag a little.
echo("Generated $i random sets in " . (time()-$starttime) . " seconds.\n");
$i++;
}else {
//there was a collision. There will generally be more the longer the process runs.
echo "thud.\n";
$gCollisions++;
}
}
// okay, we're done with the hard work. $solutions contains a bunch of
// unique, random, ints in the right range. Everything from here on out
// is just output.
//takes an integer with 25 significant digits, and returns an array of 15 numbers between 1 and 25
function hash2set($hash){
$set = array();
for($i=0;$i<24;$i++){
if ($hash & pow(2,$i)){
$set[] = $i+1;
}
}
return $set;
}
//pretty-print our sets.
function formatSet($set){
return "[ " . implode(',',$set) . ']';
}
//if we wanted to print them,
foreach($solutions as $hash){
echo formatSet(hash2set($hash)) . "\n";
}
echo("Generated $gNumSets unique random sets in " . (time()-$starttime) . " seconds.\n");
echo "\n\nDone. $gCollisions collisions.\n";
I think it's all correct, but it's late, and I've been enjoying several very nice bottles of beer.
Do they have to be truly random? Or seemingly random?
Selection: generate a set with all 25 - "shuffle" the first 15 elements using Fisher-Yates / the Knuth shuffle, and then check if you've seen that permutation of the first 15 elements before. If so, disregard, and retry.
Duplicates: You have 25 values that are there or not - this can be trivially hashed to an integer value (if the 1st element is present, add 2^0, if the second is, add 2^1, etc. - it can be directly represented as a 25 bit number), so you can check easily if you've seen it already.
You'll get a fair bit of collisions, but if it's not a performance critical snippet, it might be doable.
The random number generator (RNG) of your environment will supply you random numbers that are evenly distributed in a particular range. This type of distribution is often what is needed, say if your subset simulate lottery drawings, but it is important to mention this fact in case your are modeling say the age of people found on the grounds of a middle school...
Given this RNG you can "draw" 10 (or 15, read below) numbers between 1 and 25. This may require that you multiply (and round) the random number produced by the generator, and that you ignore numbers that are above 25 (i.e. draw again), depending on the exact API associated with the RNG, but again getting a drawing in a given range is trivial. You will also need to re-draw when a number comes up again.
I suggest you get 10 numbers only, as these can be removed from the 1-25 complete sequence to produce a set of 15. In other words drawing 15 to put in is the same drawing 10 to take out...
Next you need to assert the uniqueness of the sets. Rather than storing the whole set, you can use a hash to identify each set uniquely. This should take fewer that 25 bits, so can be stored on a 32 bits integer. You then need to have an efficient storage for up to 100,000 of these values; unless you want to store this in a database.
On this question of uniqueness of 100,000 sets taken out of all the possible sets, the probability of a collision seems relatively low. Edit: Oops... I was optimistic... This probability is not so low, with about 1.5% chance of a collision starting after drawing the 50,000th, there will be quite a few collisions, enough to warrant a system to exclude them...

Categories