I've been testing the randomness of generated values in PHP, and have been considering 32bit hexadecimal to represent a unique state within a given time frame.
I wrote this simple test script:
$checks = [];
$i = 0;
while (true) {
$hash = hash('crc32b', openssl_random_pseudo_bytes(4));
echo $hash . PHP_EOL;
if (in_array($hash, $checks)) {
echo 'Copy: ' . $i . PHP_EOL;
break;
}
$i++;
$checks[] = $hash;
}
Surprisingly (to me) this script generates a copy in less than 100,000 iterations, and as low as 1000 iterations.
My question is, am I doing something wrong here? Out of 4 billion possibilities, this level of frequency seems too unlikely.
No, this is not surprising, and there is nothing wrong with the random number generator. This is the birthday problem. With just 23 people in a room, the probability that two of them have the same birthday is 50%. This is perhaps counter-intuitive until you realize that there are 253 possible pairings of 23 people, so you get 253 shots at two people having the same birthday.
You are doing the same thing here. You are not looking to see when you hit a particular 32-bit value. Instead you are looking for a match between any two values you have created so far, which gives you a lot more chances. If you consider step 100,000, you have a 1 in 43,000 chance of matching one of the numbers you have created so far, as opposed to a 1 in 4,300,000,000 chance of matching a particular number. In the run up to 100,000, you have added up a lot of those chances.
See this answer here on stackoverflow for the calculation for a 32-bit value. On average you only need about 93,000 values to get a hit.
By the way, the use of a CRC-32 on the four-byte random value has no bearing here. The result would be the same with or without it. All you're doing is mapping each 32-bit number uniquely (one-to-one and onto) to another 32-bit number.
Related
I am trying to generate a random serial number to put on holographic stickers in order to let customers check if the purchased product is authentic or not.
Preface:
Once you input that and query that code it will be nulled, so next time you do it again you receive a message that the product might be fake because the code is already used.
Considering that I should make this system for a factory that produces no more than 2/3 millions pieces a year, for me is a bit hard understand how to set up everything, at least the 1st time…
I thought about 20 digits code in 4 groups (no letters because must be very easy for the user read and input the code)
12345-67890-98765-43210
This is what I think is the easiest way to do everything:
function mycheckdigit()
{
...
return $myserial;
}
$mycustomcode="123";
$qty=20000;
$myfile = fopen("./thefile.txt","w") or die("Houston we got a problem here");
//using a txt file for a test, should be a DB instead...
for($i=0;$i<=$qty;$i++) {
$txt = date("y").$mycustomcode.str_pad(gettimeofday()['usec'],6,STR_PAD_LEFT).random_int(1000000,9999999). "\n";
//here the code to make check digits
mycheckdigit($txt);
fwrite($myfile,$myserial);
}
fclose($myfile);
The 1st group identifying something like year: 18 and 3 custom code
The 2nd group include microtime (gettimeofday()['usec'])
The 3rd completely random
last group including 3 random number and a check digit for group 1 and a check digit for group 2
in short:
Y= year
E= part of the EAN or custom code
M= Microtime generated number (gettimeofday()['usec'])
D= random_int() digits
C= Check Digit
YYEEE-MMMMM-MDDDD-DDDCC
In this way, I have a prefix that changes every year, I can recognize what brand is the product (so I could use one DB source only) and I still have enough random digits to be - maybe - quite unique if I consider that I will “pick-up” only a portion of the numbers from 1,000,000 and 9,999,999 and split it following using above sorting
Some questions for you:
Do you think I have enough combinations to not generate same code in one year considering 2 million codes? I would not use a lookup in the DB for the same code if it is not really necessary because could slow down batch generation (executed in batch during production process)
Could be better put some also unique identifier, like a day of the year (001-365) and make random_int() 3 digits shorter? Please Consider that I will generate codes monthly and not daily (but I think there is no big change in uniqueness)
Considering that backend in PHP I am thinking to use mt_rand() function, could be a good approach?
UPDATE: After the #apokryfos suggestion, I read more about UUID generation and similar I found a good compromise using random_int() instead.
Because I just need digits, so HEX hashes are not useful for my needs and making things more complicated
I would avoid using complex cryptographic things like RSA keys and so on…
I don’t need that level of security and complexity, I just need a way to generate a unique serial number, most unique as possible that is not easy to be guessed and nulled if you don’t scratch the sticker (so number creation should not be made A to Z, but randomly)
You can play with 11 random digits per year so that's 11 digit numbers 1 to 99999999999 (99.9 billion is a lot more than 2 million) so w.r.t. enough combinations I think you're covered.
However using mt_rand you're likely to get collisions. Here's a way to plan your way to 2 million random numbers before using the database:
<?php
$arr = [];
while (count($arr) < 1000000) {
$num = mt_rand(1, 99999999999);
$numStr = str_pad($num,11,0,STR_PAD_LEFT); //Force 11 digits
if (!isset($arr[$numStr])) {
$arr[$numStr] = true;
}
}
$keys= array_keys($arr);
The number of collisions is generally low (the first collision occurs at at about 300 000 - 500 000 numbers generated so it's pretty rare.
Each value in the array $keys is an 11 digit number which is random and unique.
This approach is relatively fast but be aware it will need quite a bit of memory (more than 128MB).
This being said, a more generally used method is to generate a universally unique identifier (UUID) which is a lot more likely to be unique and will therefore does not really need checking for uniqueness.
I don't want my database id's to be sequential, so I'm trying to generate uids with this code:
$bin = openssl_random_pseudo_bytes(12);
$hex = bin2hex($bin);
return base_convert($hex, 16, 36);
My question is: how many bytes would i need to make the ids unique enough to handle large amounts of records (like twitter)?
Use PHP's uniqid(), with an added entropy factor. That'll give you plenty of room.
You might considering something like the way tinyurl and other shortening services work. I've used similar techniques, which guarantees uniqueness until all combinations are exhausted. So basically you choose an alphabet, and how many characters you want as a length. Let's say we use alphanumeric, upper and lower, so that's 62 characters in the alphabet, and let's do 5 characters per code. That's 62^5 = 916,132,832 combinations.
You start with your sequential database ID and you multiply that be some prime number (choose one that's fairly large, like 2097593). All you do is multiply that by your database ID, making sure to wrap around if you exceed 62^5, and then convert that number to base-62 as per your chosen alphabet.
This makes each code look fairly unique, yet because we use a prime number, we're guaranteed not to hit the same number twice until we've used all codes already. And it's very short.
You can use longer keys with a smaller alphabet, too, if length isn't a concern.
Here's a question I asked along the same lines: Tinyurl-style unique code: potential algorithm to prevent collisions
Assuming that openssl_random_pseudo_bytes may generate every possible value, N bytes will give you 2 ^ (N * 8) distinct values. For 12 bytes this is 7.923 * 10^28
use MySQL UUID
insert into `database`(`unique`,`data`) values(UUID(),'Test');
If your not using MySQL search google for UUID (Database Name) and it will give you an option
Source Wikipedia
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%
I have a tricky question that I've looked into a couple of times without figuring it out.
Some backstory: I am making a textbased RPG-game where players fight against animals/monsters etc. It works like any other game where you hit a number of hitpoints on each other every round.
The problem: I am using the random-function in php to generate the final value of the hit, depending on levels, armor and such. But I'd like the higher values (like the max hit) to appear less often than the lower values.
This is an example-graph:
How can I reproduce something like this using PHP and the rand-function? When typing rand(1,100) every number has an equal chance of being picked.
My idea is this: Make a 2nd degree (or quadratic function) and use the random number (x) to do the calculation.
Would this work like I want?
The question is a bit tricky, please let me know if you'd like more information and details.
Please, look at this beatiful article:
http://www.redblobgames.com/articles/probability/damage-rolls.html
There are interactive diagrams considering dice rolling and percentage of results.
This should be very usefull for you.
Pay attention to this kind of rolling random number:
roll1 = rollDice(2, 12);
roll2 = rollDice(2, 12);
damage = min(roll1, roll2);
This should give you what you look for.
OK, here's my idea :
Let's say you've got an array of elements (a,b,c,d) and you won't to randomly pick one of them. Doing a rand(1,4) to get the random element index, would mean that all elements have an equal chance to appear. (25%)
Now, let's say we take this array : (a,b,c,d,d).
Here we still have 4 elements, but not every one of them has equal chances to appear.
a,b,c : 20%
d : 40%
Or, let's take this array :
(1,2,3,...,97,97,97,98,98,98,99,99,99,100,100,100,100)
Hint : This way you won't only bias the random number generation algorithm, but you'll actually set the desired probability of apparition of each one (or of a range of numbers).
So, that's how I would go about that :
If you want numbers from 1 to 100 (with higher numbers appearing more frequently, get a random number from 1 to 1000 and associate it with a wider range. E.g.
rand = 800-1000 => rand/10 (80->100)
rand = 600-800 => rand/9 (66->88)
...
Or something like that. (You could use any math operation you imagine, modulo or whatever... and play with your algorithm). I hope you get my idea.
Good luck! :-)
I need to work out a way to create 10,000 non-repeating random numbers in PHP, and then put it into database table. Number will be 12 digits long.
What is the best way to do this?
At 12 digits long, I don't think the possibility of getting repeats is very large. I would probably just generate the numbers, try to insert them into the table, and if it already exists (assuming you have a unique constraint on that column) just generate another one.
Read e.g. 40000 (=PHP_INT_SIZE * 10000) bytes from /dev/random at once, then split it, modularize it (the % operator), and there you have it.
Then filter it, and repeat the procedure.
That avoids too many syscalls/context switches (between the php runtime, the zend engine, and the operating system itself - I'm not going to dive into details here).
That should be the most performant way of doing it.
Generate 10000 random numbers and place them in an array. Run the array through array_unique. Check the length. If less than 10000, add on a bunch more. Run the array through array_unique. If greater than 10000, then run through array_slice to give 10000. Otherwise, lather, rinse, repeat.
This assumes that you can generate a 12 digit random number without problems (use getrandmax() to see how big you can get....according to php.net on some systems 32k is as large a number as you can get.
$array = array();
while(sizeof($array)<=10000){
$number = mt_rand(0,999999999999);
if(!array_key_exists($number,$array)){
$array[$number] = null;
}
}
foreach($array as $key=>$val){
//write array records to db.
}
You could use either rand() or mt_rand(). mt_rand() is supposed to be faster however.
I have just found this great tutorial as it is something that I need.
However, after having a look, it seems that this might be inefficient. The way it works is, first generate a unique key then check if it exists in the database to make sure it really is unique. However, the larger the database gets the slower the function gets, right?
Instead, I was thinking, is there a way to add ordering to this function? So all that has to be done is check the previous entry in the DB and increment the key. So it will always be unique?
function generate_chars()
{
$num_chars = 4; //max length of random chars
$i = 0;
$my_keys = "123456789abcdefghijklmnopqrstuvwxyz"; //keys to be chosen from
$keys_length = strlen($my_keys);
$url = "";
while($i<$num_chars)
{
$rand_num = mt_rand(1, $keys_length-1);
$url .= $my_keys[$rand_num];
$i++;
}
return $url;
}
function isUnique($chars)
{
//check the uniqueness of the chars
global $link;
$q = "SELECT * FROM `urls` WHERE `unique_chars`='".$chars."'";
$r = mysql_query($q, $link);
//echo mysql_num_rows($r); die();
if( mysql_num_rows($r)>0 ):
return false;
else:
return true;
endif;
}
The tiny url people like to use random tokens because then you can't just troll the tiny url links. "Where does #2 go?" "Oh, cool!" "Where does #3 go?" "Even cooler!" You can type in random characters but it's unlikely you'll hit a valid value.
Since the key is rather sparse (4 values each having 36* possibilities gives you 1,679,616 unique values, 5 gives you 60,466,176) the chance of collisions is small (indeed, it's a desired part of the design) and a good SQL index will make the lookup be trivial (indeed, it's the primary lookup for the url so they optimize around it).
If you really want to avoid the lookup and just unse auto-increment you can create a function that turns an integer into a string of seemingly-random characters with the ability to convert back. So "1" becomes "54jcdn" and "2" becomes "pqmw21". Similar to Base64-encoding, but not using consecutive characters.
(*) I actually like using less than 36 characters -- single-cased, no vowels, and no similar characters (1, l, I). This prevents accidental swear words and also makes it easier for someone to speak the value to someone else. I even map similar charactes to each other, accepting "0" for "O". If you're entirely machine-based you could use upper and lower case and all digits for even greater possibilities.
In the database table, there is an index on the unique_chars field, so I don't see why that would be slow or inefficient.
UNIQUE KEY `unique_chars` (`unique_chars`)
Don't rush to do premature optimization on something that you think might be slow.
Also, there may be some benefit in a url shortening service that generates random urls instead of sequential urls.
I don't know why you'd bother. The premise of the tutorial is to create a "random" URL. If the random space is large enough, then you can simply rely on pure, dumb luck. If you random character space is 62 characters (A-Za-z0-9), the the 4 characters they use, given a reasonable random number generator, is 1 in 62^4, which is 1 in 14,776,336. Five characters is 1 in 916,132,832. So, a conflict is, literally, "1 in a billion".
Obviously, as the documents fill, your odds increase for the chance of a collision.
With 10,000 documents, it's 1 in 91,613, almost 1 in 100,000 (for round numbers).
That means, for every new document, you have a 1 in 91,613 chance of hitting the DB again for another pull on the slot machine.
It is not deterministic. It's random. It's luck. In theory, you can hit a string of really, really, bad luck and just get collision after collision after collision. Also, it WILL, eventually, fill up. How many URLs do you plan on hashing?
But if 1 in 91,613 odds isn't good enough, boosting it to 6 chars makes it more than 1 in 5M for 10,000 documents. We're talking almost LOTTO odds here.
Simply put, make the key big enough (7 characters? 8?) and the problem pretty much "wishes" itself out of existence.
Couldn't you encode the URL as Base36 when it's generated, and then decode it when visited - that would allow you to remove the database completely?
A snippet from Channel9:
The formula is simple, just turn the
Entry ID of our post, which is a long
into a short string by Base-36
encoding it and then stick
'http://ch9.ms/' onto the front of it.
This produces reasonably short URLs,
and can be computed at either end
without any need for a database look
up. The result, a URL like
http://ch9.ms/A49H is then used in
creating the twitter link.
I solved a similar problem by implementing an alogirthm that used to generate serial numbers one-by-one in base36. I had my own oredring of base36 characters all of which are unique. Since it was generating numbers serially I did not have to worry about duplication. Complexity and randomness of the number depends on the ordering of base36 numbers[characters]... that too for public only becuase to my application they are serial numbers :)
Check out this guys functions - http://www.pgregg.com/projects/php/base_conversion/base_conversion.php source - http://www.pgregg.com/projects/php/base_conversion/base_conversion.inc.phps
You can use any base you like, for example to convert 554512 to base 62, call
$tiny = base_base2base(554512, 10, 62); and that evaluates to $tiny = '2KFk'.
So, just pass in the unique id of the database record.
In a project I used this in a removed a few characters from the $sChars string, and am using base 58. You can also rearrange the characters in the string if you want the values to be less easy to guess.
You could of course add ordering by simply numbering the urls:
http://mytinyfier.com/1
http://mytinyfier.com/2
and so on. But if the hash key is indexed in the database (which it obviously should be), the performance boost would be minimal at best.
I wouldn't bother doing ordered enumeration for two reasons:
1) SQL servers are very effective at checking such hash collisions (given correct indexes)
2) That might hurt privacy, as users would be able to easily figure out what other users are tinyurl-ing.
Use autoincrement on the database, and get the latest id as described by http://www.acuras.co.uk/articles/24-php-use-mysqlinsertid-to-get-the-last-entered-auto-increment-value
Perhaps this is a bit off-answer, but, my general rule for creating always unique keys is simple md5( time() * 100 + rand( 0, 100 ) ); There is a one in 100,000 chance that if two people are using the same service at the same second they will get the same result (nie impossible).
That said, md5( rand( 0, n ) ) works too.
That might work, but the easiest way to accomplish the problem would probably be with hashing. Theoretically speaking, hashing runs in O(1) time, as in, it only has to perform the hash, and then does only one actual hit to the database to retrieve the value. Then, you would introduce complications for checking for hash collisions, but it seems like this is probably what most of the tinyurl providers do. And, a good hash function isn't terribly hard to write.
I have also created small tinyurl service.
I wrote a script in Python that was generating keys and store in MySQL table named tokens with status U(Unused).
But, I am doing it in offline mode. I have a corn job on my VPS. It runs a script every 10 minutes. The script check if there are less than 1000 keys in the table, it keep generating keys and inserting them if they are unique and not already exists in the table until the key's count up to 1000.
For my service, 1000 keys for 10 minutes are more than enough, you can set the timing or number of keys generated according to your need.
Now when any tiny url needs to be created on my website, my PHP script just fetch any key which is unused from the table and marked its status as T(taken). PHP script does not have to bother about its uniqueness as my python script already populated only unique keys.
Couldn't you just trim the hash to the length you wish?
$tinyURL = substr(md5($longURL . time()),0,4);
Granted, this may not provide as much pseudo randomness as using the entire string length. But, if you hash the long URL concatenated with the time(), wouldn't this be sufficient? Thoughts on using this method? Thanks!