Hash collision worries - php

If I have a system where a hash is generated out of a total permutation of 1 million possibilities. If there's a 10% chance of a collision, should I worry about the generating algorithm running 5 times?
I have a system similar to jsfiddle, where a user can "save" a file on my server. Now I'm using '23456789abcdefghijkmnopqrstuvwxyz' which is 33 chars, and the file is 4 chars long, for a total of 33^4 = 1,185,921 possibilities.
The "filename" is generated randomly and if there's a collision it reruns to get another filename. Using a birthday paradox calculator I can see that after I have 500 entries I have a 10% chance of a collision.
What are the chances that I'll get a collision more than 5 times in a row? what about 4?
Is there any way to figure this out? Should I worry about it? What happens after 5000 entries?
Is there a program out there that can figure this out with any arbitary inputs?

I don't think that the birthday paradox calculations apply. There's a difference between the odds of 500 random numbers out of 1185921 being all different and the odds of one new number being different once you have 500 known unique numbers.
If you have 500 assigned numbers and generate a new number at random, it will have odds of 500/1185921 of being a collision. With 500 names taken, the chances of 4 collisions in a row are (500/1185921)4 < 10-13. With 5000 existing file names, the odds of a new name being a collision are 5000/1185921, and the chance of 4 collisions in a row are < 10-9.

My math is a little rusty so bear with me.
The chance of getting x collisions in a row is simply:
chance of collision ^ x;
Where the chance of collision is:
entries/space (which is 500/1185921 or 0.04%).
You can see above that this will get worse with the more entries (and better with a bigger space).
Also note the birthday paradox is perhaps not quite what you want. The 10% chance is the chance that any two entries will have had a collision, not the chance of a collision for the next entry.

Related

Anti-forgery unique serial number generation

I am trying to generate a random serial number to put on holographic stickers in order to let customers check if the purchased product is authentic or not.
Preface:
Once you input that and query that code it will be nulled, so next time you do it again you receive a message that the product might be fake because the code is already used.
Considering that I should make this system for a factory that produces no more than 2/3 millions pieces a year, for me is a bit hard understand how to set up everything, at least the 1st time…
I thought about 20 digits code in 4 groups (no letters because must be very easy for the user read and input the code)
12345-67890-98765-43210
This is what I think is the easiest way to do everything:
function mycheckdigit()
{
...
return $myserial;
}
$mycustomcode="123";
$qty=20000;
$myfile = fopen("./thefile.txt","w") or die("Houston we got a problem here");
//using a txt file for a test, should be a DB instead...
for($i=0;$i<=$qty;$i++) {
$txt = date("y").$mycustomcode.str_pad(gettimeofday()['usec'],6,STR_PAD_LEFT).random_int(1000000,9999999). "\n";
//here the code to make check digits
mycheckdigit($txt);
fwrite($myfile,$myserial);
}
fclose($myfile);
The 1st group identifying something like year: 18 and 3 custom code
The 2nd group include microtime (gettimeofday()['usec'])
The 3rd completely random
last group including 3 random number and a check digit for group 1 and a check digit for group 2
in short:
Y= year
E= part of the EAN or custom code
M= Microtime generated number (gettimeofday()['usec'])
D= random_int() digits
C= Check Digit
YYEEE-MMMMM-MDDDD-DDDCC
In this way, I have a prefix that changes every year, I can recognize what brand is the product (so I could use one DB source only) and I still have enough random digits to be - maybe - quite unique if I consider that I will “pick-up” only a portion of the numbers from 1,000,000 and 9,999,999 and split it following using above sorting
Some questions for you:
Do you think I have enough combinations to not generate same code in one year considering 2 million codes? I would not use a lookup in the DB for the same code if it is not really necessary because could slow down batch generation (executed in batch during production process)
Could be better put some also unique identifier, like a day of the year (001-365) and make random_int() 3 digits shorter? Please Consider that I will generate codes monthly and not daily (but I think there is no big change in uniqueness)
Considering that backend in PHP I am thinking to use mt_rand() function, could be a good approach?
UPDATE: After the #apokryfos suggestion, I read more about UUID generation and similar I found a good compromise using random_int() instead.
Because I just need digits, so HEX hashes are not useful for my needs and making things more complicated
I would avoid using complex cryptographic things like RSA keys and so on…
I don’t need that level of security and complexity, I just need a way to generate a unique serial number, most unique as possible that is not easy to be guessed and nulled if you don’t scratch the sticker (so number creation should not be made A to Z, but randomly)
You can play with 11 random digits per year so that's 11 digit numbers 1 to 99999999999 (99.9 billion is a lot more than 2 million) so w.r.t. enough combinations I think you're covered.
However using mt_rand you're likely to get collisions. Here's a way to plan your way to 2 million random numbers before using the database:
<?php
$arr = [];
while (count($arr) < 1000000) {
$num = mt_rand(1, 99999999999);
$numStr = str_pad($num,11,0,STR_PAD_LEFT); //Force 11 digits
if (!isset($arr[$numStr])) {
$arr[$numStr] = true;
}
}
$keys= array_keys($arr);
The number of collisions is generally low (the first collision occurs at at about 300 000 - 500 000 numbers generated so it's pretty rare.
Each value in the array $keys is an 11 digit number which is random and unique.
This approach is relatively fast but be aware it will need quite a bit of memory (more than 128MB).
This being said, a more generally used method is to generate a universally unique identifier (UUID) which is a lot more likely to be unique and will therefore does not really need checking for uniqueness.

What are the chances of getting 100 using mt_rand(1,100)?

I'm wondering what are the chances of getting 100 using mt_rand(1,100)?
Are the chances 1-100? does that mean I'll get atleast 100 once if i "roll" 100 times?
I've been wondering this for a while but I can't find any solution.
The reason why i wonder is because i'm trying to calculate how many times I have to roll in order to get 100 guaranteed.
<?php
$roll = mt_rand(1,100);
echo $roll;
?>
Regards Dennis
Are the chances 1-100? does that mean I'll get atleast 100 once if i "roll" 100 times?
No, thats not how random number generators work. Take an extreme example:
mt_rand(1, 2)
One would assume that over a long enough time frame that the number of 1s and the number of 2s would be the same. However, it is perfectly possible to get a sequence of 10 consecutive 1s. Just because its random, doesn't mean that a specific number must appear, if that were the case it would no longer be random.
I'm trying to calculate how many times I have to roll in order to get 100 guaranteed.
Mathematically, there is no number where 100 is guaranteed to be in the sequence. If each roll is independent there is a 99/100 chance that it won't be 100.
For two rolls this is (99/100)^2 or 98% likely. For 100 rolls its about 37% likely that you won't roll one 100 in that set. In fact, you need to roll in sets of 230 to have a less than 1% chance of having no 100s in the set.
The probability of getting 100 is 1/100 by calling this function however there is no guarantee of getting 100 when you call it for the 100 times. You have to take a much bigger sample space. For example: If you call this function for 100,000,000 times, there are good chances that 100 will be found for 100,000 times.
This can be answered in a better way if you let us know about your use case in more detail.
getting 1 out of 100 rolls is just a statistical way of explaining it. though there is 1%(means 1 out of 100), it doesn't mean you really will get one 1 out of 100 rolls. it's a matter of chances.
mt_rand uses the Mersenne Twister to generate pseudo random numbers, that are said to be uniform distributed. So if, you set min and max values, it should be (most likely) also uniform distributed.
So: you can only talk about the propability to get a number in the given range and also about an expected number of trys until you get a specific number or all numbers in range.
This means: No guarantees for a given number number to get a specific number at least once.

MySQL RAND() how often can it be used? does it use /dev/random?

I have a table with few rows (tops 50), I need to get random value out of table I can do that by
ORDER BY RAND() LIMIT 1
Main question is in the point when I have 6k selects in 5 seconds is rand stil 'reliable'?
How is rand calculated, can I seed it over time? (idk, every 5 seconds).
The MySQL pseudo-random number generator is completely deterministic. The docs say:
RAND() is not meant to be a perfect random generator. It is a fast way to generate random numbers on demand that is portable between platforms for the same MySQL version.
It can't use /dev/random because MySQL is designed to work on a variety of operating systems, some of which don't have a /dev/random.
MySQL initializes a default seed at server startup, using the integer returned by time(0).
If you're interested in the source line, it's in the MySQL source in file sql/mysqld.cc, function init_server_components(). I don't think it ever re-seeds itself.
Then the subsequent "random" numbers are based solely on the seed. See source file mysys_ssl/my_rnd.cc, function my_rnd().
The best practice solution to your random-selection task, for both performance and quality of randomization, is to generate a random value between the minimum primary key value and maximum primary key value. Then use that random value to select a primary key in your table:
SELECT ... FROM MyTable WHERE id > $random LIMIT 1
The reason you'd use > instead of = is that you might have gaps in the id due to rows being deleted or rolled back, or you might have other conditions in your WHERE clause so that you have gaps in between rows that match your conditions.
The disadvantages of this greater-than method:
Rows following such a gap have a higher chance of being chosen, and the larger the gap the greater the chance.
You need to know the MIN(id) and MAX(id) before you generate the random value.
Doesn't work as well if you need more than one random row.
Advantages of this method:
It's much faster than ORDER BY RAND(), even for a modest table size.
You can use a random function outside of SQL.
RAND is pseudorandom. Be careful using it for security stuff. I don't think your "choose one row randomly out of fifty" is for security, so you're probably OK.
It's pretty fast for a small table. It will be horrible for picking a random row out of a large table: it will has to tag every row with a pseudorandom number and then sort them. For the application you're describing, #TheEwook's suggestion is exactly right; sorting even a small table more often than once a millisecond can swamp even powerful MySQL hardware.
Don't seed RAND, ever, unless you're testing and you want a repeatable sequence of random numbers for some kind of unit test. I learned this the hard way once when generating what I thought were hard-to-guess session tokens. The MySQL guys did a good job with RAND and you can trust them for the application you're talking about.
I think (not sure), if you don't seed it, it starts with a random seed from /dev/random.
If you need crypto-grade random numbers, read /dev/random yourself. But keep in mind that /dev/random can only generate a limited rate. /dev/urandom uses /dev/random to generate a faster rate, but isn't as high-grade in its entropy pool.
If your table is not too big (let's say max 1000 records) it doesn't really matter. But for big tables you must choose an alternative way.
This article may help you:
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

Is my random generation method flawed? (PHP)

I have this small internal project that inserts occasional entries into a MySQL database. I have a column named "idChar" were I set it's value to a randomly generated string using 62 possible characters with a length of 31.
Today I discovered that a new entry just so happened to have the same exact idChar as an entry from several months ago. I am now checking for duplicate entries before saving them, but this made me think about the odds of this happening and I am curious to know if my implementation of generating these random keys is flawed. Getting a duplicate should be roughly 1 in 62^31 right?
function getCode($len)
{
//$len = 10;
$base='ABCDEFGHIJKLMNOPQRSTWXYZabcdefghijklmnopqrstwxyz123456789';
$max=strlen($base)-1;
$linkCode='';
mt_srand((double)microtime()*1000000);
while (strlen($linkCode)<$len+1)
$linkCode.=$base{mt_rand(0,$max)};
return $linkCode;
}
$idChar=getCode(30);
//code to insert into MySQL here
The odds of getting a duplicate would be calculated as per the birthday problem, because that's how you calculate the chance of collisions for the output of a function that produces a randomly chosen output from a discrete codomain. In effect you want to calculate the chance that among a pool of selections made randomly any two selections are the same.
You should also completely drop the mt_srand call as it is not necessary, and it's likely to provide worse seeds than what PHP will do automatically. Consider that the output of microtime (in my system at least) is like
0.29574400 1348356024
which means that you only have 1 million different seeds available as the last two digits of the float are always zeroes and the (double)microtime() cast completely ignores the seconds part (it would be a lousy seed anyway).
Assuming that the random number generator produces the same sequence of random numbers whenever it is seeded with the same seed then in effect you only have 1 million possible random codes instead of 62^31 -- quite a decrease! Fortunately it is documented that this does not happen on PHP 5.2.1 onwards.

How many bytes are unique enough for twitter?

I don't want my database id's to be sequential, so I'm trying to generate uids with this code:
$bin = openssl_random_pseudo_bytes(12);
$hex = bin2hex($bin);
return base_convert($hex, 16, 36);
My question is: how many bytes would i need to make the ids unique enough to handle large amounts of records (like twitter)?
Use PHP's uniqid(), with an added entropy factor. That'll give you plenty of room.
You might considering something like the way tinyurl and other shortening services work. I've used similar techniques, which guarantees uniqueness until all combinations are exhausted. So basically you choose an alphabet, and how many characters you want as a length. Let's say we use alphanumeric, upper and lower, so that's 62 characters in the alphabet, and let's do 5 characters per code. That's 62^5 = 916,132,832 combinations.
You start with your sequential database ID and you multiply that be some prime number (choose one that's fairly large, like 2097593). All you do is multiply that by your database ID, making sure to wrap around if you exceed 62^5, and then convert that number to base-62 as per your chosen alphabet.
This makes each code look fairly unique, yet because we use a prime number, we're guaranteed not to hit the same number twice until we've used all codes already. And it's very short.
You can use longer keys with a smaller alphabet, too, if length isn't a concern.
Here's a question I asked along the same lines: Tinyurl-style unique code: potential algorithm to prevent collisions
Assuming that openssl_random_pseudo_bytes may generate every possible value, N bytes will give you 2 ^ (N * 8) distinct values. For 12 bytes this is 7.923 * 10^28
use MySQL UUID
insert into `database`(`unique`,`data`) values(UUID(),'Test');
If your not using MySQL search google for UUID (Database Name) and it will give you an option
Source Wikipedia
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%

Categories