I was thinking about how to make a small function that generates a YouTube-like ID and does not cause collisions. The charset should be A-Za-z0-9-_ and lets say I want it to be 10 characters long which would equal 64 ^ 10 / Quadrillion - 1153 Quadrillion possibilities.
It's rather important to be small due to the fact that it's one of the major functionality of the website and should be clear enough to be modified easily. Within seconds in case an emergency occurs.
It doesn't have to be secure in the sense that we don't really mind whether it's easily to convert back to an ID as an outsider. It should however not be a pattern such as:
aaaaaaaaaa
aaaaaaaaab
aaaaaaaaac
I'm aware that you could theoretically make a charset and loop through it randomly until you have 10 random characters but this could collide ( I realize this chance is slim and you could perform a query to see whether the key has been generated before but I would like to keep dependencies minimal and it not relying on SQL ).
I can also imagine making a charset of 100 characters as in 0-99 and then prepend and/or append random characters if the length of 10 condition is not met. Eg:
IMPORTANT: the scenario sketched below assumes A is the first character in the set and _ is the last one.
id: 0 = A+9 random characters
id: 1 = B+9 random characters
id: 99 = _+9 random characters
id: 990 = _A+8 random characters
id: 99999999999999999999 = __________
I would think that would prevent collisions but I don't have a charset of 100 but 64.
Hope someone can toss some ideas.
UPDATE1: The above sketch could actually collide in the following (and more) circumstance(s):
If the ID is 99 it doesn't mean that the next character can't be A by chance although very unlikely; it could therefore collide with 990 :-(
UPDATE2: I don't think it would still collide if we use a static component after the ID and before the random component that is not part of the charset.
Updated Sketch (the static component will be $):
id: 0 = A$+8 random characters
id: 1 = B$+8 random characters
id: 99 = _$+8 random characters
id: 990 = _A$+7 random characters
id: 999999999999999999 = _________$
as in your other question here Encode/Decode ID Reversal issue all you want is to transform your ID which is in base 10 to a different base than 10 ...
you could simply base64 encode it and be done ...
if you want it to look like random, transform the number before ... for example, simply xor it with a static 60 bit value ...
Related
I am trying to generate a random serial number to put on holographic stickers in order to let customers check if the purchased product is authentic or not.
Preface:
Once you input that and query that code it will be nulled, so next time you do it again you receive a message that the product might be fake because the code is already used.
Considering that I should make this system for a factory that produces no more than 2/3 millions pieces a year, for me is a bit hard understand how to set up everything, at least the 1st time…
I thought about 20 digits code in 4 groups (no letters because must be very easy for the user read and input the code)
12345-67890-98765-43210
This is what I think is the easiest way to do everything:
function mycheckdigit()
{
...
return $myserial;
}
$mycustomcode="123";
$qty=20000;
$myfile = fopen("./thefile.txt","w") or die("Houston we got a problem here");
//using a txt file for a test, should be a DB instead...
for($i=0;$i<=$qty;$i++) {
$txt = date("y").$mycustomcode.str_pad(gettimeofday()['usec'],6,STR_PAD_LEFT).random_int(1000000,9999999). "\n";
//here the code to make check digits
mycheckdigit($txt);
fwrite($myfile,$myserial);
}
fclose($myfile);
The 1st group identifying something like year: 18 and 3 custom code
The 2nd group include microtime (gettimeofday()['usec'])
The 3rd completely random
last group including 3 random number and a check digit for group 1 and a check digit for group 2
in short:
Y= year
E= part of the EAN or custom code
M= Microtime generated number (gettimeofday()['usec'])
D= random_int() digits
C= Check Digit
YYEEE-MMMMM-MDDDD-DDDCC
In this way, I have a prefix that changes every year, I can recognize what brand is the product (so I could use one DB source only) and I still have enough random digits to be - maybe - quite unique if I consider that I will “pick-up” only a portion of the numbers from 1,000,000 and 9,999,999 and split it following using above sorting
Some questions for you:
Do you think I have enough combinations to not generate same code in one year considering 2 million codes? I would not use a lookup in the DB for the same code if it is not really necessary because could slow down batch generation (executed in batch during production process)
Could be better put some also unique identifier, like a day of the year (001-365) and make random_int() 3 digits shorter? Please Consider that I will generate codes monthly and not daily (but I think there is no big change in uniqueness)
Considering that backend in PHP I am thinking to use mt_rand() function, could be a good approach?
UPDATE: After the #apokryfos suggestion, I read more about UUID generation and similar I found a good compromise using random_int() instead.
Because I just need digits, so HEX hashes are not useful for my needs and making things more complicated
I would avoid using complex cryptographic things like RSA keys and so on…
I don’t need that level of security and complexity, I just need a way to generate a unique serial number, most unique as possible that is not easy to be guessed and nulled if you don’t scratch the sticker (so number creation should not be made A to Z, but randomly)
You can play with 11 random digits per year so that's 11 digit numbers 1 to 99999999999 (99.9 billion is a lot more than 2 million) so w.r.t. enough combinations I think you're covered.
However using mt_rand you're likely to get collisions. Here's a way to plan your way to 2 million random numbers before using the database:
<?php
$arr = [];
while (count($arr) < 1000000) {
$num = mt_rand(1, 99999999999);
$numStr = str_pad($num,11,0,STR_PAD_LEFT); //Force 11 digits
if (!isset($arr[$numStr])) {
$arr[$numStr] = true;
}
}
$keys= array_keys($arr);
The number of collisions is generally low (the first collision occurs at at about 300 000 - 500 000 numbers generated so it's pretty rare.
Each value in the array $keys is an 11 digit number which is random and unique.
This approach is relatively fast but be aware it will need quite a bit of memory (more than 128MB).
This being said, a more generally used method is to generate a universally unique identifier (UUID) which is a lot more likely to be unique and will therefore does not really need checking for uniqueness.
I have a very large integer 12-14 digits long and I want to encrypt/compress this to an alphanumeric value so that the integer can be recovered later from the alphanumeric value. I tried to convert this integer using a 62 base and tried to map those values to a-zA-Z0-9, but the value generated from this is 7 characters long. This length is still long enough and I want to convert to about 4-5 characters.
Is there a general way to do this or some method in which this can be done so that recovering the integer would still be possible? I am asking the mathematical aspects here but I would be programming this in PHP and I recently started programming in php.
Edit:
I was thinking in terms of assigning a masking bit and using this in a fashion to generate less number of Chars. I am aware of the fact that the range is not enough and that is the reason I was focusing on using a mathematical trick or a way of representation. The 62 base was an Idea that I already applied but is not working out.
14 digit decimal numbers can express 100,000,000,000,000 values (1014).
5 characters of a 62 character alphabet can express 916,132,832 values (625).
You cannot cram the equivalent number of values of a 14 digit number into a 5 character base 62 string. It's simply not possible to express each possible value uniquely. See http://en.wikipedia.org/wiki/Pigeonhole_principle. Even base 64 with 7 characters is not enough (only 4,398,046,511,104 possible values). In fact, if you target a 5 character short string you'd need to compensate by using a base 631 alphabet (6315 = 100,033,806,792,151).
Even compression doesn't help you. It would mean that two or more numbers would need to compress to the same compressed string (because there aren't enough possible unique compressed values), which logically means it's impossible to uncompress them into two different values.
To illustrate this very simply: Say my alphabet and target "string length" consists of one bit. That one bit can be 0 or 1. It can express 2 unique possible values. Say I have a compression algorithm which compresses anything and everything into this one bit. ... How could I possibly uncompress 100,000,000,000,000 unique values out of that one bit with two possible values? If you'd solve that problem, bandwidth and storage concerns would immediately evaporate and you'd be a billionaire.
With 95 printable ASCII characters you can switch to base 95 encoding instead of 62:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
That way an integer string of length X can be compressed into length Y base 95 string, where
Y = X * log 10/ log 95 = roughly X / 2
which is pretty good compression. So from length 12 you get down to 6. If the purpose of compression is to save the bandwidth by using JSON, then base 92 can be good choice (excluding ",\,/ that become escaped in JSON).
Surely you can get better compression but the price to pay is a larger alphabet. Just replace 95 in the above formula by the number of symbols.
Unless of course, you know the structure of your integers. For instance, if they have plenty of zeroes, you can base your compression on this knowledge to get much better results.
because the pigeon principle you will end up with some values that get compressed and other values that get expanded. It simply impossible to create a compression algorithm that compress every possible input string (i.e. in your case your numbers).
If you force the cardinality of the output set to be smaller than the cardinality of the input set you'll get collisions (i.e. more input strings get "compressed" to the same compressed binary string). A compression algorithm should be reversible, right? :)
I am trying to resolve the following problem via PHP. The aim is to generate a unique 6-character string based on an integer seed and containing a predefined range of characters. The second requirement is that the string must appear random (so if code 1 were 100000, it is not acceptable for code 2 to be 100001, and 3 100002)
The range of characters is:
Uppercase A-Z excluding: B, I, O, S and Z
0-9 excluding: 0, 1, 2, 5, 8
So that would be a total of 26 characters if I am not mistaken. My first idea would to be encoding from base 10 to base 24 starting at number 7962624. So do 7962624 + seed, and then base24 encode that number.
This gives me the characters 0-N. If I replace the resulting string in the following fashion, I then meet the first criteria:
B=P, I=Q, 0=R, 1=T, 2=U, 5=V, 8=W
So at this point, my codes will look something like this:
1=TRRRR, 2=TRRRT, 3=TRRRU
So my question to you gurus is: How can I make a method that behaves consistently (so the return string for a given integer is always the same) and meets the 2 requirements above? I have spent 2 full days on this now and short of dumping 700,000,000 codes into a database and retrieving them randomly I'm all out of ideas.
Stephen
You get a reasonably random looking sequence if you take your input sequence 1,2,3... and apply a linear map modulo a prime number. The number of unique codes is limited to the prime number so you should choose a large one. The resulting codes will be unique as long as you choose a multiplier that's not divisible by the prime.
Here's an example: With 6 characters you can make 266=308915776 unique strings, so a suitable prime number could be 308915753. This function therefore will generate over 300.000.000 unique codes:
function encode($num) {
$scrambled = (240049382*$num + 37043083) % 308915753;
return base_convert($scrambled, 10, 26);
}
Make sure that you run this on 64bit PHP though, otherwise the multiplication will overflow. On 32bit you'll have to use bcmath. The codes generated for the numbers 1 through 9 are:
n89a2d
hdh4jo
biopb9
5o6k2k
3eek5
k8m9aj
ee4424
8jbojf
2ojjb0
All that's left is filling in the initial 0s that are sometimes missing and replacing the letters and numbers so that none of the forbidden characters are produced.
As you can see, there's no obvious pattern, but someone with some time on their hands, enough motivation and with access to a few of this codes will be able to find out what's going on. A safer alternative is using an encryption algorithm with a small block size, such as Skip32.
I'd like to generate a long list of 9-digits sequences.
Let's call them ID.
So each ID is unique and the main purpose is to have them all really different. It is unacceptable to have 2 IDs which differs by 1 or 2 digits in sequence.
Do you have any ideas how to implement it without comparing each new generated ID with each previously generated?
Probably there is some algorithm already or simple MYSQL function to compare how close those strings are?
You could try the following formula for your ID's - you would only need to check that the ID value doesn't already exist in the table (salt is a constant between 0 and 100 that doesn't ever change once you pick a value - I would recommend using a prime number, and definitely not 0):
ID = random integer * 101 + salt;
This generates ID values like the following (for salt = 73):
469956305
017775467
001195913
913620520
156482807
577463533
470183959
049290800
078643925
141526626
If you take any two of these ID values and compare them, you'll notice that no two numbers differ by only one or two digits in sequence. I wrote a script to compare all possible ID values between 0 and 3000000, and there were no two ID values of this form differing by 1 or 2 digits in sequence. If you want to test it out yourself, here's the script I used (in C#): http://ideone.com/lFHnlX - I reduced the upper limit because of timeout on IDEone.
You want to get away with not-checking for uniqueness and you don't want IDs to be similar? Then you're really looking for UUIDs/GUIDs.
MySQL's built-in uuid() function will get you there.
As Robert Harvey points out, UUIDs are alphanumeric (not numeric) and longer than 9 characters, but you're going to have to sacrifice something – you cannot satisfy all of your constraints simultaneously.
I need to group a number of parameters into a short, non-predictable, spellable code. Ex:
serial: WJ-JHA5JK7E9RTAS
date: 04/02/2013
days: 30
valid: true
Compressed code could look like this: 3xy9b0laiph3s
My goal is to make the code as short as possible (without losing any information, of course). The algorithm must be easily implemented in other languages as well (so it can't have crazy specific dependencies). Any thoughts?
Most of the time this is handled by storing the data somewhere and creating an ID which is then compressed and used. The most common users of this system are short URL sites.
Store data in DB and get row ID
convert base-10 row ID to base 32 or 64 (base_convert in PHP)
use the new ID which looks like '4F7c'
When that ID is passed just unconvert it bask to base 10 and look up the data in the DB
Code:
$id = 23590;
print $id;
$hash = base_convert($id, 10, 32);
print $hash;
$id = base_convert($hash, 32, 10);
print $id;
For arbitrary short strings there is not enough information to apply generalized predictive methods of compression.
You'll need to exploit the known features of your data.
Example:
Serial numbers appear to be capital letters and numbers - 36 values per character - and 15 characters long. That's 36^15 possible values which will fit in 78 bits.
Date can be converted into number of days since a fixed date. If all the dates are known to fall within 100 years of each other, this can be stored in 16 bits.
If days is never more than years worth, this can be stored in 9 bits.
Valid can be stored in 1 bit.
That's 104 bits, which can be Base64 encoded to 18 characters
Note that oftentimes serial numbers have a checksum character or two. If you know how the checksum is calculated, you can omit this character and recalculate it upon decoding. This could save you a Base64 digit here.
If you want to make the result less predictable, without worrying with heavy encryption, you can just deterministically shuffle your encoded string.
UUencode or Base64, but in these codings case is matched. Eventually you could edit these codings for your purposes (only small letters). If you have exactly the same amount of data this would be the easiest solution. But not the minimal one.