I have a SQL table which uses strings for a key. I need to convert that string (max. 18 Characters) to a unique (!) 4-byte integer using PHP. Can anyone help?
Unique? Not possible, sorry.
Let's take a closer look:
With 18 characters, even if we were assuming only the 128 possible characters of ASCII (7 bits), you'd get 128^18 possible strings (and I'm not even going into the possibility of shorter strings!), which is about 8E37 ( 8 and 37 zeroes ).
With a 4-byte integer, you're getting 256^4 possible integers, which is about 4E9 ( 4 billion ).
So, you have about 4E28 more strings than you have integers; you can't have an unique mapping.
Therefore, you'll definitely run into a collision as soon as you enter the 4294967297th key, but it is possible to run into one as soon as you enter more than one.
See also: http://en.wikipedia.org/wiki/Pigeonhole_principle
Keep a lookup-table of strings to integers. Everytime you encounter a new string you add it to the mapping table and assign it a new unique ID. This will work for about 2^32 strings which is probably enough.
There is no way to do this for more that 2^32 distinct strings.
You can't. A four-byte integer can represent 2^32 = 4 billion values, which is not enough to hold your target space.
If you currently have less then 4 billion rows in the table, you could create a cross table that just assigns an incremental value to each. You'd be limited to 4 billion rows with this approach, but this may be fine for your situation.
Related
I have a very large integer 12-14 digits long and I want to encrypt/compress this to an alphanumeric value so that the integer can be recovered later from the alphanumeric value. I tried to convert this integer using a 62 base and tried to map those values to a-zA-Z0-9, but the value generated from this is 7 characters long. This length is still long enough and I want to convert to about 4-5 characters.
Is there a general way to do this or some method in which this can be done so that recovering the integer would still be possible? I am asking the mathematical aspects here but I would be programming this in PHP and I recently started programming in php.
Edit:
I was thinking in terms of assigning a masking bit and using this in a fashion to generate less number of Chars. I am aware of the fact that the range is not enough and that is the reason I was focusing on using a mathematical trick or a way of representation. The 62 base was an Idea that I already applied but is not working out.
14 digit decimal numbers can express 100,000,000,000,000 values (1014).
5 characters of a 62 character alphabet can express 916,132,832 values (625).
You cannot cram the equivalent number of values of a 14 digit number into a 5 character base 62 string. It's simply not possible to express each possible value uniquely. See http://en.wikipedia.org/wiki/Pigeonhole_principle. Even base 64 with 7 characters is not enough (only 4,398,046,511,104 possible values). In fact, if you target a 5 character short string you'd need to compensate by using a base 631 alphabet (6315 = 100,033,806,792,151).
Even compression doesn't help you. It would mean that two or more numbers would need to compress to the same compressed string (because there aren't enough possible unique compressed values), which logically means it's impossible to uncompress them into two different values.
To illustrate this very simply: Say my alphabet and target "string length" consists of one bit. That one bit can be 0 or 1. It can express 2 unique possible values. Say I have a compression algorithm which compresses anything and everything into this one bit. ... How could I possibly uncompress 100,000,000,000,000 unique values out of that one bit with two possible values? If you'd solve that problem, bandwidth and storage concerns would immediately evaporate and you'd be a billionaire.
With 95 printable ASCII characters you can switch to base 95 encoding instead of 62:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
That way an integer string of length X can be compressed into length Y base 95 string, where
Y = X * log 10/ log 95 = roughly X / 2
which is pretty good compression. So from length 12 you get down to 6. If the purpose of compression is to save the bandwidth by using JSON, then base 92 can be good choice (excluding ",\,/ that become escaped in JSON).
Surely you can get better compression but the price to pay is a larger alphabet. Just replace 95 in the above formula by the number of symbols.
Unless of course, you know the structure of your integers. For instance, if they have plenty of zeroes, you can base your compression on this knowledge to get much better results.
because the pigeon principle you will end up with some values that get compressed and other values that get expanded. It simply impossible to create a compression algorithm that compress every possible input string (i.e. in your case your numbers).
If you force the cardinality of the output set to be smaller than the cardinality of the input set you'll get collisions (i.e. more input strings get "compressed" to the same compressed binary string). A compression algorithm should be reversible, right? :)
I'd like to generate a long list of 9-digits sequences.
Let's call them ID.
So each ID is unique and the main purpose is to have them all really different. It is unacceptable to have 2 IDs which differs by 1 or 2 digits in sequence.
Do you have any ideas how to implement it without comparing each new generated ID with each previously generated?
Probably there is some algorithm already or simple MYSQL function to compare how close those strings are?
You could try the following formula for your ID's - you would only need to check that the ID value doesn't already exist in the table (salt is a constant between 0 and 100 that doesn't ever change once you pick a value - I would recommend using a prime number, and definitely not 0):
ID = random integer * 101 + salt;
This generates ID values like the following (for salt = 73):
469956305
017775467
001195913
913620520
156482807
577463533
470183959
049290800
078643925
141526626
If you take any two of these ID values and compare them, you'll notice that no two numbers differ by only one or two digits in sequence. I wrote a script to compare all possible ID values between 0 and 3000000, and there were no two ID values of this form differing by 1 or 2 digits in sequence. If you want to test it out yourself, here's the script I used (in C#): http://ideone.com/lFHnlX - I reduced the upper limit because of timeout on IDEone.
You want to get away with not-checking for uniqueness and you don't want IDs to be similar? Then you're really looking for UUIDs/GUIDs.
MySQL's built-in uuid() function will get you there.
As Robert Harvey points out, UUIDs are alphanumeric (not numeric) and longer than 9 characters, but you're going to have to sacrifice something – you cannot satisfy all of your constraints simultaneously.
How to generate unique numeric value with fixed length from given data in PHP? For instance, I can have a string that contains numbers and characters and I need to generate unique numeric value with length 6. Thanks!
You won't be able to generate a unique numeric value out of an input with any algorithm. That's the problem of converting an input into a pseudorandom output. If you have an input string of 20 characters and an output of only 6, there will be repeated results, because:
input of 20 characters (assuming 58 alphanumerical possibilities):
58^20 = 1.8559226468222606056912232424512e+35 possibilities
output of 6 characters (assuming 10 numerical possibilities):
10^6 = 1000000 possibilities
So, to sum up, you won't be able to generate a unique number out of a string. Your best chances are to use a hashing function like md5 or sha1. They are alphanumerical but you can always convert them into numbers. However, once you crop them to, let's say, 6 digits, their chances to be repeated increase a lot.
It is impossible to generate a completely unique value given an arbitrary value with a limit on the number of characters unfortunately. There are an infinite number of possible values, while there are only 999999 possible values in a numeric value of length 6.
In PHP however you can do the following:
$list_of_numeric_values = array();
foreach ($list_of_given_values as $value)
{
if (!in_array($value, $list_of_numeric_values))
$list_of_numeric_values[] = $value;
}
After this is complete, the array then will have a unique key for each possible value you can use.
If you dont need to calculate these all at the same time you can follow a similar algorithm where instead of just "searching" the array using PHP perhaps its a SELECT on a MySQL table to see if the entry currently exists, and using the auto increment of the primary key to get your value.
I don't want my database id's to be sequential, so I'm trying to generate uids with this code:
$bin = openssl_random_pseudo_bytes(12);
$hex = bin2hex($bin);
return base_convert($hex, 16, 36);
My question is: how many bytes would i need to make the ids unique enough to handle large amounts of records (like twitter)?
Use PHP's uniqid(), with an added entropy factor. That'll give you plenty of room.
You might considering something like the way tinyurl and other shortening services work. I've used similar techniques, which guarantees uniqueness until all combinations are exhausted. So basically you choose an alphabet, and how many characters you want as a length. Let's say we use alphanumeric, upper and lower, so that's 62 characters in the alphabet, and let's do 5 characters per code. That's 62^5 = 916,132,832 combinations.
You start with your sequential database ID and you multiply that be some prime number (choose one that's fairly large, like 2097593). All you do is multiply that by your database ID, making sure to wrap around if you exceed 62^5, and then convert that number to base-62 as per your chosen alphabet.
This makes each code look fairly unique, yet because we use a prime number, we're guaranteed not to hit the same number twice until we've used all codes already. And it's very short.
You can use longer keys with a smaller alphabet, too, if length isn't a concern.
Here's a question I asked along the same lines: Tinyurl-style unique code: potential algorithm to prevent collisions
Assuming that openssl_random_pseudo_bytes may generate every possible value, N bytes will give you 2 ^ (N * 8) distinct values. For 12 bytes this is 7.923 * 10^28
use MySQL UUID
insert into `database`(`unique`,`data`) values(UUID(),'Test');
If your not using MySQL search google for UUID (Database Name) and it will give you an option
Source Wikipedia
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%
I need to compare a very large number in php (30 digits long) with 2 numbers in my database. Whats a good way to do this? I tried using floats but its not precise enough and I don't know of a good way to use large numbers in php.
Have you tried using string comparison? Just make sure every number is padded with zeroes.
mysql> select "123123123123123123456456456"<"123123123123123123456456457";
+-------------------------------------------------------------+
| "123123123123123123456456456"<"123123123123123123456456457" |
+-------------------------------------------------------------+
| 1 |
+-------------------------------------------------------------+
Justed test this up to 200+ chars, works like a charm.
Check bcdcomp function
You could compare strings instead.
Depending on how you're fetching the data from the database, you may want to explicitly cast the integer to a string type in the SQL statement.
Other than that, there are several libraries in PHP that handle large integers, like BCMath and GMP.
Handling large numbers in PHP is done through either of two libraries: GMP or BC Math.
I haven't done this myself, so it may not be correct, but I think you'd have to take the string result from GMP or BC Math, and feed that into the query. Make sure you store your numbers as bigint.
Interestin fact: You might think BigInt would be limited to about 20 digits, and you'd be right, except for the fact that it has Mysql Magic:
You can always store an exact integer value in a BIGINT column by storing it using a string. In this case, MySQL performs a string-to-number conversion that involves no intermediate double-precision representation.
If they're -very- big, I'd compare them as strings even. First, if one is longer than the other, it wins. If they're the same length, compare digit by digit left-to-right - if two digits differ, the number with the bigger digit wins. This of course for Positive integers.