Array collisions in php

Array collisions in php - php

Small remark
Reading about max_input_vars variable made me to read a lot about PHP's internals for handling arrays. This is not really a question, but rather answering my own question "why do we really need this max_input_var". It is not localized, and actually related to a lot of other programming languages and not only php.
A problem:
Compare these two small php scripts:
$data = array();
for ($key = 0; $key <= 1073709056; $key += 32767){
$data[$key] = 0;
}
Can check it here. Everything normal, nothing unexpected. Execution time is close to 0.
And this mostly identical (difference is in 1)
$data = array();
for ($key = 0; $key <= 1073709056; $key += 32768){
$data[$key] = 0;
}
Check it here. Nothing is normal, everything is unexpected. You exceeded execution time. So it is at least 3000 times slower!
The question is why does it happen?
I posted it here together with an answer because this vastly improved my knowledge about php internals and I learned new things about security.

The problem is not in the loop, the problem is with how PHP and many other languages (Java, Python, ASP.Net) are storing key/value pairs in hash data structures. PHP uses hash table to store arrays (which makes them theoretically very fast for storing and retrieving data from that array O(1)). The problem arise when more than one values get mapped to the same key thus creating hash collisions. Inserting element into such a key becomes more expensive O(n) and therefore inserting n keys jumps from O(n) to O(n^2).
And this is exactly what goes on here. When the number changes from 32767 to 32768 it changes keys from no collisions to everything collides to the same key.
This is the case, because of the way it php arrays are implemented in C. Array is of the size of the power of 2. (Array of 9 and 15 elements will be allocated with array of size 16). Also if the array key is an integer, the hash will be an integer with a mask on top of it. The mask is size of the array - 1 in binary. This means that if someone will try to insert the following keys in the associative array 0, 32, 64, 128, 256, ... and so on, they all will be mapped to the same key and thus the hash will have a linked list. The above example creates exactly this.
This requires a lot of CPU to process and therefore you see huge time increase. What this means is that devs should be really careful when they accept some data from outside that they will be parsed into array (people can craft the data easily and DOS the server). This data can be $_GET, $_POST requests (this is why you can limit the number with max_input_vars), XML, JSON.
Here are the resources I used to learn about these things:
res 1
res 2
res 3
res 4
res 5

I don't know anything about php specifically but 32767 would be the max value of a 2 byte number. Increasing it to 32768 would make the use of a 3 byte number (which is never used so it'll be 4 bytes) necessary which would in turn make everything slower.

Related

PHP numeric array index and memory

If I have an array which is using numerical keys and I add a key far outside the range so far, does it create the intermediate keys as well. For example if I have
$array = array(1,2,3,4);
$array[9] = 10;
Will this cause php to internally reserve memory for the keys 4-8 even though they don't have a value with them.
The reason I ask is I have a large 2D array which I want to use for memoization in a dynamic programming algorithm because only a small number of the total cells will need to be computed. However it would be taxing on the memory to have an empty array of that size for the full 2D array. Is there a better way to do 2 key memoization? I could use an associative array and append the keys with a separator or some scheme like that but if php won't make the extra keys I would rather (for simplicity and readability) just use the 2D array. Thoughts?

This may not fully answer your question, but should help in finding an answer, at least to the first question.
Create this program.
$arr = array(1, 2, 3, 4);
sleep(10);
$arr[100000] = 1;
sleep(10);
Now run it and monitor its memory usage.
In the first ten seconds, the program reserves memory for a small array.
In the next ten seconds, if the array reserves space for the unused indices, the memory usage goes ridiculously high compared to the previous one. If it doesn't, though, the memory used will only grow slightly.
This should give you an idea of the effect of your final program, whether or not using a 2D array is a good idea.

Don't worry, it won't make any extra keys. PHP is not like that, even arrays that you think are regular are associative arrays. You can even combine PHP arrays like this:
array(
1 => 121,
2 => 2112,
'stuff' => array('morestuff'),
'foo' => 1231
)
With PHP its very comfortable which can be good and bad also.

It doesn't seem like it will allocate a placeholder or use any memory for the unused keys, based on Doug T.'s response. Hope this helps!

RijndaelManaged.CreateEncryptor key expansion

There are two ways to specify a key and an IV for a RijndaelManaged object. One is by calling CreateEncryptor:
var encryptor = rij.CreateEncryptor(Encoding.UTF8.GetBytes(key), Encoding.UTF8.GetBytes(iv)));
and another one by directly setting Key and IV properties:
rij.Key = "1111222233334444";
rij.IV = "1111222233334444";
As long as the length of the Key and IV is 16 bytes, both methods produce the same result. But if your key is shorter than 16 bytes, the first method still allows you to encode the data and the second method fails with an exception.
Now this may sound like an absolutely abstract question, but I have to use PHP & the key which is only 10 bytes long in order to send an encrypted message to a server which uses the first method.
So the question is: How does CreateEncryptor expand the key and is there a PHP implementation? I cannot alter the C# code so I'm forced to replicate this behaviour in PHP.

I'm going to have to start with some assumptions. (TL;DR - The solution is about two-thirds of the way down but the journey is way cooler).
First, in your example you set IV and Key to strings. This can't be done. I'm therefore going to assume we call GetBytes() on the strings, which is a terrible idea by the way as there are less potential byte values in usable ASCII space than there are in all 256 values in a byte; that's what GenerateIV() and GenerateKey() are for. I'll get to this at the very end.
Next I'm going to assume you're using the default block, key and feedback size for RijndaelManaged: 128, 256 and 128 respectively.
Now we'll decompile the Rijndael CreateEncryptor() call. When it creates the Transform object it doesn't do much of anything with the key at all (except set m_Nk, which I'll come to later). Instead it goes straight to generating a key expansion from the bytes it is given.
Now it gets interesting:
switch (this.m_blockSizeBits > rgbKey.Length * 8 ? this.m_blockSizeBits : rgbKey.Length * 8)
So:
128 > len(k) x 8 = 128
128 <= len(k) x 8 = len(k) x 8
128 / 8 = 16, so if len(k) is 16 we can expect to switch on len(k) x 8. If it's more, then it will switch on len(k) x 8 too. If it's less it will switch on the block size, 128.
Valid switch values are 128, 192 and 256. That means it will only fall to default (and throw an exception) if it's over 16 bytes in length and not a valid block (not key) length of some sort.
In other words, it never checks against the key length specified in the RijndaelManaged object. It goes straight in to the key expansion and starts operating at the block level, as long as the key length (in bits) is one of 128, 192, 256 or less than 128. This is actually a check against the block size, not the key size.
So what happens now that we've patently not checked the key length? The answer has to do with the nature of the key schedule. When you enter a key in to Rijndael, the key needs to be expanded before it can be used. In this case, it's going to be expanded to 176 bytes. In order to accomplish this, it uses an algorithm which is specifically designed to turn a short byte array in to much longer byte array.
Part of that involves checking the key length. A bit more decompilation fun and we find that this defined as m_Nk. Sounds familiar?
this.m_Nk = rgbKey.Length / 4;
Nk is 4 for a 16-byte key, less when we enter shorter keys. That's 4 words, for anyone wondering where the magic number 4 came from. This causes a curious fork in the key scheduler, there's a specific path for Nk <= 6.
Without going too deep in to the details, this actually happens to 'work' (ie. not crash in a fireball) with a key length less than 16 bytes... until it gets below 8 bytes.
Then the entire thing crashes spectacularly.
So what have we learned? When you use CreateEncryptor you are actually throwing a completely invalid key straight in to the key scheduler and it's serendipity that sometimes it doesn't outright crash on you (or a horrible contractual integrity breach, depending on your POV); probably an unintended side effect of the fact there's a specific fork for short key lengths.
For completeness sake we can now look at the other implementation where you set the Key and IV in the RijndaelManaged object. These are stored in the SymmetricAlgorithm base class, which has the following setter:
if (!this.ValidKeySize(value.Length * 8))
throw new CryptographicException(Environment.GetResourceString("Cryptography_InvalidKeySize"));
Bingo. Contract properly enforced.
The obvious answer is that you cannot replicate this in another library unless that library happens to contain the same glaring issue, which I'm going to a call a bug in Microsoft's code because I really can't see any other option.
But that answer would be a cop out. By inspecting the key scheduler we can work out what's actually happening.
When the expanded key is initialised, it populates itself with 0x00s. It then writes to the first Nk words with our key (in our case Nk = 2, so it populates the first 2 words or 8 bytes). Then it enters a second stage of expanding upon that by populating the rest of the expanded key beyond that point.
So now we know it's essentially padding everything past 8 bytes with 0x00, we can pad it with 0x00s right? No; because this shifts the Nk up to Nk = 4. As a result, although our first 4 words (16 bytes) will be populated as we expect, the second stage will begin expanding at the 17th byte, not the 9th!
The solution then is utterly trivial. Rather than padding our initial key with 6 additional bytes, just chop off the last 2 bytes.
So your direct answer in PHP is:
$key = substr($key, 0, -2);
Simple, right? :)
Now you can interop with this encryption function. But don't. It can be cracked.
Assuming your key uses lowercase, uppercase and digits you have an exhaustive search space of only 218 trillion keys.
62 bytes (26 + 26 + 10) is the search space of each byte because you're never using the other 194 (256 - 62) values. Since we have 8 bytes, there are 62^8 possible combinations. 218 trillion.
How fast can we try all the keys in that space? Let's ask openssl what my laptop (running lots of clutter) can do:
Doing aes-256 cbc for 3s on 16 size blocks: 12484844 aes-256 cbc's in 3.00s
That's 4,161,615 passes/sec. 218,340,105,584,896 / 4,161,615 / 3600 / 24 = 607 days.
Okay, 607 days isn't bad. But I can always just fire up a bunch of Amazon servers and cut that down to ~1 day by asking 607 equivalent instances to calculate 1/607th of the search space. How much would that cost? Less than $1000, assuming that each instance was somehow only as efficient as my busy laptop. Cheaper and faster otherwise.
There is also an implementation that is twice the speed of openssl1, so cut whatever figure we've ended up with in half.
Then we've got to consider that we'll almost certainly find the key before exhausting the entire search space. So for all we know it might be finished in an hour.
At this point we can assert if the data is worth encrypting, it's probably worth it to crack the key.
So there you go.

Create 10,000 non-repeating random numbers in PHP

I need to work out a way to create 10,000 non-repeating random numbers in PHP, and then put it into database table. Number will be 12 digits long.
What is the best way to do this?

At 12 digits long, I don't think the possibility of getting repeats is very large. I would probably just generate the numbers, try to insert them into the table, and if it already exists (assuming you have a unique constraint on that column) just generate another one.

Read e.g. 40000 (=PHP_INT_SIZE * 10000) bytes from /dev/random at once, then split it, modularize it (the % operator), and there you have it.
Then filter it, and repeat the procedure.
That avoids too many syscalls/context switches (between the php runtime, the zend engine, and the operating system itself - I'm not going to dive into details here).
That should be the most performant way of doing it.

Generate 10000 random numbers and place them in an array. Run the array through array_unique. Check the length. If less than 10000, add on a bunch more. Run the array through array_unique. If greater than 10000, then run through array_slice to give 10000. Otherwise, lather, rinse, repeat.

This assumes that you can generate a 12 digit random number without problems (use getrandmax() to see how big you can get....according to php.net on some systems 32k is as large a number as you can get.
$array = array();
while(sizeof($array)<=10000){
$number = mt_rand(0,999999999999);
if(!array_key_exists($number,$array)){
$array[$number] = null;
}
}
foreach($array as $key=>$val){
//write array records to db.
}
You could use either rand() or mt_rand(). mt_rand() is supposed to be faster however.

Permutations of Varying Size

I'm trying to write a function in PHP that gets all permutations of all possible sizes. I think an example would be the best way to start off:
$my_array = array(1,1,2,3);
Possible permutations of varying size:
1
1 // * See Note
2
3
1,1
1,2
1,3
// And so forth, for all the sets of size 2
1,1,2
1,1,3
1,2,1
// And so forth, for all the sets of size 3
1,1,2,3
1,1,3,2
// And so forth, for all the sets of size 4
Note: I don't care if there's a duplicate or not. For the purposes of this example, all future duplicates have been omitted.
What I have so far in PHP:
function getPermutations($my_array){
$permutation_length = 1;
$keep_going = true;
while($keep_going){
while($there_are_still_permutations_with_this_length){
// Generate the next permutation and return it into an array
// Of course, the actual important part of the code is what I'm having trouble with.
}
$permutation_length++;
if($permutation_length>count($my_array)){
$keep_going = false;
}
else{
$keep_going = true;
}
}
return $return_array;
}
The closest thing I can think of is shuffling the array, picking the first n elements, seeing if it's already in the results array, and if it's not, add it in, and then stop when there are mathematically no more possible permutations for that length. But it's ugly and resource-inefficient.
Any pseudocode algorithms would be greatly appreciated.
Also, for super-duper (worthless) bonus points, is there a way to get just 1 permutation with the function but make it so that it doesn't have to recalculate all previous permutations to get the next?
For example, I pass it a parameter 3, which means it's already done 3 permutations, and it just generates number 4 without redoing the previous 3? (Passing it the parameter is not necessary, it could keep track in a global or static).
The reason I ask this is because as the array grows, so does the number of possible combinations. Suffice it to say that one small data set with only a dozen elements grows quickly into the trillions of possible combinations and I don't want to task PHP with holding trillions of permutations in its memory at once.

Sorry no php code, but I can give you an algorithm.
It can be done with small amounts of memory and since you don't care about dupes, the code will be simple too.
First: Generate all possible subsets.
If you view the subset as a bit vector, you can see that there is a 1-1 correspondence to a set and a binary number.
So if your array had 12 elements, you will have 2^12 subsets (including empty set).
So to generate a subset, you start with 0 and keep incrementing till you reach 2^12. At each stage you read the set bits in the number to get the appropriate subset from the array.
Once you get one subset, you can now run through its permutations.
The next permutation (of the array indices, not the elements themselves) can be generated in lexicographic order like here: http://www.de-brauwer.be/wiki/wikka.php?wakka=Permutations and can be done with minimal memory.
You should be able to combine these two to give your-self a next_permutation function. Instead of passing in numbers, you could pass in an array of 12 elements which contains the previous permutation, plus possibly some more info (little memory again) of whether you need to go to the next subset etc.
You should actually be able to find very fast algorithms which use minimal memory, provide a next_permutation type feature and do not generate dupes: Search the web for multiset permutation/combination generation.
Hope that helps. Good luck!

The best set of functions I've come up with was the one provided by some user at the comments of the shuffle function on php.net Here is the link It works pretty good.
Hope it's useful.

The problem seems to be trying to give an index to every permutation and having a constant access time. I cannot think of a constant time algorithm, but maybe you can improve this one to be so. This algorithm has a time complexity of O(n) where n is the length of your set. The space complexity should be reducible to O(1).
Assume our set is 1,1,2,3 and we want the 10th permutation. Also, note that we will index each element of the set from 0 to 3. Going by your order, this means the single element permutations come first, then the two element, and so on. We are going to subtract from the number 10 until we can completely determine the 10th permutation.
First up are the single element permutations. There are 4 of those, so we can view this as subtracting one four times from 10. We are left with 6, so clearly we need to start considering the two element permutations. There are 12 of these, and we can view this as subtracting three up to four times from 6. We discover that the second time we subtract 3, we are left with 0. This means the indexes of our permutation must be 2 (because we subtracted 3 twice) and 0, because 0 is the remainder. Therefore, our permutation must be 2,1.
Division and modulus may help you.
If we were looking for the 12th permutation, we would run into the case where we have a remainder of 2. Depending on your desired behavior, the permutation 2,2 might not be valid. Getting around this is very simple, however, as we can trivially detect that the indexes 2 and 2 (not to be confused with the element) are the same, so the second one should be bumped to 3. Thus the 12th permutation can trivially be calculated as 2,3.
The biggest confusion right now is that the indexes and the element values happen to match up. I hope my algorithm explanation is not too confusing because of that. If it is, I will use a set other than your example and reword things.

Inputs: Permutation index k, indexed set S.
Pseudocode:
L = {S_1}
for i = 2 to |S| do
Insert S_i before L_{k % i}
k <- k / i
loop
return L
This algorithm can also be easily modified to work with duplicates.

Creating your own TinyURL

I have just found this great tutorial as it is something that I need.
However, after having a look, it seems that this might be inefficient. The way it works is, first generate a unique key then check if it exists in the database to make sure it really is unique. However, the larger the database gets the slower the function gets, right?
Instead, I was thinking, is there a way to add ordering to this function? So all that has to be done is check the previous entry in the DB and increment the key. So it will always be unique?
function generate_chars()
{
$num_chars = 4; //max length of random chars
$i = 0;
$my_keys = "123456789abcdefghijklmnopqrstuvwxyz"; //keys to be chosen from
$keys_length = strlen($my_keys);
$url = "";
while($i<$num_chars)
{
$rand_num = mt_rand(1, $keys_length-1);
$url .= $my_keys[$rand_num];
$i++;
}
return $url;
}
function isUnique($chars)
{
//check the uniqueness of the chars
global $link;
$q = "SELECT * FROM `urls` WHERE `unique_chars`='".$chars."'";
$r = mysql_query($q, $link);
//echo mysql_num_rows($r); die();
if( mysql_num_rows($r)>0 ):
return false;
else:
return true;
endif;
}

The tiny url people like to use random tokens because then you can't just troll the tiny url links. "Where does #2 go?" "Oh, cool!" "Where does #3 go?" "Even cooler!" You can type in random characters but it's unlikely you'll hit a valid value.
Since the key is rather sparse (4 values each having 36* possibilities gives you 1,679,616 unique values, 5 gives you 60,466,176) the chance of collisions is small (indeed, it's a desired part of the design) and a good SQL index will make the lookup be trivial (indeed, it's the primary lookup for the url so they optimize around it).
If you really want to avoid the lookup and just unse auto-increment you can create a function that turns an integer into a string of seemingly-random characters with the ability to convert back. So "1" becomes "54jcdn" and "2" becomes "pqmw21". Similar to Base64-encoding, but not using consecutive characters.
(*) I actually like using less than 36 characters -- single-cased, no vowels, and no similar characters (1, l, I). This prevents accidental swear words and also makes it easier for someone to speak the value to someone else. I even map similar charactes to each other, accepting "0" for "O". If you're entirely machine-based you could use upper and lower case and all digits for even greater possibilities.

In the database table, there is an index on the unique_chars field, so I don't see why that would be slow or inefficient.
UNIQUE KEY `unique_chars` (`unique_chars`)
Don't rush to do premature optimization on something that you think might be slow.
Also, there may be some benefit in a url shortening service that generates random urls instead of sequential urls.

I don't know why you'd bother. The premise of the tutorial is to create a "random" URL. If the random space is large enough, then you can simply rely on pure, dumb luck. If you random character space is 62 characters (A-Za-z0-9), the the 4 characters they use, given a reasonable random number generator, is 1 in 62^4, which is 1 in 14,776,336. Five characters is 1 in 916,132,832. So, a conflict is, literally, "1 in a billion".
Obviously, as the documents fill, your odds increase for the chance of a collision.
With 10,000 documents, it's 1 in 91,613, almost 1 in 100,000 (for round numbers).
That means, for every new document, you have a 1 in 91,613 chance of hitting the DB again for another pull on the slot machine.
It is not deterministic. It's random. It's luck. In theory, you can hit a string of really, really, bad luck and just get collision after collision after collision. Also, it WILL, eventually, fill up. How many URLs do you plan on hashing?
But if 1 in 91,613 odds isn't good enough, boosting it to 6 chars makes it more than 1 in 5M for 10,000 documents. We're talking almost LOTTO odds here.
Simply put, make the key big enough (7 characters? 8?) and the problem pretty much "wishes" itself out of existence.

Couldn't you encode the URL as Base36 when it's generated, and then decode it when visited - that would allow you to remove the database completely?
A snippet from Channel9:
The formula is simple, just turn the
Entry ID of our post, which is a long
into a short string by Base-36
encoding it and then stick
'http://ch9.ms/' onto the front of it.
This produces reasonably short URLs,
and can be computed at either end
without any need for a database look
up. The result, a URL like
http://ch9.ms/A49H is then used in
creating the twitter link.

I solved a similar problem by implementing an alogirthm that used to generate serial numbers one-by-one in base36. I had my own oredring of base36 characters all of which are unique. Since it was generating numbers serially I did not have to worry about duplication. Complexity and randomness of the number depends on the ordering of base36 numbers[characters]... that too for public only becuase to my application they are serial numbers :)

Check out this guys functions - http://www.pgregg.com/projects/php/base_conversion/base_conversion.php source - http://www.pgregg.com/projects/php/base_conversion/base_conversion.inc.phps
You can use any base you like, for example to convert 554512 to base 62, call
$tiny = base_base2base(554512, 10, 62); and that evaluates to $tiny = '2KFk'.
So, just pass in the unique id of the database record.
In a project I used this in a removed a few characters from the $sChars string, and am using base 58. You can also rearrange the characters in the string if you want the values to be less easy to guess.

You could of course add ordering by simply numbering the urls:
http://mytinyfier.com/1
http://mytinyfier.com/2
and so on. But if the hash key is indexed in the database (which it obviously should be), the performance boost would be minimal at best.

I wouldn't bother doing ordered enumeration for two reasons:
1) SQL servers are very effective at checking such hash collisions (given correct indexes)
2) That might hurt privacy, as users would be able to easily figure out what other users are tinyurl-ing.

Use autoincrement on the database, and get the latest id as described by http://www.acuras.co.uk/articles/24-php-use-mysqlinsertid-to-get-the-last-entered-auto-increment-value

Perhaps this is a bit off-answer, but, my general rule for creating always unique keys is simple md5( time() * 100 + rand( 0, 100 ) ); There is a one in 100,000 chance that if two people are using the same service at the same second they will get the same result (nie impossible).
That said, md5( rand( 0, n ) ) works too.

That might work, but the easiest way to accomplish the problem would probably be with hashing. Theoretically speaking, hashing runs in O(1) time, as in, it only has to perform the hash, and then does only one actual hit to the database to retrieve the value. Then, you would introduce complications for checking for hash collisions, but it seems like this is probably what most of the tinyurl providers do. And, a good hash function isn't terribly hard to write.

I have also created small tinyurl service.
I wrote a script in Python that was generating keys and store in MySQL table named tokens with status U(Unused).
But, I am doing it in offline mode. I have a corn job on my VPS. It runs a script every 10 minutes. The script check if there are less than 1000 keys in the table, it keep generating keys and inserting them if they are unique and not already exists in the table until the key's count up to 1000.
For my service, 1000 keys for 10 minutes are more than enough, you can set the timing or number of keys generated according to your need.
Now when any tiny url needs to be created on my website, my PHP script just fetch any key which is unused from the table and marked its status as T(taken). PHP script does not have to bother about its uniqueness as my python script already populated only unique keys.

Couldn't you just trim the hash to the length you wish?
$tinyURL = substr(md5($longURL . time()),0,4);
Granted, this may not provide as much pseudo randomness as using the entire string length. But, if you hash the long URL concatenated with the time(), wouldn't this be sufficient? Thoughts on using this method? Thanks!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.