Compressing string using ASCII char codes using backtracking in php - php

I want to can compress a string using the chars ASCII codes.
I want to compress them using number patterns. Because ASCII codes are numbers, I want to find sub-patterns in the list of ASCII char codes.
Theory
This will be the format for every pattern I found:
[nnn][n][nn], where:
[nnn] is the ASCII code for first char, from group numbers with same pattern.
[n] is a custom number for a certain pattern/rule (I will explain more below).
[nn] shows how many times this rule happens.
The number patterns are not concretely established. But let me give you some examples:
same char
linear growth (every number/ascii is greater, with one, than previous)
linear decrease (every number/ascii is smaller, with one, than previous)
Now let's see some situations:
"adeflk" becomes "097.1.01-100.2.03-108.3.02"
same char ones, linear growth three times, linear decrease twice.
"rrrrrrrrrrr" becomes "114.1.11"
same char eleven times.
"tsrqpozh" becomes "116.3.06-122.1.01-104.1.01"
linear decrease six times, same char ones, same char ones.
I added dots ('.') and dashes ('-') so you can see them easily.
Indeed, we don't see good results (compression). I want to use this algorithm for large strings. And adding more rules (number patterns) we increase changes for making shorter result than original.
I know the existent compressing solutions. I want this solution because the result have only digits, and it helps me.
What I've tried
// recursive function
function run (string $data, array &$rules): string {
if (strlen($data) == 1) {
// str_pad for having always ASCII code with 3 digits
return (str_pad(ord($data), 3, '0', STR_PAD_LEFT) .'.'. '1' .'.'. '01');
}
$ord = ord($data); // first char
$strlen = strlen($data);
$nr = str_pad($ord, 3, '0', STR_PAD_LEFT); // str_pad for having always ASCII code with 3 digits
$result = '';
// compares every rule
foreach ($rules as $key => $rule) {
for ($i = 1; $i < $strlen; $i++) {
// check for how many times this rule matches
if (!$rule($ord, $data, $i)) {
// save the shortest result (so we can compress)
if (strlen($r = ($nr .'.'. $key .'.'. $i .' - '. run(substr($data, $i), $rules))) < strlen($result)
|| !$result) {
$result = $r;
}
continue 2; // we are going to next rule
}
}
// if comes here, it means entire $data follow this rule ($key)
if (strlen($r = (($nr .'.'. $key .'.'. $i))) < strlen($result)
|| !$result) {
$result = $r; // entire data follow this $rule
}
}
return $result; // it will return the shortest result it got
}
// ASCII compressor
function compress (string $data): string {
$rules = array( // ASCII rules
1 => function (int $ord, string $data, int $i): bool { // same char
return ($ord == ord($data[$i]));
},
2 => function (int $ord, string $data, int $i): bool { // linear growth
return (($ord+$i) == ord($data[$i]));
},
3 => function (int $ord, string $data, int $i): bool { // progressive growth
return ((ord($data[$i-1])+$i) == ord($data[$i]));
},
4 => function (int $ord, string $data, int $i): bool { // linear decrease
return (($ord-$i) == ord($data[$i]));
},
5 => function (int $ord, string $data, int $i): bool { // progressive decrease
return ((ord($data[$i-1])-$i) == ord($data[$i]));
}
);
// we use base64_encode because we want only ASCII chars
return run(base64_encode($data), $rules);
}
I added dots ('.') and dashes ('-') only for testing easily.
Results
compress("ana ar") => "089.1.1 - 087.1.1 - 053.1.1 - 104.1.1 - 073.1.1 - 071.4.2 - 121.1.01"
Which is ok. And it runs fast. Without a problem.
compress("ana aros") => Fatal error: Maximum execution time of 15 seconds exceeded
If string is a bit longer, it gets toooo much. It works fast and normal for 1-7 chars. But when there are more chars in string, that happens.
The algorithm doesn't run perfect and doesn't return the perfect 6-digit pattern, indeed. Before getting there, I'm stucked with that.
Question
How I can increase performance of this backtracking for running ok now and also with more rules?

Searching for gradients / infix repetitions is not a good match for compressing a natural language. Natural language is significantly easier to compress using a dictionary based approach (both dynamic dictionaries bundled with the compressed data, as well as pre-compiled dictionaries trained on a reference set work), as even repeating sequences in ASCII encoding usually don't follow any trivial geometric pattern, but appear quite random when observing only the individual characters ordinal representations.
That said, the reason your algorithm is so slow, is because you are exploring all possible patterns, which results in a run time exponential in the input length, precisely O(5^n). For your self-set goal of finding the ideal compression in a set of 5 arbitrary rules, that's already as good as possible. If anything, you can only reduce the run time complexity by a constant factor, but you can't get rid of the exponential run time. In other terms, even if you apply perfect optimizations, that only makes the difference of increasing the maximum input length you can handle by maybe 30-50%, before you inevitably run into timeouts again.
#noam's solution doesn't even attempt to find the ideal pattern, but simply greedily uses the first matching pattern to consume the input. As a result it will incorrectly ignore better matches, but in return it also has only to look at each input character once only, resulting in a linear run time complexity O(n).
Of course there are some details in your current solution which make it a lot easier to solve, just based on simple observations about your rules. Be wary though that these assumptions will break when you try to add more rules.
Specifically, you can avoid most of the backtracking if you are smart about the order in which you try your rules:
Try to start a new geometric pattern ord(n[i])=ord(n[0])+i first, and accept as match only when it matched at least 3 characters ahead.
Try to continue current geometric pattern.
Try to continue current gradient pattern.
Try to start new gradient ord(n[i])=ord(n[0])+i, and accept as match only when it matched at least 2 characters ahead.
Try to start / continue simple repetition last, and always accept.
Once a character from input was accepted by any of these rules (meaning it has been consumed by a sequence), you will no longer need to backtrack from it or check any other rule for it, as you have already found the best possible representation for it. You still need to re-check the rules for every following character you add to to the sequence, as a suffix of the gradient rule may be required as the prefix for a geometric rule.
Generically speaking, the pattern in your rule set which allows this, is the fact that for every rule with a higher priority, no match for that rule can have a better match in any following rule. If you like, you can easily prove that formally for every pair of possible rules you have in your set.
If you want to test your implementation, you should specifically test patterns such as ABDHIK. Even though H is a match the currently running geometric sequence ABDH, using it as the starting point of the new geometric sequence HIK is unconditionally the better choice.

I came up with a initial solution to your problem. Please note:
You will never get a sequence of just one letter, because each 2 consecutive letters are a "linear growth" with a certain difference.
My solution is not very clean. You can, for example combine $matches and $rules to a single array.
My solution is naive and greedy. For example, in the example adeflk, the sequence def is a sequence of 3, but because my solution is greedy, it will consider ad as a sequence of 2, and ef as another sequence of 2. That being said, you can still improve my code.
The code is hard to test. You should probably make use of OOP and divide the code to many small methods that are easy to test separately.
<?php
function compress($string, $rules, $matches) {
if ($string === '') {
return getBestMatch($matches);
}
$currentCharacter = $string[0];
$matchFound = false;
foreach ($rules as $index => &$rule) {
if ($rule['active']) {
$soFarLength = strlen($matches[$index]);
if ($soFarLength === 0) {
$matchFound = true;
$matches[$index] = $currentCharacter;
} elseif ($rule['callback']($currentCharacter, $matches[$index])) {
$matches[$index] .= $currentCharacter;
$matchFound = true;
} else {
$rule['active'] = false;
}
}
}
if ($matchFound) {
return compress(substr($string, 1), $rules, $matches);
} else {
return getBestMatch($matches) . startNewSequence($string);
}
}
function getBestMatch($matches) {
$rule = -1;
$length = -1;
foreach ($matches as $index => $match) {
if (strlen($match) > $length) {
$length = strlen($match);
$rule = $index;
}
}
if ($length <= 0) {
return '';
}
return ord($matches[$rule][0]) . '.' . $rule . '.' . $length . "\n";
}
function startNewSequence($string) {
$rules = [
// rule number 1 - all characters are the same
1 => [
'active' => true,
'callback' => function ($a, $b) {
return $a === substr($b, -1);
}
],
// rule number 2 - ASCII code of current letter is one more than the last letter ("linear growth")
2 => [
'active' => true,
'callback' => function ($a, $b) {
return ord($a) === (1 + ord(substr($b, -1)));
}
],
// rule number 3 - ASCII code is a geometric progression. The ord() of each character increases with each step.
3 => [
'active' => true,
'callback' => function ($a, $b) {
if (strlen($b) == 1) {
return ord($a) > ord($b);
}
$lastCharOrd = ord(substr($b, -1));
$oneBeforeLastCharOrd = ord(substr($b, -2, 1));
$lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
$currentOrd = ord($a);
return ($currentOrd - $lastCharOrd) === ($lastDiff + 1);
}
],
// rule number 4 - ASCII code of current letter is one less than the last letter ("linear decrease")
4 => [
'active' => true,
'callback' => function ($a, $b) {
return ord($a) === (ord(substr($b, -1)) - 1);
}
],
// rule number 5 - ASCII code is a negative geometric progression. The ord() of each character decreases by one
// with each step.
5 => [
'active' => true,
'callback' => function ($a, $b) {
if (strlen($b) == 1) {
return ord($a) < ord($b);
}
$lastCharOrd = ord(substr($b, -1));
$oneBeforeLastCharOrd = ord(substr($b, -2, 1));
$lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
$currentOrd = ord($a);
return ($currentOrd - $lastCharOrd) === ($lastDiff - 1);
}
],
];
$matches = [
1 => '',
2 => '',
3 => '',
4 => '',
5 => '',
];
return compress($string, $rules, $matches);
}
echo startNewSequence('tsrqpozh');

Related

How to effectively convert positive number in base -2 to a negative one in base - 2?

Old question name: How to effectively split a binary string in a groups of 10, 0, 11?
I have some strings as an input, which are binary representation of a number.
For example:
10011
100111
0111111
11111011101
I need to split these strings (or arrays) into groups of 10, 0, and 11 in order to replace them.
10 => 11
0 => 0
11 => 10
How to do it? I have tried these options but don't work.
preg_match('/([10]{2})(0{1})([11]{2})/', $S, $matches);
It should be [10] [0], [11] for 10011 input.
And it should be 11010 when replaced.
UPD1.
Actually, I'm trying to do a negation algorithm for converting a positive number in a base -2 to a negative one in a base -2.
It could be done with an algorithm from Wikipedia with a loop. But byte groups replacing is a much faster. I have implemented it already and just trying to optimize it.
For this case 0111111 it's possible to add 0 in the end. Then rules will be applied. And we could remove leading zeros in a result. The output will be 101010.
UPD2.
#Wiktor Stribiżew proposed an idea how to do a replace immediately, without splitting bytes into groups first.
But I have a faster solution already.
$S = strtr($S, $rules);
The meaning of this question isn't do a replacement, but get an array of desired groups [11] [0] [10].
UPD3.
This is a solution which I reached with an idea of converting binary groups. It's faster than one with a loop.
function solution2($A)
{
$S = implode('', $A);
//we could add leading 0
if (substr($S, strlen($S) - 1, 1) == 1) {
$S .= '0';
}
$rules = [
'10' => '11',
'0' => '0',
'11' => '10',
];
$S = strtr($S, $rules);
$arr = str_split($S);
//remove leading 0
while ($arr[count($arr) - 1] == 0) {
array_pop($arr);
}
return $arr;
}
But the solution in #Alex Blex answer is faster.
You may use a simple /11|10/ regex with a preg_replace_callback:
$s = '10011';
echo preg_replace_callback("/11|10/", function($m) {
return $m[0] == "11" ? "10" : "11"; // if 11 is matched, replace with 10 or vice versa
}, $s);
// => 11010
See the online PHP demo.
Answering the question
algorithm for converting a positive number in a base -2 to a negative one in a base -2
I believe following function is more efficient than a regex:
function negate($negabin)
{
$mask = 0xAAAAAAAAAAAAAAA;
return decbin((($mask<<1)-($mask^bindec($negabin)))^$mask);
}
Parameter is a positive int60 in a base -2 notation, e.g. 11111011101.
The function converts the parameter to base 10, negate it, and convert it back to base -2 as described in the wiki: https://en.wikipedia.org/wiki/Negative_base#To_negabinary
Works on 64bit system, but can easily adopted to work on 32bit.

Password Validation With Multiple Rules

I'm attempting to write a regex in PHP that validates the following:
At least 10 chars
Has at least 2 Upper-case characters
Has at least 2 Numbers OR Symbols
I've looked at just about every reference I can find but, to no avail.
I guess I can test individually, but that makes me very sad :(
Can someone please help? (And send me to a spot where I can learn in plain English Reg Ex?)
This picture is worth more than 1000 words
(and that's a lot of entropy)
(image via XKCD)
With this in mind you might want to consider dropping rules 2 & 3 if password length is higher than X (say.. 20) or increase the minimum to at least 16 characters (as the only rule).
As for your requirement:
As opposed to having one big, ugly, hard-to-maintain, advanced RegExp you might want to break the problem in smaller parts and tackle each bit separately using dedicated functions.
For this you could look at ctype_* functions, count_chars() and MultiByte String Functions.
Now the ugly:
This advanced RegEx will return true or false according to your rules:
preg_match('/^(?=.{10,}$)(?=.*?[A-Z].*?[A-Z])(?=.*?([\x20-\x40\x5b-\x60\x7b-\x7e\x80-\xbf]).*?(?1).*?$).*$/',$string);
Test demo here: http://regex101.com/r/qE9eB2
1st part (LookAhead) : (?=.{10,}$) will check string length and continue if it has at least 10 characters. You could drop this and do a check with strlen() or even better mb_strlen().
2nd part (also a LookAhead): (?=.*?[A-Z].*?[A-Z]) will check for the presence of 2 UPPERCASE characters. You could also do a $upper=preg_replace('/[^A-Z]/','',$string) instead and count the chars in $upper to be more than two.
3rd LookAhead uses a character class: [\x20-\x40\x5b-\x60\x7b-\x7e\x80-\xbf] with hex escaped character ranges for common symbols (pretty much all the symbols one could find on an average keyboard). You could also do a $sym=preg_replace('/[^a-zA-Z]/','',$string) instead and count the chars in $sym to be more than two. Note: to make it shorter I used a recursive group (?1) to not repeat the same character class again
For learning, the most comprehensive RegExp reference I know of is: regular-expressions.info
You can use lookaheads to make sure that what you are looking for is contained appropriately.
/(?=.*[A-Z].*[A-Z])(?=.*[^a-zA-Z].*[^a-zA-Z]).{10,}/
I have always preferred good old procedural code for handling stuff like this. Regular expressions can be useful but they can also be a little cumbersome, especially for code maintenance and quick scanning (regular expressions are not exactly examples of readability).
function strContains($string, $contains, $n = 1, $exact = false) {
$length = strlen($string);
$tally = 0;
for ($i = 0; $i < $length; $i++) {
if (strpos($contains, $string[$i]) !== false) {
$tally++;
}
}
return ($exact ? $tally == $n : $tally >= $n);
}
function validPassword($password) {
if (strlen($password) < 10) {
return false;
}
$upperChars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
$upperCount = 2;
if (strContains($password, $upperChars, $upperCount) === false) {
return false;
}
$numSymChars = '0123456789!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~';
$numSymCount = 2;
if (strContains($password, $numSymChars, $numSymCount) === false) {
return false;
}
return true;
}

Printing all permutations of strings that can be formed from phone number

I'm trying to print the possible words that can be formed from a phone number in php. My general strategy is to map each digit to an array of possible characters. I then iterate through each number, recursively calling the function to iterate over each possible character.
Here's what my code looks like so far, but it's not working out just yet. Any syntax corrections I can make to get it to work?
$pad = array(
array('0'), array('1'), array('abc'), array('def'), array('ghi'),
array('jkl'), array('mno'), array('pqr'), array('stuv'), array('wxyz')
);
function convertNumberToAlpha($number, $next, $alpha){
global $pad;
for($i =0; $i<count($pad[$number[$next]][0]); $i++){
$alpha[$next] = $pad[$next][0][$i];
if($i<strlen($number) -1){
convertNumberToAlpha($number, $next++, $alpha);
}else{
print_r($alpha);
}
}
}
$alpha = array();
convertNumberToAlpha('22', 0, $alpha);
How is this going to be used? This is not a job for a simple recursive algorithm such as what you have suggested, nor even an iterative approach. An average 10-digit number will yield 59,049 (3^10) possibilities, each of which will have to be evaluated against a dictionary if you want to determine actual words.
Many times, the best approach to this is to pre-compile a dictionary which maps 10-digit numbers to various words. Then, your look-up is a constant O(1) algorithm, just selecting by a 10 digit number which is mapped to an array of possible words.
In fact, pre-compiled dictionaries were the way that T9 worked, mapping dictionaries to trees with logarithmic look-up functions.
The following code should do it. Fairly straight forward: it uses recursion, each level processes one character of input, a copy of current combination is built/passed at each recursive call, recursion stops at the level where last character of input is processed.
function alphaGenerator($input, &$output, $current = "") {
static $lookup = array(
1 => "1", 2 => "abc", 3 => "def",
4 => "ghi", 5 => "jkl", 6 => "mno",
7 => "pqrs", 8 => "tuv", 9 => "wxyz",
0 => "0"
);
$digit = substr($input, 0, 1); // e.g. "4"
$other = substr($input, 1); // e.g. "3556"
$chars = str_split($lookup[$digit], 1); // e.g. "ghi"
foreach ($chars as $char) { // e.g. g, h, i
if ($other === false) { // base case
$output[] = $current . $char;
} else { // recursive case
alphaGenerator($other, $output, $current . $char);
}
}
}
$output = array();
alphaGenerator("43556", $output);
var_dump($output);
Output:
array(243) {
[0]=>string(5) "gdjjm"
[1]=>string(5) "gdjjn"
...
[133]=>string(5) "helln"
[134]=>string(5) "hello"
[135]=>string(5) "hfjjm"
...
[241]=>string(5) "iflln"
[242]=>string(5) "ifllo"
}
You should read Norvigs article on writing a spellchecker in Python http://norvig.com/spell-correct.html . Although its a spellchecker and in python not php, it is the same concept around finding words with possible variations, might give u some good ideas.

Regex expression for matching all duplicate substrings of any length

Let's say we have a string: "abcbcdcde"
I want to identify all substrings that are repeated in this string using regex (i.e. no brute-force iterative loops).
For the above string, the result set would be: {"b", "bc", "c", "cd", "d"}
I must confess that my regex is far more rusty than it should be for someone with my experience. I tried using a backreference, but that'll only match consecutive duplicates. I need to match all duplicates, consecutive or otherwise.
In other words, I want to match any character(s) that appears for the >= 2nd time. If a substring occurs 5 times, then I want to capture each of occurrences 2-5. Make sense?
This is my pathetic attempt thus far:
preg_match_all( '/(.+)(.*)\1+/', $string, $matches ); // Way off!
I tried playing with look-aheads but I'm just butchering it. I'm doing this in PHP (PCRE) but the problem is more or less language-agnostic. It's a bit embarrassing that I'm finding myself stumped on this.
Your problem is recursi ... you know what, forget about recursion! =p it wouldn't really work well in PHP and the algorithm is pretty clear without it as well.
function find_repeating_sequences($s)
{
$res = array();
while ($s) {
$i = 1; $pat = $s[0];
while (false !== strpos($s, $pat, $i)) {
$res[$pat] = 1;
// expand pattern and try again
$pat .= $s[$i++];
}
// move the string forward
$s = substr($s, 1);
}
return array_keys($res);
}
Out of interest, I wrote Tim's answer in PHP as well:
function find_repeating_sequences_re($s)
{
$res = array();
preg_match_all('/(?=(.+).*\1)/', $s, $matches);
foreach ($matches[1] as $match) {
$length = strlen($match);
if ($length > 1) {
for ($i = 0; $i < $length; ++$i) {
for ($j = $i; $j < $length; ++$j) {
$res[substr($match, $i, $j - $i + 1)] = 1;
}
}
} else {
$res[$match] = 1;
}
}
return array_keys($res);
}
I've let them fight it out in a small benchmark of 800 bytes of random data:
$data = base64_encode(openssl_random_pseudo_bytes(600));
Each code is run for 10 rounds and the execution time is measured. The results?
Pure PHP - 0.014s (10 runs)
PCRE - 40.86s <-- ouch!
It gets weirder when you look at 24k bytes (or anything above 1k really):
Pure PHP - 4.565s (10 runs)
PCRE - 0.232s <-- WAT?!
It turns out that the regular expression broke down after 1k characters and so the $matches array was empty. These are my .ini settings:
pcre.backtrack_limit => 1000000 => 1000000
pcre.recursion_limit => 100000 => 100000
It's not clear to me how a backtrack or recursion limit would have been hit after only 1k of characters. But even if those settings are "fixed" somehow, the results are still obvious, PCRE doesn't seem to be the answer.
I suppose writing this in C would speed it up somewhat, but I'm not sure to what degree.
Update
With some help from hakre's answer I put together an improved version that increases performance by ~18% after optimizing the following:
Remove the substr() calls in the outer loop to advance the string pointer; this was a left over from my previous recursive incarnations.
Use the partial results as a positive cache to skip strpos() calls inside the inner loop.
And here it is, in all its glory (:
function find_repeating_sequences3($s)
{
$res = array();
$p = 0;
$len = strlen($s);
while ($p != $len) {
$pat = $s[$p]; $i = ++$p;
while ($i != $len) {
if (!isset($res[$pat])) {
if (false === strpos($s, $pat, $i)) {
break;
}
$res[$pat] = 1;
}
// expand pattern and try again
$pat .= $s[$i++];
}
}
return array_keys($res);
}
You can't get the required result in a single regex because a regex will match either greedily (finding bc...bc) or lazily (finding b...b and c...c), but never both. (In your case, it does find c...c, but only because c is repeated twice.)
But once you've found a repeated substring of length > 1, it logically follows that all the smaller "substrings of that substring" must also be repeated. If you want to get them spelled out for you, you need to do this separately.
Taking your example (using Python because I don't know PHP):
>>> results = set(m.group(1) for m in re.finditer(r"(?=(.+).*\1)", "abcbcdcde"))
>>> results
{'d', 'cd', 'bc', 'c'}
You could then go and apply the following function to each of your results:
def substrings(s):
return [s[start:stop] for start in range(len(s)-1)
for stop in range(start+1, len(s)+1)]
For example:
>>> substrings("123456")
['1', '12', '123', '1234', '12345', '123456', '2', '23', '234', '2345', '23456',
'3', '34', '345', '3456', '4', '45', '456', '5', '56']
The closest I can get is /(?=(.+).*\1)/
The purpose of the lookahead is to allow the same characters to be matched more than once (for instance, c and cd). However, for some reason it doesn't seem to be getting the b...
Interesting question. I basically took the function in Jacks answer and was trying if the number of tests can be reduced.
I first tried to only search half the string, however it turned out that creating the pattern to search for via substr each time was way too expensive. The way how it is done in Jacks answer by appending one character per each iteration is way better it looks like. And then I did run out of time so I could not look further into it.
However while looking for such an alternative implementation I at least found out that some of the differences in the algorithm I had in mind could be applied to Jacks function as well:
There is no need to cut the beginning of the string in each outer iteration as the search is already done with offsets.
If the rest of the subject to look for repetition is smaller than the repetition needle, you do not need to search for the needle.
If it was already searched for the needle, you don't need to search again.
Note: This is a memory trade. If you have many repetitions, you will use similar memory. However if you do have a low amount of repetitions, than this variant uses more memory than before.
The function:
function find_repeating_sequences($string) {
$result = array();
$start = 0;
$max = strlen($string);
while ($start < $max) {
$pat = $string[$start];
$i = ++$start;
while ($max - $i > 0) {
$found = isset($result[$pat]) ? $result[$pat] : false !== strpos($string, $pat, $i);
if (!$result[$pat] = $found) break;
// expand pattern and try again
$pat .= $string[$i++];
}
}
return array_keys(array_filter($result));
}
So just see this as an addition to Jacks answer.

Short unique id in php

I want to create a unique id but uniqid() is giving something like '492607b0ee414'. What i would like is something similar to what tinyurl gives: '64k8ra'. The shorter, the better. The only requirements are that it should not have an obvious order and that it should look prettier than a seemingly random sequence of numbers. Letters are preferred over numbers and ideally it would not be mixed case. As the number of entries will not be that many (up to 10000 or so) the risk of collision isn't a huge factor.
Any suggestions appreciated.
Make a small function that returns random letters for a given length:
<?php
function generate_random_letters($length) {
$random = '';
for ($i = 0; $i < $length; $i++) {
$random .= chr(rand(ord('a'), ord('z')));
}
return $random;
}
Then you'll want to call that until it's unique, in pseudo-code depending on where you'd store that information:
do {
$unique = generate_random_letters(6);
} while (is_in_table($unique));
add_to_table($unique);
You might also want to make sure the letters do not form a word in a dictionnary. May it be the whole english dictionnary or just a bad-word dictionnary to avoid things a customer would find of bad-taste.
EDIT: I would also add this only make sense if, as you intend to use it, it's not for a big amount of items because this could get pretty slow the more collisions you get (getting an ID already in the table). Of course, you'll want an indexed table and you'll want to tweak the number of letters in the ID to avoid collision. In this case, with 6 letters, you'd have 26^6 = 308915776 possible unique IDs (minus bad words) which should be enough for your need of 10000.
EDIT:
If you want a combinations of letters and numbers you can use the following code:
$random .= rand(0, 1) ? rand(0, 9) : chr(rand(ord('a'), ord('z')));
#gen_uuid() by gord.
preg_replace got some nasty utf-8 problems, which causes the uid somtimes to contain "+" or "/".
To get around this, you have to explicitly make the pattern utf-8
function gen_uuid($len=8) {
$hex = md5("yourSaltHere" . uniqid("", true));
$pack = pack('H*', $hex);
$tmp = base64_encode($pack);
$uid = preg_replace("#(*UTF8)[^A-Za-z0-9]#", "", $tmp);
$len = max(4, min(128, $len));
while (strlen($uid) < $len)
$uid .= gen_uuid(22);
return substr($uid, 0, $len);
}
Took me quite a while to find that, perhaps it's saves somebody else a headache
You can achieve that with less code:
function gen_uid($l=10){
return substr(str_shuffle("0123456789abcdefghijklmnopqrstuvwxyz"), 0, $l);
}
Result (examples):
cjnp56brdy
9d5uv84zfa
ih162lryez
ri4ocf6tkj
xj04s83egi
There are two ways to obtain a reliably unique ID: Make it so long and variable that the chances of a collision are spectacularly small (as with a GUID) or store all generated IDs in a table for lookup (either in memory or in a DB or a file) to verify uniqueness upon generation.
If you're really asking how you can generate such a short key and guarantee its uniqueness without some kind of duplicate check, the answer is, you can't.
Here's the routine I use for random base62s of any length...
Calling gen_uuid() returns strings like WJX0u0jV, E9EMaZ3P etc.
By default this returns 8 digits, hence a space of 64^8 or roughly 10^14,
this is often enough to make collisions quite rare.
For a larger or smaller string, pass in $len as desired. No limit in length, as I append until satisfied [up to safety limit of 128 chars, which can be removed].
Note, use a random salt inside the md5 [or sha1 if you prefer], so it cant easily be reverse-engineered.
I didn't find any reliable base62 conversions on the web, hence this approach of stripping chars from the base64 result.
Use freely under BSD licence,
enjoy,
gord
function gen_uuid($len=8)
{
$hex = md5("your_random_salt_here_31415" . uniqid("", true));
$pack = pack('H*', $hex);
$uid = base64_encode($pack); // max 22 chars
$uid = ereg_replace("[^A-Za-z0-9]", "", $uid); // mixed case
//$uid = ereg_replace("[^A-Z0-9]", "", strtoupper($uid)); // uppercase only
if ($len<4)
$len=4;
if ($len>128)
$len=128; // prevent silliness, can remove
while (strlen($uid)<$len)
$uid = $uid . gen_uuid(22); // append until length achieved
return substr($uid, 0, $len);
}
Really simple solution:
Make the unique ID with:
$id = 100;
base_convert($id, 10, 36);
Get the original value again:
intval($str,36);
Can't take credit for this as it's from another stack overflow page, but I thought the solution was so elegant and awesome that it was worth copying over to this thread for people referencing this.
You could use the Id and just convert it to base-36 number if you want to convert it back and forth. Can be used for any table with an integer id.
function toUId($baseId, $multiplier = 1) {
return base_convert($baseId * $multiplier, 10, 36);
}
function fromUId($uid, $multiplier = 1) {
return (int) base_convert($uid, 36, 10) / $multiplier;
}
echo toUId(10000, 11111);
1u5h0w
echo fromUId('1u5h0w', 11111);
10000
Smart people can probably figure it out with enough id examples. Dont let this obscurity replace security.
I came up with what I think is a pretty cool solution doing this without a uniqueness check. I thought I'd share for any future visitors.
A counter is a really easy way to guarantee uniqueness or if you're using a database a primary key also guarantees uniqueness. The problem is it looks bad and and might be vulnerable. So I took the sequence and jumbled it up with a cipher. Since the cipher can be reversed, I know each id is unique while still appearing random.
It's python not php, but I uploaded the code here:
https://github.com/adecker89/Tiny-Unique-Identifiers
Letters are pretty, digits are ugly.
You want random strings, but don't want "ugly" random strings?
Create a random number and print it in alpha-style (base-26), like the reservation "numbers" that airlines give.
There's no general-purpose base conversion functions built into PHP, as far as I know, so you'd need to code that bit yourself.
Another alternative: use uniqid() and get rid of the digits.
function strip_digits_from_string($string) {
return preg_replace('/[0-9]/', '', $string);
}
Or replace them with letters:
function replace_digits_with_letters($string) {
return strtr($string, '0123456789', 'abcdefghij');
}
You can also do it like tihs:
public static function generateCode($length = 6)
{
$az = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
$azr = rand(0, 51);
$azs = substr($az, $azr, 10);
$stamp = hash('sha256', time());
$mt = hash('sha256', mt_rand(5, 20));
$alpha = hash('sha256', $azs);
$hash = str_shuffle($stamp . $mt . $alpha);
$code = ucfirst(substr($hash, $azr, $length));
return $code;
}
You can do that without unclean/costy stuff like loops, String concatenations or multiple calls to rand(), in a clean and easy to read way. Also, it is better to use mt_rand():
function createRandomString($length)
{
$random = mt_rand(0, (1 << ($length << 2)) - 1);
return dechex($random);
}
If you need the String to have the exact length in any case, just pad the hex number with zeros:
function createRandomString($length)
{
$random = mt_rand(0, (1 << ($length << 2)) - 1);
$number = dechex($random);
return str_pad($number, $length, '0', STR_PAD_LEFT);
}
The "theoretical backdraw" is, that you are limited to PHPs capabilities - but this is more a philosophical issue in that case ;) Let's go through it anyways:
PHP is limited in what it can represent as a hex number doing it like this. This would be $length <= 8 at least on a 32bit system, where PHPs limitation for this should be 4.294.967.295 .
PHPs random number generator also has a maximum. For mt_rand() at least on a 32bit system, it should be 2.147.483.647
So you are theoretically limited to 2.147.483.647 IDs.
Coming back to the topic - the intuitive do { (generate ID) } while { (id is not uniqe) } (insert id) has one drawback and one possible flaw that might drive you straight to darkness...
Drawback: The validation is pessimistic. Doing it like this always requires a check at the database. Having enough keyspace (for example length of 5 for your 10k entries) will quite unlikely cause collisions as often, as it might be comparably less resource consuming to just try to store the data and retry only in case of a UNIQUE KEY error.
Flaw: User A retrieves an ID that gets verified as not taken yet. Then the code will try to insert the data. But in the meantime, User B entered the same loop and unfortunately retrieves the same random number, because User A is not stored yet and this ID was still free. Now the system stores either User B or User A, and when attempting to store the second User, there already is the other one in the meantime - having the same ID.
You would need to handle that exception in any case and need to re-try the insertion with a newly created ID. Adding this whilst keeping the pessimistic checking loop (that you would need to re-enter) will result in quite ugly and hard to follow code. Fortunately the solution to this is the same like the one to the drawback: Just go for it in the first place and try to store the data. In case of a UNIQUE KEY error just retry with a new ID.
Take a lookt at this article
Create short IDs with PHP - Like Youtube or TinyURL
It explains how to generate short unique ids from your bdd ids, like youtube does.
Actually, the function in the article is very related to php function base_convert which converts a number from a base to another (but is only up to base 36).
10 chars:
substr(uniqid(),-10);
5 binary chars:
hex2bin( substr(uniqid(),-10) );
8 base64 chars:
base64_encode( hex2bin( substr(uniqid(),-10) ) );
function rand_str($len = 12, $type = '111', $add = null) {
$rand = ($type[0] == '1' ? 'abcdefghijklmnpqrstuvwxyz' : '') .
($type[1] == '1' ? 'ABCDEFGHIJKLMNPQRSTUVWXYZ' : '') .
($type[2] == '1' ? '123456789' : '') .
(strlen($add) > 0 ? $add : '');
if(empty($rand)) $rand = sha1( uniqid(mt_rand(), true) . uniqid( uniqid(mt_rand(), true), true) );
return substr(str_shuffle( str_repeat($rand, 2) ), 0, $len);
}
If you do like a longer version of unique Id use this:
$uniqueid = sha1(md5(time()));
Best Answer Yet: Smallest Unique "Hash Like" String Given Unique Database ID - PHP Solution, No Third Party Libraries Required.
Here's the code:
<?php
/*
THE FOLLOWING CODE WILL PRINT:
A database_id value of 200 maps to 5K
A database_id value of 1 maps to 1
A database_id value of 1987645 maps to 16LOD
*/
$database_id = 200;
$base36value = dec2string($database_id, 36);
echo "A database_id value of 200 maps to $base36value\n";
$database_id = 1;
$base36value = dec2string($database_id, 36);
echo "A database_id value of 1 maps to $base36value\n";
$database_id = 1987645;
$base36value = dec2string($database_id, 36);
echo "A database_id value of 1987645 maps to $base36value\n";
// HERE'S THE FUNCTION THAT DOES THE HEAVY LIFTING...
function dec2string ($decimal, $base)
// convert a decimal number into a string using $base
{
//DebugBreak();
global $error;
$string = null;
$base = (int)$base;
if ($base < 2 | $base > 36 | $base == 10) {
echo 'BASE must be in the range 2-9 or 11-36';
exit;
} // if
// maximum character string is 36 characters
$charset = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
// strip off excess characters (anything beyond $base)
$charset = substr($charset, 0, $base);
if (!ereg('(^[0-9]{1,50}$)', trim($decimal))) {
$error['dec_input'] = 'Value must be a positive integer with < 50 digits';
return false;
} // if
do {
// get remainder after dividing by BASE
$remainder = bcmod($decimal, $base);
$char = substr($charset, $remainder, 1); // get CHAR from array
$string = "$char$string"; // prepend to output
//$decimal = ($decimal - $remainder) / $base;
$decimal = bcdiv(bcsub($decimal, $remainder), $base);
} while ($decimal > 0);
return $string;
}
?>

Categories