Integer string compression algorithm [closed]

Integer string compression algorithm [closed] - php

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
can someone please name an existing algo which is used for compressing numbers? numbers are integers and totally random with no spaces and decimals, eg. 35637462736423478235687479567456....n
well, so far, all i have is this, it converts the integers into ascii reducing approx 40% of the original size
function intergerToChar($v)
{
$buffer="";
$charsLen=strlen($v);
for($i = 0; $i <= $charsLen; $i++)
{
$asc=$v[$i];
if($asc==0){$buffer[]=0;}
elseif($asc==1){$buffer[]=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
elseif($asc==2)
{
if($v[$i+1]<5){$buffer[]=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
elseif($v[$i+1]==5 && $v[$i+2]<6){$buffer[]=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
else{$buffer[]=$v[$i].$v[$i+1];$i++;}
}
else{$buffer[]=$v[$i].$v[$i+1];$i++;}
}
return $buffer;
}
btw, i know PHP is not meant for building a compression tool. I'll be using C/C++
UPDATE: This is another PHP code with better compressing result than the above code, it can compress upto 66% if the integers on the position 1st,6th,12,th, and so on has vales of less than 256 and the 3 integers following them have a values not more than 256 than the preceding 3 integers egs, 134298156286159.... can be compressed upto 66% i knw its not optimal, please feel free to make suggestions/corrections
function intergerToChar2($v)
{
$buffer="";
$charsLen=strlen($v);
for($i = 0; $i <= $charsLen; $i++)
{
if($v[$i].$v[$i+1].$v[$i+2]<256){$base=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
else{$base=$v[$i].$v[$i+1];$i=$i+1;}$i=$i+1;
if($v[$i].$v[$i+1].$v[$i+2]<256){$next=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
else{$next=$v[$i].$v[$i+1];$i=$i+1;}
if($next!=="")
{
$next=$next-$base;
if($next<0)$next=255+$next;
}
$buffer[]=$base;
$buffer[]=$next;
}
return $buffer;
}
btw, 10 bit encoding or 40 bit encoding can be easily done using base_convert() or 4th comment from http://php.net/manual/en/ref.bc.php page which always shows a compression of about 58.6%.

If the digits are random, then you cannot compress the sequence more than the information-theoretic limit, which is log210 bits/digit. (Actually, it's slightly more than that unless the precise length of the string is fixed.) You can achieve that limit by representing the digits as a (very long) binary number; however, that's awkward and timeconsuming to compress and decompress.
A very near optimal solution results from the fact that 1000 is only slightly less than 210, so you can represent 3 digits using 10 bits. That's 3.33 bits/digits, compared with the theoretically optimal 3.32 bits/digit. (In other words, it's about 99.7% optimal.)
Since there actually 1024 possible 10-bit codes, and you only need 1000 of them to represent 3 digits, you have some left over; one of them can be used to indicate the end of the stream, if necessary.
It's a little bit annoying to output 10-bit numbers. It's easier to output 40-bit numbers, since 40 bits is exactly five bytes. Fortunately, most languages these days support 40-bit arithmetic (actually 64-bit arithmetic).
(Note: This is not that different from your solution. But it's a bit easier and a bit more compressed.)

Related

High Entropy Unique ID for Filenames

I would like to create a unique ID (filename) for some of my files and would like to do so with minimal concern. My original plan was to use a base64 encoded string of random bytes but this will not work because / is a valid base64 character.
The function I currently use looks like this:
static function uniqueID(){
return base64_encode(random_bytes(16));
}
How could I go about generating a high entropy unique ID that will always be a certain length and contain only characters that are valid in filenames?
Note: I do not wish to use a solution that accomplishes this with str_replace().
Edit: I appreciate the responses so far.This is one solution Ive explored locally but I am not sure how to calculate how much entropy I lose when compared to uniqueID().
public static function uniqueFilename(){
return bin2hex(random_bytes(12));
}

As User: Sammitch stated,
12 bytes = 96 bits. If you generated a random 96-bit value and then
tried to intentionally brute-force guess a collision you'd need to
build a Dyson sphere around the sun to power the computer and then
wait a few centuries. Just use the random bytes.
96 bits is like
concatenating 3x 32bit numbers together, the number of possibilities
is: 79,228,162,514,264,337,593,543,950,336
Knowing this we can justify using the following function when $n >= 12.
public static function uniqueFilename($n = 12){
return bin2hex(random_bytes($n));
}

I would recommend Ramsey's UUID: https://github.com/ramsey/uuid .
We use it in production and it does the job.

PHP built in functions complexity (isAnagramOfPalindrome function)

I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)

A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though

Probability of a random variable [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I really feel ashamed to ask this question however I don't have time for revision. Also not a native English speaker, so excuse my lack of math vocabulary.
I am writing a program that requires assigning probabilities to variables then selecting one randomly.
Example:
Imagine that I have I coin, I would like to assign the probably of 70% to heads and 30% to tails. When I toss it I would like to have 70% chance that the heads appears and 30% tails.
A dumb way to do it is to create an array of cells insert the heads 70 cells and the tail in 30. Randomize them and select one randomly.
Edit 1: I also would like to point out that I am not limited to 2 variables. For example lets say that I have 3 characters to select between (*,\$,#) and I want to assign the following probably to each of them * = 30%, \$ = 30%, and # = 40%.
That's why I did not want to to use the random function and wanted to see how it was done mathematically.

You want another way to do it? Most rand functions produce a decimal from [0, 1). For 30%, check produced number is less than 0.3
Though note, if you actually test the perceived "randomness", it's not really random..
In PHP, you can use rand(0, 99) (integer instead of double, 30 instead of 0.3). PHP rand function is a closed interval (both inclusive)
function randWithWeight($chanceToReturnTrue) { // chance in percent
return rand(0, 99) < $chanceToReturnTrue;
}
Edit: for the note about perceived randomness. Some math since you say you're coming from math... Generate numbers from 0-99, adding them to an array. Stop when the array contains a duplicate. It usually takes about ~20 passes (I'm getting 3-21 passes before duplicate, 10+ tries). So it's not what you'd expect as "random". Though, (I know I'm going off track), take a look at the birthday problem. It is "more random" than it seems.

Here is a simple function to calculate weighted rand:
<?php
function weightedRand($weights, $weight_sum = 100){
$r = rand(1,$weight_sum);
$n = count($weights);
$i = 0;
while($r > 0 && $i < $n){
$r -= $weights[$i];
$i++;
}
return $i - 1;
}
This function accepts an array. For example array(30,70) will have 30% chance getting 0 and 70% chance getting 1. This should work for multiple weights.
Its principle is to subtract the generated random number by the weight until it gets less than or equal to zero.
Demo with 30%:70%
Demo with 20%:30%:50%

If you want 30% probability just do
if(rand(1,100) <= 30){
// execute code
}

One way would be
$r=rand(1,100);
if($r<70)
{
echo "Head";
}
else
{
echo "Tail";
}

Same algorithm , different result

Good day, I am making my hashing algorthm, so I am rewriting it to C++ from PHP.
But result in C++ is different than php result. PHP result contains more than 10 characters, C++ result only 6 - 8 characters. But those last 8 characters of PHP result are same as C++ result.
So here is PHP code:
<?php function JL1($text) {
$text.="XQ";
$length=strlen($text);
$hash=0;
for($j=0;$j<$length;$j++) {
$p=$text[$j];
$s=ord($p);
if($s%2==0) $s+=9999;
$hash+=$s*($j+1)*0x40ACEF*0xFF;
}
$hash+=33*0x40ACEF*0xFF;
$hash=sprintf("%x",$hash);
return $hash; } ?>
And here C++ code:
char * JL1(char * str){
int size=(strlen(str)+3),s=0; //Edit here (+2 replaced with +3)
if(size<=6) //Edit here (<9 replaced with <=6)
size=9;
char *final=new char[size],temp;
strcpy(final,str);
strcat(final,"XQ");
long length=strlen(final),hash=0L;
for(int i=0;i<length;i++){
temp=final[i];
s=(int)temp;
if(s%2==0)s+=9999;
hash+=((s)*(i+1)*(0x40ACEF)*(0xFF));
}
hash+=33*(0x40ACEF)*(0xFF);
sprintf(final,"%x",hash); //to hex string
final[8]='\0';
return final; }
Example of C++ result for word: "Hi!" : 053c81be
And PHP result for this word: 324c053c81be
Does anyone know,where is that mistake and how to fix that, whether in php or in cpp code?
By the way, when I cut those first letters in php result I get C++ result, but it wont help, because C++ result have not to be 8 characters long, it can be just 6 characters long in some cases.

Where to begin...
Data types do not have fixed guaranteed sizes in C or C++. As such, hash may overflow every iteration, or it may never do so.
chars can be either signed or unsigned, therefore converting one to an integer may result in negative and positive values on different implementations, for the same character.
You may be writing past the end of final when printing the value of hash into it. You may also be cutting the string off prematurely when setting the 9th character to 0.
strcat will write past the end of final if str is at least 7 characters long.
s, a relatively short-lived temporary variable, is declared way too soon. Same with temp.
Your code looks very crowded with almost no whitespace, and is very hard to read.
The expression "33*(0x40ACEF)*(0xFF)" overflows; did you mean 0x4DF48431L?
Consider using std::string instead of char arrays when dealing with strings in C++.

long hash in C++ is most likely limited to 32 bits on your platform. PHP's number isn't.
sprintf(final, "%x", hash) produces a possibly incorrect result. %x interprets the argument as an unsigned int, which is 32 bits on both Windows and Linux x64. So it's interpreting a long as an unsigned int, if your long is more than 32 bits, your result will get truncated.
See all the issues raised by aib. Especially the premature termination of the result.
You will need to deal with the 3rd point yourself, but I can answer the first two. You need to clamp the result to 32 bits: $hash &= 0xFFFFFFFF;.
If you clamp the final value, the php code will produce the same results as the C++ code would on x64 Linux (that means 64 bit integers for intermediate results).
If you clamp it after every computation, you should get the same results as the C++ code would on 32 bit platforms or Windows x64 (32 bit integers for intermediate results).

There seems to be a bug here...
int size=(strlen(str)+2),s=0;
if(size<9)
size=9;
char *final=new char[size],temp;
strcpy(final,str);
strcat(final,"XQ");
If strlen was say 10, then size will be 12 and 12 chars will be allocated.
You then copy in the original 10 characters, and add XQ, but the final terminating \0 will be outside of the allocated memory.
Not sure if that's your bug or not but it doesn;t look right

RNG gaussian distribution

What I'm trying to do isn't exactly a Gaussian distribution, since it has a finite minimum and maximum. The idea is closer to rolling X dice and counting the total.
I currently have the following function:
function bellcurve($min=0,$max=100,$entropy=-1) {
$sum = 0;
if( $entropy < 0) $entropy = ($max-$min)/15;
for($i=0; $i<$entropy; $i++) $sum += rand(0,15);
return floor($sum/(15*$entropy)*($max-$min)+$min);
}
The idea behind the $entropy variable is to try and roll enough dice to get a more even distribution of fractional results (so that flooring it won't cause problems).
It doesn't need to be a perfect RNG, it's just for a game feature and nothing like gambling or cryptography.
However, I ran a test over 65,536 iterations of bellcurve() with no arguments, and the following graph emerged:
(source: adamhaskell.net)
As you can see, there are a couple of values that are "offset", and drastically so. While overall it doesn't really affect that much (at worst it's offset by 2, and ignoring that the probability is still more or less where I want it), I'm just wondering where I went wrong.
Any additional advice on this function would be appreciated too.
UPDATE: I fixed the problem above just by using round instead of floor, but I'm still having trouble getting a good function for this. I've tried pretty much every function I can think of, including gaussian, exponential, logistic, and so on, but to no avail. The only method that has worked so far is this approximation of rolling dice, which is almost certainly not what I need...

If you are looking for a bell curve distribution, generate multiple random numbers and add them together. If you are looking for more modifiers, simply multiply them to the end result.
Generate a random bell curve number, with a bonus of 50% - 150%.
Sum(rand(0,15), rand(0,15) , rand(0,15))*(rand(2,6)/2)
Though if you're concerned about rand not providing random enough numbers you can use mt_rand which will have a much better distribution (uses mersenne twister)

The main issue turned out to be that I was trying to generate a continuous bell curve based on a discrete variable. That's what caused holes and offsets when scaling the result.
The fix I used for this was: +rand(0,1000000)/1000000 - it essentially takes the whole number discrete variable and adds a random fraction to it, more or less making it continuous.
The function is now:
function bellcurve() {
$sum = 0;
$entropy = 6;
for($i=0; $i<$entropy; $i++) $sum += rand(0,15);
return ($sum+rand(0,1000000)/1000000)/(15*$entropy);
}
It returns a float between 0 and 1 inclusive (although those exact values are extremely unlikely), which can then be scaled and rounded as needed.
Example usage:
$damage *= bellcurve()-0.5; // adjusts $damage by a random amount
// between 50% and 150%, weighted in favour of 100%

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Integer string compression algorithm [closed] - php

Related

High Entropy Unique ID for Filenames

PHP built in functions complexity (isAnagramOfPalindrome function)

Probability of a random variable [closed]

Same algorithm , different result

RNG gaussian distribution

Categories

Resources