Would you advise please, how to calculate file HASH on files larger than 2GB in PHP?
The only PHP function known to me is:
string hash_file ( string $algo , string $filename [, bool $raw_output = false ] )
This function however has a limitation. It returns HASH for files smaller than 2GB. For larger files, hash_file() throws error.
Here are some constraints/requests:
should work on Linux Ubuntu 64bit server
compatible with PHP 5+
there should be no file size limit
should be as fast as possible
This is all the information I have now. Thank you very much.
UPDATE
I have a solution that is more practical and efficient than any hash calculation from data >2GB.
I have realized, that I do not have to generate hash from complete files that are over 2GB. To uniquely identify any file, calculating hash from say first 10KB of data of any file should be sufficient. Moreover, it will be faster than >2GB calculation. In other words, ability to calculate hash from a data string that is over 2GB probably is not necessary at all.
I will wait for your reactions. In couple of days, I will close this question.
I would use exec() to run a local hashing function in the shell and return the value back to the php script. Here's an example with md5 but any algo available can be used.
$results = array();
$filename = '/full/path/to/file';
exec("md5sum $filename", $results);
Then parse the result array (the output of the shell command).
In general, I like to avoid doing anything directly in PHP that requires more than 1G of memory, especially if running in php-fpm or as an apache module--sort of time reinforced prejudice. This is definitely my advice when there is a native application that can accomplish the goal and you don't particularly need portablitly cross platform (like run on both linux and windows machines).
Related
So how big can a $variable in PHP get? I've tried to test this, but I'm not sure that I have enough system memory (~2gb). I figure there has to be some kind of limit. What happens when a string gets too large? Is it concatenated, or does PHP throw an exception?
http://php.net/manual/en/language.types.string.php says:
Note: As of PHP 7.0.0, there are no particular restrictions regarding the length of a string on 64-bit builds. On 32-bit builds and in earlier versions, a string can be as large as up to 2GB (2147483647 bytes maximum)
In PHP 5.x, strings were limited to 231-1 bytes, because internal code recorded the length in a signed 32-bit integer.
You can slurp in the contents of an entire file, for instance using file_get_contents()
However, a PHP script has a limit on the total memory it can allocate for all variables in a given script execution, so this effectively places a limit on the length of a single string variable too.
This limit is the memory_limit directive in the php.ini configuration file. The memory limit defaults to 128MB in PHP 5.2, and 8MB in earlier releases.
If you don't specify a memory limit in your php.ini file, it uses the default, which is compiled into the PHP binary. In theory you can modify the source and rebuild PHP to change this default value.
If you specify -1 as the memory limit in your php.ini file, it stop checking and permits your script to use as much memory as the operating system will allocate. This is still a practical limit, but depends on system resources and architecture.
Re comment from #c2:
Here's a test:
<?php
// limit memory usage to 1MB
ini_set('memory_limit', 1024*1024);
// initially, PHP seems to allocate 768KB for basic operation
printf("memory: %d\n", memory_get_usage(true));
$str = str_repeat('a', 255*1024);
echo "Allocated string of 255KB\n";
// now we have allocated all of the 1MB of memory allowed
printf("memory: %d\n", memory_get_usage(true));
// going over the limit causes a fatal error, so no output follows
$str = str_repeat('a', 256*1024);
echo "Allocated string of 256KB\n";
printf("memory: %d\n", memory_get_usage(true));
String can be as large as 2GB.
Source
PHP's string length is limited by the way strings are represented in PHP; memory does not have anything to do with it.
According to phpinternalsbook.com, strings are stored in struct { char *val; int len; } and since the maximum size of an int in C is 4 bytes, this effectively limits the maximum string size to 2GB.
In a new upcoming php7 among many other features, they added a support for strings bigger than 2^31 bytes:
Support for strings with length >= 2^31 bytes in 64 bit builds.
Sadly they did not specify how much bigger can it be.
The maximum length of a string variable is only 2GiB - (2^(32-1) bits). Variables can be addressed on a character (8 bits/1 byte) basis and the addressing is done by signed integers which is why the limit is what it is. Arrays can contain multiple variables that each follow the previous restriction but can have a total cumulative size up to memory_limit of which a string variable is also subject to.
To properly answer this qustion you need to consider PHP internals or the target that PHP is built for.
To answer this from a typical Linux perspective on x86...
Sizes of types in C:
https://usrmisc.wordpress.com/2012/12/27/integer-sizes-in-c-on-32-bit-and-64-bit-linux/
Types used in PHP for variables:
http://php.net/manual/en/internals2.variables.intro.php
Strings are always 2GB as the length is always 32bits and a bit is wasted because it uses int rather than uint. int is impractical for lengths over 2GB as it requires a cast to avoid breaking arithmetic or "than" comparisons. The extra bit is likely being used for overflow checks.
Strangely, hash keys might internally support 4GB as uint is used although I have never put this to the test. PHP hash keys have a +1 to the length for a trailing null byte which to my knowledge gets ignored so it may need to be unsigned for that edge case rather than to allow longer keys.
A 32bit system may impose more external limits.
I'm currently using the following two methods in my class to get the job done:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
function find($str){
return $this->startingindex($this->name,$str);
}
function startingindex($a,$b){
$lim = 1 + filesize($a) - strlen($b)/2;
$h = fopen($a,"rb");
rewind($h);
for($i=0;$i<$lim;$i++){
$this->xseek($h,$i);
if($b==strtoupper(bin2hex(fread($h,strlen($b)/2)))){
fclose($h);
return $i;
}
}
fclose($h);
return -1;
}
I realize this is quite inefficient, especially for PHP, but I'm not allowed any other language on my hosting plan.
I ran a couple tests, and when the hex string is towards the beginning of the file, it runs quickly and returns the offset. When the hex string isn't found, however, the page hangs for a while. This kills me inside because last time I tested with PHP and had hanging pages, my webhost shut my site down for 24 hours due to too much cpu time.
Is there a better way to accomplish this (finding a hex string's offset in a file)? Is there certain aspects of this that could be improved to speed up execution?
I would read the entire contents of the file into one hex string and use strrpos, but I was getting errors about maximum memory being exceeded. Would this be a better method if I chopped the file up and searched large pieces with strrpos?
edit:
To specify, I'm dealing with a settings file for a game. The settings and their values are in a block where there is a 32-bit int before the setting, then the setting, a 32-bit int before the value, and then the value. Both ints represent the lengths of the following strings. For example, if the setting was "test" and the value was "0", it would look like (in hex): 00000004746573740000000130. Now that you mention it, this does seem like a bad way to go about it. What would you recommend?
edit 2:
I tried a file that was below the maximum memory I'm allowed and tried strrpos, but it was very much slower than the way I've been trying.
edit 3: in reply to Charles:
What's unknown is the length of the settings block and where it starts. What I do know is what the first and last settings USUALLY are. I've been using these searching methods to find the location of the first and last setting and determine the length of the settings block. I also know where the parent block starts. The settings block is generally no more than 50 bytes into its parent, so I could start the search for the first setting there and limit how far it will search. The problem is that I also need to find the last setting. The length of the settings block is variable and could be any length. I could read the file the way I assume the game does, by reading the size of the setting, reading the setting, reading the size of the value, reading the value, etc. until I reached a byte with value -1, or FF in hex. Would a combination of limiting the search for the first setting and reading the settings properly make this much more efficient?
You have a lot of garbage code. For example, this code is doing nearly nothing:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
because it reads everytime from the begining of the file. Furthemore, why do you need to read something if you are not returning it? May be you looke for fseek()?
If you need to find a hex string in binary file, may be better to use something like this: http://pastebin.com/fpDBdsvV (tell me if there some bugs/problems).
But, if you are parsing game's settings file, I'd advise you to use fseek(), fread() and unpack() to seek to a place of where setting is, read portion of bytes and unpack it to PHP's variable types.
I'm working with a large array which is a height map, 1024x1024 and of course, i'm stuck with the memory limit. In my test machine i can increase the mem limit to 1gb if i want, but in my tiny VPS with only 256 ram, it's not an option.
I've been searching in stack and google and found several "well, you are using PHP not because memory efficiency, ditch it and rewrite in c++" and honestly, that's ok and I recognize PHP loves memory.
But, when digging more inside PHP memory management, I did not find what memory consumes every data type. Or if casting to another type of data reduces mem consumption.
The only "optimization" technique i found was to unset variables and arrays, that's it.
Converting the code to c++ using some PHP parsers would solve the problem?
Thanks!
If you want a real indexed array, use SplFixedArray. It uses less memory. Also, PHP 5.3 has a much better garbage collector.
Other than that, well, PHP will use more memory than a more carefully written C/C++ equivalent.
Memory Usage for 1024x1024 integer array:
Standard array: 218,756,848
SplFixedArray: 92,914,208
as measured by memory_get_peak_usage()
$array = new SplFixedArray(1024 * 1024); // array();
for ($i = 0; $i < 1024 * 1024; ++$i)
$array[$i] = 0;
echo memory_get_peak_usage();
Note that the same array in C using 64-bit integers would be 8M.
As others have suggested, you could pack the data into a string. This is slower but much more memory efficient. If using 8 bit values it's super easy:
$x = str_repeat(chr(0), 1024*1024);
$x[$i] = chr($v & 0xff); // store value $v into $x[$i]
$v = ord($x[$i]); // get value $v from $x[$i]
Here the memory will only be about 1.5MB (that is, when considering the entire overhead of PHP with just this integer string array).
For the fun of it, I created a simple benchmark of creating 1024x1024 8-bit integers and then looping through them once. The packed versions all used ArrayAccess so that the user code looked the same.
mem write read
array 218M 0.589s 0.176s
packed array 32.7M 1.85s 1.13s
packed spl array 13.8M 1.91s 1.18s
packed string 1.72M 1.11s 1.08s
The packed arrays used native 64-bit integers (only packing 7 bytes to avoid dealing with signed data) and the packed string used ord and chr. Obviously implementation details and computer specs will affect things a bit, but I would expect you to get similar results.
So while the array was 6x faster it also used 125x the memory as the next best alternative: packed strings. Obviously the speed is irrelevant if you are running out of memory. (When I used packed strings directly without an ArrayAccess class they were only 3x slower than native arrays.)
In short, to summarize, I would use something other than pure PHP to process this data if speed is of any concern.
In addition to the accepted answer and suggestions in the comments, I'd like to suggest PHP Judy array implementation.
Quick tests showed interesting results. An array with 1 million entries using regular PHP array data structure takes ~200 MB. SplFixedArray uses around 90 megabytes. Judy uses 8 megs. Tradeoff is in performance, Judy takes about double the time of regular php array implementation.
A little bit late to the party, but if you have a multidimensional array you can save a lot of RAM when you store the complete array as json.
$array = [];
$data = [];
$data["a"] = "hello";
$data["b"] = "world";
To store this array just use:
$array[] = json_encode($data);
instead of
$array[] = $data;
If you want to get the arrry back, just use something like:
$myData = json_decode($array[0], true);
I had a big array with 275.000 sets and saved about 36% RAM consumption.
EDIT:
I found a more better way, when you zip the json string:
$array[] = gzencode(json_encode($data));
and unzip it when you need it:
$myData = json_decode(gzdecode($array[0], true));
This saved me nearly 75% of RAM peak usage.
So how big can a $variable in PHP get? I've tried to test this, but I'm not sure that I have enough system memory (~2gb). I figure there has to be some kind of limit. What happens when a string gets too large? Is it concatenated, or does PHP throw an exception?
http://php.net/manual/en/language.types.string.php says:
Note: As of PHP 7.0.0, there are no particular restrictions regarding the length of a string on 64-bit builds. On 32-bit builds and in earlier versions, a string can be as large as up to 2GB (2147483647 bytes maximum)
In PHP 5.x, strings were limited to 231-1 bytes, because internal code recorded the length in a signed 32-bit integer.
You can slurp in the contents of an entire file, for instance using file_get_contents()
However, a PHP script has a limit on the total memory it can allocate for all variables in a given script execution, so this effectively places a limit on the length of a single string variable too.
This limit is the memory_limit directive in the php.ini configuration file. The memory limit defaults to 128MB in PHP 5.2, and 8MB in earlier releases.
If you don't specify a memory limit in your php.ini file, it uses the default, which is compiled into the PHP binary. In theory you can modify the source and rebuild PHP to change this default value.
If you specify -1 as the memory limit in your php.ini file, it stop checking and permits your script to use as much memory as the operating system will allocate. This is still a practical limit, but depends on system resources and architecture.
Re comment from #c2:
Here's a test:
<?php
// limit memory usage to 1MB
ini_set('memory_limit', 1024*1024);
// initially, PHP seems to allocate 768KB for basic operation
printf("memory: %d\n", memory_get_usage(true));
$str = str_repeat('a', 255*1024);
echo "Allocated string of 255KB\n";
// now we have allocated all of the 1MB of memory allowed
printf("memory: %d\n", memory_get_usage(true));
// going over the limit causes a fatal error, so no output follows
$str = str_repeat('a', 256*1024);
echo "Allocated string of 256KB\n";
printf("memory: %d\n", memory_get_usage(true));
String can be as large as 2GB.
Source
PHP's string length is limited by the way strings are represented in PHP; memory does not have anything to do with it.
According to phpinternalsbook.com, strings are stored in struct { char *val; int len; } and since the maximum size of an int in C is 4 bytes, this effectively limits the maximum string size to 2GB.
In a new upcoming php7 among many other features, they added a support for strings bigger than 2^31 bytes:
Support for strings with length >= 2^31 bytes in 64 bit builds.
Sadly they did not specify how much bigger can it be.
The maximum length of a string variable is only 2GiB - (2^(32-1) bits). Variables can be addressed on a character (8 bits/1 byte) basis and the addressing is done by signed integers which is why the limit is what it is. Arrays can contain multiple variables that each follow the previous restriction but can have a total cumulative size up to memory_limit of which a string variable is also subject to.
To properly answer this qustion you need to consider PHP internals or the target that PHP is built for.
To answer this from a typical Linux perspective on x86...
Sizes of types in C:
https://usrmisc.wordpress.com/2012/12/27/integer-sizes-in-c-on-32-bit-and-64-bit-linux/
Types used in PHP for variables:
http://php.net/manual/en/internals2.variables.intro.php
Strings are always 2GB as the length is always 32bits and a bit is wasted because it uses int rather than uint. int is impractical for lengths over 2GB as it requires a cast to avoid breaking arithmetic or "than" comparisons. The extra bit is likely being used for overflow checks.
Strangely, hash keys might internally support 4GB as uint is used although I have never put this to the test. PHP hash keys have a +1 to the length for a trailing null byte which to my knowledge gets ignored so it may need to be unsigned for that edge case rather than to allow longer keys.
A 32bit system may impose more external limits.
I want to synchronize two directories. And I use
file_get_contents($source) === file_get_contents($dest)
to compare two files. Is there any problem to do this?
I would rather do something like this:
function files_are_equal($a, $b)
{
// Check if filesize is different
if(filesize($a) !== filesize($b))
return false;
// Check if content is different
$ah = fopen($a, 'rb');
$bh = fopen($b, 'rb');
$result = true;
while(!feof($ah))
{
if(fread($ah, 8192) != fread($bh, 8192))
{
$result = false;
break;
}
}
fclose($ah);
fclose($bh);
return $result;
}
This checks if the filesize is the same, and if it is it goes through the file step by step.
Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...
Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.
You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.
Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.
Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.
http://php.net/sha1_file
http://php.net/md5_file
if (sha1_file($source) == sha1_file($dest)) {
/* ... */
}
Seems a bit heavy. This will load both files completely as strings and then compare.
I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.
There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.
I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.
Check first for the obvious:
Compare size
Compare file type (mime-type).
Compare content.
(add comparison of date, file name and other metadata to this obvious list if those are also not supposed to be similar).
When comparing content hashing sounds not very efficient like #Oli says in his comment. If the files are different they most likely will be different already in the beginning. Calculating a hash of two 50 Mb files and then comparing the hash sounds like a waste of time if the second bit is already different...
Check this post on php.net. Looks very similar to that of #Svish but it also compares file mime-type. A smart addition if you ask me.
Something I noticed is there is a lack of the N! factor. In other words - to do the filesize() function you would first have to check every file against all of the other files. Why? What if the first file and the second file are different sizes but the third file is the same size.
So first - you need to get a list of all of the files you are going to work with If you want to do the filesize type of thing - then use the COMPLETE / string as the key for an array and then store the filesize() information. Then you sort the array so all files which are the same size are lined up. THEN you can check file sizes. However - this does not mean they really are the same - only that they are the same size.
You need to do something like the sha1_file() command and, like above, make an array where the keys are the / names are the keys and the values is the value returned. Sort those, and then just do a simple walk through the array storing the sha1_file() value to test against. So is A==B? Yes. Do any additional tests, then get rid of the SECOND file and continue.
Why am I commenting? I'm working on this same problem and I just found out my program did not work correctly. So now I'm going to go correct it using the sha1_file() function. :-)