Can I use file_get_contents() to compare two files? - php

I want to synchronize two directories. And I use
file_get_contents($source) === file_get_contents($dest)
to compare two files. Is there any problem to do this?

I would rather do something like this:
function files_are_equal($a, $b)
{
// Check if filesize is different
if(filesize($a) !== filesize($b))
return false;
// Check if content is different
$ah = fopen($a, 'rb');
$bh = fopen($b, 'rb');
$result = true;
while(!feof($ah))
{
if(fread($ah, 8192) != fread($bh, 8192))
{
$result = false;
break;
}
}
fclose($ah);
fclose($bh);
return $result;
}
This checks if the filesize is the same, and if it is it goes through the file step by step.
Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...

Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.
You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.

Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.

Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.
http://php.net/sha1_file
http://php.net/md5_file
if (sha1_file($source) == sha1_file($dest)) {
/* ... */
}

Seems a bit heavy. This will load both files completely as strings and then compare.
I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.

There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.
I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.

Check first for the obvious:
Compare size
Compare file type (mime-type).
Compare content.
(add comparison of date, file name and other metadata to this obvious list if those are also not supposed to be similar).
When comparing content hashing sounds not very efficient like #Oli says in his comment. If the files are different they most likely will be different already in the beginning. Calculating a hash of two 50 Mb files and then comparing the hash sounds like a waste of time if the second bit is already different...
Check this post on php.net. Looks very similar to that of #Svish but it also compares file mime-type. A smart addition if you ask me.

Something I noticed is there is a lack of the N! factor. In other words - to do the filesize() function you would first have to check every file against all of the other files. Why? What if the first file and the second file are different sizes but the third file is the same size.
So first - you need to get a list of all of the files you are going to work with If you want to do the filesize type of thing - then use the COMPLETE / string as the key for an array and then store the filesize() information. Then you sort the array so all files which are the same size are lined up. THEN you can check file sizes. However - this does not mean they really are the same - only that they are the same size.
You need to do something like the sha1_file() command and, like above, make an array where the keys are the / names are the keys and the values is the value returned. Sort those, and then just do a simple walk through the array storing the sha1_file() value to test against. So is A==B? Yes. Do any additional tests, then get rid of the SECOND file and continue.
Why am I commenting? I'm working on this same problem and I just found out my program did not work correctly. So now I'm going to go correct it using the sha1_file() function. :-)

Related

Dealing with binary data and mb_function overloading?

I have a piece of code here which I need either assurance, or "no no no!" about in regards to if I'm thinking about this in the right or entirely wrong way.
This has to deal with cutting a variable of binary data at a specific spot, and also dealing with multi-byte overloaded functions. For example substr is actually mb_substr and strlen is mb_strlen etc.
Our server is set to UTF-8 internal encoding, and so theres this weird little thing I do to circumvent it for this binary data manipulation:
// $binary_data is the incoming variable with binary
// $clip_size is generally 16, 32 or 64 etc
$curenc = mb_internal_encoding();// this should be "UTF-8"
mb_internal_encoding('ISO-8859-1');// change so mb_ overloading doesnt screw this up
if (strlen($binary_data) >= $clip_size) {
$first_hunk = substr($binary_data,0,$clip_size);
$rest_of_it = substr($binary_data,$clip_size);
} else {
// skip since its shorter than expected
}
mb_internal_encoding($curenc);// put this back now
I can't really show input and output results, since its binary data. But tests using the above appear to be working just fine and nothing is breaking...
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Notes:
The binary data coming in, is a concatenation of those two parts to begin with.
The first part's size is always known (but changes).
The second part's size is entirely unknown.
This is pretty darn close to encryption and stuffing the IV on front and ripping it off again (which oddly, I found some old code which does this same thing lol ugh).
So, I guess my question is:
Is this actually fine to be doing?
Or is there something super obvious I'm overlooking?
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Your brain is right, you shouldn't be doing that in PHP in the first place. :)
Is this actually fine to be doing?
It depends the purpose of your code.
I can't see any reason of the top of my head to cut a binary like that. So my first instinct would be "no no no!" use unpack() to properly parse the binary into usable variables.
That being said if you just need to split your binary because reasons, then I guess this is fine. As long as your tests confirm that the code is working for you, I can't see any problem.
As a side note, I don't use mbstring overloading exactly for this kind of use case - i.e. for whenever you need the default string functions.
MY SOLUTION TO THE WORRY
I dislike answering my own questions... but I wanted to share what I have decided on nonetheless.
Although what I had, "worked", I still wanted to change the hack-job-altering of the charset encoding. It was old code I admit, but for some reason, I never looked at hex2bin bin2hex for doing this. So I decided to change it to use those.
The resulting new code:
// $clip_size remains the same value for continuity later,
// only spot-adjusted here... which is why the *2.
$hex_data = bin2hex( $binary_data );
$first_hunk = hex2bin( substr($hex_data,0,($clip_size*2)) );
$rest_of_it = hex2bin( substr($hex_data,($clip_size*2)) );
if ( !empty($rest_of_it) ) { /* process the result for reasons */ }
Using the hex functions, turns the mess into something mb will not screw with either way. A 1 million bench loop, showed the process wasn't anything to be worried about (and its safer to run in parallel to itself than the mb_encoding mangle method).
So I'm going with this. It sits better in my mind, and resolves my question for now... until I revisit this old code again in a few years and go "what was I thinking ?!".

Overhead of large length in PHP stream_get_line()

I'm testing out a PHP library which relies upon another streams based library. My experience is with sockets which is low level compared to streams so I'm a little unsure if streams is going to be flexible enough.
Basically, using sockets, I would write a buffer loop which checked each chunk of data for an EOL character like \n like so...
PHP
$data = NULL;
while ($buffer = socket_read($connection,1024)) {
$data .= $buffer;
if (strpos($buffer, "\n") !== FALSE) {
$data = substr($data,0,-1);
break;
}
}
I'm looking to do something similar without having to rewrite their entire library. Here's the problem...
stream_get_line($handle,$length,$EOL) accepts a length value but will truncate everything longer than that. PHP docs state...
Reading ends when length bytes have been read, when the string specified by ending is found (which is not included in the return value), or on EOF (whichever comes first).
... and there's no offset param so I can't use it the same way to get the remainder. This means that if I don't know the length of the data, or if it's inconsistent, I need to set length high enough to deal with ANY possible length of data.
That isn't the big concern, that seems to work. The question is will setting the $length value to something like 512000 (500Kb) cause a lot of unnecessary overhead for shorter responses?
As far as the statement from the docs:
Reading ends when length bytes have been read, when the string
specified by ending is found (which is not included in the return
value), or on EOF (whichever comes first).
This doesn't mean that if you pass a length of 1,024 or even 200,000 and the line is longer than that the rest of the data is truncated or lost. This is just the maximum amount of data the call will return if it doesn't reach EOF/EOL before that.
So if you have a line that is 102,500 bytes long and you have the length parameter set to 1024, you will have to have called stream_get_line 101 times before the entire line of data is read but the entire line will be read, just not in one call to the function.
To directly answer your question, there won't be any extra overhead for short responses if you pass a large value. How it works under the hood really depends on what type of stream you are reading from. In the case of a network stream with a large length value, it may take a long time before the call returns any data in the event that it takes a long time for length data to be read from the network, where if you were reading in smaller chunks, you might start to get more data from the network before everything has been received.

PHP: Calculating File HASH for Files Larger than 2GB

Would you advise please, how to calculate file HASH on files larger than 2GB in PHP?
The only PHP function known to me is:
string hash_file ( string $algo , string $filename [, bool $raw_output = false ] )
This function however has a limitation. It returns HASH for files smaller than 2GB. For larger files, hash_file() throws error.
Here are some constraints/requests:
should work on Linux Ubuntu 64bit server
compatible with PHP 5+
there should be no file size limit
should be as fast as possible
This is all the information I have now. Thank you very much.
UPDATE
I have a solution that is more practical and efficient than any hash calculation from data >2GB.
I have realized, that I do not have to generate hash from complete files that are over 2GB. To uniquely identify any file, calculating hash from say first 10KB of data of any file should be sufficient. Moreover, it will be faster than >2GB calculation. In other words, ability to calculate hash from a data string that is over 2GB probably is not necessary at all.
I will wait for your reactions. In couple of days, I will close this question.
I would use exec() to run a local hashing function in the shell and return the value back to the php script. Here's an example with md5 but any algo available can be used.
$results = array();
$filename = '/full/path/to/file';
exec("md5sum $filename", $results);
Then parse the result array (the output of the shell command).
In general, I like to avoid doing anything directly in PHP that requires more than 1G of memory, especially if running in php-fpm or as an apache module--sort of time reinforced prejudice. This is definitely my advice when there is a native application that can accomplish the goal and you don't particularly need portablitly cross platform (like run on both linux and windows machines).

Searching for hex string in a file in php?

I'm currently using the following two methods in my class to get the job done:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
function find($str){
return $this->startingindex($this->name,$str);
}
function startingindex($a,$b){
$lim = 1 + filesize($a) - strlen($b)/2;
$h = fopen($a,"rb");
rewind($h);
for($i=0;$i<$lim;$i++){
$this->xseek($h,$i);
if($b==strtoupper(bin2hex(fread($h,strlen($b)/2)))){
fclose($h);
return $i;
}
}
fclose($h);
return -1;
}
I realize this is quite inefficient, especially for PHP, but I'm not allowed any other language on my hosting plan.
I ran a couple tests, and when the hex string is towards the beginning of the file, it runs quickly and returns the offset. When the hex string isn't found, however, the page hangs for a while. This kills me inside because last time I tested with PHP and had hanging pages, my webhost shut my site down for 24 hours due to too much cpu time.
Is there a better way to accomplish this (finding a hex string's offset in a file)? Is there certain aspects of this that could be improved to speed up execution?
I would read the entire contents of the file into one hex string and use strrpos, but I was getting errors about maximum memory being exceeded. Would this be a better method if I chopped the file up and searched large pieces with strrpos?
edit:
To specify, I'm dealing with a settings file for a game. The settings and their values are in a block where there is a 32-bit int before the setting, then the setting, a 32-bit int before the value, and then the value. Both ints represent the lengths of the following strings. For example, if the setting was "test" and the value was "0", it would look like (in hex): 00000004746573740000000130. Now that you mention it, this does seem like a bad way to go about it. What would you recommend?
edit 2:
I tried a file that was below the maximum memory I'm allowed and tried strrpos, but it was very much slower than the way I've been trying.
edit 3: in reply to Charles:
What's unknown is the length of the settings block and where it starts. What I do know is what the first and last settings USUALLY are. I've been using these searching methods to find the location of the first and last setting and determine the length of the settings block. I also know where the parent block starts. The settings block is generally no more than 50 bytes into its parent, so I could start the search for the first setting there and limit how far it will search. The problem is that I also need to find the last setting. The length of the settings block is variable and could be any length. I could read the file the way I assume the game does, by reading the size of the setting, reading the setting, reading the size of the value, reading the value, etc. until I reached a byte with value -1, or FF in hex. Would a combination of limiting the search for the first setting and reading the settings properly make this much more efficient?
You have a lot of garbage code. For example, this code is doing nearly nothing:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
because it reads everytime from the begining of the file. Furthemore, why do you need to read something if you are not returning it? May be you looke for fseek()?
If you need to find a hex string in binary file, may be better to use something like this: http://pastebin.com/fpDBdsvV (tell me if there some bugs/problems).
But, if you are parsing game's settings file, I'd advise you to use fseek(), fread() and unpack() to seek to a place of where setting is, read portion of bytes and unpack it to PHP's variable types.

truly unique random number generate by php?

I'm have build an up php script to host large number of images upload by user, what is the best way to generate random numbers to image filenames so that in future there would be no filename conflict? Be it like Imageshack. Thanks.
$better_token = uniqid(md5(mt_rand()), true);
Easiest way would be a new GUID for each file.
http://www.php.net/manual/en/function.uniqid.php#65879
Here's how I implemented your solution
This example assumes i want to
Get a list, containing 50 numbers that is unique and random, and
This list of # to come from the number range of 0 to 1000
Code:
//developed by www.fatphuc.com
$array = array(); //define the array
//set random # range
$minNum = 0;
$maxNum = 1000;
// i just created this function, since we’ll be generating
// # in various sections, and i just want to make sure that
// if we need to change how we generate random #, we don’t
// have to make multiple changes to the codes everywhere.
// (basically, to prevent mistakes)
function GenerateRandomNumber($minNum, $maxNum){
return round(rand($minNum, $maxNum));
}
//generate 49 more random #s to give a total of 50 random #s
for($i = 1; $i <= 49; $i++){
$num1 = GenerateRandomNumber($minNum, $maxNum);
while(in_array($num1, $array)){
$num1 = GenerateRandomNumber($minNum, $maxNum);
}
$array[$i] = $num1;
}
asort($array); //just want to sort the array
//this simply prints the list of #s in list style
echo '<ol>';
foreach ($array as $var){
echo '<li>';
echo $var;
echo '</li>';
}
echo '</ol>';
Keep a persistent list of all the previous numbers you've generated(in a database table or in a file) and check that a newly generated number is not amongst the ones on the list. If you find this to be prohibitively expensive, generate random numbers on a sufficient number of bits to guarantee a very low probability of collision.
You can also use an incremental approach of assigning these numbers, like a concatenation of a timestamp_part based on the current time and a random_part, just to make sure you don't get collisions if multiple users upload files at the same time.
You could use microtime() as suggested above and then appending an hash of the original filename to further avoid collisions in the (rare) case of exact contemporary uploads.
There are several flaws in your postulate that random values will be unique - regardless of how good the random number generator is. Also, the better the random number generator, the longer it takes to calculate results.
Wouldn't it be better to use a hash of the datafile - that way you get the added benefit of detecting duplicate submissions.
If detecting duplicates is known to be a non-issue, then I'd still recommend this approach but modify the output based on detected collisions (but using a MUCH cheaper computation method than that proposed by Lo'oris) e.g.
$candidate_name=generate_hash_of_file($input_file);
$offset=0;
while ((file_exists($candidate_name . strrev($offset) && ($offset<50)) {
$offset++;
}
if ($offset<50) {
rename($input_file, $candidate_name . strrev($offset));
} else {
print "Congratulations - you've got the biggest storage network in the world by far!";
}
this would give you the capacity to store approx 25*2^63 files using a sha1 hash.
As to how to generate the hash, reading the entire file into PHP might be slow (particularly if you try to read it all into a single string to hash it). Most Linux/Posix/Unix systems come with tools like 'md5sum' which will generate a hash from a stream very efficiently.
C.
forge a filename
try to open that file
if it exists, goto 1
create the file
Using something based on a timestamp maybe. See the microtime function for details. Alternatively uniqid to generate a unique ID based on the current time.
Guaranteed unique cannot be random. Random cannot be guaranteed unique. If you want unique (without the random) then just use the integers: 0, 1, 2, ... 1235, 1236, 1237, ... Definitely unique, but not random.
If that doesn't suit, then you can have definitely unique with the appearance of random. You use encryption on the integers to make them appear random. Using DES will give you 32 bit numbers, while using AES will give you 64 bit numbers. Use either to encrypt 0, 1, 2, ... in order with the same key. All you need to store is the key and the next number to encrypt. Because encryption is reversible, then the encrypted numbers are guaranteed unique.
If 64 bit or 32 bit numbers are too large (32 bits is 8 hex digits) then look at a format preserving encryption which will give you a smaller size range at some cost in time.
My solution is usually a hash (MD5/SHA1/...) of the image contents. This has the added advantage that if people upload the same image twice you still only have one image on the hard disk, saving some space (ofc you have to make sure that the image is not deleted if one user deletes it and another user has the same image in use).

Categories