Can you SHA1 hash a file to a certain length - php

I need to compare two files. One file may be longer than the other and I need to check to see if the longer file contains all of the data of the shorter file.
I could do a binary compare of the two something like this:
function compareFiles($file_a, $file_b){
if (filesize($file_a) > filesize($file_b)){
$fp_a = fopen($file_b, 'rb');
$fp_b = fopen($file_a, 'rb');
} else { // filesize($file_b) > filesize($file_a)
$fp_a = fopen($file_a, 'rb');
$fp_b = fopen($file_b, 'rb');
}
while (($b = fread($fp_a, 4096)) !== false){
$b_b = fread($fp_b, 4096);
if ($b !== $b_b){
fclose($fp_a);
fclose($fp_b);
return false;
}
}
fclose($fp_a);
fclose($fp_b);
return true;
}
but this would be slow. As an alternative, I could compare the SHA1 hash of the smaller file against the SHA1 hash of the larger file up and until the size of the smaller file, something like this:
function compareFiles($file_a, $file_b){
$tmpfile = '/dev/shm/tmp_file_copy.bin';
if (filesize($file_a) > filesize($file_b)){
$readfromfile = $file_b;
$bytes_to_copy = filesize($file_b);
} else {
$readfromfile = $file_a
$bytes_to_copy = filesize($file_a);
}
$readfile = fopen($readfromfile, 'rb');
$writefile = fopen($tmpfile, 'wb');
while (!feof($readfile) && $bytes_to_copy> 0) {
if ($bytes_to_copy <= 8192) {
$contents = fread($readfile, $bytes_to_copy);
$bytes_to_copy = 0;
} else {
$contents = fread($readfile, 8192);
$bytes_to_copy =- 8192;
}
fwrite($writefile, $contents);
}
fclose($writefile);
fclose($readfile);
$result = sha1_file($readfromfile = $file_a ? $file_b : $file_a) === sha1_file($tmpfile);
unlink($tmpfile);
return $result;
}
but I fear that this would also be slow as it involves a lot of I/O (to /dev/shm).
In short, I'm looking for a better way...

Hashing the files in this case will only be slower. Consider the following case.
File A.txt contents:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
File B.txt contents:
AAAAAAAAAAAAAAAAAAAAABBBBBBBBBB
Note that A.txt is 40 characters total, 10 characters longer than B.txt at 30 characters
How much I/O do we have to do on each file to determine if A.txt contains all of B.txt? 40 bytes? 30 bytes? No, the answer is only 20 bytes, because that's how much the two files share have in common. You stream each file one byte (or chunk of bytes) at a time and compare them as you go. The results of such a comparison looks like this:
A.txt: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
B.txt: AAAAAAAAAAAAAAAAAAAAABBBBBBBBBB
Stream ---------------------^
Result ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓X
And then you stop. Why compare the rest?
If you hash both files, you have to have the whole contents in memory to compute the hash. Even if you hash them in chunks while streaming them into memory, which do you think is faster: comparing each byte from each file or hashing the chunk? The complexity of the comparison is O(number of bytes) whereas the complexity of the SHA-1 hash algorithm is specified by RFC 3174.

The byte-by-byte is the best method in your case. It only compares first x bytes of both files and stops if they are different. Hash function must process all bytes in file. Isn't that slower?

Related

Fatal Error - Out of Memory while reading a *.dat file in php [duplicate]

I am reading a file containing around 50k lines using the file() function in Php. However, its giving a out of memory error since the contents of the file are stored in the memory as an array. Is there any other way?
Also, the lengths of the lines stored are variable.
Here's the code. Also the file is 700kB not mB.
private static function readScoreFile($scoreFile)
{
$file = file($scoreFile);
$relations = array();
for($i = 1; $i < count($file); $i++)
{
$relation = explode("\t",trim($file[$i]));
$relation = array(
'pwId_1' => $relation[0],
'pwId_2' => $relation[1],
'score' => $relation[2],
);
if($relation['score'] > 0)
{
$relations[] = $relation;
}
}
unset($file);
return $relations;
}
Use fopen, fread and fclose to read a file sequentially:
$handle = fopen($filename, 'r');
if ($handle) {
while (!feof($handle)) {
echo fread($handle, 8192);
}
fclose($handle);
}
EDIT after update of question and comments to answer of fabjoa:
There is definitely something fishy if a 700kb file eats up 140MB of memory with that code you gave (you could unset $relation at the end of the each iteration though). Consider using a debugger to step through it to see what happens. You might also want to consider rewriting the code to use SplFileObject's CSV functions as well (or their procedural cousins)
SplFileObject::setCsvControl example
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl('|');
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
For an OOP approach to iterate over the file, try SplFileObject:
SplFileObject::fgets example
$file = new SplFileObject("file.txt");
while (!$file->eof()) {
echo $file->fgets();
}
SplFileObject::next example
// Read through file line by line
$file = new SplFileObject("misc.txt");
while (!$file->eof()) {
echo $file->current();
$file->next();
}
or even
foreach(new SplFileObject("misc.txt") as $line) {
echo $line;
}
Pretty much related (if not duplicate):
How to save memory when reading a file in Php?
If you don't know the maximum line length and you are not comfortable to use a magic number for the max line length then you'll need to do an initial scan of the file and determine the max line length.
Other than that the following code should help you out:
// length is a large number or calculated from an initial file scan
while (!feof($handle)) {
$buffer = fgets($handle, $length);
echo $buffer;
}
Old question but since I haven't seen anyone mentioning it, PHP generators is a great way to reduce save memory consumption.
For example:
function read($fileName)
{
$fileHandler = fopen($fileName, 'rb');
while(($line = fgets($fileHandler)) !== false) {
yield rtrim($line, "\r\n");
}
fclose($fileHandler);
}
foreach(read(__DIR__ . '/filenameHere') as $line) {
echo $line;
}
allocate more memory during the operation, maybe something like ini_set('memory_limit', '16M');. Don't forget to go back to initial memory allocation once operation is done

Most memory-efficient way to split a variable sized chunks?

Is there a way to do something like an fread, but on a variable?
That is, I want to "read" another in-memory variable 1MB at a time.
That way I could have something like this:
$data = ... ; // 10MB of data
$handle = fopen($data, "rb"); // Need something instead of fopen here
while (!feof($handle))
{
$chunk = fread($handle, 1048576); // Want to read 1MB at a time
doSomethingWithChunk($chunk);
}
fclose($handle);
I have a large binary file loaded into memory, about 10MB. I'd like to split it into an array of 1MB chunks. I don't need all 1MB chunks in memory at once, so I think I could do something like the above more efficiently than just using PHP's built-in str_split function.
There's no way to sequentially 'read' a string that's already loaded into memory; it's not really more efficient to split it up. The overhead of multiple variables will use more memory than a single one as well. Ideally you would load the string into a stream, but PHP doesn't really have a string stream.
If you just want to deal with the string in chunks, you can just loop over substrings of it:
$data;
$pointer = 0, $size = strlen($data);
$chunkSize = 1048576;
while ($pointer < $size)
{
$chunk = substr($data, $pointer, $chunkSize);
doSomethingWithChunk($chunk);
$pointer += $chunkSize;
}
I'm not sure how PHP handles large strings internally, but according to the string documentation, a string can only be "as large as up to 2GB (2147483647 bytes maximum)". If your file is about 10MB, it shouldn't be a problem for PHP.
Another option (probably the better option) is to load $data into a memory or temporary stream. If you want to spare the environment from excessive memory, you can use the php://temp stream wrapper, where some of the data is stored in a temporary file if it exceeds 2MB. Just load the string into the stream as soon as possible to conserve memory, and then you can use the file stream functions on it.
$dataStream = fopen("php://temp", "w+b");
fwrite($dataStream, funcThatGetsData()); // try not to put data into a variable to save memory
while (!feof($dataStream))
{
$chunk = fread($dataStream, 1048576); // want to read 1MB at a time
doSomethingWithChunk($chunk);
}
fclose($dataStream);
If you get $data from another function you could pass around $dataStream instead. If you must have $data in a string beforehand, be sure to call unset() on it to free the memory:
$data = getData(); // string from some other function
$dataStream = fopen("php://temp", "w+b");
fwrite($dataStream, $data);
unset($data); // free 10MB of memory!
...
If you want to keep it all in memory you can use php://memory, but you might as well just use a string in that case.
You can use like;
$handle = #fopen("path_to_your_file", "r");
if ($handle) {
while (($buffer = fgets($handle, 1024)) !== false) {
doSomethingWithChunk($buffer );
}
fclose($handle);
}

PHP filesize over 4Gb

I'm running a Synology NAS Server,
and I'm trying to use PHP to get the filesize of files.
I'm trying to find a function that will successfully calculate the filesize of files over 4Gb.
filesize($file); only works for files <2Gb
sprintf("%u", filesize($file)); only works for files <4Gb
I also tried another function that I found on the php manual, but it doesn't work properly.
It randomly works for certain file sizes but not for others.
function fsize($file) {
// filesize will only return the lower 32 bits of
// the file's size! Make it unsigned.
$fmod = filesize($file);
if ($fmod < 0) $fmod += 2.0 * (PHP_INT_MAX + 1);
// find the upper 32 bits
$i = 0;
$myfile = fopen($file, "r");
// feof has undefined behaviour for big files.
// after we hit the eof with fseek,
// fread may not be able to detect the eof,
// but it also can't read bytes, so use it as an
// indicator.
while (strlen(fread($myfile, 1)) === 1) {
fseek($myfile, PHP_INT_MAX, SEEK_CUR);
$i++;
}
fclose($myfile);
// $i is a multiplier for PHP_INT_MAX byte blocks.
// return to the last multiple of 4, as filesize has modulo of 4 GB (lower 32 bits)
if ($i % 2 == 1) $i--;
// add the lower 32 bit to our PHP_INT_MAX multiplier
return ((float)($i) * (PHP_INT_MAX + 1)) + $fmod;
}
Any ideas?
You are overflowing PHP's 32-bit integer. On *nix, this will give you the filesize as a string:
<?php $size = trim(shell_exec('stat -c %s '.escapeshellarg($filename))); ?>
How about executing a shell command like:
<?php
echo shell_exec("du 'PATH_TO_FILE'");
?>
where PATH_TO_FILE is obviously the path to the file relative to the php script
you will most probably do some regex to get the filesize as a standalone as it returns a string like:
11777928 name_of_file.extention
Here is one complete solution what you can try: https://stackoverflow.com/a/48363570/2592415
include_once 'class.os.php';
include_once 'function.filesize.32bit.php';
// Must be real path to file
$file = "/home/username/some-folder/yourfile.zip";
echo get_filesize($file);
This function can run on the 32-bit Linux
function my_file_size($file){
if ( PHP_INT_MAX > 2147483647){ //64bit
return filesize($file);
}
$ps_cmd = "/bin/ls -l $file";
exec($ps_cmd, $arr_output, $rtn);
///bin/ls -l /data/07f2088a371424c0bdcdca918a3008a9cbd74a25.ic2
//-rw-r--r-- 1 resin resin 269484032 9月 9 21:36 /data/07f2088a371424c0bdcdca918a3008a9cbd74a25.ic2
if($rtn !=0) return floatval(0);
preg_match("/^[^\s]+\s+\d+\s+[^\s]+\s+[^\s]+\s+(\d+)\s+/", $arr_output[0], $matches);
if(!empty($matches)){
return floatval($matches[1]);
}
return floatval(0);
}

file_get_contents => PHP Fatal error: Allowed memory exhausted

I have no experience when dealing with large files so I am not sure what to do about this. I have attempted to read several large files using file_get_contents ; the task is to clean and munge them using preg_replace().
My code runs fine on small files ; however, the large files (40 MB) trigger an Memory exhausted error:
PHP Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 41390283 bytes)
I was thinking of using fread() instead but I am not sure that'll work either. Is there a workaround for this problem?
Thanks for your input.
This is my code:
<?php
error_reporting(E_ALL);
##get find() results and remove DOS carriage returns.
##The error is thrown on the next line for large files!
$myData = file_get_contents("tmp11");
$newData = str_replace("^M", "", $myData);
##cleanup Model-Manufacturer field.
$pattern = '/(Model-Manufacturer:)(\n)(\w+)/i';
$replacement = '$1$3';
$newData = preg_replace($pattern, $replacement, $newData);
##cleanup Test_Version field and create comma delimited layout.
$pattern = '/(Test_Version=)(\d).(\d).(\d)(\n+)/';
$replacement = '$1$2.$3.$4 ';
$newData = preg_replace($pattern, $replacement, $newData);
##cleanup occasional empty Model-Manufacturer field.
$pattern = '/(Test_Version=)(\d).(\d).(\d) (Test_Version=)/';
$replacement = '$1$2.$3.$4 Model-Manufacturer:N/A--$5';
$newData = preg_replace($pattern, $replacement, $newData);
##fix occasional Model-Manufacturer being incorrectly wrapped.
$newData = str_replace("--","\n",$newData);
##fix 'Binary file' message when find() utility cannot id file.
$pattern = '/(Binary file).*/';
$replacement = '';
$newData = preg_replace($pattern, $replacement, $newData);
$newData = removeEmptyLines($newData);
##replace colon with equal sign
$newData = str_replace("Model-Manufacturer:","Model-Manufacturer=",$newData);
##file stuff
$fh2 = fopen("tmp2","w");
fwrite($fh2, $newData);
fclose($fh2);
### Functions.
##Data cleanup
function removeEmptyLines($string)
{
return preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $string);
}
?>
Firstly you should understand that when using file_get_contents you're fetching the entire string of data into a variable, that variable is stored in the hosts memory.
If that string is greater than the size dedicated to the PHP process then PHP will halt and display the error message above.
The way around this to open the file as a pointer, and then take a chunk at a time. This way if you had a 500MB file you can read the first 1MB of data, do what you will with it, delete that 1MB from the system's memory and replace it with the next MB. This allows you to manage how much data you're putting in the memory.
An example if this can be seen below, I will create a function that acts like node.js
function file_get_contents_chunked($file,$chunk_size,$callback)
{
try
{
$handle = fopen($file, "r");
$i = 0;
while (!feof($handle))
{
call_user_func_array($callback,array(fread($handle,$chunk_size),&$handle,$i));
$i++;
}
fclose($handle);
}
catch(Exception $e)
{
trigger_error("file_get_contents_chunked::" . $e->getMessage(),E_USER_NOTICE);
return false;
}
return true;
}
and then use like so:
$success = file_get_contents_chunked("my/large/file",4096,function($chunk,&$handle,$iteration){
/*
* Do what you will with the {$chunk} here
* {$handle} is passed in case you want to seek
** to different parts of the file
* {$iteration} is the section of the file that has been read so
* ($i * 4096) is your current offset within the file.
*/
});
if(!$success)
{
//It Failed
}
One of the problems you will find is that you're trying to perform regex several times on an extremely large chunk of data. Not only that but your regex is built for matching the entire file.
With the above method your regex could become useless as you may only be matching a half set of data. What you should do is revert to the native string functions such as
strpos
substr
trim
explode
for matching the strings, I have added support in the callback so that the handle and current iteration are passed. This will allow you to work with the file directly within your callback, allowing you to use functions like fseek, ftruncate and fwrite for instance.
The way you're building your string manipulation is not efficient whatsoever, and using the proposed method above is by far a much better way.
A pretty ugly solution to adjust your memory limit depending on file size:
$filename = "yourfile.txt";
ini_set ('memory_limit', filesize ($filename) + 4000000);
$contents = file_get_contents ($filename);
The right solutuion would be to think if you can process the file in smaller chunks, or use command line tools from PHP.
If your file is line-based you can also use fgets to process it line-by-line.
For processing just n numbers of rows at a time, we can use generators in PHP.
n(use 1000)
This is how it works
Read n lines, process them, come back at n+1, then read n lines, process them come back and read next n lines and so on.
Here's the code for doing so.
<?php
class readLargeCSV{
public function __construct($filename, $delimiter = "\t"){
$this->file = fopen($filename, 'r');
$this->delimiter = $delimiter;
$this->iterator = 0;
$this->header = null;
}
public function csvToArray()
{
$data = array();
while (($row = fgetcsv($this->file, 1000, $this->delimiter)) !== false)
{
$is_mul_1000 = false;
if(!$this->header){
$this->header = $row;
}
else{
$this->iterator++;
$data[] = array_combine($this->header, $row);
if($this->iterator != 0 && $this->iterator % 1000 == 0){
$is_mul_1000 = true;
$chunk = $data;
$data = array();
yield $chunk;
}
}
}
fclose($this->file);
if(!$is_mul_1000){
yield $data;
}
return;
}
}
And for reading it, you can use this.
$file = database_path('path/to/csvfile/XYZ.csv');
$csv_reader = new readLargeCSV($file, ",");
foreach($csv_reader->csvToArray() as $data){
// you can do whatever you want with the $data.
}
Here $data contains the 1000 entries from the csv or n%1000 which will be for the last batch.
A detailed explanation for this can be found here https://medium.com/#aashish.gaba097/database-seeding-with-large-files-in-laravel-be5b2aceaa0b
My advice would be to use fread. It may be a little slower, but you won't have to use all your memory...
For instance :
//This use filesize($oldFile) memory
file_put_content($newFile, file_get_content($oldFile));
//And this 8192 bytes
$pNew=fopen($newFile, 'w');
$pOld=fopen($oldFile, 'r');
while(!feof($pOld)){
fwrite($pNew, fread($pOld, 8192));
}

Least memory intensive way to read a file in PHP

I am reading a file containing around 50k lines using the file() function in Php. However, its giving a out of memory error since the contents of the file are stored in the memory as an array. Is there any other way?
Also, the lengths of the lines stored are variable.
Here's the code. Also the file is 700kB not mB.
private static function readScoreFile($scoreFile)
{
$file = file($scoreFile);
$relations = array();
for($i = 1; $i < count($file); $i++)
{
$relation = explode("\t",trim($file[$i]));
$relation = array(
'pwId_1' => $relation[0],
'pwId_2' => $relation[1],
'score' => $relation[2],
);
if($relation['score'] > 0)
{
$relations[] = $relation;
}
}
unset($file);
return $relations;
}
Use fopen, fread and fclose to read a file sequentially:
$handle = fopen($filename, 'r');
if ($handle) {
while (!feof($handle)) {
echo fread($handle, 8192);
}
fclose($handle);
}
EDIT after update of question and comments to answer of fabjoa:
There is definitely something fishy if a 700kb file eats up 140MB of memory with that code you gave (you could unset $relation at the end of the each iteration though). Consider using a debugger to step through it to see what happens. You might also want to consider rewriting the code to use SplFileObject's CSV functions as well (or their procedural cousins)
SplFileObject::setCsvControl example
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl('|');
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
For an OOP approach to iterate over the file, try SplFileObject:
SplFileObject::fgets example
$file = new SplFileObject("file.txt");
while (!$file->eof()) {
echo $file->fgets();
}
SplFileObject::next example
// Read through file line by line
$file = new SplFileObject("misc.txt");
while (!$file->eof()) {
echo $file->current();
$file->next();
}
or even
foreach(new SplFileObject("misc.txt") as $line) {
echo $line;
}
Pretty much related (if not duplicate):
How to save memory when reading a file in Php?
If you don't know the maximum line length and you are not comfortable to use a magic number for the max line length then you'll need to do an initial scan of the file and determine the max line length.
Other than that the following code should help you out:
// length is a large number or calculated from an initial file scan
while (!feof($handle)) {
$buffer = fgets($handle, $length);
echo $buffer;
}
Old question but since I haven't seen anyone mentioning it, PHP generators is a great way to reduce save memory consumption.
For example:
function read($fileName)
{
$fileHandler = fopen($fileName, 'rb');
while(($line = fgets($fileHandler)) !== false) {
yield rtrim($line, "\r\n");
}
fclose($fileHandler);
}
foreach(read(__DIR__ . '/filenameHere') as $line) {
echo $line;
}
allocate more memory during the operation, maybe something like ini_set('memory_limit', '16M');. Don't forget to go back to initial memory allocation once operation is done

Categories