Get MD5 Checksum for Very Large Files - php

I've written a script that reads through all files in a directory and returns md5 hash for each file. However, it renders nothing for a rather large file. I assume that the interpreter has some value set for maximum processing time, and since it takes too long to get this value, it just skips along to other files. Is there anyway to get an md5 checksum for large files through PHP? If not, could it be done through a chron job with cpanel? I gave it a shot there but it doesn't seem that my md5sum command has ever been processed: I never get an email with the hash. Here's the PHP I've already written. It's a very simple code and works file for files of a reasonable size:
function md5_dir($dir) {
if (is_dir($dir)) {
if ($dh = opendir($dir)) {
while (($file = readdir($dh)) !== false) {
echo nl2br($file . "\n" . md5_file($file) . "\n\n");
}
closedir($dh);
}
}
}

Make sure to use escapeshellarg ( http://us3.php.net/manual/en/function.escapeshellarg.php ) if you decide to use a shell_exec() or system() call. I.e.,
shell_exec('md5sum -b ' . escapeshellarg($filename));

While i couldn't reproduce it with PHP 5.2 or 5.3 with a 2GB file the issue seems to come up on 32bit PHP builds.
Even so it's not a really nice solution you could try to let the system to the hasing
echo system("md5sum test.txt");
46d6a7bcbcf7ae0501da341cb3bae27c test.txt

If you're hitting an execution time limit or maximum execution time, PHP should be throwing an error message to that effect. Check your error logs. If you are hitting a limit, you can set the maximum values for PHP memory usage and execution time in your php.ini file:
memory_limit = 16M
will set max memory usage to 16 megs. For maximum execution time:
max_execution_time = 30
will set maximum execution time to 30 seconds.

you could achieve it with command line
shell_exec('md5sum -b '. $fileName);

FYI....in case someone needs a fast md5()check-sum. PHP is pretty fast even with the larger files. This returns the check-sum on Linux Mint .iso (size 880MB) in 3 sec.
<?php
// checksum
$path = $_SERVER['DOCUMENT_ROOT']; // get upload folder path
$file = $path."/somefolder/linux-mint.iso"; // any file
echo md5_file($file);
?>

Related

PHP filesize() showing old filesize with a file inside a windows shared (network) folder

I have the following script that runs to read new content from a file:
<?php
clearstatcache();
$fileURL = "\\\\saturn\extern\seq_ws.csv";
$fileAvailable = file_exists($fileURL);
$bytesRead = file_get_contents("bytes.txt");
if($fileAvailable){
$fileSize = filesize($fileURL);
//Statusses 1 = Partial read, 2 = Complete read, 0 = No read, -1 File not found. followed by !!
if($bytesRead < $fileSize){
//$bytesRead till $fileSize bytes read from file.
$content = file_get_contents($fileURL, NULL, NULL, $bytesRead);
file_put_contents("bytes.txt", ((int)$bytesRead + strlen($content)));
echo "1!!$content";
}else if($bytesRead > $fileSize){
//File edit or delete detected, whole file read again.
$content = file_get_contents($fileURL);
file_put_contents("bytes.txt", strlen($content));
echo "2!!$content";
}else if($bytesRead == $fileSize){
//No new data found, no action taken.
echo "0!!";
}
}else{
//File delete detected, reading whole file when available
echo "-1!!";
file_put_contents("bytes.txt", "0");
}
?>
It works perfect when I run it and does what is expected.
When I edit the file from the same PC and my server it works instantly and returns the correct values.
However when I edit the file from another PC, my script takes about 4-6 seconds to read the correct filesize of the file.
I added clearstatcache(); on top of my script, because I think its a caching issue. But the strange thing is that when I change the file from the server PC it responds instantly, but from another it doesn't.
On top of that as soon as the other PC changes the file, I see the file change in Windows with the filesize and content but for some reason, it takes Apache about 4-6 seconds to detect the change. In those 4-6 seconds it receives the old filesize from before the change.
So I have the following questions:
Is the filesize information cached anywhere maybe either on the Apache server or inside Windows?
If question 1 applies, is there anyway to remove or disable this caching?
Is it possible this isnt a caching problem?
I think that in Your local PC php has development settings.
So I suggest to check php.ini for this param: realpath_cache_ttl
Which is:
realpath_cache_ttl integer
Duration of time (in seconds) for which to cache realpath
information for a given file or directory.
For systems with rarely changing files,
consider increasing the value.
to test it, php info both locally and on server to check that value:
<?php phpinfo();

Can I set max execution time in php.ini to be 30,000 in my case?

Scenario is that I wanna save 4046 images to a folder . (Have coded in php ) I guess it would take maximum of 5 hours . Initially max execution time in php.ini was set to 30 seconds . After 650 images got saved , The browser froze . And none of the images got saved .But the process was running . And had no error too . Can anybody give me an idea the max execution time I should set in this case !
P.S. If my approach is wrong , Do guide me .
Thanks
I'm not sure if your problem isn't caused just by wrong tool - PHP isn't meant for such long tasks.
If that images are on some server better user FTP client.
If you have list of files saved in text file use cURL to download them.
I'd highly suggest modifying your script to do the job incrementally. So basically break the job up into smaller parts and provide a break in between. The basic logic flow would be like this.
<?php
$start = $_GET['start']; //where to start the job at
$end = $start + 250; //queue this and the next 250 positions
for($i=$start;$i<=$end;$i++){
//do the operations need for position $i
}
header("/urlToScript?start=".($end+1)); //refresh page moving to the next 250 jobs
?>
This will do small parts of the total job and avoid any issues with the interpreter. Add any INI modifications to increase memory usage and time as needed and you'll be fine.
You can extend the time using this line at that script which saving images.
ini_set('max_execution_time', 30000);
Second approach is to use htaccess.
php_value max_execution_time 30000

PHP - Chunked file copy (via FTP) has missing bytes?

So, I'm writing a chunked file transfer script that is intended to copy files--small and large--to a remote server. It almost works fantastically (and did with a 26 byte file I tested, haha) but when I start to do larger files, I notice it isn't quite working. For example, I uploaded a 96,489,231 byte file, but the final file was 95,504,152 bytes. I tested it with a 928,670,754 byte file, and the copied file only had 927,902,792 bytes.
Has anyone else ever experienced this? I'm guessing feof() may be doing something wonky, but I have no idea how to replace it, or test that. I commented the code, for your convenience. :)
<?php
// FTP credentials
$server = CENSORED;
$username = CENSORED;
$password = CENSORED;
// Destination file (where the copied file should go)
$destination = "ftp://$username:$password#$server/ftp/final.mp4";
// The file on my server that we're copying (in chunks) to $destination.
$read = 'grr.mp4';
// If the file we're trying to copy exists...
if (file_exists($read))
{
// Set a chunk size
$chunk_size = 4194304;
// For reading through the file we want to copy to the FTP server.
$read_handle = fopen($read, 'rb');
// For appending to the destination file.
$destination_handle = fopen($destination, 'ab');
echo '<span style="font-size:20px;">';
echo 'Uploading.....';
// Loop through $read until we reach the end of the file.
while (!feof($read_handle))
{
// So Rackspace doesn't think nothing's happening.
echo PHP_EOL;
flush();
// Read a chunk of the file we're copying.
$chunk = fread($read_handle, $chunk_size);
// Write the chunk to the destination file.
fwrite($destination_handle, $chunk);
sleep(1);
}
echo 'Done!';
echo '</span>';
}
fclose($read_handle);
fclose($destination_handle);
?>
EDIT
I (may have) confirmed that the script is dying at the end somehow, and not corrupting the files. I created a simple file with each line corresponding to the line number, up to 10000, then ran my script. It stopped at line 6253. However, the script is still returning "Done!" at the end, so I can't imagine it's a timeout issue. Strange!
EDIT 2
I have confirmed that the problem exists somewhere in fwrite(). By echoing $chunk inside the loop, the complete file is returned without fail. However, the written file still does not match.
EDIT 3
It appears to work if I add sleep(1) immediately after the fwrite(). However, that makes the script take a million years to run. Is it possible that PHP's append has some inherent flaw?
EDIT 4
Alright, further isolated the problem to being an FTP problem, somehow. When I run this file copy locally, it works fine. However, when I use the file transfer protocol (line 9) the bytes are missing. This is occurring despite the binary flags the two cases of fopen(). What could possibly be causing this?
EDIT 5
I found a fix. The modified code is above--I'll post an answer on my own as soon as I'm able.
I found a fix, though I'm not sure exactly why it works. Simply sleeping after writing each chunk fixes the problem. I upped the chunk size quite a bit to speed things up. Though this is an arguably bad solution, it should work for my uses. Thanks anyway, guys!

PHP having issues saving "large" files

I've got a program that takes 3 arrays (which are the same length, can contain 500 items or so) and writes them to a text file.
However I'm getting an issue with writing larger files. The arrays are coordinates and timestamps of a canvas drawing app so I can control the length. I've found that once files start getting larger than 2mb it doesn't save the file. The maximum file I've managed to save has been 2.18mb. From a related question PHP: Having trouble uploading large files I've determined that the cause is most likely due to having this hosted on a free hosting server. I've looked at phpinfo() and here are the 4 relevant numbers:
memory_limit 16M
max_execution_time 30
upload_max_filesize 5M
post_max_size 5M
Here is the relevant writing code:
// retrieve data from the JS
$x_s = $_GET['x_coords'];
$y_s = $_GET['y_coords'];
$new_line = $_GET['new_lines'];
$times = $_GET['time_stamps'];
print_r($_GET);
$randInt = rand(1,1000);
// first want to open a file
$file_name = "test_logs/data_test_" . $randInt . ".txt";
$file_handler = fopen($file_name, 'w') or die("Couldn't connect");
// For loop to write the data
for ($i = 0; $i < count($x_s); $i++){
// If new line want to write new line!
if (!$new_line[$i]){
if ($i!=0){
// If not the first line
fwrite($file_handler, "LINE_END\n"); }
fwrite($file_handler, "LINE_START\n");
}
// Write the x coord, y coord, timestamp
fwrite($file_handler, $x_s[$i] . ", ". $y_s[$i] .", ". $times[$i]. "\n");
// If last line then write last LINE_END
if ($i == (count($x_s)-1)){
fwrite($file_handler, "LINE_END\n"); }
}
fclose($file_handler);
I've setup a php server on my localhost and have access to the error log. This is what I am getting.
[Fri Mar 23 20:03:02 2012] [error] [client ::1] request failed: URI too long (longer than 8190)
PROBLEM RESOLVED: The issue was that I was using GET to send large amounts of data, which was appended to the URI. Once the URI reached 8190 characters it had an error. Using POST solves this.
upload_max_filesize and post_max_size determine the maximum size of data that can be posted. But this is probably not your problem since some of the data is written (if you reach the data limit, the script does not execute).
Your script has two restrictions: max_execution_time and memory_limit. Have a look at your apache error log file to see if you are getting an error message (saying which limit is reached).
You can also try logging inside the for loop to see the progression of time and memory usage:
if(($i % 100) == 0) { // log every 100 entries
error_log(date("H:i:s ").memory_get_usage(true)."Bytes used\n", 3, 'test.log');
}
It may also be that the Suhosin patch is preventing you from sending too many data points:
http://www.adityamooley.net/blogs/2012/01/09/php-suhosin-and-post-data/
Maybe script exceeds max execution time.
Add this
set_time_limit(0)
at the beginning of your code.
1) check max_input_time
ini_set ( 'max_input_time', 50 );
2) check in phpinfo() - do you have Suhosin patch?
You should look at apache error_log - You should find which limit is reached.
Try
ini_set('error_reporting', E_ALL);
error_reporting(E_ALL);
ini_set('log_errors', true);
ini_set('html_errors', false);
ini_set('error_log', dirname(__FILE__).'script_error.log');
ini_set('display_errors', true);
PHP (an hence the web server) is protecting itself. Perhaps use a different mechanism to upload a large file - i would imagine they come from known (an trusted) sources. Use a different mechanism, for example SFTP.

How can I optimize this simple PHP script?

This first script gets called several times for each user via an AJAX request. It calls another script on a different server to get the last line of a text file. It works fine, but I think there is a lot of room for improvement but I am not a very good PHP coder, so I am hoping with the help of the community I can optimize this for speed and efficiency:
AJAX POST Request made to this script
<?php session_start();
$fileName = $_POST['textFile'];
$result = file_get_contents($_SESSION['serverURL']."fileReader.php?textFile=$fileName");
echo $result;
?>
It makes a GET request to this external script which reads a text file
<?php
$fileName = $_GET['textFile'];
if (file_exists('text/'.$fileName.'.txt')) {
$lines = file('text/'.$fileName.'.txt');
echo $lines[sizeof($lines)-1];
}
else{
echo 0;
}
?>
I would appreciate any help. I think there is more improvement that can be made in the first script. It makes an expensive function call (file_get_contents), well at least I think its expensive!
This script should limit the locations and file types that it's going to return.
Think of somebody trying this:
http://www.yoursite.com/yourscript.php?textFile=../../../etc/passwd (or something similar)
Try to find out where delays occur.. does the HTTP request take long, or is the file so large that reading it takes long.
If the request is slow, try caching results locally.
If the file is huge, then you could set up a cron job that extracts the last line of the file at regular intervals (or at every change), and save that to a file that your other script can access directly.
readfile is your friend here
it reads a file on disk and streams it to the client.
script 1:
<?php
session_start();
// added basic argument filtering
$fileName = preg_replace('/[^A-Za-z0-9_]/', '', $_POST['textFile']);
$fileName = $_SESSION['serverURL'].'text/'.$fileName.'.txt';
if (file_exists($fileName)) {
// script 2 could be pasted here
//for the entire file
//readfile($fileName);
//for just the last line
$lines = file($fileName);
echo $lines[count($lines)-1];
exit(0);
}
echo 0;
?>
This script could further be improved by adding caching to it. But that is more complicated.
The very basic caching could be.
script 2:
<?php
$lastModifiedTimeStamp filemtime($fileName);
if (isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
$browserCachedCopyTimestamp = strtotime(preg_replace('/;.*$/', '', $_SERVER['HTTP_IF_MODIFIED_SINCE']));
if ($browserCachedCopyTimestamp >= $lastModifiedTimeStamp) {
header("HTTP/1.0 304 Not Modified");
exit(0);
}
}
header('Content-Length: '.filesize($fileName));
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 604800)); // (3600 * 24 * 7)
header('Last-Modified: '.date('D, d M Y H:i:s \G\M\T', $lastModifiedTimeStamp));
?>
First things first: Do you really need to optimize that? Is that the slowest part in your use case? Have you used xdebug to verify that? If you've done that, read on:
You cannot really optimize the first script usefully: If you need a http-request, you need a http-request. Skipping the http request could be a performance gain, though, if it is possible (i.e. if the first script can access the same files the second script would operate on).
As for the second script: Reading the whole file into memory does look like some overhead, but that is neglibable, if the files are small. The code looks very readable, I would leave it as is in that case.
If your files are big, however, you might want to use fopen() and its friends fseek() and fread()
# Do not forget to sanitize the file name here!
# An attacker could demand the last line of your password
# file or similar! ($fileName = '../../passwords.txt')
$filePointer = fopen($fileName, 'r');
$i = 1;
$chunkSize = 200;
# Read 200 byte chunks from the file and check if the chunk
# contains a newline
do {
fseek($filePointer, -($i * $chunkSize), SEEK_END);
$line = fread($filePointer, $i++ * $chunkSize);
} while (($pos = strrpos($line, "\n")) === false);
return substr($line, $pos + 1);
If the files are unchanging, you should cache the last line.
If the files are changing and you control the way they are produced, it might or might not be an improvement to reverse the order lines are written, depending on how often a line is read over its lifetime.
Edit:
Your server could figure out what it wants to write to its log, put it in memcache, and then write it to the log. The request for the last line could be fulfulled from memcache instead of file read.
The most probable source of delay is that cross-server HTTP request. If the files are small, the cost of fopen/fread/fclose is nothing compared to the whole HTTP request.
(Not long ago I used HTTP to retrieve images to dinamically generate image-based menus. Replacing the HTTP request by a local file read reduced the delay from seconds to tenths of a second.)
I assume that the obvious solution of accessing the file server filesystem directly is out of the question. If not, then it's the best and simplest option.
If not, you could use caching. Instead of getting the whole file, you just issue a HEAD request and compare the timestamp to a local copy.
Also, if you are ajax-updating a lot of clients based on the same files, you might consider looking at using comet (meteor, for example). It's used for things like chats, where a single change has to be broadcasted to several clients.

Categories