Writing on existing files in a filesystem database

Writing on existing files in a filesystem database - php

I have a function that writes ~120Kb-150Kb HTML and meta data on ~8000 .md files with fixed names every few minutes:
a-agilent-technologies-healthcare-nyse-us-39d4
aa-alcoa-basic-materials-nyse-us-159a
aaau-perth-mint-physical-gold--nyse-us-8ed9
aaba-altaba-financial-services-nasdaq-us-26f5
aac-healthcare-nyse-us-e92a
aadr-advisorshares-dorsey-wright-adr--nyse-us-d842
aal-airlines-industrials-nasdaq-us-29eb
If file does not exist, it generates/writes quite fast.
If however the file exists, it does the same much slower, since the existing file carries ~150KB data.
How do I solve this problem?
Do I generate a new file with a new name in the same directory, and unlink the older file in the for loop?
or do I generate a new folder and write all files then I unlink the previous directory? The problem with this method is that sometimes 90% of files are being rewritten and some remain the same.
Code
This function is being called in a for loop, which you can see it in this link
public static function writeFinalStringOnDatabase($equity_symbol, $md_file_content, $no_extension_filename)
{
/**
*#var is the MD file content with meta and entire HTML
*/
$md_file_content = $md_file_content . ConfigConstants::NEW_LINE . ConfigConstants::NEW_LINE;
$dir = __DIR__ . ConfigConstants::DIR_FRONT_SYMBOLS_MD_FILES; // symbols front directory
$new_filename = EQ::generateFileNameFromLeadingURL($no_extension_filename, $dir);
if (file_exists($new_filename)) {
if (is_writable($new_filename)) {
file_put_contents($new_filename, $md_file_content);
if (EQ::isLocalServer()) {
echo $equity_symbol . " 💚 " . ConfigConstants::NEW_LINE;
}
} else {
if (EQ::isLocalServer()) {
echo $equity_symbol . " symbol MD file is not writable in " . __METHOD__ . " 💔 Maybe, check permissions!" . ConfigConstants::NEW_LINE;
}
}
} else {
$fh = fopen($new_filename, 'wb');
fwrite($fh, $md_file_content);
fclose($fh);
if (EQ::isLocalServer()) {
echo $equity_symbol . " front md file does not exit in " . __METHOD__ . " It's writing on the database now 💛" . ConfigConstants::NEW_LINE;
}
}
}

I haven't programmed in PHP for years, but this question has drawn my interest today. :D
Suggestion
How do I solve this problem?
Do I generate a new file with a new name in the same directory, and unlink the older file in the for loop?
Simply use the 3 amigos fopen(), fwrite() & fclose() again, since fwrite will also overwrite the entire content of an existing file.
if (file_exists($new_filename)) {
if (is_writable($new_filename)) {
$fh = fopen($new_filename,'wb');
fwrite($fh, $md_file_content);
fclose($fh);
if (EQ::isLocalServer()) {
echo $equity_symbol . " 💚 " . ConfigConstants::NEW_LINE;
}
} else {
if (EQ::isLocalServer()) {
echo $equity_symbol . " symbol MD file is not writable in " . __METHOD__ . " 💔 Maybe, check permissions!" . ConfigConstants::NEW_LINE;
}
}
} else {
$fh = fopen($new_filename, 'wb');
fwrite($fh, $md_file_content);
fclose($fh);
if (EQ::isLocalServer()) {
echo $equity_symbol . " front md file does not exit in " . __METHOD__ . " It's writing on the database now 💛" . ConfigConstants::NEW_LINE;
}
}
For the sake of DRY principle:
// It's smart to put the logging and similar tasks in a separate function,
// after you end up writing the same thing over and over again.
public static function log($content)
{
if (EQ::isLocalServer()) {
echo $content;
}
}
public static function writeFinalStringOnDatabase($equity_symbol, $md_file_content, $no_extension_filename)
{
$md_file_content = $md_file_content . ConfigConstants::NEW_LINE . ConfigConstants::NEW_LINE;
$dir = __DIR__ . ConfigConstants::DIR_FRONT_SYMBOLS_MD_FILES; // symbols front directory
$new_filename = EQ::generateFileNameFromLeadingURL($no_extension_filename, $dir);
$file_already_exists = file_exists($new_filename);
if ($file_already_exists && !is_writable($new_filename)) {
EQ::log($equity_symbol . " symbol MD file is not writable in " . __METHOD__ . " 💔 Maybe, check permissions!" . ConfigConstants::NEW_LINE);
} else {
$fh = fopen($new_filename,'wb'); // you should also check whether fopen succeeded
fwrite($fh, $md_file_content); // you should also check whether fwrite succeeded
if ($file_already_exists) {
EQ::log($equity_symbol . " 💚 " . ConfigConstants::NEW_LINE);
} else {
EQ::log($equity_symbol . " front md file does not exit in " . __METHOD__ . " It's writing on the database now 💛" . ConfigConstants::NEW_LINE);
}
fclose($fh);
}
}
Possible cause
tl;dr To much overhead due to the Zend string API being used.
The official PHP manual says:
file_put_contents() is identical to calling fopen(), fwrite() and fclose() successively to write data to a file.
However, if you look at the source code of PHP on GitHub, you can see that the part "writing data" is done slightly different in file_put_contents() and fwrite().
In the fwrite function the raw input data (= $md_file_content) is directly accessed in order to write the buffer data to the stream context:
Line 1171:
ret = php_stream_write(stream, input, num_bytes);
In the file_put_contents function on the other hand the Zend string API is used (which I never heard before).
Here the input data and length is encapsulated for some reason.
Line 662
numbytes = php_stream_write(stream, Z_STRVAL_P(data), Z_STRLEN_P(data));
(The Z_STR.... macros are defined here, if you are interested).
So, my suspicion is that possibly the Zend string API is causing the overhead while using file_put_contents.
side note
At first I thought that every file_put_contents() call creates a new stream context, since the lines related to creating context were also slightly different:
PHP_NAMED_FUNCTION(php_if_fopen) (Reference):
context = php_stream_context_from_zval(zcontext, 0);
PHP_FUNCTION(file_put_contents) (Reference):
context = php_stream_context_from_zval(zcontext, flags & PHP_FILE_NO_DEFAULT_CONTEXT);
However, on closer inspection, the php_stream_context_from_zval call is made effectively with the same params, that is the first param zcontext is null, and since you don't pass any flags to file_put_contents, flags & PHP_FILE_NO_DEFAULT_CONTEXT becomes also 0 and is passed as second param.
So, I guess the default stream context is re-used here on every call. Since it's apparently a stream of type persistent it is not disposed after the php_stream_close() call.
So the Fazit, as the Germans say, is there is apparently either no additional overhead or equally same overhead regarding the creation or reusing a stream context in both cases.
Thank you for reading.

Related

PHP ftp_nlist returning false when retrieving large directory?

What I'm trying to do:
Use PHP ftp_nlist to retrieve the contents of a directory on the FTP server
The problem:
For directories that contain a lot of files (the one I encountered the problem on has nearly 40 thousand files, and no subfolders), the ftp_nlist function is returning false. For directories that are not as large, the ftp_nlist function returns an array of filenames as expected.
What I've tried:
Enabling passive mode (it already was enabled, but I see it as a common suggestion)
Adding ftp_set_option($conn_id, FTP_USEPASVADDRESS, false); after my ftp_login
using ftp_chdir, although my folder names never have spaces anyways
echoing error_get_last() after ftp_nlist returns false. The error show seems unrelated, but is shown below.
My code:
In case it is useful, here is the function I have created. What it is supposed to do is...
take in $fm (filemaker, unrelated to this problem)
take in $FTPConnectionID (the ftp connection I established in the prior to the function call)
take in $FolderPath (the path of the folder on the FTP server for which I want to list files/subfolders recursively - ex: "SomeFolder/Testing")
take in $TextFile (I am writing the paths of every file found on the FTP server to a text file, which was created prior to calling the function)
function createAuditFile($fm, $FTPConnectionID, $FolderPath, $TextFile) {
echo "createAuditFile called for " . $FolderPath . "\n";
//Get the contents of the given path. Will include files and folders.
$FolderContents = ftp_nlist($FTPConnectionID, $FolderPath);
if($FolderContents == false) {
echo "Couldn't get " . $FolderPath . "\n";
echo "ERROR: " . print_r(error_get_last()) . "\n";
} else {
print_r($FolderContents);
}
//Loop through the array, call this function recursively if a folder is found.
if(is_array($FolderContents)) {
foreach($FolderContents as $Content) {
//Create a varaible for the folder path
$ContentPath = $FolderPath . "/" . $Content;
//Call the function recursively if a folder is found
if(pathinfo($Content, PATHINFO_EXTENSION) == "") {
createAuditFile($fm, $FTPConnectionID, $ContentPath, $TextFile);
echo "Recursive call for " . $ContentPath . "\n";
//If a file is found, add the file ftp path to our array
} else {
echo "Writing to file: " . $ContentPath . "\n";
fwrite($TextFile, $ContentPath . "\n");
}
}
}
}
I can provide other code if needed, but I think my question is less of a coding issue, and more of an understanding ftp_nlist issue. I've been stuck on this for hours, so any help is appreciated. And like I said, this function works just fine for most folder paths passed to it, the problem is when there are tens of thousands of files within the folder. Thank you!

PHP file_exists() for URL/robots.txt returns false

I tryed to use file_exists(URL/robots.txt) to see if the file exists on randomly chosen websites and i get a false response;
How do i check if the robots.txt file exists ?
I dont want to start the download before i check.
Using fopen() will do the trick ? because : Returns a file pointer resource on success, or FALSE on error.
and i guess that i can put something like:
$f=#fopen($url,"r");
if($f) ...
my code:
http://www1.macys.com/robots.txt
maybe it's not there
http://www.intend.ro/robots.txt
maybe it's not there
http://www.emag.ro/robots.txt
maybe it's not there
http://www1.bloomingdales.com/robots.txt
maybe it's not there
try {
if (file_exists($file))
{
echo 'exists'.PHP_EOL;
$curl_tool = new CurlTool();
$content = $curl_tool->fetchContent($file);
//if the file exists on local disk, delete it
if (file_exists(CRAWLER_FILES . 'robots_' . $website_id . '.txt'))
unlink(CRAWLER_FILES . 'robots_' . $website . '.txt');
echo CRAWLER_FILES . 'robots_' . $website_id . '.txt', $content . PHP_EOL;
file_put_contents(CRAWLER_FILES . 'robots_' . $website_id . '.txt', $content);
}
else
{
echo 'maybe it\'s not there'.PHP_EOL;
}
} catch (Exception $e) {
echo 'EXCEPTION ' . $e . PHP_EOL;
}

file_exists cannot be used on resources on another websites. It's intended for local filesystem. Have a look here on how to perform the check properly.
As other have mentioned in the comments and as the link says it's (probably) easiest to use get_headers function to do this:
try {
if (strpos(get_headers($url,1),"404")!==FALSE){
... your code ...
} else {
... you get the idea ...
}
}

Just to second what other people said,
it's best to use cURL in php to find out if that http://example.com/robots.txt returns a 404 status code. If it does, then the file does not exist. If it returns a 200 it means it exists.
Be wary of custom 404 pages though, I'm never looked to find out what they return.

The http:// wrapper does not support stat() functionality, which file_exists() needs; you will need to check the HTTP response code from e.g. cURL.
As of PHP 5.0.0, this function can also be used with some URL wrappers. Refer to Supported Protocols and Wrappers to determine which wrappers support stat() family of functionality.

Second instance of script runs the exact same code as the first one

This script is supposed to write log files using file locks etc to make sure that scripts running at the same time don't have any read/write complications. I got it off someone on php.net. When I tried to run it twice at the same time, I noticed that it completely ignored the lock file. However, when I ran them consecutively, the lock file worked just fine.
That doesn't make any sense whatsoever. The script just checks if a file exists, and acts based on that. Whether another script is running or not, shouldn't influence it at all. I double checked to make sure the lock file was created in both cases; it was.
So I started to do some testing.
First instance started at 11:21:00 outputs:
Started at: 2012-04-12 11:21:00
Checking if weblog/20120412test.txt.1.wlock exists
Got lock: weblog/20120412test.txt.1.wlock
log file not exists, make new
log file was either appended to or create anew
Wrote: 2012-04-12 11:21:00 xx.xx.xx.xxx "testmsg"
1
Second instance started at 11:21:03 outputs:
Started at: 2012-04-12 11:21:00
Checking if weblog/20120412test.txt.1.wlock exists
Got lock: weblog/20120412test.txt.1.wlock
log file not exists, make new
log file was either appended to or create anew
Wrote: 2012-04-12 11:21:00 xx.xx.xx.xxx "testmsg"
1
So there are two things wrong here. The timestamp, and the fact that the script sais the lock file doesn't exist even though it most certainly does.
It's almost as if the second instance of the script simply outputs what the first one did.
<?php
function Weblog_debug($input)
{
echo $input."<br/>";
}
function Weblog($directory, $logfile, $message)
{
// Created 15 september 2010: Mirco Babin
$curtime = time();
$startedat = date('Y-m-d',$curtime) . "\t" . date('H:i:s', $curtime) . "\t";
Weblog_debug("Started at: $startedat");
$logfile = date('Ymd',$curtime) . $logfile;
//Set directory correctly
if (!isset($directory) || $directory === false)
$directory = './';
if (substr($directory,-1) !== '/')
$directory = $directory . '/';
$count = 1;
while(1)
{
//*dir*/*file*.*count*
$logfilename = $directory . $logfile . '.' . $count;
//*dir*/*file*.*count*.lock
$lockfile = $logfilename . '.wlock';
$lockhandle = false;
Weblog_debug("Checking if $lockfile exists");
if (!file_exists($lockfile))
{
$lockhandle = #fopen($lockfile, 'xb'); //lock handle true if lock file opened
Weblog_debug("Got lock: $lockfile");
}
if ($lockhandle !== false) break; //break loop if we got lock
$count++;
if ($count > 100) return false;
}
//log file exists, append
if (file_exists($logfilename))
{
Weblog_debug("log file exists, append");
$created = false;
$loghandle = #fopen($logfilename, 'ab');
}
//log file not exists, make new
else
{
Weblog_debug("log file not exists, make new");
$loghandle = #fopen($logfilename, 'xb');
if ($loghandle !== false) //Did we make it?
{
$created = true;
$str = '#version: 1.0' . "\r\n" .
'#Fields: date time c-ip x-msg' . "\r\n";
fwrite($loghandle,$str);
}
}
//was log file either appended to or create anew?
if ($loghandle !== false)
{
Weblog_debug("log file was either appended to or create anew");
$str = date('Y-m-d',$curtime) . "\t" .
date('H:i:s', $curtime) . "\t" .
(isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '-') . "\t" .
'"' . str_replace('"', '""', $message) . '"' . "\r\n";
fwrite($loghandle,$str);
Weblog_debug("Wrote: $str");
fclose($loghandle);
//Only chmod if new file
if ($created) chmod($logfilename,0644); // Read and write for owner, read for everybody else
$result = true;
}
else
{
Weblog_debug("log file was not appended to or create anew");
$result = false;
}
/**
Sleep & disable unlinking of lock file, both for testing purposes.
*/
//Sleep for 10sec to allow other instance(s) of script to run while this one still in progress.
sleep(10);
//fclose($lockhandle);
//#unlink($lockfile);
return $result;
}
echo Weblog("weblog", "test.txt", "testmsg");
?>
UPDATE:
Here's a simple script that just shows the timestamp. I tried it on a different host so I don't think it's a problem with my server;
<?php
function Weblog_debug($input)
{
echo $input."<br/>";
}
$curtime = time();
$startedat = date('Y-m-d',$curtime) . "\t" . date('H:i:s', $curtime) . "\t";
Weblog_debug("Started at: $startedat");
$timediff = time() - $curtime;
while($timediff < 5)
{
$timediff = time() - $curtime;
}
Weblog_debug("OK");
?>
Again, if I start the second instance of the script while the first is in the while loop, the second script will state it started at the same time as the first.

I can't fricking believe this myself, but it turns out this is just a "feature" in Opera. The script works as intended in Firefox. I kinda wish I tested that before I went all berserk on this but there ya go.

Are rename() and unlink() asynchronous functions?

I have strong reason to believe that both functions rename() and unlink() are asynchronous, which, from my understanding, means that when the functions are called, the code below them are continued before it finishes its procedures on the filesystem. This is a problem for the internet app I'll explain below, because later code depends on these changes to already be set in stone. So, is there a way to make both synchronous, so that the code reader freezes when it hits these functions, until all of its tasks are fully carried out on the filesystem?
Here is the code in delete-image.php, which is called by ajax from another admin-images.php(the latter will not be shown):
`
foreach ($dirScan as $key => $value) {
$fileParts = explode('.', $dirScan[$key]);
if (isset($fileParts[1])) {
if ((!($fileParts[1] == "gif") && !($fileParts[1] == "jpg")) && (!($fileParts[1] == "png") && !($fileParts[1] == "jpg"))) {
unset($dirScan[$key]);
}
} else {
unset($dirScan[$key]);
}
}
$dirScan = array_values($dirScan);
// for thumbnail
$file = 'galleries/' . $currentGal . '/' . $currentDir . "/" . $dirScan[$imageNum - 1];
unlink($file);
for ($i = ($imageNum - 1) + 1; $i < count($dirScan); $i++) {
$thisFile = 'galleries/' . $currentGal . '/' . $currentDir . '/' . $dirScan[$i];
$thisSplitFileName = explode('.', $dirScan[$i]);
$newName = 'galleries/' . $currentGal . '/' . $currentDir . "/" . ($thisSplitFileName[0] - 1) . "." . $thisSplitFileName[1];
rename($thisFile, $newName);
}
// for large image
$fileParts = explode('.', $dirScan[$imageNum - 1]);
$file = 'galleries/' . $currentGal . '/' . $currentDir . "/large/" . $fileParts[0] . "Large." . $fileParts[1];
unlink($file);
for ($i = ($imageNum - 1) + 1; $i < count($dirScan); $i++) {
$thisSplitFileName = explode('.', $dirScan[$i]);
$thisFile = 'galleries/' . $currentGal . '/' . $currentDir . '/large/' . $thisSplitFileName[0] . "Large." . $thisSplitFileName[1];
$newName = 'galleries/' . $currentGal . '/' . $currentDir . "/large/" . ($thisSplitFileName[0] - 1) . "Large." . $thisSplitFileName[1];
rename($thisFile, $newName);
}
sleep(1);
echo 'deleted ' . $dirScan[$imageNum - 1] . " successfully!";
} else {
echo "please set the post data";
} ?>`
After this script returns its completed text, admin-images.php triggers a new function which populates an image table from these renamed and trimmed files. Sometimes it displays old names and files that were suppose to be deleted, and a simple page refresh gets rid of them. This seems to suggest that the above php script is running through all the code and spitting out echoed text to the mainfile before it completes its file-system manipulation (All of this other code is long and complicated, and hopefully unnecessary for the discussion at hand).
You'll notice, I've tried a sleep() function to halt the php script to hopefully give it time to finish. This is an ineligent, and problematic way of doing things, because I have to put a large amount of time to insure it works every-time, but I don't want the user to wait longer than she / he has to.

Mind that file-systems often use caches to reduce the load. Normally you won't notice, but sometimes you need to clear the cache if you need to have the real information. Check the configuration of your file-system if your issue is file-system related.
PHP itself uses a cache as well for some file-operations, so clear that, too.
See clearstatcache to clear the PHP stat cache.
Take note that this is a "view" issue, the file is actually deleted on disk, but PHP might still return it's there (until you clear the cache).

I suppose they are not asynchronous, because they return a result telling if the operation was successful or not.
I believe the problem happens because when you run scandir after making the modifications, it may be using "cached" data, from memory, instead of re-scanning the file system.

rename() is not, but unlink() is asynchronous on Windows.
Because there seems to be no way of waiting for a pending delete to finish, this answer suggests to rename a file before deleting it. PHP does not seem to do that, so you can assume it's asynchronous.

To use any file operation you are required to use the $_SERVER["DOCUMENT_ROOT"] to make that work. In case you wont do it.. the real operation wont work properly. Also in case you are using the Linux Server then you will be required to set the permissions for the folders in which you want to perform the file operation.
And mind it both the operations are synchronous they are not asynchronous. It also depends on the type of the server or the OS that you are using.

PHP ZipArchive not adding more than 700 files

I have a problem with the php_zip.dll's ZipArchive class. I'm using it through the ZipArchiveImproved wrapper class suggested on php.net to avoid the max file-handle issue.
The problem is really simple: 700 files are added properly (jpg image files), and the rest fails. The addFile method returns false.
The PHP version is 5.2.6.
The weird thing is that this actually used to work.
What could be the problem? Can you give me any clues?
Thank you very much in advance!
Edit: sorry, it's not true that I'm not getting any error message (display_errors was switched off in php.ini I didn't notice it before). From the 701. file on, I'm getting the following error message:
Warning: ZipArchive::addFile() [ziparchive.addfile]: Invalid or unitialized Zip object in /.../includes/ZipArchiveImproved.class.php on line 104
Looks like the close() call returns false, but issues no error. Any ideas?
Edit 2: the relevant source:
include_once DIR_INCLUDES . 'ZipArchiveImproved.class.php';
ini_set('max_execution_time', 0);
$filePath = $_SESSION['fqm_archivePath'];
$zip = new ZipArchiveImproved();
if(! $zip->open($filePath, ZipArchive::CREATE))
{
echo '<div class="error">Hiba: a célfájl a(z) "' . $filePath . '" útvonalon nem hozható létre.</div>';
return;
}
echo('Starting (' . count($_POST['files']) . ' files)...<br>');
$addedDirs = array();
foreach($_POST['files'] as $i => $f)
{
$d = getUserNameByPicPath($f);
if(! isset($addedDirs[$d]))
{
$addedDirs[$d] = true;
$zip->addEmptyDir($d);
echo('Added dir "' . $d . '".<br>');
}
$addName = $d . '/' . basename($f);
$r = $zip->addFile($f, $addName);
if(! $r)
{
echo('<font color="Red">[' . ($i + 1) . '] Failed to add file "' . $f . '" as "' . $addName . '".</font><br>');
}
}
$a = $zip->addFromString('test.txt', 'Moooo');
if($a)
{
echo 'Added string successfully.<br>';
}
else
{
echo 'Failed to add string.<br>';
}
$zip->close();

It's probably because of maximal number of open files in your OS (see http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/ for more detailed info; it can by system-wide or just user limit).
Zip keeps every added file open until Zip::close is called.
The solution is to close and reopen archive every X files (256 or 512 should be safe value).

The problem is described here: http://www.php.net/manual/en/function.ziparchive-open.php#88765
Have you tried to specify both flags?

I solved this problem by increasing the ulimit: ulimit -n 8192.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Writing on existing files in a filesystem database - php

Related

PHP ftp_nlist returning false when retrieving large directory?

PHP file_exists() for URL/robots.txt returns false

Second instance of script runs the exact same code as the first one

Are rename() and unlink() asynchronous functions?

PHP ZipArchive not adding more than 700 files

Categories

Resources