I have a script which constanly append strings into a file.
For example (this is a test script):
$i = 1;
$file = 'wikipedia/test.txt';
$text = 'the quick brown fox jumps over the lazy dog';
while($i!=0)
{
file_put_contents($file, $text, FILE_APPEND );
}
But for an unknown reason my program stops appending strings when the text file reached the file size of 2097156 B . It wasn't a disk space issue since i could still create another text file yet limited to the same exact file size value.
I tried using other php functions fwrite, fputs but still didn't work out.
Any idea why this problem occurs?
Seems unlikely, but you might have run up against PHP's max_execution_time if its current setting is very low. Try increasing its value in php.ini
Your loop doesnt make sense. It never changes $i.
Try it without the while.
$file = 'wikipedia/test.txt';
$text = 'the quick brown fox jumps over the lazy dog';
file_put_contents($file, $text, FILE_APPEND );
There are several issues that could cause this problem.
You may have encountered the max execution time limit ( default: 30 seconds ).
You may have exhausted the memory limit ( default: varies by version)
Something may have changed on disk ( the file permissions may have changed, or you may have exceeded a disk quota).
PHP's error output would be invaluable for identify which of these issues may have contributed to the problem.
Related
In PHP, I use fopen( ), fgets( ), and fclose( ) to read a file line by line. It works well. But I have a script (being run from the CLI) that has to process three hundred 5GB text files. That's approximately 3 billion fgets( ). So it works well enough but at this scale, tiny speed savings will add up extremely fast. So I'm wondering if there are any tricks to speed up the process?
The only potential thing I thought of was getting fgets( ) to read more than one line at once. It doesn't look like it supports that, but I could in theory do lets say 20 consecutive $line[] = fgets($file); and then process the array. That's not quite the same thing as reading multiple lines in one command so it may not have any affect. But I know queue your mysql inserts and sending them as one giant insert (another trick I'm going to implement in this script after more testing and benchmarking) will save a lot of time.
Update 4/13/19
Here is the solution I went with. Originally I had a much more complicated method of slicing off the end of each read, but then I realized you can do it much simpler.
$index_file = fopen( path to file,"r" );
$chunk = "";
while ( !feof($index_file) )
{
$chunk .= fread($index_file,$read_length);
$payload_lines = explode("\n",$chunk);
if ( !feof($index_file) )
{ $chunk = array_pop($payload_lines); }
}
Of course PHP has a function for everything. So I break every read into an array of lines, and array_pop() the last item in the array back to the beginning of the 'read buffer'. That last part is probably split, but not necessarily split. But either way, it goes back in and gets processed with the next loop (unless we're done with the file, then we don't pop it).
The only thing you have to watch out for here is if you have a line so long that a single read won't capture the whole thing. But know your data, that probably won't be a hassle. For me, I'm parsing a json-ish file, and I'm reading 128 KB at a time, so there are always many line breaks in my read.
Note: I settled on 128 KB by doing a million benchmarks and finding the size my server processes the absolute fastest. This parsing function will run 300 times so every second I save, saves me 5 minutes of total runtime.
One possible approach that might be faster would be to read large chunks of the file in with fread(), split it by newlines and then process the lines. You'd have to take in account that the chunks may sever lines and you'd have to detect this and glue them back together.
Generally speaking the larger the chunk you can read in one go the faster your process should become. Within the limits of your available memory.
From fread() docs:
Note that fread() reads from the current position of the file pointer. Use ftell() to find the current position of the pointer and rewind() to rewind the pointer position.
I am reading from log files which can be anything from a small log file up to 8-10mb of logs. The typical size would probably be 1mb. Now the key thing is that the keyword im looking for is normally near the end of the document, in probably 95% of the cases. Then i extract 1000 characters after the keyword.
If i use this approach:
$lines = explode("\n",$body);
$reversed = array_reverse($lines);
foreach($reversed AS $line) {
// Search for my keyword
}
Would it be more efficent than using:
$pos = stripos($body,$keyword);
$snippet_pre = substr($body, $pos, 1000);
What i am not sure on is with stripos does it just start searching through the document 1 character at a time so in theory if there is 10,000 characters after the keyword then i wont have to read those into memory, whereas the first option would have to read everything into memory even though it probably only needs the last 100 lines, could i alter it to read 100 lines into memory, then search another 101-200 lines if the first 100 was not successful or is the query so light that it doesnt really matter.
I have a 2nd question and this assumes the reverse_array is the best approach, how would i extract the next 1000 characters after i have found the keyword, here is my woeful attempt
$body = $this_is_the_log_content;
$lines = explode("\n",$body);
$reversed = array_reverse($lines);
foreach($reversed AS $line) {
$pos = stripos($line,$keyword);
$snippet_pre = substr($line, $pos, 1000);
}
Why i don't think that will work is because each $line might only be a few hundred characters so would the better solution be to explode it every say 2,000 lines and also keep the previous $line as a backup variable so something like this.
$body = $this_is_the_log_content;
$lines = str_split($body, 2000);
$reversed = array_reverse($lines);
$previous_line = $line;
foreach($reversed AS $line) {
$pos = stripos($line,$keyword);
if ($pos) {
$line = $previous_line . ' ' . $line;
$pos1 = stripos($line,$keyword);
$snippet_pre = substr($line, $pos, 1000);
}
}
Im probably massively over-complicating this?
I would strongly consider using a tool like grep for this. You can call this command line tool from PHP and use it to search the file for the word you are looking for and do things like give you the byte offset of the matching line, give you a matching line plus trailing context lines, etc.
Here is a link to grep manual. http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
Play with the command a bit on the command line to get it the way you want it, then call it from PHP using exec(), passthru(), or similar depending on how you need to capture/display the content.
Alternatively, you can simply fopen() the file with the pointer at the end and move the file pointer forward in the file using fseek() searching for the string as you move along the way. Once you find you needle, you can then read the file from that offset until you get to the end of file or the number of log entries.
Either of these might be preferable to reading the entire log file into memory and then trying to work with it.
The other thing to consider is whether 1000 characters is meaningful. Typically log files would have lines that vary in length. To me it would seem that you should be more concerned about getting the next X lines from the log file, not the next Y characters. What if a line has 2000 characters, are you saying you only want to get half of it? That may not be meaningful at all.
I am working on old sites and updating the deprecated php functions. I have the following code, which creates an error when I change the ereg to preg.
private function stripScriptTags($string) {
$pattern = array("'\/\*.*\*\/'si", "'<\?.*?\?>'si", "'<%.*?%>'si", "'<script[^>]*?>.*?</script>'si");
$replace = array("", "", "", "");
return ereg_replace($pattern, $replace, $string);
}
This is the error I get:
Fatal error: Allowed memory size of 10000000 bytes exhausted (tried to allocate 6249373 bytes) in C:\Inetpub\qcppos.com\library\vSearch.php on line 403
Is there something else in that line of code that I need to be changing along with the ereg_replace?
So your regexes are as follows:
"'\/\*.*\*\/'si"
"'<\?.*?\?>'si"
"'<%.*?%>'si"
"'<script[^>]*?>.*?</script>'si"
Taking those one at a time, you are first greedily stripping out multiline comments. This is almost certainly where your memory problem is coming from, you should ungreedify that quantifier.
Next up, you are stipping out anything that looks like a PHP tag. This is done with a lazy quantifier, so I don't see any issue with it. Same goes for the ASP tag, and finally the script tag.
Leaving aside potental XSS threats left out by your regex, the main issue seems to be coming from your first regex. Try "'\/\*.*?\*\/'si" instead.
get your memory limitations value
ini_get('memory_limit');
and check your script with memory_get_usage() and memory_get_peak_usage()
(this one needs php 5.2 or higher)
if this turns out to be too low then you can set it higher with:
ini_set("memory_limit","128M"); // "8M" before PHP 5.2.0, "16M" in PHP 5.2.0, > "128M"
Just put in whatever memory you have and works for your script. Keep in mind that this limits the individual php process; to set it globally you need to adapt your php.ini file.
Obviously if it needs an insane amount to run, then consider it a monkeypatch and start rewriting it for better memory footprint.
for more info look at core php.ini directives, search for Resource Limits
Since allowing higher memory allocation wasn't working, the following functions were updated as follows (due to the fact that they aren't actually doing anything but causing issues):
private function stripScriptTags($string)
{
/* $pattern = array("'\/\*.*\*\/'si", "'<\?.*?\?>'si", "'<%.*?%>'si",
"'<script[^>]*?>.*?</script>'si");
$replace = array("", "", "", "");
return ereg_replace($pattern, $replace, $string);
*/
return $string;
}
private function clearSpaces($string, $clear_enters = true)
{
/*$pattern = ($clear_enters == true) ? ("/\s+/") : ("/[ \t]+/");
return preg_replace($pattern, " ", trim($string));
*/
return $string;
}
What is the best way to write to files in a large php application. Lets say there are lots of writes needed per second. How is the best way to go about this.
Could I just open the file and append the data. Or should i open, lock, write and unlock.
What will happen of the file is worked on and other data needs to be written. Will this activity be lost, or will this be saved. and if this will be saved will is halt the application.
If you have been, thank you for reading!
Here's a simple example that highlights the danger of simultaneous wites:
<?php
for($i = 0; $i < 100; $i++) {
$pid = pcntl_fork();
//only spawn more children if we're not a child ourselves
if(!$pid)
break;
}
$fh = fopen('test.txt', 'a');
//The following is a simple attempt to get multiple threads to start at the same time.
$until = round(ceil(time() / 10.0) * 10);
echo "Sleeping until $until\n";
time_sleep_until($until);
$myPid = posix_getpid();
//create a line starting with pid, followed by 10,000 copies of
//a "random" char based on pid.
$line = $myPid . str_repeat(chr(ord('A')+$myPid%25), 10000) . "\n";
for($i = 0; $i < 1; $i++) {
fwrite($fh, $line);
}
fclose($fh);
echo "done\n";
If appends were safe, you should get a file with 100 lines, all of which roughly 10,000 chars long, and beginning with an integer. And sometimes, when you run this script, that's exactly what you'll get. Sometimes, a few appends will conflict, and it'll get mangled, however.
You can find corrupted lines with grep '^[^0-9]' test.txt
This is because file append is only atomic if:
You make a single fwrite() call
and that fwrite() is smaller than PIPE_BUF (somewhere around 1-4k)
and you write to a fully POSIX-compliant filesystem
If you make more than a single call to fwrite during your log append, or you write more than about 4k, all bets are off.
Now, as to whether or not this matters: are you okay with having a few corrupt lines in your log under heavy load? Honestly, most of the time this is perfectly acceptable, and you can avoid the overhead of file locking.
I do have high-performance, multi-threaded application, where all threads are writing (appending) to single log file. So-far did not notice any problems with that, each thread writes multiple times per second and nothing gets lost. I think just appending to huge file should be no issue. But if you want to modify already existing content, especially with concurrency - I would go with locking, otherwise big mess can happen...
If concurrency is an issue, you should really be using databases.
If you're just writing logs, maybe you have to take a look in syslog function, since syslog provides an api.
You should also delegate writes to a dedicated backend and do the job in an asynchroneous maneer ?
These are my 2p.
Unless a unique file is needed for a specific reason, I would avoid appending everything to a huge file. Instead, I would wrap the file by time and dimension. A couple of configuration parameters (wrap_time and wrap_size) could be defined for this.
Also, I would probably introduce some buffering to avoid waiting the write operation to be completed.
Probably PHP is not the most adapted language for this kind of operations, but it could still be possible.
Use flock()
See this question
If you just need to append data, PHP should be fine with that as filesystem should take care of simultaneous appends.
I'm trying to get file contents, replace some parts of it using regular expressions and preg_replace and save it to another file:
$content = file_get_contents('file.txt', true);
$content_replaced = preg_replace('/\[\/m\]{1}\s+(\{\{.*\}\})\s+[\x{4e00}-\x{9fa5}]+/u', 'replaced text', $contents);
if ($content_replaced) {
file_put_contents('file_new.txt', $content_replaced);
echo "Successful!";
}
else {
echo "Some error ocurred";
}
this piece of code works fine with small files, but when I try the original file, which is about 60Mb, it just keeps giving me a message "Some error ocurred".
Any suggestions are greatly appreciated.
Update. No errors in the logs, memory limit is set to 1024M
I've had max/limit issues with file_put_contents.
No idea what the limits might be, but using fwrite solved my troubles and I put down the bottle.
You're probably running out of memory. What's the memory_limit set to? (phpinfo() will tell you). You may be able to increase the memory limit like:
ini_set('memory_limit','128M');
I'm pretty sure you're hitting some regex limit. Heck, some time ago I hit a limit with 1000 chars... with 60Mb of input I bet you will likely hit regex limits everywhere also with really simple patterns. I will try at least to simplify it as much as possible, making it ungreedy with .*? instead of .* if possible.
To get more information, just check the return value of preg_last_error().