Imagine that a campaign will have 10,000 to 30,000 files about 4kb each should be written to disk.
And, there will be a couple of campaigns running at the same time. 10 tops.
Currently, I'm going with the usual way: file_put_contents.
it gets the job done but in a slow way and its php process is taking 100% cpu usage all the way.
fopen, fwrite, fclose, well, the result is similar to file_put_contents.
I've tried some async io stuff such as php eio and swoole.
it's faster but it'll yield "too many open files" after some time.
php -r 'echo exec("ulimit -n");' the result is 800000.
Any help would be appreciated!
well, this is sort of embarrassing... you guys are correct, the bottleneck is how it generates the file content...
I am assuming that you cannot follow SomeDude's very good advice on using databases instead, and you already have performed what hardware tuning could be performed (e.g. increasing cache, increasing RAM to avoid swap thrashing, purchasing SSD drives).
I'd try and offload the file generation to a different process.
You could e.g. install Redis and store the file content into the keystore, which is very fast. Then, a different, parallel process could extract the data from the keystore, delete it, and write to a disk file.
This removes all disk I/O from the main PHP process, and lets you monitor the backlog (how many keypairs are still unflushed: ideally zero) and concentrate on the bottleneck in content generation. You'll possibly need some extra RAM.
On the other hand, this is not too different from writing to a RAM disk. You could also output data to a RAM disk, and it would be probably even faster:
# As root
mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
mkdir /mnt/ramdisk/temp
mkdir /mnt/ramdisk/ready
# Change ownership and permissions as appropriate
and in PHP:
$fp = fopen("/mnt/ramdisk/temp/{$file}", "w");
fwrite($fp, $data);
fclose($fp);
rename("/mnt/ramdisk/temp/{$file}", "/mnt/ramdisk/ready/{$file}");
and then have a different process (crontab? Or continuously running daemon?) move files from the "ready" directory of the RAM disk to the disk, deleting then the RAM ready file.
File System
The time required to create a file depends on the number of files in the directory, with various dependency functions that themselves depend on the file system. ext4, ext3, zfs, btrfs etc. will exhibit different behaviour. Specifically, you might experience significant slowdowns if the number of files exceeds some quantity.
So you might want to try timing the creation of a large number of sample files in one directory, and see how this time grows with the growth of the number. Keep in mind that there will be a performance penalty for access to different directories, so using straight away a very large number of subdirectories is again not recommended.
<?php
$payload = str_repeat("Squeamish ossifrage. \n", 253);
$time = microtime(true);
for ($i = 0; $i < 10000; $i++) {
$fp = fopen("file-{$i}.txt", "w");
fwrite($fp, $payload);
fclose($fp);
}
$time = microtime(true) - $time;
for ($i = 0; $i < 10000; $i++) {
unlink("file-{$i}.txt");
}
print "Elapsed time: {$time} s\n";
Creation of 10000 files takes 0.42 seconds on my system, but creation of 100000 files (10x) takes 5.9 seconds, not 4.2. On the other hand, creating one eighth of those files in 8 separate directories (the best compromise I found) takes 6.1 seconds, so it's not worthwhile.
But suppose that creating 300000 files took 25 seconds instead of 17.7; dividing those files in ten directories might take 22 seconds, and make the directory split worthwhile.
Parallel processing: r strategy
TL;DR this doesn't work so well on my system, though your mileage may vary. If the operations to be done are lengthy (here they are not) and differently bound from the main process, then it can be advantageous to offload them each to a different thread, provided you don't spawn too many threads.
You will need pcntl functions installed.
$payload = str_repeat("Squeamish ossifrage. \n", 253);
$time = microtime(true);
for ($i = 0; $i < 100000; $i++) {
$pid = pcntl_fork();
switch ($pid) {
case 0:
// Parallel execution.
$fp = fopen("file-{$i}.txt", "w");
fwrite($fp, $payload);
fclose($fp);
exit();
case -1:
echo 'Could not fork Process.';
exit();
default:
break;
}
}
$time = microtime(true) - $time;
print "Elapsed time: {$time} s\n";
(The fancy name r strategy is taken from biology).
In this example, spawning times are catastrophic if compared to what each child needs to do. Therefore, overall processing time skyrockets. With more complex children things would go better, but you must be careful not to turn the script into a fork bomb.
One possibility, if possible, could be to divide the files to be created into, say, chunks of 10% each. Each child would then change its working directory with chdir(), and create its files in a different directory. This would negate the penalty for writing files in different subdirectories (each child writes in its current directory), while benefiting from writing less files. In this case, with very lightweight and I/O bound operations in the child, again the strategy isn't worthwhile (I get doubled execution time).
Parallel processing: K strategy
TL;DR this is more complex but works well... on my system. Your mileage may vary.
While r strategy involves lots of fire-and-forget threads, K strategy calls for a limited (possibly one) child which is nurtured carefully. Here we offload the creation of all the files to one parallel thread, and communicate with it via sockets.
$payload = str_repeat("Squeamish ossifrage. \n", 253);
$sockets = array();
$domain = (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN' ? AF_INET : AF_UNIX);
if (socket_create_pair($domain, SOCK_STREAM, 0, $sockets) === false) {
echo "socket_create_pair failed. Reason: ".socket_strerror(socket_last_error());
}
$pid = pcntl_fork();
if ($pid == -1) {
echo 'Could not fork Process.';
} elseif ($pid) {
/*parent*/
socket_close($sockets[0]);
} else {
/*child*/
socket_close($sockets[1]);
for (;;) {
$cmd = trim(socket_read($sockets[0], 5, PHP_BINARY_READ));
if (false === $cmd) {
die("ERROR\n");
}
if ('QUIT' === $cmd) {
socket_write($sockets[0], "OK", 2);
socket_close($sockets[0]);
exit(0);
}
if ('FILE' === $cmd) {
$file = trim(socket_read($sockets[0], 20, PHP_BINARY_READ));
$len = trim(socket_read($sockets[0], 8, PHP_BINARY_READ));
$data = socket_read($sockets[0], $len, PHP_BINARY_READ);
$fp = fopen($file, "w");
fwrite($fp, $data);
fclose($fp);
continue;
}
die("UNKNOWN COMMAND: {$cmd}");
}
}
$time = microtime(true);
for ($i = 0; $i < 100000; $i++) {
socket_write($sockets[1], sprintf("FILE %20.20s%08.08s", "file-{$i}.txt", strlen($payload)));
socket_write($sockets[1], $payload, strlen($payload));
//$fp = fopen("file-{$i}.txt", "w");
//fwrite($fp, $payload);
//fclose($fp);
}
$time = microtime(true) - $time;
print "Elapsed time: {$time} s\n";
socket_write($sockets[1], "QUIT\n", 5);
$ok = socket_read($sockets[1], 2, PHP_BINARY_READ);
socket_close($sockets[1]);
THIS IS HUGELY DEPENDENT ON THE SYSTEM CONFIGURATION. For example on a mono-processor, mono-core, non-threading CPU, this is madness - you'll at least double the total runtime, but more likely it will go from three to ten times as slow.
So this is definitely not the way to pimp up something running on an old system.
On a modern multithreading CPU and supposing the main content creation loop is CPU bound, you may experience the reverse - the script might go ten times faster.
On my system, the "forking" solution above runs a bit less than three times faster. I expected more, but there you are.
Of course, whether the performance is worth the added complexity and maintenance, remains to be evaluated.
The bad news
While experimenting above, I came to the conclusion that file creation on a reasonably configured and performant machine in Linux is fast as hell, so not only it's difficult to squeeze more performances, but if you're experiencing slowness, it's very likely that it is not file related. Try detailing some more about how you create that content.
Having read your description, I understand you're writing many files that are each rather small. The way PHP usually works (at least in the Apache server), there is overhead for each filesystem access: a file pointer and buffer is opened and maintained for each file. Since there's no code samples to review here, it's hard to see where inefficiencies are.
However, using file_put_contents() for 300,000+ files appears to be slightly less efficient than using fopen() and fwrite() or fflush() directly, then fclose() when you're done. I'm saying that based on a benchmark done by a fellow in the comments of the PHP documentation for file_put_contents() at http://php.net/manual/en/function.file-put-contents.php#105421
Next, when dealing with such small file sizes, it sounds like there's a great opportunity to use a database instead of flat files (I'm sure you've got that before). A database, whether mySQL or PostgreSQL, is highly optimized for simultaneous access to many records, and can internally balance CPU workload in ways that filesystem access never can (and binary data in records is possible too). Unless you need access to real files directly from your server hard drives, a database can simulate many files by allowing PHP to return individual records as file data over the web (i.e., by using the header() function). Again, I'm assuming this PHP is running as a web interface on a server.Overall, what I am reading suggests that there may be an inefficiency somewhere else besides filesystem access. How is the file content generated? How does the operating system handle file access? Is there compression or encryption involved? Are these images or text data? Is the OS writing to one hard drive, a software RAID array, or some other layout? Those are some of the questions I can think of just glancing over your problem. Hopefully my answer helped. Cheers.
The main idea is to have less files.
Ex: 1,000 files can be added in 100 files, each containing 10 files - and parsed with explode and you will get 5x faster on write and 14x faster on read+parse
with file_put_contents and fwrite optimized, you will not get more than 1.x speed. This solution can be useful for read/write. Other solution may be mysql or other db.
On my computer to create 30k files with a small string it takes 96.38 seconds and to append 30k times same string in one file it takes 0.075 sec
I can offer you an unusual solution, when you can use it fewer times file_put_contents function. bellow this i show you a simple code to understand how it works.
$start = microtime(true);
$str = "Aaaaaaaaaaaaaaaaaaaaaaaaa";
if( !file_exists("test/") ) mkdir("test/");
foreach( range(1,1000) as $i ) {
file_put_contents("test/".$i.".txt",$str);
}
$end = microtime(true);
echo "elapsed_file_put_contents_1: ".substr(($end - $start),0,5)." sec\n";
$start = microtime(true);
$out = '';
foreach( range(1,1000) as $i ) {
$out .= $str;
}
file_put_contents("out.txt",$out);
$end = microtime(true);
echo "elapsed_file_put_contents_2: ".substr(($end - $start),0,5)." sec\n";
this is a full example with 1000 files and elapsed time
with 1000 files
writing file_put_contens: elapsed: 194.4 sec
writing file_put_contens APPNED :elapsed: 37.83 sec ( 5x faster )
............
reading file_put_contens elapsed: 2.401 sec
reading append elapsed: 0.170 sec ( 14x faster )
$start = microtime(true);
$allow_argvs = array("gen_all","gen_few","read_all","read_few");
$arg = isset($argv[1]) ? $argv[1] : die("php ".$argv[0]." gen_all ( ".implode(", ",$allow_argvs).")");
if( !in_array($arg,$allow_argvs) ) {
die("php ".$argv[0]." gen_all ( ".implode(", ",$allow_argvs).")");
}
if( $arg=='gen_all' ) {
$dir_campain_all_files = "campain_all_files/";
if( !file_exists($dir_campain_all_files) ) die("\nFolder ".$dir_campain_all_files." not exist!\n");
$exists_campaings = false;
foreach( range(1,10) as $i ) { if( file_exists($dir_campain_all_files.$i) ) { $exists_campaings = true; } }
if( $exists_campaings ) {
die("\nDelete manualy all subfolders from ".$dir_campain_all_files." !\n");
}
build_campain_dirs($dir_campain_all_files);
// foreach in campaigns
foreach( range(1,10) as $i ) {
$campain_dir = $dir_campain_all_files.$i."/";
$nr_of_files = 1000;
foreach( range(1,$nr_of_files) as $f ) {
$file_name = $f.".txt";
$data_file = generateRandomString(4*1024);
$dir_file_name = $campain_dir.$file_name;
file_put_contents($dir_file_name,$data_file);
}
echo "campaing #".$i." done! ( ".$nr_of_files." files writen ).\n";
}
}
if( $arg=='gen_few' ) {
$delim_file = "###FILE###";
$delim_contents = "###FILE###";
$dir_campain = "campain_few_files/";
if( !file_exists($dir_campain) ) die("\nFolder ".$dir_campain_all_files." not exist!\n");
$exists_campaings = false;
foreach( range(1,10) as $i ) { if( file_exists($dir_campain.$i) ) { $exists_campaings = true; } }
if( $exists_campaings ) {
die("\nDelete manualy all files from ".$dir_campain." !\n");
}
$amount = 100; // nr_of_files_to_append
$out = ''; // here will be appended
build_campain_dirs($dir_campain);
// foreach in campaigns
foreach( range(1,10) as $i ) {
$campain_dir = $dir_campain.$i."/";
$nr_of_files = 1000;
$cnt_few=1;
foreach( range(1,$nr_of_files) as $f ) {
$file_name = $f.".txt";
$data_file = generateRandomString(4*1024);
$my_file_and_data = $file_name.$delim_file.$data_file;
$out .= $my_file_and_data.$delim_contents;
// append in a new file
if( $f%$amount==0 ) {
$dir_file_name = $campain_dir.$cnt_few.".txt";
file_put_contents($dir_file_name,$out,FILE_APPEND);
$out = '';
$cnt_few++;
}
}
// append remaning files
if( !empty($out) ) {
$dir_file_name = $campain_dir.$cnt_few.".txt";
file_put_contents($dir_file_name,$out,FILE_APPEND);
$out = '';
}
echo "campaing #".$i." done! ( ".$nr_of_files." files writen ).\n";
}
}
if( $arg=='read_all' ) {
$dir_campain = "campain_all_files/";
$exists_campaings = false;
foreach( range(1,10) as $i ) {
if( file_exists($dir_campain.$i) ) {
$exists_campaings = true;
}
}
foreach( range(1,10) as $i ) {
$campain_dir = $dir_campain.$i."/";
$files = getFiles($campain_dir);
foreach( $files as $file ) {
$data = file_get_contents($file);
$substr = substr($data, 100, 5); // read 5 chars after char100
}
echo "campaing #".$i." done! ( ".count($files)." files readed ).\n";
}
}
if( $arg=='read_few' ) {
$dir_campain = "campain_few_files/";
$exists_campaings = false;
foreach( range(1,10) as $i ) {
if( file_exists($dir_campain.$i) ) {
$exists_campaings = true;
}
}
foreach( range(1,10) as $i ) {
$campain_dir = $dir_campain.$i."/";
$files = getFiles($campain_dir);
foreach( $files as $file ) {
$data_temp = file_get_contents($file);
$explode = explode("###FILE###",$data_temp);
//#mkdir("test/".$i);
foreach( $explode as $exp ) {
$temp_exp = explode("###FILE###",$exp);
if( count($temp_exp)==2 ) {
$file_name = $temp_exp[0];
$file_data = $temp_exp[1];
$substr = substr($file_data, 100, 5); // read 5 chars after char100
//file_put_contents("test/".$i."/".$file_name,$file_data); // test if files are recreated correctly
}
}
//echo $file." has ".strlen($data_temp)." chars!\n";
}
echo "campaing #".$i." done! ( ".count($files)." files readed ).\n";
}
}
$end = microtime(true);
echo "elapsed: ".substr(($end - $start),0,5)." sec\n";
echo "\n\nALL DONE!\n\n";
/*************** FUNCTIONS ******************/
function generateRandomString($length = 10) {
$characters = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
$charactersLength = strlen($characters);
$randomString = '';
for ($i = 0; $i < $length; $i++) {
$randomString .= $characters[rand(0, $charactersLength - 1)];
}
return $randomString;
}
function build_campain_dirs($dir_campain) {
foreach( range(1,10) as $i ) {
$dir = $dir_campain.$i;
if( !file_exists($dir) ) {
mkdir($dir);
}
}
}
function getFiles($dir) {
$arr = array();
if ($handle = opendir($dir)) {
while (false !== ($file = readdir($handle))) {
if ($file != "." && $file != "..") {
$arr[] = $dir.$file;
}
}
closedir($handle);
}
return $arr;
}
I want impose a time limit to a process reading using fgets opened by popen in PHP.
I have the next code:
$handle = popen("tail -F -n 30 /tmp/pushlog.txt 2>&1", "r");
while(!feof($handle)) {
$buffer = fgets($handle);
echo "data: ".$buffer."\n";
#ob_flush();
flush();
}
pclose($handle);
I tried without success:
set_time_limit(60);
ignore_user_abort(false);
The process is as follow:
The browser send a GET request waiting for a Answer in HTML5 Server side
events format.
The request is received by AWS Load Balancer and is
forwarded to EC2 instances.
The answer is the last 30 lines of the file
The browser receive it in 30 messages and the connection is persisted.
If tail command sends a new line it is returned else fgets wait undefined time until new line is returned from tail command.
AWS Load Balancer after 60 seconds of network inactivity (No new lines in 60 seconds) closes the connection to the browser. The connection to EC2 instance is not closed.
The browser detect that the connection is closed and it opens a new connection, the process go back to step 1.
AS this steps describe, the connection between AWS Load Balancer and EC2 instance is never closed, after a few hours/days there is hundreds and hundreds of tail and httpd process running and the server start not answering.
Of course it appear to be a AWS Load Balancer bug, but I don't want start a process to gain the attention from Amazon and wait for a fix.
My temporary solution is do a sudo kill tail to kill the process before the server becomes unstable.
I think PHP doesn't stop the script because PHP is "blocked" waiting for fgets to finish.
I know that the time limit of AWS Load Balancer is editable, but I want keep in the default value, even a higher limit is not going to fix the problem.
I don't know if I need change the question to How to execute a process in linux with a time limit / timeout?.
PHP 5.5.22 / Apache 2.4 / Linux Kernel 3.14.35-28.38.amzn1.x86_64
Tested with PHP 5.5.20:
//Change configuration.
set_time_limit(0);
ignore_user_abort(true);
//Open pipe & set non-blocking mode.
$descriptors = array(0 => array('file', '/dev/null', 'r'),
1 => array('pipe', 'w'),
2 => array('file', '/dev/null', 'w'));
$process = proc_open('exec tail -F -n 30 /tmp/pushlog.txt 2>&1',
$descriptors, $pipes, NULL, NULL) or exit;
$stream = $pipes[1];
stream_set_blocking($stream, 0);
//Call stream_select with a 10 second timeout.
$read = array($stream); $write = NULL; $except = NULL;
while (!feof($stream) && !connection_aborted()
&& stream_select($read, $write, $except, 10)) {
//Print out all the lines we can.
while (($buffer = fgets($stream)) !== FALSE) {
echo 'data: ' . $buffer . "\n";
#ob_flush();
flush();
}
}
//Clean up.
fclose($stream);
$status = proc_get_status($process);
if ($status !== FALSE && $status['running'] === TRUE)
proc_terminate($process);
proc_close($process);
Rather than using a process file pointer, I went with my "multitasking" approach. I use this code to spawn other "processes" Kind of a multitasking cheat.
I call a Script, hang.php, that just hangs for 90 seconds: sleep(90).
You may want to adjust the stream and stream_select timeouts.
Create stream(s)
header('Content-Type: text/plain; charset=utf-8');
$timeout = 20;
$result = array();
$sockets = array();
$buffer_size = 8192;
$id = 0;
$stream = stream_socket_client("ispeedlink.com:80", $errno,$errstr, $timeout,
STREAM_CLIENT_ASYNC_CONNECT|STREAM_CLIENT_CONNECT);
if ($stream) {
$sockets[$id++] = $stream; // supports multiple sockets
$http = "GET /testbed/hang.php HTTP/1.0\r\nHost: ispeedlink.com\r\n\r\n";
fwrite($stream, $http);
}
else {
echo "$id Failed\n";
}
Additional scripts can be run by adding the stream: $sockets[$id++] = $stream;
Below will put anything read in to the $result[$id] array.
Monitor the streams:
while (count($sockets)) {
$read = $sockets;
stream_select($read, $write = NULL, $except = NULL, $timeout);
if (count($read)) {
foreach ($read as $r) {
$id = array_search($r, $sockets);
$data = fread($r, $buffer_size);
if (strlen($data) == 0) { // either reads data or EOF
echo "$id Closed: " . date('h:i:s') . "\n\n\n";
fclose($r);
unset($sockets[$id]);
}
else {
$result[$id] .= $data;
}
}
}
else {
echo 'Timeout: ' . date('h:i:s') . "\n\n\n";
break;
}
}
echo system('ps auxww');
.
When I want to kill a process I use system('ps auxww') to get the pid and kill it with system("kill $pid")
kill.php
header('Content-Type: text/plain; charset=utf-8');
//system('kill 220613');
echo system('ps auxww');
My code:
<?
$url = 'http://w1.weather.gov/xml/current_obs/KGJT.xml';
$xml = simplexml_load_file($url);
?>
<?
echo $xml->weather, " ";
echo $xml->temperature_string;
?>
This works great, but I read that caching external data is a must for page speed. How can I cache this for lets say 5 hours?
I looked into ob_start(), is this what I should use?
The ob system is for in-script cacheing. It's not useful for persistent multi invocation caching.
To do this properly, you'd write the resulting xml out of a file. Every time the script runs, you'd check the last updated time on that file. if it's > 5 hours, you fetch/save a fresh copy.
e.g.
$file = 'weather.xml';
if (filemtime($file) < (time() - 5*60*60)) {
$xml = file_get_contents('http://w1.weather.gov/xml/current_obs/KGJT.xml');
file_put_contents($file, $xml);
}
$xml = simplexml_load_file($file);
echo $xml->weather, " ";
echo $xml->temperature_string;
ob_start would not be a great solution. That only applies when you need to modify or flush the output buffer. Your XML returned data is not being sent to the buffer, so no need for those calls.
Here's one solution, which I've used in the past. Does not require MySQL or any database, as data is stored in a flat file.
$last_cache = -1;
$last_cache = #filemtime( 'weather_cache.txt' ); // Get last modified date stamp of file
if ($last_cache == -1){ // If date stamp unattainable, set to the future
$since_last_cache = time() * 9;
} else $since_last_cache = time() - $last_cache; // Measure seconds since cache last set
if ( $since_last_cache >= ( 3600 * 5) ){ // If it's been 5 hours or more since we last cached...
$url = 'http://w1.weather.gov/xml/current_obs/KGJT.xml'; // Pull in the weather
$xml = simplexml_load_file($url);
$weather = $xml->weather . " " . $xml->temperature_string;
$fp = fopen( 'weather_cache.txt', 'a+' ); // Write weather data to cache file
if ($fp){
if (flock($fp, LOCK_EX)) {
ftruncate($fp, 0);
fwrite($fp, "\r\n" . $weather );
flock($fp, LOCK_UN);
}
fclose($fp);
}
}
include_once('weather_cache.txt'); // Include the weather data cache
function checkServer($domain, $port=80)
{
global $checkTimeout, $testServer;
$status = 0;
$starttime = microtime(true);
$file = #fsockopen ($domain, $port, $errno, $errstr, $checkTimeout);
$stoptime = microtime(true);
if($file)
{
fclose($file);
$status = ($stoptime - $starttime) * 1000;
$status = floor($status);
}
else
{
$testfile = #fsockopen ($testServer, 80, $errno, $errstr, $checkTimeout);
if($testfile)
{
fclose($testfile);
$status = -1;
}
else
{
$status = -2;
}
}
return $status;
}
the testserver is google.sk, and checkTimeout is 10 seconds. This actually works, but when i try to run it in a loop for about 50 times, and do other stuff (mysql queries and things like that), it's not slow, but it causes 100% load of my CPU until the script ends. It's a single apache proccess that drives my cpu crazy ... So i wanted to ask you if you have any ideas about it. maybe some tip how to do the same in python or bash or so will be appreciated.
Thank you for the responses :)
Use CURL
this is an example how to conversion fsockopen to CURL
PHP fsockopen to curl conversion
Good luck