Efficient way to write 2^20 files to ext4 filesystem - php

I'm trying to write 2^20 files with about 45k lines in each, each line is a word.
I need this to be flat, no sql for storage purpose, I optimized it not to take too much disk space.
As of now the files are written in 16 directory, making it 65536 file in each directory. The file names are 4 characters long.
That didn't seem too much for me, my script takes a huge file, read each line, and then write each line in its dedicated file.
I first tried to make this with 2^16 files, so 4096 files in each directory, worked like a charm but I wanted to make the lookup faster. It seems to hit a wall at this number of files. Yet from what I saw on the internet, it's not even close the ext4 maximum file number.
After say 45gB it became very very very slow (like wrote 100 lines in each file in more than an hour).
My script is taking advantage of the server 64gB ram by having a buffer, I have 1.7tB SSD drive with more than 700mB writing speed but still looks like it cannot manage that much files. Even a du -h /dir/ takes forever.
Is there any way to do this task faster, or should I just go with 65536 files ?
Thank you for your help :)
Have a great day.
EDIT : Here's the code sorry for the mistake.
Thank you for your answer and sorry I didn't post any code. Here's the code :
$time_pre = microtime(true);
$bufferSize = (1<<9);
$dico = fopen('/path/to/file1',"r+");
$hexa = array("0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f");
for($i = 0;$i < 16;$i++)
{
for($j = 0;$j < 16;$j++)
{
for($k = 0;$k < 16;$k++)
{
for($l = 0;$l < 16;$l++)
{
for($m = 0;$m < 16;$m++)
{
$ijklm = "$hexa[$i]$hexa[$j]$hexa[$k]$hexa[$l]$hexa[$m]";
$$ijklm = "";
}
}
}
}
}
while(!feof($dico))
{
$ligne = rtrim(fgets($dico));
$md4 = hash("md5",$ligne);
$add4 = "$md4[0]$md4[1]$md4[2]$md4[3]$md4[4]";
$$add4 .= "$md4[5]$md4[6]$ligne\n";
if(isset($$add4{$bufferSize}))
{
$baseFichier = '/path/to/'.$md4[0].'/'.$md4[1].'/'.$md4[2].'/'.$md4[3].'/'.$md4[4];
$fichier = fopen($baseFichier,"a");
fwrite($fichier, $$add4);
fclose($fichier);
$$add4 = "";
}
}
for($i = 0;$i < 16;$i++)
{
for($j = 0;$j < 16;$j++)
{
for($k = 0;$k < 16;$k++)
{
for($l = 0;$l < 16;$l++)
{
for($m = 0;$m < 16;$m++)
{
$ijklm = "$hexa[$i]$hexa[$j]$hexa[$k]$hexa[$l]$hexa[$m]";
if($$ijklm != "")
{
$baseFichier = '/Path/to/'.$hexa[$i].'/'.$hexa[$j].'/'.$hexa[$k].'/'.$hexa[$l].'/'.$hexa[$m];
$fichier = fopen($baseFichier,"a");
fwrite($fichier, $$ijkl);
fclose($fichier);
$$ijklm = "";
}
}
}
}
}
}
fclose($dico);
I used fflush and some locks for the files at first, because I had 6 workers at the same time, to avoid writing problems. But for testing sake I removed it. On this one I tried to create 16^4 directories instead of one directory with 65.536 files in it.
I also tried a smaller buffer, to check if this was better on SSD.
It's running right now but looks like it's slower than previous test. The input file is 75gig, I have 6 like this to process.
What would you do to speed things up ?
I'm using php8. I tried using C but was the same issue so I went with a more high-level language for facility purpose.
Thank you :)
PS : Sorry if the code isn't pro it's a hobbie for me.

Related

Why does my PHP code doesn't work anymore for no reason?

I have a for loop in my code. I haven't changed anything on this part of code for about 5-6 days and I never had problems with it.
Since yesterday I tried to reload my code and it allways gives me this error:
Maximum execution time of 30 seconds exceeded - in LogController.php line 270
Well I can't explain why but maybe someone of you could look over it.
This is my code around line 270.
$topten_sites = [];
for ($i = 0; $i <= count($sites_array); $i++) {
if ($i < 10) { // this is 270
$topten_sites[] = $sites_array[$i];
}
}
$topten_sites = collect($topten_sites)->sortByDesc('number')->all();
As I said, it worked perfectly, so why it gives me an error? If I uncomment these lines and every other line that contains the $topten_sites array, the code workes again.
This looks wrong:
for ($i = 0; $i <= $sites_array; $i++) {
if ($i < 10) { // this is 270
$topten_sites[] = $sites_array[$i];
}
}
If $sites_array is an array, it makes no sense to compare it to an integer so you probably have a never-ending loop.
If you just need the first 10 elements in another array, you can replace your loop with:
$topten_sites = array_slice($sites_array, 0, 10);
Why would You iterate entire array if You only want first 10 results?
for ($i = 0; $i < 10; $i++) {
$topten_sites[] = $sites_array[$i];
}
To answer the actual answer; code never stops working "for no reason". Code works or it doesn't, both for a reason. If it stops working something changed compared to your previous tests.
"Sometimes it works, sometimes it doesn't" falls in the same logic. Code will always behave exactly the same every time, just some of the parameters have changed, you have to find which one.
In your case, i'm guessing the entries in your array have increased. PHP and arrays aren't best friends when it comes to speed, arrays are slow. It could very well be that your array was
smaller when you tested it (wasn't probally the fastest to begin with), but now with the current amount it just hit the threshold of 30 seconds.
It could also be that a part of code before this bit of code takes a lot of time (say suddenly 28 seconds instead of 20), and your loop (which never changed) does it's job in the regular 3seconds it always does, now runs into problems
Use it like this:
$topten_sites = [];
for ($i = 0; $i <= 10; $i++) {
$topten_sites[] = $sites_array[$i];
}
$topten_sites = collect($topten_sites)->sortByDesc('number')->all();

php memory limit test

It seems this is an ever unsolved question: I did a simple test to the memory limits in my local machine (from command line):
<?php
for ($i = 0; $i < 4000*4000; $i ++) {
$R[$i] = 1.00001;
}
?>
and I have memory limit set at 128M. But PHP still sends off "Allowed memory exhausted" message. Why?
Well I wouldn't say ever unsolved question. There are a few reasons for it-PHP is a very insufficient language in terms of memory management-it's no secret. Now the code you provided could be optimized a little bit, but not enough to make a difference. For example take the multiplication in the for loop outside and store the value in a variable. Otherwise you are performing that mathematical operation on each loop. But that will not make any significant difference - 2310451248 bytes as it is and 2310451144 bytes if you do it as I proposed. But the point remains - PHP is not a low level language so you can't expect it to have the same efficiency as C for example. In your particular case, the required memory to perform all this is a little over 2 GB(2.15 gb)
<?php
ini_set('memory_limit', '4096M');
$ii = 4000*4000;
//$R = new SplFixedArray($ii);
$R = array();
for ($i = 0; $i < $ii; $i ++) {
$R[$i] = 1.00001;
}
echo humanize(memory_get_usage())."\n";
function humanize($size)
{
$unit=array('b','kb','mb','gb','tb','pb');
return round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}
?>
But using SplFixedArray things change a lot:
<?php
ini_set('memory_limit', '4096M');
$ii = 4000*4000;
$R = new SplFixedArray($ii);
for ($i = 0; $i < $ii; $i ++) {
$R[$i] = 1.00001;
}
echo humanize(memory_get_usage())."\n";
function humanize($size)
{
$unit=array('b','kb','mb','gb','tb','pb');
return round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}
?>
Which requires "only" 854.72 mb.
This is one of the main reasons why companies who deal with larger amounts of data in general avoid using PHP and go for languages such as python instead. There is a great article describing all of the problems and causes around this topic, found here. Hope that helps.

Which is fastest way to get 10 lines randomly inside a file with 10.000 lines?

I have a file with ~ 10.000 lines inside. I want every time user access my website, it auto pick 10 lines randomly among them.
Code I currently used:
$filelog = 'items.txt';
$random_lines = (file_exists($filelog))? file($filelog) : array();
$random_count = count($random_lines);
$random_file_html = '';
if ($random_count > 10)
{
$random_file_html = '<div><ul>';
for ($i = 0; $i < 10; $i++)
{
$random_number = rand(0, $random_count - 1); // Duplicate are accepted
$random_file_html .= '<li>'.$random_lines[$random_number]."</li>\r\n";
}
$random_file_html .= '</ul>
</div>';
}
When I have < 1000 lines, every things is ok. But now, with 1000 lines. It make my website slow dow significantly.
That I'm thinking to other methods, like:
Divide file to 50 files, select randomly them, then select 10 lines randoms inside the selected file.
-- or --
I knew total lines (items). Make 10 numbers randomly, then read file use
$file = new SplFileObject('items.txt');
$file->seek($ranđom_number);
echo $file->current();
(My server does not support any type of SQL)
Maybe you have other methods that best suit for me. What is best method for my problem? Thank you very much!
The fastest way would be apparently not to pick 10 lines randomly out of a file with ~ 10.000 lines inside on every user's request.
It's impossible to answer more as we know no details of this "XY problem".
If it is possible to adjust the contents of the file then simply pad each of the lines so they have a common length. Then you can access the lines in the file using random access.
$lineLength = 50; // this is the assumed length of each line
$total = filesize($filename);
$numLines = $total/$lineLength;
// get ten random numbers
$fp = fopen($filename, "r");
for ($x = 0; $x < 10; $x++){
fseek($fp, (rand(1, $numLines)-1)*$lineLength, SEEK_SET);
echo fgets($fp, 50);
}
fclose($fp);
try:
$lines = file('YOUR_TXT_FILE.txt');
$rand = array_rand($lines);
echo $lines[$rand];
for 10 of them just put it in a loop:
$lines = file('YOUR_TXT_FILE.txt');
for ($i = 0; $i < 10; $i++) {
$rand = array_rand($lines);
echo $lines[$rand];
}
NOTE: ** the above code does not guarantee that **2 same lines wont be picked. In order to guarantee uniqueness you need to add extra while loop and an array that holds all randomly generated indexes, so next time it generates it and it already exists in an array, generate another one until its not in the array.
The above solution might not be fastest but might fulfill your needs. Since your server does not support any type of SQL, maybe switch to a different server? Because I am wondering how you are storing User Data? Are those stored in files also?

Prevent my script from using so much memory?

I have a script which lists all possible permutations in an array, which, admittedly, might be used instead of a wordlist. If I get this to work, it'll be impossible to not get a hit eventually unless there is a limit on attempts.
Anyway, the script obviously takes a HUGE amount of memory, something which will set any server on fire. What I need help with is finding a way to spread out the memory usage, something like somehow resetting the script and continuing where it left off by going to another file or something, possibly by using Sessions. I have no clue.
Here's what I've got so far:
<?php
ini_set('memory_limit', '-1');
ini_set('max_execution_time', '0');
$possible = "abcdefghi";
$input = "$possible";
function string_getpermutations($prefix, $characters, &$permutations)
{
if (count($characters) == 1)
$permutations[] = $prefix . array_pop($characters);
else
{
for ($i = 0; $i < count($characters); $i++)
{
$tmp = $characters;
unset($tmp[$i]);
string_getpermutations($prefix . $characters[$i], array_values($tmp), $permutations);
}
}
}
$characters = array();
for ($i = 0; $i < strlen($input); $i++)
$characters[] = $input[$i];
$permutations = array();
print_r($characters);
string_getpermutations("", $characters, $permutations);
print_r($permutations);
?>
Any ideas? :3
You could store the permutations in files every XXX permutations, then reopen files when needed in the correct order to display/use your permutations. (Files or whatever you want, as long as you can free PhP memory)
I see that you're just echoing the permutations, but maybe you'd want to do something else with it ? So it depends somehow.
Also, try to unset as many unused variables as soon as possible while doing your permutations.
Edit : Sometimes, using references as you did for your permutations array can result to a bigger use of memory. Just in case you didn't try, check which is better, with or without

php nested for statements?

I'm trying to process a for loop within a for loop, and just a little wary of the syntax... Will this work? Essentially, I want to run code for every 1,000 records while the count is equal to or less than the $count... Will the syntax below work, or is there a better way?
for($x = 0; $x <= 700000; $x++) {
for($i = 0; $i <= 1000; $i++) {
//run the code
}
}
The syntax you have will work, but I don't think it's going to do exactly what you want. Right now, it's going to do the outer loop 700,001 times, and for every single one of those 700,001 times, it's going to do the inner loop.
That means, in total, the inner loop is going to run 700,001 x 1001 = about 700.7 million times.
If this isn't what you want, can you give a bit more information? I can't really work out what "I want to run code for every 1,000 records while the count is equal to or less than the $count" means. I don't see any variable named $count at all.
Well, essentially, I'm reading in a text file and inserting each of the lines into a db. I did originally try while(!feof($f)) [where $f = filename], but it keeps complaining of a broken pipe. I thought this would be another way to go
$f should be file-handle returned by fopen(), not a filename.
$file_handle = fopen($filename, 'r');
while(!feof($file_handle)) {
$line = fgets($file_handle);
$line = trim($line); // remove space chars at beginning and end
if(!$line) continue; // we don't need empty lines
mysql_query('INSERT INTO table (column) '
.'VALUES ("'.mysql_real_escape_string($line).'")');
}
Read through the documentation at php.net for fopen(), fgets(). You might also need explode() if you need to split your string.
If your file isn't big, you might want to read it into an array at once like this:
$filelines = file($filename, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
foreach($filelines as $line) {
do_stuff_with($line);
}
Hmm. Ok... Well, essentially, I'm reading in a text file and inserting each of the lines into a db. I did originally try while(!feof($f)) [where $f = filename], but it keeps complaining of a broken pipe. I thought this would be another way to go..
To read a text file line by line I usually:
$file = file("path to file")
foreach($file as $line){
//insert $line into db
}
Strictly answering the question, you'd want something more like this:
// $x would be 0, then 1000, then 2000, then 3000
for($x = 0; $x < 700000; $x += 1000) {
// $i would be $x through $x + 999
for($i = $x; $i < $x + 1000; $i++) {
//run the code
}
}
However, you should really consider one of the other methods for importing files to a database.

Categories