There is an array of numbers, divided into partitions containing the same number of elements (as an output of array_chunk()). They are written into separate files, file 1.txt contains the first chunk, 2.txt - the second and so on. And now I want these files to contain a different number of elements of the initial array. Of course, I can read them into one array and split it again, but it requires quite a large amount of memory. Could you please help me with a more efficient solution? (The number of files and the size of the last are stored separately) I have no other ideas...
Do you know what the different number is? If you do, then you can easily read data in, and then whenever you fill a chunk write data out. In pseudo-code:
for each original file:
for each record:
add record to buffer
if buffer is desired size:
write new file
clear buffer
write new file
Obviously you'll need to keep new files separate from old ones. And then, once you've rewritten the data, you can swap them out somehow. (I would personally suggest having two directories, then rename directories after you're done.)
If you don't know what the size of your chunks should be (for instance you want a specific number of files) then first do whatever work it needs to figure that out, then proceed with the original solution.
Related
I know similar questions have been asked but they don't address the finer details and not in PHP...
I have a .txt file with a comma separated list of numbers.It's simple enough to sort the file, split it and save into multiple smaller files. What would be an memory efficient way to put small chunks of these files back together and sort them?
All my current attempts to do this have resulted in huge memory utilisation.
A more specific way to put it could be:
how do I stream small pieces of data from a file
track which pieces of data I've already processed
Create a buffer, and when the buffer is full write it out
start back at 1 and repeat until all the data has been sorted
I have 1 file txt with 100 000 lines. Can you know any way to get 1 line fastest randomly ? Thank you very much !
The fastest way would be to build index (simple array that contains the position in the file of each new line). Then choose random key, get the position, fseek the file to that position, and read the line. This will require updating the index file any time you change the file, but if you want to optimize retrieving the data, that's the way.
You can optimize further by spliting the file in ranges (e.g. sharding the data), or having several representations of the file (for example you can have file with lines inverted so last come first, and if your random number is bigger than half of the elements, you read from the second file)
In a text file, lines are detected by \n at the end of each line. For this purpose, it is necessary to read the entire file, and this is a big problem for large files (say 2GB). I am looking for a method to read a single line without walking through the entire file (though I know it should be a complicated process).
The first way I know is to use fseek() with offset; but it is not practical.
Creating a flat file of key/value; but I am not sure if there is a way to avoid loading the entire into RAM (it should be something like reading an array in php).
Alternatively, can we make some numbers at the beginning of each line to be read. I mean, is it possible to read the first digits at the beginning of the line by skipping the line contents (going to the next line).
768| line content is here
769| another line
770| something
If reading only the first digits, the total data which should be read is not much even for large files.
Do you need to read specific lines that can be indexed on line number?. If so just do a binary search. Read (say) 200 characters in the middle of the file to find out a line number. Then repeat in either of the halves until you get to the right line.
I think there are no simple way to do what you want. Records have variable length and no length could be determined in advance, right?
If file is always the same (or at least not modified frequently), I'd put it to database, or at least create index file (record number: offset) and use that fseek()
Alternatively you can index your text file initially and then proceed with your daily operation of picking up single file lines based on your index file. You can find how to index your text file here or here. Indexing a text file is no different from indexing a CSV or variable record file.
I have multiple CSV files, each with the same set of row/column titles but each with different values. For example:
CSV-1.csv
A,B,C,C,C,X
A,A,A,A,C,X
CSV-2.csv
A,C,C,C,C,X
A,C,A,A,C,X
and so on...
I have been able to figure out how to read the files and convert them into HTML pre-formatted tables. However, I have not been able to figure out how to paginate when there are multiple files with data (as shown above) such that I get only a single table at a time with "Next" and "Previous" buttons (to be able to effectively see the changes in the table and data.
Any ideas would be greatly appreciated.
If you know in advance what the files are, then predetermining the line count for each file would let you do the pageination.
Then it'd be a simple matter of scanning through this line count cache to figure out which file to start reading from, and just keep reading lines/files until you reach the per-page line limit.
Otherwise, you option will be to open/read each file upon each request, but only start outputting when you reach the file/line that matches the current "page" offset. For large files with many lines, this'd be a serious waste of cpu time and disk bandwidth.
Basically i have simple form which user uses for files uploading. Files should be stored under /files/ directory with some subdirectories for almost equally splitting files. e.g. /files/sub1/sub2/file1.txt
Also i need to not to store equal files (by filename).
I have own solution. Calculate sha1 from filename. Take first 5 symbols - abcde for example and put file in /files/a/b/c/d/e/ this works well, but gives situation when one folder contains 4k files, 2nd 6k files. Is there any way to make files count be more closer to each other? Max files count can be 10k or 10kk.
Thanks for any help.
P.S. May be i explained something wrong, so once again :) Task is simple - you have only html and php (without any db) and files directory where you should store only uploaded files without any own data. You should develop script that can handle storing uploads to files directory without storing duplicates (by filename) and split uploaded files by subdirectories by files count in each directory (optimal and count files in each directory should be close to each other).
I have no idea why you want it taht way. But if you REALLY have to do it this way, iI would suggest you set a limit how many bytes are stored in each folder. Everytime you have to save the data you open a log with
the current sub
the total number of bytes written to that directory
If necesary you create a new sub diretory(you coulduse th current timestempbecause it wont repeat) and reset the bytecount
Then you save the file and increment the byte count by the number of bytes written.
I highly doubt it is worth the work, but I do not really know why you want to distribute the files that way.