PHP: How to get random line in big file fastest? - php

I have 1 file txt with 100 000 lines. Can you know any way to get 1 line fastest randomly ? Thank you very much !

The fastest way would be to build index (simple array that contains the position in the file of each new line). Then choose random key, get the position, fseek the file to that position, and read the line. This will require updating the index file any time you change the file, but if you want to optimize retrieving the data, that's the way.
You can optimize further by spliting the file in ranges (e.g. sharding the data), or having several representations of the file (for example you can have file with lines inverted so last come first, and if your random number is bigger than half of the elements, you read from the second file)

Related

Reading a specified number of rows in huge excel file

I want to read a huge excel file in patches so as to reduce reading time.
I wrote the code below.
$xls = new Spreadsheet_Excel_Reader($path);
$dd=array(0);
for($row=2;$row<=10;$row++){
$val=$xls->val($row,$field);
}
This takes a lot of time each time the file is read because the file is huge. The file also gets reloaded each time.
How can I read only the required rows of the file to save time??
The file will get reloaded each time the PHP script is executed, simply due to the fact that it does not keep the previous state. When you say a large file, how many records/bytes are we talking about?
To speed up the reading of such a file, you could put it on a RAM disk (if using Linux) which is far faster than SSD's. Or, read it and store a CSV equivalent with fixed record lengths. The fixed record lengths will allow you to jump to segments as you wish, and retrieve the number of records easily.
So if your record length became 90 bytes/characters and you wanted the records 100 to 109, you would open the file for reading, fseek to position 9000 (90 * 100) and grab the next 900 characters.

performance of two file fopen vs one big file fopen

i am trying to write in txt file and my data are comma separated numbers which are appended every time. i have two types of data which i am trying to compare so if we are using one txt file then it has to be written in new line .
1,5,10,15,22,32
other line is
1,5,7,12,8,99
so its simple data but with time it grows as new data are appended and in worst situation may become big. may be 10-15000 numbers in each line
i have also doubt that if these txt files are open and closed fast enough data may corrupt.
so i would like to write these two lines in two different txt file instead of one. where is more integrity and performance ??
should i use only one file or two files are preferred.
dont want to use database as database is always naughty and may create some load on server which can be avoided with txt file

How do you read excel files using Spout as efficiently as website says?

Due to time-outs using PHPExcel, I am experimenting with Spout.
I understood it differs from PHPExcel in that it reads a row at a time. I tried a file with 116,652 rows (and about 21MB) inserting logging at specific points within the code to track how long things are taking.
To my surprise this statement (right after instantiating $reader with ReaderFactory class):
$reader->open($inputFileName);
took 30 minutes to execute.
Is there some setting I'm missing out? I've done nothing different from what is instructed here.
30 minutes for 116,000 rows is definitely not normal. The open function may take a while when using sharedStrings (as opposed to inlineStrings). Excel is using shared strings by default (which optimizes the size of the XLSX file) while Spout uses inline strings by default (which optimizes for future read performance).
To understand what could go wrong, let me explain what's done in the open function:
An XLSX file is a bunch of XML files zipped together. Each XML file contains specific data describing the spreadsheet. There is one XML file describing the structure of the spreadsheet (row1 > cell A1 > A1 value - cell B1 > B1 value -- row2 > cell A2 > A2 value, ...).
The value of the cells can either be inline, so directly present in this file or in the case of strings, these values may contain references only. So for instance, the value for A1 is the string with reference 1, the value for A2 is the string of reference 3, etc. This is useful for space optimization, as a repeated value will only be stored once. Only its reference will be duplicated. That's how the shared strings work.
So when Spout reads an XLXS file that uses shared strings, it needs to know which string correspond to the reference 1 or 3. For that, it first needs to read the file containing all these strings and store the mapping somewhere (string 1 = "foo", string 3 = "bar") so that when it reads the file containing the structure of the spreadsheet, it can translate "string with reference 1" into "foo".
If the file containing the shared strings is very big, Spout does not try to keep the mapping "string reference" => "actual value" in memory but splits this file in smaller chunks. This whole process (reading + splitting) takes a while hence the long wait during the open execution.
Now, unless you have a very large number of cells, it should not take 30 minutes.
The most likely reason is that your spreadsheet contains more than one sheet (they may be hidden, but Spout still processes all the strings).
So check that, and if that's not the reason, please create an issue on Github, attaching the file you have problems with.

Mapping a flat text file

In a text file, lines are detected by \n at the end of each line. For this purpose, it is necessary to read the entire file, and this is a big problem for large files (say 2GB). I am looking for a method to read a single line without walking through the entire file (though I know it should be a complicated process).
The first way I know is to use fseek() with offset; but it is not practical.
Creating a flat file of key/value; but I am not sure if there is a way to avoid loading the entire into RAM (it should be something like reading an array in php).
Alternatively, can we make some numbers at the beginning of each line to be read. I mean, is it possible to read the first digits at the beginning of the line by skipping the line contents (going to the next line).
768| line content is here
769| another line
770| something
If reading only the first digits, the total data which should be read is not much even for large files.
Do you need to read specific lines that can be indexed on line number?. If so just do a binary search. Read (say) 200 characters in the middle of the file to find out a line number. Then repeat in either of the halves until you get to the right line.
I think there are no simple way to do what you want. Records have variable length and no length could be determined in advance, right?
If file is always the same (or at least not modified frequently), I'd put it to database, or at least create index file (record number: offset) and use that fseek()
Alternatively you can index your text file initially and then proceed with your daily operation of picking up single file lines based on your index file. You can find how to index your text file here or here. Indexing a text file is no different from indexing a CSV or variable record file.

Rearranging data in files

There is an array of numbers, divided into partitions containing the same number of elements (as an output of array_chunk()). They are written into separate files, file 1.txt contains the first chunk, 2.txt - the second and so on. And now I want these files to contain a different number of elements of the initial array. Of course, I can read them into one array and split it again, but it requires quite a large amount of memory. Could you please help me with a more efficient solution? (The number of files and the size of the last are stored separately) I have no other ideas...
Do you know what the different number is? If you do, then you can easily read data in, and then whenever you fill a chunk write data out. In pseudo-code:
for each original file:
for each record:
add record to buffer
if buffer is desired size:
write new file
clear buffer
write new file
Obviously you'll need to keep new files separate from old ones. And then, once you've rewritten the data, you can swap them out somehow. (I would personally suggest having two directories, then rename directories after you're done.)
If you don't know what the size of your chunks should be (for instance you want a specific number of files) then first do whatever work it needs to figure that out, then proceed with the original solution.

Categories