I want to read a huge excel file in patches so as to reduce reading time.
I wrote the code below.
$xls = new Spreadsheet_Excel_Reader($path);
$dd=array(0);
for($row=2;$row<=10;$row++){
$val=$xls->val($row,$field);
}
This takes a lot of time each time the file is read because the file is huge. The file also gets reloaded each time.
How can I read only the required rows of the file to save time??
The file will get reloaded each time the PHP script is executed, simply due to the fact that it does not keep the previous state. When you say a large file, how many records/bytes are we talking about?
To speed up the reading of such a file, you could put it on a RAM disk (if using Linux) which is far faster than SSD's. Or, read it and store a CSV equivalent with fixed record lengths. The fixed record lengths will allow you to jump to segments as you wish, and retrieve the number of records easily.
So if your record length became 90 bytes/characters and you wanted the records 100 to 109, you would open the file for reading, fseek to position 9000 (90 * 100) and grab the next 900 characters.
Related
I have 1000 plus txt files with file name as usernames. Now i'm reading it by using loop. here is my code
for($i=0; $i<1240; $i++){
$node=$users_array[$i];
$read_file="Uploads/".$node."/".$node.".txt";
if (file_exists($read_file)) {
if(filesize($read_file) > 0){
$myfile = fopen($read_file, "r");
$file_str =fread($myfile,filesize($read_file));
fclose($myfile);
}
}
}
when loop runs, it takes too much time and server gets timed out.
I don't know why it is taking that much time because files have not much data in it. read all text from a txt file should be fast. am i right?
Well, you are doing read operations on HDD/SSD which are not as fast as memory, so you should expect a high running time depending on how big the text files are. You can try the following:
if you are running the script from browser, I recommend running it from command line, this way you will not get a web server time out and the script will manage to finish if there is no time execution limit set on php, case in which maybe you should increase it
on your script above you can set "filesize($read_file)" into a variable so that you do not execute it twice, it might improve running the script
if you still can't finish the job consider running it in batches of 100 or 500
keep an eye on memory usage, maybe that is why the script dies
if you need the content of the file as a string you can try "file_get_contents" and maybe skip "filesize" check all together
It sounds like your problem is having 1000+ files in a single directory. On a traditional Unix file system, finding a single file by name requires scanning through the directory entries one by one. If you have a list of files and try to read all of them, it'll require traversing about 500000 directory entries, and it will be slow. It's an O(n^2) algorithm and it'll only get worse as you add files.
Newer file systems have options to enable more efficient directory access (for example https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Hash_Tree_Directories) but if you can't/don't want to change file system options you'll have to split your files into directories.
For example, you could take the first two letters of the user name and use that as the directory. That's not great because you'll get an uneven distribution, it would be better to use a hash, but then it'll be difficult to find entries by hand.
Alternatively you could iterate the directory entries (with opendir and readdir) and check if the file names match your users, and leave dealing with the problems the huge directory creates for later.
Alternatively, look into using a database for your storage layer.
I need to insert all the data in an Excel file (.xlsx) to my database. I have tried all the available methods, such as caching, make it read chunk by chunk but nothing seems to work at all. Has anyone tried to do this with a big file size before? My spreadsheet has about 32 columns and about 700,000 rows of records.
The file is already uploaded in the server. And I want to write a cron job to actually read the excel file and insert the data to the database. I chunk it to read each time 5000, 3000 or even 10 records only, but none worked. What happens is it will return this error:
simplexml_load_string(): Memory allocation failed: growing buffer.
I did try with CSV file type and manage to get the thing run at 4000k records each time but will take about five minutes each time to process, but any higher will fail too. And get the same error. But the requirement need it in .xlsx file types, so I need to stick with that.
Consider converting it to CSV format using external tool, like ssconvert from Gnumeric package and then read CSV line by line with fgetcsv function.
Your issue occurs because you are trying to read the contents of a whole XML file. Caching and reading chunk by chunk does not help because the library you are using needs to read the entire XML file at one point to determine the structure of the spreadsheet.
So for very large files, the XML file is so big that reading it consumes all the available memory. The only working option is to use streamers and optimize the reading.
This is still a pretty complex problem. For instance, to resolve the data in your sheet, you need to read the shared strings from one XML file and the structure of your sheet from another one. Because of the way shared strings are stored, you need to have those strings in memory when reading the sheet structure. If you have thousands of shared strings, that becomes a problem.
If you are interested, Spout solves this problem. It is open-source so you can take a look at the code!
I am building a website where the basic premise is there are two files. index.php and file.txt.
File.txt has (currently) 10megs of data, this can potentially be up to 500mb. The idea of the site is, people go to index.php and then can seek to any position of the file. Another feature is they can read up to 10kb data from the point of seeking. So:
index.php?pos=432 will get the byte at position 423 on the file.
index.php?pos=555&len=5000 will get 5kb of the data from the file starting from position 555
Now, Imagine the site getting thousands of hits a day.
I currently use fseek and fread to serve the data. Is there any faster way of doing this? Or is my usage too low to consider advanced optimizations such as caching the results of each request or loading the file into memory and reading it from there?
Thousands of hits per day, that's like one every few seconds? That's definitely too low to need optimizing at this point, so just use fseek and fread if that's what's easiest for you.
If it is crucial for you to keep all data into a file, I would suggest you to split your file into a chunk of smaller files.
So for example you could make a decision, that a file size should not be more then 1 mb. It means that you have to split your file.txt file into 10 separate files: file-1.txt, file-2.txt, file-3.txt and so on...
When you will process a request, you will need to determine what file to pickup by division pos argument on file size and show appropriate amount of data. In this case fseek function will work faster, perhaps...
But anyway you have to stick with fseek and fopen functions.
edit: now that I consider it, so long as you're using fseek() to go to a byte offset and then using fread() to get a a certain number of bytes it shouldn't be a problem. For some reason I read your question as serving X number of lines from a file which would be truly terrible.
The problem is you are absolutely hammering the disk with IO operations, and you're not just causing performance issues with this one file/script, you're causing performance issues with anything that needs that disk. Other users, the OS, etc. if you're on shared hosting I guarantee that one of the sysadmins is trying to figure out who you are so they can turn you off. [I would be]
You need to find a way to either:
Offload this to memory.
Set up a daemon on the server that loads the file into memory and serves chunks on request.
Offload this to something more efficient, like mySQL.
You're already serving the data in sequential chunks, eg: line 466 to 476, it will be much faster to retrieve the data from a table like:
CREATE TABLE mydata (
line INTEGER NOT NULL AUTO_INCREMENT,
data VARCHAR(2048)
) PRIMARY KEY (line);
by:
SELECT data FROM mydata WHERE line BETWEEN 466 AND 476;
If the file never changes, and is truly limited in maximum size, I would simply mount a ramdisk, and have a boot script which copies the file from permanent storage to RAM storage.
This probably requires hosting the site on linux if you aren't already.
This would allow you to guarantee that the file segments are served from memory, without relying on the OS filesystem cache.
Using PHPExcel I can run each tab separately and get the results I want but if I add them all into one excel it just stops, no error or any thing.
Each tab consists of about 60 to 80 thousand records and I have about 15 to 20 tabs. So about 1600000 records split into multiple tabs (This number will probably grow as well).
Also I have tested the 65000 row limitation with .xls by using the .xlsx extension with no problems if I run each tab it it's own excel file.
Pseudo code:
read data from db
start the PHPExcel process
parse out data for each page (some styling/formatting but not much)
(each numeric field value does get summed up in a totals column at the bottom of the excel using the formula SUM)
save excel (xlsx format)
I have 3GB of RAM so this is not an issue and the script is set to execute with no timeout.
I have used PHPExcel in a number of projects and have had great results but having such a large data set seems to be an issue.
Anyone every have this problem? work around? tips? etc...
UPDATE:
on error log --- memory exhausted
Besides adding more RAM to the box is there any other tips I could do?
Anyone every save current state and edit excel with new data?
I had the exact same problem and googling around did not find a valuable solution.
As PHPExcel generates Objects and stores all data in memory, before finally generating the document file which itself is also stored in memory, setting higher memory limits in PHP will never entirely solve this problem - that solution does not scale very well.
To really solve the problem, you need to generate the XLS file "on the fly". Thats what i did and now i can be sure that the "download SQL resultset as XLS" works no matter how many (million) row are returned by the database.
Pity is, i could not find any library which features "drive-by" XLS(X) generation.
I found this article on IBM Developer Works which gives an example on how to generate the XLS XML "on-the-fly":
http://www.ibm.com/developerworks/opensource/library/os-phpexcel/#N101FC
Works pretty well for me - i have multiple sheets with LOTS of data and did not even touch the PHP memory limit. Scales very well.
Note that this example uses the Excel plain XML format (file extension "xml") so you can send your uncompressed data directly to the browser.
http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Excel_XML_Spreadsheet_example
If you really need to generate an XLSX, things get even more complicated. XLSX is a compressed archive containing multiple XML files. For that, you must write all your data on disk (or memory - same problem as with PHPExcel) and then create the archive with that data.
http://en.wikipedia.org/wiki/Office_Open_XML
Possibly its also possible to generate compressed archives "on the fly", but this approach seems really complicated.
I have multiple CSV files, each with the same set of row/column titles but each with different values. For example:
CSV-1.csv
A,B,C,C,C,X
A,A,A,A,C,X
CSV-2.csv
A,C,C,C,C,X
A,C,A,A,C,X
and so on...
I have been able to figure out how to read the files and convert them into HTML pre-formatted tables. However, I have not been able to figure out how to paginate when there are multiple files with data (as shown above) such that I get only a single table at a time with "Next" and "Previous" buttons (to be able to effectively see the changes in the table and data.
Any ideas would be greatly appreciated.
If you know in advance what the files are, then predetermining the line count for each file would let you do the pageination.
Then it'd be a simple matter of scanning through this line count cache to figure out which file to start reading from, and just keep reading lines/files until you reach the per-page line limit.
Otherwise, you option will be to open/read each file upon each request, but only start outputting when you reach the file/line that matches the current "page" offset. For large files with many lines, this'd be a serious waste of cpu time and disk bandwidth.