I have multiple CSV files, each with the same set of row/column titles but each with different values. For example:
CSV-1.csv
A,B,C,C,C,X
A,A,A,A,C,X
CSV-2.csv
A,C,C,C,C,X
A,C,A,A,C,X
and so on...
I have been able to figure out how to read the files and convert them into HTML pre-formatted tables. However, I have not been able to figure out how to paginate when there are multiple files with data (as shown above) such that I get only a single table at a time with "Next" and "Previous" buttons (to be able to effectively see the changes in the table and data.
Any ideas would be greatly appreciated.
If you know in advance what the files are, then predetermining the line count for each file would let you do the pageination.
Then it'd be a simple matter of scanning through this line count cache to figure out which file to start reading from, and just keep reading lines/files until you reach the per-page line limit.
Otherwise, you option will be to open/read each file upon each request, but only start outputting when you reach the file/line that matches the current "page" offset. For large files with many lines, this'd be a serious waste of cpu time and disk bandwidth.
Related
I have the following problem: I upload excel files with a form, and on submit process them for 5 minutes server side with a Background process.
Now, I want to create a snapshot of the excel file and display it to the user, what I already do, but opening the file with PHPExcel is usually very slow, and I need to make that process faster, for the sake of usability.
To be clear, if I click "preview" it may take 10, 20, 30 seconds, or the ajax request simply die. Sometimes I use reduced versions of the excel (Open them, remove 50k rows, and save again with 100 rows) for testing purposes, and then the preview is shown in no time.
What I want to do is do the same with php server side. I mean, opening the excel, remove 50k rows, save again, and then send the preview back.
Using PHPExcel doesn´t help at all, it may achieve what I want, but again, the time is not acceptable.
Is there any way I can do somnething like:
$excel_info = file_get_contents($file);
//USE SOME REGEX OR RULE TO REMOVE COLUMNS, OR OTHERWISE, EXTRACT ONLY SOME ROWS
$first10ColumnsInfo = customFunction($excel_info);
file_put_contents("tmp/reduced_excel.xlsx", $first10ColumnsInfo);
I tried to look into PHPExcel libraries to get an idea of how did it handle the data, and try to do something similar but at some point, I simply got lost, after I could retrieve some info, but not properly formatted.
Thank you in advance
Using PHPExcel I can run each tab separately and get the results I want but if I add them all into one excel it just stops, no error or any thing.
Each tab consists of about 60 to 80 thousand records and I have about 15 to 20 tabs. So about 1600000 records split into multiple tabs (This number will probably grow as well).
Also I have tested the 65000 row limitation with .xls by using the .xlsx extension with no problems if I run each tab it it's own excel file.
Pseudo code:
read data from db
start the PHPExcel process
parse out data for each page (some styling/formatting but not much)
(each numeric field value does get summed up in a totals column at the bottom of the excel using the formula SUM)
save excel (xlsx format)
I have 3GB of RAM so this is not an issue and the script is set to execute with no timeout.
I have used PHPExcel in a number of projects and have had great results but having such a large data set seems to be an issue.
Anyone every have this problem? work around? tips? etc...
UPDATE:
on error log --- memory exhausted
Besides adding more RAM to the box is there any other tips I could do?
Anyone every save current state and edit excel with new data?
I had the exact same problem and googling around did not find a valuable solution.
As PHPExcel generates Objects and stores all data in memory, before finally generating the document file which itself is also stored in memory, setting higher memory limits in PHP will never entirely solve this problem - that solution does not scale very well.
To really solve the problem, you need to generate the XLS file "on the fly". Thats what i did and now i can be sure that the "download SQL resultset as XLS" works no matter how many (million) row are returned by the database.
Pity is, i could not find any library which features "drive-by" XLS(X) generation.
I found this article on IBM Developer Works which gives an example on how to generate the XLS XML "on-the-fly":
http://www.ibm.com/developerworks/opensource/library/os-phpexcel/#N101FC
Works pretty well for me - i have multiple sheets with LOTS of data and did not even touch the PHP memory limit. Scales very well.
Note that this example uses the Excel plain XML format (file extension "xml") so you can send your uncompressed data directly to the browser.
http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Excel_XML_Spreadsheet_example
If you really need to generate an XLSX, things get even more complicated. XLSX is a compressed archive containing multiple XML files. For that, you must write all your data on disk (or memory - same problem as with PHPExcel) and then create the archive with that data.
http://en.wikipedia.org/wiki/Office_Open_XML
Possibly its also possible to generate compressed archives "on the fly", but this approach seems really complicated.
So, I searched some here, but couldn't find anything good, apologies if my search-fu is insufficient...
So, what I have today is that my users upload a CSV text file using a form to my PHP script, and then I import that file into a database, after validating every line in it. The text file can be put to about 70,000 lines long, and each lines contains 24 fields of values. This is obviously not a problem since dealing with that kind of data. Every line needs to be validated plus I check the DB for duplicates (according to a dynamic key generated from the data) to determine if the data should be inserted or updated.
Right, but my clients are now requesting an automatic API for this, so they don't have to manually create and upload a text file. Sure, but how would I do it?
If I were to use a REST server, memory would run out pretty quickly if one request contained XML for 70k posts to be inserted, so that's pretty much out of the question.
So, how should I do it? I have thought about three options, please help med decide or add more options to the list
One post per request. Not all clients have 70k posts, but an update to the DB could result in the API handling 70k requests in a short period, and it would probably be daily either way.
X amount of posts per request. Set a limit to the number of posts that the API deals with per request is set to, say, 100 at a time. This means 700 requests.
The API requires for the client script to upload a CSV file ready to import using the current routine. This seems "fragile" and not very modern.
Any other ideas?
If you read up on SAX processing http://en.wikipedia.org/wiki/Simple_API_for_XML and HTTP Chunk Encoding http://en.wikipedia.org/wiki/Chunked_transfer_encoding you will see that it should be feasible to parse the XML document whilst it is being sent.
I have now solved this by imposing a limit of 100 posts per request, and I am using REST through PHP to handle the data. Uploading 36,000 posts takes about two minutes with all the validation.
First of all don't use XMl for this! Use JSON, it is fastest than xml.
I Use on my project import from xls. file is very large, but script work fine, just client must create files with same structure for import
There is an array of numbers, divided into partitions containing the same number of elements (as an output of array_chunk()). They are written into separate files, file 1.txt contains the first chunk, 2.txt - the second and so on. And now I want these files to contain a different number of elements of the initial array. Of course, I can read them into one array and split it again, but it requires quite a large amount of memory. Could you please help me with a more efficient solution? (The number of files and the size of the last are stored separately) I have no other ideas...
Do you know what the different number is? If you do, then you can easily read data in, and then whenever you fill a chunk write data out. In pseudo-code:
for each original file:
for each record:
add record to buffer
if buffer is desired size:
write new file
clear buffer
write new file
Obviously you'll need to keep new files separate from old ones. And then, once you've rewritten the data, you can swap them out somehow. (I would personally suggest having two directories, then rename directories after you're done.)
If you don't know what the size of your chunks should be (for instance you want a specific number of files) then first do whatever work it needs to figure that out, then proceed with the original solution.
I've got a report that can generate over 30,000 records if given a large enough date range. From the HTML side of things, a resultset this large is not a problem since I implement a pagination system that limits the viewable results to 100 at a given time.
My real problem occurs once the user presses the "Get PDF" button. When this happens, I essentially re-run the portion of the report that prints the data (the results of the report itself are stored in a 'save' table so there's no need to re-run the data-gathering logic), and store the results in a variable called $html. Keep in mind that this variable now contains 30,000 records of data plus the HTML needed to format it correctly on the PDF. Once I've got this HTML string created, I pass it to TCPDF to try and generate the PDF file for the user. However, instead of generating the PDF file, it just craps out without an error message (the 'Generating PDf...') dialog disappears and the system acts like you never asked it to do anything.
Through tests, I've discovered that the problem lies in the size of the $html variable being passed in. If the report under 3K records, it works fine. If it's over that, the HTML side of the report will print but not the PDF.
Helpful Info
PHP 5.3
TCPDF for PDF generation (also tried PS2PDF)
Script Memory Limit: 500 MB
How would you guys handle this scale of data when generating a PDF of this size?
Here is how I solved this issue: I noticed that some of the strings that I was having in my HTML output had some slight encoding issues - I ran htmlentities on those particular strings as I was querying the database for them and that cleared the problem.
Don't know if this was what was causing your problem, but my experience was very similar - when I was trying to output an HTML table that had a large size, with about 80.000 rows, TCPDF would display the page header but nothing table-related. This behaviour would be the same with different sets of data and different table structures.
After many attempts I started adding my own pagination - every 15 table rows, I would break the page and add a new table to the following page. That's when I noticed that every once and a while I would get blank pages between a lot of full and correct ones. That's when I realised that there must be a problem with those particular subsets of data, and discovered the encoding issue. It may be that you had something similar and TCPDF was not making it clear what your problem was.
Are you using the writeHTML method?
I went through the performance recommendations here: http://www.tcpdf.org/performances.php
It says "Split large HTML blocks in smaller pieces;".
I found that if my blocks of HTML went over 20,000 characters the PDF would take well over 2 minutes to generate.
I simply split my html up into the blocks and called writeHTML for each block and it improved dramatically. A file that wouldn't generate in 2 minutes before now takes 16 seconds.
TCPDF seems to be a native implementation of PDF generation in PHP. You may have better performance using a compiled library like PDFlib or a command-line app like htmldoc. The latter will have the best chances of generating a large PDF.
Also, are you breaking the output PDF into multiple pages? I.e. does TCPDF know to take a single HTML document and cut it into multiple pages, or are you generating multiple HTML files for it to combine into a single PDF document? That may also help.
I would break the PDF into parts, just like pagination.
1) Have "Get PDF" button on every paginated HTML page and allow downloading of records from that HTML page only.
2) Limit the maximum number of records that can be downloaded. If the maximum limit reaches, split the PDF and let the user to download multiple PDFs.