Fastest way to parse large xls files with PHP

Fastest way to parse large xls files with PHP - php

So basically I'm trying to parse large xls/xlsx files in PHP (Laravel). The flow is a tad different than usual:
I need to grab the column headers from a specific row (provided by user) and return as array
I need to then parse xls file into an array
I need to iterate over the array, mapping the keys of each value to the column name (because input is variable and mysql columns aren't), so I'll change eg [someKey => value] to [item => value] every time
I then use Laravel's :insert() to batch insert the array into the database
This is basically an api, so it needs to be reasonably fast on eg 15k rows xls files. I tried two things so far, both having their issues:
Laravel Excel no way to grab just one row, it parses the entire file twice, once to return headers, and once to map the value keys
Python openpyxl which I'm calling from the Laravel controller with exec(), while being faster, seems somewhat harder to control, as the array map to specific keys gets a bit more complex.
Is there a better/faster way to do something like this? Laravel can't be changed, but everything else is fair game. I also have zero control over input files, but a user has to define the row containing headers.
This is running on a Digital Ocean droplet, LAMP stack, CPU Optimized Droplets, 4 GB / 2 vCPUs. It's running both front (React) and back (Laravel) on the same machine. What happens is CPU goes up to 100% immediately as I run the parse, which isn't a huge issue, since it's fairly stable, but I'm just looking for a possible better/faster way of doing this, specifically, a way to yank just one row so I return headers faster, and then maybe a better way to map column names in the array.

Related

Handling big arrays in PHP

The application i am working on needs to obtain dataset of around 10mb maximum two times a hour. We use that dataset to display paginated results on the site also simple search by one of the object properties should also be possible.
Currently we are thinking about 2 different ways to implement this
1.) Store the json dataset in the database or a file in the file system, read that and loop over to display results whenever we need.
2.) Store the json dataset in relational MySQL table and query the results and loop over whenever we need to display them.
Replacing/Refreshing the results has to be done multiple times per hour as i said.
Both ways have cons. I am trying to choose a good way which is less evil overall. Reading 10 MB in memory is not a lot and on the other hand rewriting a table few times a hour could produce conflicts in my opinion.
My concern regarding 1.) is how safe the app will be if we read 10mb in the memory all the time? What will happen if multiple users do this at some point of time, is this something to worry about or PHP is able to handle this in background?
What do you think it will be best for this use case?
Thanks!

When php runs on a web server (as it usually does) the server starts new php processes on demand when they're needed to handle concurrent requests. A powerful web server may allow fifty or so php processes. If each of them is handling this large data set, you'll need to have enough RAM for fifty copies. And, you'll need to load that data somehow for each new request. Reading 10mb from a file is not an overwhelming burden unless you have some sort of parsing to do. But it is a burden.
As it starts to handle each request, php offers a clean context to the programming environment. php is not good at maintaining in-RAM context from one request to the next. You may be able to figure out how to do it, but it's a dodgy solution. If you're running on a server that's shared with other web applications -- especially applications you don't trust -- you should not attempt to do this; the other applications will have access to your in-RAM data.
You can control the concurrent processes with Apache or nginx configuration settings, and restrict it to five or ten copies of php. But if you have a lot of incoming requests, those requests get serialized and they will slow down.
Will this application need to scale up? Will you eventually need a pool of web servers to handle all your requests? If so, the in-RAM solution looks worse.
Does your json data look like a big array of objects? Do most of the objects in that array have the same elements as each other? If so, that's conformable to a SQL table? You can make a table in which the columns correspond to the elements of your object. Then you can use SQL to avoid touching every row -- every element of each array -- every time you display or update data.
(The same sort of logic applies to Mongo, Redis, and other ways of storing your data.)

Best way to manipulate large json objects

We have an application that calls an API every 4 hours and gets a dump of all objects, returned in a json format which are then stored in a file.json
The reason we do this is because we need up to date data and we are not allowed to use the api directly to get small portions of this data and also that we need to do a clean up on it.
There is also another problem, we can't call for only the updated records (which is actually what we need)
The way we are currently handling this is by getting the data, storing in a file, load the previous file into memory and compare the values in order to get only the new and the updated ones, once we get the new and updated we go ahead and insert into MySQL
I am currently looking into a different option, what I was thinking is that since since the new file will contain every single record why not query for the needed objects from the file.json when needed?
The problem with that is that some of these files are larger than 50MB (each file contains one of the related tables, being 6 files in total which complete the full relation) and we can't be loading them into memory every time there is a query, does any one know of a DB system that will allow to query on a file or an easier way to replace the old data with the new one with a quick operation?

I think the approach you're using already is probably the most practical, but I'm intrigued by your idea of searching the JSON file directly.
Here's how I'd take a stab at implementing this, having worked on a Web application that used the similar approach of searching an XML file on disk rather than a database (and, remarkably, was still fast enough for production use):
Sort the JSON data first. Creating a new master file with the objects reordered to match how they're indexed in the database will maximize the efficiency of a linear search through the data.
Use a streaming JSON parser for searches. This will allow the file to be parsed object-by-object without needing to load the entire document in memory first. If the file is sorted, only half the document on average will need to be parsed for each lookup.
Streaming JSON parsers are rare, but they exist. Salsify has created one for PHP.
Benchmark searching the file directly using the above two strategies. You may discover this is enough to make the application usable, especially if it supports only a small number of users. If not:
Build separate indices on disk. Instead of having the application search the entire JSON file directly, parse it once when it's received and create one or more index files that associate key values with byte offsets into the original file. The application can then search a (much smaller) index file for the object it needs; once it retrieves the matching offset, it can seek immediately to the corresponding JSON object in the master file and parse it directly.
Consider using a more efficient data format. JSON is lightweight, but there may be better options. You might experiment with
generating a new master file using serialize to output a "frozen" representation of each parsed JSON object in PHP's native serialization format. The application can then use unserialize to obtain an array or object it can use immediately.
Combining this with the use of index files, especially if they're generated as trees rather than lists, will probably give you about the best performance you can hope for from a simple, purely filesystem-based solution.

I ended up doing my own processing method.
I got a json dump of all records which I then processed into single files with each one having all its related records in it, kind of like a join, to avoid the indexing of these files to be long I created multiple subfolders for a block of records, while creating these files I started building an index files which pointed to the directory location of the record which is a tiny file, now every time there is a query I just load the index file into memory which is under 1 MB I then check if the index key exists which is the master key of the record, if it does I then have the location of the file which I then load into memory and has all the required information to use in the application.
The query for these files ended up being a lot faster than querying the DB which works for what we need.
Thank you all for your input as it helped me decide which way to go.

Performance fact of using multiple loops(not nested) / single loop for the same dataset

For exporting data as pdf, i need to query a dataset from the database and then loop through it to manage indexes and then export as pdf. When querying, the dataset can be manipulated in different ways with the use of joins and other mechanism. The confusion is, depending on the structure of the dataset, it can be done within a single loop and it may be done with multiple loops(not nested). What is the performance effect of these two ways.
For example: dataset contains "organizations" and its "inquries"."inquiries" has different types. In this condition, I can query to retrieve dataset as whole and use one loop to go through it or retrieve dataset as sections and then use multiple loops to go through.
What is the performance fact of these two scenarios.
Thanks in advance

By querying only once per dataset you can bypass the extra latency (as in the time mysql needs to execute its task, if the database is on the local server the additional network latency is very low ofc) that might occure when sending additional queries to the database.
Each time you query the database, the query itself has to be
send to the databse
parsed by the database
analysed and optimised
executed
and the resultset has to be returned. So performancewise it would be a good idea to reduce any additional runs to the database.
edit: https://stackoverflow.com/a/2215625/3595565 how mysql works
edit2:
I guess it depends on what you want to do with each of those arrays. I probably would treat them seperately, as they all handle different things.
In this scenario performance should not be the important factor as it doesn't matter if you iterate once over an array with 9k elements or iterate 3 times over arrays with 3k elements respectively.
You should set your focus on clear communication via code. Seperate the responsibilities of the entities.
Let's take your example and say for each element in your 3 arrays you want to send an email and all the data needed(from, to, subject etc.) is in your arrays already. Presumption is, the data in the three arrays needs to be treated differently
$tableAR = array('Incident' => $incidentAR,
'Prevention' => $preventionAR,
'Undefined' => $undefinedAR);
foreach($tableAR['Incident'] as $incident){
sendIncidentmail($incident);
}
// same for prevention and any other array, write your code in a way some random programmer can understand what is happening
Of course you could create a function to look into the element, check the type and then decide what kind of mail to send, but it would be a lot less readable at the place of calling the function and the function itself would have to be bigger to decide what it has to with the given data.
$tableAR = array('Incident' => $incidentAR,
'Prevention' => $preventionAR,
'Undefined' => $undefinedAR);
$completeArray = array_merge($tableAR['Incident'], $tableAR['Prevention'], $tableAR['Undefined']);
foreach($completeArray as $element){
sendMail($incident);
}

Best practices of dealing with very large arrays? DB?

In another question, I stated that I was running a php script that grabbed 150 rows of data from a mysql database, then did some calculations on that data, which it put into an array (120 elements with an array of 30 for each, or roughly 3600 elements total). The "results" array is needed, because I create a graph with the data. This script works fine.
I wanted to expand my script to a more dense dataset (which would provide better results). The dataset is 1700 rows, which would end up with a "results" array of 1340 elements with an array of 360 for each, or roughly 482,400 elements total. Problem is, I've tried this and came up with some heinous memory errors.
As described to me in the previous question I posted, the size of that results array is probably overwhelming the server memory
In you second larger sample it will be array(1700,1699). At 144 bytes per element thats 415,915,200 bytes, thats slightly over 406Meg + remaining bucket space, just to hold the results of your calculations.
I am not familiar with the typical ways to deal with this issue. I was considering for the larger set of data, serializing and base64_encode'ing each of the 1340 result array elements, as it runs (or every 10 or 20. 1340 db calls might be too much), and uploading to a SQL server, and unsetting the results array as to free up memory. I could then make my report and my graph by querying the DB for the specific information, rather than having it ALL in a huge array.
Any other way of doing this?

You should probably use Hadoop map-reduce and/or other such technologies when dealing with large sets of data. And most of the processing you want to do on the data must be a batch process. The results must be put somewhere else- another database. You will only need to query that database and your application will become much faster and you will not run into memory problems.

The easiest and fastest way is probably to continue to use your in memory array solution and figure out how to solve the memory issues. What are the memory errors you have been encountering?
If you have over 1GB of RAM that should be enough to generate your graph. With 1GB of RAM you can set memory_limit PHP configuration option to 750MB. You could only generate it with one process at a time so you would need to generate it and use some method to cache the results.
If you dont have enough RAM on your current system. I suggest trying Amazon EC2 you can get a 16GB machine for about 7 cents an hour on the spot market which you could just stop and start whenever you needed to generate the graphs.

Can you provide more specifics on your use case? How many distinct graphs do you need to service? How frequently will the underlying data change? How many concurrent users do you need to serve? Are you actually trying to plot 2 million elements on a single chart?
In the absence of specifics, I would note/recommended some combination of the following:
Build your graphs offline and cache them
Use a web-based solution to offload all querying and chart generation (google charts + google fusion table)
Use a backend process to do the analysis and generate the graphs, only expose the end result to the client. Check out R and http://www.rstudio.com/shiny/

PHP Array efficiency vs mySQL query

I have a MySQL table with about 9.5K rows, these won't change much but I may slowly add to them.
I have a process where if someone scans a barcode I have to check if that barcode matches a value in this table. What would be the fastest way to accomplish this? I must mention there is no pattern to these values
Here Are Some Thoughts
Ajax call to PHP file to query MySQL table ( my thoughts would this would be slowest )
Load this MySQL table into an array on log in. Then when scanning Ajax call to PHP file to check the array
Load this table into an array on log in. When viewing the scanning page somehow load that array into a JavaScript array and check with JavaScript. (this seems to me to be the fastest because it eliminates Ajax call and MySQL Query. Would it be efficient to split into smaller arrays so I don't lag the server & browser?)

Honestly, I'd never load the entire table for anything. All I'd do is make an AJAX request back to a PHP gateway that then queries the database, and returns the result (or nothing). It can be very fast (as it only depends on the latency) and you can cache that result heavily (via memcached, or something like it).
There's really no reason to ever load the entire array for "validation"...

Much faster to used a well indexed MySQL table, then to look through an array for something.
But in the end it all depends on what you really want to do with the data.

As you mentions your table contain around 9.5K of data. There is no logic to load data on login or scanning page.
Better to index your table and do a ajax call whenever required.
Best of Luck!!

While 9.5 K rows are not that much, the related amount of data would need some time to transfer.
Therefore - and in general - I'd propose to run validation of values on the server side. AJAX is the right technology to do this quite easily.
Loading all 9.5 K rows only to find one specific row, is definitely a waste of resources. Run a SELECT-query for the single value.
Exposing PHP-functionality at the client-side / AJAX
Have a look at the xajax project, which allows to expose whole PHP classes or single methods as AJAX method at the client side. Moreover, xajax helps during the exchange of parameters between client and server.
Indexing to be searched attributes
Please ensure, that the column, which holds the barcode value, is indexed. In case the verification process tends to be slow, look out for MySQL table scans.
Avoiding table scans
To avoid table scans and keep your queries run fast, do use fixed sized fields. E.g. VARCHAR() besides other types makes queries slower, since rows no longer have a fixed size. No fixed-sized tables effectively prevent the database to easily predict the location of the next row of the result set. Therefore, you e.g. CHAR(20) instead of VARCHAR().
Finally: Security!
Don't forget, that any data transferred to the client side may expose sensitive data. While your 9.5 K rows may not get rendered by client's browser, the rows do exist in the generated HTML-page. Using Show source any user would be able to figure out all valid numbers.
Exposing valid barcode values may or may not be a security problem in your project context.
PS: While not related to your question, I'd propose to use PHPexcel for reading or writing spreadsheet data. Beside other solutions, e.g. a PEAR-based framework, PHPExcel depends on nothing.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.