This question already has answers here:
Processing large JSON files in PHP
(7 answers)
Closed 9 years ago.
I'm working on a cron script that hits an API, receives JSON file (a large array of objects) and stores it locally. Once that is complete another script needs to parse the downloaded JSON file and insert each object into a MySQL database.
I'm currently using a file_get_contents() along with json_decode(). This will attempt to read the whole file into memory before trying to process it. This would be fine except for the fact that my JSON files will usually range from 250MB-1GB+. I know I can increase my PHP memory limit but that doesn't seem to be the greatest answer in my mind. I'm aware that I can run fopen() and fgets() to read the file in line by line, but I need to read the file in by each json object.
Is there a way to read in the file per object, or is there another similar approach?
try this lib https://github.com/shevron/ext-jsonreader
The existing ext/json which is shipped with PHP is very convenient and
simple to use - but it is inefficient when working with large
ammounts of JSON data, as it requires reading the entire JSON data
into memory (e.g. using file_get_contents()) and then converting it
into a PHP variable at once - for large data sets, this takes up a lot
of memory.
JSONReader is designed for memory efficiency - it works on streams and
can read JSON data from any PHP stream without loading the entire
data into memory. It also allows the developer to extract specific
values from a JSON stream without decoding and loading all data into
memory.
This really depends on what the json files contain.
If opening the file one shot into memory is not an option, your only other option, as you eluded to, is fopen/fgets.
Reading line by line is possible, and if these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.
Once you collect a whole object, you insert it into a db, then go on to the next one.
There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.
Best possible solution:
Use some sort of delimiter (pagination, timestamp, object ID etc) that allows you to read the data in smaller chunks over multiple requests. This solution assumes that you have some sort of control of how these JSON files are generated. I'm basing my assumption on:
This would be fine except for the fact that my JSON files will usually
range from 250MB-1GB+.
Reading in and processing 1GB of JSON data is simply ridiculous. A better approach is most definitely needed.
Related
I'm going to be using a JSON file to contain a list of links to a few posts, and it can be updated at any time. However, I'm stumped with the mode to use with PHP's fopen() function. This will be a flat-file database and primarily is for me learning to work with files, PHP, and JSON before moving onto a proper relational database (that, and it's not a huge collection of pages that I'm worried about using SQL or something like that yet...)
The process I'm using is that once a blog post is typed up, it will create a directory, save a new index.php file to it with all of the stuff that lets me view the page, and then, where I'm currently stuck, update a JSON file with the Title, Author, Date, and link to the newly created page.
Based on the PHP Manual, there are three modes I might want to use. r+, w+, or a+.
The process I am looking to use is to take the JSON file and place the data into an array. Update the array, then save it back to the file.
a+ places the pointer at the end of the file and writes are always appended, so I'm assuming this is the worst choice for this situation since I wouldn't add a new JSON entry at the end of the file (I'm tempted to actually insert any new data at the beginning of the JSON object instead of at the end).
w+ mentions read and write, but also truncating the file - does this happen upon saving data to the file, or does this happen the moment the file is opened? If I used this mode on an existing JSON file, would I then be reading a blank file before I can even modify the array and re-save it to the object?
r+ mentions placing the pointer at the beginning of the file - does saving data overwrite what's there or will it insert the data BEFORE what's existing there? If it inserts, how would I manually clear the file and then save the newly-modified array to the JSON object?
Which of those modes are best suited for what I'm looking to do? Is there a better way of doing this, anyway?
If you're always reading or writing an entire file, you don't have to work with file handles at all - PHP provides a pair of functions file_get_contents($file_name) and file_put_contents($file_name, $content) which are much simpler to work with.
File handles with their various modes are most useful when you're working with parts of files. For instance, if you are using CSV, you can read or write one line at a time, without having the full set of data in memory at once. Or, with binary file formats, you might know the location in the file you want to read from, and can "seek" the file handle to that location.
You should probably read the entire file first (eg with file_get_contents(), and then open it with w+ to write the new data. (Edit: Or rather, as the other answer points out, use file_put_contents(), which is always simpler when you are only making one write operation.)
r+ will overwrite as much of the file as you are writing, but won't erase beyond that. If your data always increases in size, this should be the same as overwriting the file entirely, but even if it's true now, that's an assumption that will likely mess up your data in the future.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Here is the scenario:
I have a variable in php that has raw contents of an excel file and I want to parse the contents of that variable (which is in an excel format, or can also be in a pdf format) for a certain value. I am looking for a keyword near the end of the contents of the file and will need to extract some of the contents near the desired value inside the contents of the file so I can get it into a variable in php and output to my webpage. From what I know the file is in binary, or hex representation but the ascii conversion is represented as readable text with diamond characters (with a question mark) and rectangles with a border and other extraneous characters including readable text content.
Here are the requirements:
I don't want parse the contents of the file by first storing or saving on disk. I want to parse the contents of the retrieved file directly while in a php variable.
Here is my question:
How do I go about this? Should I rely upon PHPExcel to read this content if possible? If not, what php libraries can accomplish this task?
Should I rely upon PHPExcel to read this content if possible?
It is not possible (see below).
If not, what PHP libraries can accomplish this task?
None that I know of.
How do I go about this?
An Excel file (rather, an Excel 2003+ XLSX file - Excel97 XLS files are a wholly different can of worms) is a ZIP archive containing XML and other files in a tree structure. So your first stage is to decompress a ZIP file in a string; PHPExcel relies on the ZipArchive class, and this, in turn, does not support string reading and also bypasses most stream hacks. A similar problem - actually exactly the same problem - is described in this question.
You could think of using stream wrapping to decode the file from a string, and the first part - the reading - would work. The writing of the files would not. And you cannot modify the ZipArchive class so that it writes to a memory object, because it is a native class.
So you can employ a slight variation, from one of the answers above (the one by toster-cx). You need to decode the ZIP structure yourself, and thus get the offset in the ZIP file where the file you need begins. This will either be /xl/worksheets/sheet1.xml or /xl/sharedStrings.xml, depending on whether the string has been inlined by Excel, or not. This also assumes that the format is the newer XLSX. Once you have that, you can extract the data from the string and decompress it, then search it for the token.
Of course, a more efficient use of the time would be to determine exactly why you don't want to use temporary files. Maybe that problem can be solved another way.
Speed problem
Actually, reading/writing an Excel file is not so terrible, because in this case you don't need to do that. You can almost certainly consider it a Zip file, and open it using ZipArchive and getStream() to directly access the internal sub-file you're interested in. This operation will be quite fast, also because you can run the search from the getStream() read cycle. You do need to write the file once, but nothing more.
In fact, chances are that you can write the file while it is being uploaded (what do you use for Web upload? The plupload JS library has a very nice hook to capture very large files one chunk at a time). You still need a temporary area on the disk where to store the data, but in this case the time expenditure will be exclusively dedicated to the decompression and reading of the XML sub-file - the same thing you'd have needed to do with a string object.
It is also (perhaps, depending on several factors, mainly the platform and operating system) possible to offload this part of the work to a secondary process running in the background, so that the user sees the page reload immediately, while the information appears after a while. This part, however, is pretty tricky and can rapidly turn into a maintenance nightmare (yeah, I do have first-hand experience on this. In my case it was tiled image conversion).
Cheating
OK, fact is I love cheating; it's so efficient. You say that you control the XLSX and PDF being created? Well! It turns out that in both cases, you can add hidden metadata to the file. And those metadata are much more easily read than you might think.
For example, you can add zip archive comments to a XLSX file, since it is a Zip file. Actually you could add a fake file with zero length to the archive, call it INVOICE_TOTAL_12345.xml, and that would mean that the invoice total is 12345. The advantage is that the file names are stored in the clear inside the XLSX file, so you can just use preg_match and look for INVOICE_TOTAL_([0-9]+)\.xml and retrieve your total.
Same goes for PDF. You can store keywords in a PDF. Just add a keyword attribute named "InvoiceTotal" (check the PDF to see how that turns out). But there is also a PDF ID inside the PDF, and that ID will be at the very end of the PDF. It will be something like /ID [<ec144ea3ecbb9ab8c22b413fec06fe29><ec144ea3ecbb9ab8c22b413fec06fe29>]^, but just use a known sequence such as deadbeef and ec144ea3ecbb9ab8c22deadbeef12345 will, again, mean the total is 12345. The ID before the known sequence will be random, so the overall ID will still be random and valid.
In both cases you could now just look for a known token in the string, exacly as requested.
I tried to load a 16MB file, into an php array.
It ends up with about 63MB memory usage.
Loading it into a string, just consumes the 16MB, but the issue is, I need it inside of an array, to access it faster, afterwards.
The file consists of about 750k lines (routing table dump).
I proberly should load it into a MySQL database, issue there, not enough memory to run that thing, so I did choose rqlite: https://github.com/rqlite/rqlite. Since I also need the replication features.
I am not sure if a SQLite database is fast enough for that.
Does anyone got an Idea for that issue?
You can get the actual file here: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/07/routeviews-rv2-20180715-1400.pfx2as.gz
The code I used:
$data = file('routeviews-rv2-20180715-1400.pfx2as');
var_dump(memory_get_usage());
Thanks.
You may use the Php fread function. It allows reading data of fixed size. It can be used inside a loop to read sized blocks of data. It does not consume much memory and is suitable for reading large files.
If you want to sort the data, then you may want to use a database. You can read the data from the large file one line at a time using fread and then insert it to the database.
I am using file_get_contents to get 1 million records from URL and output the results which is in json format and I can't go for pagination and currently working by increasing my memory. Is there any other solution for this?
If you're processing large amounts of data, fscanf will probably prove
valuable and more efficient than, say, using file followed by a split
and sprintf command. In contrast, if you're simply echoing a large
amount of text with little modification, file, file_get_contents, or
readfile might make more sense. This would likely be the case if
you're using PHP for caching or even to create a makeshift proxy
server.
More
The right way to read files with PHP
I have a large XML file (600mb+) and am developing a PHP application which needs to query this file.
My initial approach was to extract all the data from the file and insert it into a MySQL database - then query it that way. The only issue with this was that it was still slow, plus the XML data gets updated regularly - meaning I need to download, parse and insert data from the XML file into the database everytime the XML file is updated.
Is it actually possible to query a 600mb file? (for example, searching for records where TITLE="something here"?) Is it possible to get it to do this in a reasonable amount of time?
Ideally would like to do this in PHP, though I could also use JavaScript too.
Any help and suggestions appreciated :)
Constructing an XML DOM for a 600+ Mb document is definitely a way to fail. What you need is SAX-based API. SAX, though, does not usually allow XPath to be used, but you can emulate it with imperative code.
As for the file being updated, is it possible to retrieve only differences anyhow? That would massively speed up subsequent processing.