Best way to manipulate large json objects - php

We have an application that calls an API every 4 hours and gets a dump of all objects, returned in a json format which are then stored in a file.json
The reason we do this is because we need up to date data and we are not allowed to use the api directly to get small portions of this data and also that we need to do a clean up on it.
There is also another problem, we can't call for only the updated records (which is actually what we need)
The way we are currently handling this is by getting the data, storing in a file, load the previous file into memory and compare the values in order to get only the new and the updated ones, once we get the new and updated we go ahead and insert into MySQL
I am currently looking into a different option, what I was thinking is that since since the new file will contain every single record why not query for the needed objects from the file.json when needed?
The problem with that is that some of these files are larger than 50MB (each file contains one of the related tables, being 6 files in total which complete the full relation) and we can't be loading them into memory every time there is a query, does any one know of a DB system that will allow to query on a file or an easier way to replace the old data with the new one with a quick operation?

I think the approach you're using already is probably the most practical, but I'm intrigued by your idea of searching the JSON file directly.
Here's how I'd take a stab at implementing this, having worked on a Web application that used the similar approach of searching an XML file on disk rather than a database (and, remarkably, was still fast enough for production use):
Sort the JSON data first. Creating a new master file with the objects reordered to match how they're indexed in the database will maximize the efficiency of a linear search through the data.
Use a streaming JSON parser for searches. This will allow the file to be parsed object-by-object without needing to load the entire document in memory first. If the file is sorted, only half the document on average will need to be parsed for each lookup.
Streaming JSON parsers are rare, but they exist. Salsify has created one for PHP.
Benchmark searching the file directly using the above two strategies. You may discover this is enough to make the application usable, especially if it supports only a small number of users. If not:
Build separate indices on disk. Instead of having the application search the entire JSON file directly, parse it once when it's received and create one or more index files that associate key values with byte offsets into the original file. The application can then search a (much smaller) index file for the object it needs; once it retrieves the matching offset, it can seek immediately to the corresponding JSON object in the master file and parse it directly.
Consider using a more efficient data format. JSON is lightweight, but there may be better options. You might experiment with
generating a new master file using serialize to output a "frozen" representation of each parsed JSON object in PHP's native serialization format. The application can then use unserialize to obtain an array or object it can use immediately.
Combining this with the use of index files, especially if they're generated as trees rather than lists, will probably give you about the best performance you can hope for from a simple, purely filesystem-based solution.

I ended up doing my own processing method.
I got a json dump of all records which I then processed into single files with each one having all its related records in it, kind of like a join, to avoid the indexing of these files to be long I created multiple subfolders for a block of records, while creating these files I started building an index files which pointed to the directory location of the record which is a tiny file, now every time there is a query I just load the index file into memory which is under 1 MB I then check if the index key exists which is the master key of the record, if it does I then have the location of the file which I then load into memory and has all the required information to use in the application.
The query for these files ended up being a lot faster than querying the DB which works for what we need.
Thank you all for your input as it helped me decide which way to go.

Related

MySQL + PHP: ~500kb JSON file - Loading data as tables and fields VS as a single serialized variable

I am making a website that interacts with an offline project through json files sent from the offline project to the site.
The site will need to load these files and manipulate the data.
Is it feasible with modern computing power to simply load these files into the database as a single serialized field, which can then be loaded and decoded for every use?
Or would it save significant overhead to properly store the JSON as tables and fields and refer to those for every use?
Without knowing more about the project, a table with multiple fields is probably the better solution.
There will be more options for the data in the long run, for example, indexing fields, searching through fields and many other MySQL commands that would not be possible if it was all stored in a single variable.
Consider future versions of the project too, example adding another field to a table is easy, but adding another field to a block of JSON would be more difficult.
Project growth, what if you experience 100x or 1000x growth will the table handle the extra load.
500kb is a relatively small data block, there shouldn't be any issue with computing power regardless of which method is used, although more information would be handy here, example 500kb per user, per upload, how many stores a day how often is it accessed.
Debugging will also be easier.
The New MySQL shell has a bulk JSON loader that is not only very quick but lets you have a lot of control on how the data is handled. See https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-json.html
Load it as JSON.
Think about what queries you need to perform.
Copy selected fields into MySQL columns so that those queries can use WHERE, GROUP BY, ORDER BY of MySQL instead of having to deal with the processing in the client.
A database table contains a bunch of similarly structured rows. Each row has a constant set of columns. (NULLs can be used to indicate missing columns for a given row.) JSON complicates things by providing a complex column. My advice above is a compromise between the open-ended flexibility of JSON and the need to use the database server to process lots of data. Further discussion here.

Fastest way to parse large xls files with PHP

So basically I'm trying to parse large xls/xlsx files in PHP (Laravel). The flow is a tad different than usual:
I need to grab the column headers from a specific row (provided by user) and return as array
I need to then parse xls file into an array
I need to iterate over the array, mapping the keys of each value to the column name (because input is variable and mysql columns aren't), so I'll change eg [someKey => value] to [item => value] every time
I then use Laravel's :insert() to batch insert the array into the database
This is basically an api, so it needs to be reasonably fast on eg 15k rows xls files. I tried two things so far, both having their issues:
Laravel Excel no way to grab just one row, it parses the entire file twice, once to return headers, and once to map the value keys
Python openpyxl which I'm calling from the Laravel controller with exec(), while being faster, seems somewhat harder to control, as the array map to specific keys gets a bit more complex.
Is there a better/faster way to do something like this? Laravel can't be changed, but everything else is fair game. I also have zero control over input files, but a user has to define the row containing headers.
This is running on a Digital Ocean droplet, LAMP stack, CPU Optimized Droplets, 4 GB / 2 vCPUs. It's running both front (React) and back (Laravel) on the same machine. What happens is CPU goes up to 100% immediately as I run the parse, which isn't a huge issue, since it's fairly stable, but I'm just looking for a possible better/faster way of doing this, specifically, a way to yank just one row so I return headers faster, and then maybe a better way to map column names in the array.

Zf2 Caching difference between $cache->addItem and $cache->setItem

I have 1000 objects, where each object needs a "key".
For example
$this->setItem("1", $object);
$this->setItem("2", $object);
My problem is that each time I use $this->setItem() or $this->addItem() Zend is creating a new folder with a .dat file.
I would like to create only one .dat file for all of the objects, however I am able to call it with $this->getItem("key")
Therefore, I am asking what is the difference between these two functions?
Of course I could achieve the goal with the addItem() function.
The purpose of caching is to retrieve your cached results in a fast way.
If ZF2 would aggregate all your different cache keys (and their data) in a single file it would be impossible to fetch your data fast because all the expensive file searching / splitting etc. which needs to happen.
Generating a single file for each cache key makes this process simple. ZF2 will create a MD5 hash of the cache key and can directly retrieve the file with that name from the filesystem. The different directories you see are just a substring of the hash, so the amount of directories will be limited.
setItem will always write data to the specified key(overwrite if data already exists).
addItem will only write data if there's no data present yet.

Effective data storage when there is no limit on data size

I have a PHP application that takes objects from users, and stores them on a server. Similar to an upload-download service. The object could vary from just a string of text,to any kind of media object like a movie, songs etc. Even if a simple text is sent, the size of the text sent could be big (probably an entire ebook). At present, the way I'm doing this is, write all these data to files, because files don't impose a size limit.
Edit: To clarify, I'm looking for a generic and efficient way for storing data, no matter what the format. For example, a user could send the string, "Hi, I am XYZ". I can store this using file operations like "fopen", "fwrite". If a user sends an MP3 file, I can again use "fwrite" and the data of the file will be written as is, and the MP3 format is not disturbed. This works perfectly at present, no issues. So "fwrite" is my generic interface here.
My question: Is there some better, efficient way to do this?
Thanking you for your help!
The answer to this question is rather complicated. You can definitely store such objects in the databases as LONGBLOB objects -- unless you are getting into the realm of feature length movies (the size limit is 32 bits).
A more important question is how you are getting the objects back to the user. A "blob" object is not going to give you much flexibility in returning the results (it comes from a query in a single row). Reading from a file system might give more flexibility (such as retrieving part of a text file to see the contents).
Another issue is backup and recovery. Storing the objects in the database will greatly increase the database size, requiring that much more time to restore a database. In the event of a problem, you might be happy to have the database back (users can see what objets they have) before they can actually access the objects.
On the other hand, it might be convenient to have a single image of the database and objects, for moving around to, say, a backup server. Or, if you are using replication to keep multiple versions in sync, storing everything in the database gives this for free (assuming you have a high bandwidth connection between the two servers).
Personally, I tend to think that such objects are better suited for the file system. That does require a somewhat more complicated API for the application, which has to retrieve data from two different places.
Storing files in a file system is not the bad way until your file system has no limits on files per directory count, on the file size. It can be also hard to sync it over a number of your servers.
In that case of limitations you can use some kind of virtual fs (like mongo gridFS)

How can I handle 5M Transactions every day with MySQL and the whole LAMP?

Well, Maybe 5M is not that much, but it needs to receive a XML based on the following schema
http://www.sat.gob.mx/sitio_internet/cfd/3/cfdv3.xsd
Therefore I need to save almost all the information per row. Now by law we are required to save the information for a very long time and eventually this database will be very very veeeeery big.
Maybe create a table every day? something like _invoices_16_07_2012.
Well, I'm lost..I have no idea how to do this, but I know is possible.
On top of that, I need to create a PDF and 2 more files based on each XML and keep them on HD.
And you should be able to retrieve your files quickly using a web site.
Thats a lot of data to put into one field in a single row (not sure if that was something you were thinking about doing).
Write a script to parse the xml object and save each value from the xml in a separate field or in a way that makes sense for you (so you'll have to create a table with all the appropriate fields). You should be able to input your data as one row per xml sheet.
You'll also want to shard your database and spread it across a cluster of servers on many tables. MySQL does support this but I've only boostrapped my own sharding mechanism before.
Do not create a table per XML sheet as that is overkill.
Now, why do you need mysql for this? Are you querying the data in the XML? If you're storing this data simply for archival purposes, you don't need mysql, but can instead compress the files into, say, a tarball and store them directly on disk. Your website can easily fetch the file in this way.
If you do need a big data store that can handle 5M transactions with as much data as you're saying, you might also want to look into something like Hadoop and store the data in a Distributed File System. If you want to more easily query your data, look into HBase which can run on top of Hadoop.
Hope this helps.

Categories