Merge Json objects, regardless of Memory - php

OK, here's what I need :
I've got several (lots of them actually) json objects written to files (e.g. result.1.json, result.2.json, etc)
I need to combine all of them in an array, like :
$results = array (
json_decode(file_get_contents("result.1.json"),
json_decode(file_get_contents("result.2.json")),
...
)
And then write all $results back to a json file, like :
file_put_contents("results.json",$results);
And here's the issue :
if we are talking about LOTS of json objects, am I going to face a Memory limit error (that's why I decided to split the objects in the first place - storing all results in memory triggered a Memory related error and crashed)?
If the above is true, how could I circumvent it, and still "concat" the objects?

If the objects from the file(s) get enclosed in to just one big array, there's actually no need to read and decode them, assumed they actually are valid JSON Syntax.
An Array is of the kind:
[
{ obj1},
{ obj2}
]
So you can just do the following:
file_put_contents("results.json","[\n");
while ( ... has more files ... )
{
file_put_contents(file_get_contents($theFile), FILE_APPEND);
if ( .... has more files ....)
file_put_contents("results.json",",\n", FILE_APPEND);
}
file_put_contents("results.json","]\n", FILE_APPEND);
This will use almost no memory and is quite fast.

Yes/No/Maybe. It all depends on what is your memory_limit really and how much memory you need to process all your JSON.
Increase memory limit if you can; process offsite and put all json data in DB for smarter processing

Related

Large SQL query takes 10 times more memory in PHP than SQL data

So we have an existing system, which we are trying to scale up and running out of memory retrieving close to 3M records.
I was trying to determine how viable is increasing server memory as a stop gap solution, by ascertaining data size returned by the query, by doing something like:
select sum(row_size)
from (
SELECT
ifnull(LENGTH(qr.id), 0)+
ifnull(LENGTH(qr.question_id), 0)+
ifnull(LENGTH(qr.form_response_id), 0)+
ifnull(LENGTH(qr.`value`), 0)+
ifnull(LENGTH(qr.deleted_at), 0)+
ifnull(LENGTH(qr.created_at), 0)+
ifnull(LENGTH(qr.updated_at), 0)
as row_size
FROM
....
LIMIT 500000
) as tbl1;
Which returns 30512865 which is roughly 30MB of data.
However when I cross check what PHP actually uses to store the results using:
$memBefore = memory_get_usage();
$formResponses = DB::select($responsesSQL, $questionIDsForSQL);
$memAfter = memory_get_usage();
dd($memBefore, $memAfter);
I am getting 315377552 and 22403248 which means 292974304 bytes or roughly 300MB of memory usage to store simple array!
I would like to understand why the memory footprint is 10 times the data retrieved, and is there anything I could do to reduce that footprint, short of modifying the API response from back end, and front end to not need the entire result set which will take time.
For context, current implementation uses the above results (returned by getQuestionResponses)to transform them into associative array grouped by question_id using Laravel Collections:
collect($this->questionResponseRepo->getQuestionResponses($questions))->groupBy('question_id')->toArray();
I am thinking to replace the collect with own implementation more memory efficient which will use the array returned from the query to reduce memory inflation by converting that array into Laravel's Collection, but thats still not helping with the array itself taking 300MB for 500k records responses instead of 30MB.
One of the solutions online is to use SplFixedArray but I am not sure how to force DB::select to use that instead of array?
Another possible solution involves ensuring it returns simple assoc array instead of array of standard classes https://stackoverflow.com/a/37532052/373091
But when I try that as in:
// get original model
$fetchMode = DB::getFetchMode();
// set mode to custom
DB::setFetchMode(\PDO::FETCH_ASSOC);
$memBefore = memory_get_usage();
$formResponses = DB::select($responsesSQL, $questionIDsForSQL);
DB::setFetchMode($fetchMode);
$memAfter = memory_get_usage();
dd($memBefore, $memAfter, $formResponses);
, I get error Call to undefined method Illuminate\\Database\\MySqlConnection::getFetchMode() which means apparently it can no longer be done from Laravel> 5.4 :(
Any suggestions?
I think the real problem is that you're loading all 3 million records into memory at once. You should instead either process them in chunks or use a cursor.
Chunking
To chunk records into batches, you can use the Laravel's chunk method. This method accepts two parameters, the chunk size and a callback that gets passed the subset of models or objects for processing. This will execute on query per chunk.
Here's the example taken from the documentation:
Flight::chunk(200, function ($flights) {
foreach ($flights as $flight) {
//
}
});
Cursor
Alternatively, you can also use the cursor method if you only want to execute a single query. In this case, Laravel will only hydrate one model at a time so you never have more than one model (or object if you're not using Eloquent) in memory at a time.
foreach (Flight::where('destination', 'Zurich')->cursor() as $flight) {
//
}

Memory efficient to/from file serialization in PHP

I'm looking for a way to serialize large arrays to a file in PHP.
Right now I use a simple JSON format. Unfortunately to store JSON to a file you need to convert it to a string first with json_encode and then write the string to a file. During this process the amount of used memory almost doubles (it's less). And in some cases it can be a problem if things are happening concurrently.
My question is: is there a PHP library (binary preferably) which can serialize an array to a file (a JSON format would be nice) without converting the object to a string and thus 'doubling' the memory. If the output can be compressed with GZIP, what would be even better.
Any other suggestion to write (and read) of large object without intermediate format/state are welcome too.
If memory is the only concern
At the risk of being called Captain Obvious - I'd like to suggest a weird approach I like to use when there's not enough memory and I have to deal with something that only fits in once. Also, if garbage collection doesn't happen, that can be solved by doing the job in several steps as this article explains.
What I mean is something like this:
function packWithoutExhaustingMemory (array $a) {
foreach($a as $key => $value) {
$a[$key] = gzcompress(serialize($value)); // but only one piece at a
time!
}
return $a;
}
Again, not sure if this exact piece will do the job but it illustrates the concept.

How do I pre-allocate memory for an array in PHP?

How do I pre-allocate memory for an array in PHP? I want to pre-allocate space for 351k longs. The function works when I don't use the array, but if I try to save long values in the array, then it fails. If I try a simple test loop to fill up 351k values with a range(), it works. I suspect that the array is causing memory fragmentation and then running out of memory.
In Java, I can use ArrayList al = new ArrayList(351000);.
I saw array_fill and array_pad but those initialize the array to specific values.
Solution:
I used a combination of answers. Kevin's answer worked alone, but I was hoping to prevent problems in the future too as the size grows.
ini_set('memory_limit','512M');
$foundAdIds = new \SplFixedArray(100000); # google doesn't return deleted ads. must keep track and assume everything else was deleted.
$foundAdIdsIndex = 0;
// $foundAdIds = array();
$result = $gaw->getAds(function ($googleAd) use ($adTemplates, &$foundAdIds, &$foundAdIdsIndex) { // use call back to avoid saving in memory
if ($foundAdIdsIndex >= $foundAdIds->count()) $foundAdIds->setSize( $foundAdIds->count() * 1.10 ); // grow the array
$foundAdIds[$foundAdIdsIndex++] = $googleAd->ad->id; # save ids to know which to not set deleted
// $foundAdIds[] = $googleAd->ad->id;
PHP has an Array Class with SplFixedArray
$array = new SplFixedArray(3);
$array[1] = 'test1';
$array[0] = 'test2';
$array[2] = 'test3';
foreach ($array as $k => $v) {
echo "$k => $v\n";
}
$array[] = 'fails';
gives
0 => test1
1 => test2
2 => test3
As other people have pointed out, you can't do this in PHP (well, you can create an array of fixed length, but that's not really want you need). What you can do however is increase the amount of memory for the process.
ini_set('memory_limit', '1024M');
Put that at the top of your PHP script and you should be ok. You can also set this in the php.ini file. This does not allocate 1GB of memory to PHP, but rather allows PHP to expand it's memory usage up to that point.
A couple of things to point out though:
This might not be allowed on some shared hosts
If you're using this much memory, you might need to have a look at how you're doing things and see if they can be done more efficiently
Look out for opportunities to clear out unneeded resources (do you really need to keep hold of $x that contains a huge object you've already used?) using unset($x);
The quick answer is: you can't
PHP is quite different from java.
You can make an array with specific values as you said, but you already know about them. You can 'fake' it by filling it with null values, but that's about the same to be honest.
So unless you want to just create one with array_fill and null (which is a hack in my head), you just can't.
(You might want to check your reasoning about the memory. Are you sure this isn't an XY-problem? As memory is limited by a number (max usage) I don't think the fragmentation would have much effect. Check what is taking your memory rather then try going down this road)
The closest you will get is using SplFixedArray. It doesn't preallocate the memory needed to store the values (because you can't pre-specify the type of values used), but it preallocates the array slots and doesn't need to resize the array itself as you add values.

XML or JSON reponse for Flex applications with PHP backend?

I'm developing a floorplanner Flex mini application. I was just wondering whether JSON or XML would be a better choice in terms of performance when generating responses from PHP. I'm currently leaning for JSON since the responses could also be reused for Javascript. I've read elsewhere that JSON takes longer to parse than XML, is that true? What about flexibility for handling data with XML vs JSON in Flex?
I'd go with JSON. We've added native JSON support to Flash Player, so it will be as fast on the parsing side as XML and it's much less verbose/smaller.
=Ryan ryan#adobe.com
JSON is not a native structure to Flex (strange, huh? You'd think that the {} objects could be easily serialized, but not really), XML is. This means that XML is done behind the scenes by the virtual machine while the JSON Strings are parsed and turned into objects through String manipulation (even if you're using AS3CoreLib)... gross... Personally, I've also seen inconsistencies in JSONEncoder (at one point Arrays were just numerically indexed objects).
Once the data has been translated into an AS3 object, it is still faster to search and parse data in XML than it is with Objects. XPath expressions make data traversal a pleasure (almost easy enough to make you smile compared to other things out there).
On the other hand JS is much better at parsing JSON. MUCH, MUCH BETTER. But, since the move to JavaScript is a "maybe... someday..." then you may want to consider, "will future use of JSON be worth the performance hit right now?"
But here is a question, why not simply have two outputs? Since both JS and AS can provide you POSTs with a virtually arbitrary number of variables, you really only need to concern yourself with how the server send the data not receives it. Here's a potential way to handle this:
// as you are about to output:
$type = './outputs/' . $_GET[ 'type' ] . '.php';
if( file_exists( $type ) && strpos( $type, '.', 1 ) === FALSE )
{
include( $type );
echo output_data( $data );
}
else
{
// add a 404 if you like
die();
}
Then, when getting a $_GET['type'] == 'js', js.php would be:
function output_data( $data ){ return json_encode( $data ); }
When getting $_GET['type'] == 'xml', xml.php would hold something which had output_data return a string which represented XML (plenty of examples here)
Of course, if you're using a framework, then you could just do something like this with a view instead (my suggestion boils down to "you should have two different views and use MVC").
No, JSON is ALWAYS smaller than XML when their structures are completely same. And the cost of parsing text is almost up to the size of the target text.
So, JSON is faster than XML and if you have a plan to reuse them on javascript side, choose JSON.
Benchmark JSON vs XML:
http://www.navioo.com/ajax/ajax_json_xml_Benchmarking.php
If you're ever going to use Javascript with it, definitely go with JSON. Both have a very nice structure.
It depends on how well Flex can parse JSON though, so I would look into that. How much data are you going to be passing back? Error/Success messages? User Profiles? What kind of data is this going to contain?
Is it going to need attributes on the tags? Or just a "structure". If it needs attributes and things like that, and you don't want to go too deep into an "array like" structure, go with XML.
If you're just going to have key => value, even multi-dimensional... go with JSON.
All depends on what kind of data you're going to be passing back and forth. That will make your decision for you :)
Download time:
JSON is faster.
Javascript Parse
JSON is faster
Actionscript Parse
XML is faster.
Advanced use within Actionscript
XML is better with all the E4X functionality.
JSON is limited with no knowledge of Vectors meaning you are limited to Arrays or will need to override the JSON Encoder in ascorelib with something such as
else if ( value is Vector.<*> ) {
// converts the vector to an array and
return arrayToString( vectorToArray( value ) );
} else if ( value is Object && value != null ) {
private function vectorToArray(__vector:Object):Array {
var __return : Array;
var __vList : Vector.<*>;
__return = new Array;
if ( !__vector || !(__vector is Vector.<*>) )
{
return __return;
}
for each ( var __obj:* in (__vector as Vector.<*>) )
{
__return.push(__obj);
}
return __return;
}
But I am afraid getting those values back into Vectors is not as nice. I had to make a whole utility class devoted to it.
So which one is all depending on how advanced your objects you are going to be moving.. More advanced, go with XML to make it easier ActionScript side.
Simple stuff go JSON

Handling large datasets with PHP/Drupal

I have a report page that deals with ~700k records from a database table. I can display this on a webpage using paging to break up the results. However, my export to PDF/CSV functions rely on processing the entire data set at once and I'm hitting my 256MB memory limit at around 250k rows.
I don't feel comfortable increasing the memory limit and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV. However, I can't really see a way of serving up large data sets with Drupal using something like:
$form = array();
$table_headers = array();
$table_rows = array();
$data = db_query("a query to get the whole dataset");
while ($row = db_fetch_object($data)) {
$table_rows[] = $row->some attribute;
}
$form['report'] = array('#value' => theme('table', $table_headers, $table_rows);
return $form;
Is there a way of getting around what is essentially appending to a giant array of arrays? At the moment I don't see how I can offer any meaningful report pages with Drupal due to this.
Thanks
With such a large dataset, I would use Drupal's Batch API which allows for time intensive operations to be broken into batches. It is also better for users because it will give them a progress bar with some indication of how long the operation will take.
Start the batch operation by opening a temporary file, then append new records to it on each new batch until done. The final page can do the final processing to deliver the data as cvs or convert to PDF. You'd probably want to add some cleanup afterwords as well.
http://api.drupal.org/api/group/batch/6
If you are generating PDF or CSV you shouldn't use the Drupal native functions. What about writing to the output file inside your while loop? This way, only one result set is in memory at a given time.
At the moment you store everything in the array $table_rows.
Can't you flush at least parts of the report while you're reading it from the database (e.g. every so and so many lines) in order to free some of the memory? I can't see why it should only be possible to write to a csv at once.
I don't feel comfortable increasing the memory limit
Increasing the memory limit doesn't mean that every php process will use that amount of memory. However you could exec the cli version of php with a custom memory limit - but that's not the right solution either....
and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV
Then don't save it all in an array - write each line to the output buffer when you fetch it from the database (IIRC the entire result set is buffered outside the limited php memory). Or write it directly to a file then do a redirect when the file is completed and closed.
C.
You should include paging into that with a pager_query, and break results into 50-100 per page. That should help a lot. You say you want to use paging but I don't see it in the code.
Check this out: http://api.drupal.org/api/function/pager_query/6
Another things to keep in mind is that in PHP5 (before 5.3), assigning an array to a new variable or passing it to a function copies the array and does not create a reference. You may be creating many copies of the same data, and if none are unset or go out of scope they cannot be garbage collected to free up memory. Where possible, using references to perform operations on the original array can save memory
function doSomething($arg){
foreach($arg AS $var)
// a new copy is created here internally: 3 copies of data exist
$internal[] = doSomethingToValue($var);
return $internal;
// $arg goes out of scope and can be garbage collected: 2 copies exist
}
$var = array();
// a copy is passed to function: 2 copies of data exist
$var2 = doSomething($var);
// $var2 will be a reference to the same object in memory as $internal,
// so only 2 copies still exist
if the $var is set to the return value of the function, the old value can be garbage collected, but not until after the assignment, so more memory will still be needed for a brief time
function doSomething(&$arg){
foreach($arg AS &$var)
// operations are performed on original array data:
// only two copies of an array element exist at once, not the whole array
$var = doSomethingToValue($var);
unset($var); // not needed here, but good practice in large functions
}
$var = array();
// a reference is passed to function: 1 copy of data exists
doSomething($var);
The way I approach such huge reports is to generate them with the php cli/Java/CPP/C# (i.e. CRONTAB) + use the unbuffered query option mysql has.
Once the file/report creation is done on the disk, you can give a link to it...

Categories