I'm looking for a way to serialize large arrays to a file in PHP.
Right now I use a simple JSON format. Unfortunately to store JSON to a file you need to convert it to a string first with json_encode and then write the string to a file. During this process the amount of used memory almost doubles (it's less). And in some cases it can be a problem if things are happening concurrently.
My question is: is there a PHP library (binary preferably) which can serialize an array to a file (a JSON format would be nice) without converting the object to a string and thus 'doubling' the memory. If the output can be compressed with GZIP, what would be even better.
Any other suggestion to write (and read) of large object without intermediate format/state are welcome too.
If memory is the only concern
At the risk of being called Captain Obvious - I'd like to suggest a weird approach I like to use when there's not enough memory and I have to deal with something that only fits in once. Also, if garbage collection doesn't happen, that can be solved by doing the job in several steps as this article explains.
What I mean is something like this:
function packWithoutExhaustingMemory (array $a) {
foreach($a as $key => $value) {
$a[$key] = gzcompress(serialize($value)); // but only one piece at a
time!
}
return $a;
}
Again, not sure if this exact piece will do the job but it illustrates the concept.
Related
So... I need to save a large-ish amount of data from a platform with an excruciatingly limited amount of memory.
Because of this, I'm basically storing the data on my webserver, using a php script to just write JSON to a flat file, because I'm lazy af.
I could go to the trouble of having it store the data in my mysql server, but frankly the flat file thing should have been trivial, but I've run up against a problem. There are several quick and dirty workarounds that would fix it, but I've been trying to fix it the "right" way (I know, I know, the right way would be to just store the data in mysql, but I actually need to be able to take the json file this produces and send it back to the platform that needs the data (In a ridiculously roundabout fashion), so it made sense to just have the php save it as a flat file in the first place. And It's already working, aside from this one issue, so I hate to reimpliment.
See... Because of the low memory on the platform I'm sending the json to my server from... I'm sending things one field at a time. Each call to the php script is only setting ONE field.
So basically what I'm doing is loading the file from disk if it exists, and running it through json_decode to get my storage object, and then the php file gets a key argument and a value argument, and if the key is something like "object1,object2", it explodes that, gets the length of the resulting array, and then stores the value in $data->$key[0]->$key[1].
Then it's saved back to disk with fwrite($file, json_encode($data));
This is all working perfectly. Except when $value is a simple string. If it's an array, it works perfectly. If it's a number, it works fine. If it's a string, I get null from json_decode. I have tried every way I can think of to force quotes on to the ends of the $value variable in the hopes of getting json_decode to recognize it. Nothing works.
I've tried setting $data->$key[0]->$key[1] = $value in cases where value is a string, and not an array or number. No dice, php just complains that I'm trying to set an object that doesn't exist. It's fine if I'm using the output of json_decode to set the field, but it simply will not accept a string on its own.
So I have no idea.
Does anyone know how I can either get json_decode to not choke on a string that's just a string, or add a new field to an existing php object without using the output of json_decode?
I'm sure there's something obvious I'm missing. It should be clear I'm no php guru. I've never really used arrays and objects in php, so their vagaries are not something I'm familiar with.
Solutions I'm already aware of, but would prefer to avoid, are: I could have the platform that's sending the post requests wrap single, non-numeric values with square braces, creating a single item array, but this shouldn't be necessary, as far as I'm aware, so doing this bothers me (And ends up costing me something like half a kilobyte of storage that shouldn't need to be used).
I could also change some of my json from objects to arrays in order to get php to let me add items more readily, but it seems like there should be a solution that doesn't require that, so I'd really prefer not to...
I skim through your post.
And I know this works for StdClass :
$yourClass->newField = $string;
Is this what you wanted ?
OK so... ultimately, as succinctly as possible, the problem was this:
Assuming we have this JSON in $data:
{
"key1":
{
"key2":["somedata","someotherdata"]
}
}
And we want it to be:
{
"key1":
{
"key2":["somedata","someotherdata"],
"key3":"key3data"
}
}
The php script has received "key=key1,key3&value=key3data" as its post data, and is initialized thusly:
$key = $_POST["key"];
$key = explode($key,",");
$value = $_POST["value"];
...which provides us with an array ($key) representing the nested json key we want to set as a field, and a variable ($value) holding the value we want to set it to.
Approach #1:
$data->$key[0]->$key[1] = json_decode($value);
...fails. It creates this JSON when we re-encode $data:
{
"key1":
{
"key2":["somedata","someotherdata"],
"key3":null
}
}
Approach #2:
$data->$key[0]->$key[1] = $value;
...also fails. It fails to insert the field into $data at all.
But then I realized... the problem with #2 is that it won't let me set the nonexistent field, and the problem with approach #1 is that it sets the field wrong.
So all I have to do is brute force it thusly:
$data->$key[0]->$key[1] = json_decode($value);
if (json_decode($value) == NULL)
{
$data->$key[0]->$key[1] = $value;
}
This works! Since Approach #1 has created the field (Albeit with the incorrect value), PHP now allows me to set the value of that field without complaint.
It's a very brute force sort of means of fixing the problem, and I'm sure there are better ones, if I understood PHP objects better. But this works, so at least I have my code working.
OK, here's what I need :
I've got several (lots of them actually) json objects written to files (e.g. result.1.json, result.2.json, etc)
I need to combine all of them in an array, like :
$results = array (
json_decode(file_get_contents("result.1.json"),
json_decode(file_get_contents("result.2.json")),
...
)
And then write all $results back to a json file, like :
file_put_contents("results.json",$results);
And here's the issue :
if we are talking about LOTS of json objects, am I going to face a Memory limit error (that's why I decided to split the objects in the first place - storing all results in memory triggered a Memory related error and crashed)?
If the above is true, how could I circumvent it, and still "concat" the objects?
If the objects from the file(s) get enclosed in to just one big array, there's actually no need to read and decode them, assumed they actually are valid JSON Syntax.
An Array is of the kind:
[
{ obj1},
{ obj2}
]
So you can just do the following:
file_put_contents("results.json","[\n");
while ( ... has more files ... )
{
file_put_contents(file_get_contents($theFile), FILE_APPEND);
if ( .... has more files ....)
file_put_contents("results.json",",\n", FILE_APPEND);
}
file_put_contents("results.json","]\n", FILE_APPEND);
This will use almost no memory and is quite fast.
Yes/No/Maybe. It all depends on what is your memory_limit really and how much memory you need to process all your JSON.
Increase memory limit if you can; process offsite and put all json data in DB for smarter processing
I'm developing a floorplanner Flex mini application. I was just wondering whether JSON or XML would be a better choice in terms of performance when generating responses from PHP. I'm currently leaning for JSON since the responses could also be reused for Javascript. I've read elsewhere that JSON takes longer to parse than XML, is that true? What about flexibility for handling data with XML vs JSON in Flex?
I'd go with JSON. We've added native JSON support to Flash Player, so it will be as fast on the parsing side as XML and it's much less verbose/smaller.
=Ryan ryan#adobe.com
JSON is not a native structure to Flex (strange, huh? You'd think that the {} objects could be easily serialized, but not really), XML is. This means that XML is done behind the scenes by the virtual machine while the JSON Strings are parsed and turned into objects through String manipulation (even if you're using AS3CoreLib)... gross... Personally, I've also seen inconsistencies in JSONEncoder (at one point Arrays were just numerically indexed objects).
Once the data has been translated into an AS3 object, it is still faster to search and parse data in XML than it is with Objects. XPath expressions make data traversal a pleasure (almost easy enough to make you smile compared to other things out there).
On the other hand JS is much better at parsing JSON. MUCH, MUCH BETTER. But, since the move to JavaScript is a "maybe... someday..." then you may want to consider, "will future use of JSON be worth the performance hit right now?"
But here is a question, why not simply have two outputs? Since both JS and AS can provide you POSTs with a virtually arbitrary number of variables, you really only need to concern yourself with how the server send the data not receives it. Here's a potential way to handle this:
// as you are about to output:
$type = './outputs/' . $_GET[ 'type' ] . '.php';
if( file_exists( $type ) && strpos( $type, '.', 1 ) === FALSE )
{
include( $type );
echo output_data( $data );
}
else
{
// add a 404 if you like
die();
}
Then, when getting a $_GET['type'] == 'js', js.php would be:
function output_data( $data ){ return json_encode( $data ); }
When getting $_GET['type'] == 'xml', xml.php would hold something which had output_data return a string which represented XML (plenty of examples here)
Of course, if you're using a framework, then you could just do something like this with a view instead (my suggestion boils down to "you should have two different views and use MVC").
No, JSON is ALWAYS smaller than XML when their structures are completely same. And the cost of parsing text is almost up to the size of the target text.
So, JSON is faster than XML and if you have a plan to reuse them on javascript side, choose JSON.
Benchmark JSON vs XML:
http://www.navioo.com/ajax/ajax_json_xml_Benchmarking.php
If you're ever going to use Javascript with it, definitely go with JSON. Both have a very nice structure.
It depends on how well Flex can parse JSON though, so I would look into that. How much data are you going to be passing back? Error/Success messages? User Profiles? What kind of data is this going to contain?
Is it going to need attributes on the tags? Or just a "structure". If it needs attributes and things like that, and you don't want to go too deep into an "array like" structure, go with XML.
If you're just going to have key => value, even multi-dimensional... go with JSON.
All depends on what kind of data you're going to be passing back and forth. That will make your decision for you :)
Download time:
JSON is faster.
Javascript Parse
JSON is faster
Actionscript Parse
XML is faster.
Advanced use within Actionscript
XML is better with all the E4X functionality.
JSON is limited with no knowledge of Vectors meaning you are limited to Arrays or will need to override the JSON Encoder in ascorelib with something such as
else if ( value is Vector.<*> ) {
// converts the vector to an array and
return arrayToString( vectorToArray( value ) );
} else if ( value is Object && value != null ) {
private function vectorToArray(__vector:Object):Array {
var __return : Array;
var __vList : Vector.<*>;
__return = new Array;
if ( !__vector || !(__vector is Vector.<*>) )
{
return __return;
}
for each ( var __obj:* in (__vector as Vector.<*>) )
{
__return.push(__obj);
}
return __return;
}
But I am afraid getting those values back into Vectors is not as nice. I had to make a whole utility class devoted to it.
So which one is all depending on how advanced your objects you are going to be moving.. More advanced, go with XML to make it easier ActionScript side.
Simple stuff go JSON
I'm storing some "unstructured" data (a keyed array) in one field of my table, and i'm currently using serialize() / unserialize() to "convert" back and forth from array to string.
Every now and then, however, I get errors when unserializing the data. I believe these errors happen because of Unicode data in the strings inside the array i'm serializing, although there are some records with Unicode data that work just fine. (DB field is UTF-8)
I'm wondering whether using json_encode instead of serialize will make a difference / make this more resilient. This is not trivial for me to test, since in my dev environment everything works well, but in production, every now and then (about 1% of records) I get an error.
Btw, I know i'm weaseling out of finding an actual explanation for the problem and just blindly trying something, I'm kind of hoping I can get rid of this without spending too much time on it.
Do you think using json_encode instead of serialize will make this more resilient to "serialization errors"? The data format does look more "forgiving" to me...
UPDATE: The actual error i'm getting is:
Notice: unserialize(): Error at offset 401 of 569 bytes in C:\blah.php on line 20
Thanks!
Daniel
JSON has one main advantage :
compatibility with other languages than PHP.
PHP's serialize has one main advantage :
it's specifically designed to store PHP-based data -- most notably, it can store serialized objects, instance of classes, that will be re-instanciated to the right class-type when the string is unserialized.
(Yes, those advantages are the exact opposite of each other)
In your case, as you are storing data that's not really structured, both formats should work pretty well.
And the encoding problem you have should not be related to serialize by itself : as long as everything (DB, connection to the DB, PHP files, ...) is in UTF-8, serialization should work too.
I think unless you absolutely need to preserve php specific types that json_encode() is the way to go for storing structured data in a single field in MySQL. Here's why:
https://dev.mysql.com/doc/refman/5.7/en/json.html
As of MySQL 5.7.8, MySQL supports a native JSON data type defined by RFC 7159 that enables efficient access to data in JSON (JavaScript Object Notation) documents
If you are using a version of MySQL that supports the new JSON data type you can benefit from that feature.
Another important point of consideration is the ability to perform changes on those JSON strings. Suppose you have a url stored in encoded strings all over your database. Wordpress users who've ever tried to migrate an existing database to a new domain name may sympathize here. If it's serialized, it's going to break things. If it's JSON you can simply run a query using REPLACE() and everything will be fine. Example:
$arr = ['url' => 'http://example.com'];
$ser = serialize($arr);
$jsn = json_encode($arr);
$ser = str_replace('http://','https://',$ser);
$jsn = str_replace('http://','https://',$jsn);
print_r(unserialize($ser));
PHP Notice: unserialize(): Error at offset 39 of 43 bytes in /root/sandbox/encoding.php on line 10
print_r(json_decode($jsn,true));
Array
(
[url] => https://example.com
)
json_encode() converts non-ASCII characters and symbols (e.g., “Schrödinger” becomes “Schr\u00f6dinger”) but serialize() does not.
Source: https://www.toptal.com/php/10-most-common-mistakes-php-programmers-make#common-mistake-6--ignoring-unicodeutf-8-issues
To leave UTF-8 characters untouched, you can use the option JSON_UNESCAPED_UNICODE as of PHP 5.4.
Source: https://stackoverflow.com/a/804089/1438029
If the problem is (and I believe it is) in UTF-8 encoding, there is not difference between json_encode and serialize. Both will leave characters encoding unchanged.
You should make sure your database/connection is properly set up for handle all UTF-8 characters or encode whole record into supported encoding before inserting to the DB.
Also please specify what "I get an error" means.
Found this in the PHP docs...
function mb_unserialize($serial_str) {
$out = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $serial_str );
return unserialize($out);
}
I don't quite understand it, but it worked to unserialize the data that I couldn't unserialize before. Moved to JSON now, i'll report in a couple of weeks whether this solved the problem of randomly getting some records "corrupted"
As I'm going through this I'll give my opinion, both serialize and json_encode are good for storing data in DB, but for those looking for performance, I've tested and I get these results, json_encode are a little microsegunds faster tham serialize, i used this script to calculate a the difference time.
$bounced =array();
for($i=count($bounced); $i<9999; ++$i)$bounced[$i]=$i;
$timeStart = microtime(true);
var_dump(serialize ($bounced));
unserialize(serialize ($bounced));
print timer_diff($timeStart) . " sec.\n";
$timeStart = microtime(true);
var_dump(json_encode ($bounced));
json_decode(json_encode ($bounced));
print timer_diff($timeStart) . " sec.\n";
function timer_diff($timeStart)
{
return number_format(microtime(true) - $timeStart, 3);
}
As a design decision, I'd opt for storing JSON because it can only represent a data structure, whereas serialization is bound to a PHP data object signature.
The advantages I see are:
* you are forced to separate the data storage from any logic layer on top.
* you are independent from changes to the data object class (say, for example, that you want to add a field).
I have a report page that deals with ~700k records from a database table. I can display this on a webpage using paging to break up the results. However, my export to PDF/CSV functions rely on processing the entire data set at once and I'm hitting my 256MB memory limit at around 250k rows.
I don't feel comfortable increasing the memory limit and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV. However, I can't really see a way of serving up large data sets with Drupal using something like:
$form = array();
$table_headers = array();
$table_rows = array();
$data = db_query("a query to get the whole dataset");
while ($row = db_fetch_object($data)) {
$table_rows[] = $row->some attribute;
}
$form['report'] = array('#value' => theme('table', $table_headers, $table_rows);
return $form;
Is there a way of getting around what is essentially appending to a giant array of arrays? At the moment I don't see how I can offer any meaningful report pages with Drupal due to this.
Thanks
With such a large dataset, I would use Drupal's Batch API which allows for time intensive operations to be broken into batches. It is also better for users because it will give them a progress bar with some indication of how long the operation will take.
Start the batch operation by opening a temporary file, then append new records to it on each new batch until done. The final page can do the final processing to deliver the data as cvs or convert to PDF. You'd probably want to add some cleanup afterwords as well.
http://api.drupal.org/api/group/batch/6
If you are generating PDF or CSV you shouldn't use the Drupal native functions. What about writing to the output file inside your while loop? This way, only one result set is in memory at a given time.
At the moment you store everything in the array $table_rows.
Can't you flush at least parts of the report while you're reading it from the database (e.g. every so and so many lines) in order to free some of the memory? I can't see why it should only be possible to write to a csv at once.
I don't feel comfortable increasing the memory limit
Increasing the memory limit doesn't mean that every php process will use that amount of memory. However you could exec the cli version of php with a custom memory limit - but that's not the right solution either....
and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV
Then don't save it all in an array - write each line to the output buffer when you fetch it from the database (IIRC the entire result set is buffered outside the limited php memory). Or write it directly to a file then do a redirect when the file is completed and closed.
C.
You should include paging into that with a pager_query, and break results into 50-100 per page. That should help a lot. You say you want to use paging but I don't see it in the code.
Check this out: http://api.drupal.org/api/function/pager_query/6
Another things to keep in mind is that in PHP5 (before 5.3), assigning an array to a new variable or passing it to a function copies the array and does not create a reference. You may be creating many copies of the same data, and if none are unset or go out of scope they cannot be garbage collected to free up memory. Where possible, using references to perform operations on the original array can save memory
function doSomething($arg){
foreach($arg AS $var)
// a new copy is created here internally: 3 copies of data exist
$internal[] = doSomethingToValue($var);
return $internal;
// $arg goes out of scope and can be garbage collected: 2 copies exist
}
$var = array();
// a copy is passed to function: 2 copies of data exist
$var2 = doSomething($var);
// $var2 will be a reference to the same object in memory as $internal,
// so only 2 copies still exist
if the $var is set to the return value of the function, the old value can be garbage collected, but not until after the assignment, so more memory will still be needed for a brief time
function doSomething(&$arg){
foreach($arg AS &$var)
// operations are performed on original array data:
// only two copies of an array element exist at once, not the whole array
$var = doSomethingToValue($var);
unset($var); // not needed here, but good practice in large functions
}
$var = array();
// a reference is passed to function: 1 copy of data exists
doSomething($var);
The way I approach such huge reports is to generate them with the php cli/Java/CPP/C# (i.e. CRONTAB) + use the unbuffered query option mysql has.
Once the file/report creation is done on the disk, you can give a link to it...