Handling large datasets with PHP/Drupal - php

I have a report page that deals with ~700k records from a database table. I can display this on a webpage using paging to break up the results. However, my export to PDF/CSV functions rely on processing the entire data set at once and I'm hitting my 256MB memory limit at around 250k rows.
I don't feel comfortable increasing the memory limit and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV. However, I can't really see a way of serving up large data sets with Drupal using something like:
$form = array();
$table_headers = array();
$table_rows = array();
$data = db_query("a query to get the whole dataset");
while ($row = db_fetch_object($data)) {
$table_rows[] = $row->some attribute;
}
$form['report'] = array('#value' => theme('table', $table_headers, $table_rows);
return $form;
Is there a way of getting around what is essentially appending to a giant array of arrays? At the moment I don't see how I can offer any meaningful report pages with Drupal due to this.
Thanks

With such a large dataset, I would use Drupal's Batch API which allows for time intensive operations to be broken into batches. It is also better for users because it will give them a progress bar with some indication of how long the operation will take.
Start the batch operation by opening a temporary file, then append new records to it on each new batch until done. The final page can do the final processing to deliver the data as cvs or convert to PDF. You'd probably want to add some cleanup afterwords as well.
http://api.drupal.org/api/group/batch/6

If you are generating PDF or CSV you shouldn't use the Drupal native functions. What about writing to the output file inside your while loop? This way, only one result set is in memory at a given time.

At the moment you store everything in the array $table_rows.
Can't you flush at least parts of the report while you're reading it from the database (e.g. every so and so many lines) in order to free some of the memory? I can't see why it should only be possible to write to a csv at once.

I don't feel comfortable increasing the memory limit
Increasing the memory limit doesn't mean that every php process will use that amount of memory. However you could exec the cli version of php with a custom memory limit - but that's not the right solution either....
and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV
Then don't save it all in an array - write each line to the output buffer when you fetch it from the database (IIRC the entire result set is buffered outside the limited php memory). Or write it directly to a file then do a redirect when the file is completed and closed.
C.

You should include paging into that with a pager_query, and break results into 50-100 per page. That should help a lot. You say you want to use paging but I don't see it in the code.
Check this out: http://api.drupal.org/api/function/pager_query/6

Another things to keep in mind is that in PHP5 (before 5.3), assigning an array to a new variable or passing it to a function copies the array and does not create a reference. You may be creating many copies of the same data, and if none are unset or go out of scope they cannot be garbage collected to free up memory. Where possible, using references to perform operations on the original array can save memory
function doSomething($arg){
foreach($arg AS $var)
// a new copy is created here internally: 3 copies of data exist
$internal[] = doSomethingToValue($var);
return $internal;
// $arg goes out of scope and can be garbage collected: 2 copies exist
}
$var = array();
// a copy is passed to function: 2 copies of data exist
$var2 = doSomething($var);
// $var2 will be a reference to the same object in memory as $internal,
// so only 2 copies still exist
if the $var is set to the return value of the function, the old value can be garbage collected, but not until after the assignment, so more memory will still be needed for a brief time
function doSomething(&$arg){
foreach($arg AS &$var)
// operations are performed on original array data:
// only two copies of an array element exist at once, not the whole array
$var = doSomethingToValue($var);
unset($var); // not needed here, but good practice in large functions
}
$var = array();
// a reference is passed to function: 1 copy of data exists
doSomething($var);

The way I approach such huge reports is to generate them with the php cli/Java/CPP/C# (i.e. CRONTAB) + use the unbuffered query option mysql has.
Once the file/report creation is done on the disk, you can give a link to it...

Related

Large SQL query takes 10 times more memory in PHP than SQL data

So we have an existing system, which we are trying to scale up and running out of memory retrieving close to 3M records.
I was trying to determine how viable is increasing server memory as a stop gap solution, by ascertaining data size returned by the query, by doing something like:
select sum(row_size)
from (
SELECT
ifnull(LENGTH(qr.id), 0)+
ifnull(LENGTH(qr.question_id), 0)+
ifnull(LENGTH(qr.form_response_id), 0)+
ifnull(LENGTH(qr.`value`), 0)+
ifnull(LENGTH(qr.deleted_at), 0)+
ifnull(LENGTH(qr.created_at), 0)+
ifnull(LENGTH(qr.updated_at), 0)
as row_size
FROM
....
LIMIT 500000
) as tbl1;
Which returns 30512865 which is roughly 30MB of data.
However when I cross check what PHP actually uses to store the results using:
$memBefore = memory_get_usage();
$formResponses = DB::select($responsesSQL, $questionIDsForSQL);
$memAfter = memory_get_usage();
dd($memBefore, $memAfter);
I am getting 315377552 and 22403248 which means 292974304 bytes or roughly 300MB of memory usage to store simple array!
I would like to understand why the memory footprint is 10 times the data retrieved, and is there anything I could do to reduce that footprint, short of modifying the API response from back end, and front end to not need the entire result set which will take time.
For context, current implementation uses the above results (returned by getQuestionResponses)to transform them into associative array grouped by question_id using Laravel Collections:
collect($this->questionResponseRepo->getQuestionResponses($questions))->groupBy('question_id')->toArray();
I am thinking to replace the collect with own implementation more memory efficient which will use the array returned from the query to reduce memory inflation by converting that array into Laravel's Collection, but thats still not helping with the array itself taking 300MB for 500k records responses instead of 30MB.
One of the solutions online is to use SplFixedArray but I am not sure how to force DB::select to use that instead of array?
Another possible solution involves ensuring it returns simple assoc array instead of array of standard classes https://stackoverflow.com/a/37532052/373091
But when I try that as in:
// get original model
$fetchMode = DB::getFetchMode();
// set mode to custom
DB::setFetchMode(\PDO::FETCH_ASSOC);
$memBefore = memory_get_usage();
$formResponses = DB::select($responsesSQL, $questionIDsForSQL);
DB::setFetchMode($fetchMode);
$memAfter = memory_get_usage();
dd($memBefore, $memAfter, $formResponses);
, I get error Call to undefined method Illuminate\\Database\\MySqlConnection::getFetchMode() which means apparently it can no longer be done from Laravel> 5.4 :(
Any suggestions?
I think the real problem is that you're loading all 3 million records into memory at once. You should instead either process them in chunks or use a cursor.
Chunking
To chunk records into batches, you can use the Laravel's chunk method. This method accepts two parameters, the chunk size and a callback that gets passed the subset of models or objects for processing. This will execute on query per chunk.
Here's the example taken from the documentation:
Flight::chunk(200, function ($flights) {
foreach ($flights as $flight) {
//
}
});
Cursor
Alternatively, you can also use the cursor method if you only want to execute a single query. In this case, Laravel will only hydrate one model at a time so you never have more than one model (or object if you're not using Eloquent) in memory at a time.
foreach (Flight::where('destination', 'Zurich')->cursor() as $flight) {
//
}

How can I most efficiently check for the existence of a single value in an array of thousands of values?

Due to a weird set of circumstances, I need to determine if a value exists in a known set, then take an action. Consider:
An included file will look like this:
// Start generated code
$set = array();
$set[] = 'foo';
$set[] = 'bar';
// End generated code
Then another file will look like this:
require('that_last_file.php');
if(in_array($value, $set)) {
// Do thing
}
As noted, the creation of the array will be from generated code -- a process will create a PHP file which will be included above the if statement with require.
How concerned should I be about the size of this mess -- both in bytes, and array values? It could easily get to 5,000 values. How concerned should I be with the overhead of a 5,000-value array? Is there a more efficient way to search for the value, other than using in_array on an array? How painful is including a 5,000-line file via require?
I know there are ultimately better ways of doing this, but my limitations are that the set creation and logic has to be in an included PHP file. There are odd technical restrictions that prevent other options (i.e. -- a database lookup).
A faster way would be:
if (array_flip($set)[$value] !== null) {
// Do thing
}
A 5000 value array really isn't that bad though if it's just strings

How do I efficiently run a PHP script that doesn't take forever to execute in wamp enviornemnt...?

I've made a script that pretty much loads a huge array of objects from a mysql database, and then loads a huge (but smaller) list of objects from the same mysql database.
I want to iterate over each list to check for irregular behaviour, using PHP. BUT everytime I run the script it takes forever to execute (so far I haven't seen it complete). Is there any optimizations I can make so it doesn't take this long to execute...? There's roughly 64150 entries in the first list, and about 1748 entries in the second list.
This is what the code generally looks like in pseudo code.
// an array of size 64000 containing objects in the form of {"id": 1, "unique_id": "kqiweyu21a)_"}
$items_list = [];
// an array of size 5000 containing objects in the form of {"inventory: "a long string that might have the unique_id", "name": "SomeName", id": 1};
$user_list = [];
Up until this point the results are instant... But when I do this it takes forever to execute, seems like it never ends...
foreach($items_list as $item)
{
foreach($user_list as $user)
{
if(strpos($user["inventory"], $item["unique_id"]) !== false)
{
echo("Found a version of the item");
}
}
}
Note that the echo should rarely happen.... The issue isn't with MySQL as the $items_list and $user_list array populate almost instantly.. It only starts to take forever when I try to iterate over the lists...
With 130M iterations, adding a break will help somehow despite it rarely happens...
foreach($items_list as $item)
{
foreach($user_list as $user)
{
if(strpos($user["inventory"], $item["unique_id"])){
echo("Found a version of the item");
break;
}
}
}
alternate solutions 1 with PHP 5.6: You could also use PTHREADS and split your big array in chunks to pool them into threads... with break, this will certainly improve it.
alternate solutions 2: use PHP7, the performances improvements regarding arrays manipulations and loop is BIG.
Also try to sort you arrays before the loop. depends on what you are looking at but very oftenly, sorting arrays before will limit a much as possible the loop time if the condition is found.
Your example is almost impossible to reproduce. You need to provide an example that can be replicated ie the two loops as given if only accessing an array will complete extremely quickly ie 1 - 2 seconds. This means that either the string your searching is kilobytes or larger (not provided in question) or something else is happening ie a database access or something like that while the loops are running.
You can let SQL do the searching for you. Since you don't share the columns you need I'll only pull the ones I see.
SELECT i.unique_id, u.inventory
FROM items i, users u
WHERE LOCATE(i.unique_id, u inventory)

Custom buffering of php mysql results - strange issue

I discovered something very strange with my PHP code and mysqli functions. When I have my code in the format below:
function mainline(){
$q=mysqli_query($this->conn,"select * from table",MYSQLI_USE_RESULT);
$dataset=parse($q);
}
function parse($q){
if (!$q){return NULL;}
while($res=mysqli_fetch_array($q)){$r[]=$res;}
mysqli_free_result($q);$q=NULL;$res=NULL;return $r;
}
I'm able to retrieve data and process it. In the above example, data is returned to $dataset and each element is retrieved in the form of $dataset[row number][field name].
Now when I change my code so its like this:
function mainline(){
$q=mysqli_query($this->conn,"select * from table",MYSQLI_USE_RESULT);
$dataset=parse($q);
}
function parse($q){
if (!$q){return NULL;}
while($r[]=mysqli_fetch_array($q)); // I made change here
mysqli_free_result($q);$q=NULL;return $r;
}
The data returned is always nothing even though the select statement is exactly the same and always returns rows. During both tests, nothing has modified the data in the database.
My question then is why does while($res=mysqli_fetch_array($q)){$r[]=$res;} retrieve correct results and while($r[]=mysqli_fetch_array($q)); does not?
With the second while loop, I won't have to allocate an extra variable and I'm trying to cut down on the use of system memory so that I can run more apache processes on my system instead of waste memory unnecessarily on PHP.
Any ideas why while($r[]=mysqli_fetch_array($q)); wont work? or any ideas how I can make it work without using an extra variable? or am I stuck?
if you want to store all result in array than why not use
mysqli_fetch_all($q)
and store result in whatever you want. Though if you want to have quick access I
think caching sounds more appropriate.
mysqli_fetch_all — Fetches all result rows as an associative array, a numeric array, or both

How do I pre-allocate memory for an array in PHP?

How do I pre-allocate memory for an array in PHP? I want to pre-allocate space for 351k longs. The function works when I don't use the array, but if I try to save long values in the array, then it fails. If I try a simple test loop to fill up 351k values with a range(), it works. I suspect that the array is causing memory fragmentation and then running out of memory.
In Java, I can use ArrayList al = new ArrayList(351000);.
I saw array_fill and array_pad but those initialize the array to specific values.
Solution:
I used a combination of answers. Kevin's answer worked alone, but I was hoping to prevent problems in the future too as the size grows.
ini_set('memory_limit','512M');
$foundAdIds = new \SplFixedArray(100000); # google doesn't return deleted ads. must keep track and assume everything else was deleted.
$foundAdIdsIndex = 0;
// $foundAdIds = array();
$result = $gaw->getAds(function ($googleAd) use ($adTemplates, &$foundAdIds, &$foundAdIdsIndex) { // use call back to avoid saving in memory
if ($foundAdIdsIndex >= $foundAdIds->count()) $foundAdIds->setSize( $foundAdIds->count() * 1.10 ); // grow the array
$foundAdIds[$foundAdIdsIndex++] = $googleAd->ad->id; # save ids to know which to not set deleted
// $foundAdIds[] = $googleAd->ad->id;
PHP has an Array Class with SplFixedArray
$array = new SplFixedArray(3);
$array[1] = 'test1';
$array[0] = 'test2';
$array[2] = 'test3';
foreach ($array as $k => $v) {
echo "$k => $v\n";
}
$array[] = 'fails';
gives
0 => test1
1 => test2
2 => test3
As other people have pointed out, you can't do this in PHP (well, you can create an array of fixed length, but that's not really want you need). What you can do however is increase the amount of memory for the process.
ini_set('memory_limit', '1024M');
Put that at the top of your PHP script and you should be ok. You can also set this in the php.ini file. This does not allocate 1GB of memory to PHP, but rather allows PHP to expand it's memory usage up to that point.
A couple of things to point out though:
This might not be allowed on some shared hosts
If you're using this much memory, you might need to have a look at how you're doing things and see if they can be done more efficiently
Look out for opportunities to clear out unneeded resources (do you really need to keep hold of $x that contains a huge object you've already used?) using unset($x);
The quick answer is: you can't
PHP is quite different from java.
You can make an array with specific values as you said, but you already know about them. You can 'fake' it by filling it with null values, but that's about the same to be honest.
So unless you want to just create one with array_fill and null (which is a hack in my head), you just can't.
(You might want to check your reasoning about the memory. Are you sure this isn't an XY-problem? As memory is limited by a number (max usage) I don't think the fragmentation would have much effect. Check what is taking your memory rather then try going down this road)
The closest you will get is using SplFixedArray. It doesn't preallocate the memory needed to store the values (because you can't pre-specify the type of values used), but it preallocates the array slots and doesn't need to resize the array itself as you add values.

Categories