Insert performance of node-mongodb-native

Insert performance of node-mongodb-native - php

I'm testing performance of Node.js with MongoDB. I know each of these is fine independent of the other, but I'm trying a handful of tests to get a feel for them. I ran across this issue and I'm having trouble determining the source.
The Problem
I'm trying to insert 1,000,000 records in a single Node.js program. It absolutely crawls. We're talking 20 minute execution time. This occurs whether it's my Mac or CentOS, although the behavior is marginally different between the two. It does eventually complete.
The effect is similar to swapping, although it's not (memory never exceeds 2 GB). There are only 3 connections open to MongoDB, and most of the time there's no data being inserted. It appears to be doing a lot of context switching, and the Node.js CPU core is maxed out.
The effect is similar to the one mentioned in this thread.
I try the same using PHP and it finishes in 2-3 minutes. No drama.
Why?
Possible Causes
I currently believe this is either a Node.js socket issue, something going on with libev behind the scenes, or some other node-mongodb-native issue. I may be totally wrong, so I'm looking for a little guidance here.
As for other Node.js MongoDB adapters, I have tried Mongolian and it appears to queue documents in order to batch insert them, and it ends up running out of memory. So that's out. (Side note: I have no idea why on this, either, since it doesn't even come close to my 16 GB box limit--but I haven't bothered investigating much further on that.)
I should probably mention that I did in fact test a master/worker cluster with 4 workers (on a quad-core machine) and it finished in 2-3 minutes.
The Code
Here's my Node.js CoffeeScript program:
mongodb = require "mongodb"
microtime = require "microtime"
crypto = require "crypto"
times = 1000000
server = new mongodb.Server "127.0.0.1", 27017
db = mongodb.Db "test", server
db.open (error, client) ->
throw error if error?
collection = mongodb.Collection client, "foo"
for i in [0...times]
console.log "Inserting #{i}..." if i % 100000 == 0
hash = crypto.createHash "sha1"
hash.update "" + microtime.now() + (Math.random() * 255 | 0)
key = hash.digest "hex"
doc =
key: key,
foo1: 1000,
foo2: 1000,
foo3: 1000,
bar1: 2000,
bar2: 2000,
bar3: 2000,
baz1: 3000,
baz2: 3000,
baz3: 3000
collection.insert doc, safe: true, (error, response) ->
console.log error.message if error
And here's the roughly equivalent PHP program:
<?php
$mongo = new Mongo();
$collection = $mongo->test->foo;
$times = 1000000;
for ($i = 0; $i < $times; $i++) {
if ($i % 100000 == 0) {
print "Inserting $i...\n";
}
$doc = array(
"key" => sha1(microtime(true) + rand(0, 255)),
"foo1" => 1000,
"foo2" => 1000,
"foo3" => 1000,
"bar1" => 2000,
"bar2" => 2000,
"bar3" => 2000,
"baz1" => 3000,
"baz2" => 3000,
"baz3" => 3000
);
try {
$collection->insert($doc, array("safe" => true));
} catch (MongoCursorException $e) {
print $e->getMessage() . "\n";
}
}

It sounds like you're running into the default heap limit in V8. I wrote a blog post about removing this limitation.
The garbage collector is probably going crazy and chewing on CPU, since it will constantly execute until you're under the 1.4GB limit.

What happens if you explicitly return a value at the end of the db.open callback function? Your generated javascript code is pushing all of your collection.insert returns onto a big "_results" array, which would get slower and slower, I imagine.
db.open(function(error, client) {
var collection, doc, hash, i, key, _i, _results;
if (error != null) {
throw error;
}
collection = mongodb.Collection(client, "foo");
_results = [];
for (i = _i = 0; 0 <= times ? _i < times : _i > times; i = 0 <= times ? ++_i : --_i) {
...
_results.push(collection.insert(doc, {
safe: true
}, function(error, response) {
if (error) {
return console.log(error.message);
}
}));
}
return _results;
});
Try adding this at the end of your coffeescript:
collection.insert doc, safe: true, (error, response) ->
console.log error.message if error
return
*Update: * So, I actually tried to run your program, and noticed a few more issues:
The biggest problem is you're trying to spawn a million inserts in a synchronous fashion, which will really kill your RAM, and eventually stop inserting (at least, it did for me). I killed it at 800MB RAM or so.
You need to change the way you're calling collection.insert() so that it works asynchronously.
I rewrote it like so, breaking out a couple of functions for clarity:
mongodb = require "mongodb"
microtime = require "microtime"
crypto = require "crypto"
gen = () ->
hash = crypto.createHash "sha1"
hash.update "" + microtime.now() + (Math.random() * 255 | 0)
key = hash.digest "hex"
key: key,
foo1: 1000,
foo2: 1000,
foo3: 1000,
bar1: 2000,
bar2: 2000,
bar3: 2000,
baz1: 3000,
baz2: 3000,
baz3: 3000
times = 1000000
i = times
insertDocs = (collection) ->
collection.insert gen(), {safe:true}, () ->
console.log "Inserting #{times-i}..." if i % 100000 == 0
if --i > 0
insertDocs(collection)
else
process.exit 0
return
server = new mongodb.Server "127.0.0.1", 27017
db = mongodb.Db "test", server
db.open (error, db) ->
throw error if error?
db.collection "foo", (err, collection) ->
insertDocs(collection)
return
return
Which finished in ~3 minutes:
wfreeman$ time coffee mongotest.coffee
Inserting 0...
Inserting 100000...
Inserting 200000...
Inserting 300000...
Inserting 400000...
Inserting 500000...
Inserting 600000...
Inserting 700000...
Inserting 800000...
Inserting 900000...
real 3m31.991s
user 1m55.211s
sys 0m23.420s
Also, it has the side benefit of using <100MB of RAM, 70% CPU on node, and 40% CPU on mongod (on a 2 core box, so it looks like it wasn't maxing out the CPU).

Related

How to process GuzzleHTTP async requests without blocking?

I need to write a processor that can potentially send out many HTTP requests to an external service. Since I want to maximize performance, I wish to minimize blocking. I'm using PHP 5.6 and GuzzleHTTP.
GuzzleHTTP does have an option for async requests. But since we do have only 1 thread available in PHP, I need to allocate some time for them to be processed. Unfortunately I only see one way to do it - calling wait which blocks until all the requests are processed. That's not what I want.
Instead I'd like to have some method that handles whatever has arrived, and then returns. So that I can do something along the lines of:
$allRequests = [];
while ( !checkIfNeedToEnd() ) {
$newItems = getItemsFromQueue();
$allRequests = $allRequests + spawnRequests($newItems);
GuzzleHttp::processWhatYouCan($allRequests);
removeProcessedRequests($allRequests);
}
Is this possible?

Alright... figured it out myself:
$handler = new \GuzzleHttp\Handler\CurlMultiHandler();
$client = new \GuzzleHttp\Client(['handler' => $handler]);
$promise1 = $client->getAsync("http://www.stackoverflow.com");
$promise2 = $client->getAsync("http://localhost/");
$doneCount = 0;
$promise1->then(function() use(&$doneCount) {
$doneCount++;
echo 'Promise 1 done!';
});
$promise2->then(function() use(&$doneCount) {
$doneCount++;
echo 'Promise 2 done!';
});
$last = microtime(true);
while ( $doneCount < 2 ) {
$now = microtime(true);
$delta = round(($now-$last)*1000);
echo "tick($delta) ";
$last = $now;
$handler->tick();
}
And the output I get is:
tick(0) tick(6) tick(1) tick(0) tick(1001) tick(10) tick(96) Promise 2 done!tick(97) Promise 1 done!
The magic ingredient is creating the CurlMultiHandler yoursef and then calling tick() on that when it's convenient. After that it's promises as usual. And if the queue is empty, tick() returns immediately.
Note that it can still block for up to 1 second (default) if there is no activity. This can be also changed if needed:
$handler = new \GuzzleHttp\Handler\CurlMultiHandler(['select_timeout' => 0.5]);
The value is in seconds, but with floating point.

pthreads stopping already running thread once a condition has been met

I'm still relatively new to PHP and trying to use pthreads to solve an issue. I have 20 threads running processes that end at varying times. Most finish around < 10 seconds or so. I don't need all 20, just 10 detected. Once I get to 10, I would like to kill the threads, or to continue on to the next step.
I have tried using set_time_limit to about 20 seconds for each of the threads, but they ignore it and keep running. I am looping through the jobs looking for the join because I didn't want the rest of the program to run but I'm stuck until the slowest one has finished. While pthreads has reduced the time from around a minute to about 30 seconds, I can shave even more time since the first 10 run in about 3 seconds.
Thanks for any help and here is my code:
$count = 0;
foreach ( $array as $i ) {
$imgName = $this->smsId."_$count.jpg";
$name = "LocalCDN/".$imgName;
$stack[] = new AsyncImageModify($i['largePic'], $name);
$count++;
}
// Run the threads
foreach ( $stack as $t ) {
$t->start();
}
// Check if the threads have finished; push the coordinates into an array
foreach ( $stack as $t ) {
if($t->join()){
array_push($this->imgArray, $t->data);
}
}
class class AsyncImageModify extends \Thread{
public $data;
public function __construct($arg, $name, $container) {
$this->arg = $arg;
$this->name = $name;
}
public function run() {
//tried putting the set_time_limit() here, didn't work
if ($this->arg) {
// Get the image
$didWeGetTheImage = Image::getImage($this->arg, $this->name);
if($didWeGetTheImage){
$timestamp1 = microtime(true);
print_r("Starting face detection $this->arg" . "\n");
print_r(" ");
$j = Image::process1($this->name);
if($j){
// lets go ahead and do our image manipulation at this point
$userPic = Image::process2($this->name, $this->name, 200, 200, false, $this->name, $j);
if($userPic){
$this->data = $userPic;
print_r("Back from process2; the image returned is $userPic");
}
}
$endTime = microtime(true);
$td = $endTime-$timestamp1;
print_r("Finished face detection $this->arg in $td seconds" . "\n");
print_r($j);
}
}
}

It is difficult to guess the functionality of Image::* methods, so I can't really answer in any detail.
What I can say, is that there are very few machines I can think of that are suitable to run 20 concurrent threads in any case. A more suitable setup would be the worker/stackable model. A Worker thread is a reuseable context, and can execute task after task, implemented as Stackables; execution in a multi-threaded environment should always use the least amount of threads to get the most work done possible.
Please see pooling example and other examples that are distributed with pthreads, available on github, additionally, much information regarding usage is contained in past bug reports, if you are still struggling after that ...

determining proper gearman task function to retrieve real-time job status

Very simply, I have a program that needs to perform a large process (anywhere from 5 seconds to several minutes) and I don't want to make my page wait for the process to finish to load.
I understand that I need to run this gearman job as a background process but I'm struggling to identify the proper solution to get real-time status updates as to when the worker actually finishes the process. I've used the following code snippet from the PHP examples:
do {
sleep(3);
$stat = $gmclient->jobStatus($job_handle);
if (!$stat[0]) // the job is known so it is not done
$done = true;
echo "Running: " . ($stat[1] ? "true" : "false") . ", numerator: " . $stat[2] . ", denomintor: " . $stat[3] . "\n";
} while(!$done);
echo "done!\n";
and this works, however it appears that it just returns data to the client when the worker finished telling the job what to do. Instead I want to know when the literal process of the job finished.
My real-life example:
Pull several data feeds from an API (some feeds take longer than others)
Load a couple of the ones that always load fast, place a "Waiting/Loading" animation on the section that was sent off to a worker queue
When the work is done and the results have been completely retrieved, replace the animation with the results

This is a bit late, but I stumbled across this question looking for the same answer. I was able to get a solution together, so maybe it will help someone else.
For starters, refer to the documentation on GearmanClient::jobStatus. This will be called from the client, and the function accepts a single argument: $job_handle. You retrieve this handle when you dispatch the request:
$client = new GearmanClient( );
$client->addServer( '127.0.0.1', 4730 );
$handle = $client->doBackground( 'serviceRequest', $data );
Later on, you can retrieve the status by calling the jobStatus function on the same $client object:
$status = $client->jobStatus( $handle );
This is only meaningful, though, if you actually change the status from within your worker with the sendStatus method:
$worker = new GearmanWorker( );
$worker->addFunction( 'serviceRequest', function( $job ) {
$max = 10;
// Set initial status - numerator / denominator
$job->sendStatus( 0, $max );
for( $i = 1; $i <= $max; $i++ ) {
sleep( 2 ); // Simulate a long running task
$job->sendStatus( $i, $max );
}
return GEARMAN_SUCCESS;
} );
while( $worker->work( ) ) {
$worker->wait( );
}
In versions of Gearman prior to 0.5, you would use the GearmanJob::status method to set the status of a job. Versions 0.6 to current (1.1) use the methods above.
See also this question: Problem With Gearman Job Status

Is there a memory leak in my script or is it another culprit?

I recently switched from a shared to dedicated host giving me alot more monitoring/control. I've been trying to debug an issue I've had since before I switched, very high memory usage. I think I've narrowed it down to a specific script that is a subscription to an instagram feed/api. It works in a codeIgniter framework.
This is a screenshot of my processes. Note the really high httpd memory values
Here's my controller in codeIgniter
class Subscribe extends CI_Controller {
function __construct() {
parent::__construct();
$this->instagram_api->access_token = 'hidden';
}
function callback()
{
//echo anchor('logs/activity.log', 'LOG');
$min_id = '';
$next_min_id = '';
$this->load->model('Subscribe_model');
$min_id = $this->Subscribe_model->min_id();
echo $min_id;
$pugs = $this->instagram_api->tagsRecent('tagg','',$min_id);
if($pugs){
if (property_exists($pugs->pagination, 'min_tag_id')) {
$next_min_id = $pugs->pagination->min_tag_id;
}
foreach($pugs as $pug) {
if(is_array($pug)) {
foreach($pug as $media) {
$url = $media->images->standard_resolution->url;
$m_id = $media->id;
$c_time = $media->created_time;
$user = $media->user->username;
$filter = $media->filter;
$comments = $media->comments->count;
$caption = $media->caption->text;
$link = $media->link;
$low_res=$media->images->low_resolution->url;
$thumb=$media->images->thumbnail->url;
$lat = $media->location->latitude;
$long = $media->location->longitude;
$loc_id = $media->location->id;
$date = new DateTime('2000-01-01', new DateTimeZone('Pacific/Nauru'));
$data = array(
'media_id' => $m_id,
'min_id' => $next_min_id,
'url' => $url,
'c_time' => $c_time,
'user' => $user,
'filter' => $filter,
'comment_count' => $comments,
'caption' => $caption,
'link' => $link,
'low_res' => $low_res,
'thumb' => $thumb,
'lat' => $lat,
'long' => $long,
'loc_id' => $loc_id,
);
$this->Subscribe_model->add_pug($data);
}
}
}
}
and here is the model....
class Subscribe_model extends CI_Model {
function min_id(){
$this->db->order_by("c_time", "desc");
$query = $this->db->get("pugs");
if ($query->num_rows() > 0)
{
$row = $query->row();
$min_id = $row->min_id;
if(!$min_id){
$min_id ='';
}
}
return $min_id;
}
function add_pug($data){
$query = $this->db->get_where('pugs', array('media_id'=>$data['media_id']));
if($query->num_rows() > 0){
return FALSE;
}else{
$this->db->insert('pugs', $data);
}
}
}
//============================EDIT========================//
I've converted some of the services over to fast-cgi and it seems to have brought my memory usage down significantly but I've noticed a bump in CPU. I was hoping that switching to a dedicated server would have far less headaches and make things much easier but it's been a nightmare so far. Affraid I've bit off more than I can chew.
Another fear of mine is adding some more domain names to the server. Will that add a new process that will run real high like the multiple php-cgi's running in the last image?
Here's my most recent outputs...

To ensure there is no true memory leakage, try having nothing but the httpd/myslqd running on the server (killall Xorg / telinit 3) and then stop the mentioned two services. Note down output from free|grep Mem |sed 's/\([^0-9]*[^\ ]*\)\{3\}\([^\ ]*\).*/\1/'. This is X free bytes of RAM. Now, start httpd/mysqld services and let them run for a few hundred requests. Stop the services and note down numbers again, repeat until satisfied with median results.
It is not uncommon for httpd to consume a lot of RAM. Also mysqld caches in-memory. This is simply because, if the same request is encountered consequtive times (in a row) then the static caches are allready buffered and good-to-go.
For PHP a Class is pre-compiled and once the system needs it after first compile, it will not have to interpret the script line-by-line, it will have a bytecode encoded object to work with. The compilation is off course recompiled if fstat.mtime > bytecode.mtime..
You can analyse real-memory usage the non-swapped kind with this command:
ps -ylC httpd --sort:rss
Child process size for serving static file is about 2-3M. For dynamic content such as PHP, it may be around 15M
To configure how apache sets up the workers, these parameters are valid in a httpd.conf:
StartServers,
MaxClients,
MinSpareThreads,
MaxSpareThreads,
ThreadsPerChild,
MaxRequestsPerChild
Check this link: http://www.howtoforge.com/configuring_apache_for_maximum_performance
Section 3.5:
The MaxClients sets the limit on maximum simultaneous requests that
can be supported by the server. No more than this much number of child
processes are spawned. It shouldn't be set too low such that new
connections are put in queue, which eventually time-out and the server
resources are left unused. Setting this too high will cause the server
to start swapping and the response time will degrade drastically.
Appropriate value for MaxClients can be calculated as: MaxClients =
Total RAM dedicated to the web server / Max child process size
Apache performance tuning is available here http://httpd.apache.org/docs/2.0/misc/perf-tuning.html, skip down to section about Process creation for more info on above mentioned config options.

Enumerate all users in LDAP with PHP

I'd like to create a php script that runs as a daily cron. What I'd like to do is enumerate through all users within an Active Directory, extract certain fields from each entry, and use this information to update fields within a MySQL database.
Basically what I want to to do is sync up certain user information between Active Directory and a MySQL table.
The problem I have is that the sizelimit on the Active Directory server is often set at 1000 entries per search result. I had hoped that the php function "ldap_next_entry" would get around this by only fetching one entry at a time, but before you can call "ldap_next_entry", you first have to call "ldap_search", which can trigger the SizeLimit exceeded error.
Is there any way besides removing the sizelimit from the server? Can I somehow get "pages" of results?
BTW - I am currently not using any 3rd party libraries or code. Just PHPs ldap methods. Although, I am certainly open to using a library if that will help.

I've been struck by the same problem while developing Zend_Ldap for the Zend Framework. I'll try to explain what the real problem is, but to make it short: until PHP 5.4, it wasn't possible to use paged results from an Active Directory with an unpatched PHP (ext/ldap) version due to limitations in exactly this extension.
Let's try to unravel the whole thing... Microsoft Active Directory uses a so called server control to accomplish server-side result paging. This control ist described in RFC 2696 "LDAP Control Extension for Simple Paged Results Manipulation" .
ext/php offers an access to LDAP control extensions via its ldap_set_option() and the LDAP_OPT_SERVER_CONTROLS and LDAP_OPT_CLIENT_CONTROLS option respectively. To set the paged control you do need the control-oid, which is 1.2.840.113556.1.4.319, and we need to know how to encode the control-value (this is described in the RFC). The value is an octet string wrapping the BER-encoded version of the following SEQUENCE (copied from the RFC):
realSearchControlValue ::= SEQUENCE {
size INTEGER (0..maxInt),
-- requested page size from client
-- result set size estimate from server
cookie OCTET STRING
}
So we can set the appropriate server control prior to executing the LDAP query:
$pageSize = 100;
$pageControl = array(
'oid' => '1.2.840.113556.1.4.319', // the control-oid
'iscritical' => true, // the operation should fail if the server is not able to support this control
'value' => sprintf ("%c%c%c%c%c%c%c", 48, 5, 2, 1, $pageSize, 4, 0) // the required BER-encoded control-value
);
This allows us to send a paged query to the LDAP/AD server. But how do we know if there are more pages to follow and how do we specify with which control-value we have to send our next query?
This is where we're getting stuck... The server responds with a result set that includes the required paging information but PHP lacks a method to retrieve exactly this information from the result set. PHP provides a wrapper for the LDAP API function ldap_parse_result() but the required last parameter serverctrlsp is not exposed to the PHP function, so there is no way to retrieve the required information. A bug report has been filed for this issue but there has been no response since 2005. If the ldap_parse_result() function provided the required parameter, using paged results would work like
$l = ldap_connect('somehost.mydomain.com');
$pageSize = 100;
$pageControl = array(
'oid' => '1.2.840.113556.1.4.319',
'iscritical' => true,
'value' => sprintf ("%c%c%c%c%c%c%c", 48, 5, 2, 1, $pageSize, 4, 0)
);
$controls = array($pageControl);
ldap_set_option($l, LDAP_OPT_PROTOCOL_VERSION, 3);
ldap_bind($l, 'CN=bind-user,OU=my-users,DC=mydomain,DC=com', 'bind-user-password');
$continue = true;
while ($continue) {
ldap_set_option($l, LDAP_OPT_SERVER_CONTROLS, $controls);
$sr = ldap_search($l, 'OU=some-ou,DC=mydomain,DC=com', 'cn=*', array('sAMAccountName'), null, null, null, null);
ldap_parse_result ($l, $sr, $errcode, $matcheddn, $errmsg, $referrals, $serverctrls); // (*)
if (isset($serverctrls)) {
foreach ($serverctrls as $i) {
if ($i["oid"] == '1.2.840.113556.1.4.319') {
$i["value"]{8} = chr($pageSize);
$i["iscritical"] = true;
$controls = array($i);
break;
}
}
}
$info = ldap_get_entries($l, $sr);
if ($info["count"] < $pageSize) {
$continue = false;
}
for ($entry = ldap_first_entry($l, $sr); $entry != false; $entry = ldap_next_entry($l, $entry)) {
$dn = ldap_get_dn($l, $entry);
}
}
As you see there is a single line of code (*) that renders the whole thing useless. On my way though the sparse information on this subject I found a patch against the PHP 4.3.10 ext/ldap by Iñaki Arenaza but neither did I try it nor do I know if the patch can be applied on a PHP5 ext/ldap. The patch extends ldap_parse_result() to expose the 7th parameter to PHP:
--- ldap.c 2004-06-01 23:05:33.000000000 +0200
+++ /usr/src/php4/php4-4.3.10/ext/ldap/ldap.c 2005-09-03 17:02:03.000000000 +0200
## -74,7 +74,7 ##
ZEND_DECLARE_MODULE_GLOBALS(ldap)
static unsigned char third_argument_force_ref[] = { 3, BYREF_NONE, BYREF_NONE, BYREF_FORCE };
-static unsigned char arg3to6of6_force_ref[] = { 6, BYREF_NONE, BYREF_NONE, BYREF_FORCE, BYREF_FORCE, BYREF_FORCE, BYREF_FORCE };
+static unsigned char arg3to7of7_force_ref[] = { 7, BYREF_NONE, BYREF_NONE, BYREF_FORCE, BYREF_FORCE, BYREF_FORCE, BYREF_FORCE, BYREF_FORCE };
static int le_link, le_result, le_result_entry, le_ber_entry;
## -124,7 +124,7 ##
#if ( LDAP_API_VERSION > 2000 ) || HAVE_NSLDAP
PHP_FE(ldap_get_option, third_argument_force_ref)
PHP_FE(ldap_set_option, NULL)
- PHP_FE(ldap_parse_result, arg3to6of6_force_ref)
+ PHP_FE(ldap_parse_result, arg3to7of7_force_ref)
PHP_FE(ldap_first_reference, NULL)
PHP_FE(ldap_next_reference, NULL)
#ifdef HAVE_LDAP_PARSE_REFERENCE
## -1775,14 +1775,15 ##
Extract information from result */
PHP_FUNCTION(ldap_parse_result)
{
- pval **link, **result, **errcode, **matcheddn, **errmsg, **referrals;
+ pval **link, **result, **errcode, **matcheddn, **errmsg, **referrals, **serverctrls;
ldap_linkdata *ld;
LDAPMessage *ldap_result;
+ LDAPControl **lserverctrls, **ctrlp, *ctrl;
char **lreferrals, **refp;
char *lmatcheddn, *lerrmsg;
int rc, lerrcode, myargcount = ZEND_NUM_ARGS();
- if (myargcount 6 || zend_get_parameters_ex(myargcount, &link, &result, &errcode, &matcheddn, &errmsg, &referrals) == FAILURE) {
+ if (myargcount 7 || zend_get_parameters_ex(myargcount, &link, &result, &errcode, &matcheddn, &errmsg, &referrals, &serverctrls) == FAILURE) {
WRONG_PARAM_COUNT;
}
## -1793,7 +1794,7 ##
myargcount > 3 ? &lmatcheddn : NULL,
myargcount > 4 ? &lerrmsg : NULL,
myargcount > 5 ? &lreferrals : NULL,
- NULL /* &serverctrls */,
+ myargcount > 6 ? &lserverctrls : NULL,
0 );
if (rc != LDAP_SUCCESS ) {
php_error(E_WARNING, "%s(): Unable to parse result: %s", get_active_function_name(TSRMLS_C), ldap_err2string(rc));
## -1805,6 +1806,29 ##
/* Reverse -> fall through */
switch(myargcount) {
+ case 7 :
+ zval_dtor(*serverctrls);
+
+ if (lserverctrls != NULL) {
+ array_init(*serverctrls);
+ ctrlp = lserverctrls;
+
+ while (*ctrlp != NULL) {
+ zval *ctrl_array;
+
+ ctrl = *ctrlp;
+ MAKE_STD_ZVAL(ctrl_array);
+ array_init(ctrl_array);
+
+ add_assoc_string(ctrl_array, "oid", ctrl->ldctl_oid,1);
+ add_assoc_bool(ctrl_array, "iscritical", ctrl->ldctl_iscritical);
+ add_assoc_stringl(ctrl_array, "value", ctrl->ldctl_value.bv_val,
+ ctrl->ldctl_value.bv_len,1);
+ add_next_index_zval (*serverctrls, ctrl_array);
+ ctrlp++;
+ }
+ ldap_controls_free (lserverctrls);
+ }
case 6 :
zval_dtor(*referrals);
if (array_init(*referrals) == FAILURE) {
Actually the only option left would be to change the Active Directory configuration and raise the maximum result limit. The relevant option is called MaxPageSize and can be altered by using ntdsutil.exe - please see "How to view and set LDAP policy in Active Directory by using Ntdsutil.exe".
EDIT (reference to COM):
Or you can go the other way round and use the COM-approach via ADODB as suggested in the link provided by eykanal.

Support for paged results was added in PHP 5.4.
See ldap_control_paged_result for more details.

This isn't a full answer, but this guy was able to do it. I don't understand what he did, though.
By the way, a partial answer is that you CAN get "pages" of results. From the documentation:
resource ldap_search ( resource $link_identifier , string $base_dn ,
string $filter [, array $attributes [, int $attrsonly [, int $sizelimit [,
int $timelimit [, int $deref ]]]]] )
...
sizelimit Enables you to limit the count of entries fetched. Setting this to 0 means no limit.
Note: This parameter can NOT override server-side preset sizelimit.
You can set it lower though. Some directory server hosts will be
configured to return no more than a preset number of entries. If this
occurs, the server will indicate that it has only returned a partial
results set. This also occurs if you use this parameter to limit the
count of fetched entries.
I don't know how to specify that you want to search STARTING from a certain position, though. I.e., after you get your first 1000, I don't know how to specify that now you need the next 1000. Hopefully someone else can help you there :)

Here's an alternative (which works pre PHP 5.4). If you have 10,000 records you need to get but your AD server only returns 5,000 per page:
$ldapSearch = ldap_search($ldapResource, $basedn, $filter, array('member;range=0-4999'));
$ldapResults = ldap_get_entries($dn, $ldapSearch);
$members = $ldapResults[0]['member;range=0-4999'];
$ldapSearch = ldap_search($ldapResource, $basedn, $filter, array('member;range=5000-10000'));
$ldapResults = ldap_get_entries($dn, $ldapSearch);
$members = array_merge($members, $ldapResults[0]['member;range=5000-*']);

I was able to get around the size limitation using ldap_control_paged_result
ldap_control_paged_result is used to Enable LDAP pagination by sending the pagination control. The below function worked perfectly in my case.
function retrieves_users($conn)
{
$dn = 'ou=,dc=,dc=';
$filter = "(&(objectClass=user)(objectCategory=person)(sn=*))";
$justthese = array();
// enable pagination with a page size of 100.
$pageSize = 100;
$cookie = '';
do {
ldap_control_paged_result($conn, $pageSize, true, $cookie);
$result = ldap_search($conn, $dn, $filter, $justthese);
$entries = ldap_get_entries($conn, $result);
if(!empty($entries)){
for ($i = 0; $i < $entries["count"]; $i++) {
$data['usersLdap'][] = array(
'name' => $entries[$i]["cn"][0],
'username' => $entries[$i]["userprincipalname"][0]
);
}
}
ldap_control_paged_result_response($conn, $result, $cookie);
} while($cookie !== null && $cookie != '');
return $data;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Insert performance of node-mongodb-native - php

It sounds like you're running into the default heap limit in V8. I wrote a blog post about removing this limitation. The garbage collector is probably going crazy and chewing on CPU, since it will constantly execute until you're under the 1.4GB limit.

Related

How to process GuzzleHTTP async requests without blocking?

pthreads stopping already running thread once a condition has been met

determining proper gearman task function to retrieve real-time job status

Is there a memory leak in my script or is it another culprit?

Enumerate all users in LDAP with PHP

Categories

Resources