How can I batch doesObjectExist() requests to Amazon S3?

How can I batch doesObjectExist() requests to Amazon S3? - php

I need to check whether a set of keys exist in S3, for each of a large number of items. (Each set of keys relates to one of the large number of items).
I am using the PHP SDK (v2)
Currently I am calling $client->doesObjectExist(BUCKET, $key) for each of the keys, which is a bottleneck (the round-trip time to S3 for each call).
I would prefer to do something like $client->doesObjectExist(BUCKET, $batch) where $batch = array($key1, $key2 ... $keyn), and for the client to check all of those keys then come back with an array of responses (or some other similar structure).
I have come across a few references to a "batch api" which sounds promising, but nothing concrete. I'm guessing that this might have been present only in the v1 SDK.

You can do parallel requests using the AWS SDK for PHP by taking advantage of the underlying Guzzle library features. Since the doesObjectExist method actually does HeadObject operations under that hood. You can create groups of HeadObject commands by doing something like this:
use Aws\S3\S3Client;
use Guzzle\Service\Exception\CommandTransferException;
function doObjectsExist(S3Client $s3, $bucket, array $objectKeys)
{
$headObjectCommands = array();
foreach ($objectKeys as $key) {
$headObjectCommands[] = $s3->getCommand('HeadObject', array(
'Bucket' => $bucket,
'Key' => $key
));
}
try {
$s3->execute($headObjectCommands); // Executes in parallel
return true;
} catch (CommandTransferException $e) {
return false;
}
}
$s3 = S3Client::factory(array(
'key' => 'your_aws_access_key_id',
'bucket' => 'your_aws_secret_key',
));
$bucket = 'your_bucket_name';
$objectKeys = array('object_key_1', 'object_key_2','object_key_3');
// Returns true only if ALL of the objects exist
echo doObjectsExist($s3, $bucket, $objectKeys) ? 'YES' : 'NO';
If you want data from the responses, other than just whether or not the keys exist, you can change the try-catch block to do something like this instead.
try {
$executedCommands = $s3->execute($headObjectCommands);
} catch (CommandTransferException $e) {
$executedCommands = $e->getAllCommands();
}
// Do stuff with the command objects
foreach ($executedCommands as $command) {
$exists = $command->getResponse()->isSuccessful() ? "YES" : "NO";
echo "{$command['Bucket']}/{$command['Key']}: {$exists}\n";
}
Sending commands in parallel is mentioned in the AWS SDK for PHP User Guide, but I would also take a look at the Guzzle batching documentation.

The only way to do a bulk check to see if some keys exist would be to list the objects in the bucket.
For a list call AWS returns up to 1000 keys/call so it's much faster than doing a doesObjectExist call for each key. But if you have a large number of keys and you only want to check a couple of them, listing all the objects in the bucket will not be practical so in that case, your only option remains to check each object individually.
The problem is not that the PHP v2 SDK lacks the bulk functionality but that the S3 API does not implement such bulk processing.

I am building on Jeremy Lindblom's answer.
Just want to point out the OnComplete callback that you can setup on each command.
$bucket = 'my-bucket';
$keys = array('page1.txt', 'page2.txt');
$commands = array();
foreach ($keys as $key) {
$commands[] = $s3Client->getCommand('HeadObject', array('Bucket' => $bucket, 'Key' => $key))
->setOnComplete(
function($command) use ($bucket, $key)
{
echo "\nBucket: $bucket\n";
echo "\nKey: $key\n";
// see http://goo.gl/pIWoYr for more detail on command objects
var_dump($command->getResult());
}
);
}
try {
$ex_commands = $s3Client->execute($commands);
}
catch (\Guzzle\Service\Exception\CommandTransferException $e) {
$ex_commands = $e->getAllCommands();
}
// this is necesary; without this, the OnComplete handlers wouldn't get called (strange?!?)
foreach ($ex_commands as $command)
{
$command->getResult();
}
It will be wonderful if someone could shed light on why I need to call $command->getResult() to invoke the OnComplete handler.

Related

Authenticate and use Google's BigQuery in my php code

I'm setting up my client app using my BigQuery Data, I need to build it with php (which I'm not expert).
I want to authenticate and be able to build queries using php in my server.
I'm having trouble authenticating, as I see different solutions which seem outdated or incomplete.
I've got the JSON key and the project I'd but don't know the proper wait to authenticate to BigQuery.
Can someone tell me how to properly do that? If I need another library (like google cloud auth?)

Here is the code snippet:
public function getServiceBuilder() {
putenv('GOOGLE_APPLICATION_CREDENTIALS=path_to_service_account.json');
$builder = new ServiceBuilder(array(
'projectId' => PROJECT_ID
));
return $builder;
}
then you can use like this
$builder = $this->getServiceBuilder();
$bigQuery = $builder->bigQuery();
$job = $bigQuery->runQueryAsJob('SELECT .......');
$backoff = new ExponentialBackoff(8);
$backoff->execute(function () use ($job) {
$job->reload();
$this->e('reloading job');
if (!$job->isComplete()) {
throw new \Exception();
}
});
if (!$job->isComplete()) {
$this->e('Job failed to complete within the allotted time.');
return false;
}
$queryResults = $job->queryResults();
if ($queryResults->isComplete()) {
$i = 0;
$rows = $queryResults->rows();
foreach ($rows as $row) {
(edited)
you may need to add:
use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\ServiceBuilder;
use Google\Cloud\ExponentialBackoff;
and composer.json (this is just an example it may differ the actual version)
{
"require": {
"google/cloud": "^0.99.0"
}
}

How do I control the order I receive data when getting a response of a multi-curl in php?

In my scenario I could be required to make over 100 curl requests to get information that I need. There's no way to get this information beforehand, and I don't have access to the server that I will be making the requests to. My plan is to use curl_multi_init(). Each response will come in json. The problem is that I need to receive the information in the order that I placed it otherwise I won't know where everything goes after the response comes back. How do I solve this problem.

When you get the handles back from curl_multi_info_read, you can compare those handles against your keyed list, then of course use the key to know where your response goes. Here's the direct implementation, based on a model I use for a scraper:
// here's our list of URL, in the order we care about
$easy_handles['google'] = curl_init('https://google.com/');
$easy_handles['bing'] = curl_init('https://bing.com/');
$easy_handles['duckduckgo'] = curl_init('https://duckduckgo.com/');
// our responses will be here, keyed same as URL list
$responses = [];
// here's the code to do the multi-request -- it's all boilerplate
$common_options = [ CURLOPT_FOLLOWLOCATION => true, CURLOPT_RETURNTRANSFER => true ];
$multi_handle = curl_multi_init();
foreach ($easy_handles as $easy_handle) {
curl_setopt_array($easy_handle, $common_options);
curl_multi_add_handle($multi_handle, $easy_handle);
}
do {
$status = curl_multi_exec($multi_handle, $runCnt);
assert(CURLM_OK === $status);
do {
$status = curl_multi_select($multi_handle, 2/*seconds timeout*/);
if (-1 === $status) usleep(10); // reported bug in PHP
} while (0 === $status);
while (false !== ($info = curl_multi_info_read($multi_handle))) {
foreach ($easy_handles as $key => $easy_handle) { // find the response handle
if ($info['handle'] === $easy_handle) { // from our list
if (CURLE_OK === $info['result']) {
$responses[$key] = curl_multi_getcontent($info['handle']);
} else {
$responses[$key] = new \RuntimeException(
curl_strerror($info['result'])
);
}
}
}
}
} while (0 < $runCnt);
Most of this is boilerplate machinery to do the multi fetch. The lines that target your specific question are:
foreach ($easy_handles as $key => $easy_handle) { // find the response handle
if ($info['handle'] === $easy_handle) { // from our list
if (CURLE_OK === $info['result']) {
$responses[$key] = curl_multi_getcontent($info['handle']);
Loop over your list comparing the returned handle against each stored handle, then use the corresponding key to fill in your response.

Obviously, since the requests are asynchronous, you cannot predict the order in which the responses will arrive. Therefore, in your design, you must provide for each request to include "some random bit of information" – a so-called nonce – which each client will somehow be obliged to return to you verbatim.
Based upon this "nonce," you will then be able to pair each response to the request which originated it – and to discard any random bits of garbage that wander in "out of the blue."
Otherwise, there is no(!) solution to your problem.

Amazon AWS PHP SDK with Guzzle's MultiCurl?

I need to perform some fairly heavy queries with Amazon's AWS SDK for PHP.
The most efficient way would be to use PHP's MultiCurl. It seems that Guzzle already has functionality for MultiCurl built in.
Does using the standard methods provided by the AWS SDK automatically use MultiCurl or do I have to specify it's usage directly? E.g. calling $sns->Publish() 30 times.
Thanks!

Parallel requests work exactly the same in the SDK as in plain Guzzle and do take advantage of MultiCurl. For example, you could do something like this:
$message = 'Hello, world!';
$publishCommands = array();
foreach ($topicArns as $topicArn) {
$publishCommands[] = $sns->getCommand('Publish', array(
'TopicArn' => $topicArn,
'Message' => $message,
));
}
try {
$successfulCommands = $sns->execute($publishCommands);
$failedCommands = array();
} catch (\Guzzle\Service\Exception\CommandTransferException $e) {
$successfulCommands = $e->getSuccessfulCommands();
$failedCommands = $e->getFailedCommands();
}
foreach ($failedCommands as $failedCommand) { /* Handle any errors */ }
$messageIds = array();
foreach ($successfulCommands as $successfulCommand) {
$messageIds[] = $successfulCommand->getResult()->get('MessageId');
}
// Also Licensed under version 2.0 of the Apache License.
The AWS SDK for PHP User Guide has more information about working with command objects in this way.

How to count number of instances in AWS PHP SDK2

Assume composer is installed and you need to setup an ec2 client.

Suppose SDK setup with recommended method using Composer. First call aws_setupthen create an ec2 client object with security credentials. Since composer has been invoked, it will automatically load required libraries.
Then use DescribeInstances to get all running instances.
I packaged the function countInstances so it can be reused. You can call DescribeInstances with
with an array to filter results which is posted at the end.
Setup as follows:
require('/PATH/TO/MY/COMPOSER/vendor/autoload.php');
function aws_setup()
{
$conf_aws = array();
$conf_aws['key'] = 'MYKEY';
$conf_aws['secret'] = 'MYSECRET';
$conf_aws['region'] = 'us-east-1';
return $conf_aws;
}
function countInstances($list)
{
$count = 0;
foreach($list['Reservations'] as $instances)
{
foreach($instances['Instances'] as $instance)
{
$count++;
}
}
return $count;
}
$config = aws_setup();
$ec2Client = \Aws\Ec2\Ec2Client::factory($config);
$list = $ec2Client->DescribeInstances();
echo "Number of running instances: " . countInstances($list);
If you want to filter your results try something like this as a parameter to DescribeInstances:
array('Filters' => array(array('Name' => 'tag-value', 'Values' => array('MY_INSTANCE_TAG'))));
The code executes without error, but I had to adapt it to post it here.
EDIT: Added list of instances to countInstances function. Otherwise it wouldn't be visible.

check if object exists in Cloud Files (PHP API)

I've just started working with the PHP API for Rackspace Cloud Files. So far so good-- but I am using it as sort of a poor man's memcache, storing key/value pairs of serialized data.
My app attempts to grab the existing cached object by its key ('name' in the API language) using something like this:
$obj = $this->container->get_object($key);
The problem is, if the object doesn't exist, the API throws a fatal error rather than simply returning false. The "right" way to do this by the API would probably be to do a
$objs = $this->container->list_objects();
and then check for my $key value in that list. However, this seems way more time/CPU intensive than just returning false from the get_object request.
Is there a way to do a "search for object" or "check if object exists" in Cloud Files?
Thanks

I sent them a pull request and hope it'll get included.
https://github.com/rackspace/php-cloudfiles/pull/35
My pull-request includes an example, for you it would be similar to this:
$object = new CF_Object($this->container, 'key');
if ($object->exists() === false) {
echo "The object '{$object->name}' does not exist.";
}

I have more general way to check if object exists:
try {
$this->_container->get_object($path);
$booExists = true;
} catch (Exception $e) {
$booExists = false;
}

If you dump the $object, you'll see that content_length is zero. Or, last modified will be a zero length string.
Example:
$object = new CF_Object($container, 'thisdocaintthere.pdf');
print_r($object->content_length);
There is also, deep in the dumped parent object, a 404 that will return, but it's private, so you'd need to some hackin' to get at it.
To see this, do the following:
$object = new CF_Object($container, 'thisdocaintthere.pdf');
print_r($object->container->cfs_http);
You'll see inside that object a response_status that is 404
[response_status:CF_Http:private] => 404

I know I'm a little late to the party, but hopefully this will help someone in the future: you can use the objectExists() method to test if an object is available.
public static function getObject($container, $filename, $expirationTime = false)
{
if ($container->objectExists($filename)) {
$object = $container->getPartialObject($filename);
// return a private, temporary url
if ($expirationTime) {
return $object->getTemporaryUrl($expirationTime, 'GET');
}
// return a public url
return $object->getPublicUrl();
}
// object does not exist
return '';
}
Use like...
// public CDN file
$photo = self::getObject($container, 'myPublicfile.jpg');
// private file; temporary link expires after 60 seconds
$photo = self::getObject($container, 'myPrivatefile.jpg', 60);

If you do not want to import opencloud to perform this check you can use the following:
$url = 'YOUR CDN URL';
$code = FALSE;
$options['http'] = array(
'method' => "HEAD",
'ignore_errors' => 1,
'max_redirects' => 0
);
$body = file_get_contents($url, NULL, stream_context_create($options));
sscanf($http_response_header[0], 'HTTP/%*d.%*d %d', $code);
if($code!='200') {
echo 'failed';
} else {
echo 'exists';
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I batch doesObjectExist() requests to Amazon S3? - php

Related

Authenticate and use Google's BigQuery in my php code

How do I control the order I receive data when getting a response of a multi-curl in php?

Amazon AWS PHP SDK with Guzzle's MultiCurl?

How to count number of instances in AWS PHP SDK2

check if object exists in Cloud Files (PHP API)

Categories

Resources