MongoDB -> DynamoDB Migration - php

All,
I am attempting to migrate roughly 6GB of Mongo data that is comprised of hundreds of collections to DynamoDB. I have written some scripts using the AWS PHP SDK and am able to port over very small collections but when I try ones that have more than 20k documents (still a very small collection all things considered) it either takes an outrageous amount of time or quietly fails.
Does anyone have some tips/tricks for taking data from Mongo (or any other NoSQL DB) and migrating it to Dynamo, or any other NoSQL DB. I feel like this should be relatively easy because the documents are extremely flat/simple.
Any thoughts/suggestions would be much appreciated!
Thanks!
header.php
<?
require './aws-autoloader.php';
require './MongoGet.php';
set_time_limit(0);
use \Aws\DynamoDb\DynamoDbClient;
$client = \Aws\DynamoDb\DynamoDbClient::factory(array(
'key' => 'MY_KEY',
'secret' => 'MY_SECRET',
'region' => 'MY_REGION',
'base_url' => 'http://localhost:8000'
));
$collection = "AccumulatorGasPressure4093_raw";
function nEcho($str) {
echo "{$str}<br>\n";
}
echo "<pre>";
test-store.php
<?
include('test-header.php');
nEcho("Creating table(s)...");
// create test table
$client->createTable(array(
'TableName' => $collection,
'AttributeDefinitions' => array(
array(
'AttributeName' => 'id',
'AttributeType' => 'N'
),
array(
'AttributeName' => 'count',
'AttributeType' => 'N'
)
),
'KeySchema' => array(
array(
'AttributeName' => 'id',
'KeyType' => 'HASH'
),
array(
'AttributeName' => 'count',
'KeyType' => 'RANGED'
)
),
'ProvisionedThroughput' => array(
'ReadCapacityUnits' => 10,
'WriteCapacityUnits' => 20
)
));
$result = $client->describeTable(array(
'TableName' => $collection
));
nEcho("Done creating table...");
nEcho("Getting data from Mongo...");
// instantiate class and get data
$mGet = new MongoGet();
$results = $mGet->getData($collection);
nEcho ("Done retrieving Mongo data...");
nEcho ("Inserting data...");
$i = 0;
foreach($results as $result) {
$insertResult = $client->putItem(array(
'TableName' => $collection,
'Item' => $client->formatAttributes(array(
'id' => $i,
'date' => $result['date'],
'value' => $result['value'],
'count' => $i
)),
'ReturnConsumedCapacity' => 'TOTAL'
));
$i++;
}
nEcho("Done Inserting, script ending...");

I suspect that you are being throttled by DynamoDB, especially if your tables' throughputs are low. The SDK retries the requests, up to 11 times per request, but eventually, the requests fail, which should throw an exception.
You should take a look at the WriteRequestBatch object. This object is basically a queue of items that get sent in batches, but any items that fail to transfer are re-queued automatically. Should provide a more robust solution for what you are doing.

Related

How to do lock item in DynamoDB during update?

I'm testing DynamoDB with a lot of concurrent update requests, but I have some doubts.
For example:
In a financial transaction system where I have a table called Account and a table called Transactions.
Accounts:
id: string
balance: Number
Transactions
id: Number
amount: Number
balanceAfter: Number
accountId: string
So when performing a debit transaction I update the balance in the Account table and create a transaction with the account balance after the transaction.
If the account has a balance of 100 and I execute two transactions of 50 at the same time, the account balance would be 50 and not 0 and with two transactions in the database with balanceAfter: 50.
How to lock DynamoDB item for UPDATE with concurrency to avoid double spending? (Similar to TRANSACTION in relational database)
What is the safest way to get the updated item from DynamoDB after running UPDATE?
The code:
<?php
require './vendor/autoload.php';
use Aws\DynamoDb\DynamoDbClient;
use Aws\Credentials\CredentialProvider;
function executeDebitTransaction($accountId, $transactionAmount)
{
$provider = CredentialProvider::defaultProvider();
$client = DynamoDbClient::factory(array(
'version' => '2012-08-10',
'credentials' => $provider,
'region' => 'sa-east-1'
));
$response = $client->getItem(array(
'TableName' => 'Accounts',
'Key' => array(
'id' => array( 'S' => $accountId ))
)
);
$currentBalance = $response['Item']['balance']['N'];
$newbalance = (string)((int)$currentBalance - (int)$transactionAmount);
$response = $client->updateItem(array(
'TableName' => 'accounts',
'Key' => array(
'id' => array( 'S' => $accountId )
),
'ExpressionAttributeValues' => array (
':amount' => array('N' => $transactionAmount),
),
'UpdateExpression' => 'SET balance = balance - :amount'
));
// Generate random ID
$id = (string)(random_int(1, 1000000000));
$client->putItem(array(
'TableName' => 'Transactions',
'Item' => array(
'id' => array('N' => $id),
'amount' => array('N' => $transactionAmount),
'balanceAter' => array('N' => $newbalance),
'accountId' => $transactionAmount
)
));
}
$accountId = 'A1469CCD-10B8-4D31-83A2-86B71BF39EA8';
$debitAmount = '50';
executeDebitTransaction($accountId, $debitAmount);
Running this script with few concurrency works perfectly, but when I increase the parallelism I start to have problems.
Did some tests using optimistic lock and I had good results.
Repo: https://github.com/thalyswolf/dynamo-lock/blob/main/src/Repository/AccountRepositoryDynamoWithOptimisticLock.php

Get EC2 Bandwidth Usage By Instance ID

How do I get the instance bandwidth usage for NetworkIn and NetworkOut for an EC2 instance based on the instance ID using the PHP SDK.
So far what I have is...
<?php
require_once("../aws/Sdk.php");
use Aws\CloudWatch\CloudWatchClient;
$client = CloudWatchClient::factory(array(
'profile' => 'default',
'region' => 'ap-southeast-2'
));
$dimensions = array(
array('Name' => 'Prefix', 'Value' => ""),
);
$result = $client->getMetricStatistics(array(
'Namespace' => 'AWSSDKPHP',
'MetricName' => 'NetworkIn',
'Dimensions' => $dimensions,
'StartTime' => strtotime('-1 hour'),
'EndTime' => strtotime('now'),
'Period' => 3000,
'Statistics' => array('Maximum', 'Minimum'),
));
I have a PHP cron job running every hour and I need to be able to get the bandwidth in and out for a specific EC2 instance to record in an internal database.
What I have above I have been able to piece together from the SDK documentation but from here I am kinda stumped.
I believe what I need is cloudwatch so would rather it be able to be done through this. I know that I can install a small program onto each server to report the bandwidth usage to a file on the server that I then SFTP into to download to our database but would rather it be done externally of any settings within the instance itself so that an instance admin can't cause issues with the bandwidth reporting.
Managed to get it working with...
<?php
require '../../aws.phar';
use Aws\CloudWatch\CloudWatchClient;
$cw = CloudWatchClient::factory(array(
'key' => 'your-key-here',
'secret' => 'your-secret-here',
'region' => 'your-region-here',
'version' => 'latest'
));
$metrics = $cw->listMetrics(array('Namespace' => 'AWS/EC2'));
//print_r($metrics);
$statsyo = $cw->getMetricStatistics(array(
'Namespace' => 'AWS/EC2',
'MetricName' => 'NetworkIn',
'Dimensions' => array(array('Name' => 'InstanceId', 'Value' => 'your-instance-id-here')),
'StartTime' => strtotime("2017-01-23 00:00:00"),
'EndTime' => strtotime("2017-01-23 23:59:59"),
'Period' => 86400,
'Statistics' => array('Average'),
'Unit' => 'Bytes'
));
echo($statsyo);
If you're trying to calculate your bandwidth charge the same way AWS would, a better and more conclusive way would be to use VPC Flow Logs. You can subscribe your ENI to VPC flow logs (should be pretty cheap, they only charge for CloudWatch Logs costs, flow logs is free) then use the AWS SDK to pull from CloudWatch with GetLogEvents, and then sum up the bytes total.

php dynamic vs manual array declaration

I have this situation where I could pre-define the array in this way:
$packages = array(
'0' => array(
'name' => 'Hotel1', //pcg name
'curr' => '$',
'amount' => '125',
'period' => 'NIGHT', //pcg duration
'client_data' => array(
'Name' =>'Adrien',
'Addr' =>'Sample Street',
'Payment' =>'Credit Card',
'Nights' =>'6',
)
),
);
Or
$packages = array();
$packages[] = array(
'name' => 'PREMIUM', //pcg name
'curr' => '$',
'amount' => '3.95',
'period' => 'MONTH', //pcg duration
'features' => array(
'Clients' =>'100',
'Invoices' =>'300 <small>MONTH</small>',
'Products' =>'30',
'Staff' =>'1',
)
);
The data will be static always so I wont be fetching this from
a sql query or a dynamic search. Would it make any difference
in terms of performance (the slightest difference could be helpful)
by using the first or the second "method" or they're actually 100%
identical in terms of performance.
Theorically the "dynamic" array creation might be slower because
it needs to check the size of the array, the last array index and
maybe other things such as those.
Thank you.
One simple task like that takes absolutely no resources in current hardware reality. Even in my first PC, a 386DX 20MHz it would not make that difference ;)
Anyway, I executed 1k times both options:
FIRST OPTION average:
0.000114s
SECOND OPTION average:
0.000108s
Be happy!

can't getting the exact error in aggregation on mongodb

When i am trying this aggregation query on a sample mongo database of 25 entries, It works. But when the same method is applied for entire mongodb of 317000 entries, it doesn't respond. Even it's not showing any kind of error.
What to do for such problems.
$mongo = new Mongo("localhost");
$collection = $mongo->selectCollection("sampledb","analytics");
$cursor = $collection->aggregate(
array(
array(
'$group' => array(
'_id' => array(
'os'=>'$feat_os',
'minprice'=>'$finder_input_1',
'max_price'=>'$finder_input_2',
'feat_quadcode'=>'$feat_quadcore'
),
'count' => array('$sum' => 1)
)
),
array('$out'=>"result")
)
);

Limiting resultset from Magento SOAP query

How can you specify a max result set for Magento SOAP queries?
I am querying Magento via SOAP API for a list of orders matching a given status. We have some remote hosts who are taking too long to return the list so I'd like to limit the result set however I don't see a parameter for this.
$orderListRaw = $proxy -> call ( $sessionId, 'sales_order.list', array ( array ( 'status' => array ( 'in' => $orderstatusarray ) ) ) );
I was able to see that we do get data back (6 minutes later) and have been able to deal with timeouts, etc. but would prefer to just force a max result set.
It doesn't seem like it can be done using limit, (plus you would have to do some complex pagination logic to get all records, because you would need know the total number of records and the api does not have a method for that) See api call list # http://www.magentocommerce.com/api/soap/sales/salesOrder/sales_order.list.html
But what you could do as a work around is use complex filters, to limit the result set base on creation date. (adjust to ever hour, day or week base on order volume).
Also, since you are using status type (assuming that you are excluding more that just cancel order), you may want to think about getting all order and keep track of the order_id/status locally (only process the ones with the above status) and the remainder that wasn't proceed would be a list of order id that may need your attention later on
Pseudo Code Example
$params = array(array(
'filter' => array(
array(
'key' => 'status',
'value' => array(
'key' => 'in',
'value' => $orderstatusarray,
),
),
),
'complex_filter' => array(
array(
'key' => 'created_at',
'value' => array(
'key' => 'gteq',
'value' => '2012-11-25 12:00:00'
),
),
array(
'key' => 'created_at',
'value' => array(
'key' => 'lteq',
'value' => '2012-11-26 11:59:59'
),
),
)
));
$orderListRaw = $proxy -> call ( $sessionId, 'sales_order.list', $params);
Read more about filtering # http://www.magentocommerce.com/knowledge-base/entry/magento-for-dev-part-8-varien-data-collections

Categories