I need to download a large number of large files, stored across multiple identical servers. A file, like '5.doc', that is stored on server 3, is also stored on server 55.
To speed this up, instead of using just one server to download all the files one after another, I'm using all servers at the same time. The problem is that one of the servers may be much slower than the others, or may even be down. When using Guzzle to batch download files, all of the files in that batch must be downloaded before another batch starts.
Is there a way to immediately start downloading another file alongside others so that all of the servers are constantly downloading a file?
If a server is down, I've set a timeout of 300 seconds and when this is reached Guzzle will catch it's ConnectionException.
How do I identify which of the promises (downloads) have failed so I can cancel them? Can I get information about which file/server failed?
Below is a simplified example of the code I'm using to illustrate the point. Thanks for the help!
$filesToDownload = [['5.doc', '8.doc', '10.doc'], ['1.doc', '9.doc']]; //The file names that we need to download
$availableServers = [3, 55, 88]; //Server id's that are available
foreach ($filesToDownload as $index => $fileBatchToDownload) {
$promises = [];
foreach ($availableServers as $key => $availableServer) {
array_push(
$promises, $client->requestAsync('GET', 'http://domain.com/' . $fileBatchToDownload[$index][$key], [
'timeout' => 300,
'sink' => '/assets/' . $fileBatchToDownload[$index][$key]
])
);
$database->updateRecord($fileBatchToDownload[$index][$key], ['is_cached' => 1]);
}
try {
$results = Promise\unwrap($promises);
$results = Promise\settle($promises)->wait();
} catch (\GuzzleHttp\Exception\ConnectException $e) {
//When can't connect to the server or didn't download within timeout
foreach ($e->failed() as $failedPromise) {
//Re-set record in database to is_cached = 0
//Delete file from server
//Remove this server from the $availableServers list as it may be down or too slow
//Re-add this file to the next batch to download $filesToDownload
}
}
}
I'm not sure how you are doing an asynchronous download of one file from multiple servers using Guzzle, but getting array index of failed requests can be done by promise's then() method:
array_push(
$promises,
$client->requestAsync('GET', "http://localhost/file/{$id}", [
'timeout' => 10,
'sink' => "/assets/{$id}"
])->then(function() {
echo 'Success';
},
function() use ($id) {
echo "Failed: $id";
}
)
);
then() accepts two callbacks. First one is triggered on success and the second one on failure. Source calls them $onFullfilled and $onRejected. Other usages are documented in guzzle documentation. This way you can start downloading a file immediately after its failure.
Can I get information about which file/server failed?
When a promise failed then it means request remained unfulfilled. In this case you can get host and requested path by passing an instance of RequestException class to second then()'s callback:
use GuzzleHttp\Exception\RequestException;
.
.
.
array_push(
$promises,
$client->requestAsync('GET', "http://localhost/file/{$id}", [
'timeout' => 10,
'sink' => "/assets/{$id}"
])->then(function() {
echo 'Success';
},
function(RequestException $e) {
echo "Host: ".$e->getRequest()->getUri()->getHost(), "\n";
echo "Path: ".$e->getRequest()->getRequestTarget(), "\n";
}
)
);
So you will have full information about failing host and file's name. If you may need access to more information you should know that $e->getRequest() returns an instance of GuzzleHttp\Psr7\Request class and all methods on this class are available to be used here. (Guzzle and PSR-7)
When an item is successfully downloaded, can we then immediately
start a new file download on this free server, whilst the other files
are still downloading?
I think you should decide to download new files only on creating promises at the very beginning and repeat/renew failed requests within second callback. Trying to make new promises followed by a successful promise may result in an endless process with downloading duplicated files and that's not simple to handle.
Related
This actually follows on from a previous question I had that, unfortunately, did not receive any answers so I'm not exactly holding my breath for a response but I understand this can be a bit of a tricky issue to solve.
I am currently trying to implement rate limiting on outgoing requests to an external API to match the limit on their end. I have tried to implement a token bucket library (https://github.com/bandwidth-throttle/token-bucket) into the class we are using to manage Guzzle requests for this particular API.
Initially, this seemed to be working as intended but we have now started seeing 429 responses from the API as it no longer seems to be correctly rate limiting the requests.
I have a feeling what is happening is that the number of tokens in the bucket is now being reset every time the API is called due to how Symfony handles services.
I am setting currently setting up the bucket location, rate and starting amount in the service's constructor:
public function __construct()
{
$storage = new FileStorage(__DIR__ . "/api.bucket");
$rate = new Rate(50, Rate::MINUTE);
$bucket = new TokenBucket(50, $rate, $storage);
$this->consumer = new BlockingConsumer($bucket);
$bucket->bootstrap(50);
}
I'm then attempting to consume a token before each request:
public function fetch(): array
{
try {
$this->consumer->consume(1);
$response = $this->client->request(
'GET', $this->buildQuery(), [
'query' => array_merge($this->params, ['api_key' => $this->apiKey]),
'headers' => [ 'Content-type' => 'application/json' ]
]
);
} catch (ServerException $e) {
// Process Server Exception
} catch (ClientException $e) {
// Process Client Exception
}
return $this->checkResponse($response);
}
I can't see anything obvious in that, that would allow it to request more than 50 times per minute unless the amount of available tokens was being reset on each request.
This is being supplied to a set of repository services that handle converting the data from each endpoint into objects used within the system. Consumers use the appropriate repository to request the data needed to complete their process.
If the amount of tokens is being reset by the bootstrap function being in service constructor, where should it be moved to within the Symfony framework that would still work with consumers?
I assume that it should work, but maybe try to move the ->bootstrap(50) call from every request? Not sure, but it can be the reason.
Anyway it's better to do that only once, as a part of your deployment (every time you deploy a new version). It doesn't have anything with Symfony, really, because the framework doesn't have any restrictions on deployment procedure. So it depends on how you do the deployment.
P.S. Have you considered to just handle 429 errors from the server? IMO you can wait (that's what BlockingConsumer does inside) when you receive 429 error. It's simpler and doesn't require an additional layer in your system.
BTW, have you considered nginx's ngx_http_limit_req_module as an alternative solution? It usually comes with nginx by default, so no additional actions to install, only a small configuration is required.
You can place an nginx proxy behind your code and the target web service and enable limits on it. Then in your code you will handle 429 as usual, but the requests will be throttled by your local nginx proxy, not by the external web service. So the final destination will get only limited amount of requests.
I have found a trick using Guzzle bundle for symfony.
I had to improve a sequential program sending GET requests to a Google API. In code example, it a pagespeed URL.
To have a rate limit, there an option to delay the requests before they are sent asynchronously.
Pagespeed rate limit is 200 requests per minute.
A quick calculation gives 200/60 = 0.3s per request.
Here is the code I tested on 300 urls, getting a fantastic result of no error, except if the url passed as a parameter in the GET request gives a 400 HTTP Error (Bad request).
I put a delay of 0.4s and the average result time is less then 0.2s, whereas it took more than a minute with a sequential program.
use GuzzleHttp;
use GuzzleHttp\Client;
use GuzzleHttp\Promise\EachPromise;
use GuzzleHttp\Exception\ClientException;
// ... Now inside class code ... //
$client = new GuzzleHttp\Client();
$promises = [];
foreach ($requetes as $i=>$google_request) {
$promises[] = $client->requestAsync('GET', $google_request ,['delay'=>0.4*$i*1000]); // delay is the trick not to exceed rate limit (in ms)
}
GuzzleHttp\Promise\each_limit($promises, function(){ // function returning the number of concurrent requests
return 100; // 1 or 100 concurrent request(s) don't really change execution time
}, // Fulfilled function
function ($response,$index)use($urls,$fp) { // $urls is used to get the url passed as a parameter in GET request and $fp a csv file pointer
$feed = json_decode($response->getBody(), true); // Get array of results
$this->write_to_csv($feed,$fp,$urls[$index]); // Write to csv
}, // Rejected function
function ($reason,$index) {
if ($reason instanceof GuzzleHttp\Exception\ClientException) {
$message = $reason->getMessage();
var_dump(array("error"=>"error","id"=>$index,"message"=>$message)); // You could write the errors to a file or database too
}
})->wait();
My EC2 servers are currently hosting a website that logs each registered user's activity under their own separate log file on the local EC2 instance, say username.log. I'm trying to figure out a way to push log events for these to CloudWatch using the PHP SDK without slowing the application down, AND while still being able to maintain a separate log file for each registered member of my website.
I can't for the life of me figure this out:
OPTION 1: How can I log to CloudWatch asynchronously using the CloudWatch SDK? My PHP application is behaving VERY sluggishly, since each log line takes roughly 100ms to push directly to CloudWatch. Code sample is below.
OPTION 2: Alternatively, how could I configure an installed CloudWatch Agent on EC2 to simply OBSERVE all of my log files, which would basically upload them asynchronously to CloudWatch for me in a separate process? The CloudWatch EC2 Logging Agent requires a static "configuration file" (AWS documentation) on your server which, to my knowledge, needs to lists out all of your log files ("log streams") in advance, which I won't be able to predict at the time of server startup. Is there any way around this (ie, simply observe ALL log files in a directory)? Config file sample is below.
All ideas are welcome here, but I don't want my solution to simply be "throw all your logs into a single file, so that your log names are always predictable".
Thanks in advance!!!
OPTION 1: Logging via SDK (takes ~100ms / logEvent):
// Configuration to use for the CloudWatch client
$sharedConfig = [
'region' => 'us-east-1',
'version' => 'latest',
'http' => [
'verify' => false
]
];
// Create a CloudWatch client
$cwClient = new Aws\CloudWatchLogs\CloudWatchLogsClient($sharedConfig);
// DESCRIBE ANY EXISTING LOG STREAMS / FILES
$create_new_stream = true;
$next_sequence_id = "0";
$result = $cwClient->describeLogStreams([
'Descending' => true,
'logGroupName' => 'user_logs',
'LogStreamNamePrefix' => $stream,
]);
// Iterate through the results, looking for a stream that already exists with the intended name
// This is so that we can get the next sequence id ('uploadSequenceToken'), so we can add a line to an existing log file
foreach ($result->get("logStreams") as $stream_temp) {
if ($stream_temp['logStreamName'] == $stream) {
$create_new_stream = false;
if (array_key_exists('uploadSequenceToken', $stream_temp)) {
$next_sequence_id = $stream_temp['uploadSequenceToken'];
}
break;
}
}
// CREATE A NEW LOG STREAM / FILE IF NECESSARY
if ($create_new_stream) {
$result = $cwClient->createLogStream([
'logGroupName' => 'user_logs',
'logStreamName' => $stream,
]);
}
// PUSH A LINE TO THE LOG *** This step ALONE takes 70-100ms!!! ***
$result = $cwClient->putLogEvents([
'logGroupName' => 'user_logs',
'logStreamName' => $stream,
'logEvents' => [
[
'timestamp' => round(microtime(true) * 1000),
'message' => $msg,
],
],
'sequenceToken' => $next_sequence_id
]);
OPTION 2: Logging via CloudWatch Installed Agent (note that config file below only allows hardcoded, predermined log names as far as I know):
[general]
state_file = /var/awslogs/state/agent-state
[applog]
file = /var/www/html/logs/applog.log
log_group_name = PP
log_stream_name = applog.log
datetime_format = %Y-%m-%d %H:%M:%S
Looks like we have some good news now... not sure if it's too late!
CloudWatch Log Configuration
So to answer the doubt,
Is there any way around this (ie, simply observe ALL log files in a directory)?
yes, we can mention log files and file paths using wild cards, which can help you in having some flexibility in configuring from where the logs are fetched and pushed to the log streams.
I'm working on trace logger of sorts that pushes log message requests onto a Queue on a Service Bus, to later be picked off by a worker role which would insert them into the table store. While running on my machine, this works just fine (since I'm the only one using it), but once I put it up on a server to test, it produced the following error:
HTTP_Request2_MessageException: Malformed response: in D:\home\site\wwwroot\vendor\pear-pear.php.net\HTTP_Request2\HTTP\Request2\Adapter\Socket.php on line 1013
0 HTTP_Request2_Response->__construct('', true, Object(Net_URL2)) D:\home\site\wwwroot\vendor\pear-pear.php.net\HTTP_Request2\HTTP\Request2\Adapter\Socket.php:1013
1 HTTP_Request2_Adapter_Socket->readResponse() D:\home\site\wwwroot\vendor\pear-pear.php.net\HTTP_Request2\HTTP\Request2\Adapter\Socket.php:139
2 HTTP_Request2_Adapter_Socket->sendRequest(Object(HTTP_Request2)) D:\home\site\wwwroot\vendor\pear-pear.php.net\HTTP_Request2\HTTP\Request2.php:939
3 HTTP_Request2->send() D:\home\site\wwwroot\vendor\microsoft\windowsazure\WindowsAzure\Common\Internal\Http\HttpClient.php:262
4 WindowsAzure\Common\Internal\Http\HttpClient->send(Array, Object(WindowsAzure\Common\Internal\Http\Url)) D:\home\site\wwwroot\vendor\microsoft\windowsazure\WindowsAzure\Common\Internal\RestProxy.php:141
5 WindowsAzure\Common\Internal\RestProxy->sendContext(Object(WindowsAzure\Common\Internal\Http\HttpCallContext)) D:\home\site\wwwroot\vendor\microsoft\windowsazure\WindowsAzure\Common\Internal\ServiceRestProxy.php:86
6 WindowsAzure\Common\Internal\ServiceRestProxy->sendContext(Object(WindowsAzure\Common\Internal\Http\HttpCallContext)) D:\home\site\wwwroot\vendor\microsoft\windowsazure\WindowsAzure\ServiceBus\ServiceBusRestProxy.php:139
7 WindowsAzure\ServiceBus\ServiceBusRestProxy->sendMessage('<queuename>/mes…', Object(WindowsAzure\ServiceBus\Models\BrokeredMessage)) D:\home\site\wwwroot\vendor\microsoft\windowsazure\WindowsAzure\ServiceBus\ServiceBusRestProxy.php:155
⋮
I've seen previous posts that describe similar issues; Namely:
Windows Azure PHP Queue REST Proxy Limit (Stack Overflow)
Operations on HTTPS do not work correctly (GitHub)
That imply that this is a known issue regarding the PHP Azure Storage libraries, where there are a limited amount of HTTPS connections allowed. Before requirements were changed, I was accessing the table store directly, and ran into this same issue, and fixed it in the way the first link describes.
The problem is that the Service Bus endpoint in the connection string, unlike Table Store (etc.) connection string endpoints, MUST be 'HTTPS'. Trying to use it with 'HTTP' will return a 400 - Bad Request error.
I was wondering if anyone had any ideas on a potential workaround. Any advice would be greatly appreciated.
Thanks!
EDIT (After Gary Liu's Comment):
Here's the code I use to add items to the queue:
private function logToAzureSB($source, $msg, $severity, $machine)
{
// Gather all relevant information
$msgInfo = array(
"Severity" => $severity,
"Message" => $msg,
"Machine" => $machine,
"Source" => $source
);
// Encode it to a JSON string, and add it to a Brokered message.
$encoded = json_encode($msgInfo);
$message = new BrokeredMessage($encoded);
$message->setContentType("application/json");
// Attempt to push the message onto the Queue
try
{
$this->sbRestProxy->sendQueueMessage($this->azureQueueName, $message);
}
catch(ServiceException $e)
{
throw new \DatabaseException($e->getMessage, $e->getCode, $e->getPrevious);
}
}
Here, $this->sbRestProxy is a Service Bus REST Proxy, set up when the logging class initializes.
On the recieving end of things, here's the code on the Worker role side of this:
public override void Run()
{
// Initiates the message pump and callback is invoked for each message that is received, calling close on the client will stop the pump.
Client.OnMessage((receivedMessage) =>
{
try
{
// Pull the Message from the recieved object.
Stream stream = receivedMessage.GetBody<Stream>();
StreamReader reader = new StreamReader(stream);
string message = reader.ReadToEnd();
LoggingMessage mMsg = JsonConvert.DeserializeObject<LoggingMessage>(message);
// Create an entry with the information given.
LogEntry entry = new LogEntry(mMsg);
// Set the Logger to the appropriate table store, and insert the entry into the table.
Logger.InsertIntoLog(entry, mMsg.Service);
}
catch
{
// Handle any message processing specific exceptions here
}
});
CompletedEvent.WaitOne();
}
Where Logging Message is a simple object that basically contains the same fields as the Message Logged in PHP (Used for JSON Deserialization), LogEntry is a TableEntity which contains these fields as well, and Logger is an instance of a Table Store Logger, set up during the worker role's OnStart method.
This was a known issue with the Windows Azure PHP, which hasn't been looked at in a long time, nor has it been fixed. In the time between when I posted this and now, We ended up writing a separate API web service for logging, and had our PHP Code send JSON strings to it over cURL, which works well enough as a temporary work around. We're moving off of PHP now, so this wont be an issue for much longer anyways.
I'm bulding a PHP client (I have already a C# client) as proof of concept. The PHP client will connect with the same .NET server using SOAP. As example I'm using the game blackjack.
Now first the C# client works perfect, but there is an issue in the PHP. After much debugging I found out that PHP always uses a new connection for every remote server call. This makes it impossible to program a game.
For example, I have a PHP file that just sets up the client like this (client.php):
try {
$client = #new soapClient("http://localhost:8000/BlackJack/Service?wsdl",
array(
"trace" => 1,
"exceptions" => 0,
"cache_wsdl" => 0)
);
} catch (Exception $e) {
print 'Caught exception: ' . $e->getMessage() . "\n";
}
Then in my main file (ill be using jQuery and ajax calls to load it dynamically but now im just testing) I have the following code (blackJackClient.php):
require_once("client.php");
$ini = ini_set("soap.wsdl_cache_enabled","0");
$BetAmountPost = 100;
print_r($result = $client->buttonDeal_Click(array("BetAmount" => (string)$BetAmountPost))->buttonDeal_ClickResult);
print_r($result = $client->PlayerMoney()->PlayerMoneyResult);
The first call will start a new game (Deal) and the player his bet amount (for example 100) gets subtracted from the total amount (1000). So what do I get returned in result is money = 900.
The following commando will ask what money I currently have, and one expects to get returned 900, but no instead I have 1000 (my starting amount).
So my question is how can I make it that all call's are made in 1 session so we still use the same connection?
Thanks!
Since SOAP uses the HTTP protocol, and HTTP is stateless, there is no way to keep your connection open during the session of the game.
Instead, you should send your user's authentication with every request to the server.
I have a php site www.test.com.
In the index page of this site ,i am updating another site's(www.pgh.com) database by the following php code.
$url = "https://pgh.com/test.php?userID=".$userName. "&password=" .$password ;
$response = file_get_contents($url);
But,now the site www.pgh.com is down.So it also affecting my site 'www.test.com'.
So how can i add some exception or something else to this code,so that my site should work if other site is down
$response = file_get_contents($url);
if(!$response)
{
//Return error
}
From the PHP manual
Adding a timeout:
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
file_get_contents returns false on fail.
You have two options:
Add a timeout to the file_get_contents call using stream_get_context() (The manual has good examples; Docs for the timeout parameter here). This is not perfect, as even a one second's timeout will cause a notable pause when loading the page.
More complex but better: Use a caching mechanism. Do the file_get_contents request in a separate script that gets called frequently (e.g. every 15 minutes) using a cron job (if you have access to one). Write the result to a local file, which your actual script will read.