Verify Http Status Code for ~90K URLs with PHP - php

I have a list of 90K URLs and I need to verify the Http status code for each of them; and make an entry next to each URL in the database. Here's my attempt so far; but it's super slow. I'd appreciate any suggestions, feedback on making this run faster.
public function handle()
{
DB::table('internal_links')->whereNull('status')->orderBy('id')->chunk(5, function($links) {
foreach($links as $link) {
stream_context_get_default([
'http' => ['method' => 'HEAD']
]);
$status_code = #get_headers($link->href)[0];
if(!$status_code) {
$status_code = 404;
} else {
$status_code = substr($status_code, 9,3);
}
DB::table('internal_links')->where('href', $link->href)->update(['status' => $status_code]);
}
});
return Command::SUCCESS;
}
In my test, this takes about ~6-7 hours to complete. I'd be happy if this can be done faster with concurrent requests. I'm currently exploring a way to build the concurrent requests with Laravel's Http client; but can't figure it out.
Update
I'm attempting to use Laravel Http concurrent requests; but not able to figure this out -
public function handle()
{
DB::table('internal_links')->distinct('href')->orderBy('id')->chunk(50, function($urls) {
// How do I write the following block to have my "$urls as $url) go into the $pool->head($url) ?
$responses = Http::pool(fn (Pool $pool) => [
$pool->head('http://url1'),
$pool->head('http://url2')
]);
});
return Command::SUCCESS;
}

Related

How to speed up a HTTP get request that loads larg data as a response

I have this code with a while loop to exceed the limit of the API of 1000 records by merging all the arrays in the response together in one array and passing it to the view, but it takes too long waiting time, is there any better way to do it and speed up the process?
this is my code
public function guzzleGet()
{
$aData = [];
$sCursor = null;
while($aResponse = $this->guzzleGetData($sCursor))
{
if(empty($aResponse['data']))
{
break;
}
else
{
$aData = array_merge($aData, $aResponse['data']);
if(empty($aResponse['meta']['next_cursor']))
{
break;
}
else
{
$sCursor = $aResponse['meta']['next_cursor'];
}
}
}
$user = Auth::user()->name;
return view("".$user."/home")->with(['data' => json_encode($aData)]);
}
protected function guzzleGetData($sCursor = null)
{
$client = new \GuzzleHttp\Client();
$token = 'token';
$response = $client->request('GET', 'https://data.beneath.dev/v1/user/project/table', [
'headers' => [
'Authorization' => 'Bearer '.$token,
],
'query' => [
'limit' => 1000,
'cursor' => $sCursor
]
]);
if($response->getBody())
{
return json_decode($response->getBody(), true) ?: [];
}
return [];
}
You would have to debug where is the bottleneck. If the reason its slow is your network/bandwidth, then that might be the issue. It could also be the API is limiting your download speed.
Another bottleneck could be the download speed of the client. Since you are building up a big array, when the server sends it to the client it needs to download it and can take time.
You could potentially increase speed a little by reusing the same curl handler, or guzzle client. Another are you could improve is the array_merge, you can use your own custom logic explained here: https://stackoverflow.com/a/23348715/8485567.
If you have control of the external API, make sure to use gzip and HTTP/2 or possibly even gRPC instead of http.
However I would recommend you do this on the client side using JS, like that you can avoid the additional bandwidth it takes for the client to download it from the server. You could use the same approach of limit 1000, or could even stream the response as it comes in and render it.

PHP & MQTT - Publish and subscribe in same function

Is there a way to make this work..
I'm trying to publish a 'PING' request to a device, and then subscribe to another channel and wait for the 'PONG' response. Here's the code:
public function ping(Request $request)
{
$device = Device::find($request->id);
if ($device) {
//Listen on laravel_device_DEVICEID
$listen_topic = 'laravel_device_' . $device->device_id;
$result = [];
$mqtt = MQTT::connection();
$mqtt->subscribe($listen_topic, function (string $topic, string $message) use ($mqtt, &$result) {
$result['topic'] = $topic;
$result['message'] = $message;
$mqtt->interrupt();
}, 0);
$mqtt->loop(true, true);
//Submit PING message
$topic = $device->device_id;
$message = json_encode([
'type' => 'ping',
'device_id' => $device->device_id
]);
$mqtt = MQTT::connection();
$mqtt->publish($topic, $message);
$device->last_ping_response = 2;
$device->save();
}
}
I'm so free to copy my response from the original question in the php-mqtt/laravel-client repository.
Yes, it is possible and a fairly common thing to do as well. The pattern is called RPC (Remote Procedure Call).
Your code already looks quite good, the only issue is the order of subscribe(), publish() and loop(). Subscribing first is cruicial, since you do not want to miss a response before your subscription goes through. This part is correct already. But you may start the loop only after publishing the request, otherwise the publish() is never reached (because the loop does not exit by itself, only when it receives a message).
Here is the updated piece of code of yours, with some best practices added in. I assumed the response message is what you want to save as $device->last_ping_response:
use PhpMqtt\Client\MqttClient;
use PhpMqtt\Client\Exceptions\MqttClientException;
public function ping(Request $request)
{
$device = Device::find($request->id);
// Exit early if we have no device to ping. Doing it this way round,
// we save one level of identation for the rest of the code.
if ($device === null) {
return;
}
// Prepare all data before connecting to the broker. This way, we do not waste
// time preparing data while already connected.
$subscribeTopic = "laravel_device_{$device->device_id}";
$publishTopic = $device->device_id;
$request = json_encode([
'type' => 'ping',
'device_id' => $device->device_id,
]);
try {
$mqtt = MQTT::connection();
// Ensure we are ready to receive a response to the request
// we are going to send.
$mqtt->subscribe(
$subscribeTopic,
function (string $topic, string $message) use ($mqtt, $device) {
// Update the device based on the response.
$device->last_ping_response = $message;
$device->save();
$mqtt->interrupt();
},
MqttClient::QOS_AT_LEAST_ONCE
);
// Here we register a timeout using a loop event handler.
// The event handler is passed the elapsed time
// the loop runs already (in seconds).
// We do this because in case the receiver of our request is offline,
// we would be stuck in a loop forever.
$mqtt->registerLoopEventHandler(
function function (MqttClient $client, float $elapsedTime) {
// After 10 seconds, we quit the loop.
if ($elapsedTime > 10) {
$client->interrupt();
}
}
);
// Send the request. We use QoS 1 to have guaranteed delivery.
$mqtt->publish($publishTopic, $request, MqttClient::QOS_AT_LEAST_ONCE);
// Wait for a response. This will either return when we receive
// a response on the subscribed topic, or after 10 seconds
// due to our registered loop event handler.
$mqtt->loop();
// Always disconnect gracefully.
$mqtt->disconnect();
} catch (MqttClientException $e) {
\Log::error("Pinging device failed.", [
'device' => $device->id,
'exception' => $e->getMessage(),
]);
}
}

Guzzle async requests waiting for timeout even I use any to wrap around the promise - how can I make it return ASAP?

My sample code is below, which basically tries to get a certain URL using a list of proxies. I want to return a result as soon as a proxy returns:
$response = any(
array_map(
function (?string $proxy) use ($headers, $url) {
return $this->client->getAsync(
$url
, [
'timeout' => 5,
'http_errors' => FALSE,
'proxy' => $proxy,
]
);
}
, self::PROXIES
)
)
->wait();
However, whatever the value I set in the timeout, I found out that the whole HTTP request only returns when the full timeout has passed, i.e. 5 seconds in this case. If I change 5 to 10, the whole HTTP request only returns after 10 seconds.
How can I really make it return ASAP?
Are you sure that the proxy/end server work well? I mean, Guzzle do nothing special with proxied request, so there are no settings that you can tweak.
What do you get after the timeout? Normal 200 response or an exception?
To me it looks like the issue is in proxy or in the end server. Have you tried to request the URL directly, without a proxy? Is it fast or still takes 5-10-... seconds?
I finally wrote my own promise to solve the problem which really returns immediately.
/** #var Promise $master_promise */
$master_promise = new Promise(
function () use ($url, &$master_promise) {
$onFulfilled = static function (ResponseInterface $response) use ($master_promise) {
$master_promise->resolve($response);
};
$rejections = [];
foreach (static::PROXIES as $proxy) {
$this->client->getAsync(
$url
, [
'timeout' => static::TIMEOUT,
'http_errors' => FALSE,
'proxy' => $proxy,
]
)
->then(
$onFulfilled
, static function (GuzzleException $exception)
use ($master_promise, &$rejections)
{
$rejections[] = $exception;
if (count($rejections) === count(static::PROXIES)) {
$master_promise->reject(
new AggregateException(
'Calls by all proxies have failed.'
, $rejections
)
);
}
}
);
}
while ($master_promise->getState() === PromiseInterface::PENDING) {
$this->handler->tick();
}
}
);
$response = $master_promise->wait();

Guzzle is sending requests synchronously

I am using Guzzle to consume a SOAP API. I have to make 6 requests, but in the future this might be even an indeterminate amount of requests.
Problem is that the requests are being send sync, instead of async. Every request on it's own takes +-2.5s. When I send all 6 requests paralell (at least thats what I am trying) it takes +- 15s..
I tried all the examples on Guzzle, the one with a fixed array with $promises, and even the pool (which I need eventually). When I put everything in 1 file (functional) I manage to get the total timing back to 5-6s (which is OK right?). But when I put everything in Objects and functions somehow I do something that makes Guzzle decide to do them sync again.
Checks.php:
public function request()
{
$promises = [];
$promises['requestOne'] = $this->requestOne();
$promises['requestTwo'] = $this->requestTwo();
$promises['requestThree'] = $this->requestThree();
// etc
// wait for all requests to complete
$results = \GuzzleHttp\Promise\settle($promises)->wait();
// Return results
return $results;
}
public function requestOne()
{
$promise = (new API\GetProposition())
->requestAsync();
return $promise;
}
// requestTwo, requestThree, etc
API\GetProposition.php
public function requestAsync()
{
$webservice = new Webservice();
$xmlBody = '<some-xml></some-xml>';
return $webservice->requestAsync($xmlBody, 'GetProposition');
}
Webservice.php
public function requestAsync($xmlBody, $soapAction)
{
$client = new Client([
'base_uri' => 'some_url',
'timeout' => 5.0
]);
$xml = '<soapenv:Envelope>
<soapenv:Body>
'.$xmlBody.'
</soapenv:Body>
</soapenv:Envelope>';
$promise = $client->requestAsync('POST', 'NameOfService', [
'body' => $xml,
'headers' => [
'Content-Type' => 'text/xml',
'SOAPAction' => $soapAction, // SOAP Method to post to
],
]);
return $promise;
}
I changed the XML and some of the parameters for abbreviation. The structure is like this, because I eventually have to talk against multiple API's, to thats why I have a webservice class in between that does all the preparation needed for that API. Most API's have multiple methods/actions that you can call, so that why I have something like. API\GetProposition.
Before the ->wait() statement I can see all $promises pending. So it looks like there are being send async. After ->wait() they have all been fulfilled.
It's all working, minus the performance. All 6 requests don't take more then 2.5 to max 3s.
Hope someone can help.
Nick
The problem was that the $client object was created with every request. Causing the curl multi curl not to be able to know which handler to use.
Found the answer via https://stackoverflow.com/a/46019201/7924519.

Created Request using die() also dies the Request caller

I don't know if it's the right terms to employ...
I made an API, in which the answer is sent by the die() function, to avoid some more useless calculations and/or functions calls.
example :
if (isset($authorize->refusalReason)) {
die ($this->api_return(true, [
'resultCode' => $authorize->resultCode,
'reason' => $authorize->refusalReason
]
));
}
// api_return method:
protected function api_return($error, $params = []) {
$time = (new DateTime())->format('Y-m-d H:i:s');
$params = (array) $params;
$params = ['error' => $error, 'date_time' => $time] + $params;
return (Response::json($params)->sendHeaders()->getContent());
}
But my website is based on this API, so I made a function to create a Request and return the contents of it, based on its URI, method, params, and headers:
protected function get_route_contents($uri, $type, $params = [], $headers = []) {
$request = Request::create($uri, $type, $params);
if (Auth::user()->check()) {
$request->headers->set('S-token', Auth::user()->get()->Key);
}
foreach ($headers as $key => $header) {
$request->headers->set($key, $header);
}
// things to merge the Inputs into the new request.
$originalInput = Request::input();
Request::replace($request->input());
$response = Route::dispatch($request);
Request::replace($originalInput);
$response = json_decode($response->getContent());
// This header cancels the one there is in api_return. sendHeaders() makes Content-Type: application/json
header('Content-Type: text/html');
return $response;
}
But now when I'm trying to call an API function, The request in the API dies but dies also my current Request.
public function postCard($token) {
$auth = $this->get_route_contents("/api/v2/booking/payment/card/authorize/$token", 'POST', Input::all());
// the code below is not executed since the API request uses die()
if ($auth->error === false) {
return Redirect::route('appts')->with(['success' => trans('messages.booked_ok')]);
}
return Redirect::back()->with(['error' => $auth->reason]);
}
Do you know if I can handle it better than this ? Any suggestion of how I should turn my code into ?
I know I could just use returns, but I was always wondering if there were any other solutions. I mean, I want to be better, so I wouldn't ask this question if I knew for sure that the only way of doing what I want is using returns.
So it seems that you are calling an API endpoint through your code as if it is coming from the browser(client) and I am assuming that your Route:dispatch is not making any external request(like curl etc)
Now There can be various approaches to handle this:
If you function get_route_contents is going to handle all the requests, then you need to remove the die from your endpoints and simply make them return the data(instead of echoing). Your this "handler" will take care of response.
Make your Endpoint function to have an optional parameter(or some property set in the $request variable), which will tell the function that this is an internal request and data should be returned, when the request comes directly from a browser(client) you can do echo
Make an external call your code using curl etc(only do this if there is no other option)

Categories