Twitter - twemproxy - memcached - Retry not working as expected

Twitter - twemproxy - memcached - Retry not working as expected - php

Simple setup:
1 node running twemproxy (vcache:22122)
2 nodes running memcached (vcache-1, vcache-2) both listening on 11211
I have the following twemproxy config:
default:
auto_eject_hosts: true
distribution: ketama
hash: fnv1a_64
listen: 0.0.0.0:22122
server_failure_limit: 1
server_retry_timeout: 600000 # 600sec, 10m
timeout: 100
servers:
- vcache-1:11211:1
- vcache-2:11211:1
The twemproxy node can resolve all hostnames. As part of testing I took down vcache-2. In theory for every attempt to interface with vcache:22122, twemproxy will contact a server from the pool to facilitate the attempt. However, if one of the cache nodes is down, then twemproxy is supposed to "auto eject" it from the pool, so subsequent requests will not fail.
It is up to the app layer to determine if a failed interface attempt with vcache:22122 was due to infrastructure issue, and if so, try again. However I am finding that on the retry, the same failed server is being used, so instead of subsequent attempts being passed to a known good cache node (in this case vcache-1) they are still being passed to the ejected cache node (vcache-2).
Here's the php code snippet which attempts the retry:
....
// $this is a Memcached object with vcache:22122 in the server list
$retryCount = 0;
do {
$status = $this->set($key, $value, $expiry);
if (Memcached::RES_SUCCESS === $this->getResultCode()) {
return true;
}
} while (++$retryCount < 3);
return false;
-- Update --
Link to Issue opened on Github for more info: Issue #427

I can't see anything wrong with your configuration. As you know the important settings are in place:
default:
auto_eject_hosts: true
server_failure_limit: 1
The documentation suggests connection timeouts might be an issue.
Relying only on client-side timeouts has the adverse effect of the original request having timedout on the client to proxy connection, but still pending and outstanding on the proxy to server connection. This further gets exacerbated when client retries the original request.
Is your PHP script closing the connection and retrying before twemproxy failed its first attempt and removed the server from the pool? Perhaps adding a timeout value in the twemproxy lower than the connection timeout used in PHP solves the issue.
From your discussion on Github though it sounds like support for healthcheck, and perhaps auto ejection, aren't stable in twemproxy. If you're building against old packages you might be better to find a package which has been stable for some time. Is mcrouter (with interesting article) suitable?

For this feature to work please merge with this repo/branch
https://github.com/charsyam/twemproxy/tree/feature/heartbeat
to have this specific commit
https://github.com/charsyam/twemproxy/commit/4d49d2ecd9e1d60f18e665570e4ad1a2ba9b65b1
here is the PR
https://github.com/twitter/twemproxy/pull/428
after that recompile it

Related

Queued Laravel Notifications get stuck on AWS SQS

I have a worker on AWS that handles queued Laravel notifications. Some of the notifications get send out, but others get stuck in the queue and I don't know why.
I've looked at the logs in Beanstalk and see three different types of error:
2020/11/03 09:22:34 [emerg] 10932#0: *30 malloc(4096) failed (12: Cannot allocate memory) while reading upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock:", host: "localhost"
I see an Out of Memory issue on Bugsnag too, but without any stacktrace.
Another error is this one:
2020/11/02 14:50:07 [error] 10241#0: *2623 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock", host: "localhost"
And this is the last one:
2020/11/02 15:00:24 [error] 10241#0: *2698 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock:", host: "localhost"
I don't really understand what I can do to resolve these errors. It's just a basic Laravel / EBS / SQS setup, and the only thing the queue has to do is handle notifications. Sometimes a couple of dozens at a time. I'm running a t2.micro, and would assume that's enough to send a few e-mails? I've upped the environment to a t2.large but to no avail.
I notice that messages end up in the queue, then get the status 'Messages in flight', but then run into all sorts of troubles on the Laravel side. But I don't get any useful errors to work with.
All implementation code seems to be fine, because the first few notifications go out as expected and if I don't queue at all, all notifications get dispatched right away.
The queued notifications eventually generate two different exceptions: MaxAttemptsExceededException and an Out of Memory FatalError, but neither leads me to the actual underlying problem.
Where do I look further to debug?
UPDATE
See my answer for the problem and the solution. The database transaction hadn't finished before the worker tried to send a notification for the object that still had to be created.

What is the current memory_limit assigned to PHP? You can determine this by running this command:
php -i | grep memory_limit
You can increase this by running something like:
sed -i -e 's/memory_limit = [current-limit]/memory_limit = [new-limit]/g' [full-path-to-php-ini]
Just replace the [current-limit] with the value displayed in the first command, and [new-limit] with a new reasonable value. This might require trial and error. Replace [full-path-to-php-ini] with the full path to the php.ini that's used by the process that's failing. To find this, run:
php -i | grep php.ini

First make sure that you increased the max_execution_time and also memory_limit
Also make sure that you set --timeout option
Then make sure you follow the instruction for Amazon SQS as in laravel doc says
The only queue connection which does not contain a retry_after value is Amazon SQS. SQS will retry the job based on the Default Visibility Timeout which is managed within the AWS console.
Job Expirations & Timeouts

If you are sure that some of the queued events are correctly received and processed by the worker Laravel, then as others said it's mostly a PHP memory issue.
On beanstalk, here's what I added to my ebextensions to get larger memory for PHP (it was for composer memory issues):
Note that this is with a t3.medium EC2 instance with 4go, dedicated for laravel API only.
02-environment.config
commands:
...
option_settings:
...
- namespace: aws:elasticbeanstalk:container:php:phpini
option_name: memory_limit
value: 4096M
- namespace: aws:ec2:instances
option_name: InstanceTypes
value: t3.medium
So you can try to increase the limit use more of your available instance max ram, and deploy again so beanstalk will rebuild the instance and setup PHP memory_limit.
Note: the real config contains other configuration files and more truncated contents of course.
As you said, you are just sending an email so it should be ok. Is it happening when there's a burst of email queued? Is there, in the end, many events in the SQS deadLetterQueue? If so, it may be because of a queued email burst. And so SQS will "flood" the /worker route to execute your jobs. You could check server usage from AWS console, or htop like CLI tools to monitor, and also check SQS interface to see if many failed jobs are coming at the same moments (burst).
Edit: for elastic beanstalk, I use dusterio/laravel-aws-worker, maybe you too as your log mentions the /worker/queue route

Memory
The default amount of memory allocated to PHP can often be quite small. When using EBS, you want to use config files as much as possible - any time you're having to SSH and change things on the server, you're going to have more issues when you need to redploy. I have this added to my EBS config /.ebextensions/01-php-settings.config:
option_settings:
aws:elasticbeanstalk:container:php:phpini:
memory_limit: 256M
That's been enough when running a t3.micro to do all my notification and import processing. For simple processing it doesn't usually need much more memory than the default, but it depends a fair bit on your use-case and how you've programmed your notifications.
Timeout
As pointed out in this answer already, the SQS queue operates a little differently when it comes to timeouts. This is a small trait that I wrote to help work around this issue:
<?php
namespace App\Jobs\Traits;
trait CanExtendSqsVisibilityTimeout
{
/** NOTE: this needs to map to setting in AWS console */
protected $defaultBackoff = 30;
protected $backoff = 30;
/**
* Extend the time that the job is locked for processing
*
* SQS messages are managed via the default visibility timeout console setting; noted absence of retry_after config
* #see https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts
* AWS recommends to create a "heartbeat" in the consumer process in order to extend processing time:
* #see https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#configuring-visibility-timeout
*
* #param int $delay Number of seconds to extend the processing time by
*
* #return void
*/
public function extendBackoff($delay = 60)
{
if ($this->job) {
// VisibilityTimeout has a 12 hour (43200s) maximum and will error above that; no extensions if close to it
if ($this->backoff + $delay > 42300) {
return;
}
// add the delay
$this->backoff += $delay;
$sqs = $this->job->getSqs();
$sqsJob = $this->job->getSqsJob();
$sqs->changeMessageVisibility([
'QueueUrl' => $this->job->getQueue(),
'ReceiptHandle' => $sqsJob['ReceiptHandle'],
'VisibilityTimeout' => $this->backoff,
]);
}
}
}
Then for a queued job that was taking a long time, I changed the code a bit to work out where I could insert a sensible "heartbeat". In my case, I had a loop:
class LongRunningJob implements ShouldQueue
{
use CanExtendSqsVisibilityTimeout;
//...
public function handle()
{
// some other processing, no loops involved
// now the code that loops!
$last_extend_at = time();
foreach ($tasks as $task) {
$task->doingSomething();
// make sure the processing doesn't time out, but don't extend time too often
if ($last_extend_at + $this->defaultBackoff - 10 > time()) {
// "heartbeat" to extend visibility timeout
$this->extendBackoff();
$last_extend_at = time();
}
}
}
Supervisor
It sounds like you might need to look at how you're running your worker(s) in a bit more detail.
Having Supervisor running to help restart your workers is a must, I think. Otherwise if the worker(s) stop working, messages that are queued up will end up getting deleted as they expire. It's a bit fiddly to get working nicely with Laravel + EBS - there isn't really much good documentation around it, which is potentially why not having to manage it is one of the selling points for Vapor!

We finally found out what the problem was, and it wasn't memory or execution time.
Already from the beginning I thought it was strange that either default memory or default execution time wasn't sufficient to send an e-mail or two.
Our use case is: a new Article is created and users receive a notification.
A few clues that led to the solution:
We noticed that we usually have problems with the first notification.
If we create 10 articles at the same time, we miss the first notification on every article.
We set the HTTP Max Connections in the Worker to 1. When creating 10 articles simultaneously, we noticed that only the first article missed the first notification.
We didn't get any useful error messages from the Worker, so we decided to set up our own EC2 and run php artisan queue manually.
What we then saw explained everything:
Illuminate\Database\Eloquent\ModelNotFoundException: No query results for model [App\Article]
This is an error that we never got from the EBS Worker / SQS and swiftly led to the solution:
The notification is handled before the article has made it to the database.
We have added a delay to the worker and haven't had a problem since then. We recently added a database transaction to the process of creating an article, and creating the notification happens within that transaction (but on the very end). I think that's why we didn't have this problem before. We decided to leave the notification creation in the transaction, and just handle the notifications with a delay. This means we don't have to do a hotfix to get this solved.
Thanks to everyone who joined in to help!

How to change URL timeout settings on linux webserver

I have some cron job set up on Linux through wget, those jobs run once every 24 hours. All the jobs are basically calling the APIs, pulling the data and I am strong it on the database. Now the issue is some API calls are very very slow and takes a lot of time to get a response that eventually ends up getting the below error.
--2017-07-24 06:00:02-- http://wwwin-cam-stage.cisco.com/cron/mh.php Resolving
wwwin-cam-stage.cisco.com (wwwin-cam-stage.cisco.com)... 171.70.100.25
Connecting to wwwin-cam-stage.cisco.com
(wwwin-cam-stage.cisco.com)|171.70.100.25|:80... connected. HTTP
request sent, awaiting response... Read error (Connection reset by
peer) in headers. Retrying.
--2017-07-24 06:05:03-- (try: 2) http://wwwin-cam-stage.cisco.com/cron/mh.php Connecting to
wwwin-cam-stage.cisco.com
(wwwin-cam-stage.cisco.com)|171.70.100.25|:80... connected. HTTP
request sent, awaiting response... Read error (Connection reset by
peer) in headers. Retrying.
--2017-07-24 06:10:05-- (try: 3) http://wwwin-cam-stage.cisco.com/cron/mh.php Connecting to
wwwin-cam-stage.cisco.com
(wwwin-cam-stage.cisco.com)|171.70.100.25|:80... connected. HTTP
request sent, awaiting response... 200 OK Length: 0 [text/html] Saving
to: ‘mh.php.6’
0K 0.00 =0s
2017-07-24 06:14:58 (0.00 B/s) - ‘mh.php.6’ saved [0/0]
Though at third try it gave response 200 OK, it messes up the actual data as it timed out in a first and second try.
How can I change the URL timeout settings to unlimited time or highest limit in order to complete job in one try and without getting error like
(Connection reset by peer)....

wget --timeout 10 http://url
This can be used in case of wget.
EDIT
Or
If you are asking about the Keep-Alive of linux machine this might help.
On RedHat Linux modify the following kernel parameter by editing the /etc/sysctl.conf file, and restart the network daemon (/etc/rc.d/init.d/network restart).
"Connection reset by peer" is the TCP/IP equivalent of slamming the
phone back on the hook. It's more polite than merely not replying,
leaving one hanging. But it's not the FIN-ACK expected of the truly
polite TCP/IP converseur.
Code:
# Decrease the time default value for tcp_keepalive_time
tcp_keepalive_time = 1800
EDIT
-T seconds
‘--timeout=seconds’
Set the network timeout to seconds seconds. This is equivalent to specifying ‘--dns-timeout’, ‘--connect-timeout’, and ‘--read-timeout’, all at the same time.
When interacting with the network, Wget can check for timeout and abort the operation if it takes too long. This prevents anomalies like hanging reads and infinite connects. The only timeout enabled by default is a 900-second read timeout. Setting a timeout to 0 disables it altogether. Unless you know what you are doing, it is best not to change the default timeout settings.
All timeout-related options accept decimal values, as well as subsecond values. For example, ‘0.1’ seconds is a legal (though unwise) choice of timeout. Subsecond timeouts are useful for checking server response times or for testing network latency.
‘--dns-timeout=seconds’
Set the DNS lookup timeout to seconds seconds. DNS lookups that don’t complete within the specified time will fail. By default, there is no timeout on DNS lookups, other than that implemented by system libraries.
‘--connect-timeout=seconds’
Set the connect timeout to seconds seconds. TCP connections that take longer to establish will be aborted. By default, there is no connect timeout, other than that implemented by system libraries.
‘--read-timeout=seconds’
Set the read (and write) timeout to seconds seconds. The “time” of this timeout refers to idle time: if, at any point in the download, no data is received for more than the specified number of seconds, reading fails and the download is restarted. This option does not directly affect the duration of the entire download.
Of course, the remote server may choose to terminate the connection sooner than this option requires. The default read timeout is 900 seconds.
Source Wget Manual.
See this wget Manual page for more information.

Apache Connection Time Out

I have a report that generates an array of data from MySQL server by using looping through PHP code (Laravel framework). However, the maximum that the server can handle is an array with 400 row, and each row contains 61 child value in it.
[
[1, ...,61], // row 1
.
.
.
[1,....,61] // row 400
]
Each value is calculated from running a loop that retrieves data from MySQL server.
There is a no load balancer.
I tried to increase max_execution_time = 600 (10 minutes), but it still show the connection time out problem. Any thoughts? Thanks,
Connection Timed Out
Description: Connection Timed Out
Server version: Apache/2.4.7 (Ubuntu) - PHP 5.6

Would need more info for a definitive answer...
What is the Apache/httpd version (there have been some bugs that relate to this)?
Is there a firewall or load balancer in the mix?
If you are sure it is still a timeout error and not say memory, then it is probably httpd's TimeOut directive. It defaults to 300 seconds.
If still stuck paste the exact error you are seeing.

My PHP version was 5.6. After upgrading to PHP7, my application speed was increased significantly. Everything works fine now.

PHP / MYSQL connection failures under heavy load through mysql.sock

I've done quite a bit of reading before asking this, so let me preface by saying I am not running out of connections, or memory, or cpu, and from what I can tell, I am not running out of file descriptors either.
Here's what PHP throws at me when MySQL is under heavy load:
Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (11 "Resource temporarily unavailable")
This happens randomly under load - but the more I push, the more frequently php throws this at me. While this is happening I can always connect locally through the console and from PHP through 127.0.0.1 instead of "localhost" which uses the faster unix socket.
Here's a few system variables to weed out the usual problems:
cat /proc/sys/fs/file-max = 4895952
lsof | wc -l = 215778 (during "outages")
Highest usage of available connections: 26% (261/1000)
InnoDB buffer pool / data size: 10.0G/3.7G (plenty o room)
soft nofile 999999
hard nofile 999999
I am actually running MariaDB (Server version: 10.0.17-MariaDB MariaDB Server)
These results are generated both under normal load, and by running mysqlslap during off hours, so, slow queries are not an issue - just high connections.
Any advice? I can report additional settings/data if necessary - mysqltuner.pl says everything is a-ok
and again, the revealing thing here is that connecting via IP works just fine and is fast during these outages - I just can't figure out why.
Edit: here is my my.ini (some values may seem a bit high from my recent troubleshooting changes, and please keep in mind that there are no errors in the MySQL logs, system logs, or dmesg)
socket=/var/lib/mysql/mysql.sock
skip-external-locking
skip-name-resolve
table_open_cache=8092
thread_cache_size=16
back_log=3000
max_connect_errors=10000
interactive_timeout=3600
wait_timeout=600
max_connections=1000
max_allowed_packet=16M
tmp_table_size=64M
max_heap_table_size=64M
sort_buffer_size=1M
read_buffer_size=1M
read_rnd_buffer_size=8M
join_buffer_size=1M
innodb_log_file_size=256M
innodb_log_buffer_size=8M
innodb_buffer_pool_size=10G
[mysql.server]
user=mysql
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
open-files-limit=65535

Most likely it is due to net.core.somaxconn
What is the value of /proc/sys/net/core/somaxconn
net.core.somaxconn
# The maximum number of "backlogged sockets". Default is 128.
Connections in the queue which are not yet connected. Any thing above that queue will be rejected. I suspect this in your case. Try increasing it according to your load.
as root user run
echo 1024 > /proc/sys/net/core/somaxconn

This is something that can and should be solved by analysis. Learning how to do this is a great skill to have.
Analysis to find out just what is happening under a heavy load...number of queries, execution time should be your first step. Determine the load and then make the proper db config settings. You might find you need to optimize the sql queries instead!
Then make sure the PHP db driver settings are in alignment as well to fully utilize the database connections.
Here is a link to the MariaDB threadpool documentation. I know it says version 5.5, but its still relevant and the page does reference version 10. There are settings listed that may not be in your .cnf file that you can use.
https://mariadb.com/kb/en/mariadb/threadpool-in-55/

From the top of my head, I can think of max_connections as a possible source of the problem. I'd increase the limit, to at least eliminate the possibility.
Hope it helps.

Session [url] not available and is not among the last 1000 terminated sessions. How to solve this?

When I try to launch
php behat.phar
The WebDriver firefox window pops up and then my feature test fails at first step and skips the rest. I get:
...
Given I am on "first.php" #FeatureContext::visit()
Session [url] not available and is not among the last 1000 terminated sessions.
Active sessions are[ext. key 51191ae0-8f6f-49d0-27b322967296]
...
If I only use behat the test passes. This happens only when I try to use selenium.
I'm using MinkExtension GivenIAmOn() premade function
my behat.yml:
default:
paths:
features: features
bootstrap: features/bootstrap
extensions:
mink_extension.phar:
mink_loader: 'mink.phar'
base_url: 'http://10.0.0.10/'
goutte: ~
selenium2:
wd_host: 'http://localhost:4444/wd/hub'
capabilities:
version: ''
My FeatureContext extends from MinkContext.
I've been searching for a solution for days and I couldn't solve this.
I'm working with Windows 7 with firefox 26, selenium-server-standalone-2.42.2 and I tried lower versions as well. As I read in some issues, the session/"session-id"/url was broken some versions ago, but now it shouldn't. For some reason it can't pick the right session.
Sorry for the data quality, I don't have an internet connection at my workplace and it's quite restricted. That's why I use .phar files instead of using composer. I can't either copy-paste the files and so on. If I have to provide more data just tell me and I will.

It sounds like a grid-level timeout issue. You should try to increase browserTimeout and newSessionWaitTimeout and see if that helps.
Source: Session not available and is not among the last 1000 terminated sessions.
Timeouts in the grid should normally be handled through webDriver.manage().timeouts(), which will control how the different operations time out.
The browserTimeout should be:
Higher than the socket lock timeout (45 seconds).
Generally higher than values used in webDriver.manage().timeouts(), since this mechanism is a "last line of defense".
For any issues, check also: http://localhost:4444/wd/hub/sessions

I had the same problem, and logs of Selenium Grid are this :
WARN [RequestHandler.process] - The client is gone for session ext. key fa804448787370d0547cd517ab2badc1, terminating
INFO [ActiveTestSessions.updateReason] - Removed a session that had not yet assigned an external key 24f5656a-7a59-4edb-bf7b-c6a1ae59ca16, indicates failure in session creation CLIENT_GONE
The error CLIENT_GONE is :
"The client process (your code) appears to have died or otherwise not responded to our requests, intermittent network issues may also cause".
And I had some test that are waiting in queue (5 tests are waiting, while 5 other are running)
I solve this problem with just stoping to have queue on grid.

I fixed the problem by using the same version of the selenium hub and the nodes:
image: selenium/hub:3.11
AND
image: selenium/node-chrome:3.11

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.