We have experienced a problem with AWS Aurora failover and looking for pointers as to how to resolve.
Scenario
AWS Aurora set up with two end points:
Writer:
host: stackName-dbcluster-ID.cluster-ID.us-west-2.rds.amazonaws.com
resolves to IP: 10.1.0.X
Reader:
host: stackName-dbcluster-ID.cluster-ro-ID.us-west-2.rds.amazonaws.com
resolves to IP: 10.1.0.Y
So therefore our PDO MySQL Connection string is stackName-dbcluster-ID.cluster-ID.us-west-2.rds.amazonaws.com (for writing)
After failover
On failover, the DNS entries are flipped to point as follows:
Reader:
host: stackName-dbcluster-ID.cluster-ro-ID.us-west-2.rds.amazonaws.com
resolves to IP: 10.1.0.X
Writer:
host: stackName-dbcluster-ID.cluster-ID.us-west-2.rds.amazonaws.com
resolves to IP: 10.1.0.Y
Critically, the PDO Connection string (for writing) remains the same "stackName-dbcluster-ID.cluster-ID.us-west-2.rds.amazonaws.com" b ut points to a different IP address.
What Happened
We had error 1290 "SQLSTATE[HY000]: General error: 1290 The MySQL server is running with the --read-only option so it cannot execute this statement".
As the DB engines are stopped started, our initial persistent connections will have "gone away" and been invalidated (something we immediately handle in a reconnect/retry code).
However the error above means new connections will have been made to the old node, but then not further invalidated with propagation of the DNS change. They lasted 10/15 minutes (well beyond TTL of the DNS).
My Questions
Does anyone know if a persistent connection on PDO is retrieved based on the connection string, or is more reliable using the IP or other signature? Evidence suggests it's hostname, but would like confirmation.
Does anyone know a way to mark a persistent connection as "invalid" in PDO, so that is it not used again?
Or, is there something I missed?
Side notes
We already have code in place to handle the retry, and they retry is told to get a new non-persistent connection (which works). It's at this point we could "invalidate" the PDO connection so the next run of a script does not repeat this cycle over and over.
The failover can happen at any time, so we're not in a position to do manual actions such as restart php (as we had to do this time).
Without persistent connections, performance is notably slower.
FastCGI, Centos 16, PHP 7.2, MySQLD 5.0.12-dev (which is normal on Centos - see https://superuser.com/questions/1433346/php-shows-outdated-mysqlnd-version)
Persistent connections must be terminated and restarted.
Reminds me of a 2-minute TTL that took 20 minutes to be recognized. I don't know whether Amazon does a better job, or even if they have any say in DNS.
5.0.12?? That was released in 2005! Maybe a typo. Anyway, I don't think the version matters in this Question.
DNS may not be the optimal way to failover; there are several Proxy servers out there. I would expect them to flip within seconds. However, they need to know who's who rather than depending on DNS.
Can you modify the code to disconnect+reconnect when that error occurs? (It may not help.)
Unfortunately, this error is documented:
https://github.com/jeremydaly/serverless-mysql/issues/7
everything said revolves around migrating to: mysqlnd driver for mysqlnd_ms
I will continue looking for a more efficient solution.
Related
I've noticed that while reasonably fast, the connection to the database (google's cloud SQL mysql-compatible one) is a large part of the request. I'm using PDO for the abstraction.
So, since that's the case the obvious solution is to enable PHP's PDO persistent connections.
To my understanding, and I've verified this in PHP's source code (links bellow), the way these work is as follows:
when you connect with the persistent flag on, PHP caches the connection using a hash of the connection string, username and password for the key
when you try to re-connect in another request it checks if a persistent connection exists then checks the liveness of the connection (which is driver specific; mysql version is what is executed in my case) and kills the cached version it if it fails the test
if the cached version was killed a new connection is created and returned; otherwise you get to skip the overhead of creating the connection (around 30x faster creation process based on xdebug profiles executed directly in devel versions on the cloud)
Everything sounds good so far? Not sure how all of this works when say you have a cached connections and two requests hit it (stress testing it doesn't appear to cause issues), but otherwise sounds okey and in testing works fine.
Well, here's what happens in the real world after some time passes...
Once a connection "dies" PDO will stall the entire request for 60s or more. This happens I believe after maybe 1h or more; so for a short while everything will work just fine and PDO will connect super fast to Cloud SQL. I've tried several ways to at least mitigate the stalling being more then 1s but to no result (ini_set socket timeout wont affect it, expires flag on PDO is ignored I believe, exception and status checks for the "has gone away" are useless since it stalls on making the connection, etc). I assume most likely the connection "expires" but reasons are unknown to me. I assume Cloud SQL drops it since its not in "show processlist;", but it's possible I'm not looking at it correctly.
Is there any secret sauce that makes PDO persistent connections work with Cloud SQL for more then a brief time?
Are persistent connections to Cloud SQL not supported?
You haven't described where your application is running (e.g. Compute Engine, App Engine, etc), so I will make an educated guess based on the described symptoms.
It's likely your TCP keepalive time is set too high on the application host. You can change the settings via these instructions.
Assuming a Linux host, the following command will show your current setting, in seconds:
cat /proc/sys/net/ipv4/tcp_keepalive_time
TCP connections without any activity for a period of time may be dropped anywhere along the path between the application host and Cloud SQL, depending on how everything in the communication path is configured.
TCP keepalive sends a periodic "ping" on idle connections to work around this problem/feature.
Cloud SQL supports long-running connections.
We've recently implemented a SQL Server 2012 Always On failover cluster. The go-live is in 2 weeks time and some concerning issues have come up.
Previously we were dealing with servers in the same subnet, but we've since moved the servers to multiple subnets. Since doing that we encountered the multiple subnet failover issue; http://technet.microsoft.com/en-us/library/ff878716.aspx.
"In a multi-subnet configuration, both the online and offline IP addresses of the network name will be registered at the DNS server. The client application then retrieves all registered IP addresses from the DNS server and attempts to connect to the addresses either in order or in parallel. This means that client recovery time in multi-subnet failovers no longer depend on DNS update latencies.
By default, the client tries the IP addresses in order. When the client uses the new optional MultiSubnetFailover=True parameter in its connection string, it will instead try the IP addresses simultaneously and connects to the first server that responds. This can help minimize the client recovery latency when failovers occur."
The symptoms of the issue are: The PHP 5.4 server will intermittently fail to connect. It may work for 20 minutes, then fail for 25 minutes, then work for 40 minutes.
We've tried introducing the 'MultiSubnetFailover' parameter as so:
$dbhandle = sqlsrv_connect(
$myServer,
array("UID"=>$myUser, "PWD"=>$myPass, "Database"=>$myDB, 'ReturnDatesAsStrings'=> true,
'MultiSubnetFailover'=> true)
)
And updating the webserver with Microsoft SQL drivers that explicitly support multi-subnet failover; http://blogs.msdn.com/b/sqlphp/archive/2012/03/07/microsoft-drivers-3-0-for-php-for-sql-server-released-to-web.aspx
The subnets are set up correctly and I can connect normally through other services such as SQL Server Management Studio when I supply the 'MultiSubnetFailover=Yes' parameter, in fact the difference is night & day.
Any help appreciated, this one is too close to the release for comfort.
EDIT: There is actually a second connection string I missed, but once configuring this with the multi-subnet failover parameter the error still occurs;
$pdoHandle = new PDO("sqlsrv:server={$myServer};database={$myDB};multiSubnetFailover=yes", $myUser, $myPass);
This is an awkward problem without much documentation but we did come to a solution. The SQLSRV_Connect parameter should read
'MultiSubnetFailover'=> "Yes"
rather than 'true'. Because true just returns a boolean value, whereas it wants a string. For a connection string as used by the PDO interface the follwoing syntax seems to work for us:
"MultiSubnetFailover=True"
But even when you use the correct syntax the support isn't great. If this solution doesn't work then you need to increase the timeout on the connection because the SQL Server driver will try each DNS record in turn. We use "LoginTimeout=75" (seconds) for a set-up with 2 subnets and 110 should do for a set-up with 3 subnets.
This solution is, however, still crap. It works acceptably for a front-end application that only needs to connect once and uses the same connection from then on. It doesn't work so well for web servers that tend to create a new connection for each request. It could make loading each web page take as long as 30, 70 or 100 seconds, depending on how the DNS records happen to be configured at that time.
The target is simple: clients post http requests to query data and update record by some keys。 Highest request: 500/sec (the higher the better, but the best is to fulfil this requirement while making the system easy to achieve and using less mashines)
what I've done: nginx + php-cgi(using php) to serve http request, the php use thrift RPC to retrieve data from a DB proxy which is only used to query and update DB(mysql). The DB proxy uses mysql connection pool and thrift's TNonblockingServer. (In my country, there are 2 ISP, DB Proxy will be deployed in multi-isp machine and so is the db, web servers can be deployed on single-isp mashine according to the experience)
what trouble me: when I do stress test(when >500/sec), I found " TSocket: Could not connect to 172.19.122.32:9090 (Connection refused [111]" from php log. I think it may be caused by the port's running out(may be incorrect conclusion). So I design to use thrift connection bool to reduce thrift connection. But there is no connection pool in php (there seem to be some DB connection pool tech) and php does not support the feature.
So I think maybe the project is designed in the wrong way from the beginning(like use php ,thrift). Is there a good way to solve this based on what i've done? And I think most people will doubt my awkward scheme. Well, your new scheme will help a lot
thanks.
"TSocket: Could not connect to 172.19.122.32:9090 (Connection refused [111])" from php log shows the ports running out because of too many short connections in a short time. So I config the tcp TIME_WAIT status to recycle port in time using:
sysctl -w net.ipv4.tcp_timestamps=1
sysctl -w net.ipv4.tcp_tw_recycle=1
it works!
what droubles me is sloved, but to change the kernal parameter will affect the NAT. It's not a perfect solution. I think a new good design of this system can be continue to discuss.
Colleagues!
I'm running php 5.3 (5.3.8) with memcache (2.2.6) client library (http://pecl.php.net/package/memcache) to deal with memcached server.
My goal is to have failover solution for sessions engine, namely:
Only native php sessions support (no custom handlers)
Few memcached servers in the pool
What I expect is that in case if one of memcached servers is down, php will attempt to utilize the second server in the pool [will successfully connect it and become happy], however when first memcached server in the pool is down I'm receiving the following error:
Session start failed. Original message: session_start(): Server 10.0.10.111 (tcp 11211) failed with: Connection refused (111)
while relevant php settings are:
session.save_handler memcache
session.save_path tcp://10.0.10.111:11211?persistent=1&weight=1&timeout=1&retry_interval=10, tcp://10.0.10.110:11211?persistent=1&weight=1&timeout=1&retry_interval=10
and memcache settings (while I think that it's near to standard) are:
Directive Local Value
memcache.allow_failover 1
memcache.chunk_size 8192
memcache.default_port 11211
memcache.default_timeout_ms 1000
memcache.hash_function crc32
memcache.hash_strategy standard
memcache.max_failover_attempts 20
Memcached still running on the second server and perfectly accessible from the WEB server:
telnet 10.0.10.110 11211
Trying 10.0.10.110...
Connected to 10.0.10.110 (10.0.10.110).
Escape character is '^]'.
get aaa
END
quit
Connection closed by foreign host.
So in other words, instead of querying all of the listed servers sequentially it crashes after unsuccessful attempt to connect the first server in the queue. Finally I do realize that there are releases of 3.0.x client library available, however it does not look too reliable for me as it still in beta version.
Please advice how can I get desired behavior with standard PHP, client lib and server.
Thanks a lot!
Best,
Eugene
Use the Memcached extension. Note that there are two memcache plugins for PHP. One is called Memcache, the other is called Memcached. Yes, that's confusing, but true anyway.
The Memcache plugin supports those complex URL's you're using, with the protocol identifier (tcp) and the parameters (persistency and so on), while the Memcached plugin supports connection pools.
The documentation you're mentioning in the comments above (http://www.php.net/manual/en/memcached.sessions.php) is about the Memcached extension, not about Memcache.
Update: Some interesting read: https://serverfault.com/questions/164350/can-a-pool-of-memcache-daemons-be-used-to-share-sessions-more-efficiently
I would like to thank everybody who participated this question, the answer is the following: in reality memcache (not memcached) as session handler supports comma separated servers as the session.save_path, moreover it supports failover. The error mentioned above Session start failed. Original message: session_start(): Server 10.0.10.111 (tcp 11211) failed with: Connection refused (111) had only 8th (Notice) level. In fact engine just informs you about the fact that one of the servers is unavailable (which is logical, as otherwise how will you know?) and then successfully connects to the second server and using it.
So all of the misunderstanding has been caused by weak documentation, memcache/memcached confusions and paranoid (E_ALL) settings of my custom error handler. In the meantime the issue has been resolved by ignoring notices referring to error Connection refused (111) in the session establishing context
You must change the hash strategy
Change your config to
memcache.hash_strategy consistent
When you make the hash strategy to consistent memcache copies the data across multiple servers. If one of the servers is down, it retries to copy it on the next request.
I'm using php::memcache module to connect a local memcached server (#127.0.0.1), but I don't know that which one I should use, memcache::connect() or memcache::pconnect ? Does memcache::pconnect will consume many resource of the server?
Thank you very much for your answer!
Memcached uses a TCP connection (handshake is 3 extra packets, closing is usually 4 packets) and doesn't require any authentication. Therefore, the only upside to using a persistent connection is that you don't need to send those extra 7 packets and don't have to worry about having a leftover TIME-WAIT port for a few seconds.
Sadly, the downside of sacrificing those resources is far greater than the minor upsides. So I recommend not using persistent connections in memcached.
pconnect stands for persistant connection. This means that the client (in your case the script) will constantly have a connection open to your server which might not be a resouces problem - more a lack of connections available.
You should probably be wanting the standard connect unless you know you need to use persistant connections.
As far as I know, the same rules that govern persistent vs. regular connections when connecting to MySQL apply to memcached as well. The upshot is, you probably shouldn't use persistent connections in either case.
"Consumes" TCP port.
In application I'm developing I use pconnect as it uses connection pool and from the view of hardware - one server keeps one connection to memcache. I don't know exactly how it works but I think memcached is smart enough to track IP of memcached client machine.
I've played with memcached for a long time and found that using memcache::getStatus shows that connections count doesn't increased when using pconnect.
You can use debug page which show memcached stats and try to tweak pconnect or connect and see what's going on.
One downside is that PHP gets no blatant error or warning if one or all of the persistently-connected memcached daemons vanish(es). That's a pretty darn big downside.