Merge "Allow user to set sysctl_net_ipv4_tcp_retries2"

2021-06-23 13:57:13 +00:00
parent b22a7726aa 09d0409ed4
commit 18fd27feff
6 changed files with 79 additions and 3 deletions
--- a/ansible/roles/haproxy/defaults/main.yml
+++ b/ansible/roles/haproxy/defaults/main.yml
@@ -90,4 +90,8 @@ haproxy_check_timeout: "10s"
 # Check http://www.haproxy.org/download/1.5/doc/configuration.txt for available options
 haproxy_defaults_balance: "roundrobin"

+# Avoid TCP connections refusing to die after VIP switch
+# https://bugs.launchpad.net/kolla-ansible/+bug/1917068
+haproxy_host_ipv4_tcp_retries2: "KOLLA_UNSET"
+
 kolla_externally_managed_cert: False
--- a/ansible/roles/haproxy/tasks/config-host.yml
+++ b/ansible/roles/haproxy/tasks/config-host.yml
@@ -10,9 +10,10 @@
    sysctl_file: "{{ kolla_sysctl_conf_path }}"
  become: true
  with_items:
-    - { name: "net.ipv4.ip_nonlocal_bind", value: 1}
-    - { name: "net.ipv6.ip_nonlocal_bind", value: 1}
-    - { name: "net.unix.max_dgram_qlen", value: 128}
+    - { name: "net.ipv4.ip_nonlocal_bind", value: 1 }
+    - { name: "net.ipv6.ip_nonlocal_bind", value: 1 }
+    - { name: "net.ipv4.tcp_retries2", value: "{{ haproxy_host_ipv4_tcp_retries2 }}" }
+    - { name: "net.unix.max_dgram_qlen", value: 128 }
  when:
    - set_sysctl | bool
    - item.value != 'KOLLA_SKIP'
--- a/doc/source/reference/high-availability/haproxy-guide.rst
+++ b/doc/source/reference/high-availability/haproxy-guide.rst
@@ -0,0 +1,47 @@
+.. _haproxy-guide:
+
+=============
+HAProxy Guide
+=============
+
+Kolla Ansible supports a Highly Available (HA) deployment of
+Openstack and other services. High-availability in Kolla
+is implented as via Keepalived and HAProxy. Keepalived manages virtual IP
+addresses, while HAProxy load-balances traffic to service backends.
+These two components must be installed on the same hosts
+and they are deployed to hosts in the ``haproxy`` group.
+
+Preparation and deployment
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+HAProxy and Keepalived are enabled by default. They may be disabled by
+setting the following in ``/etc/kolla/globals.yml``:
+
+.. code-block:: yaml
+
+   enable_haproxy: "no"
+   enable_keepalived: "no"
+
+Configuration
+~~~~~~~~~~~~~
+
+Failover tuning
+---------------
+
+When a VIP fails over from one host to another, hosts may take some
+time to detect that the connection has been dropped. This can lead
+to service downtime.
+
+To reduce the time by the kernel to close dead connections to VIP
+address, modify the ``net.ipv4.tcp_retries2`` kernel option by setting
+the following in ``/etc/kolla/globals.yml``:
+
+.. code-block:: yaml
+
+   haproxy_host_ipv4_tcp_retries2: 6
+
+This is especially helpful for connections to MariaDB. See
+`here <https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html>`__,
+`here <https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/>`__ and
+`here <https://access.redhat.com/solutions/726753>`__ for
+further information about this kernel option.
--- a/doc/source/reference/high-availability/index.rst
+++ b/doc/source/reference/high-availability/index.rst
@@ -0,0 +1,10 @@
+=================
+High-availability
+=================
+
+This section describes high-availability configuration of services.
+
+.. toctree::
+   :maxdepth: 1
+
+   haproxy-guide
--- a/doc/source/reference/index.rst
+++ b/doc/source/reference/index.rst
@@ -17,3 +17,4 @@ Projects Deployment Configuration Reference
   message-queues/index
   deployment-config/index
   deployment-and-bootstrapping/index
+   high-availability/index
--- a/releasenotes/notes/fix-TCP-connections-refusing-to-die-after-VIP-switch-5f9e811783c36041.yaml
+++ b/releasenotes/notes/fix-TCP-connections-refusing-to-die-after-VIP-switch-5f9e811783c36041.yaml
@@ -0,0 +1,13 @@
+---
+features:
+  - |
+    Added a new haproxy configuration variable,
+    ``haproxy_host_ipv4_tcp_retries2``,
+    which allows users to modify this kernel option.
+    This option sets maximum number of times a TCP packet is retransmitted
+    in established state before giving up. The default kernel value is 15,
+    which corresponds to a duration of approximately between 13 to 30
+    minutes, depending on the retransmission timeout. This variable can be used
+    to mitigate an issue with stuck connections in case of VIP failover,
+    see `bug 1917068 <https://bugs.launchpad.net/kolla-ansible/+bug/1917068>`__
+    for details.