diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst index b24d8533f7..ebe03df5ee 100644 --- a/doc/source/admin/index.rst +++ b/doc/source/admin/index.rst @@ -34,3 +34,4 @@ the major upgrades procedures. upgrades/compatibility-matrix.rst upgrades/minor-upgrades.rst upgrades/major-upgrades.rst + upgrades/distribution-upgrades.rst diff --git a/doc/source/admin/upgrades/distribution-upgrades.rst b/doc/source/admin/upgrades/distribution-upgrades.rst new file mode 100644 index 0000000000..1ee925dab1 --- /dev/null +++ b/doc/source/admin/upgrades/distribution-upgrades.rst @@ -0,0 +1,281 @@ +===================== +Distribution upgrades +===================== + +This guide provides information about upgrading from one distribution +release to the next. + +.. note:: + + This guide was written when upgrading from Ubuntu Bionic to Focal during the + Victoria release cycle. + +Introduction +============ + +OpenStack Ansible supports operating system distribution upgrades during +specific release cycles. These can be observed by consulting the operating +system compatibility matrix, and identifying where two versions of the same +operating system are supported. + +Upgrades should be performed in the order specified in this guide to minimise +the risk of service interruptions. Upgrades must also be carried out by +performing a fresh installation of the target system's operating system, before +running openstack-ansible to install services on this host. + +Ordering +======== + +This guide includes a suggested order for carrying out upgrades. This may need +to be adapted dependent on the extent to which you have customised your +OpenStack Ansible deployment. + +Critically, it is important to consider when you upgrade 'repo' +hosts/containers. At least one 'repo' host should be upgraded before you +upgrade any API hosts/containers. The last 'repo' host to be upgraded should be +the 'primary', and should not be carried out until after the final service +which does not support '--limit' is upgraded. + +If this order is adapted, it will be necessary to restore some files to the +'repo' host from a backup part-way through the process. This will be necessary +if no 'repo' hosts remain which run the older operating system version, which +prevents older packages from being built. + +Beyond these requirements, a suggested order for upgrades is a follows: + +#. Infrastructure services (Galera, RabbitMQ, APIs, HAProxy) + + In all cases, secondary or backup instances should be upgraded first + +#. Compute nodes + +#. Network nodes + +Pre-Requisites +============== + +* Ensure that all hosts in your target deployment have been installed and + configured using a matching version of OpenStack Ansible. Ideally perform a + minor upgrade to the latest version of the OpenStack release cycle which you + are currently running first in order to reduce the risk of encountering + bugs. + +* Check any OpenStack Ansible variables which you customise to ensure that + they take into account the new and old operating system version (for example + custom package repositories and version pinning). + +* Perform backups of critical data, in particular the Galera database in case + of any failures. It is also recommended to back up the '/var/www/repo' + directory on the primary 'repo' host in case it needs to be restored + mid-upgrade. + +* Identify your 'primary' HAProxy/Galera/RabbitMQ/repo infrastructure host + + In a simple 3 infrastructure hosts setup, these services/containers + usually end up being all on the the same host. + + The 'primary' will be the LAST box you'll want to reinstall. + + * HAProxy/keepalived + + Finding your HAProxy/keepalived primary is as easy as + + .. code:: console + + ssh {{ external_lb_vip_address }} + + Or preferably if you've installed HAProxy with stats, like so; + + .. code-block:: yaml + + haproxy_stats_enabled: true + haproxy_stats_bind_address: "{{ external_lb_vip_address }}" + + and can visit https://admin:password@external_lb_vip_address:1936/ and read + 'Statistics Report for pid # on infrastructure_host' + + * repo_container + + Check all your repo_containers and look for /etc/lsyncd/lsyncd.conf.lua + +Warnings +======== + +* During the upgrade process, some OpenStack services cannot be deployed by + using Ansible's '--limit'. As such, it will be necessary to deploy some + services to mixed operating system versions at the same time. + + The following services are known to lack support for '--limit': + + * RabbitMQ + * Repo Server + * Keystone + +* In the same way as OpenStack Ansible major (and some minor) upgrades, there + will be brief interruptions to the entire Galera and RabbitMQ clusters + during the upgrade which will result in brief service interruptions. + +* When taking down 'memcached' instances for upgrades you may encounter + performance issues with the APIs. + +Deploying Infrastructure Hosts +============================== + +#. Drain RabbitMQ connections (optional) + + In order to cleanly hand over connections from one member of the RabbitMQ + cluster to another, the instance being reinstalled should be drained. + This can be achieved by running the following from the instance to be + reinstalled and waiting for the RabbitMQ admin interface to indicate that + socket descriptors have reduced to zero. + + .. code:: console + + rabbitmq-upgrade drain + +#. Disable HAProxy back ends (optional) + + If you wish to minimise error states in HAProxy, services on hosts which are + being reinstalled can be set in maintenance mode (MAINT). + + Log into your primary HAProxy/keepalived and run something similar to + + .. code:: console + + echo "disable server repo_all-back/_repo_container-" | socat /var/run/haproxy.stat stdio + + for each API or service instance you wish to disable. + + Or if you've enabled haproxy_stats as described above, you can visit + https://admin:password@external_lb_vip_address:1936/ and select them and + 'Set state to MAINT' + +#. Reinstall an infrastructure host's operating system + + As noted above, this should be carried out for non-primaries first, ideally + starting with a 'repo' host. + +#. Clearing out stale information + + #. Removing stale ansible-facts + + .. code:: console + + rm /etc/openstack_deploy/ansible-facts/reinstalled_host* + + (* because we're deleting all container facts for the host as well.) + + #. If RabbitMQ was running on this host + + We forget it by running these commands on another RabbitMQ host. + + .. code:: console + + rabbitmqctl cluster_status + rabbitmqctl forget_cluster_node rabbit@removed_host_rabbitmq_container + +#. If it is NOT a 'primary', install everything on the new host + + .. code:: console + + openstack-ansible setup-hosts.yml --limit localhost,reinstalled_host* + openstack-ansible setup-infrastructure.yml --limit localhost,repo_all,rabbitmq_all,reinstalled_host* + openstack-ansible setup-openstack.yml --limit localhost,keystone_all,reinstalled_host* + + (* because we need to include containers in the limit) + +#. If it IS a 'primary', do these steps + + .. code:: console + + openstack-ansible setup-hosts.yml --limit localhost,reinstalled_host* + + Temporarily set your primary Galera in MAINT in HAProxy + + .. code:: console + + openstack-ansible galera-install.yml --limit localhost,reinstalled_host* + + Note that at this point, the Ansible role will have taken the primary Galera + out of MAINT in HAProxy. You may wish to temporarily put it back into MAINT + until you are sure it is working correctly. + + You'll now have mariadb running, but it's not synced info from the + non-primaries. To fix this we ssh to the primary Galera, and restart the + mariadb.service and verify everything is in order. + + .. code:: console + + systemctl restart mariadb.service + mysql + mysql> SHOW STATUS LIKE "wsrep_cluster_%"; + mysql> SHOW DATABASES; + + Everything should be sync'ed and in order now. You can take your + primary Galera from MAINT to READY + + We can move on to RabbitMQ primary + + .. code:: console + + openstack-ansible rabbitmq-install.yml + + The RabbitMQ primary will also be in a cluster of it's own. You will need to + fix this by running these commands on the primary. + + .. code:: console + + rabbitmqctl stop_app + rabbitmqctl join_cluster rabbit@some_operational_rabbitmq_container + rabbitmqctl start_app + rabbitmqctl cluster_status + + Everything should now be in a working state and we can finish it off with + + .. code:: console + + openstack-ansible setup-infrastructure.yml --limit localhost,repo_all,rabbitmq_all,reinstalled_host* + openstack-ansible setup-openstack.yml --limit localhost,keystone_all,reinstalled_host* + +#. Adjust HAProxy status + + If HAProxy was set into MAINT mode, this can now be removed for services + which have been restored. + + For the 'repo' host, it is important that the freshly installed hosts are + set to READY in HAProxy, and any which remain on the old operating system + are set to 'MAINT'. + +Deploying Compute & Network Hosts +================================= + +#. Disable the hypervisor service on compute hosts and migrate any VMs to + another available hypervisor. + +#. Reinstall a host's operating system + +#. Clear out stale ansible-facts + + .. code:: console + + rm /etc/openstack_deploy/ansible-facts/reinstalled_host* + + (* because we're deleting all container facts for the host as well.) + +#. Execute the following: + + .. code:: console + + openstack-ansible setup-hosts.yml --limit localhost,reinstalled_host* + openstack-ansible setup-infrastructure.yml --limit localhost,reinstalled_host* + openstack-ansible setup-openstack.yml --limit localhost,reinstalled_host* + + (* because we need to include containers in the limit) + +.. note:: + + During this upgrade cycle it was noted that network nodes required a restart + to bring some tenant interfaces online after running setup-openstack. + Additionally, BGP speakers (used for IPv6) had to be re-initialised from the + command line. These steps were necessary before reinstalling further network + nodes to prevent HA Router interruptions.