diff --git a/doc/source/install/migration.rst b/doc/source/install/migration.rst index 84064536b..92b7e8d81 100644 --- a/doc/source/install/migration.rst +++ b/doc/source/install/migration.rst @@ -3,163 +3,213 @@ Migration Strategy ================== -This document details an in-place migration strategy from ML2/OVS in either -ovs-firewall, or ovs-hybrid mode in a TripleO OpenStack deployment. +This document details an in-place migration strategy from ML2/OVS to ML2/OVN +in either ovs-firewall or ovs-hybrid mode for a TripleO OpenStack deployment. For non TripleO deployments, please refer to the file ``migration/README.rst`` and the ansible playbook ``migration/migrate-to-ovn.yml``. Overview -------- -The migration would be accomplished by following the steps: +The migration process is orchestrated through the shell script +ovn_migration.sh, which is provided with networking-ovn. -a. Administrator steps: +The administrator uses ovn_migration.sh to perform readiness steps +and migration from the undercloud node. +The readiness steps, such as host inventory production, DHCP and MTU +adjustments, prepare the environment for the procedure. - * Updating to the latest openstack/neutron version +Subsequent steps start the migration via Ansible. - * Reducing the DHCP T1 parameter on dhcp_agent.ini beforehand, which - is controlled by the dhcp_renewal_time of /etc/neutron/dhcp_agent.ini +Plan for a 24-hour wait after the setup-mtu-t1 step to allow VMs to catch up +with the new MTU size. The default neutron ML2/OVS configuration has a +dhcp_lease_duration of 86400 seconds (24h). - Somewhere around 30 seconds would be enough (TODO: Data and calculations - to back this value with precise information). +Also, if there are instances using static IP assignment, the administrator +should be ready to update the MTU of those instances to the new value of 8 +bytes less than the ML2/OVS (VXLAN) MTU value. For example, the typical +1500 MTU network value that makes VXLAN tenant networks use 1450 bytes of MTU +will need to change to 1442 under Geneve. Or under the same overlay network, +a GRE encapsulated tenant network would use a 1458 MTU, but again a 1442 MTU +for Geneve. - * Waiting for at least dhcp_lease_duration (see /etc/neutron/neutron.conf - or /etc/neutron/dhcp_agent.ini) time (default is 86400 seconds = - 24 hours), that way all instances will grab the new new lease renewal - time and start checking with the dhcp server periodically based on the - T1 parameter. +If there are instances which use DHCP but don't support lease update during +the T1 period the administrator will need to reboot them to ensure that MTU +is updated inside those instances. - * Lowering the MTU of all VXLAN or GRE based networks down to - make sure geneve works (a tool will be provided for that). The mtu - must be set to "max_tunneling_network_mtu - ovn_geneve_overhead", that's - generally "1500 - ovn_geneve_overhead", unless your network and any - intermediate router hop between compute and network nodes is jumboframe - capable). ovn_geneve_overhead is 58 bytes. VXLAN overhead is 50 bytes. So - for the typical 1500 MTU tunneling network, we may need to assign 1442. -b. Automated steps (via ansible) +Steps for migration +------------------- - * Create pre-migration resources (network and VM) to validate final - migration. +Perform the following steps in the overcloud/undercloud +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - * Update the overcloud stack (in the case of TripleO) to deploy OVN - alongside reference implementation services using a temporary bridge - "br-migration" instead of br-int. +1. Ensure that you have updated to the latest openstack/neutron version. - * Start the migration process: +Perform the following steps in the undercloud +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - 1. generate the OVN north db by running neutron-ovn-db-sync util - 2. re-assign ovn-controller to br-int instead of br-migration - 3. cleanup network namespaces (fip, snat, qrouter, qdhcp), - 4. remove any unnecessary patch ports on br-int - 5. remove br-tun and br-migration ovs bridges - 6. delete qr-*, ha-* and qg-* ports from br-int +1. Install python-networking-ovn-migration-tool. - * Delete neutron agents and neutron HA internal networks + .. code-block:: console - * Validate connectivity on pre-migration resources. + yum install python-networking-ovn-migration-tool - * Delete pre-migration resources. +2. Create a working directory on the undercloud, and copy the ansible playbooks - * Create post-migration resources. + .. code-block:: console - * Validate connectivity on post-migration resources. + mkdir ~/ovn_migration + cd ~/ovn_migration + cp -rfp /usr/share/ansible/networking-ovn-migration/playbooks . - * Cleanup post-migration resources. - * Re-run deployment tool to update OVN on br-int. +3. Create or edit the ``overcloud-deploy-ovn.sh`` script in your ``$HOME``. +This script must source your stackrc file, and then execute an ``openstack +overcloud overcloud deploy`` with your original deployment parameters, plus +the following environment files, added to the end of the command +in the following order: + When your network topology is DVR and your compute nodes have connectivity + to the external network: -Steps for migration -------------------- -Carryout the below steps in the undercloud: + .. code-block:: console -1. Create ``overcloud-deploy-ovn.sh`` script in /home/stack. Make sure the - below environment files are added in the order mentioned below + -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-dvr-ha.yaml \ + -e $HOME/ovn-extras.yaml + + + When your compute nodes don't have external connectivity and you don't use + DVR: .. code-block:: console - -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \ - -e /home/stack/ovn-extras.yaml + -e $HOME/ovn-extras.yaml + - If compute nodes have external connectivity, then you can use the - environment file - environments/services-docker/neutron-ovn-dvr-ha.yaml +Make sure that all users have execution privileges on the script, because it +will be called by ovn_migration.sh/ansible during the migration process. + + .. code-block:: console + + $ chmod a+x ~/overcloud-deploy-ovn.sh -2. Check the script ``ovn_migration.sh`` and override the environment variables - if desired. - Below are the environment variables +4. To configure the parameters of your migration you can set the environment +variables that will be used by ``ovn_migration.sh``. You can skip setting any +values matching the defaults. - * IS_DVR_ENABLED - If the existing ML2/OVS has DVR enabled, set it to True. - Default value is False. + * STACKRC_FILE - must point to your stackrc file in your undercloud. + Default: ~/stackrc - * PUBLIC_NETWORK_NAME - Name of the public network. Default value is - 'public'. + * OVERCLOUDRC_FILE - must point to your overcloudrc file in your + undercloud. + Default: ~/overcloudrc + + * OVERCLOUD_OVN_DEPLOY_SCRIPT - must point to the script described in step + 1.. + Default: ~/overcloud-deploy-ovn.sh + + * PUBLIC_NETWORK_NAME - Name of your public network. + Default: 'public'. + To support migration validation, this network must have available + floating IPs, and those floating IPs must be pingable from the + undercloud. If that's not possible please configure VALIDATE_MIGRATION + to False. * IMAGE_NAME - Name/ID of the glance image to us for booting a test server. - Default value is 'cirros'. + Default:'cirros'. + It will be automatically downloaded during the pre-validation / + post-validation process. * VALIDATE_MIGRATION - Create migration resources to validate the - migration. - The migration script, before starting the migration, boots a server and - validates that the server is reachable after the migration. - Default value is True. + migration. The migration script, before starting the migration, boot a + server and validates that the server is reachable after the migration. + Default: True. + + * SERVER_USER_NAME - User name to use for logging to the migration + instances. + Default: 'cirros'. - * SERVER_USER_NAME - User name to use for logging to the migration server. - Default value is 'cirros'. + * DHCP_RENEWAL_TIME - DHCP renewal time in seconds to configure in DHCP + agent configuration file. + Default: 30 - * DHCP_RENEWAL_TIME - DHCP renewal time to configure in dhcp agent - configuration file. The default value is 30 seconds. -2. Run ``./ovn_migration.sh generate-inventory`` to generate the inventory - file - hosts_for_migration. Please review this file for correctness and - modify it if desired. + .. warning:: + + Please note that VALIDATE_MIGRATION requires enough quota (2 + available floating ips, 2 networks, 2 subnets, 2 instances, + and 2 routers as admin). + + For example: + + .. code-block:: console + + $ export PUBLIC_NETWORK_NAME=my-public-network + $ ovn_migration.sh ......... + + +5. Run ``ovn_migration.sh generate-inventory`` to generate the inventory + file - ``hosts_for_migration`` and ``ansible.cfg``. Please review + ``hosts_for_migration`` for correctness. + + .. code-block:: console + + $ ovn_migration.sh generate-inventory + -4. Run ``./ovn_migration.sh setup-mtu-t1``. This lowers the T1 parameter - of the internal neutron DHCP servers configuring the 'dhcp_renewal_time' in - /var/lib/config-data/puppet-generated/neutron/etc/neutron/dhcp_agent.ini +6. Run ``ovn_migration.sh setup-mtu-t1``. This lowers the T1 parameter + of the internal neutron DHCP servers configuring the ``dhcp_renewal_time`` + in /var/lib/config-data/puppet-generated/neutron/etc/neutron/dhcp_agent.ini in all the nodes where DHCP agent is running. -5. After the previous step we need to wait at least 24h before continuing - if you are using VXLAN or GRE tenant networking. This will allow VMs to - catch up with the new MTU size of the next step. + .. code-block:: console - .. warning:: + $ ovn_migration.sh setup-mtu-t1 - This step is very important, never skip it if you are using VXLAN - or GRE tenant networks. If you are using VLAN tenant networks you don't - need to wait. - .. warning:: +7. If you are using VXLAN or GRE tenant networking, ``wait at least 24 hours`` +before continuing. This will allow VMs to catch up with the new MTU size +of the next step.` + + .. warning:: + + If you are using VXLAN or GRE networks, this 24-hour wait step is critical. + If you are using VLAN tenant networks you can proceed to the next step without delay. + + .. warning:: If you have any instance with static IP assignation on VXLAN or - GRE tenant networks, you will need to manually modify the - configuration of those instances to configure the new geneve MTU, - which is current VXLAN MTU minus 8 bytes, that is 1442 when VXLAN - based MTU was 1450. + GRE tenant networks, you must manually modify the configuration of those instances. + If your instances don't honor the T1 parameter of DHCP they will need + to be rebooted. + to configure the new geneve MTU, which is the current VXLAN MTU minus 8 bytes. + For instance, if the VXLAN-based MTU was 1450, change it to 1442. - .. note:: + .. note:: - 24h is the time based on default configuration, it actually depends on + 24 hours is the time based on default configuration. It actually depends on /var/lib/config-data/puppet-generated/neutron/etc/neutron/dhcp_agent.ini dhcp_renewal_time and /var/lib/config-data/puppet-generated/neutron/etc/neutron/neutron.conf dhcp_lease_duration parameters. (defaults to 86400 seconds) - .. note:: + .. note:: - Please note that migrating a VLAN deployment is not recommended at - this time because of a bug in core ovn, full support is being worked - out here: + Please note that migrating a deployment which uses VLAN for tenant/project + networks is not recommended at this time because of a bug in core ovn, + full support is being worked out here: https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/347594.html - One way of verifying that the T1 parameter has propated to existing VMs - is going to one of the compute nodes, and run tcpdump over one of the - VM taps attached to a tenant network, we should see that requests happen - around every 30 seconds. - .. code-block:: console + One way to verify that the T1 parameter has propagated to existing VMs + is to connect to one of the compute nodes, and run ``tcpdump`` over one + of the VM taps attached to a tenant network. If T1 propegation was a success, + you should see that requests happen on an interval of approximately 30 seconds. + + .. code-block:: console [heat-admin@overcloud-novacompute-0 ~]$ sudo tcpdump -i tap52e872c2-e6 port 67 or port 68 -n tcpdump: verbose output suppressed, use -v or -vv for full protocol decode @@ -169,37 +219,121 @@ Carryout the below steps in the undercloud: 13:17:56.241156 IP 192.168.99.5.bootpc > 192.168.99.3.bootps: BOOTP/DHCP, Request from fa:16:3e:6b:41:3d, length 300 13:17:56.249899 IP 192.168.99.3.bootps > 192.168.99.5.bootpc: BOOTP/DHCP, Reply, length 355 - .. note:: + .. note:: - This verification is not possible with cirros VMs, due to cirros - udhcpc implementation which won't obey DHCP option 58 (T1), if you have - any cirros based instances you will need to reboot them. + This verification is not possible with cirros VMs. The cirros + udhcpc implementation does not obey DHCP option 58 (T1). Please + try this verification on a port that belongs to a full linux VM. + We recommend you to check all the different types of workloads your + system runs (Windows, different flavors of linux, etc..). -6. Run ``./ovn_migration.sh reduce-mtu``. This lowers the MTU of the pre - migration VXLAN and GRE networks. You can skip this step if you use VLAN - tenant networks. It will be safe to execute in such case, because the - tool will ignore non-VXLAN/GRE networks. +8. Run ``ovn_migration.sh reduce-mtu``. + + This lowers the MTU of the pre + migration VXLAN and GRE networks. The tool will ignore non-VXLAN/GRE + networks, so if you use VLAN for tenant networks it will be fine if you + find this step not doing anything. + + .. code-block:: console -7. Set the below tripleo heat template parameters to point to the proper - OVN docker images in appropriate environment file + $ ovn_migration.sh reduce-mtu - * DockerOvnControllerConfigImage - * DockerOvnControllerImage - * DockerOvnNorthdImage - * DockerNeutronApiImage - * DockerNeutronConfigImage - * DockerOvnDbsImage - * DockerOvnDbsConfigImage - This can be done running the next command: + This step will go network by network reducing the MTU, and tagging with + ``adapted_mtu`` the networks which have been already handled. + + +9. Make Tripleo ``prepare the new container images`` for OVN. + + If your deployment didn't have a containers-prepare-parameter.yaml, you can + create one with: + + .. code-block:: console + + $ test -f $HOME/containers-prepare-parameter.yaml || \ + openstack tripleo container image prepare default \ + --output-env-file $HOME/containers-prepare-parameter.yaml + + + If you had to create the file, please make sure it's included at the end of + your $HOME/overcloud-deploy-ovn.sh and $HOME/overcloud-deploy.sh + + Change the neutron_driver in the containers-prepare-parameter.yaml file to + ovn: + + .. code-block:: console + + $ sed -i -E 's/neutron_driver:([ ]\w+)/neutron_driver: ovn/' $HOME/containers-prepare-parameter.yaml + + You can verify with: + + .. code-block:: console + + $ grep neutron_driver containers-prepare-parameter.yaml + neutron_driver: ovn + + + Then update the images: .. code-block:: console - PREPARE_ARGS="-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \ - -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml" \ - ~/overcloud-prep-containers.sh + $ openstack tripleo container image prepare \ + --environment-file /home/stack/containers-prepare-parameter.yaml + + .. note:: + + It's important to provide the full path to your containers-prepare-parameter.yaml + otherwise the command will finish very quickly and won't work (current + version doesn't seem to output any error). + + + TripleO will validate the containers and push them to your local + registry. + + +10. Run ``ovn_migration.sh start-migration`` to kick start the migration + process. + + .. code-block:: console + + $ ovn_migration.sh start-migration + + + Under the hood, this is what will happen: + + * Create pre-migration resources (network and VM) to validate existing + deployment and final migration. + + * Update the overcloud stack to deploy OVN alongside reference + implementation services using a temporary bridge "br-migration" instead + of br-int. + + * Start the migration process: + + 1. generate the OVN north db by running neutron-ovn-db-sync util + 2. clone the existing resources from br-int to br-migration, to ovn + find the same resources UUIDS over br-migration + 3. re-assign ovn-controller to br-int instead of br-migration + 4. cleanup network namespaces (fip, snat, qrouter, qdhcp), + 5. remove any unnecessary patch ports on br-int + 6. remove br-tun and br-migration ovs bridges + 7. delete qr-*, ha-* and qg-* ports from br-int (via neutron netns + cleanup) + + * Delete neutron agents and neutron HA internal networks from the database + via API. + + * Validate connectivity on pre-migration resources. + + * Delete pre-migration resources. + + * Create post-migration resources. + + * Validate connectivity on post-migration resources. + + * Cleanup post-migration resources. + + * Re-run deployment tool to update OVN on br-int. -8. Run ``./ovn_migration.sh start-migration`` to kick start the migration - process. Migration is complete !!!