stories: discuss various latency issues encountered

Change-Id: I12c73133c67c2ebb64217d6deeb51fd57ffc5d2c Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com>
2023-10-06 11:41:03 +02:00 · 2023-10-06 11:41:03 +02:00 · c3da7e6586
commit c3da7e6586
parent 9e54292102
2 changed files with 131 additions and 0 deletions
--- a/doc/source/stories/2023-10-06.rst
+++ b/doc/source/stories/2023-10-06.rst
@ -0,0 +1,130 @@
 ========================================================
 2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale)
 ========================================================
 This document discusses various latency issues encountered in our
 production platform at Société Générale which makes heavy use of
 OpenStack. The platform is experiencing rapid growth, putting a
 considerable load on the distributed services that OpenStack relies
 upon.
 After further investigations, a timeout issue has been identified
 during the process of virtual machine creation. The occurrence of this
 timeout suggests that there are communication difficulties between the
 Nova and Neutron services.
 Under the Hood
 --------------
 During the virtual machine creation process, once the compute host has
 been selected, Nova begins the build process on that host. This
 involves several steps, including initiating the QEMU process and
 creating a TAP interface on the host. At this point, Nova places the
 virtual machine in a paused state, awaiting an event (or signal) from
 Neutron to indicate that the network for that particular virtual
 machine is ready. The process to receive this event from Neutron can
 be time-consuming, due to the following sequence of events:
 1. The Neutron agent on the compute host responsible for building
   network and informing Nova sends a message via RPC to the Neutron
   server.
 2. The Neutron server, in turn, communicates this with Nova api
   through a REST API.
 3. Finally, the Nova api informs the compute host related of the event
   via another RPC message.
 That whole process is using `event callback API<https://blueprints.launchpad.net/nova/+spec/admin-event-callback-api>`_
 introduced for icehouse, and the events related
 ``network-vif-plugged``, ``network-vif-failed``.
 Considering the diffculty identified for Neutron to inform Nova in
 time we have decided to focus ourself on putting less activity on it.
 Reducing Nova requests on Neutron
 ---------------------------------
 For each virtual machines Nova continuously requests Neutron to
 refresh its networking cache. Initially, this mechanism is designed to
 reduce the number of API requests that Nova makes to Neutron. However,
 the default interval for this periodic task refresh is set to 60
 seconds and in a heavily-loaded environment with thousands of virtual
 machines, this leads to a substantial number of cache refresh requests
 for any given virtual machine.
 ::
    [DEFAULT]
    heal_instance_info_cache = 600
 The networking driver in use is quite stable. According to some
 discussions within the community, it was safe to adjust the refresh
 interval to 600 seconds. It could have even been possible to disable
 this feature entirely.
 .. note::
   When you restart the Nova Compute service, the healing process is
   reset and starts over. This can be particularly problematic in
   environments where the Nova Compute service is restarted
   frequently.
 Increasing RPC workers for Neutron
 ----------------------------------
 We have also decided to significantly increase the value for
 ``rpc_workers``. Given that RPC operations are designed to be
 I/O-bound, we considered that exceeding the number of available cores
 on our hosts by a factor of two would be both conservative and safe.
 ::
    [DEFAULT]
    rpc_workers = 20
 Increasing Neutron ``max_pool_size``
 ------------------------------------
 We have made a deliberate change to extend the ``max_pool_size`` value
 from 1 to 60 in the Neutron database settings. This adjustment is
 logical, given that we have increased the number of workers and can
 anticipate these workers making use of the database.
 ::
    [database]
    max_pool_size = 60
 Deferring flows deletion
 ------------------------
 We have observed that the Neutron agents in our OpenStack environment
 experience delays when deleting network flows as part of the virtual
 machine termination process. This operation is blocking in nature,
 causing the agent to become unresponsive to any other tasks until the
 flow deletion is completed.
 We decided to deploy the change `OpenDev Review #843253<https://review.opendev.org/c/openstack/neutron/+/843253>`_,
 aims to mitigate this issue. This change is offloading the flow
 deletion task to a separate thread, freeing the main thread to
 continue with other operations.
 .. code-block:: python
      # will not match with the ip flow's cookie so OVS won't actually
      # delete the flow
      flow['cookie'] = ovs_lib.COOKIE_ANY
    + self._delete_flows(deferred=False, **flow)
    - self._delete_flows(**flow)
 Improvements after deploying
 ----------------------------
 Finally, after deploying those changes, we have noticed a considerable
 improvement in stability and success rate for virtual machine
 creation. The latency involved in creating virtual machines is now
 stable, requiring only a reasonable amount of time to transition them
 to an active state.
--- a/doc/source/stories/index.rst
+++ b/doc/source/stories/index.rst
@ -23,3 +23,4 @@ Contents:
   :maxdepth: 1
   2020-01-29
   2023-10-06