stories: discuss various latency issues encountered

Change-Id: I12c73133c67c2ebb64217d6deeb51fd57ffc5d2c Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com>
2023-10-06 11:41:03 +02:00 · 2023-10-06 11:41:03 +02:00 · 5dcebd91be
parent 9e54292102
commit 5dcebd91be
2 changed files with 132 additions and 0 deletions
--- a/doc/source/stories/2023-10-06.rst
+++ b/doc/source/stories/2023-10-06.rst
@ -0,0 +1,131 @@
+========================================================
+2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale)
+========================================================
+
+This document discusses various latency issues encountered in our
+production platform at Société Générale which makes heavy use of
+OpenStack. The platform is experiencing rapid growth, putting a
+considerable load on the distributed services that OpenStack relies
+upon.
+
+After further investigations, a timeout issue has been identified
+during the process of virtual machine creation. The occurrence of this
+timeout suggests that there are communication difficulties between the
+Nova and Neutron services.
+
+Under the Hood
+--------------
+
+During the virtual machine creation process, once the compute host has
+been selected, Nova begins the build process on that host. This
+involves several steps, including initiating the QEMU process and
+creating a TAP interface on the host. At this point, Nova places the
+virtual machine in a paused state, awaiting an event (or signal) from
+Neutron to indicate that the network for that particular virtual
+machine is ready. The process to receive this event from Neutron can
+be time-consuming, due to the following sequence of events:
+
+1. The Neutron agent on the compute host responsible for building
+   network and informing Nova sends a message via RPC to the Neutron
+   server.
+2. The Neutron server, in turn, communicates this with Nova api
+   through a REST API.
+3. Finally, the Nova api informs the compute host related of the event
+   via another RPC message.
+
+That whole process is using `event callback API`<https://blueprints.launchpad.net/nova/+spec/admin-event-callback-api>_
+introduced for icehouse, and the events related
+``network-vif-plugged``, ``network-vif-failed``.
+
+Considering the diffculty identified for Neutron to inform Nova in
+time we have decided to focus ourself on putting less activity on it.
+
+Reducing Nova requests on Neutron
+---------------------------------
+
+For each virtual machines Nova continuously requests Neutron to
+refresh its networking cache. Initially, this mechanism is designed to
+reduce the number of API requests that Nova makes to Neutron. However,
+the default interval for this periodic task refresh is set to 60
+seconds and in a heavily-loaded environment with thousands of virtual
+machines, this leads to a substantial number of cache refresh requests
+for any given virtual machine.
+
+::
+
+    [DEFAULT]
+    heal_instance_info_cache = 600
+
+The networking driver in use is quite stable. According to some
+discussions within the community, it was safe to adjust the refresh
+interval to 600 seconds. It could have even been possible to disable
+this feature entirely.
+
+.. note::
+
+   When you restart the Nova Compute service, the healing process is
+   reset and starts over. This can be particularly problematic in
+   environments where the Nova Compute service is restarted
+   frequently.
+
+
+Increasing RPC workers for Neutron
+----------------------------------
+
+We have also decided to significantly increase the value for
+`rpc_workers`. Given that RPC operations are designed to be I/O-bound,
+we considered that exceeding the number of available cores on our
+hosts by a factor of two would be both conservative and safe.
+
+::
+
+    [DEFAULT]
+    rpc_workers = 20
+
+
+Increasing Neutron `max_pool_size`
+----------------------------------
+
+We have made a deliberate change to extend the `max_pool_size` value
+from 1 to 60 in the Neutron database settings. This adjustment is
+logical, given that we have increased the number of workers and can
+anticipate these workers making use of the database.
+
+::
+
+    [database]
+    max_pool_size = 60
+
+
+Deferring flows deletion
+------------------------
+
+We have observed that the Neutron agents in our OpenStack environment
+experience delays when deleting network flows as part of the virtual
+machine termination process. This operation is blocking in nature,
+causing the agent to become unresponsive to any other tasks until the
+flow deletion is completed.
+
+We decided to deploy the change `OpenDev Review #843253
+<https://review.opendev.org/c/openstack/neutron/+/843253>`_, aims to
+mitigate this issue. This change is offloading the flow deletion task
+to a separate thread, freeing the main thread to continue with other
+operations.
+
+.. code-block:: python
+
+      # will not match with the ip flow's cookie so OVS won't actually
+      # delete the flow
+      flow['cookie'] = ovs_lib.COOKIE_ANY
+    + self._delete_flows(deferred=False, **flow)
+    - self._delete_flows(**flow)
+
+
+Improvements after deploying
+----------------------------
+
+Finally, after deploying those changes, we have noticed a considerable
+improvement in stability and success rate for virtual machine
+creation. The latency involved in creating virtual machines is now
+stable, requiring only a reasonable amount of time to transition them
+to an active state.
--- a/doc/source/stories/index.rst
+++ b/doc/source/stories/index.rst
@ -23,3 +23,4 @@ Contents:
   :maxdepth: 1

   2020-01-29
+   2023-10-06