From b589f7fa2ae49af07ef572df842c77367485577e Mon Sep 17 00:00:00 2001 From: Sahid Orentino Ferdjaoui Date: Fri, 6 Oct 2023 11:41:03 +0200 Subject: [PATCH] stories: discuss various latency issues encountered Change-Id: I12c73133c67c2ebb64217d6deeb51fd57ffc5d2c Signed-off-by: Sahid Orentino Ferdjaoui --- doc/source/stories/2023-10-06.rst | 118 ++++++++++++++++++++++++++++++ doc/source/stories/index.rst | 1 + 2 files changed, 119 insertions(+) create mode 100644 doc/source/stories/2023-10-06.rst diff --git a/doc/source/stories/2023-10-06.rst b/doc/source/stories/2023-10-06.rst new file mode 100644 index 0000000..8850610 --- /dev/null +++ b/doc/source/stories/2023-10-06.rst @@ -0,0 +1,118 @@ +======================================================== +2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale) +======================================================== + +This document discusses various latency issues encountered in our +production platform at Société Générale which makes heavy use of +OpenStack. The platform is experiencing rapid growth, putting a +considerable load on the distributed services that OpenStack relies +upon. + +After thorough investigations, a timeout issue has been identified +during the process of virtual machine creation. The occurrence of this +timeout suggests that there are communication difficulties between the +Nova and Neutron services. + +Under the Hood +-------------- + +During the virtual machine creation process, once the compute host has +been selected, Nova begins the build process on that host. This +involves several steps, including initiating the QEMU process and +creating a TAP interface on the host. At this point, Nova places the +virtual machine in a paused state, awaiting an event (or signal) from +Neutron to indicate that the network for that particular virtual +machine is ready. The process to receive this event from Neutron can +be time-consuming, due to the following sequence of events: + +1. The Neutron agent on the compute host responsible for building + network and informing Nova sends a message via RPC to the Neutron + server. +2. The Neutron server, in turn, communicates this with Nova api + through a REST API. +3. Finally, the Nova api informs the compute host related of the event + via another RPC message. + +That whole process is using `event callback +API`_ +introduced for icehouse, and the events related `network-vif-plugged`, +`network-vif-failed`. + +Considering the diffculty identifed for Neutron to inform Nova in time +we have decided to focus ourself on putting less activity on it. + +Reducing Nova requests on Neutron +--------------------------------- + +For each virtual machines Nova continuously requests Neutron to +refresh its networking cache. Initially, this mechanism is designed to +reduce the number of API requests that Nova makes to Neutron. However, +the default interval for this periodic task refresh is set to 60 +seconds and in a heavily-loaded environment with thousands of virtual +machines, this leads to a substantial number of cache refresh requests +for any given virtual machine. + +:: + + [DEFAULT] + head_instance_info_cache = 600 + +The networking driver in use is quite stable. According to some +discussions within the community, it was safe to adjust the refresh +interval to 600 seconds. It could have even be possible to disable +this feature entirely. + +Increasing RPC workers for Neutron +---------------------------------- + +We have also decided to significantly increase the value for +`rpc_workers`. Given that RPC operations are designed to be I/O-bound, +we considered that exceeding the number of available cores on our +hosts by a factor of two would be both conservative and safe. + +:: + + [DEFAULT] + rpc_workers = 20 + +Extending the connection duration timeout for Neutron agents to the servers +--------------------------------------------------------------------------- + +An infrequently used option, `of_inactivity_probe`, has also been +increased from 10 to 60 seconds. Based on our investigations, this +adjustment is expected to reduce the frequency with which an agent is +considered orphaned from the Neutron server's perspective, especially +in a heavily-loaded environment. + +Deferring flows deletion +------------------------ + +We have observed that the Neutron agents in our OpenStack environment +experience delays when deleting network flows as part of the virtual +machine termination process. This operation is blocking in nature, +causing the agent to become unresponsive to any other tasks until the +flow deletion is completed. + +We decided to deploy the change `OpenDev Review #843253 +`_, aims to +mitigate this issue. This change is offloading the flow deletion task +to a separate thread, freeing the main thread to continue with other +operations. + +.. code-block:: python + + # will not match with the ip flow's cookie so OVS won't actually + # delete the flow + flow['cookie'] = ovs_lib.COOKIE_ANY + + self._delete_flows(deferred=False, **flow) + - self._delete_flows(**flow) + + +Improvements after deploying +---------------------------- + +Finally, after deploying those changes, we have noticed a considerable +improvement in stability and success rate for virtual machine +creation. The latency involved in creating virtual machines is now +stable, requiring only a reasonable amount of time to transition them +to an active state. diff --git a/doc/source/stories/index.rst b/doc/source/stories/index.rst index 9ce08f0..3c297b8 100644 --- a/doc/source/stories/index.rst +++ b/doc/source/stories/index.rst @@ -23,3 +23,4 @@ Contents: :maxdepth: 1 2020-01-29 + 2023-10-06