stories: discuss various latency issues encountered

Change-Id: I12c73133c67c2ebb64217d6deeb51fd57ffc5d2c
Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com>
This commit is contained in:
Sahid Orentino Ferdjaoui 2023-10-06 11:41:03 +02:00
parent 9e54292102
commit b589f7fa2a
2 changed files with 119 additions and 0 deletions

View File

@ -0,0 +1,118 @@
========================================================
2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale)
========================================================
This document discusses various latency issues encountered in our
production platform at Société Générale which makes heavy use of
OpenStack. The platform is experiencing rapid growth, putting a
considerable load on the distributed services that OpenStack relies
upon.
After thorough investigations, a timeout issue has been identified
during the process of virtual machine creation. The occurrence of this
timeout suggests that there are communication difficulties between the
Nova and Neutron services.
Under the Hood
--------------
During the virtual machine creation process, once the compute host has
been selected, Nova begins the build process on that host. This
involves several steps, including initiating the QEMU process and
creating a TAP interface on the host. At this point, Nova places the
virtual machine in a paused state, awaiting an event (or signal) from
Neutron to indicate that the network for that particular virtual
machine is ready. The process to receive this event from Neutron can
be time-consuming, due to the following sequence of events:
1. The Neutron agent on the compute host responsible for building
network and informing Nova sends a message via RPC to the Neutron
server.
2. The Neutron server, in turn, communicates this with Nova api
through a REST API.
3. Finally, the Nova api informs the compute host related of the event
via another RPC message.
That whole process is using `event callback
API`<https://blueprints.launchpad.net/nova/+spec/admin-event-callback-api>_
introduced for icehouse, and the events related `network-vif-plugged`,
`network-vif-failed`.
Considering the diffculty identifed for Neutron to inform Nova in time
we have decided to focus ourself on putting less activity on it.
Reducing Nova requests on Neutron
---------------------------------
For each virtual machines Nova continuously requests Neutron to
refresh its networking cache. Initially, this mechanism is designed to
reduce the number of API requests that Nova makes to Neutron. However,
the default interval for this periodic task refresh is set to 60
seconds and in a heavily-loaded environment with thousands of virtual
machines, this leads to a substantial number of cache refresh requests
for any given virtual machine.
::
[DEFAULT]
head_instance_info_cache = 600
The networking driver in use is quite stable. According to some
discussions within the community, it was safe to adjust the refresh
interval to 600 seconds. It could have even be possible to disable
this feature entirely.
Increasing RPC workers for Neutron
----------------------------------
We have also decided to significantly increase the value for
`rpc_workers`. Given that RPC operations are designed to be I/O-bound,
we considered that exceeding the number of available cores on our
hosts by a factor of two would be both conservative and safe.
::
[DEFAULT]
rpc_workers = 20
Extending the connection duration timeout for Neutron agents to the servers
---------------------------------------------------------------------------
An infrequently used option, `of_inactivity_probe`, has also been
increased from 10 to 60 seconds. Based on our investigations, this
adjustment is expected to reduce the frequency with which an agent is
considered orphaned from the Neutron server's perspective, especially
in a heavily-loaded environment.
Deferring flows deletion
------------------------
We have observed that the Neutron agents in our OpenStack environment
experience delays when deleting network flows as part of the virtual
machine termination process. This operation is blocking in nature,
causing the agent to become unresponsive to any other tasks until the
flow deletion is completed.
We decided to deploy the change `OpenDev Review #843253
<https://review.opendev.org/c/openstack/neutron/+/843253>`_, aims to
mitigate this issue. This change is offloading the flow deletion task
to a separate thread, freeing the main thread to continue with other
operations.
.. code-block:: python
# will not match with the ip flow's cookie so OVS won't actually
# delete the flow
flow['cookie'] = ovs_lib.COOKIE_ANY
+ self._delete_flows(deferred=False, **flow)
- self._delete_flows(**flow)
Improvements after deploying
----------------------------
Finally, after deploying those changes, we have noticed a considerable
improvement in stability and success rate for virtual machine
creation. The latency involved in creating virtual machines is now
stable, requiring only a reasonable amount of time to transition them
to an active state.

View File

@ -23,3 +23,4 @@ Contents:
:maxdepth: 1
2020-01-29
2023-10-06