stories: discuss various latency issues encountered

Change-Id: I12c73133c67c2ebb64217d6deeb51fd57ffc5d2c
Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com>
This commit is contained in:
Sahid Orentino Ferdjaoui 2023-10-06 11:41:03 +02:00
parent 9e54292102
commit 5dcebd91be
2 changed files with 132 additions and 0 deletions

View File

@ -0,0 +1,131 @@
========================================================
2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale)
========================================================
This document discusses various latency issues encountered in our
production platform at Société Générale which makes heavy use of
OpenStack. The platform is experiencing rapid growth, putting a
considerable load on the distributed services that OpenStack relies
upon.
After further investigations, a timeout issue has been identified
during the process of virtual machine creation. The occurrence of this
timeout suggests that there are communication difficulties between the
Nova and Neutron services.
Under the Hood
--------------
During the virtual machine creation process, once the compute host has
been selected, Nova begins the build process on that host. This
involves several steps, including initiating the QEMU process and
creating a TAP interface on the host. At this point, Nova places the
virtual machine in a paused state, awaiting an event (or signal) from
Neutron to indicate that the network for that particular virtual
machine is ready. The process to receive this event from Neutron can
be time-consuming, due to the following sequence of events:
1. The Neutron agent on the compute host responsible for building
network and informing Nova sends a message via RPC to the Neutron
server.
2. The Neutron server, in turn, communicates this with Nova api
through a REST API.
3. Finally, the Nova api informs the compute host related of the event
via another RPC message.
That whole process is using `event callback API`<https://blueprints.launchpad.net/nova/+spec/admin-event-callback-api>_
introduced for icehouse, and the events related
``network-vif-plugged``, ``network-vif-failed``.
Considering the diffculty identified for Neutron to inform Nova in
time we have decided to focus ourself on putting less activity on it.
Reducing Nova requests on Neutron
---------------------------------
For each virtual machines Nova continuously requests Neutron to
refresh its networking cache. Initially, this mechanism is designed to
reduce the number of API requests that Nova makes to Neutron. However,
the default interval for this periodic task refresh is set to 60
seconds and in a heavily-loaded environment with thousands of virtual
machines, this leads to a substantial number of cache refresh requests
for any given virtual machine.
::
[DEFAULT]
heal_instance_info_cache = 600
The networking driver in use is quite stable. According to some
discussions within the community, it was safe to adjust the refresh
interval to 600 seconds. It could have even been possible to disable
this feature entirely.
.. note::
When you restart the Nova Compute service, the healing process is
reset and starts over. This can be particularly problematic in
environments where the Nova Compute service is restarted
frequently.
Increasing RPC workers for Neutron
----------------------------------
We have also decided to significantly increase the value for
`rpc_workers`. Given that RPC operations are designed to be I/O-bound,
we considered that exceeding the number of available cores on our
hosts by a factor of two would be both conservative and safe.
::
[DEFAULT]
rpc_workers = 20
Increasing Neutron `max_pool_size`
----------------------------------
We have made a deliberate change to extend the `max_pool_size` value
from 1 to 60 in the Neutron database settings. This adjustment is
logical, given that we have increased the number of workers and can
anticipate these workers making use of the database.
::
[database]
max_pool_size = 60
Deferring flows deletion
------------------------
We have observed that the Neutron agents in our OpenStack environment
experience delays when deleting network flows as part of the virtual
machine termination process. This operation is blocking in nature,
causing the agent to become unresponsive to any other tasks until the
flow deletion is completed.
We decided to deploy the change `OpenDev Review #843253
<https://review.opendev.org/c/openstack/neutron/+/843253>`_, aims to
mitigate this issue. This change is offloading the flow deletion task
to a separate thread, freeing the main thread to continue with other
operations.
.. code-block:: python
# will not match with the ip flow's cookie so OVS won't actually
# delete the flow
flow['cookie'] = ovs_lib.COOKIE_ANY
+ self._delete_flows(deferred=False, **flow)
- self._delete_flows(**flow)
Improvements after deploying
----------------------------
Finally, after deploying those changes, we have noticed a considerable
improvement in stability and success rate for virtual machine
creation. The latency involved in creating virtual machines is now
stable, requiring only a reasonable amount of time to transition them
to an active state.

View File

@ -23,3 +23,4 @@ Contents:
:maxdepth: 1
2020-01-29
2023-10-06