stories: discuss various latency issues encountered
Change-Id: I12c73133c67c2ebb64217d6deeb51fd57ffc5d2c Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com>
This commit is contained in:
parent
9e54292102
commit
c3da7e6586
130
doc/source/stories/2023-10-06.rst
Normal file
130
doc/source/stories/2023-10-06.rst
Normal file
@ -0,0 +1,130 @@
|
|||||||
|
========================================================
|
||||||
|
2023-10-06 - Sahid Orentino Ferdjaoui (Société Générale)
|
||||||
|
========================================================
|
||||||
|
|
||||||
|
This document discusses various latency issues encountered in our
|
||||||
|
production platform at Société Générale which makes heavy use of
|
||||||
|
OpenStack. The platform is experiencing rapid growth, putting a
|
||||||
|
considerable load on the distributed services that OpenStack relies
|
||||||
|
upon.
|
||||||
|
|
||||||
|
After further investigations, a timeout issue has been identified
|
||||||
|
during the process of virtual machine creation. The occurrence of this
|
||||||
|
timeout suggests that there are communication difficulties between the
|
||||||
|
Nova and Neutron services.
|
||||||
|
|
||||||
|
Under the Hood
|
||||||
|
--------------
|
||||||
|
|
||||||
|
During the virtual machine creation process, once the compute host has
|
||||||
|
been selected, Nova begins the build process on that host. This
|
||||||
|
involves several steps, including initiating the QEMU process and
|
||||||
|
creating a TAP interface on the host. At this point, Nova places the
|
||||||
|
virtual machine in a paused state, awaiting an event (or signal) from
|
||||||
|
Neutron to indicate that the network for that particular virtual
|
||||||
|
machine is ready. The process to receive this event from Neutron can
|
||||||
|
be time-consuming, due to the following sequence of events:
|
||||||
|
|
||||||
|
1. The Neutron agent on the compute host responsible for building
|
||||||
|
network and informing Nova sends a message via RPC to the Neutron
|
||||||
|
server.
|
||||||
|
2. The Neutron server, in turn, communicates this with Nova api
|
||||||
|
through a REST API.
|
||||||
|
3. Finally, the Nova api informs the compute host related of the event
|
||||||
|
via another RPC message.
|
||||||
|
|
||||||
|
That whole process is using `event callback API<https://blueprints.launchpad.net/nova/+spec/admin-event-callback-api>`_
|
||||||
|
introduced for icehouse, and the events related
|
||||||
|
``network-vif-plugged``, ``network-vif-failed``.
|
||||||
|
|
||||||
|
Considering the diffculty identified for Neutron to inform Nova in
|
||||||
|
time we have decided to focus ourself on putting less activity on it.
|
||||||
|
|
||||||
|
Reducing Nova requests on Neutron
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
For each virtual machines Nova continuously requests Neutron to
|
||||||
|
refresh its networking cache. Initially, this mechanism is designed to
|
||||||
|
reduce the number of API requests that Nova makes to Neutron. However,
|
||||||
|
the default interval for this periodic task refresh is set to 60
|
||||||
|
seconds and in a heavily-loaded environment with thousands of virtual
|
||||||
|
machines, this leads to a substantial number of cache refresh requests
|
||||||
|
for any given virtual machine.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
[DEFAULT]
|
||||||
|
heal_instance_info_cache = 600
|
||||||
|
|
||||||
|
The networking driver in use is quite stable. According to some
|
||||||
|
discussions within the community, it was safe to adjust the refresh
|
||||||
|
interval to 600 seconds. It could have even been possible to disable
|
||||||
|
this feature entirely.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
When you restart the Nova Compute service, the healing process is
|
||||||
|
reset and starts over. This can be particularly problematic in
|
||||||
|
environments where the Nova Compute service is restarted
|
||||||
|
frequently.
|
||||||
|
|
||||||
|
|
||||||
|
Increasing RPC workers for Neutron
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
We have also decided to significantly increase the value for
|
||||||
|
``rpc_workers``. Given that RPC operations are designed to be
|
||||||
|
I/O-bound, we considered that exceeding the number of available cores
|
||||||
|
on our hosts by a factor of two would be both conservative and safe.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
[DEFAULT]
|
||||||
|
rpc_workers = 20
|
||||||
|
|
||||||
|
|
||||||
|
Increasing Neutron ``max_pool_size``
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
We have made a deliberate change to extend the ``max_pool_size`` value
|
||||||
|
from 1 to 60 in the Neutron database settings. This adjustment is
|
||||||
|
logical, given that we have increased the number of workers and can
|
||||||
|
anticipate these workers making use of the database.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
[database]
|
||||||
|
max_pool_size = 60
|
||||||
|
|
||||||
|
|
||||||
|
Deferring flows deletion
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
We have observed that the Neutron agents in our OpenStack environment
|
||||||
|
experience delays when deleting network flows as part of the virtual
|
||||||
|
machine termination process. This operation is blocking in nature,
|
||||||
|
causing the agent to become unresponsive to any other tasks until the
|
||||||
|
flow deletion is completed.
|
||||||
|
|
||||||
|
We decided to deploy the change `OpenDev Review #843253<https://review.opendev.org/c/openstack/neutron/+/843253>`_,
|
||||||
|
aims to mitigate this issue. This change is offloading the flow
|
||||||
|
deletion task to a separate thread, freeing the main thread to
|
||||||
|
continue with other operations.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# will not match with the ip flow's cookie so OVS won't actually
|
||||||
|
# delete the flow
|
||||||
|
flow['cookie'] = ovs_lib.COOKIE_ANY
|
||||||
|
+ self._delete_flows(deferred=False, **flow)
|
||||||
|
- self._delete_flows(**flow)
|
||||||
|
|
||||||
|
|
||||||
|
Improvements after deploying
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
Finally, after deploying those changes, we have noticed a considerable
|
||||||
|
improvement in stability and success rate for virtual machine
|
||||||
|
creation. The latency involved in creating virtual machines is now
|
||||||
|
stable, requiring only a reasonable amount of time to transition them
|
||||||
|
to an active state.
|
@ -23,3 +23,4 @@ Contents:
|
|||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
2020-01-29
|
2020-01-29
|
||||||
|
2023-10-06
|
||||||
|
Loading…
Reference in New Issue
Block a user