stories: discuss various optimizations on management 3000 nodes

Change-Id: I91054275728ba7334587ec7c77f4afd7816e9dee
This commit is contained in:
songwenping 2024-06-27 20:38:57 +08:00
parent 884a0b7749
commit 401969e8ce
2 changed files with 115 additions and 0 deletions

View File

@ -0,0 +1,114 @@
=========================================
2024-06-10 - Alex Song (Inspur IncloudOS)
=========================================
This document discusses the latency and timeout issues encountered in
managing 3000 computing nodeson the Inspur IncloudOS cloud platform.
After in-depth investigation and optimization, the problems of request
latency and service response timeout on the cloud platform have been
alleviated, meeting the requirements for concurrent creation and
management of virtual machines in large-scale scenarios.
Problem Description
-------------------
Service failed to startup:
1. Nova-api failed to connect to the database.
2. Nova-api unable to create threads.
3. Rabbitmq service automatically restarted.
VM failed with concurrent creation:
1. Rabbitmq message overshock
2. Specified node to create virtual machine
3. Nova waiting for port creation timeout
4. Ovsdb transaction repeated commit causing port creation failure.
Optimized Method
----------------
1. Increase database connection count and memory limit
We ensure that the database service starts up normally and the
OpenStack service can apply for sufficient database connections
by increasing the number of database connections max_connections
and the database memory limit thread_cache_size
::
[DEFAULT]
max_connections = 100000
thread_cache_size = 10000
2. Optimize the Rabbitmq configuration of message middleware
Rabbitmq service automatically restarts. We found that there are
continuously logs with noproc and handshake_timeout. By adjusting
the maximum number of connections and increasing the handshake
time configuration, the issue no longer occurs.
::
[DEFAULT]
maximum=20000
The max number of conns for RabbitMQ is estimated through cacluating the
conns of nova and cinder components. We have 3000 compute nodes, 3 control
nodes and 15 cinder-volume nodes, We deploy rabbitmq on control nodes and use
master-slave mode, the totally conns of RabbitMQ is almost 2w, so we set 2w
for RabbitMQ configuration.
3. Reduce Rabbitmq message backlog.
During the waiting process of message confirmation in the Nova-condutor
service, the connection to Rabbitmq was not released, and other coroutines
were unable to obtain the connection, resulting in message backlog.
We reduce the risk of message backlog by modifying the code, adjusting
the message timeout mechanism, and increasing the timeout duration.
4. Increase the number of allocation candidates returned from Placement
The reason for the failure is that Nova sends request to Placement to
obtain RP list with a maximum limit of 1000, and the number of computing
nodes in the environment exceeds 1000, some hosts will not return, resulting
in scheduling failure. Resolve this issue by modifying the configuration
items max_placement_result under nova scheduler.
::
[scheduler]
max_placement_result = 3000
We spwan to create 3k instance in one request, the placement default return
1000 allocate candidates, so we need to increase the limit to 3000.
5. Resolve the timeout of port creation
Change the deployment method of OVN SB by deploying 10 relay SB services
per control node, and compute nodes can connect a single relay service to
reduce port creation time.
Before deploying the SB relay, each ovn SB process needed to manage an
average of 600+ connections, the CPU usage is often 100% and the requests is
slow processing. After adding the relay, each relay process handles about 60
connections, the total relay process is set to 10. After testing, each relay
process has a low CPU usage and can process information quickly.
6. Deployment optimization
We modified the Ansible module on the basis of the OpenStack helm to support
user-defined configuration, making it more convenient to modify OpenStack
configuration parameters.
Additonally, we optimize the load balance problem of Kubeapi in large-scale
scenarios, adjust the long connection strategy of Kubelet client to make it
randomly reconnect and ensure the overall load balance of all management nodes.
Optimized performance
---------------------
1. The success rate of concurrent creation of 3000 virtual machines is 100%.
2. Querying 50000 virtual machines took 562.44ms.
3. 2000 concurrent ports can be created 100% successfully, with an average
creation time of less than 0.2 seconds per port.

View File

@ -24,3 +24,4 @@ Contents:
2020-01-29 2020-01-29
2023-10-06 2023-10-06
2024-06-10