stories: discuss various optimizations on management 3000 nodes

Change-Id: I91054275728ba7334587ec7c77f4afd7816e9dee
2024-06-27 20:38:57 +08:00 · 2024-06-27 20:38:57 +08:00 · 401969e8ce
commit 401969e8ce
parent 884a0b7749
2 changed files with 115 additions and 0 deletions
--- a/doc/source/stories/2024-06-10.rst
+++ b/doc/source/stories/2024-06-10.rst
@ -0,0 +1,114 @@
 =========================================
 2024-06-10 - Alex Song (Inspur IncloudOS)
 =========================================
 This document discusses the latency and timeout issues encountered in
 managing 3000 computing nodeson the Inspur IncloudOS cloud platform.
 After in-depth investigation and optimization, the problems of request
 latency and service response timeout on the cloud platform have been
 alleviated, meeting the requirements for concurrent creation and
 management of virtual machines in large-scale scenarios.
 Problem Description
 -------------------
 Service failed to startup:
 1. Nova-api failed to connect to the database.
 2. Nova-api unable to create threads.
 3. Rabbitmq service automatically restarted.
 VM failed with concurrent creation:
 1. Rabbitmq message overshock
 2. Specified node to create virtual machine
 3. Nova waiting for port creation timeout
 4. Ovsdb transaction repeated commit causing port creation failure.
 Optimized Method
 ----------------
 1. Increase database connection count and memory limit
 We ensure that the database service starts up normally and the
 OpenStack service can apply for sufficient database connections
 by increasing the number of database connections max_connections
 and the database memory limit thread_cache_size
 ::
    [DEFAULT]
    max_connections = 100000
    thread_cache_size = 10000
 2. Optimize the Rabbitmq configuration of message middleware
 Rabbitmq service automatically restarts. We found that there are
 continuously logs with noproc and handshake_timeout. By adjusting
 the maximum number of connections and increasing the handshake
 time configuration, the issue no longer occurs.
 ::
    [DEFAULT]
    maximum=20000
 The max number of conns for RabbitMQ is estimated through cacluating the
 conns of nova and cinder components. We have 3000 compute nodes, 3 control
 nodes and 15 cinder-volume nodes, We deploy rabbitmq on control nodes and use
 master-slave mode, the totally conns of RabbitMQ is almost 2w, so we set 2w
 for RabbitMQ configuration.
 3. Reduce Rabbitmq message backlog.
 During the waiting process of message confirmation in the Nova-condutor
 service, the connection to Rabbitmq was not released, and other coroutines
 were unable to obtain the connection, resulting in message backlog.
 We reduce the risk of message backlog by modifying the code, adjusting
 the message timeout mechanism, and increasing the timeout duration.
 4. Increase the number of allocation candidates returned from Placement
 The reason for the failure is that Nova sends request to Placement to
 obtain RP list with a maximum limit of 1000, and the number of computing
 nodes in the environment exceeds 1000, some hosts will not return, resulting
 in scheduling failure. Resolve this issue by modifying the configuration
 items max_placement_result under nova scheduler.
 ::
    [scheduler]
    max_placement_result = 3000
 We spwan to create 3k instance in one request, the placement default return
 1000 allocate candidates, so we need to increase the limit to 3000.
 5. Resolve the timeout of port creation
 Change the deployment method of OVN SB by deploying 10 relay SB services
 per control node, and compute nodes can connect a single relay service to
 reduce port creation time.
 Before deploying the SB relay, each ovn SB process needed to manage an
 average of 600+ connections, the CPU usage is often 100% and the requests is
 slow processing. After adding the relay, each relay process handles about 60
 connections, the total relay process is set to 10. After testing, each relay
 process has a low CPU usage and can process information quickly.
 6. Deployment optimization
 We modified the Ansible module on the basis of the OpenStack helm to support
 user-defined configuration, making it more convenient to modify OpenStack
 configuration parameters.
 Additonally, we optimize the load balance problem of Kubeapi in large-scale
 scenarios, adjust the long connection strategy of Kubelet client to make it
 randomly reconnect  and ensure the overall load balance of all management nodes.
 Optimized performance
 ---------------------
 1. The success rate of concurrent creation of 3000 virtual machines is 100%.
 2. Querying 50000 virtual machines took 562.44ms.
 3. 2000 concurrent ports can be created 100% successfully, with an average
 creation time of less than 0.2 seconds per port.
--- a/doc/source/stories/index.rst
+++ b/doc/source/stories/index.rst
@ -24,3 +24,4 @@ Contents:
   2020-01-29
   2023-10-06
   2024-06-10
 -01-29
 -10-06
+-06-10