stories: discuss various optimizations on management 3000 nodes

Change-Id: I91054275728ba7334587ec7c77f4afd7816e9dee
2024-06-27 20:38:57 +08:00
parent 884a0b7749
commit 401969e8ce
2 changed files with 115 additions and 0 deletions
--- a/doc/source/stories/2024-06-10.rst
+++ b/doc/source/stories/2024-06-10.rst
@@ -0,0 +1,114 @@
+=========================================
+2024-06-10 - Alex Song (Inspur IncloudOS)
+=========================================
+
+This document discusses the latency and timeout issues encountered in
+managing 3000 computing nodeson the Inspur IncloudOS cloud platform.
+After in-depth investigation and optimization, the problems of request
+latency and service response timeout on the cloud platform have been
+alleviated, meeting the requirements for concurrent creation and
+management of virtual machines in large-scale scenarios.
+
+Problem Description
+-------------------
+
+Service failed to startup:
+
+1. Nova-api failed to connect to the database.
+2. Nova-api unable to create threads.
+3. Rabbitmq service automatically restarted.
+
+VM failed with concurrent creation:
+
+1. Rabbitmq message overshock
+2. Specified node to create virtual machine
+3. Nova waiting for port creation timeout
+4. Ovsdb transaction repeated commit causing port creation failure.
+
+Optimized Method
+----------------
+
+1. Increase database connection count and memory limit
+
+We ensure that the database service starts up normally and the
+OpenStack service can apply for sufficient database connections
+by increasing the number of database connections max_connections
+and the database memory limit thread_cache_size
+
+::
+
+    [DEFAULT]
+    max_connections = 100000
+    thread_cache_size = 10000
+
+2. Optimize the Rabbitmq configuration of message middleware
+
+Rabbitmq service automatically restarts. We found that there are
+continuously logs with noproc and handshake_timeout. By adjusting
+the maximum number of connections and increasing the handshake
+time configuration, the issue no longer occurs.
+
+::
+
+    [DEFAULT]
+    maximum=20000
+
+The max number of conns for RabbitMQ is estimated through cacluating the
+conns of nova and cinder components. We have 3000 compute nodes, 3 control
+nodes and 15 cinder-volume nodes, We deploy rabbitmq on control nodes and use
+master-slave mode, the totally conns of RabbitMQ is almost 2w, so we set 2w
+for RabbitMQ configuration.
+
+3. Reduce Rabbitmq message backlog.
+
+During the waiting process of message confirmation in the Nova-condutor
+service, the connection to Rabbitmq was not released, and other coroutines
+were unable to obtain the connection, resulting in message backlog.
+We reduce the risk of message backlog by modifying the code, adjusting
+the message timeout mechanism, and increasing the timeout duration.
+
+4. Increase the number of allocation candidates returned from Placement
+
+The reason for the failure is that Nova sends request to Placement to
+obtain RP list with a maximum limit of 1000, and the number of computing
+nodes in the environment exceeds 1000, some hosts will not return, resulting
+in scheduling failure. Resolve this issue by modifying the configuration
+items max_placement_result under nova scheduler.
+
+::
+
+    [scheduler]
+    max_placement_result = 3000
+
+We spwan to create 3k instance in one request, the placement default return
+1000 allocate candidates, so we need to increase the limit to 3000.
+
+5. Resolve the timeout of port creation
+
+Change the deployment method of OVN SB by deploying 10 relay SB services
+per control node, and compute nodes can connect a single relay service to
+reduce port creation time.
+Before deploying the SB relay, each ovn SB process needed to manage an
+average of 600+ connections, the CPU usage is often 100% and the requests is
+slow processing. After adding the relay, each relay process handles about 60
+connections, the total relay process is set to 10. After testing, each relay
+process has a low CPU usage and can process information quickly.
+
+6. Deployment optimization
+
+We modified the Ansible module on the basis of the OpenStack helm to support
+user-defined configuration, making it more convenient to modify OpenStack
+configuration parameters.
+Additonally, we optimize the load balance problem of Kubeapi in large-scale
+scenarios, adjust the long connection strategy of Kubelet client to make it
+randomly reconnect  and ensure the overall load balance of all management nodes.
+
+Optimized performance
+---------------------
+
+1. The success rate of concurrent creation of 3000 virtual machines is 100%.
+
+2. Querying 50000 virtual machines took 562.44ms.
+
+3. 2000 concurrent ports can be created 100% successfully, with an average
+creation time of less than 0.2 seconds per port.
--- a/doc/source/stories/index.rst
+++ b/doc/source/stories/index.rst
@@ -24,3 +24,4 @@ Contents:

   2020-01-29
   2023-10-06
+   2024-06-10