diff --git a/doc/source/stories/2024-06-10.rst b/doc/source/stories/2024-06-10.rst new file mode 100644 index 0000000..6b8a67b --- /dev/null +++ b/doc/source/stories/2024-06-10.rst @@ -0,0 +1,114 @@ +========================================= +2024-06-10 - Alex Song (Inspur IncloudOS) +========================================= + +This document discusses the latency and timeout issues encountered in +managing 3000 computing nodeson the Inspur IncloudOS cloud platform. +After in-depth investigation and optimization, the problems of request +latency and service response timeout on the cloud platform have been +alleviated, meeting the requirements for concurrent creation and +management of virtual machines in large-scale scenarios. + +Problem Description +------------------- + +Service failed to startup: + +1. Nova-api failed to connect to the database. +2. Nova-api unable to create threads. +3. Rabbitmq service automatically restarted. + +VM failed with concurrent creation: + +1. Rabbitmq message overshock +2. Specified node to create virtual machine +3. Nova waiting for port creation timeout +4. Ovsdb transaction repeated commit causing port creation failure. + +Optimized Method +---------------- + +1. Increase database connection count and memory limit + +We ensure that the database service starts up normally and the +OpenStack service can apply for sufficient database connections +by increasing the number of database connections max_connections +and the database memory limit thread_cache_size + +:: + + [DEFAULT] + max_connections = 100000 + thread_cache_size = 10000 + +2. Optimize the Rabbitmq configuration of message middleware + +Rabbitmq service automatically restarts. We found that there are +continuously logs with noproc and handshake_timeout. By adjusting +the maximum number of connections and increasing the handshake +time configuration, the issue no longer occurs. + +:: + + [DEFAULT] + maximum=20000 + +The max number of conns for RabbitMQ is estimated through cacluating the +conns of nova and cinder components. We have 3000 compute nodes, 3 control +nodes and 15 cinder-volume nodes, We deploy rabbitmq on control nodes and use +master-slave mode, the totally conns of RabbitMQ is almost 2w, so we set 2w +for RabbitMQ configuration. + +3. Reduce Rabbitmq message backlog. + +During the waiting process of message confirmation in the Nova-condutor +service, the connection to Rabbitmq was not released, and other coroutines +were unable to obtain the connection, resulting in message backlog. +We reduce the risk of message backlog by modifying the code, adjusting +the message timeout mechanism, and increasing the timeout duration. + +4. Increase the number of allocation candidates returned from Placement + +The reason for the failure is that Nova sends request to Placement to +obtain RP list with a maximum limit of 1000, and the number of computing +nodes in the environment exceeds 1000, some hosts will not return, resulting +in scheduling failure. Resolve this issue by modifying the configuration +items max_placement_result under nova scheduler. + +:: + + [scheduler] + max_placement_result = 3000 + +We spwan to create 3k instance in one request, the placement default return +1000 allocate candidates, so we need to increase the limit to 3000. + +5. Resolve the timeout of port creation + +Change the deployment method of OVN SB by deploying 10 relay SB services +per control node, and compute nodes can connect a single relay service to +reduce port creation time. +Before deploying the SB relay, each ovn SB process needed to manage an +average of 600+ connections, the CPU usage is often 100% and the requests is +slow processing. After adding the relay, each relay process handles about 60 +connections, the total relay process is set to 10. After testing, each relay +process has a low CPU usage and can process information quickly. + +6. Deployment optimization + +We modified the Ansible module on the basis of the OpenStack helm to support +user-defined configuration, making it more convenient to modify OpenStack +configuration parameters. +Additonally, we optimize the load balance problem of Kubeapi in large-scale +scenarios, adjust the long connection strategy of Kubelet client to make it +randomly reconnect and ensure the overall load balance of all management nodes. + +Optimized performance +--------------------- + +1. The success rate of concurrent creation of 3000 virtual machines is 100%. + +2. Querying 50000 virtual machines took 562.44ms. + +3. 2000 concurrent ports can be created 100% successfully, with an average +creation time of less than 0.2 seconds per port. diff --git a/doc/source/stories/index.rst b/doc/source/stories/index.rst index 3c297b8..052d813 100644 --- a/doc/source/stories/index.rst +++ b/doc/source/stories/index.rst @@ -24,3 +24,4 @@ Contents: 2020-01-29 2023-10-06 + 2024-06-10