From 8d374dd715ddb6e83ddd36fb83e471ccff2a0e9e Mon Sep 17 00:00:00 2001 From: Ghanshyam Date: Tue, 18 Nov 2025 02:46:07 +0000 Subject: [PATCH] Add backlog spec for Graceful shutodwn of nova services PoC: - Code change: https://review.opendev.org/c/openstack/nova/+/967261 Partial implement blueprint nova-services-graceful-shutdown Change-Id: Ic1d5f039c4f1d9cc1c474d8750bff8c5274fd14e Signed-off-by: Ghanshyam --- .../nova-services-graceful-shutdown.rst | 711 ++++++++++++++++++ 1 file changed, 711 insertions(+) create mode 100644 specs/backlog/approved/nova-services-graceful-shutdown.rst diff --git a/specs/backlog/approved/nova-services-graceful-shutdown.rst b/specs/backlog/approved/nova-services-graceful-shutdown.rst new file mode 100644 index 000000000..d0c4d4ea0 --- /dev/null +++ b/specs/backlog/approved/nova-services-graceful-shutdown.rst @@ -0,0 +1,711 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================================== +Graceful Shutdown of Nova Services +================================== + +https://blueprints.launchpad.net/nova/+spec/nova-services-graceful-shutdown + +This is backlog spec proposing the design of graceful shutdown. + +Nova services do not shut down gracefully. When services are stopped, it also +stops all the in-progress operations, which not only interrupt the in-progress +operations, but can leave instances in an unwanted or unrecoverable state. The +idea is to let services stop processing the new request, but complete the +in-progress operations before service is terminated. + +Problem description +=================== + +Nova services do not have a way to shutdown gracefully means they do not wait +for the in-progress operations to be completed. When shutdown is initiated, +services wait for the RPC server to stop and wait so that they can consume all +the existing request messages (RPC call/cast) from the queue, but the service +does not complete the operation. + +Each Nova compute service has a single worker running and listening on a single +RPC server (topic: compute.). The same RPC server is used for the new +requests as well as for in-progress operations where other compute or conductor +services communicate. When shutdown is initiated, the RPC server is stopped +means it will stop handling the new request, which is ok, but at the same +time it will stop the communication needed for the in-progress operations. For +example, if live migration is in progress, the source and destination compute +communicate (sync and async way) multiple times with each other. Once the RPC +server on the compute service is stopped, it cannot communicate with the other +compute and fail the live migration. It will lead the system as well as the +instance to be in an unwanted or unrecoverable state + +Use Cases +--------- + +As an operator, I want to be able to gracefully shut down (SIGTERM) the Nova +services so that it will not impact the users' in-progress operations or +keep resources in usable state. + +As an operator, I want to be able to keep instances and other resources in a +usable state even if service is gracefully terminated (SIGTERM). + +As an operator, I want to be able to take the actual benefits of the k8s pod +graceful shutdown when Nova services are running in k8s pods. + +As a user, I want in-progress operations to be completed before the service +is gracefully terminated (SIGTERM). + +Proposed change +=============== + +Scope: The proposed solution is to gracefully shutdown the services for +the SIGTERM signal. + +The graceful shutdown is based on the following design principles: + +* When service shutdown is initiated by SIGTERM: + + * Do not process any new requests + * New requests should not be lost. Once service is restarted, it should + process the requests. + * Allow in-progress operations to reach their quickest safe termination + point, either completion or abort. + * Proper logging of the state of in-progress operations + * Keep instances or other resources in a usable state + +* When service shutdown is completed: + + * Proper logging of unfinished operations. + Ideally, all the in-progress operations should be completed before service + is terminated, but if graceful shutdown times out (due to a configured + timeout, adding the timeout details in later section) then there should be + a proper logging of all the unfinished operations. This will help to + recover the system or instances. + +* When service is started again: + + * Start processing the new requests in the normal way. + * If the requests were not processed due to the shutdown being initiated, + then they stay in message broker queue and there are multiple + possibilities: + + * Requests might have been picked by the other worker of that service. + For example, you can run more than one Nova scheduler (or conductor) + worker. If one of the worker is shutting down, then other worker will + process the request. This is not the case for Nova compute which is + always a single worker per compute service on specific host. + * If a service has single worker running, then request can be picked up + once service is up again. + * There is an opportunity for the compute service to cleanup or recover + the interrupted operation on instances during init_host(). The action + taken will depends on the tasks and its status. + * If the service is in the stopped state for a long time, based on the + RPC and message queue timeout, there is chance that: + + * The RPC client or server will timeout the call. + * The message broker queue may drop messages due to timeout. + * The order of requests and messages can be stale. + +As a graceful shutdown goal, we need to do two things: + +#. A way to stop new requests, but do not interrupt in-progress operations. + This is proposed to be done via RPC. + +#. Give services enough time to finish the operations. As a first step, + this is proposed to be done via time-based wait and later with a proper + tracking mechanism. + +This backlog spec proposes achieving the above goals in two steps. Each step +will be proposed as a separate spec for a specific release. + +The Nova services which already gracefully shutdown: +---------------------------------------------------- + +For the below services, their graceful shutdown is handled by their +deployment servers or used library. + +* Nova API & Nova metadata API: + + Those services are deployed using a server with WSGI support. That server + will ensure that Nova API services shuts down gracefully, meaning it + finishes the in-progress requests and rejects the new requests. + + I investigate with uWSGI/mod_proxy_uwsgi (devstack env). On service start, + uWSGI server pre-spawn the number of workers for API service which will + handle the API requests in distributed way. When shutdown is initiated + by SIGTERM, the uWSGI server SIGTERM handler check if there are any + in-progress request on any worker. It wait for all the workers to finish + the request and then terminates each worker. Once all worker are terminated + then it will terminate the Nova API service. + + If any new request comes after the shutdown is initiated, it will be rejected + with "503 Service Unavailable" error. + + Testing: + + I tested two types of requests: + + #. Sync request: 'openstack server list': + + * To observe the graceful shutdown, I added 10 seconds of sleep in the + server list API code. + * Start a API request 'request1': ``openstack server list`` + * Wait till the server list request reaches the Nova API (you can see + the log from the controller) + * Because of sleep(10), the server list takes time to finish. + * Initiate the Nova API service shutdown. + * Start a new API request 'request2': ``openstack server list``. This new + requests came after shutdown is initiated so it should be denied. + * Nova API service will wait because 'request1' is not finished. + * 'request1' will get the response of the server list before the service + is terminated. + * 'request2' is denied and will receive the error + "503 Service Unavailable" + + #. Async request: ``openstack server pause ``: + + * To observe the graceful shutdown, I added 10 seconds of sleep in the + server pause API code. + * Start a API request 'request1': ``openstack server pause server1`` + * Wait till the pause server request reaches the Nova API (you can see + the log from the controller) + * Because of sleep(10), the pause server takes time to finish. + * Initiate the Nova API service shutdown. + * Service will wait because 'request1' is not finished. + * Nova API will make an RPC cast to the Nova compute service and return. + * 'request1' is completed, and the response is returned to the user. + * Nova API service is terminated now. + * Nova compute service is operating the pause server request. + * Check if server is paused ``openstack server list`` + * You can see the server is paused. + +* Nova console proxy services: nova-novncproxy, nova-serialproxy, and + nova-spicehtml5proxy: + + All the console proxy services run as websockify.websocketproxy_ service. + The websockify_ library handles the SIGTERM signal and the graceful shutdown, + which is enough for the Nova services. + + When a user access the console, websockify library starts a new process + in start_service_ and calls Nova new_websocket_client_ . Nova will be + authorizing the token, creating a socket on the host & port, which will + be used to send the data/frames. After that, user can access the console. + + If a shutdown request is initiated, websockify handle the signal. First, + it will terminate all the child processes and then raise the terminate + exception, which ends up calling the Nova close_connection_ method. The + Nova close_connection_ method calls shutdown() on the socket first and + then close(), which makes sure to send the remaining data/frame before + closing the socket. + + This way, user console sessions will be terminated gracefully, and they will + get "Disconnected" message. Once service is up, the user can refresh the + browser, and the console will be up again (if the token has not expired). + +Spec 1: Split the new and in-progress requests via RPC: +------------------------------------------------------- + +RPC communication is an important part of services to finish a particular +operation. During shutdown, we need to make sure we keep the required RPC +servers/buses up. If we stop the RPC communication, then it is nothing +different than service termination. + +Nova implements, and this spec talks a lot about RPC server ``start``, +``stop``, and ``wait``, so let's cover them briefly from oslo.messaging/RPC +resources point of view, and to understand this proposal in an easy way. +Most of you might know this, so you can skip this section. + +* RPC server: + + * creation and start(): + + * It will create the required resources on oslo.messaging side, for + example, dispatcher, consumer, listener, and queues. + * It will handle the binding to the required exchanges. + + * stop(): + + * It will disable the listener ability to pick up any new message + from the queue, but will dispatch the already picked message to + the dispatcher. + * It will delete the consumer. + * It will not delete the queues and exchange on the message broker side. + * It will not stop RPC clients sending new messages to the queue, however, + they will not be picked because the consumer and listener are stopped. + * wait(): + + * It will wait for the thread pool to finish dispatching all the already + picked messages. Basically, this will make sure methods are called on the + manager. + +Analysis per services and the required proposed RPC design change: + +* The services listed below communicate with other Nova services' RPC servers. + Since they do not have their own RPC server, no change needed: + + * Nova API + * Nova metadata API + * nova-novncproxy + * nova-serialproxy + * nova-spicehtml5proxy + +* Nova scheduler: No RPC change needed. + + * Requests handling: + Nova scheduler service runs as multiple workers, each having its own RPC + server, but all the Nova scheduler workers will listen to the same RPC + topic and queue ``scheduler`` with fanout way. + + Currently, nova.service.py->stop() calls stop() and wait() on RPC server. + Once RPC server is stopped, it will stop listening to any new messages. + But it will not impact anything on the other scheduler worker, and they + continue listening to the same queue and process the request. If any of + the scheduler worker is stopped, then the other workers will process the + request. + + * Response handling: + Whenever there is a RPC call, oslo.messaging creates another reply queue + connected with the unique message id. This reply queue will be used to + send the RPC call response to the caller. Even if the RPC server is stopped + on this worker, it will not impact the reply queue. + + We still need to keep the worker up until all the responses are sent via + the reply queue, and for that, we need to implement the in-progress task + tracking in scheduler services, but that will be handled in step 2. + + This way, stopping a Nova scheduler worker will not impact the RPC + communication on the scheduler service. + +* Nova conductor: No RPC change needed. + + The Nova conductor binary is a stateless service that can spawn multiple + worker threads. Each instance of the Nova conductor has its own RPC server, + but all the Nova conductor instances will listen to the same RPC topic + and queue ``conductor``. This allows the conductor instance to ack as a + distributed worker pool such that stopping an individual conductor instance + will not impact the RPC communication for the pool of conductor instances, + allowing other available workers to process the request. Each cell has its + own pool of conductors meaning as long as one conductor is up for any given + cell the RPC communication will continue to function even when one or more + conductors are stopped. + + The request and response handling is done in the same way as mentioned for + the scheduler. + + .. note:: + + This spec does not cover the conductor single worker case. That might + requires the RPC designing for conductor as well but it need more + investigation. + +* Nova compute: RPC design change needed + + * Request handling: + The Nova compute runs as a single worker per host, and each compute per + host has their own RPC server, listener, and separate queues. It handles + the new request as well as the communication needed for in-progress + operations on the same RPC server. To achieve the graceful shutdown, we + need to separate communication for the new requests and in-progress + operations. This will be done by adding a new RPC server in the compute + service. + + For easy readability, we will be using a different term for each RPC + server: + + * 'ops RPC server': This will be used for the new RPC server, which + will be used to finish the in-progress requests and will stay up during + shutdown. + * 'new request RPC server': This will be used for the current RPC server, + which is used for the new requests and will be stopped during shutdown. + + * 'new request RPC server' per compute: + No change in this RPC server, but it will be used for all the new requests, + so that we can stop it during shutdown and stop the new requests on the + compute. + + * 'ops RPC server' per compute: + + * Each compute will have a new 'ops RPC server' which will listen to a new + topic ``compute-ops.``. ``compute-ops`` name is used because it + is mainly for compute operations, but a better name can be used if + needed. + * It will use the same transport layer/bus and exchange that the + 'new request RPC server' uses. + * It will create its own dispatcher, listener, and queue. + * Both RPC server will be bound to the same endpoints (same compute + manager), so that requests coming from either server are handled by + the same compute manager. + * This server will be mainly used for the compute-to-compute operations and + server external events. The idea is to keep this RPC server up during + shutdown so that the in-progress operations can be finished. + * In shutdown, nova.service will wait for the compute to tell if they + finished all their tasks, so that it can stop the 'ops RPC server' and + finish the shutdown. + + * Response handling: + Irrespective of request is coming from either RPC server, whenever there + is a RPC call, oslo.messaging creates another reply queue connected with + the unique message id. This reply queue will be used to send the RPC call + response to the caller. Even RPC server is stopped on this worker, it + will not impact the reply queue. + + * Compute service workflow: + + * SIGTERM signal is handled by oslo.service, it will call stop on + nova.service + * nova.service will stop the 'new request RPC server' so that no new + requests are picked by the compute. The 'ops RPC server' is running and + up. + * nova.service will wait for the manager to signal once all in-progress + operations are finished. + * Once compute signal to nova.service, then it will stop the + 'ops RPC server' and proceed with service shutdown. + + * Timeout: + + * There is an existing graceful_shutdown_timeout_ config option present + on oslo.service which can be set per service. + * That is honoured to timeout the service stop, and it will stop service + irrespective of the compute finishing the things. + + * RPC client: + + * The RPC client stays as a singleton class, which is created with the + topic ``compute.``, meaning that by default message will be + sent via 'new request RPC server'. + * If any RPC cast/call wants to send a message via the 'ops RPC server', + they need to override the ``topic`` to ``compute-ops.`` during + client.prepare() call. + * Which RPC cast/call will be using the 'ops RPC server' will be decided + during implementation, so that we can have a better judgment on what all + methods are used for the operations we want to finish during shutdown. + A draft list where we can use the 'ops RPC server': + + .. note:: + + This is draft list and can be changed during implementation. + + * Migrations: + + - Live migration: + + .. note:: + + We will be using the 'new request RPC server' for + check_can_live_migrate_destination and + check_can_live_migrate_source methods, as this is the very initial + phase where the compute service has not started the live + migration. If shutdown is initiated before live migration request, + came then migration should be rejected. + + - pre_live_migration() + - live_migration() + - prep_snapshot_based_resize_at_dest() + - remove_volume_connection() + - post_live_migration_at_destination() + - rollback_live_migration_at_destination() + - drop_move_claim_at_destination() + + - resize methods + - cold migration methods + + * Server external event + * Rebuild instance + * validate_console_port() + This is when the console is already requested, and if port validation + request is going on, the compute should finish it before shutdown so + that users can get their requested console. + +* Time based waiting for services to finish the in-progress operations: + + .. note:: + + The time based waiting is a temporary solution in spec 1. In spec 2, + it will be replaced by the proper tracking of in-progress tasks. + + * To make the graceful shutdown less complicated, spec 1 proposes to + configurable time-based waiting for services to complete their operations. + * The wait time should be less than global graceful shutdown timeout. So that + external system or oslo.service does not shut down the service before the + service wait time is over. + * It will be configurable per service. + * Proposal for the default value: + + * compute service: 150 sec, considering long-running operations on compute. + * conductor service: 60 sec should be enough. + * scheduler service: 60 sec should be enough. + +* PoC: + This PoC shows the working of the spec 1 proposal. + + * Code change: https://review.opendev.org/c/openstack/nova/+/967261 + * PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/ + +* Some specific examples of the shutdown issues which will be solved by this + proposal: + + * Migrations: + + * Migration operations will use the 'ops RPC server'. + + * If migration is in-progress then the service shutdown will not + terminate the migration; instead will be able to wait for the migration + to complete. + * Instance boot: + + * Instance boot operations will continue to use the + 'new request RPC server'. Otherwise, we will not be able to stop the + new requests. + * If instance boot requests are in progress by compute services, then + shutdown will wait for compute to boot them successfully. + * If a new instance boot request arrives after the shutdown is initiated, + then it will stay in the queue, and the compute will handle it once it + is started again. + * Any operations which is reached to compute will be completed before the + service is shut down. + +.. note:: + + As per my PoC and manual testing till now, it does not require any + change on oslo.messaging side. + +Spec 2: Smartly track and wait for the in-progress operations: +-------------------------------------------------------------- + +* The below services graceful shutdown is handled by their deployed server or + library so no work is needed for Spec 2: + + * Nova API + * Nova metadata API + * nova-novncproxy + * nova-serialproxy + * nova-spicehtml5proxy + +* The below services need to implement the tracking system: + + * Nova compute + * Nova conductor + * Nova scheduler + +This proposal is to make the service wait time based on tracking the +in-progress tasks. Once the service finishes the tasks, then they can signal +to nova.service to proceed with shutting down the service. Basically, this +replaces the wait time approach mentioned above with a tracker-based approach. + +* There will be a task tracker introduced to track the in-progress tasks. +* It will be a singleton object. +* It maintains a list of 'method names' and ``request-id``. If task is related + to instance, then we can add the instance UUID also that can help to filter + or know what all operations on specific instance is in-progress. The unique + ``request-id`` will help to track multiple calls to the same method. +* Whenever a new request comes to compute, it will add that to the task list + and remove it once the task is completed. Modification to the tracker will be + done under lock. +* Once shutdown is initiated: + + * The task tracker will either add the new tasks to the tracker list or + reject them. The decision will be made by case, for example, reject the + tasks if they are not critical to handle during shutdown. + * During shutdown, any new periodic tasks will be denied, but in-progress + periodic tasks will be finished. + * An exact list of tasks which will be rejected and accepted will be decided + during implementation. + * The task tracker will start logging the tasks which are in progress, and + log when they are completed. Basically, log the detail view of in-progress + things during shutdown. +* nova.service will wait for the task tracker to finish the in-progress tasks + until timeout. +* Example of the flow of RPC servers stop, wait, and task tacker wait will be + something like: + + * We can signal tast tracker to start logging the in-progress tasks. + * RPCserver1.stop() + * RPCserver1.wait() + * manager.finish_tasks(): wait for manager to finish the in-progress tasks. + * RPCserver2.stop() + * RPCserver2.wait() + +Graceful Shutdown Timeouts: +--------------------------- + +* Nova service timeout: + + * oslo.service already has the timeout (graceful_shutdown_timeout_) + which is configurable per service and used to timeout the SIGTERM signal + handler. + * oslo.service will terminate the Nova service based on + graceful_shutdown_timeout_, even Nova service graceful shutdown is not + finished. + * No new configurable timeout will be added for the Nova, instead it will use + the existing graceful_shutdown_timeout_. + * Its default value is 60 sec, which is less for Nova services. The proposal + is to override its default value per Nova services: + + * compute service: 180 sec (Considering the long running tasks). + * conductor service: 80 sec + * scheduler service: 80 sec + +* External system timeout: + + Depending on how Nova services are deployed, there might be an external + system (for example, Nova running on k8s pods) timeout for graceful shutdown. + That can impact the Nova graceful shutdown, so we need to document it + clearly that if there is external system timeout, then Nova service timeout + graceful_shutdown_timeout_ should be set accordingly. The external + system timeout should be higher than graceful_shutdown_timeout_, + otherwise external system will timeout and will interrupt the Nova graceful + shutdown. + +Alternatives +------------ + +One alternative for the RPC redesign is to handle the two topics per RPC +server. This needs a good amount of changes in oslo.messaging framework as well +as driver implementations. The idea is to allow oslo.messaging Target to take +more than one topic (take topic as a list) and ask the driver to create +separate consumers, listeners, dispatchers, and queues for each topic. Create +each topic binding to the exchange. This also requires oslo.messaging to +provide a new way to let the RPC server unsubscribe from a particular topic +and continue listening on other topics. We also need to redesign how RPC server +stop() and wait() works for now. This is too complicated and almost +re-designing the oslo.messaging RPC concepts. + +One more alternative is to track and stop sending the request from Nova api or +the scheduler service, but that will not be able to stop all the new requests +(compute to compute tasks) or let in-progress things to complete. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +This should provide a positive impact on end users so that the shutdown will +not stop their in-progress operations. + +Performance Impact +------------------ + +No impact on normal operations, but the service shutdown will take more time. +There is a configurable timeout to control the service shutdown wait time. + +Other deployer impact +--------------------- + +None other than a longer shutdown process, but they can configurable an +appropriate timeout for service shutdown. + +Developer impact +---------------- + +None + +Upgrade impact +-------------- + +Adding a new RPC server will impact the upgrade. The old compute will not have +the new 'ops RPC server' listening on topic RPC_TOPIC_OPS, so we need to handle +it with RPC versioning. If the RPC client detects an old compute (based on +version_cap), then it will fall back to send the message to the original RPC +server (listening to RPC_TOPIC). + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + +gmaan + +Other contributors: + +None + +Feature Liaison +--------------- + +gmaan + +Work Items +---------- + +* Implement the 'ops RPC server' on the compute service +* Use the 'ops RPC server' for the operations we need to finish during + shutdown, for example, compute-to-compute tasks and server external events. +* RPC versioning due to upgrade impact. +* Implement a task tracker for services to track and report the in-progress + tasks during shutdown. + +Dependencies +============ + +* No dependency as of now, but we will see during implementation if any change + is needed in oslo.messaging. + + +Testing +======= + +* We cannot write tempest tests for this because tempest will not be able to + stop the services. +* We can try (with some heavy live migration which will takes time) some + testing in 'post-run' phase like it is done for evacuate tests. +* Unit and functional tests will be added. + + +Documentation Impact +==================== + +Graceful shutdown working will be documented along with other considerations, +for example, timeout or wait time considered for the graceful shutdown. + +References +========== + +* PoC: + + * Code change: https://review.opendev.org/c/openstack/nova/+/967261 + * PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/ + +* PTG discussions: + + * https://etherpad.opendev.org/p/nova-2026.1-ptg#L860 + * https://etherpad.opendev.org/p/nova-2025.1-ptg#L413 + * https://etherpad.opendev.org/p/r.3d37f484b24bb0415983f345582508f7#L180 + +.. _`websockify.websocketproxy`: https://github.com/novnc/websockify/blob/e9bd68cbb81ab9b0c4ee5fa7a62faba824a142d1/websockify/websocketproxy.py#L300 +.. _`websockify`: https://github.com/novnc/websockify +.. _`start_service`: https://github.com/novnc/websockify/blob/e9bd68cbb81ab9b0c4ee5fa7a62faba824a142d1/websockify/websockifyserver.py#L861 +.. _`new_websocket_client`: https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1e3f/nova/console/websocketproxy.py#L164C9-L164C29 +.. _`close_connection`: https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1e3f/nova/console/websocketproxy.py#L150 +.. _`graceful_shutdown_timeout`: https://github.com/openstack/oslo.service/blob/8969233a0a45dad06c445fdf4a66920bd5f3eef0/oslo_service/_options.py#L60 + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - 2026.1 Gazpacho + - Introduced