diff --git a/specs/approved/implement-rescue-mode.rst b/specs/approved/implement-rescue-mode.rst new file mode 100644 index 00000000..b2e63053 --- /dev/null +++ b/specs/approved/implement-rescue-mode.rst @@ -0,0 +1,281 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +===================== +Implement Rescue Mode +===================== + +https://bugs.launchpad.net/ironic/+bug/1526449 + +Implement Nova rescue/unrescue in Ironic. Also implement an extension in IPA +that carries out rescue-related tasks. After rescuing a node, it will be left +running a rescue ramdisk, configured with the rescue_password, and listening +with ssh on the specified network interfaces. + +Problem description +=================== + +Ironic does not currently implement the Nova rescue/unrescue interface. +Therefore, end users are left with few options for troubleshooting or fixing +anomalous and misconfigured nodes. + +Proposed change +=============== +* Implement rescue(), and unrescue() in the Ironic virt driver (no spec req'd): + https://blueprints.launchpad.net/nova/+spec/ironic-rescue-mode +* Add InstanceRescueFailure, and InstanceUnRescueFailure exceptions to Nova +* Add methods to Nova driver to poll Ironic to wait for nodes to rescue and + unrescue, as appropriate. +* Store hashed/salted (crypt(3)) password in Ironic node instance_info for + use in /etc/shadow on the rescued node +* Add method to create salted hash appropriate for injecting into linux + /etc/shadow file. +* Modify Ironic state machine as described in the state machine impact section +* Add AgentRescue driver (implements base.RescueInterface). This driver will + be mixed into the agent_ipmitool and agent_pyghmi drivers. +* Add periodic task _check_rescue_timeouts to fail the rescue process if + it takes longer than rescue_callback_timeout seconds for the rescue ramdisk + to come online. +* Add Conductor methods: do_node_rescue, and do_node_unrescue +* Add Conductor RPC calls: do_node_rescue, and do_node_unrescue (and + increment API version) +* Add conductor.rescue_callback_timeout config option +* Add rescue-related functionality to Ironic Python Agent including ability + to set rescue password and kick off any needed network configuration + automation +* Documentation of good practices for building rescue ramdisk in multitenant + environments +* Add rescue_network configuration, which contains the UUID of the network the + rescue agent should be booted onto. For security reasons, this should be + separate from the networks used for provisioning and cleaning in multi-tenant + environments. + +An outline of the standard (non-error) rescue and unrescue processes follows: + +Standard rescue process: + +1. User calls Nova rescue() on a node. +2. Nova ComputeManager calls the virt driver's rescue() method, passing in + rescue_password as a parameter. +3. Virt driver calls node.set_provision_state(RESCUE), with the rescue_password + as a parameter. +4. Virt driver loops while waiting for provision_state to change, and updates + Nova state as appropriate. +5. Ironic API receives set_provision_state call, and performs do_node_rescue + RPC call. +6. Ironic conductor hands off call to appropriate driver. +7. Driver boots rescue ramdisk, using the configured boot driver. As part of + this process, Ironic will put the node onto the rescue_network, as + configured in ironic.conf. +8. IPA ramdisk boots, performs a lookup and sets rescue password based on + returned data and begins heartbeating. +9. Upon receiving heartbeat the conductor, if using multiple networks, places + the instance back onto the tenant network, isolating the rescue ramdisk + from the ironic control plane. When this completes, the node's provision + state will change to RESCUE. +10. Inside the rescue image, automation (such as cloud-init), should configure + the rescue image as appropriate. + +Standard Unrescue process: + +1. User calls Nova unrescue() on a node. +2. Nova calls Ironic unrescue() virt driver. +3. Virt driver removes rescue_password_hash from node instance info (set + during rescue process). +4. Virt driver calls node.set_provision_state(ACTIVE). +5. Virt driver loops while waiting for provision_state to change, and updates + Nova state as appropriate. +6. Ironic API receives set_provision_state call, and performs + do_node_unrescue RPC call. +7. Ironic conductor hands off call to appropriate driver. +8. Driver performs actions required to boot node normally, and sets provision + state to ACTIVE. + +Rescue/Unrescue with standalone Ironic: + +1. Call Ironic provision state API with verb "rescue", with the rescue password + as an argument. +2. When finished with rescuing the instance, call Ironic provision state API + with "unrescue" verb + +The current proposed methodology only works with Linux-based rescue ramdisks, +due to the use of crypt(3) to store the password in a one-way hash. + +Alternatives +------------ +* Continue to not support rescue and unrescue. +* Use console access to get rescue-like access into the OS, although this may + not help in cases of lost password. + +Data model impact +----------------- +Essentially none. We will use instance_info to store, and subsequently +retrieve, the rescue_password while rescuing a node. + +State Machine Impact +-------------------- +* Add states to the Ironic state machine: RESCUING, RESCUEWAIT, RESCUE, + RESCUEFAIL, UNRESCUING, UNRESCUEFAIL. +* Add transitions to the Ironic state machine: + + * ACTIVE -> RESCUING (initiate rescue) + * RESCUING -> RESCUE (rescue succeeds) + * RESCUING -> RESCUEWAIT (optionally, wait on external callback) + * RESCUING -> RESCUEFAIL (rescue fails) + * RESCUEWAIT -> RESCUE (finish rescue after callback) + * RESCUEWAIT -> RESCUEFAIL (callback fails) + * RESCUE -> RESCUING (re-rescue node) + * RESCUE -> DELETING (delete rescued node) + * RESCUE -> UNRESCUING (unrescue node) + * RESCUE -> UNRESCUEFAIL (unrescue fails) + * UNRESCUING -> UNRESCUEFAIL (unrescue fails) + * UNRESCUING -> ACTIVE (unrescue succeeds) + * UNRESCUEFAIL -> RESCUING (re-rescue node after failed unrescue) + * UNRESCUEFAIL -> UNRESCUING (re-unrescue node after failed unrescue) + * UNRESCUEFAIL -> DELETING (delete instance that failed unrescuing) + * RESCUEFAIL -> RESCUING (re-rescue after rescue failed) + * RESCUEFAIL -> UNRESCUING (unrescue after failed rescue) + * RESCUEFAIL -> DELETING (delete after failed rescue) + +* Add state machine verbs: + + * RESCUE + * UNRESCUE + +REST API impact +--------------- +Modify provision state API to support the states and transitions described in +this spec. Also increment the API microversion. Nodes in states introduced by +this spec (and related, future microversion) would be unable to be modified by +clients using an earlier microversion. + +Client (CLI) impact +------------------- +Support for the new verbs "rescue" and "unrescue" must be added to the client. + +RPC API impact +-------------- +Add do_node_rescue and do_node_unrescue to the Conductor RPC API. + +Driver API impact +----------------- +None, because we defined the RescueInterface a long time ago. + +Nova driver impact +------------------ +Implement rescue() and unrescue() in the Nova driver. Add supporting methods +including _wait_for_rescue(), _wait_for_unrescue(), and _hash_password(). + +Ramdisk impact +-------------- +An agent that wishes to support rescue should: + * Read and understand ipa-api-url kernel parameter for configuring API + endpoint + * Implement a client for ironic's lookup API call + * The rescue_password will be in instance_info in the node object + returned by Ironic on lookup. This can be placed in a linux-style + /etc/shadow entry to enable a new user account. + * Implement heartbeating to the appropriate API endpoint in Ironic + * After one heartbeat, the agent should then kickoff any action needed + to reconfigure networking, such as re-DHCPing, as the Ironic conductor + will complete all actions to finish rescue - including moving the + node off a network with access to Ironic API, if relevant. + * Once network is reconfigured, the agent process should shutdown. Rescue + is complete. + +IPA will have a rescue extension added, implementing the above functionality. + +Security impact +--------------- +The rescue_password must be sent from Nova to Ironic, and thereafter to the +rescued node. If, at any step in this process, this password is intercepted +or changed, an attacker can gain root access to the rescued node. + +Additionally, the lookup endpoint will be required to return the rescue +password as a response to the first lookup once rescue is initiated. That +means a properly executed timing attack could recover the password, but since +this would also cause the rescue to fail (despite the node changing states), +it's at worst a denial of service. + +Security vulnerabilities involving the rescue ramdisk is another source of +attacks. This is different from existing ramdisk issues, as once the rescue +is complete, the tenant would have access to the ramdisk. This means deployers +may need to ensure no secret information (such as custom cleaning steps or +firmwares) are not present in the rescue ramdisk. + +IPA is entirely unauthenticated. If IPA endpoints continue to be available +after a node is rescued, then attackers with access to the tenant network +would be able to leverage IPA's REST API to gain privileged access to the +host. As such, IPA itself should be shut down, or the network should be +sufficiently isolated during rescue operations. + +Other end user impact +--------------------- +We will add rescue and unrescue commands to python-ironicclient. + +Scalability impact +------------------ +None. + +Performance Impact +------------------ +None. + +Other deployer impact +--------------------- +Add conductor.rescue_callback_timeout config option. + +Multi-tenant deployers will most likely need to support two ram disks--one +running IPA for use with normal node-provisioning tasks, and another running +IPA for rescue mode (with non-rescue endpoints disabled). This is to ensure +the full suite of tooling and authentication needed for secure cleaning is not +given to a tenant. + +Additionally, in some environments, operators may not want to use the full +Ironic Python Agent inside the rescue ramdisk, due to it's requirement for +python or linux-centric nature. They may use statically compiled software +such as onmetal-rescue-agent [0]_ to perform the lookup and heartbeat needed +to finalize cleaning. + +Developer impact +---------------- +None. + +Implementation +============== + +Assignee(s) +----------- +Primary assignee: + JayF + +Other contributors: + Help Wanted! + +Work Items +---------- +See proposed changes. + +Dependencies +============ +Updating the Ironic virt driver in Nova to support this. + +Testing +======= +Unit tests and Tempest tests must be added. + +Upgrades and Backwards Compatibility +==================================== +Clients that are unaware of rescue-related states may not function correctly +with nodes that are in these states. + +Documentation Impact +==================== +Write documentation. + +References +========== +.. [0] https://github.com/rackerlabs/onmetal-rescue-agent diff --git a/specs/not-implemented/implement-rescue-mode.rst b/specs/not-implemented/implement-rescue-mode.rst new file mode 120000 index 00000000..05efca5a --- /dev/null +++ b/specs/not-implemented/implement-rescue-mode.rst @@ -0,0 +1 @@ +../approved/implement-rescue-mode.rst \ No newline at end of file