Implement rescue mode

Implement Nova rescue/unrescue in Ironic. Also implement an extension in IPA that carries out rescue-related tasks. After rescuing a node, it will be left running a rescue ramdisk, configured with the rescue_password, and listening with ssh on the specified network interfaces. Partial-bug; 1526449 Change-Id: Idc05cf7a9c6c1968e1403fc97bde3713d2e7e3f6 Co-authored-by: Alex Weeks <alex.weeks@gmail.com>
2016-04-22 12:56:45 -07:00 · 2016-04-22 12:56:45 -07:00 · 8a6c05d6f9
commit 8a6c05d6f9
parent c5ce3c717a
2 changed files with 282 additions and 0 deletions
--- a/specs/approved/implement-rescue-mode.rst
+++ b/specs/approved/implement-rescue-mode.rst
@ -0,0 +1,281 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================
+Implement Rescue Mode
+=====================
+
+https://bugs.launchpad.net/ironic/+bug/1526449
+
+Implement Nova rescue/unrescue in Ironic. Also implement an extension in IPA
+that carries out rescue-related tasks. After rescuing a node, it will be left
+running a rescue ramdisk, configured with the rescue_password, and listening
+with ssh on the specified network interfaces.
+
+Problem description
+===================
+
+Ironic does not currently implement the Nova rescue/unrescue interface.
+Therefore, end users are left with few options for troubleshooting or fixing
+anomalous and misconfigured nodes.
+
+Proposed change
+===============
+* Implement rescue(), and unrescue() in the Ironic virt driver (no spec req'd):
+  https://blueprints.launchpad.net/nova/+spec/ironic-rescue-mode
+* Add InstanceRescueFailure, and InstanceUnRescueFailure exceptions to Nova
+* Add methods to Nova driver to poll Ironic to wait for nodes to rescue and
+  unrescue, as appropriate.
+* Store hashed/salted (crypt(3)) password in Ironic node instance_info for
+  use in /etc/shadow on the rescued node
+* Add method to create salted hash appropriate for injecting into linux
+  /etc/shadow file.
+* Modify Ironic state machine as described in the state machine impact section
+* Add AgentRescue driver (implements base.RescueInterface). This driver will
+  be mixed into the agent_ipmitool and agent_pyghmi drivers.
+* Add periodic task _check_rescue_timeouts to fail the rescue process if
+  it takes longer than rescue_callback_timeout seconds for the rescue ramdisk
+  to come online.
+* Add Conductor methods: do_node_rescue, and do_node_unrescue
+* Add Conductor RPC calls: do_node_rescue, and do_node_unrescue (and
+  increment API version)
+* Add conductor.rescue_callback_timeout config option
+* Add rescue-related functionality to Ironic Python Agent including ability
+  to set rescue password and kick off any needed network configuration
+  automation
+* Documentation of good practices for building rescue ramdisk in multitenant
+  environments
+* Add rescue_network configuration, which contains the UUID of the network the
+  rescue agent should be booted onto. For security reasons, this should be
+  separate from the networks used for provisioning and cleaning in multi-tenant
+  environments.
+
+An outline of the standard (non-error) rescue and unrescue processes follows:
+
+Standard rescue process:
+
+1. User calls Nova rescue() on a node.
+2. Nova ComputeManager calls the virt driver's rescue() method, passing in
+   rescue_password as a parameter.
+3. Virt driver calls node.set_provision_state(RESCUE), with the rescue_password
+   as a parameter.
+4. Virt driver loops while waiting for provision_state to change, and updates
+   Nova state as appropriate.
+5. Ironic API receives set_provision_state call, and performs do_node_rescue
+   RPC call.
+6. Ironic conductor hands off call to appropriate driver.
+7. Driver boots rescue ramdisk, using the configured boot driver. As part of
+   this process, Ironic will put the node onto the rescue_network, as
+   configured in ironic.conf.
+8. IPA ramdisk boots, performs a lookup and sets rescue password based on
+   returned data and begins heartbeating.
+9. Upon receiving heartbeat the conductor, if using multiple networks, places
+   the instance back onto the tenant network, isolating the rescue ramdisk
+   from the ironic control plane. When this completes, the node's provision
+   state will change to RESCUE.
+10. Inside the rescue image, automation (such as cloud-init), should configure
+    the rescue image as appropriate.
+
+Standard Unrescue process:
+
+1. User calls Nova unrescue() on a node.
+2. Nova calls Ironic unrescue() virt driver.
+3. Virt driver removes rescue_password_hash from node instance info (set
+   during rescue process).
+4. Virt driver calls node.set_provision_state(ACTIVE).
+5. Virt driver loops while waiting for provision_state to change, and updates
+   Nova state as appropriate.
+6. Ironic API receives set_provision_state call, and performs
+   do_node_unrescue RPC call.
+7. Ironic conductor hands off call to appropriate driver.
+8. Driver performs actions required to boot node normally, and sets provision
+   state to ACTIVE.
+
+Rescue/Unrescue with standalone Ironic:
+
+1. Call Ironic provision state API with verb "rescue", with the rescue password
+   as an argument.
+2. When finished with rescuing the instance, call Ironic provision state API
+   with "unrescue" verb
+
+The current proposed methodology only works with Linux-based rescue ramdisks,
+due to the use of crypt(3) to store the password in a one-way hash.
+
+Alternatives
+------------
+* Continue to not support rescue and unrescue.
+* Use console access to get rescue-like access into the OS, although this may
+  not help in cases of lost password.
+
+Data model impact
+-----------------
+Essentially none.  We will use instance_info to store, and subsequently
+retrieve, the rescue_password while rescuing a node.
+
+State Machine Impact
+--------------------
+* Add states to the Ironic state machine: RESCUING, RESCUEWAIT, RESCUE,
+  RESCUEFAIL, UNRESCUING, UNRESCUEFAIL.
+* Add transitions to the Ironic state machine:
+
+  * ACTIVE -> RESCUING (initiate rescue)
+  * RESCUING -> RESCUE (rescue succeeds)
+  * RESCUING -> RESCUEWAIT (optionally, wait on external callback)
+  * RESCUING -> RESCUEFAIL (rescue fails)
+  * RESCUEWAIT -> RESCUE (finish rescue after callback)
+  * RESCUEWAIT -> RESCUEFAIL (callback fails)
+  * RESCUE -> RESCUING (re-rescue node)
+  * RESCUE -> DELETING (delete rescued node)
+  * RESCUE -> UNRESCUING (unrescue node)
+  * RESCUE -> UNRESCUEFAIL (unrescue fails)
+  * UNRESCUING -> UNRESCUEFAIL (unrescue fails)
+  * UNRESCUING -> ACTIVE (unrescue succeeds)
+  * UNRESCUEFAIL -> RESCUING (re-rescue node after failed unrescue)
+  * UNRESCUEFAIL -> UNRESCUING (re-unrescue node after failed unrescue)
+  * UNRESCUEFAIL -> DELETING (delete instance that failed unrescuing)
+  * RESCUEFAIL -> RESCUING (re-rescue after rescue failed)
+  * RESCUEFAIL -> UNRESCUING (unrescue after failed rescue)
+  * RESCUEFAIL -> DELETING (delete after failed rescue)
+
+* Add state machine verbs:
+
+  * RESCUE
+  * UNRESCUE
+
+REST API impact
+---------------
+Modify provision state API to support the states and transitions described in
+this spec.  Also increment the API microversion. Nodes in states introduced by
+this spec (and related, future microversion) would be unable to be modified by
+clients using an earlier microversion.
+
+Client (CLI) impact
+-------------------
+Support for the new verbs "rescue" and "unrescue" must be added to the client.
+
+RPC API impact
+--------------
+Add do_node_rescue and do_node_unrescue to the Conductor RPC API.
+
+Driver API impact
+-----------------
+None, because we defined the RescueInterface a long time ago.
+
+Nova driver impact
+------------------
+Implement rescue() and unrescue() in the Nova driver.  Add supporting methods
+including _wait_for_rescue(), _wait_for_unrescue(), and _hash_password().
+
+Ramdisk impact
+--------------
+An agent that wishes to support rescue should:
+  * Read and understand ipa-api-url kernel parameter for configuring API
+    endpoint
+  * Implement a client for ironic's lookup API call
+     * The rescue_password will be in instance_info in the node object
+       returned by Ironic on lookup. This can be placed in a linux-style
+       /etc/shadow entry to enable a new user account.
+  * Implement heartbeating to the appropriate API endpoint in Ironic
+      * After one heartbeat, the agent should then kickoff any action needed
+        to reconfigure networking, such as re-DHCPing, as the Ironic conductor
+        will complete all actions to finish rescue - including moving the
+        node off a network with access to Ironic API, if relevant.
+      * Once network is reconfigured, the agent process should shutdown. Rescue
+        is complete.
+
+IPA will have a rescue extension added, implementing the above functionality.
+
+Security impact
+---------------
+The rescue_password must be sent from Nova to Ironic, and thereafter to the
+rescued node.  If, at any step in this process, this password is intercepted
+or changed, an attacker can gain root access to the rescued node.
+
+Additionally, the lookup endpoint will be required to return the rescue
+password as a response to the first lookup once rescue is initiated. That
+means a properly executed timing attack could recover the password, but since
+this would also cause the rescue to fail (despite the node changing states),
+it's at worst a denial of service.
+
+Security vulnerabilities involving the rescue ramdisk is another source of
+attacks. This is different from existing ramdisk issues, as once the rescue
+is complete, the tenant would have access to the ramdisk. This means deployers
+may need to ensure no secret information (such as custom cleaning steps or
+firmwares) are not present in the rescue ramdisk.
+
+IPA is entirely unauthenticated.  If IPA endpoints continue to be available
+after a node is rescued, then attackers with access to the tenant network
+would be able to leverage IPA's REST API to gain privileged access to the
+host. As such, IPA itself should be shut down, or the network should be
+sufficiently isolated during rescue operations.
+
+Other end user impact
+---------------------
+We will add rescue and unrescue commands to python-ironicclient.
+
+Scalability impact
+------------------
+None.
+
+Performance Impact
+------------------
+None.
+
+Other deployer impact
+---------------------
+Add conductor.rescue_callback_timeout config option.
+
+Multi-tenant deployers will most likely need to support two ram disks--one
+running IPA for use with normal node-provisioning tasks, and another running
+IPA for rescue mode (with non-rescue endpoints disabled). This is to ensure
+the full suite of tooling and authentication needed for secure cleaning is not
+given to a tenant.
+
+Additionally, in some environments, operators may not want to use the full
+Ironic Python Agent inside the rescue ramdisk, due to it's requirement for
+python or linux-centric nature. They may use statically compiled software
+such as onmetal-rescue-agent [0]_ to perform the lookup and heartbeat needed
+to finalize cleaning.
+
+Developer impact
+----------------
+None.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+Primary assignee:
+  JayF
+
+Other contributors:
+  Help Wanted!
+
+Work Items
+----------
+See proposed changes.
+
+Dependencies
+============
+Updating the Ironic virt driver in Nova to support this.
+
+Testing
+=======
+Unit tests and Tempest tests must be added.
+
+Upgrades and Backwards Compatibility
+====================================
+Clients that are unaware of rescue-related states may not function correctly
+with nodes that are in these states.
+
+Documentation Impact
+====================
+Write documentation.
+
+References
+==========
+.. [0] https://github.com/rackerlabs/onmetal-rescue-agent
--- a/specs/not-implemented/implement-rescue-mode.rst
+++ b/specs/not-implemented/implement-rescue-mode.rst
@ -0,0 +1 @@
+../approved/implement-rescue-mode.rst