Implement rescue mode

Implement Nova rescue/unrescue in Ironic. Also implement an extension in
IPA that carries out rescue-related tasks. After rescuing a node, it
will be left running a rescue ramdisk, configured with the
rescue_password, and listening with ssh on the specified network
interfaces.

Partial-bug; 1526449
Change-Id: Idc05cf7a9c6c1968e1403fc97bde3713d2e7e3f6
Co-authored-by: Alex Weeks <alex.weeks@gmail.com>
This commit is contained in:
Jay Faulkner 2016-04-22 12:56:45 -07:00
parent c5ce3c717a
commit 8a6c05d6f9
2 changed files with 282 additions and 0 deletions

View File

@ -0,0 +1,281 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================
Implement Rescue Mode
=====================
https://bugs.launchpad.net/ironic/+bug/1526449
Implement Nova rescue/unrescue in Ironic. Also implement an extension in IPA
that carries out rescue-related tasks. After rescuing a node, it will be left
running a rescue ramdisk, configured with the rescue_password, and listening
with ssh on the specified network interfaces.
Problem description
===================
Ironic does not currently implement the Nova rescue/unrescue interface.
Therefore, end users are left with few options for troubleshooting or fixing
anomalous and misconfigured nodes.
Proposed change
===============
* Implement rescue(), and unrescue() in the Ironic virt driver (no spec req'd):
https://blueprints.launchpad.net/nova/+spec/ironic-rescue-mode
* Add InstanceRescueFailure, and InstanceUnRescueFailure exceptions to Nova
* Add methods to Nova driver to poll Ironic to wait for nodes to rescue and
unrescue, as appropriate.
* Store hashed/salted (crypt(3)) password in Ironic node instance_info for
use in /etc/shadow on the rescued node
* Add method to create salted hash appropriate for injecting into linux
/etc/shadow file.
* Modify Ironic state machine as described in the state machine impact section
* Add AgentRescue driver (implements base.RescueInterface). This driver will
be mixed into the agent_ipmitool and agent_pyghmi drivers.
* Add periodic task _check_rescue_timeouts to fail the rescue process if
it takes longer than rescue_callback_timeout seconds for the rescue ramdisk
to come online.
* Add Conductor methods: do_node_rescue, and do_node_unrescue
* Add Conductor RPC calls: do_node_rescue, and do_node_unrescue (and
increment API version)
* Add conductor.rescue_callback_timeout config option
* Add rescue-related functionality to Ironic Python Agent including ability
to set rescue password and kick off any needed network configuration
automation
* Documentation of good practices for building rescue ramdisk in multitenant
environments
* Add rescue_network configuration, which contains the UUID of the network the
rescue agent should be booted onto. For security reasons, this should be
separate from the networks used for provisioning and cleaning in multi-tenant
environments.
An outline of the standard (non-error) rescue and unrescue processes follows:
Standard rescue process:
1. User calls Nova rescue() on a node.
2. Nova ComputeManager calls the virt driver's rescue() method, passing in
rescue_password as a parameter.
3. Virt driver calls node.set_provision_state(RESCUE), with the rescue_password
as a parameter.
4. Virt driver loops while waiting for provision_state to change, and updates
Nova state as appropriate.
5. Ironic API receives set_provision_state call, and performs do_node_rescue
RPC call.
6. Ironic conductor hands off call to appropriate driver.
7. Driver boots rescue ramdisk, using the configured boot driver. As part of
this process, Ironic will put the node onto the rescue_network, as
configured in ironic.conf.
8. IPA ramdisk boots, performs a lookup and sets rescue password based on
returned data and begins heartbeating.
9. Upon receiving heartbeat the conductor, if using multiple networks, places
the instance back onto the tenant network, isolating the rescue ramdisk
from the ironic control plane. When this completes, the node's provision
state will change to RESCUE.
10. Inside the rescue image, automation (such as cloud-init), should configure
the rescue image as appropriate.
Standard Unrescue process:
1. User calls Nova unrescue() on a node.
2. Nova calls Ironic unrescue() virt driver.
3. Virt driver removes rescue_password_hash from node instance info (set
during rescue process).
4. Virt driver calls node.set_provision_state(ACTIVE).
5. Virt driver loops while waiting for provision_state to change, and updates
Nova state as appropriate.
6. Ironic API receives set_provision_state call, and performs
do_node_unrescue RPC call.
7. Ironic conductor hands off call to appropriate driver.
8. Driver performs actions required to boot node normally, and sets provision
state to ACTIVE.
Rescue/Unrescue with standalone Ironic:
1. Call Ironic provision state API with verb "rescue", with the rescue password
as an argument.
2. When finished with rescuing the instance, call Ironic provision state API
with "unrescue" verb
The current proposed methodology only works with Linux-based rescue ramdisks,
due to the use of crypt(3) to store the password in a one-way hash.
Alternatives
------------
* Continue to not support rescue and unrescue.
* Use console access to get rescue-like access into the OS, although this may
not help in cases of lost password.
Data model impact
-----------------
Essentially none. We will use instance_info to store, and subsequently
retrieve, the rescue_password while rescuing a node.
State Machine Impact
--------------------
* Add states to the Ironic state machine: RESCUING, RESCUEWAIT, RESCUE,
RESCUEFAIL, UNRESCUING, UNRESCUEFAIL.
* Add transitions to the Ironic state machine:
* ACTIVE -> RESCUING (initiate rescue)
* RESCUING -> RESCUE (rescue succeeds)
* RESCUING -> RESCUEWAIT (optionally, wait on external callback)
* RESCUING -> RESCUEFAIL (rescue fails)
* RESCUEWAIT -> RESCUE (finish rescue after callback)
* RESCUEWAIT -> RESCUEFAIL (callback fails)
* RESCUE -> RESCUING (re-rescue node)
* RESCUE -> DELETING (delete rescued node)
* RESCUE -> UNRESCUING (unrescue node)
* RESCUE -> UNRESCUEFAIL (unrescue fails)
* UNRESCUING -> UNRESCUEFAIL (unrescue fails)
* UNRESCUING -> ACTIVE (unrescue succeeds)
* UNRESCUEFAIL -> RESCUING (re-rescue node after failed unrescue)
* UNRESCUEFAIL -> UNRESCUING (re-unrescue node after failed unrescue)
* UNRESCUEFAIL -> DELETING (delete instance that failed unrescuing)
* RESCUEFAIL -> RESCUING (re-rescue after rescue failed)
* RESCUEFAIL -> UNRESCUING (unrescue after failed rescue)
* RESCUEFAIL -> DELETING (delete after failed rescue)
* Add state machine verbs:
* RESCUE
* UNRESCUE
REST API impact
---------------
Modify provision state API to support the states and transitions described in
this spec. Also increment the API microversion. Nodes in states introduced by
this spec (and related, future microversion) would be unable to be modified by
clients using an earlier microversion.
Client (CLI) impact
-------------------
Support for the new verbs "rescue" and "unrescue" must be added to the client.
RPC API impact
--------------
Add do_node_rescue and do_node_unrescue to the Conductor RPC API.
Driver API impact
-----------------
None, because we defined the RescueInterface a long time ago.
Nova driver impact
------------------
Implement rescue() and unrescue() in the Nova driver. Add supporting methods
including _wait_for_rescue(), _wait_for_unrescue(), and _hash_password().
Ramdisk impact
--------------
An agent that wishes to support rescue should:
* Read and understand ipa-api-url kernel parameter for configuring API
endpoint
* Implement a client for ironic's lookup API call
* The rescue_password will be in instance_info in the node object
returned by Ironic on lookup. This can be placed in a linux-style
/etc/shadow entry to enable a new user account.
* Implement heartbeating to the appropriate API endpoint in Ironic
* After one heartbeat, the agent should then kickoff any action needed
to reconfigure networking, such as re-DHCPing, as the Ironic conductor
will complete all actions to finish rescue - including moving the
node off a network with access to Ironic API, if relevant.
* Once network is reconfigured, the agent process should shutdown. Rescue
is complete.
IPA will have a rescue extension added, implementing the above functionality.
Security impact
---------------
The rescue_password must be sent from Nova to Ironic, and thereafter to the
rescued node. If, at any step in this process, this password is intercepted
or changed, an attacker can gain root access to the rescued node.
Additionally, the lookup endpoint will be required to return the rescue
password as a response to the first lookup once rescue is initiated. That
means a properly executed timing attack could recover the password, but since
this would also cause the rescue to fail (despite the node changing states),
it's at worst a denial of service.
Security vulnerabilities involving the rescue ramdisk is another source of
attacks. This is different from existing ramdisk issues, as once the rescue
is complete, the tenant would have access to the ramdisk. This means deployers
may need to ensure no secret information (such as custom cleaning steps or
firmwares) are not present in the rescue ramdisk.
IPA is entirely unauthenticated. If IPA endpoints continue to be available
after a node is rescued, then attackers with access to the tenant network
would be able to leverage IPA's REST API to gain privileged access to the
host. As such, IPA itself should be shut down, or the network should be
sufficiently isolated during rescue operations.
Other end user impact
---------------------
We will add rescue and unrescue commands to python-ironicclient.
Scalability impact
------------------
None.
Performance Impact
------------------
None.
Other deployer impact
---------------------
Add conductor.rescue_callback_timeout config option.
Multi-tenant deployers will most likely need to support two ram disks--one
running IPA for use with normal node-provisioning tasks, and another running
IPA for rescue mode (with non-rescue endpoints disabled). This is to ensure
the full suite of tooling and authentication needed for secure cleaning is not
given to a tenant.
Additionally, in some environments, operators may not want to use the full
Ironic Python Agent inside the rescue ramdisk, due to it's requirement for
python or linux-centric nature. They may use statically compiled software
such as onmetal-rescue-agent [0]_ to perform the lookup and heartbeat needed
to finalize cleaning.
Developer impact
----------------
None.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
JayF
Other contributors:
Help Wanted!
Work Items
----------
See proposed changes.
Dependencies
============
Updating the Ironic virt driver in Nova to support this.
Testing
=======
Unit tests and Tempest tests must be added.
Upgrades and Backwards Compatibility
====================================
Clients that are unaware of rescue-related states may not function correctly
with nodes that are in these states.
Documentation Impact
====================
Write documentation.
References
==========
.. [0] https://github.com/rackerlabs/onmetal-rescue-agent

View File

@ -0,0 +1 @@
../approved/implement-rescue-mode.rst