Implement rescue mode
Implement Nova rescue/unrescue in Ironic. Also implement an extension in IPA that carries out rescue-related tasks. After rescuing a node, it will be left running a rescue ramdisk, configured with the rescue_password, and listening with ssh on the specified network interfaces. Partial-bug; 1526449 Change-Id: Idc05cf7a9c6c1968e1403fc97bde3713d2e7e3f6 Co-authored-by: Alex Weeks <alex.weeks@gmail.com>
This commit is contained in:
parent
c5ce3c717a
commit
8a6c05d6f9
281
specs/approved/implement-rescue-mode.rst
Normal file
281
specs/approved/implement-rescue-mode.rst
Normal file
@ -0,0 +1,281 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================
|
||||
Implement Rescue Mode
|
||||
=====================
|
||||
|
||||
https://bugs.launchpad.net/ironic/+bug/1526449
|
||||
|
||||
Implement Nova rescue/unrescue in Ironic. Also implement an extension in IPA
|
||||
that carries out rescue-related tasks. After rescuing a node, it will be left
|
||||
running a rescue ramdisk, configured with the rescue_password, and listening
|
||||
with ssh on the specified network interfaces.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Ironic does not currently implement the Nova rescue/unrescue interface.
|
||||
Therefore, end users are left with few options for troubleshooting or fixing
|
||||
anomalous and misconfigured nodes.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
* Implement rescue(), and unrescue() in the Ironic virt driver (no spec req'd):
|
||||
https://blueprints.launchpad.net/nova/+spec/ironic-rescue-mode
|
||||
* Add InstanceRescueFailure, and InstanceUnRescueFailure exceptions to Nova
|
||||
* Add methods to Nova driver to poll Ironic to wait for nodes to rescue and
|
||||
unrescue, as appropriate.
|
||||
* Store hashed/salted (crypt(3)) password in Ironic node instance_info for
|
||||
use in /etc/shadow on the rescued node
|
||||
* Add method to create salted hash appropriate for injecting into linux
|
||||
/etc/shadow file.
|
||||
* Modify Ironic state machine as described in the state machine impact section
|
||||
* Add AgentRescue driver (implements base.RescueInterface). This driver will
|
||||
be mixed into the agent_ipmitool and agent_pyghmi drivers.
|
||||
* Add periodic task _check_rescue_timeouts to fail the rescue process if
|
||||
it takes longer than rescue_callback_timeout seconds for the rescue ramdisk
|
||||
to come online.
|
||||
* Add Conductor methods: do_node_rescue, and do_node_unrescue
|
||||
* Add Conductor RPC calls: do_node_rescue, and do_node_unrescue (and
|
||||
increment API version)
|
||||
* Add conductor.rescue_callback_timeout config option
|
||||
* Add rescue-related functionality to Ironic Python Agent including ability
|
||||
to set rescue password and kick off any needed network configuration
|
||||
automation
|
||||
* Documentation of good practices for building rescue ramdisk in multitenant
|
||||
environments
|
||||
* Add rescue_network configuration, which contains the UUID of the network the
|
||||
rescue agent should be booted onto. For security reasons, this should be
|
||||
separate from the networks used for provisioning and cleaning in multi-tenant
|
||||
environments.
|
||||
|
||||
An outline of the standard (non-error) rescue and unrescue processes follows:
|
||||
|
||||
Standard rescue process:
|
||||
|
||||
1. User calls Nova rescue() on a node.
|
||||
2. Nova ComputeManager calls the virt driver's rescue() method, passing in
|
||||
rescue_password as a parameter.
|
||||
3. Virt driver calls node.set_provision_state(RESCUE), with the rescue_password
|
||||
as a parameter.
|
||||
4. Virt driver loops while waiting for provision_state to change, and updates
|
||||
Nova state as appropriate.
|
||||
5. Ironic API receives set_provision_state call, and performs do_node_rescue
|
||||
RPC call.
|
||||
6. Ironic conductor hands off call to appropriate driver.
|
||||
7. Driver boots rescue ramdisk, using the configured boot driver. As part of
|
||||
this process, Ironic will put the node onto the rescue_network, as
|
||||
configured in ironic.conf.
|
||||
8. IPA ramdisk boots, performs a lookup and sets rescue password based on
|
||||
returned data and begins heartbeating.
|
||||
9. Upon receiving heartbeat the conductor, if using multiple networks, places
|
||||
the instance back onto the tenant network, isolating the rescue ramdisk
|
||||
from the ironic control plane. When this completes, the node's provision
|
||||
state will change to RESCUE.
|
||||
10. Inside the rescue image, automation (such as cloud-init), should configure
|
||||
the rescue image as appropriate.
|
||||
|
||||
Standard Unrescue process:
|
||||
|
||||
1. User calls Nova unrescue() on a node.
|
||||
2. Nova calls Ironic unrescue() virt driver.
|
||||
3. Virt driver removes rescue_password_hash from node instance info (set
|
||||
during rescue process).
|
||||
4. Virt driver calls node.set_provision_state(ACTIVE).
|
||||
5. Virt driver loops while waiting for provision_state to change, and updates
|
||||
Nova state as appropriate.
|
||||
6. Ironic API receives set_provision_state call, and performs
|
||||
do_node_unrescue RPC call.
|
||||
7. Ironic conductor hands off call to appropriate driver.
|
||||
8. Driver performs actions required to boot node normally, and sets provision
|
||||
state to ACTIVE.
|
||||
|
||||
Rescue/Unrescue with standalone Ironic:
|
||||
|
||||
1. Call Ironic provision state API with verb "rescue", with the rescue password
|
||||
as an argument.
|
||||
2. When finished with rescuing the instance, call Ironic provision state API
|
||||
with "unrescue" verb
|
||||
|
||||
The current proposed methodology only works with Linux-based rescue ramdisks,
|
||||
due to the use of crypt(3) to store the password in a one-way hash.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
* Continue to not support rescue and unrescue.
|
||||
* Use console access to get rescue-like access into the OS, although this may
|
||||
not help in cases of lost password.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
Essentially none. We will use instance_info to store, and subsequently
|
||||
retrieve, the rescue_password while rescuing a node.
|
||||
|
||||
State Machine Impact
|
||||
--------------------
|
||||
* Add states to the Ironic state machine: RESCUING, RESCUEWAIT, RESCUE,
|
||||
RESCUEFAIL, UNRESCUING, UNRESCUEFAIL.
|
||||
* Add transitions to the Ironic state machine:
|
||||
|
||||
* ACTIVE -> RESCUING (initiate rescue)
|
||||
* RESCUING -> RESCUE (rescue succeeds)
|
||||
* RESCUING -> RESCUEWAIT (optionally, wait on external callback)
|
||||
* RESCUING -> RESCUEFAIL (rescue fails)
|
||||
* RESCUEWAIT -> RESCUE (finish rescue after callback)
|
||||
* RESCUEWAIT -> RESCUEFAIL (callback fails)
|
||||
* RESCUE -> RESCUING (re-rescue node)
|
||||
* RESCUE -> DELETING (delete rescued node)
|
||||
* RESCUE -> UNRESCUING (unrescue node)
|
||||
* RESCUE -> UNRESCUEFAIL (unrescue fails)
|
||||
* UNRESCUING -> UNRESCUEFAIL (unrescue fails)
|
||||
* UNRESCUING -> ACTIVE (unrescue succeeds)
|
||||
* UNRESCUEFAIL -> RESCUING (re-rescue node after failed unrescue)
|
||||
* UNRESCUEFAIL -> UNRESCUING (re-unrescue node after failed unrescue)
|
||||
* UNRESCUEFAIL -> DELETING (delete instance that failed unrescuing)
|
||||
* RESCUEFAIL -> RESCUING (re-rescue after rescue failed)
|
||||
* RESCUEFAIL -> UNRESCUING (unrescue after failed rescue)
|
||||
* RESCUEFAIL -> DELETING (delete after failed rescue)
|
||||
|
||||
* Add state machine verbs:
|
||||
|
||||
* RESCUE
|
||||
* UNRESCUE
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
Modify provision state API to support the states and transitions described in
|
||||
this spec. Also increment the API microversion. Nodes in states introduced by
|
||||
this spec (and related, future microversion) would be unable to be modified by
|
||||
clients using an earlier microversion.
|
||||
|
||||
Client (CLI) impact
|
||||
-------------------
|
||||
Support for the new verbs "rescue" and "unrescue" must be added to the client.
|
||||
|
||||
RPC API impact
|
||||
--------------
|
||||
Add do_node_rescue and do_node_unrescue to the Conductor RPC API.
|
||||
|
||||
Driver API impact
|
||||
-----------------
|
||||
None, because we defined the RescueInterface a long time ago.
|
||||
|
||||
Nova driver impact
|
||||
------------------
|
||||
Implement rescue() and unrescue() in the Nova driver. Add supporting methods
|
||||
including _wait_for_rescue(), _wait_for_unrescue(), and _hash_password().
|
||||
|
||||
Ramdisk impact
|
||||
--------------
|
||||
An agent that wishes to support rescue should:
|
||||
* Read and understand ipa-api-url kernel parameter for configuring API
|
||||
endpoint
|
||||
* Implement a client for ironic's lookup API call
|
||||
* The rescue_password will be in instance_info in the node object
|
||||
returned by Ironic on lookup. This can be placed in a linux-style
|
||||
/etc/shadow entry to enable a new user account.
|
||||
* Implement heartbeating to the appropriate API endpoint in Ironic
|
||||
* After one heartbeat, the agent should then kickoff any action needed
|
||||
to reconfigure networking, such as re-DHCPing, as the Ironic conductor
|
||||
will complete all actions to finish rescue - including moving the
|
||||
node off a network with access to Ironic API, if relevant.
|
||||
* Once network is reconfigured, the agent process should shutdown. Rescue
|
||||
is complete.
|
||||
|
||||
IPA will have a rescue extension added, implementing the above functionality.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
The rescue_password must be sent from Nova to Ironic, and thereafter to the
|
||||
rescued node. If, at any step in this process, this password is intercepted
|
||||
or changed, an attacker can gain root access to the rescued node.
|
||||
|
||||
Additionally, the lookup endpoint will be required to return the rescue
|
||||
password as a response to the first lookup once rescue is initiated. That
|
||||
means a properly executed timing attack could recover the password, but since
|
||||
this would also cause the rescue to fail (despite the node changing states),
|
||||
it's at worst a denial of service.
|
||||
|
||||
Security vulnerabilities involving the rescue ramdisk is another source of
|
||||
attacks. This is different from existing ramdisk issues, as once the rescue
|
||||
is complete, the tenant would have access to the ramdisk. This means deployers
|
||||
may need to ensure no secret information (such as custom cleaning steps or
|
||||
firmwares) are not present in the rescue ramdisk.
|
||||
|
||||
IPA is entirely unauthenticated. If IPA endpoints continue to be available
|
||||
after a node is rescued, then attackers with access to the tenant network
|
||||
would be able to leverage IPA's REST API to gain privileged access to the
|
||||
host. As such, IPA itself should be shut down, or the network should be
|
||||
sufficiently isolated during rescue operations.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
We will add rescue and unrescue commands to python-ironicclient.
|
||||
|
||||
Scalability impact
|
||||
------------------
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
Add conductor.rescue_callback_timeout config option.
|
||||
|
||||
Multi-tenant deployers will most likely need to support two ram disks--one
|
||||
running IPA for use with normal node-provisioning tasks, and another running
|
||||
IPA for rescue mode (with non-rescue endpoints disabled). This is to ensure
|
||||
the full suite of tooling and authentication needed for secure cleaning is not
|
||||
given to a tenant.
|
||||
|
||||
Additionally, in some environments, operators may not want to use the full
|
||||
Ironic Python Agent inside the rescue ramdisk, due to it's requirement for
|
||||
python or linux-centric nature. They may use statically compiled software
|
||||
such as onmetal-rescue-agent [0]_ to perform the lookup and heartbeat needed
|
||||
to finalize cleaning.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
JayF
|
||||
|
||||
Other contributors:
|
||||
Help Wanted!
|
||||
|
||||
Work Items
|
||||
----------
|
||||
See proposed changes.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
Updating the Ironic virt driver in Nova to support this.
|
||||
|
||||
Testing
|
||||
=======
|
||||
Unit tests and Tempest tests must be added.
|
||||
|
||||
Upgrades and Backwards Compatibility
|
||||
====================================
|
||||
Clients that are unaware of rescue-related states may not function correctly
|
||||
with nodes that are in these states.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
Write documentation.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [0] https://github.com/rackerlabs/onmetal-rescue-agent
|
1
specs/not-implemented/implement-rescue-mode.rst
Symbolic link
1
specs/not-implemented/implement-rescue-mode.rst
Symbolic link
@ -0,0 +1 @@
|
||||
../approved/implement-rescue-mode.rst
|
Loading…
Reference in New Issue
Block a user