From b5c20d0f8d4a2bfc66a545eb4923b829f3559d5d Mon Sep 17 00:00:00 2001 From: Dmitry Tantsur Date: Wed, 3 Jun 2015 14:37:01 +0200 Subject: [PATCH] Start overcloud troubleshooting with figuring out failed layer Diving into heat resources is of no use, when Ironic deployment has failed. Starting with Ironic node-list can give important clues to whether the deployment failed before baremetal provisioning, during it or after. Change-Id: Idd329e71a724cd2a900cf9631f241884cdd6a2eb --- .../troubleshooting-overcloud.rst | 85 ++++++++++++++++--- 1 file changed, 72 insertions(+), 13 deletions(-) diff --git a/doc/source/troubleshooting/troubleshooting-overcloud.rst b/doc/source/troubleshooting/troubleshooting-overcloud.rst index d4137b89e..0bc3a9858 100644 --- a/doc/source/troubleshooting/troubleshooting-overcloud.rst +++ b/doc/source/troubleshooting/troubleshooting-overcloud.rst @@ -5,25 +5,84 @@ If an Overcloud deployment has failed, the OpenStack clients and service log files can be used to troubleshoot the failed deployment. The following commands are all run on the Undercloud and assume a stackrc file has been sourced. -* Identifying a failed deployment +Identifying Failed Component +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - In most cases, Heat will show the failed overcloud stack when a deployment - has failed. +In most cases, Heat will show the failed overcloud stack when a deployment +has failed:: + $ heat stack-list + + +--------------------------------------+------------+--------------------+----------------------+ + | id | stack_name | stack_status | creation_time | + +--------------------------------------+------------+--------------------+----------------------+ + | 7e88af95-535c-4a55-b78d-2c3d9850d854 | overcloud | CREATE_FAILED | 2015-04-06T17:57:16Z | + +--------------------------------------+------------+--------------------+----------------------+ + +Occassionally, Heat is not even able to create the the stack, so the ``heat +stack-list`` output will be empty. If this is the case, observe the message +that was printed to the terminal when ``instack-deploy-overcloud`` or ``heat +stack-create`` was run. + +Next, there are a few layers on which the deployment can fail: + +* Orchestration (Heat and Nova services) +* Bare metal provisioning (Ironic service) +* Post-deploy configuration (Puppet) + +As Ironic service is in the middle layer, you can use its shell to guess the +failed layer. Issue ``ironic node-list`` command to see all registered nodes +and their current status, you will see something like:: + + +--------------------------------------+------+---------------+-------------+-----------------+-------------+ + | UUID | Name | Instance UUID | Power State | Provision State | Maintenance | + +--------------------------------------+------+---------------+-------------+-----------------+-------------+ + | f1e26112-5fbd-4fc4-9612-ecce7a1d86aa | None | None | power off | available | False | + | f0b8c105-f1d7-4059-a9a3-b050c3340340 | None | None | power off | available | False | + +--------------------------------------+------+---------------+-------------+-----------------+-------------+ + +Pay close attention to **Provision State** and **Maintenance** columns +in the resulting table. + +* If the command shows empty table or less nodes that you expect, or + **Maintenance** is ``True``, or **Provision State** is ``manageable``, + there was a problem during node enrolling and introspection. + Please go back to these steps. + + For example, **Maintenance** goes to ``True`` automatically, if wrong power + credentials are provided. + +* If **Provision State** is ``available`` then the problem occured before + bare metal deployment has even started. Proceed with `Debugging Using Heat`_. + +* If **Provision State** is ``active`` and **Power State** is ``power on``, + then bare metal deployment has finished successfully, and problem happened + during the post-deployment configuration step. Again, refer to `Debugging + Using Heat`_. + +* If **Provision State** is ``wait call-back``, then bare metal deployment is + not finished for this node yet. You may want to wait until the status + changes. + +* If **Provision State** is ``error`` or ``deploy failed``, then bare metal + deployment has failed for this node. Issue :: - $ heat stack-list + ironic node-show - +--------------------------------------+------------+--------------------+----------------------+ - | id | stack_name | stack_status | creation_time | - +--------------------------------------+------------+--------------------+----------------------+ - | 7e88af95-535c-4a55-b78d-2c3d9850d854 | overcloud | CREATE_FAILED | 2015-04-06T17:57:16Z | - +--------------------------------------+------------+--------------------+----------------------+ + and look for **last_error** field. It will contain error description. - Occassionally, Heat is not even able to create the the stack, so the ``heat - stack-list`` output will be empty. If this is the case, observe the message - that was printed to the terminal when ``instack-deploy-overcloud`` or ``heat - stack-create`` was run. + If the error message is vague, you can use logs to clarify it:: + + sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api + + If you see wait timeout error, and node **Power State** is ``power on``, + then try to connect to the virtual console of the failed machine. Use + ``virt-manager`` tool for virtual machines and vendor-specific virtual + console (e.g. iDRAC for DELL) for bare metal machines. + +Debugging Using Heat +~~~~~~~~~~~~~~~~~~~~ * Identifying the failed Heat resource