ironic/doc/source/admin/troubleshooting.rst
Dmitry Tantsur 1ffa7571d3 Reference architecture: small cloud with trusted tenants
This document describes the design of a small bare metal cloud with
flat networking and conductors as part of (HA) controller nodes.

Also adds missing information about rescuing network to the common
reference architecture guide.

Change-Id: Ifd3bfcc89263cd9810cd5cfb459ffeeaf146caf7
Story: 2001745
Task: 12108
2018-04-06 13:33:27 +02:00

349 lines
14 KiB
ReStructuredText

.. _troubleshooting:
======================
Troubleshooting Ironic
======================
Nova returns "No valid host was found" Error
============================================
Sometimes Nova Conductor log file "nova-conductor.log" or a message returned
from Nova API contains the following error::
NoValidHost: No valid host was found. There are not enough hosts available.
"No valid host was found" means that the Nova Scheduler could not find a bare
metal node suitable for booting the new instance.
This in turn usually means some mismatch between resources that Nova expects
to find and resources that Ironic advertised to Nova.
A few things should be checked in this case:
#. Make sure that enough nodes are in ``available`` state, not in
maintenance mode and not already used by an existing instance.
Check with the following command::
openstack baremetal node list --provision-state available --no-maintenance --unassociated
If this command does not show enough nodes, use generic ``openstack baremetal
node list`` to check other nodes. For example, nodes in ``manageable`` state
should be made available::
openstack baremetal node provide <IRONIC NODE>
The Bare metal service automatically puts a node in maintenance mode if
there are issues with accessing its management interface. Check the power
credentials (e.g. ``ipmi_address``, ``ipmi_username`` and ``ipmi_password``)
and then move the node out of maintenance mode::
openstack baremetal node maintenance unset <IRONIC NODE>
The ``node validate`` command can be used to verify that all required fields
are present. The following command should not return anything::
openstack baremetal node validate <IRONIC NODE> | grep -E '(power|management)\W*False'
Maintenance mode will be also set on a node if automated cleaning has
failed for it previously.
#. Make sure that you have Compute services running and enabled::
$ openstack compute service list --service nova-compute
+----+--------------+-------------+------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+--------------+-------------+------+---------+-------+----------------------------+
| 7 | nova-compute | example.com | nova | enabled | up | 2017-09-04T13:14:03.000000 |
+----+--------------+-------------+------+---------+-------+----------------------------+
By default, a Compute service is disabled after 10 consecutive build
failures on it. This is to ensure that new build requests are not routed to
a broken Compute service. If it is the case, make sure to fix the source of
the failures, then re-enable it::
openstack compute service set --enable <COMPUTE HOST> nova-compute
#. Starting with the Pike release, check that all your nodes have the
``resource_class`` field set using the following command::
openstack --os-baremetal-api-version 1.21 baremetal node list --fields uuid name resource_class
Then check that the flavor(s) are configured to request these resource
classes via their properties::
openstack flavor show <FLAVOR NAME> -f value -c properties
For example, if your node has resource class ``baremetal-large``, it will
be matched by a flavor with property ``resources:CUSTOM_BAREMETAL_LARGE``
set to ``1``. See :doc:`/install/configure-nova-flavors` for more
details on the correct configuration.
#. If you do not use scheduling based on resource classes, then the node's
properties must have been set either manually or via inspection.
For each node with ``available`` state check that the ``properties``
JSON field has valid values for the keys ``cpus``, ``cpu_arch``,
``memory_mb`` and ``local_gb``. Example of valid properties::
$ openstack baremetal node show <IRONIC NODE> --fields properties
+------------+------------------------------------------------------------------------------------+
| Property | Value |
+------------+------------------------------------------------------------------------------------+
| properties | {u'memory_mb': u'8192', u'cpu_arch': u'x86_64', u'local_gb': u'41', u'cpus': u'4'} |
+------------+------------------------------------------------------------------------------------+
.. warning::
If you're using exact match filters in the Nova Scheduler, make sure
the flavor and the node properties match exactly.
#. The Nova flavor that you are using does not match any properties of the
available Ironic nodes. Use
::
openstack flavor show <FLAVOR NAME>
to compare. The extra specs in your flavor starting with ``capability:``
should match ones in ``node.properties['capabilities']``.
.. note::
The format of capabilities is different in Nova and Ironic.
E.g. in Nova flavor::
$ openstack flavor show <FLAVOR NAME> -c properties
+------------+----------------------------------+
| Field | Value |
+------------+----------------------------------+
| properties | capabilities:boot_option='local' |
+------------+----------------------------------+
But in Ironic node::
$ openstack baremetal node show <IRONIC NODE> --fields properties
+------------+-----------------------------------------+
| Property | Value |
+------------+-----------------------------------------+
| properties | {u'capabilities': u'boot_option:local'} |
+------------+-----------------------------------------+
#. After making changes to nodes in Ironic, it takes time for those changes
to propagate from Ironic to Nova. Check that
::
openstack hypervisor stats show
correctly shows total amount of resources in your system. You can also
check ``openstack hypervisor show <IRONIC NODE>`` to see the status of
individual Ironic nodes as reported to Nova.
.. TODO(dtantsur): explain inspecting the placement API
#. Figure out which Nova Scheduler filter ruled out your nodes. Check the
``nova-scheduler`` logs for lines containing something like::
Filter ComputeCapabilitiesFilter returned 0 hosts
The name of the filter that removed the last hosts may give some hints on
what exactly was not matched. See `Nova filters documentation
<https://docs.openstack.org/nova/latest/filter_scheduler.html>`_ for more
details.
#. If none of the above helped, check Ironic conductor log carefully to see
if there are any conductor-related errors which are the root cause for
"No valid host was found". If there are any "Error in deploy of node
<IRONIC-NODE-UUID>: [Errno 28] ..." error messages in Ironic conductor
log, it means the conductor run into a special error during deployment.
So you can check the log carefully to fix or work around and then try
again.
Patching the Deploy Ramdisk
===========================
When debugging a problem with deployment and/or inspection you may want to
quickly apply a change to the ramdisk to see if it helps. Of course you can
inject your code and/or SSH keys during the ramdisk build (depends on how
exactly you've built your ramdisk). But it's also possible to quickly modify
an already built ramdisk.
Create an empty directory and unpack the ramdisk content there::
mkdir unpack
cd unpack
gzip -dc /path/to/the/ramdisk | cpio -id
The last command will result in the whole Linux file system tree unpacked in
the current directory. Now you can modify any files you want. The actual
location of the files will depend on the way you've built the ramdisk.
After you've done the modifications, pack the whole content of the current
directory back::
find . | cpio -H newc -o | gzip -c > /path/to/the/new/ramdisk
.. note:: You don't need to modify the kernel (e.g.
``tinyipa-master.vmlinuz``), only the ramdisk part.
.. note:: For CoreOS-based ramdisk you also need to unpack and pack back the
squashfs archive inside the unpacked ramdisk.
API Errors
==========
The `debug_tracebacks_in_api` config option may be set to return tracebacks
in the API response for all 4xx and 5xx errors.
.. _retrieve_deploy_ramdisk_logs:
Retrieving logs from the deploy ramdisk
=======================================
When troubleshooting deployments (specially in case of a deploy failure)
it's important to have access to the deploy ramdisk logs to be able to
identify the source of the problem. By default, Ironic will retrieve the
logs from the deploy ramdisk when the deployment fails and save it on the
local filesystem at ``/var/log/ironic/deploy``.
To change this behavior, operators can make the following changes to
``/etc/ironic/ironic.conf`` under the ``[agent]`` group:
* ``deploy_logs_collect``: Whether Ironic should collect the deployment
logs on deployment. Valid values for this option are:
* ``on_failure`` (**default**): Retrieve the deployment logs upon a
deployment failure.
* ``always``: Always retrieve the deployment logs, even if the
deployment succeed.
* ``never``: Disable retrieving the deployment logs.
* ``deploy_logs_storage_backend``: The name of the storage backend where
the logs will be stored. Valid values for this option are:
* ``local`` (**default**): Store the logs in the local filesystem.
* ``swift``: Store the logs in Swift.
* ``deploy_logs_local_path``: The path to the directory where the
logs should be stored, used when the ``deploy_logs_storage_backend``
is configured to ``local``. By default logs will be stored at
**/var/log/ironic/deploy**.
* ``deploy_logs_swift_container``: The name of the Swift container to
store the logs, used when the deploy_logs_storage_backend is configured to
"swift". By default **ironic_deploy_logs_container**.
* ``deploy_logs_swift_days_to_expire``: Number of days before a log object
is marked as expired in Swift. If None, the logs will be kept forever
or until manually deleted. Used when the deploy_logs_storage_backend is
configured to "swift". By default **30** days.
When the logs are collected, Ironic will store a *tar.gz* file containing
all the logs according to the ``deploy_logs_storage_backend``
configuration option. All log objects will be named with the following
pattern::
<node-uuid>[_<instance-uuid>]_<timestamp yyyy-mm-dd-hh:mm:ss>.tar.gz
.. note::
The *instance_uuid* field is not required for deploying a node when
Ironic is configured to be used in standalone mode. If present it
will be appended to the name.
Accessing the log data
----------------------
When storing in the local filesystem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When storing the logs in the local filesystem, the log files can
be found at the path configured in the ``deploy_logs_local_path``
configuration option. For example, to find the logs from the node
``5e9258c4-cfda-40b6-86e2-e192f523d668``:
.. code-block:: bash
$ ls /var/log/ironic/deploy | grep 5e9258c4-cfda-40b6-86e2-e192f523d668
5e9258c4-cfda-40b6-86e2-e192f523d668_88595d8a-6725-4471-8cd5-c0f3106b6898_2016-08-08-13:52:12.tar.gz
5e9258c4-cfda-40b6-86e2-e192f523d668_db87f2c5-7a9a-48c2-9a76-604287257c1b_2016-08-08-14:07:25.tar.gz
.. note::
When saving the logs to the filesystem, operators may want to enable
some form of rotation for the logs to avoid disk space problems.
When storing in Swift
~~~~~~~~~~~~~~~~~~~~~
When using Swift, operators can associate the objects in the
container with the nodes in Ironic and search for the logs for the node
``5e9258c4-cfda-40b6-86e2-e192f523d668`` using the **prefix** parameter.
For example:
.. code-block:: bash
$ swift list ironic_deploy_logs_container -p 5e9258c4-cfda-40b6-86e2-e192f523d668
5e9258c4-cfda-40b6-86e2-e192f523d668_88595d8a-6725-4471-8cd5-c0f3106b6898_2016-08-08-13:52:12.tar.gz
5e9258c4-cfda-40b6-86e2-e192f523d668_db87f2c5-7a9a-48c2-9a76-604287257c1b_2016-08-08-14:07:25.tar.gz
To download a specific log from Swift, do:
.. code-block:: bash
$ swift download ironic_deploy_logs_container "5e9258c4-cfda-40b6-86e2-e192f523d668_db87f2c5-7a9a-48c2-9a76-604287257c1b_2016-08-08-14:07:25.tar.gz"
5e9258c4-cfda-40b6-86e2-e192f523d668_db87f2c5-7a9a-48c2-9a76-604287257c1b_2016-08-08-14:07:25.tar.gz [auth 0.341s, headers 0.391s, total 0.391s, 0.531 MB/s]
The contents of the log file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The log is just a ``.tar.gz`` file that can be extracted as:
.. code-block:: bash
$ tar xvf <file path>
The contents of the file may differ slightly depending on the distribution
that the deploy ramdisk is using:
* For distributions using ``systemd`` there will be a file called
**journal** which contains all the system logs collected via the
``journalctl`` command.
* For other distributions, the ramdisk will collect all the contents of
the ``/var/log`` directory.
For all distributions, the log file will also contain the output of
the following commands (if present): ``ps``, ``df``, ``ip addr`` and
``iptables``.
Here's one example when extracting the content of a log file for a
distribution that uses ``systemd``:
.. code-block:: bash
$ tar xvf 5e9258c4-cfda-40b6-86e2-e192f523d668_88595d8a-6725-4471-8cd5-c0f3106b6898_2016-08-08-13:52:12.tar.gz
df
ps
journal
ip_addr
iptables
.. _troubleshooting-stp:
DHCP during PXE or iPXE is inconsistent or unreliable
=====================================================
This can be caused by the spanning tree protocol delay on some switches. The
delay prevents the switch port moving to forwarding mode during the nodes
attempts to PXE, so the packets never make it to the DHCP server. To resolve
this issue you should set the switch port that connects to your baremetal nodes
as an edge or PortFast type port. Configured in this way the switch port will
move to forwarding mode as soon as the link is established. An example on how to
do that for a Cisco Nexus switch is:
.. code-block:: bash
$ config terminal
$ (config) interface eth1/11
$ (config-if) spanning-tree port type edge