tripleo-ansible/Troubleshooting.rst

Retrying failed actions
=======================

In some cases, steps may fail as some components may not yet be ready for
use due to initialization times, which can vary based on hardware and volume
In the event of this occurring, two options exist that allows a user to
optionally re-attempt or resume playbook executions.

  * Solutions:

    * Ansible ansible-playbook command option --start-at-task="TASK NAME"
      allows resumption of a playbook, when used with the -l limit option.

    * Ansible ansible-playbook command option --step allows a user to confirm
      each task executed by Ansible before it is executed upon.

A node goes to ERROR state during rebuild
=========================================

This can happen from time to time due to network errors or temporary
overload of the undercloud.

  * Symptoms:

    * After error, `nova list` shows node in ERROR

  * Solution:

    * Verify hardware is in working order.

    * Verify that approximately 20% of the disk space is free on the Ironic
      server node.

    * Get the image ID of the machine with `nova show`::

        nova show $node_id

    * Rebuild manually::

        nova rebuild --preserve-ephemeral $node_id $image_id

A node times out after rebuild
==============================

While rare, there is the possibility that something unexpected happened
and the host has failed to reboot as expected from a rebuild.

  * Symptoms:

    * Error Message: `msg: Timeout waiting for the server to come up.. Please
      check manually`

  * Solution:

    * Follow the steps detailed above in "A node goes to ERROR state during
      rebuild"

MySQL CLI configuration file missing
====================================

Should the post-rebuild restart fail, the possibility exists that the
MySQL CLI configuration file is missing.

  * Symptoms:

    * Attempts to access the MySQL CLI command return an error::

        ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)

  * Solution:

    * Verify that the MySQL CLI config file stored on the state drive
      is present and has content within the file.  You can do this
      by executing the command below to display the contents in your
      terminal.::

        sudo cat /mnt/state/root/metadata.my.cnf

    * If the file is empty, run the command below which will retrieve current
      metadata and update config files on disk.::

        sudo os-collect-config --force --one --command=os-apply-config

    * Verify that the MySQL CLI config file is present in the root user
      directory by executing the following command::

        sudo cat /root/.my.cnf

    * If that file does not exist or is empty, two options exist.

      * Add the following to your MySQL CLI command line::

          --defaults-extra-file=/mnt/state/root/metadata.my.cnf

      * Alternatively, copy configuration from the state drive.::

          sudo cp -f /mnt/state/root/metadata.my.cnf /root/.my.cnf


MySQL fails to start upon retrying update
=========================================

If the update was aborted or failed during the Update sequence before a
single MySQL controller was operational, MySQL will fail to start upon retrying.

  * Symptoms:

    * Update is being re-attempted.

    * The following error messages having been observed.

       * `msg: Starting MySQL (Percona XtraDB Cluster) database server: mysqld . . . . The server quit without updating PID file (/var/run/mysqld/mysqld.pid)`

       * `stderr: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)`

       * `FATAL: all hosts have already failed -- aborting`

    * Update automatically aborts.

  * *WARNING*:

    * The command `/etc/init.d/mysql bootstrap-pxc` which is mentioned below
      should only ever be executed when an entire MySQL cluster is down, and
      then only on the last node to have been shut down.  Running this command
      on multiple nodes will cause the MySQL cluster to enter a split brain
      scenario effectively breaking the cluster which will result in
      unpredictable behavior.

  * Solution:

    * Use `nova list` to determine the IP of the controllerMgmt node, then ssh into it::

        ssh heat-admin@$IP

    * Verify MySQL is down by running the mysql client as root. It _should_ fail::

        sudo mysql -e "SELECT 1"

    * Attempt to restart MySQL in case another cluster node is online.
      This should fail in this error state, however if it succeeds your
      cluster should again be operational and the next step can be skipped.::

        sudo /etc/init.d/mysql start

    * Start MySQL back up in single node bootstrap mode::

        sudo /etc/init.d/mysql bootstrap-pxc


MySQL/Percona/Galera is out of sync
===================================

OpenStack is configured to store all of its state in a multi-node
synchronous replication Percona XtraDB Cluster database, which uses
Galera for replication. This database must be in sync and have the full
complement of servers before updates can be performed safely.

  * Symptoms:

    * Update fails with errors about Galera and/or MySQL being "Out of Sync"

  * Solution:

    * use `nova list` to determine IP of controllerMgmt node, then SSH to it::

        ssh heat-admin@$IP

    * Verify replication is out of sync::

        sudo mysql -e "SHOW STATUS like 'wsrep_%'"

    * Stop mysql::

        sudo /etc/init.d/mysql stop

    * Verify it is down by running the mysql client as root. It _should_ fail::

        sudo mysql -e "SELECT 1"

    * Start controllerMgmt0 MySQL back up in single node bootstrap mode::

        sudo /etc/init.d/mysql bootstrap-pxc

    * On the remaining controller nodes observed to be having issues, utilize
      the IP address via `nova list` and login to them.::

        ssh heat-admin@$IP

     * Verify replication is out of sync::

        sudo mysql -e "SHOW STATUS like 'wsrep_%'"

    * Stop mysql::

        sudo /etc/init.d/mysql stop

    * Verify it is down by running the mysql client as root. It _should_ fail::

        sudo mysql -e "SELECT 1"

    * Start MySQL back up so it attempts to connect to controllerMgmt0::

        sudo /etc/init.d/mysql start

    * If restarting MySQL fails, then the database is most certainly out of sync
      and the MySQL error logs, located at /var/log/mysql/error.log, will need
      to be consulted.  In this case, never attempt to restart MySQL with
      `sudo /etc/init.d/mysql bootstrap-pxc` as it will bootstrap the host
      as a single node cluster thus worsening what already appears to be a
      split-brain scenario.

MysQL "Node appears to be the last node in a cluster" error
===========================================================

This error occurs when one of the controller nodes does not have MySQL running.
The playbook has detected that the current node is the last running node,
although based on sequence it should not be the last node.  As a result the
error is thrown and update aborted.

  * Symptoms:

    * Update Failed with error message "Galera Replication - Node appears to be the last node in a cluster - cannot safely proceed unless overridden via single_controller setting - See README.rst"

  * Actions:

    * Run the pre-flight_check.yml playbook.  It will attempt to restart MySQL
      on each node in the "Ensuring MySQL is running -" step.  If that step
      succeeeds, you should be able to re-run the playbook and not encounter
      "Node appears to be last node in a cluster" error.

    * IF pre-flight_check fails to restart MySQL, you will need to consult the
      MySQL logs (/var/log/mysql/error.log) to determine why the other nodes
      are not restarting.

SSH Connectivity is lost
========================

Ansible uses SSH to communicate with remote nodes. In heavily loaded, single
host virtualized environments, SSH can lose connectivity.  It should be noted
that similar issues in a physical environment may indicate issues in the
underlying network infrastructure.

  * Symptoms:

    * Ansible update attempt fails.

    * Error output::

        fatal: [192.0.2.25] => SSH encountered an unknown error. The
        output was: OpenSSH_6.6.1, OpenSSL 1.0.1i-dev xx XXX xxxx
        debug1: Reading configuration data /etc/ssh/ssh_config debug1:
        /etc/ssh/ssh_config line 19: Applying options for * debug1:
        auto-mux: Trying existing master debug2: fd 3 setting
        O_NONBLOCK mux_client_hello_exchange: write packet: Broken
        pipe FATAL: all hosts have already failed – aborting

  * Solution:

    * You will generally be able to re-run the playbook and complete the
      upgrade, unless SSH connectivity is lost while all MySQL nodes are
      down. (See 'MySQL fails to start upon retrying update' to correct
      this issue.)

    * Early Ubuntu Trusty kernel versions have known issues with KVM which
      will severely impact SSH connectivity to instances. Test hosts should
      have a minimum kernel version of 3.13.0-36-generic.
      The update steps, as root, are::

        apt-get update
        apt-get dist-upgrade
        reboot

    * If this issue is repeatedly encountered on a physical environment, the
      network infrastructure should be inspected for errors.

    * Similar error messages to the error noted in the Symptom may occur with
      long running processes, such as database creation/upgrade steps.  These
      cases will generally have partial program execution log output
      immediately before the broken pipe message visible.

      Should this be the case, Ansible and OpenSSH may need to have their
      configuration files tuned to meet the needs of the environment.

      Consult the Ansible configuration file to see available connection settings
      ssh_args, timeout, and possibly pipelining..::

        https://github.com/ansible/ansible/blob/release1.7.0/examples/ansible.cfg

      As Ansible uses OpenSSH, Please reference the ssh_config manual, in
      paricular the ServerAliveInterval and ServerAliveCountMax options.

Postfix fails to reload
=======================

Occasionally the postfix mail transfer agent will fail to reload because
it is not running when the system expects it to be running.

  * Symptoms:

    * Step in /var/log/upstart/os-collect-config.log shows that 'service postfix reload' failed.

  Solution:

    * Start postfix::

        sudo service postfix start

Apache2 Fails to start
======================

Apache2 requires some self-signed SSL certificates to be put in place
that may not have been configured yet due to earlier failures in the
setup process.

  * Error Message:

    * failed: [192.0.2.25] => (item=apache2) => {"failed": true, "item": "apache2"}
    * msg: start: Job failed to start

  * Symptoms:

    * apache2 service fails to start
    * /etc/ssl/certs/ssl-cert-snakeoil.pem is missing or empty

  * Solution:

    * Re-run `os-collect-config` to reassert the SSL certificates::

        sudo os-collect-config --force --one

RabbitMQ still running when restart is attempted
================================================

There are certain system states that cause RabbitMQ to fail to die on normal kill signals.

  * Symptoms:

    * Attempts to start rabbitmq fail because it is already running

  * Solution:

    * Find any processes running as `rabbitmq` on the box, and kill them, forcibly if need be.

Instance reported with status == "SHUTOFF" and task_state == "powering on"
==========================================================================

If nova attempts to restart an instance when the compute node is not ready,
it is possible that nova could entered a confused state where it thinks that
an instance is starting when in fact the compute node is doing nothing.

  * Symptoms:

    * Command `nova list --all-tenants` reports instance(s) with STATUS ==
      "SHUTOFF" and task_state == "powering on".
    * Instance cannot be pinged.
    * No instance appears to be running on the compute node.
    * Nova hangs upon retrieving logs or returns old logs from the previous
      boot.
    * Console session cannot be established.

  * Solution:

    * On a controller logged in as root, after executing `source stackrc`:

      * Execute `nova list --all-tenants` to obtain instance ID(s)

      * Execute `nova show <instance-id>` on each suspected ID to identify
        suspected compute nodes.

    * Log into the suspected compute node(s) and execute:
      `os-collect-config --force --one`

    * Return to the controller node that you were logged into previously, and
      using the instancce IDs obtained previously, take the following steps.

      * Execute `nova reset-state --active <instance-id>`

      * Execute `nova stop <instance-id>`

      * Execute `nova start <instance-id>`

    * Once the above steps have been taken in order, you should see the
      instance status return to ACTIVE and the instance become accessible
      via the network.

state drive /mnt is not mounted
===============================

In the rare event that something bad happened between the state drive being
unmounted and the rebuild command being triggered, the /mnt volume on the
instance that was being executed upon at that time will be in an unmounted
state.

In such a state, pre-flight checks will fail attempting to start MySQL and
RabbitMQ.

  * Error Messages:

    * Pre-flight check returns an error similar to::

        failed: [192.0.2.24] => {"changed": true, "cmd":
        "rabbitmqctl -n rabbit@$(hostname) status" stderr: Error:
        unable to connect to node
        'rabbit@overcloud-controller0-vahypr34iy2x': nodedown

    * Attempting to manually start MySQL or RabbitMQ return::

        start: Job failed to start

    * Upgrade execution returns with an error indicating::

        TASK: [fail msg="Galera Replication - Node appears to be the
        last node in a cluster - cannot safely proceed unless
        overriden via single_controller setting - See README.rst"] ***

  * Symptom:

    * Execution of the `df` command does not show a volume mounted as /mnt.

    * Unable to manually start services.

  * Solution:

    * Execute the os-collect config which will re-mount the state drive. This
      command may fail without additional intervention, however it should mount
      the state drive which is all that is needed to proceed to the next step.::

        sudo os-collect-config --force --one

    * At this point, the /mnt volume should be visible in the output of the `df`
      command.

    * Start MySQL by executing::

        sudo /etc/init.d/mysqld start

    * If MySQL fails to start, and it has been verified that MySQL is not
      running on any controller nodes, then you will need to identify the
      *last* node that MySQL was stopped on and consult the section "MySQL
      fails to start upon retrying update" for guidance on restarting the
      cluster.

    * Start RabbitMQ by executing::

        service rabbitmq-server start

    * If rabbitmq-server fails to start, then the cluster may be down. If
      this is the case, then the *last* node to be stopped will need to be
      identified and started before attempting to restart RabbitMQ on this
      node.

    * At this point, re-execute the pre-flight check, and proceed with the
      upgrade.

VMs may not shut down properly during upgrade
=============================================

During the upgrade process, VMs on compute nodes are shut down
gracefully. If the VMs do not shut down, this can cause the upgrade to
stop.

  * Error Messages:

    * A playbook run ends with a message similar to::

        failed: [10.23.210.31] => {"failed": true} msg: The ephemeral
        storage of this system failed to be cleaned up properly and
        processes or files are still in use. The previous ansible play
        should have information to help troubleshoot this issue.

    * The output of the playbook run prior to this message contains a
      process listing and a listing of open files.

  * Symptoms:

    * The state drive on the compute node, /mnt, is still in use and
      cannot be unmounted. You can confirm this by executing::

        lsof -n | grep /mnt

    * VMs are running on the node. To see which VMs are running, run::

        virsh list

    * If `virsh list` fails, you may need to restart libvirt-bin or
      libvirtd depending on which process you are running. Do
      so by running::

        service libvirt-bin restart
        or
        service libvirtd restart

  * Solution:

    * Manual intervention is required. You will need to determine why
      the VMs did not shut down properly, and resolve the issue.

    * Unresponsive VMs can be forcibly shutdown using `virsh destroy
      <id>`. Note that this can corrupt filesystems on the VM.

    * Resume the playbook run once the VMs have been shut down.

Instances are inaccessible via network
======================================

Upon restarting, it is possible that the virtual machine is
unreachable due to Open vSwitch not being ready for the virtual machine
networking.

  * Symptom:

    * After a restart, instances won't ping.

  * Solution:

    * To resolve:

      * Log into a controller node and execute `source /root/stackrc`

      * Stop all virtual machines on a compute node utilizing `nova
        hypervisor-servers <hostname>` and `nova stop <id>`

      * Log into the undercloud node and execute `source /root/stackrc`

      * Obtain a list of nodes by executing `nova list`

      * Execute `nova stop <id>` for the affected compute node.

      * Once the compute node has stopped, execute `nova start <id>` to
        reboot the compute node.

Online Upgrade fails with message saying glanceclient is not found.
===================================================================

  * Symptoms:

    * Online upgrade has been attempted, however the playbook
      execution failed when attempting to download the new image from
      Glance reporting that glanceclient was not found.

  * Solution:

    * If you are attempting to execute the Ansible playbook on the seed or
      undercloud node, source the Ansible virtual environment by executing
      `source /opt/stack/venvs/ansible/bin/activate`

    * Once the Ansible virtual environment has been sourced, execute
      `sudo pip install python-glanceclient` on the node you are attempting
      to execute Ansible from.

Online Upgrade of compute node failed
=====================================

In the event that an online upgrade of a compute node somehow failed, the node
can be recovered utilizing a traditional rebuild.

  * Symptoms:

    * Online upgrade was performed.

    * Compute node cannot be logged into, or is otherwise in a
      non-working state.

  * Solution:

    * From the undercloud:

      * Execute `source /root/stackrc`

      * Identify the instance ID of the broken compute node via `nova list`

      * Execute the command `nova stop <instance-id>` to stop the instance.

      * Return to the host that you ran the upgrade from and re-run the playbook
        without the "-e online_upgrade=True" option.

      * Additionally, you may need to utilize the "-e force_rebuild=True" option
        to force the instance to rebuild.