tripleo-ansible/Troubleshooting.rst

19 KiB
Raw Blame History

Retrying failed actions

In some cases, steps may fail as some components may not yet be ready for use due to initialization times, which can vary based on hardware and volume In the event of this occurring, two options exist that allows a user to optionally re-attempt or resume playbook executions.

  • Solutions:
    • Ansible ansible-playbook command option --start-at-task="TASK NAME" allows resumption of a playbook, when used with the -l limit option.
    • Ansible ansible-playbook command option --step allows a user to confirm each task executed by Ansible before it is executed upon.

A node goes to ERROR state during rebuild

This can happen from time to time due to network errors or temporary overload of the undercloud.

  • Symptoms:
    • After error, nova list shows node in ERROR
  • Solution:
    • Verify hardware is in working order.

    • Verify that approximately 20% of the disk space is free on the Ironic server node.

    • Get the image ID of the machine with `nova show`:

      nova show $node_id
    • Rebuild manually:

      nova rebuild --preserve-ephemeral $node_id $image_id

A node times out after rebuild

While rare, there is the possibility that something unexpected happened and the host has failed to reboot as expected from a rebuild.

  • Symptoms:
    • Error Message: msg: Timeout waiting for the server to come up.. Please check manually
  • Solution:
    • Follow the steps detailed above in "A node goes to ERROR state during rebuild"

MySQL CLI configuration file missing

Should the post-rebuild restart fail, the possibility exists that the MySQL CLI configuration file is missing.

  • Symptoms:
    • Attempts to access the MySQL CLI command return an error:

      ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
  • Solution:
    • Verify that the MySQL CLI config file stored on the state drive is present and has content within the file. You can do this by executing the command below to display the contents in your terminal.:

      sudo cat /mnt/state/root/metadata.my.cnf
    • If the file is empty, run the command below which will retrieve current metadata and update config files on disk.:

      sudo os-collect-config --force --one --command=os-apply-config
    • Verify that the MySQL CLI config file is present in the root user directory by executing the following command:

      sudo cat /root/.my.cnf
    • If that file does not exist or is empty, two options exist.

      • Add the following to your MySQL CLI command line:

        --defaults-extra-file=/mnt/state/root/metadata.my.cnf
      • Alternatively, copy configuration from the state drive.:

        sudo cp -f /mnt/state/root/metadata.my.cnf /root/.my.cnf

MySQL fails to start upon retrying update

If the update was aborted or failed during the Update sequence before a single MySQL controller was operational, MySQL will fail to start upon retrying.

  • Symptoms:
    • Update is being re-attempted.

    • The following error messages having been observed.

      • msg: Starting MySQL (Percona XtraDB Cluster) database server: mysqld . . . . The server quit without updating PID file (/var/run/mysqld/mysqld.pid)
      • stderr: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)
      • FATAL: all hosts have already failed -- aborting
    • Update automatically aborts.

  • WARNING:
    • The command /etc/init.d/mysql bootstrap-pxc which is mentioned below should only ever be executed when an entire MySQL cluster is down, and then only on the last node to have been shut down. Running this command on multiple nodes will cause the MySQL cluster to enter a split brain scenario effectively breaking the cluster which will result in unpredictable behavior.
  • Solution:
    • Use nova list to determine the IP of the controllerMgmt node, then ssh into it:

      ssh heat-admin@$IP
    • Verify MySQL is down by running the mysql client as root. It _should fail:

      sudo mysql -e "SELECT 1"
    • Attempt to restart MySQL in case another cluster node is online. This should fail in this error state, however if it succeeds your cluster should again be operational and the next step can be skipped.:

      sudo /etc/init.d/mysql start
    • Start MySQL back up in single node bootstrap mode:

      sudo /etc/init.d/mysql bootstrap-pxc

MySQL/Percona/Galera is out of sync

OpenStack is configured to store all of its state in a multi-node synchronous replication Percona XtraDB Cluster database, which uses Galera for replication. This database must be in sync and have the full complement of servers before updates can be performed safely.

  • Symptoms:

    • Update fails with errors about Galera and/or MySQL being "Out of Sync"
  • Solution:

    • use nova list to determine IP of controllerMgmt node, then SSH to it:

      ssh heat-admin@$IP
    • Verify replication is out of sync:

      sudo mysql -e "SHOW STATUS like 'wsrep_%'"
    • Stop mysql:

      sudo /etc/init.d/mysql stop
    • Verify it is down by running the mysql client as root. It _should fail:

      sudo mysql -e "SELECT 1"
    • Start controllerMgmt0 MySQL back up in single node bootstrap mode:

      sudo /etc/init.d/mysql bootstrap-pxc
    • On the remaining controller nodes observed to be having issues, utilize the IP address via nova list and login to them.:

      ssh heat-admin@$IP
    • Verify replication is out of sync:

      sudo mysql -e "SHOW STATUS like 'wsrep_%'"
    • Stop mysql:

      sudo /etc/init.d/mysql stop
    • Verify it is down by running the mysql client as root. It _should fail:

      sudo mysql -e "SELECT 1"
    • Start MySQL back up so it attempts to connect to controllerMgmt0:

      sudo /etc/init.d/mysql start
    • If restarting MySQL fails, then the database is most certainly out of sync and the MySQL error logs, located at /var/log/mysql/error.log, will need to be consulted. In this case, never attempt to restart MySQL with sudo /etc/init.d/mysql bootstrap-pxc as it will bootstrap the host as a single node cluster thus worsening what already appears to be a split-brain scenario.

MysQL "Node appears to be the last node in a cluster" error

This error occurs when one of the controller nodes does not have MySQL running. The playbook has detected that the current node is the last running node, although based on sequence it should not be the last node. As a result the error is thrown and update aborted.

  • Symptoms:
    • Update Failed with error message "Galera Replication - Node appears to be the last node in a cluster - cannot safely proceed unless overridden via single_controller setting - See README.rst"
  • Actions:
    • Run the pre-flight_check.yml playbook. It will attempt to restart MySQL on each node in the "Ensuring MySQL is running -" step. If that step succeeeds, you should be able to re-run the playbook and not encounter "Node appears to be last node in a cluster" error.
    • IF pre-flight_check fails to restart MySQL, you will need to consult the MySQL logs (/var/log/mysql/error.log) to determine why the other nodes are not restarting.

SSH Connectivity is lost

Ansible uses SSH to communicate with remote nodes. In heavily loaded, single host virtualized environments, SSH can lose connectivity. It should be noted that similar issues in a physical environment may indicate issues in the underlying network infrastructure.

  • Symptoms:
    • Ansible update attempt fails.

    • Error output:

      fatal: [192.0.2.25] => SSH encountered an unknown error. The
      output was: OpenSSH_6.6.1, OpenSSL 1.0.1i-dev xx XXX xxxx
      debug1: Reading configuration data /etc/ssh/ssh_config debug1:
      /etc/ssh/ssh_config line 19: Applying options for * debug1:
      auto-mux: Trying existing master debug2: fd 3 setting
      O_NONBLOCK mux_client_hello_exchange: write packet: Broken
      pipe FATAL: all hosts have already failed  aborting
  • Solution:
    • You will generally be able to re-run the playbook and complete the upgrade, unless SSH connectivity is lost while all MySQL nodes are down. (See 'MySQL fails to start upon retrying update' to correct this issue.)

    • Early Ubuntu Trusty kernel versions have known issues with KVM which will severely impact SSH connectivity to instances. Test hosts should have a minimum kernel version of 3.13.0-36-generic. The update steps, as root, are:

      apt-get update
      apt-get dist-upgrade
      reboot
    • If this issue is repeatedly encountered on a physical environment, the network infrastructure should be inspected for errors.

    • Similar error messages to the error noted in the Symptom may occur with long running processes, such as database creation/upgrade steps. These cases will generally have partial program execution log output immediately before the broken pipe message visible.

      Should this be the case, Ansible and OpenSSH may need to have their configuration files tuned to meet the needs of the environment.

      Consult the Ansible configuration file to see available connection settings ssh_args, timeout, and possibly pipelining..:

      https://github.com/ansible/ansible/blob/release1.7.0/examples/ansible.cfg

      As Ansible uses OpenSSH, Please reference the ssh_config manual, in paricular the ServerAliveInterval and ServerAliveCountMax options.

Postfix fails to reload

Occasionally the postfix mail transfer agent will fail to reload because it is not running when the system expects it to be running.

  • Symptoms:
    • Step in /var/log/upstart/os-collect-config.log shows that 'service postfix reload' failed.

Solution:

  • Start postfix:

    sudo service postfix start

Apache2 Fails to start

Apache2 requires some self-signed SSL certificates to be put in place that may not have been configured yet due to earlier failures in the setup process.

  • Error Message:
    • failed: [192.0.2.25] => (item=apache2) => {"failed": true, "item": "apache2"}
    • msg: start: Job failed to start
  • Symptoms:
    • apache2 service fails to start
    • /etc/ssl/certs/ssl-cert-snakeoil.pem is missing or empty
  • Solution:
    • Re-run os-collect-config to reassert the SSL certificates:

      sudo os-collect-config --force --one

RabbitMQ still running when restart is attempted

There are certain system states that cause RabbitMQ to fail to die on normal kill signals.

  • Symptoms:
    • Attempts to start rabbitmq fail because it is already running
  • Solution:
    • Find any processes running as rabbitmq on the box, and kill them, forcibly if need be.

Instance reported with status == "SHUTOFF" and task_state == "powering on"

If nova attempts to restart an instance when the compute node is not ready, it is possible that nova could entered a confused state where it thinks that an instance is starting when in fact the compute node is doing nothing.

  • Symptoms:
    • Command nova list --all-tenants reports instance(s) with STATUS == "SHUTOFF" and task_state == "powering on".
    • Instance cannot be pinged.
    • No instance appears to be running on the compute node.
    • Nova hangs upon retrieving logs or returns old logs from the previous boot.
    • Console session cannot be established.
  • Solution:
    • On a controller logged in as root, after executing `source stackrc`:
      • Execute nova list --all-tenants to obtain instance ID(s)
      • Execute nova show <instance-id> on each suspected ID to identify suspected compute nodes.
    • Log into the suspected compute node(s) and execute: os-collect-config --force --one
    • Return to the controller node that you were logged into previously, and using the instancce IDs obtained previously, take the following steps.
      • Execute nova reset-state --active <instance-id>
      • Execute nova stop <instance-id>
      • Execute nova start <instance-id>
    • Once the above steps have been taken in order, you should see the instance status return to ACTIVE and the instance become accessible via the network.

state drive /mnt is not mounted

In the rare event that something bad happened between the state drive being unmounted and the rebuild command being triggered, the /mnt volume on the instance that was being executed upon at that time will be in an unmounted state.

In such a state, pre-flight checks will fail attempting to start MySQL and RabbitMQ.

  • Error Messages:
    • Pre-flight check returns an error similar to:

      failed: [192.0.2.24] => {"changed": true, "cmd":
      "rabbitmqctl -n rabbit@$(hostname) status" stderr: Error:
      unable to connect to node
      'rabbit@overcloud-controller0-vahypr34iy2x': nodedown
    • Attempting to manually start MySQL or RabbitMQ return:

      start: Job failed to start
    • Upgrade execution returns with an error indicating:

      TASK: [fail msg="Galera Replication - Node appears to be the
      last node in a cluster - cannot safely proceed unless
      overriden via single_controller setting - See README.rst"] ***
  • Symptom:
    • Execution of the df command does not show a volume mounted as /mnt.
    • Unable to manually start services.
  • Solution:
    • Execute the os-collect config which will re-mount the state drive. This command may fail without additional intervention, however it should mount the state drive which is all that is needed to proceed to the next step.:

      sudo os-collect-config --force --one
    • At this point, the /mnt volume should be visible in the output of the df command.

    • Start MySQL by executing:

      sudo /etc/init.d/mysqld start
    • If MySQL fails to start, and it has been verified that MySQL is not running on any controller nodes, then you will need to identify the last node that MySQL was stopped on and consult the section "MySQL fails to start upon retrying update" for guidance on restarting the cluster.

    • Start RabbitMQ by executing:

      service rabbitmq-server start
    • If rabbitmq-server fails to start, then the cluster may be down. If this is the case, then the last node to be stopped will need to be identified and started before attempting to restart RabbitMQ on this node.

    • At this point, re-execute the pre-flight check, and proceed with the upgrade.

VMs may not shut down properly during upgrade

During the upgrade process, VMs on compute nodes are shut down gracefully. If the VMs do not shut down, this can cause the upgrade to stop.

  • Error Messages:
    • A playbook run ends with a message similar to:

      failed: [10.23.210.31] => {"failed": true} msg: The ephemeral
      storage of this system failed to be cleaned up properly and
      processes or files are still in use. The previous ansible play
      should have information to help troubleshoot this issue.
    • The output of the playbook run prior to this message contains a process listing and a listing of open files.

  • Symptoms:
    • The state drive on the compute node, /mnt, is still in use and cannot be unmounted. You can confirm this by executing:

      lsof -n | grep /mnt
    • VMs are running on the node. To see which VMs are running, run:

      virsh list
    • If virsh list fails, you may need to restart libvirt-bin or libvirtd depending on which process you are running. Do so by running:

      service libvirt-bin restart
      or
      service libvirtd restart
  • Solution:
    • Manual intervention is required. You will need to determine why the VMs did not shut down properly, and resolve the issue.
    • Unresponsive VMs can be forcibly shutdown using virsh destroy <id>. Note that this can corrupt filesystems on the VM.
    • Resume the playbook run once the VMs have been shut down.

Instances are inaccessible via network

Upon restarting, it is possible that the virtual machine is unreachable due to Open vSwitch not being ready for the virtual machine networking.

  • Symptom:
    • After a restart, instances won't ping.
  • Solution:
    • To resolve:
      • Log into a controller node and execute source /root/stackrc
      • Stop all virtual machines on a compute node utilizing nova hypervisor-servers <hostname> and nova stop <id>
      • Log into the undercloud node and execute source /root/stackrc
      • Obtain a list of nodes by executing nova list
      • Execute nova stop <id> for the affected compute node.
      • Once the compute node has stopped, execute nova start <id> to reboot the compute node.

Online Upgrade fails with message saying glanceclient is not found.

  • Symptoms:
    • Online upgrade has been attempted, however the playbook execution failed when attempting to download the new image from Glance reporting that glanceclient was not found.
  • Solution:
    • If you are attempting to execute the Ansible playbook on the seed or undercloud node, source the Ansible virtual environment by executing source /opt/stack/venvs/ansible/bin/activate
    • Once the Ansible virtual environment has been sourced, execute sudo pip install python-glanceclient on the node you are attempting to execute Ansible from.

Online Upgrade of compute node failed

In the event that an online upgrade of a compute node somehow failed, the node can be recovered utilizing a traditional rebuild.

  • Symptoms:
    • Online upgrade was performed.
    • Compute node cannot be logged into, or is otherwise in a non-working state.
  • Solution:
    • From the undercloud:
      • Execute source /root/stackrc
      • Identify the instance ID of the broken compute node via nova list
      • Execute the command nova stop <instance-id> to stop the instance.
      • Return to the host that you ran the upgrade from and re-run the playbook without the "-e online_upgrade=True" option.
      • Additionally, you may need to utilize the "-e force_rebuild=True" option to force the instance to rebuild.