From a7f25d8162fb062d0875d0ebaa0949a576a5b150 Mon Sep 17 00:00:00 2001 From: Alexandra Settle Date: Wed, 16 Nov 2016 16:16:23 +0000 Subject: [PATCH] [DOCS] Creating new folder for proposed operations guide Moves pre-existing operations content to new folder Change-Id: I5c177dda2bba47e835fbd77cd63df3b52864c4d4 Implements: blueprint create-ops-guide --- .../draft-operations-guide/extending.rst | 306 ++++++++++++++++++ doc/source/draft-operations-guide/index.rst | 18 ++ .../ops-add-computehost.rst | 29 ++ .../ops-galera-recovery.rst | 302 +++++++++++++++++ .../ops-galera-remove.rst | 32 ++ .../ops-galera-start.rst | 88 +++++ .../draft-operations-guide/ops-galera.rst | 18 ++ .../ops-lxc-commands.rst | 38 +++ .../ops-remove-computehost.rst | 49 +++ .../draft-operations-guide/ops-tips.rst | 38 +++ .../ops-troubleshooting.rst | 125 +++++++ 11 files changed, 1043 insertions(+) create mode 100644 doc/source/draft-operations-guide/extending.rst create mode 100644 doc/source/draft-operations-guide/index.rst create mode 100644 doc/source/draft-operations-guide/ops-add-computehost.rst create mode 100644 doc/source/draft-operations-guide/ops-galera-recovery.rst create mode 100644 doc/source/draft-operations-guide/ops-galera-remove.rst create mode 100644 doc/source/draft-operations-guide/ops-galera-start.rst create mode 100644 doc/source/draft-operations-guide/ops-galera.rst create mode 100644 doc/source/draft-operations-guide/ops-lxc-commands.rst create mode 100644 doc/source/draft-operations-guide/ops-remove-computehost.rst create mode 100644 doc/source/draft-operations-guide/ops-tips.rst create mode 100644 doc/source/draft-operations-guide/ops-troubleshooting.rst diff --git a/doc/source/draft-operations-guide/extending.rst b/doc/source/draft-operations-guide/extending.rst new file mode 100644 index 0000000000..4ecaf5dc78 --- /dev/null +++ b/doc/source/draft-operations-guide/extending.rst @@ -0,0 +1,306 @@ +=========================== +Extending OpenStack-Ansible +=========================== + +The OpenStack-Ansible project provides a basic OpenStack environment, but +many deployers will wish to extend the environment based on their needs. This +could include installing extra services, changing package versions, or +overriding existing variables. + +Using these extension points, deployers can provide a more 'opinionated' +installation of OpenStack that may include their own software. + +Including OpenStack-Ansible in your project +------------------------------------------- + +Including the openstack-ansible repository within another project can be +done in several ways. + + 1. A git submodule pointed to a released tag. + 2. A script to automatically perform a git checkout of + openstack-ansible + +When including OpenStack-Ansible in a project, consider using a parallel +directory structure as shown in the `ansible.cfg files`_ section. + +Also note that copying files into directories such as `env.d`_ or +`conf.d`_ should be handled via some sort of script within the extension +project. + +ansible.cfg files +----------------- + +You can create your own playbook, variable, and role structure while still +including the OpenStack-Ansible roles and libraries by putting an +``ansible.cfg`` file in your ``playbooks`` directory. + +The relevant options for Ansible 1.9 (included in OpenStack-Ansible) +are as follows: + + ``library`` + This variable should point to + ``openstack-ansible/playbooks/library``. Doing so allows roles and + playbooks to access OpenStack-Ansible's included Ansible modules. + ``roles_path`` + This variable should point to + ``openstack-ansible/playbooks/roles``. This allows Ansible to + properly look up any OpenStack-Ansible roles that extension roles + may reference. + ``inventory`` + This variable should point to + ``openstack-ansible/playbooks/inventory``. With this setting, + extensions have access to the same dynamic inventory that + OpenStack-Ansible uses. + +Note that the paths to the ``openstack-ansible`` top level directory can be +relative in this file. + +Consider this directory structure:: + + my_project + | + |- custom_stuff + | | + | |- playbooks + |- openstack-ansible + | | + | |- playbooks + +The variables in ``my_project/custom_stuff/playbooks/ansible.cfg`` would use +``../openstack-ansible/playbooks/``. + + +env.d +----- + +The ``/etc/openstack_deploy/env.d`` directory sources all YAML files into the +deployed environment, allowing a deployer to define additional group mappings. + +This directory is used to extend the environment skeleton, or modify the +defaults defined in the ``playbooks/inventory/env.d`` directory. + +See also `Understanding Container Groups`_ in Appendix C. + +.. _Understanding Container Groups: ../install-guide/app-custom-layouts.html#understanding-container-groups + +conf.d +------ + +Common OpenStack services and their configuration are defined by +OpenStack-Ansible in the +``/etc/openstack_deploy/openstack_user_config.yml`` settings file. + +Additional services should be defined with a YAML file in +``/etc/openstack_deploy/conf.d``, in order to manage file size. + +See also `Understanding Host Groups`_ in Appendix C. + +.. _Understanding Host Groups: ../install-guide/app-custom-layouts.html#understanding-host-groups + +user\_*.yml files +----------------- + +Files in ``/etc/openstack_deploy`` beginning with ``user_`` will be +automatically sourced in any ``openstack-ansible`` command. Alternatively, +the files can be sourced with the ``-e`` parameter of the ``ansible-playbook`` +command. + +``user_variables.yml`` and ``user_secrets.yml`` are used directly by +OpenStack-Ansible. Adding custom variables used by your own roles and +playbooks to these files is not recommended. Doing so will complicate your +upgrade path by making comparison of your existing files with later versions +of these files more arduous. Rather, recommended practice is to place your own +variables in files named following the ``user_*.yml`` pattern so they will be +sourced alongside those used exclusively by OpenStack-Ansible. + +Ordering and Precedence ++++++++++++++++++++++++ + +``user_*.yml`` variables are just YAML variable files. They will be sourced +in alphanumeric order by ``openstack-ansible``. + +.. _adding-galaxy-roles: + +Adding Galaxy roles +------------------- + +Any roles defined in ``openstack-ansible/ansible-role-requirements.yml`` +will be installed by the +``openstack-ansible/scripts/bootstrap-ansible.sh`` script. + + +Setting overrides in configuration files +---------------------------------------- + +All of the services that use YAML, JSON, or INI for configuration can receive +overrides through the use of a Ansible action plugin named ``config_template``. +The configuration template engine allows a deployer to use a simple dictionary +to modify or add items into configuration files at run time that may not have a +preset template option. All OpenStack-Ansible roles allow for this +functionality where applicable. Files available to receive overrides can be +seen in the ``defaults/main.yml`` file as standard empty dictionaries (hashes). + +Practical guidance for using this feature is available in the `Install Guide`_. + +This module has been `submitted for consideration`_ into Ansible Core. + +.. _Install Guide: ../install-guide/app-advanced-config-override.html +.. _submitted for consideration: https://github.com/ansible/ansible/pull/12555 + + +Build the environment with additional python packages ++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +The system will allow you to install and build any package that is a python +installable. The repository infrastructure will look for and create any +git based or PyPi installable package. When the package is built the repo-build +role will create the sources as Python wheels to extend the base system and +requirements. + +While the packages pre-built in the repository-infrastructure are +comprehensive, it may be needed to change the source locations and versions of +packages to suit different deployment needs. Adding additional repositories as +overrides is as simple as listing entries within the variable file of your +choice. Any ``user_.*.yml`` file within the "/etc/openstack_deployment" +directory will work to facilitate the addition of a new packages. + + +.. code-block:: yaml + + swift_git_repo: https://private-git.example.org/example-org/swift + swift_git_install_branch: master + + +Additional lists of python packages can also be overridden using a +``user_.*.yml`` variable file. + +.. code-block:: yaml + + swift_requires_pip_packages: + - virtualenv + - virtualenv-tools + - python-keystoneclient + - NEW-SPECIAL-PACKAGE + + +Once the variables are set call the play ``repo-build.yml`` to build all of the +wheels within the repository infrastructure. When ready run the target plays to +deploy your overridden source code. + + +Module documentation +++++++++++++++++++++ + +These are the options available as found within the virtual module +documentation section. + +.. code-block:: yaml + + module: config_template + version_added: 1.9.2 + short_description: > + Renders template files providing a create/update override interface + description: + - The module contains the template functionality with the ability to + override items in config, in transit, through the use of a simple + dictionary without having to write out various temp files on target + machines. The module renders all of the potential jinja a user could + provide in both the template file and in the override dictionary which + is ideal for deployers who may have lots of different configs using a + similar code base. + - The module is an extension of the **copy** module and all of attributes + that can be set there are available to be set here. + options: + src: + description: + - Path of a Jinja2 formatted template on the local server. This can + be a relative or absolute path. + required: true + default: null + dest: + description: + - Location to render the template to on the remote machine. + required: true + default: null + config_overrides: + description: + - A dictionary used to update or override items within a configuration + template. The dictionary data structure may be nested. If the target + config file is an ini file the nested keys in the ``config_overrides`` + will be used as section headers. + config_type: + description: + - A string value describing the target config type. + choices: + - ini + - json + - yaml + + +Example task using the "config_template" module +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: yaml + + - name: Run config template ini + config_template: + src: test.ini.j2 + dest: /tmp/test.ini + config_overrides: {{ test_overrides }} + config_type: ini + + +Example overrides dictionary(hash) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: yaml + + test_overrides: + DEFAULT: + new_item: 12345 + + +Original template file "test.ini.j2" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: ini + + [DEFAULT] + value1 = abc + value2 = 123 + + +Rendered on disk file "/tmp/test.ini" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: ini + + [DEFAULT] + value1 = abc + value2 = 123 + new_item = 12345 + + +In this task the ``test.ini.j2`` file is a template which will be rendered and +written to disk at ``/tmp/test.ini``. The **config_overrides** entry is a +dictionary(hash) which allows a deployer to set arbitrary data as overrides to +be written into the configuration file at run time. The **config_type** entry +specifies the type of configuration file the module will be interacting with; +available options are "yaml", "json", and "ini". + + +Discovering Available Overrides +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All of these options can be specified in any way that suits your deployment. +In terms of ease of use and flexibility it's recommended that you define your +overrides in a user variable file such as +``/etc/openstack_deploy/user_variables.yml``. + +The list of overrides available may be found by executing: + +.. code-block:: bash + + find . -name "main.yml" -exec grep '_.*_overrides:' {} \; \ + | grep -v "^#" \ + | sort -u diff --git a/doc/source/draft-operations-guide/index.rst b/doc/source/draft-operations-guide/index.rst new file mode 100644 index 0000000000..517b8e2733 --- /dev/null +++ b/doc/source/draft-operations-guide/index.rst @@ -0,0 +1,18 @@ +================================== +OpenStack-Ansible operations guide +================================== + +This is a draft index page for the proposed OpenStack-Ansible +operations guide. + +.. toctree:: + :maxdepth: 2 + + ops-lxc-commands.rst + ops-add-computehost.rst + ops-remove-computehost.rst + ops-galera.rst + ops-tips.rst + ops-troubleshooting.rst + extending.rst + diff --git a/doc/source/draft-operations-guide/ops-add-computehost.rst b/doc/source/draft-operations-guide/ops-add-computehost.rst new file mode 100644 index 0000000000..b1def07110 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-add-computehost.rst @@ -0,0 +1,29 @@ +===================== +Adding a compute host +===================== + +Use the following procedure to add a compute host to an operational +cluster. + +#. Configure the host as a target host. See `Prepare target hosts + `_ + for more information. + +#. Edit the ``/etc/openstack_deploy/openstack_user_config.yml`` file and + add the host to the ``compute_hosts`` stanza. + + If necessary, also modify the ``used_ips`` stanza. + +#. If the cluster is utilizing Telemetry/Metering (Ceilometer), + edit the ``/etc/openstack_deploy/conf.d/ceilometer.yml`` file and add the + host to the ``metering-compute_hosts`` stanza. + +#. Run the following commands to add the host. Replace + ``NEW_HOST_NAME`` with the name of the new host. + + .. code-block:: shell-session + + # cd /opt/openstack-ansible/playbooks + # openstack-ansible setup-hosts.yml --limit NEW_HOST_NAME + # openstack-ansible setup-openstack.yml --skip-tags nova-key-distribute --limit NEW_HOST_NAME + # openstack-ansible setup-openstack.yml --tags nova-key --limit compute_hosts diff --git a/doc/source/draft-operations-guide/ops-galera-recovery.rst b/doc/source/draft-operations-guide/ops-galera-recovery.rst new file mode 100644 index 0000000000..cb7dcd2892 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-galera-recovery.rst @@ -0,0 +1,302 @@ +======================= +Galera cluster recovery +======================= + +Run the ``galera-bootstrap`` playbook to automatically recover +a node or an entire environment. Run the ``galera install`` playbook +using the ``galera-bootstrap`` tag to auto recover a node or an +entire environment. + +#. Run the following Ansible command to show the failed nodes: + + .. code-block:: shell-session + + # openstack-ansible galera-install.yml --tags galera-bootstrap + +The cluster comes back online after completion of this command. + +Single-node failure +~~~~~~~~~~~~~~~~~~~ + +If a single node fails, the other nodes maintain quorum and +continue to process SQL requests. + +#. Run the following Ansible command to determine the failed node: + + .. code-block:: shell-session + + # ansible galera_container -m shell -a "mysql -h localhost \ + -e 'show status like \"%wsrep_cluster_%\";'" + node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server through + socket '/var/run/mysqld/mysqld.sock' (111) + + node2_galera_container-49a47d25 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 17 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 17 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + + In this example, node 3 has failed. + +#. Restart MariaDB on the failed node and verify that it rejoins the + cluster. + +#. If MariaDB fails to start, run the ``mysqld`` command and perform + further analysis on the output. As a last resort, rebuild the container + for the node. + +Multi-node failure +~~~~~~~~~~~~~~~~~~ + +When all but one node fails, the remaining node cannot achieve quorum and +stops processing SQL requests. In this situation, failed nodes that +recover cannot join the cluster because it no longer exists. + +#. Run the following Ansible command to show the failed nodes: + + .. code-block:: shell-session + + # ansible galera_container -m shell -a "mysql \ + -h localhost -e 'show status like \"%wsrep_cluster_%\";'" + node2_galera_container-49a47d25 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (111) + + node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (111) + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 18446744073709551615 + wsrep_cluster_size 1 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status non-Primary + + In this example, nodes 2 and 3 have failed. The remaining operational + server indicates ``non-Primary`` because it cannot achieve quorum. + +#. Run the following command to + `rebootstrap `_ + the operational node into the cluster: + + .. code-block:: shell-session + + # mysql -e "SET GLOBAL wsrep_provider_options='pc.bootstrap=yes';" + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 15 + wsrep_cluster_size 1 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (111) + + node2_galera_container-49a47d25 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (111) + + The remaining operational node becomes the primary node and begins + processing SQL requests. + +#. Restart MariaDB on the failed nodes and verify that they rejoin the + cluster: + + .. code-block:: shell-session + + # ansible galera_container -m shell -a "mysql \ + -h localhost -e 'show status like \"%wsrep_cluster_%\";'" + node3_galera_container-3ea2cbd3 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 17 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node2_galera_container-49a47d25 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 17 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 17 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + +#. If MariaDB fails to start on any of the failed nodes, run the + ``mysqld`` command and perform further analysis on the output. As a + last resort, rebuild the container for the node. + +Complete failure +~~~~~~~~~~~~~~~~ + +Restore from backup if all of the nodes in a Galera cluster fail (do not +shutdown gracefully). Run the following command to determine if all nodes in +the cluster have failed: + +.. code-block:: shell-session + + # ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat" + node3_galera_container-3ea2cbd3 | success | rc=0 >> + # GALERA saved state + version: 2.1 + uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 + seqno: -1 + cert_index: + + node2_galera_container-49a47d25 | success | rc=0 >> + # GALERA saved state + version: 2.1 + uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 + seqno: -1 + cert_index: + + node4_galera_container-76275635 | success | rc=0 >> + # GALERA saved state + version: 2.1 + uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 + seqno: -1 + cert_index: + + +All the nodes have failed if ``mysqld`` is not running on any of the +nodes and all of the nodes contain a ``seqno`` value of -1. + +If any single node has a positive ``seqno`` value, then that node can be +used to restart the cluster. However, because there is no guarantee that +each node has an identical copy of the data, we do not recommend to +restart the cluster using the ``--wsrep-new-cluster`` command on one +node. + +Rebuilding a container +~~~~~~~~~~~~~~~~~~~~~~ + +Recovering from certain failures require rebuilding one or more containers. + +#. Disable the failed node on the load balancer. + + .. note:: + + Do not rely on the load balancer health checks to disable the node. + If the node is not disabled, the load balancer sends SQL requests + to it before it rejoins the cluster and cause data inconsistencies. + +#. Destroy the container and remove MariaDB data stored outside + of the container: + + .. code-block:: shell-session + + # lxc-stop -n node3_galera_container-3ea2cbd3 + # lxc-destroy -n node3_galera_container-3ea2cbd3 + # rm -rf /openstack/node3_galera_container-3ea2cbd3/* + + In this example, node 3 failed. + +#. Run the host setup playbook to rebuild the container on node 3: + + .. code-block:: shell-session + + # openstack-ansible setup-hosts.yml -l node3 \ + -l node3_galera_container-3ea2cbd3 + + + The playbook restarts all other containers on the node. + +#. Run the infrastructure playbook to configure the container + specifically on node 3: + + .. code-block:: shell-session + + # openstack-ansible setup-infrastructure.yml \ + -l node3_galera_container-3ea2cbd3 + + + .. warning:: + + The new container runs a single-node Galera cluster, which is a dangerous + state because the environment contains more than one active database + with potentially different data. + + .. code-block:: shell-session + + # ansible galera_container -m shell -a "mysql \ + -h localhost -e 'show status like \"%wsrep_cluster_%\";'" + node3_galera_container-3ea2cbd3 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 1 + wsrep_cluster_size 1 + wsrep_cluster_state_uuid da078d01-29e5-11e4-a051-03d896dbdb2d + wsrep_cluster_status Primary + + node2_galera_container-49a47d25 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 4 + wsrep_cluster_size 2 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 4 + wsrep_cluster_size 2 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + +#. Restart MariaDB in the new container and verify that it rejoins the + cluster. + + .. note:: + + In larger deployments, it may take some time for the MariaDB daemon to + start in the new container. It will be synchronizing data from the other + MariaDB servers during this time. You can monitor the status during this + process by tailing the ``/var/log/mysql_logs/galera_server_error.log`` + log file. + + Lines starting with ``WSREP_SST`` will appear during the sync process + and you should see a line with ``WSREP: SST complete, seqno: `` + if the sync was successful. + + .. code-block:: shell-session + + # ansible galera_container -m shell -a "mysql \ + -h localhost -e 'show status like \"%wsrep_cluster_%\";'" + node2_galera_container-49a47d25 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 5 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node3_galera_container-3ea2cbd3 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 5 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 5 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + +#. Enable the failed node on the load balancer. diff --git a/doc/source/draft-operations-guide/ops-galera-remove.rst b/doc/source/draft-operations-guide/ops-galera-remove.rst new file mode 100644 index 0000000000..af44acd444 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-galera-remove.rst @@ -0,0 +1,32 @@ +============== +Removing nodes +============== + +In the following example, all but one node was shut down gracefully: + +.. code-block:: shell-session + + # ansible galera_container -m shell -a "mysql -h localhost \ + -e 'show status like \"%wsrep_cluster_%\";'" + node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (2) + + node2_galera_container-49a47d25 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (2) + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 7 + wsrep_cluster_size 1 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + +Compare this example output with the output from the multi-node failure +scenario where the remaining operational node is non-primary and stops +processing SQL requests. Gracefully shutting down the MariaDB service on +all but one node allows the remaining operational node to continue +processing SQL requests. When gracefully shutting down multiple nodes, +perform the actions sequentially to retain operation. diff --git a/doc/source/draft-operations-guide/ops-galera-start.rst b/doc/source/draft-operations-guide/ops-galera-start.rst new file mode 100644 index 0000000000..7555546a1c --- /dev/null +++ b/doc/source/draft-operations-guide/ops-galera-start.rst @@ -0,0 +1,88 @@ +================== +Starting a cluster +================== + +Gracefully shutting down all nodes destroys the cluster. Starting or +restarting a cluster from zero nodes requires creating a new cluster on +one of the nodes. + +#. Start a new cluster on the most advanced node. + Check the ``seqno`` value in the ``grastate.dat`` file on all of the nodes: + + .. code-block:: shell-session + + # ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat" + node2_galera_container-49a47d25 | success | rc=0 >> + # GALERA saved state version: 2.1 + uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 + seqno: 31 + cert_index: + + node3_galera_container-3ea2cbd3 | success | rc=0 >> + # GALERA saved state version: 2.1 + uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 + seqno: 31 + cert_index: + + node4_galera_container-76275635 | success | rc=0 >> + # GALERA saved state version: 2.1 + uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 + seqno: 31 + cert_index: + + In this example, all nodes in the cluster contain the same positive + ``seqno`` values as they were synchronized just prior to + graceful shutdown. If all ``seqno`` values are equal, any node can + start the new cluster. + + .. code-block:: shell-session + + # /etc/init.d/mysql start --wsrep-new-cluster + + This command results in a cluster containing a single node. The + ``wsrep_cluster_size`` value shows the number of nodes in the + cluster. + + .. code-block:: shell-session + + node2_galera_container-49a47d25 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (111) + + node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> + ERROR 2002 (HY000): Can't connect to local MySQL server + through socket '/var/run/mysqld/mysqld.sock' (2) + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 1 + wsrep_cluster_size 1 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + +#. Restart MariaDB on the other nodes and verify that they rejoin the + cluster. + + .. code-block:: shell-session + + node2_galera_container-49a47d25 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 3 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node3_galera_container-3ea2cbd3 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 3 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + + node4_galera_container-76275635 | success | rc=0 >> + Variable_name Value + wsrep_cluster_conf_id 3 + wsrep_cluster_size 3 + wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 + wsrep_cluster_status Primary + diff --git a/doc/source/draft-operations-guide/ops-galera.rst b/doc/source/draft-operations-guide/ops-galera.rst new file mode 100644 index 0000000000..fccac80652 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-galera.rst @@ -0,0 +1,18 @@ +========================== +Galera cluster maintenance +========================== + +.. toctree:: + + ops-galera-remove.rst + ops-galera-start.rst + ops-galera-recovery.rst + +Routine maintenance includes gracefully adding or removing nodes from +the cluster without impacting operation and also starting a cluster +after gracefully shutting down all nodes. + +MySQL instances are restarted when creating a cluster, when adding a +node, when the service is not running, or when changes are made to the +``/etc/mysql/my.cnf`` configuration file. + diff --git a/doc/source/draft-operations-guide/ops-lxc-commands.rst b/doc/source/draft-operations-guide/ops-lxc-commands.rst new file mode 100644 index 0000000000..2464d67fd6 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-lxc-commands.rst @@ -0,0 +1,38 @@ +======================== +Linux Container commands +======================== + +The following are some useful commands to manage LXC: + +- List containers and summary information such as operational state and + network configuration: + + .. code-block:: shell-session + + # lxc-ls --fancy + +- Show container details including operational state, resource + utilization, and ``veth`` pairs: + + .. code-block:: shell-session + + # lxc-info --name container_name + +- Start a container: + + .. code-block:: shell-session + + # lxc-start --name container_name + +- Attach to a container: + + .. code-block:: shell-session + + # lxc-attach --name container_name + +- Stop a container: + + .. code-block:: shell-session + + # lxc-stop --name container_name + diff --git a/doc/source/draft-operations-guide/ops-remove-computehost.rst b/doc/source/draft-operations-guide/ops-remove-computehost.rst new file mode 100644 index 0000000000..bd6a9aea9d --- /dev/null +++ b/doc/source/draft-operations-guide/ops-remove-computehost.rst @@ -0,0 +1,49 @@ +======================= +Removing a compute host +======================= + +The `openstack-ansible-ops `_ +repository contains a playbook for removing a compute host from an +OpenStack-Ansible (OSA) environment. +To remove a compute host, follow the below procedure. + +.. note:: + + This guide describes how to remove a compute node from an OSA environment + completely. Perform these steps with caution, as the compute node will no + longer be in service after the steps have been completed. This guide assumes + that all data and instances have been properly migrated. + +#. Disable all OpenStack services running on the compute node. + This can include, but is not limited to, the ``nova-compute`` service + and the neutron agent service. + + .. note:: + + Ensure this step is performed first + + .. code-block:: console + + # Run these commands on the compute node to be removed + # stop nova-compute + # stop neutron-linuxbridge-agent + +#. Clone the ``openstack-ansible-ops`` repository to your deployment host: + + .. code-block:: console + + $ git clone https://git.openstack.org/openstack/openstack-ansible-ops \ + /opt/openstack-ansible-ops + +#. Run the ``remove_compute_node.yml`` Ansible playbook with the + ``node_to_be_removed`` user variable set: + + .. code-block:: console + + $ cd /opt/openstack-ansible-ops/ansible_tools/playbooks + openstack-ansible remove_compute_node.yml \ + -e node_to_be_removed="" + +#. After the playbook completes, remove the compute node from the + OpenStack-Ansible configuration file in + ``/etc/openstack_deploy/openstack_user_config.yml``. diff --git a/doc/source/draft-operations-guide/ops-tips.rst b/doc/source/draft-operations-guide/ops-tips.rst new file mode 100644 index 0000000000..84d0221836 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-tips.rst @@ -0,0 +1,38 @@ +=============== +Tips and tricks +=============== + +Ansible forks +~~~~~~~~~~~~~ + +The default MaxSessions setting for the OpenSSH Daemon is 10. Each Ansible +fork makes use of a Session. By default, Ansible sets the number of forks to +5. However, you can increase the number of forks used in order to improve +deployment performance in large environments. + +Note that more than 10 forks will cause issues for any playbooks +which use ``delegate_to`` or ``local_action`` in the tasks. It is +recommended that the number of forks are not raised when executing against the +Control Plane, as this is where delegation is most often used. + +The number of forks used may be changed on a permanent basis by including +the appropriate change to the ``ANSIBLE_FORKS`` in your ``.bashrc`` file. +Alternatively it can be changed for a particular playbook execution by using +the ``--forks`` CLI parameter. For example, the following executes the nova +playbook against the control plane with 10 forks, then against the compute +nodes with 50 forks. + +.. code-block:: shell-session + + # openstack-ansible --forks 10 os-nova-install.yml --limit compute_containers + # openstack-ansible --forks 50 os-nova-install.yml --limit compute_hosts + +For more information about forks, please see the following references: + +* OpenStack-Ansible `Bug 1479812`_ +* Ansible `forks`_ entry for ansible.cfg +* `Ansible Performance Tuning`_ + +.. _Bug 1479812: https://bugs.launchpad.net/openstack-ansible/+bug/1479812 +.. _forks: http://docs.ansible.com/ansible/intro_configuration.html#forks +.. _Ansible Performance Tuning: https://www.ansible.com/blog/ansible-performance-tuning diff --git a/doc/source/draft-operations-guide/ops-troubleshooting.rst b/doc/source/draft-operations-guide/ops-troubleshooting.rst new file mode 100644 index 0000000000..e5b7a40584 --- /dev/null +++ b/doc/source/draft-operations-guide/ops-troubleshooting.rst @@ -0,0 +1,125 @@ +=============== +Troubleshooting +=============== + +Host kernel upgrade from version 3.13 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ubuntu kernel packages newer than version 3.13 contain a change in +module naming from ``nf_conntrack`` to ``br_netfilter``. After +upgrading the kernel, re-run the ``openstack-hosts-setup.yml`` +playbook against those hosts. See `OSA bug 157996`_ for more +information. + +.. _OSA bug 157996: https://bugs.launchpad.net/openstack-ansible/+bug/1579963 + + + +Container networking issues +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All LXC containers on the host have two virtual Ethernet interfaces: + +* `eth0` in the container connects to `lxcbr0` on the host +* `eth1` in the container connects to `br-mgmt` on the host + +.. note:: + + Some containers, such as ``cinder``, ``glance``, ``neutron_agents``, and + ``swift_proxy``, have more than two interfaces to support their + functions. + +Predictable interface naming +---------------------------- + +On the host, all virtual Ethernet devices are named based on their +container as well as the name of the interface inside the container: + + .. code-block:: shell-session + + ${CONTAINER_UNIQUE_ID}_${NETWORK_DEVICE_NAME} + +As an example, an all-in-one (AIO) build might provide a utility +container called `aio1_utility_container-d13b7132`. That container +will have two network interfaces: `d13b7132_eth0` and `d13b7132_eth1`. + +Another option would be to use the LXC tools to retrieve information +about the utility container: + + .. code-block:: shell-session + + # lxc-info -n aio1_utility_container-d13b7132 + + Name: aio1_utility_container-d13b7132 + State: RUNNING + PID: 8245 + IP: 10.0.3.201 + IP: 172.29.237.204 + CPU use: 79.18 seconds + BlkIO use: 678.26 MiB + Memory use: 613.33 MiB + KMem use: 0 bytes + Link: d13b7132_eth0 + TX bytes: 743.48 KiB + RX bytes: 88.78 MiB + Total bytes: 89.51 MiB + Link: d13b7132_eth1 + TX bytes: 412.42 KiB + RX bytes: 17.32 MiB + Total bytes: 17.73 MiB + +The ``Link:`` lines will show the network interfaces that are attached +to the utility container. + +Reviewing container networking traffic +-------------------------------------- + +To dump traffic on the ``br-mgmt`` bridge, use ``tcpdump`` to see all +communications between the various containers. To narrow the focus, +run ``tcpdump`` only on the desired network interface of the +containers. + +Cached Ansible facts issues +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +At the beginning of a playbook run, information about each host is gathered. +Examples of the information gathered are: + + * Linux distribution + * Kernel version + * Network interfaces + +To improve performance, particularly in large deployments, you can +cache host facts and information. + +OpenStack-Ansible enables fact caching by default. The facts are +cached in JSON files within ``/etc/openstack_deploy/ansible_facts``. + +Fact caching can be disabled by commenting out the ``fact_caching`` +parameter in ``playbooks/ansible.cfg``. Refer to the Ansible +documentation on `fact caching`_ for more details. + +.. _fact caching: http://docs.ansible.com/ansible/playbooks_variables.html#fact-caching + +Forcing regeneration of cached facts +------------------------------------ + +Cached facts may be incorrect if the host receives a kernel upgrade or new +network interfaces. Newly created bridges also disrupt cache facts. + +This can lead to unexpected errors while running playbooks, and +require that the cached facts be regenerated. + +Run the following command to remove all currently cached facts for all hosts: + +.. code-block:: shell-session + + # rm /etc/openstack_deploy/ansible_facts/* + +New facts will be gathered and cached during the next playbook run. + +To clear facts for a single host, find its file within +``/etc/openstack_deploy/ansible_facts/`` and remove it. Each host has +a JSON file that is named after its hostname. The facts for that host +will be regenerated on the next playbook run. +