import draft from openstack-manuals/doc/ha-guide-draft/

Move the draft of the restructured HA guide to this dedicated
repository, in order to focus reviews on a smaller, more specialist
audience and accelerate development.

Change-Id: I95a4b46fecaafafd1beb8314d1cf795b60fb17a8
This commit is contained in:
Adam Spiers 2018-11-14 15:44:02 +01:00
parent 34c59c7cb5
commit 393604c6b7
38 changed files with 6919 additions and 5 deletions

230
doc/common/app-support.rst Normal file
View File

@ -0,0 +1,230 @@
.. ## WARNING ##########################################################
.. This file is synced from openstack/openstack-manuals repository to
.. other related repositories. If you need to make changes to this file,
.. make the changes in openstack-manuals. After any change merged to,
.. openstack-manuals, automatically a patch for others will be proposed.
.. #####################################################################
=================
Community support
=================
The following resources are available to help you run and use OpenStack.
The OpenStack community constantly improves and adds to the main
features of OpenStack, but if you have any questions, do not hesitate to
ask. Use the following resources to get OpenStack support and
troubleshoot your installations.
Documentation
~~~~~~~~~~~~~
For the available OpenStack documentation, see
`docs.openstack.org <https://docs.openstack.org>`_.
The following guides explain how to install a Proof-of-Concept OpenStack cloud
and its associated components:
* `Rocky Installation Guides <https://docs.openstack.org/rocky/install/>`_
The following books explain how to configure and run an OpenStack cloud:
* `Architecture Design Guide <https://docs.openstack.org/arch-design/>`_
* `Rocky Administrator Guides <https://docs.openstack.org/rocky/admin/>`_
* `Rocky Configuration Guides <https://docs.openstack.org/rocky/configuration/>`_
* `Rocky Networking Guide <https://docs.openstack.org/neutron/rocky/admin/>`_
* `High Availability Guide <https://docs.openstack.org/ha-guide/>`_
* `Security Guide <https://docs.openstack.org/security-guide/>`_
* `Virtual Machine Image Guide <https://docs.openstack.org/image-guide/>`_
The following book explains how to use the command-line clients:
* `Rocky API Bindings
<https://docs.openstack.org/rocky/language-bindings.html>`_
The following documentation provides reference and guidance information
for the OpenStack APIs:
* `API Documentation <https://developer.openstack.org/api-guide/quick-start/>`_
The following guide provides information on how to contribute to OpenStack
documentation:
* `Documentation Contributor Guide <https://docs.openstack.org/doc-contrib-guide/>`_
ask.openstack.org
~~~~~~~~~~~~~~~~~
During the set up or testing of OpenStack, you might have questions
about how a specific task is completed or be in a situation where a
feature does not work correctly. Use the
`ask.openstack.org <https://ask.openstack.org>`_ site to ask questions
and get answers. When you visit the `Ask OpenStack
<https://ask.openstack.org>`_ site, scan
the recently asked questions to see whether your question has already
been answered. If not, ask a new question. Be sure to give a clear,
concise summary in the title and provide as much detail as possible in
the description. Paste in your command output or stack traces, links to
screen shots, and any other information which might be useful.
The OpenStack wiki
~~~~~~~~~~~~~~~~~~
The `OpenStack wiki <https://wiki.openstack.org/>`_ contains a broad
range of topics but some of the information can be difficult to find or
is a few pages deep. Fortunately, the wiki search feature enables you to
search by title or content. If you search for specific information, such
as about networking or OpenStack Compute, you can find a large amount
of relevant material. More is being added all the time, so be sure to
check back often. You can find the search box in the upper-right corner
of any OpenStack wiki page.
The Launchpad bugs area
~~~~~~~~~~~~~~~~~~~~~~~
The OpenStack community values your set up and testing efforts and wants
your feedback. To log a bug, you must `sign up for a Launchpad account
<https://launchpad.net/+login>`_. You can view existing bugs and report bugs
in the Launchpad Bugs area. Use the search feature to determine whether
the bug has already been reported or already been fixed. If it still
seems like your bug is unreported, fill out a bug report.
Some tips:
* Give a clear, concise summary.
* Provide as much detail as possible in the description. Paste in your
command output or stack traces, links to screen shots, and any other
information which might be useful.
* Be sure to include the software and package versions that you are
using, especially if you are using a development branch, such as,
``"Kilo release" vs git commit bc79c3ecc55929bac585d04a03475b72e06a3208``.
* Any deployment-specific information is helpful, such as whether you
are using Ubuntu 14.04 or are performing a multi-node installation.
The following Launchpad Bugs areas are available:
* `Bugs: OpenStack Block Storage
(cinder) <https://bugs.launchpad.net/cinder>`_
* `Bugs: OpenStack Compute (nova) <https://bugs.launchpad.net/nova>`_
* `Bugs: OpenStack Dashboard
(horizon) <https://bugs.launchpad.net/horizon>`_
* `Bugs: OpenStack Identity
(keystone) <https://bugs.launchpad.net/keystone>`_
* `Bugs: OpenStack Image service
(glance) <https://bugs.launchpad.net/glance>`_
* `Bugs: OpenStack Networking
(neutron) <https://bugs.launchpad.net/neutron>`_
* `Bugs: OpenStack Object Storage
(swift) <https://bugs.launchpad.net/swift>`_
* `Bugs: Application catalog (murano) <https://bugs.launchpad.net/murano>`_
* `Bugs: Bare metal service (ironic) <https://bugs.launchpad.net/ironic>`_
* `Bugs: Clustering service (senlin) <https://bugs.launchpad.net/senlin>`_
* `Bugs: Container Infrastructure Management service (magnum) <https://bugs.launchpad.net/magnum>`_
* `Bugs: Data processing service
(sahara) <https://bugs.launchpad.net/sahara>`_
* `Bugs: Database service (trove) <https://bugs.launchpad.net/trove>`_
* `Bugs: DNS service (designate) <https://bugs.launchpad.net/designate>`_
* `Bugs: Key Manager Service (barbican) <https://bugs.launchpad.net/barbican>`_
* `Bugs: Monitoring (monasca) <https://bugs.launchpad.net/monasca>`_
* `Bugs: Orchestration (heat) <https://bugs.launchpad.net/heat>`_
* `Bugs: Rating (cloudkitty) <https://bugs.launchpad.net/cloudkitty>`_
* `Bugs: Shared file systems (manila) <https://bugs.launchpad.net/manila>`_
* `Bugs: Telemetry
(ceilometer) <https://bugs.launchpad.net/ceilometer>`_
* `Bugs: Telemetry v3
(gnocchi) <https://bugs.launchpad.net/gnocchi>`_
* `Bugs: Workflow service
(mistral) <https://bugs.launchpad.net/mistral>`_
* `Bugs: Messaging service
(zaqar) <https://bugs.launchpad.net/zaqar>`_
* `Bugs: Container service
(zun) <https://bugs.launchpad.net/zun>`_
* `Bugs: OpenStack API Documentation
(developer.openstack.org) <https://bugs.launchpad.net/openstack-api-site>`_
* `Bugs: OpenStack Documentation
(docs.openstack.org) <https://bugs.launchpad.net/openstack-manuals>`_
Documentation feedback
~~~~~~~~~~~~~~~~~~~~~~
To provide feedback on documentation, join our IRC channel ``#openstack-doc``
on the Freenode IRC network, or `report a bug in Launchpad
<https://bugs.launchpad.net/openstack/+filebug>`_ and choose the particular
project that the documentation is a part of.
The OpenStack IRC channel
~~~~~~~~~~~~~~~~~~~~~~~~~
The OpenStack community lives in the #openstack IRC channel on the
Freenode network. You can hang out, ask questions, or get immediate
feedback for urgent and pressing issues. To install an IRC client or use
a browser-based client, go to
`https://webchat.freenode.net/ <https://webchat.freenode.net>`_. You can
also use `Colloquy <http://colloquy.info/>`_ (Mac OS X),
`mIRC <http://www.mirc.com/>`_ (Windows),
or XChat (Linux). When you are in the IRC channel
and want to share code or command output, the generally accepted method
is to use a Paste Bin. The OpenStack project has one at `Paste
<http://paste.openstack.org>`_. Just paste your longer amounts of text or
logs in the web form and you get a URL that you can paste into the
channel. The OpenStack IRC channel is ``#openstack`` on
``irc.freenode.net``. You can find a list of all OpenStack IRC channels on
the `IRC page on the wiki <https://wiki.openstack.org/wiki/IRC>`_.
OpenStack mailing lists
~~~~~~~~~~~~~~~~~~~~~~~
A great way to get answers and insights is to post your question or
problematic scenario to the OpenStack mailing list. You can learn from
and help others who might have similar issues. To subscribe or view the
archives, go to the `general OpenStack mailing list
<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>`_. If you are
interested in the other mailing lists for specific projects or development,
refer to `Mailing Lists <https://wiki.openstack.org/wiki/Mailing_Lists>`_.
OpenStack distribution packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following Linux distributions provide community-supported packages
for OpenStack:
* **CentOS, Fedora, and Red Hat Enterprise Linux:**
https://www.rdoproject.org/
* **openSUSE and SUSE Linux Enterprise Server:**
https://en.opensuse.org/Portal:OpenStack
* **Ubuntu:** https://wiki.ubuntu.com/OpenStack/CloudArchive

8
doc/common/appendix.rst Normal file
View File

@ -0,0 +1,8 @@
Appendix
~~~~~~~~
.. toctree::
:maxdepth: 1
app-support.rst
glossary.rst

View File

@ -0,0 +1,47 @@
.. ## WARNING ##########################################################
.. This file is synced from openstack/openstack-manuals repository to
.. other related repositories. If you need to make changes to this file,
.. make the changes in openstack-manuals. After any change merged to,
.. openstack-manuals, automatically a patch for others will be proposed.
.. #####################################################################
===========
Conventions
===========
The OpenStack documentation uses several typesetting conventions.
Notices
~~~~~~~
Notices take these forms:
.. note:: A comment with additional information that explains a part of the
text.
.. important:: Something you must be aware of before proceeding.
.. tip:: An extra but helpful piece of practical advice.
.. caution:: Helpful information that prevents the user from making mistakes.
.. warning:: Critical information about the risk of data loss or security
issues.
Command prompts
~~~~~~~~~~~~~~~
.. code-block:: console
$ command
Any user, including the ``root`` user, can run commands that are
prefixed with the ``$`` prompt.
.. code-block:: console
# command
The ``root`` user must run commands that are prefixed with the ``#``
prompt. You can also prefix these commands with the :command:`sudo`
command, if available, to run them.

4164
doc/common/glossary.rst Normal file

File diff suppressed because it is too large Load Diff

110
doc/common/source/conf.py Normal file
View File

@ -0,0 +1,110 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import os
# import sys
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
# sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['openstackdocstheme']
# Add any paths that contain templates here, relative to this directory.
# templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
# source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
repository_name = "openstack/openstack-manuals"
bug_project = 'openstack-manuals'
project = u'Common documents'
bug_tag = u'common'
copyright = u'2015-2018, OpenStack contributors'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = ''
# The full version, including alpha/beta/rc tags.
release = ''
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
# language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
# The reST default role (used for this markup: `text`) to use for all
# documents.
# default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
# add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
# modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
# keep_warnings = False
# -- Options for Internationalization output ------------------------------
locale_dirs = ['locale/']

1
doc/source/common Symbolic link
View File

@ -0,0 +1 @@
../common

View File

@ -0,0 +1,55 @@
============================
Configuring the compute node
============================
The `Installation Guides
<https://docs.openstack.org/ocata/install/>`_
provide instructions for installing multiple compute nodes.
To make the compute nodes highly available, you must configure the
environment to include multiple instances of the API and other services.
Configuring high availability for instances
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As of September 2016, the OpenStack High Availability community is
designing and developing an official and unified way to provide high
availability for instances. We are developing automatic
recovery from failures of hardware or hypervisor-related software on
the compute node, or other failures that could prevent instances from
functioning correctly, such as, issues with a cinder volume I/O path.
More details are available in the `user story
<http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
co-authored by OpenStack's HA community and `Product Working Group
<https://wiki.openstack.org/wiki/ProductTeam>`_ (PWG), where this feature is
identified as missing functionality in OpenStack, which
should be addressed with high priority.
Existing solutions
~~~~~~~~~~~~~~~~~~
The architectural challenges of instance HA and several currently
existing solutions were presented in `a talk at the Austin summit
<https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation>`_,
for which `slides are also available <http://aspiers.github.io/openstack-summit-2016-austin-compute-ha/>`_.
The code for three of these solutions can be found online at the following
links:
* `a mistral-based auto-recovery workflow
<https://github.com/gryf/mistral-evacuate>`_, by Intel
* `masakari <https://launchpad.net/masakari>`_, by NTT
* `OCF RAs
<http://aspiers.github.io/openstack-summit-2016-austin-compute-ha/#/ocf-pros-cons>`_,
as used by Red Hat and SUSE
Current upstream work
~~~~~~~~~~~~~~~~~~~~~
Work is in progress on a unified approach, which combines the best
aspects of existing upstream solutions. More details are available on
`the HA VMs user story wiki
<https://wiki.openstack.org/wiki/ProductTeam/User_Stories/HA_VMs>`_.
To get involved with this work, see the section on the
:doc:`ha-community`.

View File

@ -1,3 +1,16 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is execfile()d with the current directory set to its
# containing dir.
#
@ -8,8 +21,7 @@
# serve to show the default.
import os
import openstackdocstheme
# import sys
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
@ -26,6 +38,15 @@ import openstackdocstheme
# ones.
extensions = ['openstackdocstheme']
# Add any paths that contain templates here, relative to this directory.
# templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
# source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
@ -36,12 +57,97 @@ project = u'High Availability Guide'
bug_tag = u'ha-guide'
copyright = u'2016-present, OpenStack contributors'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = ''
# The full version, including alpha/beta/rc tags.
release = ''
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
# language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['common/cli*', 'common/nova*',
'common/get-started*', 'common/dashboard*']
# The reST default role (used for this markup: `text`) to use for all
# documents.
# default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
# add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
# modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
# keep_warnings = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'openstackdocs'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
html_theme_options = {
'display_badge': False
}
# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = [openstackdocstheme.get_html_theme_path()]
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
# html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
# html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
# html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
# html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = []
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
# html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
# So that we can enable "log-a-bug" links from each output HTML page, this
@ -49,6 +155,73 @@ html_theme = 'openstackdocs'
# minutes.
html_last_updated_fmt = '%Y-%m-%d %H:%M'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
# html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
# html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
# html_additional_pages = {}
# If false, no module index is generated.
# html_domain_indices = True
# If false, no index is generated.
html_use_index = False
# If true, the index is split into individual pages for each letter.
# html_split_index = False
# If true, links to the reST sources are added to the pages.
html_show_sourcelink = False
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
# html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
# html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
# html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
# html_file_suffix = None
# Output file base name for HTML help builder.
htmlhelp_basename = 'ha-guide'
# If true, publish source files
html_copy_source = False
# -- Options for LaTeX output ---------------------------------------------
latex_engine = 'xelatex'
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
# 'papersize': 'letterpaper',
# set font (TODO: different fonts for translated PDF document builds)
'fontenc': '\\usepackage{fontspec}',
'fontpkg': '''\
\defaultfontfeatures{Scale=MatchLowercase}
\setmainfont{Liberation Serif}
\setsansfont{Liberation Sans}
\setmonofont[SmallCapsFont={Liberation Mono}]{Liberation Mono}
''',
# The font size ('10pt', '11pt' or '12pt').
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
# 'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
@ -57,5 +230,63 @@ latex_documents = [
u'OpenStack contributors', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
# latex_use_parts = False
# If true, show page references after internal links.
# latex_show_pagerefs = False
# If true, show URL addresses after external links.
# latex_show_urls = False
# Documents to append as an appendix to all manuals.
# latex_appendices = []
# If false, no module index is generated.
# latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'haguide', u'High Availability Guide',
[u'OpenStack contributors'], 1)
]
# If true, show URL addresses after external links.
# man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'HAGuide', u'High Availability Guide',
u'OpenStack contributors', 'HAGuide',
'This guide shows OpenStack operators and deployers how to configure'
'OpenStack to be robust and fault-tolerant.', 'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
# texinfo_appendices = []
# If false, no module index is generated.
# texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
# texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
# texinfo_no_detailmenu = False
# -- Options for Internationalization output ------------------------------
locale_dirs = ['locale/']

View File

@ -0,0 +1,342 @@
=================================
Configuring the stateful services
=================================
.. to do: scope how in depth we want these sections to be
Database for high availability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Galera
------
The first step is to install the database that sits at the heart of the
cluster. To implement high availability, run an instance of the database on
each controller node and use Galera Cluster to provide replication between
them. Galera Cluster is a synchronous multi-master database cluster, based
on MySQL and the InnoDB storage engine. It is a high-availability service
that provides high system uptime, no data loss, and scalability for growth.
You can achieve high availability for the OpenStack database in many
different ways, depending on the type of database that you want to use.
There are three implementations of Galera Cluster available to you:
- `Galera Cluster for MySQL <http://galeracluster.com/>`_: The MySQL
reference implementation from Codership, Oy.
- `MariaDB Galera Cluster <https://mariadb.org/>`_: The MariaDB
implementation of Galera Cluster, which is commonly supported in
environments based on Red Hat distributions.
- `Percona XtraDB Cluster <https://www.percona.com/>`_: The XtraDB
implementation of Galera Cluster from Percona.
In addition to Galera Cluster, you can also achieve high availability
through other database options, such as PostgreSQL, which has its own
replication system.
Pacemaker active/passive with HAproxy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Replicated storage
------------------
For example: DRBD
Shared storage
--------------
Messaging service for high availability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RabbitMQ
--------
An AMQP (Advanced Message Queuing Protocol) compliant message bus is
required for most OpenStack components in order to coordinate the
execution of jobs entered into the system.
The most popular AMQP implementation used in OpenStack installations
is RabbitMQ.
RabbitMQ nodes fail over on the application and the infrastructure layers.
The application layer is controlled by the ``oslo.messaging``
configuration options for multiple AMQP hosts. If the AMQP node fails,
the application reconnects to the next one configured within the
specified reconnect interval. The specified reconnect interval
constitutes its SLA.
On the infrastructure layer, the SLA is the time for which RabbitMQ
cluster reassembles. Several cases are possible. The Mnesia keeper
node is the master of the corresponding Pacemaker resource for
RabbitMQ. When it fails, the result is a full AMQP cluster downtime
interval. Normally, its SLA is no more than several minutes. Failure
of another node that is a slave of the corresponding Pacemaker
resource for RabbitMQ results in no AMQP cluster downtime at all.
.. until we've determined the content depth, I've transferred RabbitMQ
configuration below from the old HA guide (darrenc)
Making the RabbitMQ service highly available involves the following steps:
- :ref:`Install RabbitMQ<rabbitmq-install>`
- :ref:`Configure RabbitMQ for HA queues<rabbitmq-configure>`
- :ref:`Configure OpenStack services to use RabbitMQ HA queues
<rabbitmq-services>`
.. note::
Access to RabbitMQ is not normally handled by HAProxy. Instead,
consumers must be supplied with the full list of hosts running
RabbitMQ with ``rabbit_hosts`` and turn on the ``rabbit_ha_queues``
option. For more information, read the `core issue
<http://people.redhat.com/jeckersb/private/vip-failover-tcp-persist.html>`_.
For more detail, read the `history and solution
<http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html>`_.
.. _rabbitmq-install:
Install RabbitMQ
^^^^^^^^^^^^^^^^
The commands for installing RabbitMQ are specific to the Linux distribution
you are using.
For Ubuntu or Debian:
.. code-block: console
# apt-get install rabbitmq-server
For RHEL, Fedora, or CentOS:
.. code-block: console
# yum install rabbitmq-server
For openSUSE:
.. code-block: console
# zypper install rabbitmq-server
For SLES 12:
.. code-block: console
# zypper addrepo -f obs://Cloud:OpenStack:Kilo/SLE_12 Kilo
[Verify the fingerprint of the imported GPG key. See below.]
# zypper install rabbitmq-server
.. note::
For SLES 12, the packages are signed by GPG key 893A90DAD85F9316.
You should verify the fingerprint of the imported GPG key before using it.
.. code-block:: none
Key ID: 893A90DAD85F9316
Key Name: Cloud:OpenStack OBS Project <Cloud:OpenStack@build.opensuse.org>
Key Fingerprint: 35B34E18ABC1076D66D5A86B893A90DAD85F9316
Key Created: Tue Oct 8 13:34:21 2013
Key Expires: Thu Dec 17 13:34:21 2015
For more information, see the official installation manual for the
distribution:
- `Debian and Ubuntu <https://www.rabbitmq.com/install-debian.html>`_
- `RPM based <https://www.rabbitmq.com/install-rpm.html>`_
(RHEL, Fedora, CentOS, openSUSE)
.. _rabbitmq-configure:
Configure RabbitMQ for HA queues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. [TODO: This section should begin with a brief mention
.. about what HA queues are and why they are valuable, etc]
.. [TODO: replace "currently" with specific release names]
.. [TODO: Does this list need to be updated? Perhaps we need a table
.. that shows each component and the earliest release that allows it
.. to work with HA queues.]
The following components/services can work with HA queues:
- OpenStack Compute
- OpenStack Block Storage
- OpenStack Networking
- Telemetry
Consider that, while exchanges and bindings survive the loss of individual
nodes, queues and their messages do not because a queue and its contents
are located on one node. If we lose this node, we also lose the queue.
Mirrored queues in RabbitMQ improve the availability of service since
it is resilient to failures.
Production servers should run (at least) three RabbitMQ servers for testing
and demonstration purposes, however it is possible to run only two servers.
In this section, we configure two nodes, called ``rabbit1`` and ``rabbit2``.
To build a broker, ensure that all nodes have the same Erlang cookie file.
.. [TODO: Should the example instead use a minimum of three nodes?]
#. Stop RabbitMQ and copy the cookie from the first node to each of the
other node(s):
.. code-block:: console
# scp /var/lib/rabbitmq/.erlang.cookie root@NODE:/var/lib/rabbitmq/.erlang.cookie
#. On each target node, verify the correct owner,
group, and permissions of the file :file:`erlang.cookie`:
.. code-block:: console
# chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie
# chmod 400 /var/lib/rabbitmq/.erlang.cookie
#. Start the message queue service on all nodes and configure it to start
when the system boots. On Ubuntu, it is configured by default.
On CentOS, RHEL, openSUSE, and SLES:
.. code-block:: console
# systemctl enable rabbitmq-server.service
# systemctl start rabbitmq-server.service
#. Verify that the nodes are running:
.. code-block:: console
# rabbitmqctl cluster_status
Cluster status of node rabbit@NODE...
[{nodes,[{disc,[rabbit@NODE]}]},
{running_nodes,[rabbit@NODE]},
{partitions,[]}]
...done.
#. Run the following commands on each node except the first one:
.. code-block:: console
# rabbitmqctl stop_app
Stopping node rabbit@NODE...
...done.
# rabbitmqctl join_cluster --ram rabbit@rabbit1
# rabbitmqctl start_app
Starting node rabbit@NODE ...
...done.
.. note::
The default node type is a disc node. In this guide, nodes
join the cluster as RAM nodes.
#. Verify the cluster status:
.. code-block:: console
# rabbitmqctl cluster_status
Cluster status of node rabbit@NODE...
[{nodes,[{disc,[rabbit@rabbit1]},{ram,[rabbit@NODE]}]}, \
{running_nodes,[rabbit@NODE,rabbit@rabbit1]}]
If the cluster is working, you can create usernames and passwords
for the queues.
#. To ensure that all queues except those with auto-generated names
are mirrored across all running nodes,
set the ``ha-mode`` policy key to all
by running the following command on one of the nodes:
.. code-block:: console
# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
More information is available in the RabbitMQ documentation:
- `Highly Available Queues <https://www.rabbitmq.com/ha.html>`_
- `Clustering Guide <https://www.rabbitmq.com/clustering.html>`_
.. note::
As another option to make RabbitMQ highly available, RabbitMQ contains the
OCF scripts for the Pacemaker cluster resource agents since version 3.5.7.
It provides the active/active RabbitMQ cluster with mirrored queues.
For more information, see `Auto-configuration of a cluster with
a Pacemaker <https://www.rabbitmq.com/pacemaker.html>`_.
.. _rabbitmq-services:
Configure OpenStack services to use Rabbit HA queues
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Configure the OpenStack components to use at least two RabbitMQ nodes.
Use these steps to configurate all services using RabbitMQ:
#. RabbitMQ HA cluster ``host:port`` pairs:
.. code-block:: console
rabbit_hosts=rabbit1:5672,rabbit2:5672,rabbit3:5672
#. Retry connecting with RabbitMQ:
.. code-block:: console
rabbit_retry_interval=1
#. How long to back-off for between retries when connecting to RabbitMQ:
.. code-block:: console
rabbit_retry_backoff=2
#. Maximum retries with trying to connect to RabbitMQ (infinite by default):
.. code-block:: console
rabbit_max_retries=0
#. Use durable queues in RabbitMQ:
.. code-block:: console
rabbit_durable_queues=true
#. Use HA queues in RabbitMQ (``x-ha-policy: all``):
.. code-block:: console
rabbit_ha_queues=true
.. note::
If you change the configuration from an old set-up
that did not use HA queues, restart the service:
.. code-block:: console
# rabbitmqctl stop_app
# rabbitmqctl reset
# rabbitmqctl start_app
Pacemaker active/passive
------------------------
Mirrored queues
---------------
Qpid
----

View File

@ -0,0 +1,518 @@
==============================
Configuring stateless services
==============================
.. to do: scope what details we want on the following services
API services
~~~~~~~~~~~~
Load-balancer
~~~~~~~~~~~~~
HAProxy
-------
HAProxy provides a fast and reliable HTTP reverse proxy and load balancer
for TCP or HTTP applications. It is particularly suited for web crawling
under very high loads while needing persistence or Layer 7 processing.
It realistically supports tens of thousands of connections with recent
hardware.
Each instance of HAProxy configures its front end to accept connections only
to the virtual IP (VIP) address. The HAProxy back end (termination
point) is a list of all the IP addresses of instances for load balancing.
.. note::
Ensure your HAProxy installation is not a single point of failure,
it is advisable to have multiple HAProxy instances running.
You can also ensure the availability by other means, using Keepalived
or Pacemaker.
Alternatively, you can use a commercial load balancer, which is hardware
or software. We recommend a hardware load balancer as it generally has
good performance.
For detailed instructions about installing HAProxy on your nodes,
see the HAProxy `official documentation <http://www.haproxy.org/#docs>`_.
Configuring HAProxy
^^^^^^^^^^^^^^^^^^^
#. Restart the HAProxy service.
#. Locate your HAProxy instance on each OpenStack controller in your
environment. The following is an example ``/etc/haproxy/haproxy.cfg``
configuration file. Configure your instance using the following
configuration file, you will need a copy of it on each
controller node.
.. code-block:: none
global
chroot /var/lib/haproxy
daemon
group haproxy
maxconn 4000
pidfile /var/run/haproxy.pid
user haproxy
defaults
log global
maxconn 4000
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s
listen dashboard_cluster
bind <Virtual IP>:443
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:443 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:443 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:443 check inter 2000 rise 2 fall 5
listen galera_cluster
bind <Virtual IP>:3306
balance source
option mysql-check
server controller1 10.0.0.12:3306 check port 9200 inter 2000 rise 2 fall 5
server controller2 10.0.0.13:3306 backup check port 9200 inter 2000 rise 2 fall 5
server controller3 10.0.0.14:3306 backup check port 9200 inter 2000 rise 2 fall 5
listen glance_api_cluster
bind <Virtual IP>:9292
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:9292 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:9292 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:9292 check inter 2000 rise 2 fall 5
listen glance_registry_cluster
bind <Virtual IP>:9191
balance source
option tcpka
option tcplog
server controller1 10.0.0.12:9191 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:9191 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:9191 check inter 2000 rise 2 fall 5
listen keystone_admin_cluster
bind <Virtual IP>:35357
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:35357 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:35357 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:35357 check inter 2000 rise 2 fall 5
listen keystone_public_internal_cluster
bind <Virtual IP>:5000
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:5000 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:5000 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:5000 check inter 2000 rise 2 fall 5
listen nova_ec2_api_cluster
bind <Virtual IP>:8773
balance source
option tcpka
option tcplog
server controller1 10.0.0.12:8773 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:8773 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:8773 check inter 2000 rise 2 fall 5
listen nova_compute_api_cluster
bind <Virtual IP>:8774
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:8774 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:8774 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:8774 check inter 2000 rise 2 fall 5
listen nova_metadata_api_cluster
bind <Virtual IP>:8775
balance source
option tcpka
option tcplog
server controller1 10.0.0.12:8775 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:8775 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:8775 check inter 2000 rise 2 fall 5
listen cinder_api_cluster
bind <Virtual IP>:8776
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:8776 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:8776 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:8776 check inter 2000 rise 2 fall 5
listen ceilometer_api_cluster
bind <Virtual IP>:8777
balance source
option tcpka
option tcplog
server controller1 10.0.0.12:8777 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:8777 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:8777 check inter 2000 rise 2 fall 5
listen nova_vncproxy_cluster
bind <Virtual IP>:6080
balance source
option tcpka
option tcplog
server controller1 10.0.0.12:6080 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:6080 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:6080 check inter 2000 rise 2 fall 5
listen neutron_api_cluster
bind <Virtual IP>:9696
balance source
option tcpka
option httpchk
option tcplog
server controller1 10.0.0.12:9696 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:9696 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:9696 check inter 2000 rise 2 fall 5
listen swift_proxy_cluster
bind <Virtual IP>:8080
balance source
option tcplog
option tcpka
server controller1 10.0.0.12:8080 check inter 2000 rise 2 fall 5
server controller2 10.0.0.13:8080 check inter 2000 rise 2 fall 5
server controller3 10.0.0.14:8080 check inter 2000 rise 2 fall 5
.. note::
The Galera cluster configuration directive ``backup`` indicates
that two of the three controllers are standby nodes.
This ensures that only one node services write requests
because OpenStack support for multi-node writes is not yet production-ready.
.. note::
The Telemetry API service configuration does not have the ``option httpchk``
directive as it cannot process this check properly.
.. TODO: explain why the Telemetry API is so special
#. Configure the kernel parameter to allow non-local IP binding. This allows
running HAProxy instances to bind to a VIP for failover. Add following line
to ``/etc/sysctl.conf``:
.. code-block:: none
net.ipv4.ip_nonlocal_bind = 1
#. Restart the host or, to make changes work immediately, invoke:
.. code-block:: console
$ sysctl -p
#. Add HAProxy to the cluster and ensure the VIPs can only run on machines
where HAProxy is active:
``pcs``
.. code-block:: console
$ pcs resource create lb-haproxy systemd:haproxy --clone
$ pcs constraint order start vip then lb-haproxy-clone kind=Optional
$ pcs constraint colocation add lb-haproxy-clone with vip
``crmsh``
.. code-block:: console
$ crm cib new conf-haproxy
$ crm configure primitive haproxy lsb:haproxy op monitor interval="1s"
$ crm configure clone haproxy-clone haproxy
$ crm configure colocation vip-with-haproxy inf: vip haproxy-clone
$ crm configure order haproxy-after-vip mandatory: vip haproxy-clone
Pacemaker versus systemd
------------------------
Memcached
---------
Memcached is a general-purpose distributed memory caching system. It
is used to speed up dynamic database-driven websites by caching data
and objects in RAM to reduce the number of times an external data
source must be read.
Memcached is a memory cache demon that can be used by most OpenStack
services to store ephemeral data, such as tokens.
Access to Memcached is not handled by HAProxy because replicated
access is currently in an experimental state. Instead, OpenStack
services must be supplied with the full list of hosts running
Memcached.
The Memcached client implements hashing to balance objects among the
instances. Failure of an instance impacts only a percentage of the
objects and the client automatically removes it from the list of
instances. The SLA is several minutes.
Highly available API services
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Identity API
------------
Ensure you have read the
`OpenStack Identity service getting started documentation
<https://docs.openstack.org/admin-guide/common/get-started-identity.html>`_.
.. to do: reference controller-ha-identity and see if section involving
adding to pacemaker is in scope
Add OpenStack Identity resource to Pacemaker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following section(s) detail how to add the Identity service
to Pacemaker on SUSE and Red Hat.
SUSE
----
SUSE Enterprise Linux and SUSE-based distributions, such as openSUSE,
use a set of OCF agents for controlling OpenStack services.
#. Run the following commands to download the OpenStack Identity resource
to Pacemaker:
.. code-block:: console
# cd /usr/lib/ocf/resource.d
# mkdir openstack
# cd openstack
# wget https://git.openstack.org/cgit/openstack/openstack-resource-agents/plain/ocf/keystone
# chmod a+rx *
#. Add the Pacemaker configuration for the OpenStack Identity resource
by running the following command to connect to the Pacemaker cluster:
.. code-block:: console
# crm configure
#. Add the following cluster resources:
.. code-block:: console
clone p_keystone ocf:openstack:keystone \
params config="/etc/keystone/keystone.conf" os_password="secretsecret" os_username="admin" os_tenant_name="admin" os_auth_url="http://10.0.0.11:5000/v2.0/" \
op monitor interval="30s" timeout="30s"
.. note::
This configuration creates ``p_keystone``,
a resource for managing the OpenStack Identity service.
#. Commit your configuration changes from the :command:`crm configure` menu
with the following command:
.. code-block:: console
# commit
The :command:`crm configure` supports batch input. You may have to copy and
paste the above lines into your live Pacemaker configuration, and then make
changes as required.
For example, you may enter ``edit p_ip_keystone`` from the
:command:`crm configure` menu and edit the resource to match your preferred
virtual IP address.
Pacemaker now starts the OpenStack Identity service and its dependent
resources on all of your nodes.
Red Hat
--------
For Red Hat Enterprise Linux and Red Hat-based Linux distributions,
the following process uses Systemd unit files.
.. code-block:: console
# pcs resource create openstack-keystone systemd:openstack-keystone --clone interleave=true
.. _identity-config-identity:
Configure OpenStack Identity service
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Edit the :file:`keystone.conf` file
to change the values of the :manpage:`bind(2)` parameters:
.. code-block:: ini
bind_host = 10.0.0.12
public_bind_host = 10.0.0.12
admin_bind_host = 10.0.0.12
The ``admin_bind_host`` parameter
lets you use a private network for admin access.
#. To be sure that all data is highly available,
ensure that everything is stored in the MySQL database
(which is also highly available):
.. code-block:: ini
[catalog]
driver = keystone.catalog.backends.sql.Catalog
# ...
[identity]
driver = keystone.identity.backends.sql.Identity
# ...
#. If the Identity service will be sending ceilometer notifications
and your message bus is configured for high availability, you will
need to ensure that the Identity service is correctly configured to
use it.
.. _identity-services-config:
Configure OpenStack services to use the highly available OpenStack Identity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Your OpenStack services now point their OpenStack Identity configuration
to the highly available virtual cluster IP address.
#. For OpenStack Compute service, (if your OpenStack Identity service
IP address is 10.0.0.11) use the following configuration in the
:file:`api-paste.ini` file:
.. code-block:: ini
auth_host = 10.0.0.11
#. Create the OpenStack Identity Endpoint with this IP address.
.. note::
If you are using both private and public IP addresses,
create two virtual IP addresses and define the endpoint. For
example:
.. code-block:: console
$ openstack endpoint create --region $KEYSTONE_REGION \
$service-type public http://PUBLIC_VIP:5000/v2.0
$ openstack endpoint create --region $KEYSTONE_REGION \
$service-type admin http://10.0.0.11:35357/v2.0
$ openstack endpoint create --region $KEYSTONE_REGION \
$service-type internal http://10.0.0.11:5000/v2.0
#. If you are using Dashboard (horizon), edit the :file:`local_settings.py`
file to include the following:
.. code-block:: ini
OPENSTACK_HOST = 10.0.0.11
Telemetry API
-------------
The Telemetry polling agent can be configured to partition its polling
workload between multiple agents. This enables high availability (HA).
Both the central and the compute agent can run in an HA deployment.
This means that multiple instances of these services can run in
parallel with workload partitioning among these running instances.
The `Tooz <https://pypi.org/project/tooz>`_ library provides
the coordination within the groups of service instances.
It provides an API above several back ends that can be used for building
distributed applications.
Tooz supports
`various drivers <https://docs.openstack.org/tooz/latest/user/drivers.html>`_
including the following back end solutions:
* `Zookeeper <http://zookeeper.apache.org/>`_:
Recommended solution by the Tooz project.
* `Redis <http://redis.io/>`_:
Recommended solution by the Tooz project.
* `Memcached <http://memcached.org/>`_:
Recommended for testing.
You must configure a supported Tooz driver for the HA deployment of
the Telemetry services.
For information about the required configuration options
to set in the :file:`ceilometer.conf`, see the `coordination section
<https://docs.openstack.org/ocata/config-reference/telemetry.html>`_
in the OpenStack Configuration Reference.
.. note::
Only one instance for the central and compute agent service(s) is able
to run and function correctly if the ``backend_url`` option is not set.
The availability check of the instances is provided by heartbeat messages.
When the connection with an instance is lost, the workload will be
reassigned within the remaining instances in the next polling cycle.
.. note::
Memcached uses a timeout value, which should always be set to
a value that is higher than the heartbeat value set for Telemetry.
For backward compatibility and supporting existing deployments, the central
agent configuration supports using different configuration files. This is for
groups of service instances that are running in parallel.
For enabling this configuration, set a value for the
``partitioning_group_prefix`` option in the
`polling section <https://docs.openstack.org/ocata/config-reference/telemetry/telemetry-config-options.html>`_
in the OpenStack Configuration Reference.
.. warning::
For each sub-group of the central agent pool with the same
``partitioning_group_prefix``, a disjoint subset of meters must be polled
to avoid samples being missing or duplicated. The list of meters to poll
can be set in the :file:`/etc/ceilometer/pipeline.yaml` configuration file.
For more information about pipelines see the `Data processing and pipelines
<https://docs.openstack.org/admin-guide/telemetry-data-pipelines.html>`_
section.
To enable the compute agent to run multiple instances simultaneously with
workload partitioning, the ``workload_partitioning`` option must be set to
``True`` under the `compute section <https://docs.openstack.org/ocata/config-reference/telemetry.html>`_
in the :file:`ceilometer.conf` configuration file.
.. To Do: Cover any other projects here with API services which require specific
HA details.

View File

@ -0,0 +1,9 @@
===========================
Configuring a control plane
===========================
.. toctree::
:maxdepth: 2
control-plane-stateless.rst
control-plane-stateful.rst

Binary file not shown.

After

Width:  |  Height:  |  Size: 223 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 215 KiB

View File

@ -0,0 +1,15 @@
============
HA community
============
The OpenStack HA community holds `weekly IRC meetings
<https://wiki.openstack.org/wiki/Meetings/HATeamMeeting>`_ to discuss
a range of topics relating to HA in OpenStack. Everyone interested is
encouraged to attend. The `logs of all previous meetings
<http://eavesdrop.openstack.org/meetings/ha/>`_ are available to read.
You can contact the HA community directly in `the #openstack-ha
channel on Freenode IRC <https://wiki.openstack.org/wiki/IRC>`_, or by
sending mail to the `openstack-dev
<https://wiki.openstack.org/wiki/Mailing_Lists#Future_Development>`_
mailing list with the ``[HA]`` prefix in the ``Subject`` header.

View File

@ -5,8 +5,31 @@ OpenStack High Availability Guide
Abstract
~~~~~~~~
This guide provides information about configuring OpenStack services for high
availability.
This guide describes how to install and configure OpenStack for high
availability. It supplements the Installation Guides
and assumes that you are familiar with the material in those guides.
This is a placeholder while we migrate information over from another repo.
.. warning::
This guide is a work-in-progress and changing rapidly
while we continue to test and enhance the guidance. There are
open `TODO` items throughout and available on the OpenStack manuals
`bug list <https://bugs.launchpad.net/openstack-manuals?field.tag=ha-guide>`_.
Please help where you are able.
.. toctree::
:maxdepth: 1
common/conventions.rst
overview.rst
intro-ha.rst
intro-os-ha.rst
control-plane.rst
networking-ha.rst
storage-ha.rst
compute-node-ha.rst
monitoring.rst
testing.rst
ref-arch-examples.rst
ha-community.rst
common/appendix.rst

View File

@ -0,0 +1,127 @@
========================
Commonly used technology
========================
High availability can be achieved only on system level, while both hardware and
software components can contribute to the system level availability.
This document lists the most common hardware and software technologies
that can be used to build a highly available system.
Hardware
~~~~~~~~
Using different technologies to enable high availability on the hardware
level provides a good basis to build a high available system. The next chapters
discuss the most common technologies used in this field.
Redundant switches
------------------
Network switches are single point of failures as networking is critical to
operate all other basic domains of the infrastructure, like compute and
storage. Network switches need to be able to forward the network traffic
and be able to forward the traffic to a working next hop.
For these reasons consider the following two factors when making a network
switch redundant:
#. The network switch itself should synchronize its internal state to a
redundant switch either in active/active or active/passive way.
#. The network topology should be designed in a way that the network router can
use at least two paths in every critical direction.
Bonded interfaces
-----------------
Bonded interfaces are two independent physical network interfaces handled as
one interface in active/passive or in active/active redundancy mode. In
active/passive mode, if an error happens in the active network interface or in
the remote end of the interface, the interfaces are switched over. In
active/active mode, when an error happens in an interface or in the remote end
of an interface, then the interface is marked as unavailable and ceases to be
used.
Load balancers
--------------
Physical load balancers are special routers which direct the traffic in
different directions based on a set of rules. Load balancers can be in
redundant mode similarly to the physical switches.
Load balancers are also important for distributing the traffic to the different
active/active components of the system.
Storage
-------
Physical storage high availability can be achieved with different scopes:
#. High availability within a hardware unit with redundant disks (mostly
organized into different RAID configurations), redundant control components,
redundant I/O interfaces and redundant power supply.
#. System level high availability with redundant hardware units with data
replication.
Software
~~~~~~~~
HAproxy
-------
HAProxy provides a fast and reliable HTTP reverse proxy and load balancer
for TCP or HTTP applications. It is particularly suited for web crawling
under very high loads while needing persistence or Layer 7 processing.
It realistically supports tens of thousands of connections with recent
hardware.
.. note::
Ensure your HAProxy installation is not a single point of failure,
it is advisable to have multiple HAProxy instances running.
You can also ensure the availability by other means, using Keepalived
or Pacemaker.
Alternatively, you can use a commercial load balancer, which is hardware
or software. We recommend a hardware load balancer as it generally has
good performance.
For detailed instructions about installing HAProxy on your nodes,
see the HAProxy `official documentation <http://www.haproxy.org/#docs>`_.
keepalived
----------
`keepalived <http://www.keepalived.org/>`_ is a routing software that
provides facilities for load balancing and high-availability to Linux
system and Linux based infrastructures.
Keepalived implements a set of checkers to dynamically and
adaptively maintain and manage loadbalanced server pool according
their health.
The keepalived daemon can be used to monitor services or systems and
to automatically failover to a standby if problems occur.
Pacemaker
---------
`Pacemaker <http://clusterlabs.org/>`_ cluster stack is a state-of-the-art
high availability and load balancing stack for the Linux platform.
Pacemaker is used to make OpenStack infrastructure highly available.
Pacemaker relies on the
`Corosync <http://corosync.github.io/corosync/>`_ messaging layer
for reliable cluster communications. Corosync implements the Totem single-ring
ordering and membership protocol. It also provides UDP and InfiniBand based
messaging, quorum, and cluster membership to Pacemaker.
Pacemaker does not inherently understand the applications it manages.
Instead, it relies on resource agents (RAs) that are scripts that encapsulate
the knowledge of how to start, stop, and check the health of each application
managed by the cluster.
These agents must conform to one of the `OCF <https://github.com/ClusterLabs/
OCF-spec/blob/master/ra/resource-agent-api.md>`_,
`SysV Init <http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/
LSB-Core-generic/iniscrptact.html>`_, Upstart, or Systemd standards.
Pacemaker ships with a large set of OCF agents (such as those managing
MySQL databases, virtual IP addresses, and RabbitMQ), but can also use
any agents already installed on your system and can be extended with
your own (see the
`developer guide <http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html>`_).

View File

@ -0,0 +1,147 @@
============
Key concepts
============
Redundancy and failover
~~~~~~~~~~~~~~~~~~~~~~~
High availability is implemented with redundant hardware
running redundant instances of each service.
If one piece of hardware running one instance of a service fails,
the system can then failover to use another instance of a service
that is running on hardware that did not fail.
A crucial aspect of high availability
is the elimination of single points of failure (SPOFs).
A SPOF is an individual piece of equipment or software
that causes system downtime or data loss if it fails.
In order to eliminate SPOFs, check that mechanisms exist for redundancy of:
- Network components, such as switches and routers
- Applications and automatic service migration
- Storage components
- Facility services such as power, air conditioning, and fire protection
In the event that a component fails and a back-up system must take on
its load, most high availability systems will replace the failed
component as quickly as possible to maintain necessary redundancy. This
way time spent in a degraded protection state is minimized.
Most high availability systems fail in the event of multiple
independent (non-consequential) failures. In this case, most
implementations favor protecting data over maintaining availability.
High availability systems typically achieve an uptime percentage of
99.99% or more, which roughly equates to less than an hour of
cumulative downtime per year. In order to achieve this, high
availability systems should keep recovery times after a failure to
about one to two minutes, sometimes significantly less.
OpenStack currently meets such availability requirements for its own
infrastructure services, meaning that an uptime of 99.99% is feasible
for the OpenStack infrastructure proper. However, OpenStack does not
guarantee 99.99% availability for individual guest instances.
This document discusses some common methods of implementing highly
available systems, with an emphasis on the core OpenStack services and
other open source services that are closely aligned with OpenStack.
You will need to address high availability concerns for any applications
software that you run on your OpenStack environment. The important thing is
to make sure that your services are redundant and available.
How you achieve that is up to you.
Active/passive versus active/active
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stateful services can be configured as active/passive or active/active,
which are defined as follows:
:term:`active/passive configuration`
Maintains a redundant instance
that can be brought online when the active service fails.
For example, OpenStack writes to the main database
while maintaining a disaster recovery database that can be brought online
if the main database fails.
A typical active/passive installation for a stateful service maintains
a replacement resource that can be brought online when required.
Requests are handled using a :term:`virtual IP address (VIP)` that
facilitates returning to service with minimal reconfiguration.
A separate application (such as Pacemaker or Corosync) monitors
these services, bringing the backup online as necessary.
:term:`active/active configuration`
Each service also has a backup but manages both the main and
redundant systems concurrently.
This way, if there is a failure, the user is unlikely to notice.
The backup system is already online and takes on increased load
while the main system is fixed and brought back online.
Typically, an active/active installation for a stateless service
maintains a redundant instance, and requests are load balanced using
a virtual IP address and a load balancer such as HAProxy.
A typical active/active installation for a stateful service includes
redundant services, with all instances having an identical state. In
other words, updates to one instance of a database update all other
instances. This way a request to one instance is the same as a
request to any other. A load balancer manages the traffic to these
systems, ensuring that operational systems always handle the
request.
Clusters and quorums
~~~~~~~~~~~~~~~~~~~~
The quorum specifies the minimal number of nodes
that must be functional in a cluster of redundant nodes
in order for the cluster to remain functional.
When one node fails and failover transfers control to other nodes,
the system must ensure that data and processes remain sane.
To determine this, the contents of the remaining nodes are compared
and, if there are discrepancies, a majority rules algorithm is implemented.
For this reason, each cluster in a high availability environment should
have an odd number of nodes and the quorum is defined as more than a half
of the nodes.
If multiple nodes fail so that the cluster size falls below the quorum
value, the cluster itself fails.
For example, in a seven-node cluster, the quorum should be set to
``floor(7/2) + 1 == 4``. If quorum is four and four nodes fail simultaneously,
the cluster itself would fail, whereas it would continue to function, if
no more than three nodes fail. If split to partitions of three and four nodes
respectively, the quorum of four nodes would continue to operate the majority
partition and stop or fence the minority one (depending on the
no-quorum-policy cluster configuration).
And the quorum could also have been set to three, just as a configuration
example.
.. note::
We do not recommend setting the quorum to a value less than ``floor(n/2) + 1``
as it would likely cause a split-brain in a face of network partitions.
When four nodes fail simultaneously, the cluster would continue to function as
well. But if split to partitions of three and four nodes respectively, the
quorum of three would have made both sides to attempt to fence the other and
host resources. Without fencing enabled, it would go straight to running
two copies of each resource.
This is why setting the quorum to a value less than ``floor(n/2) + 1`` is
dangerous. However it may be required for some specific cases, such as a
temporary measure at a point it is known with 100% certainty that the other
nodes are down.
When configuring an OpenStack environment for study or demonstration purposes,
it is possible to turn off the quorum checking. Production systems should
always run with quorum enabled.
Load balancing
~~~~~~~~~~~~~~
.. to do: definition and description of need within HA

24
doc/source/intro-ha.rst Normal file
View File

@ -0,0 +1,24 @@
=================================
Introduction to high availability
=================================
High availability systems seek to minimize the following issues:
#. System downtime: Occurs when a user-facing service is unavailable
beyond a specified maximum amount of time.
#. Data loss: Accidental deletion or destruction of data.
Most high availability systems guarantee protection against system downtime
and data loss only in the event of a single failure.
However, they are also expected to protect against cascading failures,
where a single failure deteriorates into a series of consequential failures.
Many service providers guarantee a :term:`Service Level Agreement (SLA)`
including uptime percentage of computing service, which is calculated based
on the available time and system downtime excluding planned outage time.
.. toctree::
:maxdepth: 2
intro-ha-key-concepts.rst
intro-ha-common-tech.rst

View File

@ -0,0 +1,67 @@
================
Cluster managers
================
At its core, a cluster is a distributed finite state machine capable
of co-ordinating the startup and recovery of inter-related services
across a set of machines.
Even a distributed or replicated application that is able to survive failures
on one or more machines can benefit from a cluster manager because a cluster
manager has the following capabilities:
#. Awareness of other applications in the stack
While SYS-V init replacements like systemd can provide
deterministic recovery of a complex stack of services, the
recovery is limited to one machine and lacks the context of what
is happening on other machines. This context is crucial to
determine the difference between a local failure, and clean startup
and recovery after a total site failure.
#. Awareness of instances on other machines
Services like RabbitMQ and Galera have complicated boot-up
sequences that require co-ordination, and often serialization, of
startup operations across all machines in the cluster. This is
especially true after a site-wide failure or shutdown where you must
first determine the last machine to be active.
#. A shared implementation and calculation of `quorum
<https://en.wikipedia.org/wiki/Quorum_(Distributed_Systems)>`_
It is very important that all members of the system share the same
view of who their peers are and whether or not they are in the
majority. Failure to do this leads very quickly to an internal
`split-brain <https://en.wikipedia.org/wiki/Split-brain_(computing)>`_
state. This is where different parts of the system are pulling in
different and incompatible directions.
#. Data integrity through fencing (a non-responsive process does not
imply it is not doing anything)
A single application does not have sufficient context to know the
difference between failure of a machine and failure of the
application on a machine. The usual practice is to assume the
machine is dead and continue working, however this is highly risky. A
rogue process or machine could still be responding to requests and
generally causing havoc. The safer approach is to make use of
remotely accessible power switches and/or network switches and SAN
controllers to fence (isolate) the machine before continuing.
#. Automated recovery of failed instances
While the application can still run after the failure of several
instances, it may not have sufficient capacity to serve the
required volume of requests. A cluster can automatically recover
failed instances to prevent additional load induced failures.
Pacemaker
~~~~~~~~~
.. to do: description and point to ref arch example using pacemaker
`Pacemaker <http://clusterlabs.org>`_.
Systemd
~~~~~~~
.. to do: description and point to ref arch example using Systemd and link

View File

@ -0,0 +1,35 @@
=========
Memcached
=========
Most OpenStack services can use Memcached to store ephemeral data such as
tokens. Although Memcached does not support typical forms of redundancy such
as clustering, OpenStack services can use almost any number of instances
by configuring multiple hostnames or IP addresses.
The Memcached client implements hashing to balance objects among the instances.
Failure of an instance only impacts a percentage of the objects,
and the client automatically removes it from the list of instances.
Installation
~~~~~~~~~~~~
To install and configure Memcached, read the
`official documentation <https://github.com/Memcached/Memcached/wiki#getting-started>`_.
Memory caching is managed by `oslo.cache
<http://specs.openstack.org/openstack/oslo-specs/specs/kilo/oslo-cache-using-dogpile.html>`_.
This ensures consistency across all projects when using multiple Memcached
servers. The following is an example configuration with three hosts:
.. code-block:: ini
Memcached_servers = controller1:11211,controller2:11211,controller3:11211
By default, ``controller1`` handles the caching service. If the host goes down,
``controller2`` or ``controller3`` will complete the service.
For more information about Memcached installation, see the
*Environment -> Memcached* section in the
`Installation Guides <https://docs.openstack.org/ocata/install/>`_
depending on your distribution.

View File

@ -0,0 +1,52 @@
==================================
Stateless versus stateful services
==================================
OpenStack components can be divided into three categories:
- OpenStack APIs: APIs that are HTTP(s) stateless services written in python,
easy to duplicate and mostly easy to load balance.
- The SQL relational database server provides stateful type consumed by other
components. Supported databases are MySQL, MariaDB, and PostgreSQL.
Making the SQL database redundant is complex.
- :term:`Advanced Message Queuing Protocol (AMQP)` provides OpenStack
internal stateful communication service.
.. to do: Ensure the difference between stateless and stateful services
.. is clear
Stateless services
~~~~~~~~~~~~~~~~~~
A service that provides a response after your request and then
requires no further attention. To make a stateless service highly
available, you need to provide redundant instances and load balance them.
Stateless OpenStack services
----------------------------
OpenStack services that are stateless include ``nova-api``,
``nova-conductor``, ``glance-api``, ``keystone-api``, ``neutron-api``,
and ``nova-scheduler``.
Stateful services
~~~~~~~~~~~~~~~~~
A service where subsequent requests to the service
depend on the results of the first request.
Stateful services are more difficult to manage because a single
action typically involves more than one request. Providing
additional instances and load balancing does not solve the problem.
For example, if the horizon user interface reset itself every time
you went to a new page, it would not be very useful.
OpenStack services that are stateful include the OpenStack database
and message queue.
Making stateful services highly available can depend on whether you choose
an active/passive or active/active configuration.
Stateful OpenStack services
----------------------------
.. to do: create list of stateful services

View File

@ -0,0 +1,12 @@
================================================
Introduction to high availability with OpenStack
================================================
.. to do: description of section & improvement of title (intro to OS HA)
.. toctree::
:maxdepth: 2
intro-os-ha-state.rst
intro-os-ha-cluster.rst
intro-os-ha-memcached.rst

View File

@ -0,0 +1,6 @@
==========
Monitoring
==========

View File

@ -0,0 +1,20 @@
========
L3 Agent
========
.. TODO: Introduce L3 agent
HA Routers
~~~~~~~~~~
.. TODO: content for HA routers
Networking DHCP agent
~~~~~~~~~~~~~~~~~~~~~
The OpenStack Networking (neutron) service has a scheduler that lets you run
multiple agents across nodes. The DHCP agent can be natively highly available.
To configure the number of DHCP agents per network, modify the
``dhcp_agents_per_network`` parameter in the :file:`/etc/neutron/neutron.conf`
file. By default this is set to 1. To achieve high availability, assign more
than one DHCP agent per network. For more information, see
`High-availability for DHCP
<https://docs.openstack.org/newton/networking-guide/config-dhcp-ha.html>`_.

View File

@ -0,0 +1,6 @@
==========
Neutron L3
==========
.. TODO: create and import Neutron L3 analysis
Introduce the Networking (neutron) service L3 agent

View File

@ -0,0 +1,5 @@
=========================
Neutron Networking server
=========================
.. TODO: Create content similar to other API sections

View File

@ -0,0 +1,29 @@
===================================
Configuring the networking services
===================================
Configure networking on each node. See the basic information about
configuring networking in the Networking service section of the
`Install Guides <https://docs.openstack.org/ocata/install/>`_,
depending on your distribution.
OpenStack network nodes contain:
- Networking DHCP agent
- Neutron L3 agent
- Networking L2 agent
.. note::
The L2 agent cannot be distributed and highly available. Instead, it
must be installed on each data forwarding node to control the virtual
network driver such as Open vSwitch or Linux Bridge. One L2 agent runs
per node and controls its virtual interfaces.
.. toctree::
:maxdepth: 2
networking-ha-neutron-server.rst
networking-ha-neutron-l3-analysis.rst
networking-ha-l3-agent.rst

24
doc/source/overview.rst Normal file
View File

@ -0,0 +1,24 @@
========
Overview
========
This guide can be split into two parts:
#. High level architecture
#. Reference architecture examples, monitoring, and testing
.. warning::
We recommend using this guide for assistance when considering your HA cloud.
We do not recommend using this guide for manually building your HA cloud.
We recommend starting with a pre-validated solution and adjusting to your
needs.
High availability is not for every user. It presents some challenges.
High availability may be too complex for databases or
systems with large amounts of data. Replication can slow large systems
down. Different setups have different prerequisites. Read the guidelines
for each setup.
.. important::
High availability is turned off as the default in OpenStack setups.

View File

@ -0,0 +1,3 @@
======================
Reference Architecture
======================

View File

@ -0,0 +1,59 @@
.. _storage-ha-backend:
================
Storage back end
================
An OpenStack environment includes multiple data pools for the VMs:
- Ephemeral storage is allocated for an instance and is deleted when the
instance is deleted. The Compute service manages ephemeral storage and
by default, Compute stores ephemeral drives as files on local disks on the
compute node. As an alternative, you can use Ceph RBD as the storage back
end for ephemeral storage.
- Persistent storage exists outside all instances. Two types of persistent
storage are provided:
- The Block Storage service (cinder) that can use LVM or Ceph RBD as the
storage back end.
- The Image service (glance) that can use the Object Storage service (swift)
or Ceph RBD as the storage back end.
For more information about configuring storage back ends for
the different storage options, see `Manage volumes
<https://docs.openstack.org/admin-guide/blockstorage-manage-volumes.html>`_
in the OpenStack Administrator Guide.
This section discusses ways to protect against data loss in your OpenStack
environment.
RAID drives
-----------
Configuring RAID on the hard drives that implement storage protects your data
against a hard drive failure. If the node itself fails, data may be lost.
In particular, all volumes stored on an LVM node can be lost.
Ceph
----
`Ceph RBD <http://ceph.com/>`_ is an innately high availability storage back
end. It creates a storage cluster with multiple nodes that communicate with
each other to replicate and redistribute data dynamically.
A Ceph RBD storage cluster provides a single shared set of storage nodes that
can handle all classes of persistent and ephemeral data (glance, cinder, and
nova) that are required for OpenStack instances.
Ceph RBD provides object replication capabilities by storing Block Storage
volumes as Ceph RBD objects. Ceph RBD ensures that each replica of an object
is stored on a different node. This means that your volumes are protected
against hard drive and node failures, or even the failure of the data center
itself.
When Ceph RBD is used for ephemeral volumes as well as block and image storage,
it supports `live migration
<https://docs.openstack.org/admin-guide/compute-live-migration-usage.html>`_
of VMs with ephemeral drives. LVM only supports live migration of
volume-backed VMs.

View File

@ -0,0 +1,192 @@
==================================
Highly available Block Storage API
==================================
Cinder provides Block-Storage-as-a-Service suitable for performance
sensitive scenarios such as databases, expandable file systems, or
providing a server with access to raw block level storage.
Persistent block storage can survive instance termination and can also
be moved across instances like any external storage device. Cinder
also has volume snapshots capability for backing up the volumes.
Making the Block Storage API service highly available in
active/passive mode involves:
- :ref:`ha-blockstorage-pacemaker`
- :ref:`ha-blockstorage-configure`
- :ref:`ha-blockstorage-services`
In theory, you can run the Block Storage service as active/active.
However, because of sufficient concerns, we recommend running
the volume component as active/passive only.
You can read more about these concerns on the
`Red Hat Bugzilla <https://bugzilla.redhat.com/show_bug.cgi?id=1193229>`_
and there is a
`psuedo roadmap <https://etherpad.openstack.org/p/cinder-kilo-stabilisation-work>`_
for addressing them upstream.
.. _ha-blockstorage-pacemaker:
Add Block Storage API resource to Pacemaker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On RHEL-based systems, create resources for cinder's systemd agents and create
constraints to enforce startup/shutdown ordering:
.. code-block:: console
pcs resource create openstack-cinder-api systemd:openstack-cinder-api --clone interleave=true
pcs resource create openstack-cinder-scheduler systemd:openstack-cinder-scheduler --clone interleave=true
pcs resource create openstack-cinder-volume systemd:openstack-cinder-volume
pcs constraint order start openstack-cinder-api-clone then openstack-cinder-scheduler-clone
pcs constraint colocation add openstack-cinder-scheduler-clone with openstack-cinder-api-clone
pcs constraint order start openstack-cinder-scheduler-clone then openstack-cinder-volume
pcs constraint colocation add openstack-cinder-volume with openstack-cinder-scheduler-clone
If the Block Storage service runs on the same nodes as the other services,
then it is advisable to also include:
.. code-block:: console
pcs constraint order start openstack-keystone-clone then openstack-cinder-api-clone
Alternatively, instead of using systemd agents, download and
install the OCF resource agent:
.. code-block:: console
# cd /usr/lib/ocf/resource.d/openstack
# wget https://git.openstack.org/cgit/openstack/openstack-resource-agents/plain/ocf/cinder-api
# chmod a+rx *
You can now add the Pacemaker configuration for Block Storage API resource.
Connect to the Pacemaker cluster with the :command:`crm configure` command
and add the following cluster resources:
.. code-block:: none
primitive p_cinder-api ocf:openstack:cinder-api \
params config="/etc/cinder/cinder.conf" \
os_password="secretsecret" \
os_username="admin" \
os_tenant_name="admin" \
keystone_get_token_url="http://10.0.0.11:5000/v2.0/tokens" \
op monitor interval="30s" timeout="30s"
This configuration creates ``p_cinder-api``, a resource for managing the
Block Storage API service.
The command :command:`crm configure` supports batch input, copy and paste the
lines above into your live Pacemaker configuration and then make changes as
required. For example, you may enter ``edit p_ip_cinder-api`` from the
:command:`crm configure` menu and edit the resource to match your preferred
virtual IP address.
Once completed, commit your configuration changes by entering :command:`commit`
from the :command:`crm configure` menu. Pacemaker then starts the Block Storage
API service and its dependent resources on one of your nodes.
.. _ha-blockstorage-configure:
Configure Block Storage API service
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit the ``/etc/cinder/cinder.conf`` file. For example, on a RHEL-based system:
.. code-block:: ini
:linenos:
[DEFAULT]
# This is the name which we should advertise ourselves as and for
# A/P installations it should be the same everywhere
host = cinder-cluster-1
# Listen on the Block Storage VIP
osapi_volume_listen = 10.0.0.11
auth_strategy = keystone
control_exchange = cinder
volume_driver = cinder.volume.drivers.nfs.NfsDriver
nfs_shares_config = /etc/cinder/nfs_exports
nfs_sparsed_volumes = true
nfs_mount_options = v3
[database]
connection = mysql+pymysql://cinder:CINDER_DBPASS@10.0.0.11/cinder
max_retries = -1
[keystone_authtoken]
# 10.0.0.11 is the Keystone VIP
identity_uri = http://10.0.0.11:35357/
www_authenticate_uri = http://10.0.0.11:5000/
admin_tenant_name = service
admin_user = cinder
admin_password = CINDER_PASS
[oslo_messaging_rabbit]
# Explicitly list the rabbit hosts as it doesn't play well with HAProxy
rabbit_hosts = 10.0.0.12,10.0.0.13,10.0.0.14
# As a consequence, we also need HA queues
rabbit_ha_queues = True
heartbeat_timeout_threshold = 60
heartbeat_rate = 2
Replace ``CINDER_DBPASS`` with the password you chose for the Block Storage
database. Replace ``CINDER_PASS`` with the password you chose for the
``cinder`` user in the Identity service.
This example assumes that you are using NFS for the physical storage, which
will almost never be true in a production installation.
If you are using the Block Storage service OCF agent, some settings will
be filled in for you, resulting in a shorter configuration file:
.. code-block:: ini
:linenos:
# We have to use MySQL connection to store data:
connection = mysql+pymysql://cinder:CINDER_DBPASS@10.0.0.11/cinder
# Alternatively, you can switch to pymysql,
# a new Python 3 compatible library and use
# sql_connection = mysql+pymysql://cinder:CINDER_DBPASS@10.0.0.11/cinder
# and be ready when everything moves to Python 3.
# Ref: https://wiki.openstack.org/wiki/PyMySQL_evaluation
# We bind Block Storage API to the VIP:
osapi_volume_listen = 10.0.0.11
# We send notifications to High Available RabbitMQ:
notifier_strategy = rabbit
rabbit_host = 10.0.0.11
Replace ``CINDER_DBPASS`` with the password you chose for the Block Storage
database.
.. _ha-blockstorage-services:
Configure OpenStack services to use the highly available Block Storage API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Your OpenStack services must now point their Block Storage API configuration
to the highly available, virtual cluster IP address rather than a Block Storage
API servers physical IP address as you would for a non-HA environment.
Create the Block Storage API endpoint with this IP.
If you are using both private and public IP addresses, create two virtual IPs
and define your endpoint. For example:
.. code-block:: console
$ openstack endpoint create --region $KEYSTONE_REGION \
volumev2 public http://PUBLIC_VIP:8776/v2/%\(project_id\)s
$ openstack endpoint create --region $KEYSTONE_REGION \
volumev2 admin http://10.0.0.11:8776/v2/%\(project_id\)s
$ openstack endpoint create --region $KEYSTONE_REGION \
volumev2 internal http://10.0.0.11:8776/v2/%\(project_id\)s

View File

@ -0,0 +1,114 @@
========================================
Highly available Shared File Systems API
========================================
Making the Shared File Systems (manila) API service highly available
in active/passive mode involves:
- :ref:`ha-sharedfilesystems-configure`
- :ref:`ha-sharedfilesystems-services`
- :ref:`ha-sharedfilesystems-pacemaker`
.. _ha-sharedfilesystems-configure:
Configure Shared File Systems API service
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit the :file:`/etc/manila/manila.conf` file:
.. code-block:: ini
:linenos:
# We have to use MySQL connection to store data:
sql_connection = mysql+pymysql://manila:password@10.0.0.11/manila?charset=utf8
# We bind Shared File Systems API to the VIP:
osapi_volume_listen = 10.0.0.11
# We send notifications to High Available RabbitMQ:
notifier_strategy = rabbit
rabbit_host = 10.0.0.11
.. _ha-sharedfilesystems-services:
Configure OpenStack services to use Shared File Systems API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Your OpenStack services must now point their Shared File Systems API
configuration to the highly available, virtual cluster IP address rather than
a Shared File Systems API servers physical IP address as you would
for a non-HA environment.
You must create the Shared File Systems API endpoint with this IP.
If you are using both private and public IP addresses, you should create two
virtual IPs and define your endpoints like this:
.. code-block:: console
$ openstack endpoint create --region RegionOne \
sharev2 public 'http://PUBLIC_VIP:8786/v2/%(tenant_id)s'
$ openstack endpoint create --region RegionOne \
sharev2 internal 'http://10.0.0.11:8786/v2/%(tenant_id)s'
$ openstack endpoint create --region RegionOne \
sharev2 admin 'http://10.0.0.11:8786/v2/%(tenant_id)s'
.. _ha-sharedfilesystems-pacemaker:
Add Shared File Systems API resource to Pacemaker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. Download the resource agent to your system:
.. code-block:: console
# cd /usr/lib/ocf/resource.d/openstack
# wget https://git.openstack.org/cgit/openstack/openstack-resource-agents/plain/ocf/manila-api
# chmod a+rx *
#. Add the Pacemaker configuration for the Shared File Systems
API resource. Connect to the Pacemaker cluster with the following
command:
.. code-block:: console
# crm configure
.. note::
The :command:`crm configure` supports batch input. Copy and paste
the lines in the next step into your live Pacemaker configuration and then
make changes as required.
For example, you may enter ``edit p_ip_manila-api`` from the
:command:`crm configure` menu and edit the resource to match your preferred
virtual IP address.
#. Add the following cluster resources:
.. code-block:: none
primitive p_manila-api ocf:openstack:manila-api \
params config="/etc/manila/manila.conf" \
os_password="secretsecret" \
os_username="admin" \
os_tenant_name="admin" \
keystone_get_token_url="http://10.0.0.11:5000/v2.0/tokens" \
op monitor interval="30s" timeout="30s"
This configuration creates ``p_manila-api``, a resource for managing the
Shared File Systems API service.
#. Commit your configuration changes by entering the following command
from the :command:`crm configure` menu:
.. code-block:: console
# commit
Pacemaker now starts the Shared File Systems API service and its
dependent resources on one of your nodes.

View File

@ -0,0 +1,141 @@
==========================
Highly available Image API
==========================
The OpenStack Image service offers a service for discovering, registering, and
retrieving virtual machine images. To make the OpenStack Image API service
highly available in active/passive mode, you must:
- :ref:`glance-api-pacemaker`
- :ref:`glance-api-configure`
- :ref:`glance-services`
Prerequisites
~~~~~~~~~~~~~
Before beginning, ensure that you are familiar with the
documentation for installing the OpenStack Image API service.
See the *Image service* section in the
`Installation Guides <https://docs.openstack.org/ocata/install>`_,
depending on your distribution.
.. _glance-api-pacemaker:
Add OpenStack Image API resource to Pacemaker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. Download the resource agent to your system:
.. code-block:: console
# cd /usr/lib/ocf/resource.d/openstack
# wget https://git.openstack.org/cgit/openstack/openstack-resource-agents/plain/ocf/glance-api
# chmod a+rx *
#. Add the Pacemaker configuration for the OpenStack Image API resource.
Use the following command to connect to the Pacemaker cluster:
.. code-block:: console
crm configure
.. note::
The :command:`crm configure` command supports batch input. Copy and paste
the lines in the next step into your live Pacemaker configuration and
then make changes as required.
For example, you may enter ``edit p_ip_glance-api`` from the
:command:`crm configure` menu and edit the resource to match your
preferred virtual IP address.
#. Add the following cluster resources:
.. code-block:: console
primitive p_glance-api ocf:openstack:glance-api \
params config="/etc/glance/glance-api.conf" \
os_password="secretsecret" \
os_username="admin" os_tenant_name="admin" \
os_auth_url="http://10.0.0.11:5000/v2.0/" \
op monitor interval="30s" timeout="30s"
This configuration creates ``p_glance-api``, a resource for managing the
OpenStack Image API service.
#. Commit your configuration changes by entering the following command from
the :command:`crm configure` menu:
.. code-block:: console
commit
Pacemaker then starts the OpenStack Image API service and its dependent
resources on one of your nodes.
.. _glance-api-configure:
Configure OpenStack Image service API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit the :file:`/etc/glance/glance-api.conf` file
to configure the OpenStack Image service:
.. code-block:: ini
# We have to use MySQL connection to store data:
sql_connection=mysql://glance:password@10.0.0.11/glance
# Alternatively, you can switch to pymysql,
# a new Python 3 compatible library and use
# sql_connection=mysql+pymysql://glance:password@10.0.0.11/glance
# and be ready when everything moves to Python 3.
# Ref: https://wiki.openstack.org/wiki/PyMySQL_evaluation
# We bind OpenStack Image API to the VIP:
bind_host = 10.0.0.11
# Connect to OpenStack Image registry service:
registry_host = 10.0.0.11
# We send notifications to High Available RabbitMQ:
notifier_strategy = rabbit
rabbit_host = 10.0.0.11
[TODO: need more discussion of these parameters]
.. _glance-services:
Configure OpenStack services to use the highly available OpenStack Image API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Your OpenStack services must now point their OpenStack Image API configuration
to the highly available, virtual cluster IP address instead of pointing to the
physical IP address of an OpenStack Image API server as you would in a non-HA
cluster.
For example, if your OpenStack Image API service IP address is 10.0.0.11
(as in the configuration explained here), you would use the following
configuration in your :file:`nova.conf` file:
.. code-block:: ini
[glance]
# ...
api_servers = 10.0.0.11
# ...
You must also create the OpenStack Image API endpoint with this IP address.
If you are using both private and public IP addresses, create two virtual IP
addresses and define your endpoint. For example:
.. code-block:: console
$ openstack endpoint create --region $KEYSTONE_REGION \
image public http://PUBLIC_VIP:9292
$ openstack endpoint create --region $KEYSTONE_REGION \
image admin http://10.0.0.11:9292
$ openstack endpoint create --region $KEYSTONE_REGION \
image internal http://10.0.0.11:9292

22
doc/source/storage-ha.rst Normal file
View File

@ -0,0 +1,22 @@
===================
Configuring storage
===================
.. toctree::
:maxdepth: 2
storage-ha-image.rst
storage-ha-block.rst
storage-ha-file-systems.rst
storage-ha-backend.rst
Making the Block Storage (cinder) API service highly available in
active/active mode involves:
* Configuring Block Storage to listen on the VIP address
* Managing the Block Storage API daemon with the Pacemaker cluster manager
* Configuring OpenStack services to use this IP address
.. To Do: HA without Pacemaker

6
doc/source/testing.rst Normal file
View File

@ -0,0 +1,6 @@
=======
Testing
=======

27
setup.cfg Normal file
View File

@ -0,0 +1,27 @@
[metadata]
name = openstackhaguide
summary = OpenStack High Availability Guide
author = OpenStack
author-email = openstack-dev@lists.openstack.org
home-page = https://docs.openstack.org/
classifier =
Environment :: OpenStack
Intended Audience :: Information Technology
Intended Audience :: System Administrators
License :: OSI Approved :: Apache Software License
Operating System :: POSIX :: Linux
Topic :: Documentation
[global]
setup-hooks =
pbr.hooks.setup_hook
[files]
[build_sphinx]
warning-is-error = 1
build-dir = build
source-dir = source
[wheel]
universal = 1

30
setup.py Normal file
View File

@ -0,0 +1,30 @@
#!/usr/bin/env python
# Copyright (c) 2013 Hewlett-Packard Development Company, L.P.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# THIS FILE IS MANAGED BY THE GLOBAL REQUIREMENTS REPO - DO NOT EDIT
import setuptools
# In python < 2.7.4, a lazy loading of package `pbr` will break
# setuptools if some other modules registered functions in `atexit`.
# solution from: http://bugs.python.org/issue15881#msg170215
try:
import multiprocessing # noqa
except ImportError:
pass
setuptools.setup(
setup_requires=['pbr'],
pbr=True)

13
tox.ini
View File

@ -14,3 +14,16 @@ deps =
commands =
doc8 doc/source -e txt -e rst
sphinx-build -E -W -b html doc/source doc/build/html
[doc8]
# Settings for doc8:
# Ignore target directories and autogenerated files
ignore-path = doc/*/target,doc/*/build*
# File extensions to use
extensions = .rst,.txt
# Maximal line length should be 79 but we have some overlong lines.
# Let's not get far more in.
max-line-length = 79
# Disable some doc8 checks:
# D000: Check RST validity (cannot handle the "linenos" directive)
ignore = D000