Retire Tripleo: remove repo content
TripleO project is retiring - https://review.opendev.org/c/openstack/governance/+/905145 this commit remove the content of this project repo Change-Id: Ie1b54f1dce996fefd4080b307b6959f2570bfeef
This commit is contained in:
parent
e614e6c417
commit
be8ce42b78
@ -1,7 +0,0 @@
|
||||
[run]
|
||||
branch = True
|
||||
source = tripleo-specs
|
||||
omit = tripleo-specs/tests/*,tripleo-specs/openstack/*
|
||||
|
||||
[report]
|
||||
ignore_errors = True
|
51
.gitignore
vendored
51
.gitignore
vendored
@ -1,51 +0,0 @@
|
||||
*.py[cod]
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Packages
|
||||
*.egg
|
||||
*.egg-info
|
||||
dist
|
||||
build
|
||||
eggs
|
||||
parts
|
||||
bin
|
||||
var
|
||||
sdist
|
||||
develop-eggs
|
||||
.installed.cfg
|
||||
lib
|
||||
lib64
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
.coverage
|
||||
.tox
|
||||
nosetests.xml
|
||||
.stestr/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
|
||||
# Mr Developer
|
||||
.mr.developer.cfg
|
||||
.project
|
||||
.pydevproject
|
||||
|
||||
# Complexity
|
||||
output/*.html
|
||||
output/*/index.html
|
||||
|
||||
# Sphinx
|
||||
doc/build
|
||||
|
||||
# pbr generates these
|
||||
AUTHORS
|
||||
ChangeLog
|
||||
|
||||
# Editors
|
||||
*~
|
||||
.*.swp
|
3
.mailmap
3
.mailmap
@ -1,3 +0,0 @@
|
||||
# Format is:
|
||||
# <preferred e-mail> <other e-mail 1>
|
||||
# <preferred e-mail> <other e-mail 2>
|
@ -1,3 +0,0 @@
|
||||
[DEFAULT]
|
||||
test_path=./tests
|
||||
top_dir=.
|
@ -1,9 +0,0 @@
|
||||
- project:
|
||||
templates:
|
||||
- openstack-specs-jobs
|
||||
check:
|
||||
jobs:
|
||||
- openstack-tox-py36
|
||||
gate:
|
||||
jobs:
|
||||
- openstack-tox-py36
|
@ -1,16 +0,0 @@
|
||||
If you would like to contribute to the development of OpenStack,
|
||||
you must follow the steps in this page:
|
||||
|
||||
http://docs.openstack.org/infra/manual/developers.html
|
||||
|
||||
Once those steps have been completed, changes to OpenStack
|
||||
should be submitted for review via the Gerrit tool, following
|
||||
the workflow documented at:
|
||||
|
||||
http://docs.openstack.org/infra/manual/developers.html#development-workflow
|
||||
|
||||
Pull requests submitted through GitHub will be ignored.
|
||||
|
||||
Bugs should be filed on Launchpad, not GitHub:
|
||||
|
||||
https://bugs.launchpad.net/tripleo
|
@ -1,4 +0,0 @@
|
||||
tripleo-specs Style Commandments
|
||||
===============================================
|
||||
|
||||
Read the OpenStack Style Commandments https://docs.openstack.org/hacking/latest/
|
175
LICENSE
175
LICENSE
@ -1,175 +0,0 @@
|
||||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
@ -1,6 +0,0 @@
|
||||
include AUTHORS
|
||||
include ChangeLog
|
||||
exclude .gitignore
|
||||
exclude .gitreview
|
||||
|
||||
global-exclude *.pyc
|
30
README.rst
30
README.rst
@ -1,24 +1,10 @@
|
||||
========================
|
||||
Team and repository tags
|
||||
========================
|
||||
This project is no longer maintained.
|
||||
|
||||
.. image:: http://governance.openstack.org/badges/tripleo-specs.svg
|
||||
:target: http://governance.openstack.org/reference/tags/index.html
|
||||
The contents of this repository are still available in the Git
|
||||
source code management system. To see the contents of this
|
||||
repository before it reached its end of life, please check out the
|
||||
previous commit with "git checkout HEAD^1".
|
||||
|
||||
.. Change things from this point on
|
||||
|
||||
===============================
|
||||
tripleo-specs
|
||||
===============================
|
||||
|
||||
TripleO specs repository
|
||||
|
||||
* Free software: Apache license
|
||||
* Documentation: https://specs.openstack.org/openstack/tripleo-specs
|
||||
* Source: http://git.openstack.org/cgit/openstack/tripleo-specs
|
||||
* Bugs: http://bugs.launchpad.net/tripleo
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
* TODO
|
||||
For any further questions, please email
|
||||
openstack-discuss@lists.openstack.org or join #openstack-dev on
|
||||
OFTC.
|
||||
|
@ -1,89 +0,0 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
# implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import datetime
|
||||
|
||||
# -- General configuration ----------------------------------------------------
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
|
||||
extensions = [
|
||||
#'sphinx.ext.intersphinx',
|
||||
'openstackdocstheme',
|
||||
'yasfb',
|
||||
]
|
||||
|
||||
# Feed configuration for yasfb
|
||||
feed_base_url = 'https://specs.openstack.org/openstack/tripleo-specs'
|
||||
feed_author = 'OpenStack TripleO Team'
|
||||
|
||||
exclude_patterns = [
|
||||
'**/template.rst',
|
||||
'**/policy-template.rst',
|
||||
]
|
||||
|
||||
# openstackdocstheme options
|
||||
openstackdocs_repo_name = 'openstack/tripleo-specs'
|
||||
openstackdocs_bug_project = 'tripleo'
|
||||
openstackdocs_bug_tag = ''
|
||||
|
||||
# autodoc generation is a bit aggressive and a nuisance when doing heavy
|
||||
# text edit cycles.
|
||||
# execute "export SPHINX_DEBUG=1" in your terminal to disable
|
||||
|
||||
# The suffix of source filenames.
|
||||
source_suffix = '.rst'
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = 'tripleo-specs'
|
||||
copyright = 'OpenStack Foundation'
|
||||
|
||||
# If true, '()' will be appended to :func: etc. cross-reference text.
|
||||
add_function_parentheses = True
|
||||
|
||||
# If true, the current module name will be prepended to all description
|
||||
# unit titles (such as .. function::).
|
||||
add_module_names = True
|
||||
|
||||
# The name of the Pygments (syntax highlighting) style to use.
|
||||
pygments_style = 'native'
|
||||
|
||||
# -- Options for HTML output --------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. Major themes that come with
|
||||
# Sphinx are currently 'default' and 'sphinxdoc'.
|
||||
# html_theme_path = ["."]
|
||||
# html_theme = '_theme'
|
||||
# html_static_path = ['static']
|
||||
|
||||
html_theme = 'openstackdocs'
|
||||
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = '%sdoc' % project
|
||||
|
||||
# Grouping the document tree into LaTeX files. List of tuples
|
||||
# (source start file, target name, title, author, documentclass
|
||||
# [howto/manual]).
|
||||
latex_documents = [
|
||||
('index',
|
||||
'%s.tex' % project,
|
||||
'%s Documentation' % project,
|
||||
'OpenStack Foundation', 'manual'),
|
||||
]
|
||||
|
||||
# Example configuration for intersphinx: refer to the Python standard library.
|
||||
#intersphinx_mapping = {'http://docs.python.org/': None}
|
@ -1,159 +0,0 @@
|
||||
.. tripleo documentation master file
|
||||
|
||||
==============================
|
||||
Tripleo Project Specifications
|
||||
==============================
|
||||
|
||||
Zed Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/zed/*
|
||||
|
||||
Yoga Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/yoga/*
|
||||
|
||||
Xena Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/xena/*
|
||||
|
||||
Wallaby Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/wallaby/*
|
||||
|
||||
Victoria Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/victoria/*
|
||||
|
||||
Ussuri Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/ussuri/*
|
||||
|
||||
Train Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/train/*
|
||||
|
||||
Stein Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/stein/*
|
||||
|
||||
Rocky Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/rocky/*
|
||||
|
||||
Queens Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/queens/*
|
||||
|
||||
Pike Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/pike/*
|
||||
|
||||
Ocata Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/ocata/*
|
||||
|
||||
Newton Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/newton/*
|
||||
|
||||
Mitaka Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/mitaka/*
|
||||
|
||||
Liberty Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/liberty/*
|
||||
|
||||
Kilo Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/kilo/*
|
||||
|
||||
Juno Approved Specs:
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/juno/*
|
||||
|
||||
========================
|
||||
TripleO Project Policies
|
||||
========================
|
||||
|
||||
Team decisions and policies that are not limited to a specific release.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/policy/*
|
||||
|
||||
==================
|
||||
Indices and tables
|
||||
==================
|
||||
|
||||
* :ref:`search`
|
@ -1 +0,0 @@
|
||||
../../specs
|
Binary file not shown.
Before Width: | Height: | Size: 582 KiB |
@ -1,5 +0,0 @@
|
||||
openstackdocstheme>=2.2.1 # Apache-2.0
|
||||
sphinx>=2.0.0,!=2.1.0 # BSD
|
||||
stestr>=2.0.0 # Apache-2.0
|
||||
testtools>=0.9.34
|
||||
yasfb>=0.8.0
|
12
setup.cfg
12
setup.cfg
@ -1,12 +0,0 @@
|
||||
[metadata]
|
||||
name = tripleo-specs
|
||||
summary = TripleO specs repository
|
||||
description_file =
|
||||
README.rst
|
||||
author = OpenStack
|
||||
author_email = openstack-discuss@lists.openstack.org
|
||||
home_page = https://specs.openstack.org/openstack/tripleo-specs/
|
||||
classifier =
|
||||
Intended Audience :: Developers
|
||||
License :: OSI Approved :: Apache Software License
|
||||
Operating System :: POSIX :: Linux
|
23
setup.py
23
setup.py
@ -1,23 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
# Copyright (c) 2013 Hewlett-Packard Development Company, L.P.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
# implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# THIS FILE IS MANAGED BY THE GLOBAL REQUIREMENTS REPO - DO NOT EDIT
|
||||
import setuptools
|
||||
|
||||
setuptools.setup(
|
||||
setup_requires=['pbr'],
|
||||
py_modules=[],
|
||||
pbr=True)
|
@ -1,260 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Backwards compatibility and TripleO
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-backwards-compat
|
||||
|
||||
TripleO has run with good but not perfect backwards compatibility since
|
||||
creation. It's time to formalise this in a documentable and testable fashion.
|
||||
|
||||
TripleO will follow Semantic Versioning (aka semver_) for versioning all
|
||||
releases. We will strive to avoid breaking backwards compatibility at all, and
|
||||
if we have to it will be because of extenuating circumstances such as security
|
||||
fixes with no other way to fix things.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
TripleO has historically run with an unspoken backwards compatibility policy
|
||||
but we now have too many people making changes - we need to build a single
|
||||
clear policy or else our contributors will have to rework things when one
|
||||
reviewer asks for backwards compat when they thought it was not needed (or vice
|
||||
versa do the work to be backwards compatible when it isn't needed.
|
||||
|
||||
Secondly, because we haven't marked any of our projects as 1.0.0 there is no
|
||||
way for users or developers to tell when and where backwards compatibility is
|
||||
needed / appropriate.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Adopt the following high level heuristics for identifying backwards
|
||||
incompatible changes:
|
||||
|
||||
* Making changes that break user code that scripts or uses a public interface.
|
||||
|
||||
* Becoming unable to install something we could previously.
|
||||
|
||||
* Being unable to install something because someone else has altered things -
|
||||
e.g. being unable to install F20 if it no longer exists on the internet
|
||||
is not an incompatible change - if it were returned to the net, we'd be able
|
||||
to install it again. If we remove the code to support this thing, then we're
|
||||
making an incompatible change. The one exception here is unsupported
|
||||
projects - e.g. unsupported releases of OpenStack, or Fedora, or Ubuntu.
|
||||
Because unsupported releases are security issues, and we expect most of our
|
||||
dependencies to do releases, and stop supporting things, we will not treat
|
||||
cleaning up code only needed to support such an unsupported release as
|
||||
backwards compatible. For instance, breaking the ability to deploy a previous
|
||||
*still supported* OpenStack release where we had previously been able to
|
||||
deploy it is a backwards incompatible change, but breaking the ability to
|
||||
deploy an *unsupported* OpenStack release is not.
|
||||
|
||||
Corollaries to these principles:
|
||||
|
||||
* Breaking a public API (network or Python). The public API of a project is
|
||||
any released API (e.g. not explicitly marked alpha/beta/rc) in a version that
|
||||
is >= 1.0.0. For Python projects, a \_ prefix marks a namespace as non-public
|
||||
e.g. in ``foo.\_bar.quux`` ``quux`` is not public because it's in a non-public
|
||||
namespace. For our projects that accept environment variables, if the
|
||||
variable is documented (in the README.md/user documentation) then the variable
|
||||
is part of the public interface. Otherwise it is not.
|
||||
|
||||
* Increasing the set of required parameters to Heat templates. This breaks
|
||||
scripts that use TripleO to deploy. Note that adding new parameters which
|
||||
need to be set when deploying *new* things is fine because the user is
|
||||
doing more than just pulling in updated code.
|
||||
|
||||
* Decreasing the set of accepted parameters to Heat templates. Likewise, this
|
||||
breaks scripts using the Heat templates to do deploys. If the parameters are
|
||||
no longer accepted because they are for no longer supported versions of
|
||||
OpenStack then that is covered by the carve-out above.
|
||||
|
||||
* Increasing the required metadata to use an element except when both Tuskar
|
||||
and tripleo-heat-templates have been updated to use it. There is a
|
||||
bi-directional dependency from t-i-e to t-h-t and back - when we change
|
||||
signals in the templates we have to update t-i-e first, and when we change
|
||||
parameters to elements we have to alter t-h-t first. We could choose to make
|
||||
t-h-t and t-i-e completely independent, but don't believe that is a sensible
|
||||
use of time - they are closely connected, even though loosely coupled.
|
||||
Instead we're treating them a single unit: at any point in time t-h-t can
|
||||
only guarantee to deploy images built from some minimum version of t-i-e,
|
||||
and t-i-e can only guarantee to be deployed with some minimum version of
|
||||
t-h-t. The public API here is t-h-t's parameters, and the link to t-i-e
|
||||
is equivalent to the dependency on a helper library for a Python
|
||||
library/program: requiring new minor versions of the helper library is not
|
||||
generally considered to be an API break of the calling code. Upgrades will
|
||||
still work with this constraint - machines will get a new image at the same
|
||||
time as new metadata, with a rebuild in the middle. Downgrades / rollback
|
||||
may require switching to an older template at the same time, but that was
|
||||
already the case.
|
||||
|
||||
* Decreasing the accepted metadata for an element if that would result in an
|
||||
error or misbehaviour.
|
||||
|
||||
Other sorts of changes may also be backwards incompatible, and if identified
|
||||
will be treated as such - that is, this list is not comprehensive.
|
||||
|
||||
We don't consider the internal structure of Heat templates to be an API, nor
|
||||
any test code within the TripleO codebases (whether it may appear to be public
|
||||
or not).
|
||||
|
||||
TripleO's incubator is not released and has no backwards compatibility
|
||||
guarantees - but a point in time incubator snapshot interacts with ongoing
|
||||
releases of other components - and they will be following semver, which means
|
||||
that a user wanting stability can get that as long as they don't change the
|
||||
incubator.
|
||||
|
||||
TripleO will promote all its component projects to 1.0 within one OpenStack
|
||||
release cycle of them being created. Projects may not become dependencies of a
|
||||
project with a 1.0 or greater version until they are at 1.0 themselves. This
|
||||
restriction serves to prevent version locking (makes upgrades impossible) by
|
||||
the depending version, or breakage (breaks users) if the pre 1.0 project breaks
|
||||
compatibility. Adding new projects will involve creating test jobs that test
|
||||
the desired interactions before the dependency is added, so that the API can
|
||||
be validated before the new project has reached 1.0.
|
||||
|
||||
Adopt the following rule on *when* we are willing to [deliberately] break
|
||||
backwards compatibility:
|
||||
|
||||
* When all known uses of the code are for no longer supported OpenStack
|
||||
releases.
|
||||
|
||||
* If the PTL signs off on the break. E.g. a high impact security fix for which
|
||||
we cannot figure out a backwards compatible way to deliver it to our users
|
||||
and distributors.
|
||||
|
||||
We also need to:
|
||||
|
||||
* Set a timeline for new codebases to become mature (one cycle). Existing
|
||||
codebases will have the clock start when this specification is approved.
|
||||
|
||||
* Set rules for allowing anyone to depend on new codebases (codebase must be
|
||||
1.0.0).
|
||||
|
||||
* Document what backwards compatible means in the context of heat templates and
|
||||
elements.
|
||||
|
||||
* Add an explicit test job for deploying Icehouse from trunk, because that will
|
||||
tell us about our ability to deploy currently supported OpenStack versions
|
||||
which we could previously deploy - that failing would indicate the proposed
|
||||
patch is backwards incompatible.
|
||||
|
||||
* If needed either fix Icehouse, or take a consensus decision to exclude
|
||||
Icehouse support from this policy.
|
||||
|
||||
* Commit to preserving backwards compatibility.
|
||||
|
||||
* When we need alternate codepaths to support backwards compatibility we will
|
||||
mark them clearly to facilitate future cleanup::
|
||||
|
||||
# Backwards compatibility: <....>
|
||||
if ..
|
||||
# Trunk
|
||||
...
|
||||
elif
|
||||
# Icehouse
|
||||
...
|
||||
else
|
||||
# Havana
|
||||
...
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* We could say that we don't do backwards compatibility and release like the
|
||||
OpenStack API services do, but this makes working with us really difficult
|
||||
and it also forces folk with stable support desires to work from separate
|
||||
branches rather than being able to collaborate on a single codebase.
|
||||
|
||||
* We could treat tripleo-heat-templates and tripleo-image-elements separately
|
||||
to the individual components and run them under different rules - e.g. using
|
||||
stable branches rather than semver. But there have been so few times that
|
||||
backwards compatibility would be hard for us that this doesn't seem worth
|
||||
doing.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Keeping code around longer may have security considerations, but this is a
|
||||
well known interaction.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
End users will love us.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None anticipated. Images will be a marginally larger due to carrying backwards
|
||||
compat code around.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Deployers will appreciate not having to rework things. Not that they have had
|
||||
to, but still.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers will have clear expectations set about backwards compatibility which
|
||||
will help them avoid being asked to rework things. They and reviewers will need
|
||||
to look out for backward incompatible changes and special case handling of
|
||||
them to deliver the compatibility we aspire to.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
lifeless
|
||||
|
||||
Other contributors:
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Draft this spec.
|
||||
|
||||
* Get consensus around it.
|
||||
|
||||
* Release all our non-incubator projects as 1.0.0.
|
||||
|
||||
* Add Icehouse deploy test job. (Because we could install Icehouse at the start
|
||||
of Juno, and if we get in fast we can keep being able to do so).
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None. An argument could be made for doing a quick cleanup of stuff, but the
|
||||
reality is that it's not such a burden we've had to clean it up yet.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
To ensure we don't accidentally break backwards compatibility we should look
|
||||
at the oslo cross-project matrix eventually - e.g. run os-refresh-config
|
||||
against older releases of os-apply-config to ensure we're not breaking
|
||||
compatibility. Our general policy of building releases of things and using
|
||||
those goes a long way to giving us good confidence though - we can be fairly
|
||||
sure of no single-step regressions (but will still have to watch out for
|
||||
N-step regressions unless some mechanism is put in place).
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The users manual and developer guides should reflect this.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _semver: http://docs.openstack.org/developer/pbr/semver.html
|
@ -1,229 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================================
|
||||
Haproxy ports and related services configuration
|
||||
================================================
|
||||
|
||||
Blueprint: https://blueprints.launchpad.net/tripleo/+spec/tripleo-haproxy-configuration
|
||||
|
||||
Current spec provides options for HA endpoints delivery via haproxy.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Current tripleo deployment scheme binds services on 0.0.0.0:standard_port,
|
||||
with stunnel configured to listen on ssl ports.
|
||||
|
||||
This configuration has some drawbacks and wont work in ha, for several reasons:
|
||||
|
||||
* haproxy cant bind on <vip_address>:<service_port> - openstack services are
|
||||
bound to 0.0.0.0:<service_port>
|
||||
|
||||
* services ports hardcoded in many places (any_service.conf, init-keystone),
|
||||
so changing them and configuring from heat would be a lot of pain
|
||||
|
||||
* the non-ssl endpoint is reachable from outside the local host,
|
||||
which could potentially confuse users and expose them to an insecure connection
|
||||
in the case where we want to run that service on SSL only. We want to offer SSL
|
||||
by default but we can't really prevent it.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We will bind haproxy, stunnel (ssl), openstack services on ports with
|
||||
different ipaddress settings.
|
||||
|
||||
HAProxy will be bound to VIP addresses only.
|
||||
|
||||
STunnel where it is used will be bound to the controller ctlplane address.
|
||||
|
||||
OpenStack services will bind to localhost for SSL only configurations, and to
|
||||
the ctlplane address for non-SSL or mixed-mode configurations. They will bind
|
||||
to the standard non-encrypted ports, but will never bind to 0.0.0.0 on any
|
||||
port.
|
||||
|
||||
We'll strive to make SSL-only the default.
|
||||
|
||||
An example, using horizon in mixed mode (HTTPS and HTTP):
|
||||
|
||||
vip_address = 192.0.2.21
|
||||
node_address = 192.0.2.24
|
||||
|
||||
1. haproxy
|
||||
listen horizon_http
|
||||
bind vip_address:80
|
||||
server node_1 node_address:80
|
||||
listen horizon_https
|
||||
bind vip_address:443
|
||||
server node_1 node_address:443
|
||||
|
||||
2. stunnel
|
||||
accept node_address:443
|
||||
connect node_address:80
|
||||
|
||||
3. horizon
|
||||
bind node_address:80
|
||||
|
||||
A second example, using horizon in HTTPS only mode:
|
||||
|
||||
vip_address = 192.0.2.21
|
||||
node_address = 192.0.2.24
|
||||
|
||||
1. haproxy
|
||||
listen horizon_https
|
||||
bind vip_address:443
|
||||
server node_1 node_address:443
|
||||
|
||||
2. stunnel
|
||||
accept node_address:443
|
||||
connect 127.0.0.1:80
|
||||
|
||||
3. horizon
|
||||
bind 127.0.0.1:80
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are several alternatives which do not cover all the requirements for
|
||||
security or extensibility
|
||||
|
||||
Option 1: Assignment of different ports for haproxy, stunnel, openstack services on 0.0.0.0
|
||||
|
||||
* requires additional firewall configuration
|
||||
* security issue with non-ssl services endpoints
|
||||
|
||||
1. haproxy
|
||||
bind :80
|
||||
|
||||
listen horizon
|
||||
server node_1 node_address:8800
|
||||
|
||||
2. stunnel
|
||||
accept :8800
|
||||
connect :8880
|
||||
|
||||
3. horizon
|
||||
bind :8880
|
||||
|
||||
Option 2: Using only haproxy ssl termination is suboptimal:
|
||||
|
||||
* 1.5 is still in devel phase -> potential stability issues
|
||||
* we would have to get this into supported distros
|
||||
* this also means that there is no SSL between haproxy and real service
|
||||
* security issue with non-ssl services endpoints
|
||||
|
||||
1. haproxy
|
||||
bind vip_address:80
|
||||
|
||||
listen horizon
|
||||
server node_1 node_address:80
|
||||
|
||||
2. horizon
|
||||
bind node_address:80
|
||||
|
||||
Option 3: Add additional ssl termination before load-balancer
|
||||
|
||||
* not useful in current configuration because load balancer (haproxy)
|
||||
and openstack services installed on same nodes
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
* Only ssl protected endpoints are publicly available if running SSL only.
|
||||
* Minimal firewall configuration
|
||||
* Not forwarding decrypted traffic over non-localhost connections
|
||||
* compromise of a control node exposes all external traffic (future and possibly past)
|
||||
to decryption and/or spoofing
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Several services will listen on same port, but it will be quite easy
|
||||
to understand if user (operator) will know some context.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No differences between approaches.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
We need to make the service configs - nova etc - know on a per service basis
|
||||
where to bind. The current approach uses logic in the template to choose
|
||||
between localhost and my_ip. If we move the selection into Heat this can
|
||||
become a lot simpler (read a bind address, if set use it, if not don't).
|
||||
|
||||
We considered extending the connect_ip concept to be on a per service basis.
|
||||
Right now all services are exposed to both SSL and plain, so this would be
|
||||
workable until we get a situation where only some services are plain - but we
|
||||
expect that sooner rather than later.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
dshulyak
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
tripleo-incubator:
|
||||
* build overcloud-control image with haproxy element
|
||||
|
||||
tripleo-image-elements:
|
||||
|
||||
* openstack-ssl element refactoring
|
||||
|
||||
* refactor services configs to listen on 127.0.0.1 / ctlplane address:
|
||||
horizon apache configuration, glance, nova, cinder, swift, ceilometer,
|
||||
neutron, heat, keystone, trove
|
||||
|
||||
tripleo-heat-templates:
|
||||
* add haproxy metadata to heat-templates
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
CI testing dependencies:
|
||||
|
||||
* use vip endpoints in overcloud scripts
|
||||
|
||||
* add haproxy element to overcloud-control image (maybe with stats enabled) before
|
||||
adding haproxy related metadata to heat templates
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* update incubator manual
|
||||
|
||||
* update elements README.md
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
http://haproxy.1wt.eu/download/1.4/doc/configuration.txt
|
||||
|
||||
https://www.stunnel.org/howto.html
|
@ -1,272 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
TripleO network configuration
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/os-net-config
|
||||
|
||||
We need a tool (or tools) to help configure host level networking
|
||||
in TripleO. This includes things like:
|
||||
|
||||
* Static IPs
|
||||
|
||||
* Multiple OVS bridges
|
||||
|
||||
* Bonding
|
||||
|
||||
* VLANs
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Today in TripleO we bootstrap nodes using DHCP so they can download
|
||||
custom per node metadata from Heat. This metadata contains per instance
|
||||
network information that allows us to create a customized host level network
|
||||
configuration.
|
||||
|
||||
Today this is accomplished via two scripts:
|
||||
|
||||
* ensure-bridge: http://git.openstack.org/cgit/openstack/tripleo-image-elements/tree/elements/network-utils/bin/ensure-bridge
|
||||
* init-neutron-ovs: http://git.openstack.org/cgit/openstack/tripleo-image-elements/tree/elements/neutron-openvswitch-agent/bin/init-neutron-ovs
|
||||
|
||||
The problem with the existing scripts is that their feature set is extremely
|
||||
prescriptive and limited. Today we only support bridging a single NIC
|
||||
onto an OVS bridge, VLAN support is limited and more advanced configuration
|
||||
(of even common IP address attributes like MTUs, etc) is not possible.
|
||||
|
||||
Furthermore we also desire some level of control over how networking changes
|
||||
are made and whether they are persistent. In this regard a provider layer
|
||||
would be useful so that users can choose between using for example:
|
||||
|
||||
* ifcfg/eni scripts: used where persistence is required and we want
|
||||
to configure interfaces using the distro supported defaults
|
||||
* iproute2: used to provide optimized/streamlined network configuration
|
||||
which may or may not also include persistence
|
||||
|
||||
Our capabilities are currently limited to the extent that we are unable
|
||||
to fully provision our TripleO CI overclouds without making manual
|
||||
changes and/or hacks to images themselves. As such we need to
|
||||
expand our host level network capabilities.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Create a new python project which encapsulates host level network configuration.
|
||||
|
||||
This will likely consist of:
|
||||
|
||||
* an internal python library to facilitate host level network configuration
|
||||
|
||||
* a binary which processes a YAML (or JSON) format and makes the associated
|
||||
python library calls to configure host level networking.
|
||||
|
||||
By following this design the tool should work well with Heat driven
|
||||
metadata and provide us the future option of moving some of the
|
||||
library code into Oslo (oslo.network?) or perhaps Neutron itself.
|
||||
|
||||
The tool will support a "provider" layer such that multiple implementations
|
||||
can drive the host level network configuration (iproute2, ifcfg, eni).
|
||||
This is important because as new network config formats are adopted
|
||||
by distributions we may want to gradually start making use of them
|
||||
(thinking ahead to systemd.network for example).
|
||||
|
||||
The tool will also need to be extensible such that we can add new
|
||||
configuration options over time. We may for example want to add
|
||||
more advanced bondings options at a later point in time... and
|
||||
this should be as easy as possible.
|
||||
|
||||
The focus of the tool initially will be host level network configuration
|
||||
for existing TripleO features (interfaces, bridges, vlans) in a much
|
||||
more flexible manner. While we support these things today in a prescriptive
|
||||
manner the new tool will immediately support multiple bridges, interfaces,
|
||||
and vlans that can be created in an ad-hoc manner. Heat templates can be
|
||||
created to drive common configurations and people can customize those
|
||||
as needed for more advanced networking setups.
|
||||
|
||||
The initial implementation will focus on persistent configuration formats
|
||||
for ifcfg and eni, like we do today via ensure-bridge. This will help us
|
||||
continue to make steps towards bringing bare metal machines back online
|
||||
after a power outage (providing a static IP for the DHCP server for example).
|
||||
|
||||
The primary focus of this tool should always be host level network
|
||||
configuration and fine tuning that we can't easily do within Neutron itself.
|
||||
Over time the scope and concept of the tool may shift as Neutron features are
|
||||
added and/or subtracted.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative is to keep expanding ensure-bridge and init-neutron-ovs
|
||||
which would require a significant number of new bash options and arguments to
|
||||
configure all the new features (vlans, bonds, etc.).
|
||||
|
||||
Many of the deployment projects within the OpenStack ecosystem are doing
|
||||
similar sorts of networking today. Consider:
|
||||
|
||||
* Chef/Crowbar: https://github.com/opencrowbar/core/blob/master/chef/cookbooks/network/recipes/default.rb
|
||||
* Fuel: https://github.com/stackforge/fuel-library/tree/master/deployment/puppet/l23network
|
||||
* VDSM (GPL): contains code to configure interfaces, both ifcfg and iproute2 abstractions (git clone http://gerrit.ovirt.org/p/vdsm.git, then look at vdsm/vdsm/network/configurators)
|
||||
* Netconf: heavy handed for this perhaps but interesting (OpenDaylight, etc)
|
||||
|
||||
Most of these options are undesirable because they would add a significant
|
||||
number of dependencies to TripleO.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The configuration data used by this tool is already admin-oriented in
|
||||
nature and will continue to be provided by Heat. As such there should
|
||||
be no user facing security concerns with regards to access to the
|
||||
configuration data that aren't already present.
|
||||
|
||||
This implementation will directly impact the low level network connectivity
|
||||
in all layers of TripleO including the seed, undercloud, and overcloud
|
||||
networks. Any of the host level networking that isn't already provided
|
||||
by Neutron is likely affected.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This feature enables deployers to build out more advanced undercloud and
|
||||
overcloud networks and as such should help improve the reliability and
|
||||
performance of the fundamental host network capabilities in TripleO.
|
||||
|
||||
End users should benefit from these efforts.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This feature will allow us to build better/more advanced networks and as
|
||||
such should help improve performance. In particular the interface bonding
|
||||
and VLAN support should help in this regard.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Dan Prince (dan-prince on Launchpad)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create project on GitHub: os-net-config
|
||||
|
||||
* Import project into openstack-infra, get unit tests gating, etc.
|
||||
|
||||
* Build a python library to configure host level networking with
|
||||
an initial focus on parity with what we already have including things
|
||||
we absolutely need for our TripleO CI overcloud networks.
|
||||
|
||||
The library will consist of an object model which will allow users to
|
||||
create interfaces, bridges, and vlans, and bonds (optional). Each of
|
||||
these types will act as a container for address objects (IPv4 and IPv6)
|
||||
and routes (multiple routes may be defined). Additionally, each
|
||||
object will include options to enable/disable DHCP and set the MTU.
|
||||
|
||||
* Create provider layers for ifcfg/eni. The providers take an object
|
||||
model and apply it ("make it so"). The ifcfg provider will write out
|
||||
persistent config files in /etc/sysconfig/network-scripts/ifcfg-<name>
|
||||
and use ifup/ifdown to start and stop the interfaces when an change
|
||||
has been made. The eni provider will write out configurations to
|
||||
/etc/network/interfaces and likewise use ifup/ifdown to start and
|
||||
stop interfaces when a change has been made.
|
||||
|
||||
* Create a provider layer for iproute2. Optional, can be done at
|
||||
a later time. This provider will most likely not use persistent
|
||||
formats and will run various ip/vconfig/route commands to
|
||||
configure host level networking for a given object model.
|
||||
|
||||
* Create a binary that processes a YAML config file format and makes
|
||||
the correct python library calls. The binary should be idempotent
|
||||
in that running the binary once with a given configuration should
|
||||
"make it so". Running it a second time with the same configuration
|
||||
should do nothing (i.e. it is safe to run multiple times). An example
|
||||
YAML configuration format is listed below which describes a single
|
||||
OVS bridge with an attached interface, this would match what
|
||||
ensure-bridge creates today:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
network_config:
|
||||
-
|
||||
type: ovs_bridge
|
||||
name: br-ctlplane
|
||||
use_dhcp: true
|
||||
ovs_extra:
|
||||
- br-set-external-id br-ctlplane bridge-id br-ctlplane
|
||||
members:
|
||||
-
|
||||
type: interface
|
||||
name: em1
|
||||
|
||||
..
|
||||
|
||||
The above format uses a nested approach to define an interface
|
||||
attached to a bridge.
|
||||
|
||||
* TripleO element to install os-net-config. Most likely using
|
||||
pip (but we may use git initially until it is released).
|
||||
|
||||
* Wire this up to TripleO...get it all working together using the
|
||||
existing Heat metadata formats. This would include any documentation
|
||||
changes to tripleo-incubator, deprecating old elements, etc.
|
||||
|
||||
* TripleO heat template changes to use the new YAML/JSON formats. Our default
|
||||
configuration would most likely do exactly what we do today (OVS bridge
|
||||
with a single attached interface). We may want to create some other example
|
||||
heat templates which can be used in other environments (multi-bridge
|
||||
setups like we use for our CI overclouds for example).
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Existing TripleO CI will help ensure that as we implement this we maintain
|
||||
parity with the current feature set.
|
||||
|
||||
The ability to provision and make use of our Triple CI clouds without
|
||||
custom modifications/hacks will also be a proving ground for much of
|
||||
the work here.
|
||||
|
||||
Additional manual testing may be required for some of the more advanced
|
||||
modes of operation (bonding, VLANs, etc.)
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The recommended heat metadata used for network configuration may
|
||||
change as result of this feature. Older formats will be preserved for
|
||||
backwards compatibility.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Notes from the Atlanta summit session on this topic can be found
|
||||
here (includes possible YAML config formats):
|
||||
|
||||
* https://etherpad.openstack.org/p/tripleo-network-configuration
|
@ -1,162 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================
|
||||
Control mechanism for os-apply-config
|
||||
=====================================
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
We require a control mechanism in os-apply-config (oac). This could be used,
|
||||
for example, to:
|
||||
|
||||
* Not create an empty target
|
||||
* Set permissions on the target
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The basic proposal is to parameterise oac with maps (aka dictionaries)
|
||||
containing control data. These maps will be supplied as YAML in companion
|
||||
control files. Each file will be named after the template it relates to, with a
|
||||
".oac" suffix. For example, the file "abc/foo.sh" would be controlled by
|
||||
"abc/foo.sh.oac".
|
||||
|
||||
Only control files with matching templates files will be respected, IE the file
|
||||
"foo" must exist for the control file "foo.oac" to have any effect. A dib-lint
|
||||
check will be added to look for file control files without matching templates,
|
||||
as this may indicate a template has been moved without its control file.
|
||||
|
||||
Directories may also have control files. In this case, the control file must be
|
||||
inside the directory and be named exactly "oac". A file either named "oac" or
|
||||
with the control file suffix ".oac" will never be considered as templates.
|
||||
|
||||
The YAML in the control file must evaluate to nothing or a mapping. The former
|
||||
allows for the whole mapping having been commented out. The presence of
|
||||
unrecognised keys in the mapping is an error. File and directory control keys
|
||||
are distinct but may share names. If they do, they should also share similar
|
||||
semantics.
|
||||
|
||||
Example control file::
|
||||
|
||||
key1: true
|
||||
key2: 0700
|
||||
# comment
|
||||
key3:
|
||||
- 1
|
||||
- 2
|
||||
|
||||
To make the design concrete, one file control key will be offered initially:
|
||||
allow_empty. This expects a Boolean value and defaults to true. If it is true,
|
||||
oac will behave as it does today. Otherwise, if after substitutions the
|
||||
template body is empty, no file will be created at the target path and any
|
||||
existing file there will be deleted.
|
||||
|
||||
allow_empty will also be allowed as a directory control key. Again, it will
|
||||
expect a Boolean value and default to true. Given a nested structure
|
||||
"A/B/C/foo", where "foo" is an empty file with allow_empty=false:
|
||||
|
||||
* C has allow_empty=false: A/B/ is created, C is not.
|
||||
* B has allow_empty=false: A/B/C/ is created.
|
||||
* B and C have allow_empty=false: Only A/ is created.
|
||||
|
||||
It is expected that additional keys will be proposed soon after this spec is
|
||||
approved.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
A fenced header could be used rather than a separate control file. Although
|
||||
this aids visibility of the control data, it is less consistent with control
|
||||
files for directories and (should they be added later) symlinks.
|
||||
|
||||
The directory control file name has been the subject of some debate.
|
||||
Alternatives to control "foo/" include:
|
||||
|
||||
* foo/.oac (not visible with unmodified "ls")
|
||||
* foo/oac.control (longer)
|
||||
* foo/control (generic)
|
||||
* foo.oac (if foo/ is empty, can't be stored in git)
|
||||
* foo/foo.oac (masks control file for foo/foo)
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None. The user is already in full control of the target environment. For
|
||||
example, they could use the allow_empty key to delete a critical file. However
|
||||
they could already simply provide a bash script to do the same. Further, the
|
||||
resulting image will be running on their (or their customer's) hardware, so it
|
||||
would be their own foot they'd be shooting.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
It will no longer be possible to create files named "oac" or with the suffix
|
||||
".oac" using oac. This will not affect any elements currently within
|
||||
diskimage-builder or tripleo-image-elements.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
alexisl (aka lxsli, Alexis Lee)
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Support file control files in oac
|
||||
* Support the allow_empty file control key
|
||||
* Add dib-lint check for detached control files
|
||||
* Support directory control files in oac
|
||||
* Support the allow_empty directory control key
|
||||
* Update the oac README
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This change is easily tested using standard unit test techniques.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The oac README must be updated.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
There has already been some significant discussion of this feature:
|
||||
https://blueprints.launchpad.net/tripleo/+spec/oac-header
|
||||
|
||||
There is a bug open for which an oac control mechanism would be useful:
|
||||
https://bugs.launchpad.net/os-apply-config/+bug/1258351
|
@ -1,258 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================
|
||||
Promote HEAT_ENV
|
||||
================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-promote-heat-env
|
||||
|
||||
Promote values set in the Heat environment file to take precedence over
|
||||
input environment variables.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Historically TripleO scripts have consulted the environment for many items of
|
||||
configuration. This raises risks of scope leakage and the number of environment
|
||||
variables required often forces users to manage their environment with scripts.
|
||||
Consequently, there's a push to prefer data files like the Heat environment
|
||||
file (HEAT_ENV) which may be set by passing -e to Heat. To allow this file to
|
||||
provide an unambiguous source of truth, the environment must not be allowed to
|
||||
override the values from this file. That is to say, precedence must be
|
||||
transferred.
|
||||
|
||||
A key distinction is whether the value of an environment variable is obtained
|
||||
from the environment passed to it by its parent process (either directly or
|
||||
through derivation). Those which are will be referred to as "input variables"
|
||||
and are deprecated by this spec. Those which are not will be called "local
|
||||
variables" and may be introduced freely. Variables containing values
|
||||
synthesised from multiple sources must be handled on a case-by-case basis.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Since changes I5b7c8a27a9348d850d1a6e4ab79304cf13697828 and
|
||||
I42a9d4b85edcc99d13f7525e964baf214cdb7cbf, ENV_JSON (the contents of the file
|
||||
named by HEAT_ENV) is constructed in devtest_undercloud.sh like so::
|
||||
|
||||
ENV_JSON=$(jq '.parameters = {
|
||||
"MysqlInnodbBufferPoolSize": 100
|
||||
} + .parameters + {
|
||||
"AdminPassword": "'"${UNDERCLOUD_ADMIN_PASSWORD}"'",
|
||||
"AdminToken": "'"${UNDERCLOUD_ADMIN_TOKEN}"'",
|
||||
"CeilometerPassword": "'"${UNDERCLOUD_CEILOMETER_PASSWORD}"'",
|
||||
"GlancePassword": "'"${UNDERCLOUD_GLANCE_PASSWORD}"'",
|
||||
"HeatPassword": "'"${UNDERCLOUD_HEAT_PASSWORD}"'",
|
||||
"NovaPassword": "'"${UNDERCLOUD_NOVA_PASSWORD}"'",
|
||||
"NeutronPassword": "'"${UNDERCLOUD_NEUTRON_PASSWORD}"'",
|
||||
"NeutronPublicInterface": "'"${NeutronPublicInterface}"'",
|
||||
"undercloudImage": "'"${UNDERCLOUD_ID}"'",
|
||||
"BaremetalArch": "'"${NODE_ARCH}"'",
|
||||
"PowerSSHPrivateKey": "'"${POWER_KEY}"'",
|
||||
"NtpServer": "'"${UNDERCLOUD_NTP_SERVER}"'"
|
||||
}' <<< $ENV_JSON)
|
||||
|
||||
This is broadly equivalent to "A + B + C", where values from B override those
|
||||
from A and values from C override those from either. Currently section C
|
||||
contains a mix of input variables and local variables. It is proposed that
|
||||
current and future environment variables are allocated such that:
|
||||
|
||||
* A only contains default values.
|
||||
* B is the contents of the HEAT_ENV file (from either the user or a prior run).
|
||||
* C only contains computed values (from local variables).
|
||||
|
||||
The following are currently in section C but are not local vars::
|
||||
|
||||
NeutronPublicInterface (default 'eth0')
|
||||
UNDERCLOUD_NTP_SERVER (default '')
|
||||
|
||||
The input variables will be ignored and the defaults moved into section A::
|
||||
|
||||
ENV_JSON=$(jq '.parameters = {
|
||||
"MysqlInnodbBufferPoolSize": 100,
|
||||
"NeutronPublicInterface": "eth0",
|
||||
"NtpServer": ""
|
||||
} + .parameters + {
|
||||
... elided ...
|
||||
}' <<< $ENV_JSON)
|
||||
|
||||
devtest_overcloud.sh will be dealt with similarly. These are the variables
|
||||
which need to be removed and their defaults added to section A::
|
||||
|
||||
OVERCLOUD_NAME (default '')
|
||||
OVERCLOUD_HYPERVISOR_PHYSICAL_BRIDGE (default '')
|
||||
OVERCLOUD_HYPERVISOR_PUBLIC_INTERFACE (default '')
|
||||
OVERCLOUD_BRIDGE_MAPPINGS (default '')
|
||||
OVERCLOUD_FLAT_NETWORKS (default '')
|
||||
NeutronPublicInterface (default 'eth0')
|
||||
OVERCLOUD_LIBVIRT_TYPE (default 'qemu')
|
||||
OVERCLOUD_NTP_SERVER (default '')
|
||||
|
||||
Only one out of all these input variables is used outside of these two scripts
|
||||
and consequently the rest are safe to remove.
|
||||
|
||||
The exception is OVERCLOUD_LIBVIRT_TYPE. This is saved by the script
|
||||
'write-tripleorc'. As it will now be preserved in HEAT_ENV, it does not need to
|
||||
also be preserved by write-tripleorc and can be removed from there.
|
||||
|
||||
----
|
||||
|
||||
So that users know they need to start setting these values through HEAT_ENV
|
||||
rather than input variables, it is further proposed that for an interim period
|
||||
each script echo a message to STDERR if deprecated input variables are set. For
|
||||
example::
|
||||
|
||||
for OLD_VAR in OVERCLOUD_NAME; do
|
||||
if [ ! -z "${!OLD_VAR}" ]; then
|
||||
echo "WARNING: ${OLD_VAR} is deprecated, please set this in the" \
|
||||
"HEAT_ENV file (${HEAT_ENV})" 1>&2
|
||||
fi
|
||||
done
|
||||
|
||||
----
|
||||
|
||||
To separate user input from generated values further, it is proposed that user
|
||||
values be read from a new file - USER_HEAT_ENV. This will default to
|
||||
{under,over}cloud-user-env.json. A new commandline parameter, --user-heat-env,
|
||||
will be added to both scripts so that this can be changed.
|
||||
|
||||
#. ENV_JSON is initialised with default values.
|
||||
#. ENV_JSON is overlaid by HEAT_ENV.
|
||||
#. ENV_JSON is overlaid by USER_HEAT_ENV.
|
||||
#. ENV_JSON is overlaid by computed values.
|
||||
#. ENV_JSON is saved to HEAT_ENV.
|
||||
|
||||
See http://paste.openstack.org/show/83551/ for an example of how to accomplish
|
||||
this. In short::
|
||||
|
||||
ENV_JSON=$(cat ${HEAT_ENV} ${USER_HEAT_ENV} | jq -s '
|
||||
.[0] + .[1] + {"parameters":
|
||||
({..defaults..} + .[0].parameters + {..computed..} + .[1].parameters)}')
|
||||
cat > "${HEAT_ENV}" <<< ${ENV_JSON}
|
||||
|
||||
Choosing to move user data into a new file, compared to moving the merged data,
|
||||
makes USER_HEAT_ENV optional. If users wish, they can continue providing their
|
||||
values in HEAT_ENV. The complementary solution requires users to clean
|
||||
precomputed values out of HEAT_ENV, or they risk unintentionally preventing the
|
||||
values from being recomputed.
|
||||
|
||||
Loading computed values after user values sacrifices user control in favour of
|
||||
correctness. Considering that any devtest user must be rather technical, if a
|
||||
computation is incorrect they can fix or at least hack the computation
|
||||
themselves.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Instead of removing the input variables entirely, an interim form could be
|
||||
used::
|
||||
|
||||
ENV_JSON=$(jq '.parameters = {
|
||||
"MysqlInnodbBufferPoolSize": 100,
|
||||
"NeutronPublicInterface": "'"${NeutronPublicInterface}"'",
|
||||
"NtpServer": "'"${UNDERCLOUD_NTP_SERVER}"'"
|
||||
} + .parameters + {
|
||||
...
|
||||
}
|
||||
|
||||
However, the input variables would only have an effect if the keys they affect
|
||||
are not present in HEAT_ENV. As HEAT_ENV is written each time devtest runs, the
|
||||
keys will usually be present unless the file is deleted each time (rendering it
|
||||
pointless). So this form is more likely to cause confusion than aid
|
||||
transition.
|
||||
|
||||
----
|
||||
|
||||
jq includes an 'alternative operator', ``//``, which is intended for providing
|
||||
defaults::
|
||||
|
||||
A filter of the form a // b produces the same results as a, if a produces
|
||||
results other than false and null. Otherwise, a // b produces the same
|
||||
results as b.
|
||||
|
||||
This has not been used in the proposal for two reasons:
|
||||
|
||||
#. It only works on individual keys, not whole maps.
|
||||
#. It doesn't work in jq 1.2, still included by Ubuntu 13.04 (Saucy).
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
An announcement will be made on the mailing list when this change merges. This
|
||||
coupled with the warnings given if the deprecated variables are set should
|
||||
provide sufficient notice.
|
||||
|
||||
As HEAT_ENV is rewritten every time devtest executes, we can safely assume it
|
||||
matches the last environment used. However users who use scripts to switch
|
||||
their environment may be surprised. Overall the change should be a benefit to
|
||||
these users, as they can use two separate HEAT_ENV files (passing --heat-env to
|
||||
specify which to activate) instead of needing to maintain scripts to set up
|
||||
their environment and risking settings leaking from one to the other.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
lxsli
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add USER_HEAT_ENV to both scripts.
|
||||
* Move variables in both scripts.
|
||||
* Add deprecated variables warning to both scripts.
|
||||
* Remove OVERCLOUD_LIBVIRT_TYPE from write-tripleorc.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The change will be tested in isolation from the rest of the script.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Update usage docs with env var deprecation warnings.
|
||||
* Update usage docs to recommend HEAT_ENV.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
#. http://stedolan.github.io/jq/manual/ - JQ manual
|
||||
#. http://jqplay.herokuapp.com/ - JQ interactive demo
|
@ -1,169 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======
|
||||
SSL PKI
|
||||
=======
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-ssl-pki
|
||||
|
||||
Each of our clouds require multiple ssl certificates to operate. We need to
|
||||
support generating these certificates in devtest in a manner which will
|
||||
closely resemble the needs of an actual deployment. We also need to support
|
||||
interfacing with the PKI (Public Key Infrastructure) of existing organizations.
|
||||
This spec outlines the ways we will address these needs.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
We have a handful of services which require SSL certificates:
|
||||
|
||||
* Keystone
|
||||
* Public APIs
|
||||
* Galera replication
|
||||
* RabbitMQ replication
|
||||
|
||||
Developers need to have these certificates generated automatically for them,
|
||||
while organizations will likely want to make use of their existing PKI. We
|
||||
have not made clear at what level we will manage these certificates and/or
|
||||
their CA(s) and at what level the user will be responsible for them. This is
|
||||
further complicated by the Public API's likely having a different CA than the
|
||||
internal-only facing services.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Each of these services will accept their SSL certificate, key, and CA via
|
||||
environment JSON (heat templates for over/undercloud, config.json for seed).
|
||||
|
||||
At the most granular level, a user can specify these values by editing the
|
||||
over/undercloud-env.json or config.json files. If a certificate and key is
|
||||
specified for a service then we will not attempt to automatically generate one
|
||||
for that service. If only a certificate or key is specified it is considered
|
||||
an error.
|
||||
|
||||
If no certificate and key is specified for a service, we will attempt to
|
||||
generate a certificate and key, and sign the certificate with a self-signed
|
||||
CA we generate. Both the undercloud and seed will share a self-signed CA in
|
||||
this scenario, and each overcloud will have a separate self-signed CA. We will
|
||||
also add this self-signed CA to the chain of trust for hosts which use services
|
||||
of the cloud being created.
|
||||
|
||||
The use of a custom CA for signing the automatically generated certificates
|
||||
will be solved in a future iteration.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None presented thus far.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
This change has high security impact as it affects our PKI. We currently do not
|
||||
have any SSL support, and implementing this should therefore improve our
|
||||
security. We should ensure all key files we create in this change have file
|
||||
permissions of 0600 and that the directories they reside in have permissions
|
||||
of 0700.
|
||||
|
||||
There are many security implications for SSL key generation (including entropy
|
||||
availability) and we defer to the OpenStack Security Guide[1] for this.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Users can interact with this feature by editing the under/overcloud-env.json
|
||||
files and the seed config.json file. Additionally, the current properties which
|
||||
are used for specifying the keystone CA and certificate will be changed to
|
||||
support a more general naming scheme.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
We will be performing key generation which can require a reasonable amount of
|
||||
resources, including entropy sources.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
More SSL keys will be generated for developers. Debugging via monitoring
|
||||
network traffic can also be more difficult once SSL is adopted. Production
|
||||
environments will also require SSL unwrapping to debug network traffic, so this
|
||||
will allow us to closer emulate production (developers can now spot missing SSL
|
||||
wrapping).
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
The code behind generate-keystone-pki in os-cloud-config will be generalized
|
||||
to support creation of a CA and certificates separately, and support creation
|
||||
of multiple certificates using a single CA. A new script will be created
|
||||
named 'generate-ssl-cert' which accepts a heat environment JSON file and a
|
||||
service name. This will add ssl.certificate and ssl.certificate_key properties
|
||||
under the servicename property (an example is below). If no ssl.ca_certificate
|
||||
and ssl.ca_certificate_key properties are defined then this script will perform
|
||||
generation of the self-signed certificate.
|
||||
|
||||
Example heat environment output::
|
||||
|
||||
{
|
||||
"ssl": {
|
||||
"ca_certificate": "<PEM Data>",
|
||||
"ca_key": "<PEM Data>"
|
||||
},
|
||||
"horizon" {
|
||||
"ssl": {
|
||||
"ca_certificate": "<PEM Data>",
|
||||
"ca_certificate_key": "<PEM Data>"
|
||||
},
|
||||
...
|
||||
},
|
||||
...
|
||||
}
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
greghaynes
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Generalize CA/certificate creation in os-cloud-config.
|
||||
* Add detection logic for certificate key pairs in -env.json files to devtest
|
||||
* Make devtest scripts call CA/cert creation scripts if no cert is found
|
||||
for a service
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
The services listed above are not all set up to use SSL certificates yet. This
|
||||
is required before we can add detection logic for user specified certificates
|
||||
for all services.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Tests for new functionality will be made to os-cloud-config. The default
|
||||
behavior for devtest is designed to closely mimic a production setup, allowing
|
||||
us to best make use of our CI.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We will need to document the new interfaces described in 'Other End User
|
||||
Impact'.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
1. Openstack Security Guide: http://docs.openstack.org/security-guide/content/
|
@ -1,269 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================
|
||||
Triple CI improvements
|
||||
======================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-ci-improvements
|
||||
|
||||
Tripleo CI is painful at the moment, we have problems with both reliability
|
||||
and consistency of running job times, this spec is intended to address a
|
||||
number of the problems we have been facing.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Developers should be able to depend on CI to produce reliable test results with
|
||||
a minimum number of false negatives reported in a timely fashion, this
|
||||
currently isn't the case. To date the reliability of tripleo ci has been
|
||||
heavily effected by network glitches, availability of network resources and
|
||||
reliability of the CI clouds. This spec is intended to deal with the problems
|
||||
we have been seeing.
|
||||
|
||||
**Problem :** Reliability of hp1 (hp1_reliability_)
|
||||
Intermittent failures on jobs running on the hp1 cloud have been causing a
|
||||
large number of job failures and sometimes taking this region down
|
||||
altogether. Current thinking is that the root of most of these issues is
|
||||
problems with a mellanox driver.
|
||||
|
||||
**Problem :** Unreliable access to network resources (net_reliability_)
|
||||
Gaining reliable access to various network resources has been inconsistent
|
||||
causing a CI outage when any one network resource is unavailable. Also
|
||||
inconsistent speeds downloading these resources can make it difficult to
|
||||
gauge overall speed improvements made to tripleo.
|
||||
|
||||
**Problem :** (system_health_) The health of the overall CI system isn't
|
||||
immediately obvious, problems often persist for hours (or occasionally days)
|
||||
before we react to them.
|
||||
|
||||
**Problem :** (ci_run_times_) The tripleo devtest story takes time to run,
|
||||
this uses up CI resources and developer's time, where possible we should
|
||||
reduce the time required to run devtest.
|
||||
|
||||
**Problem :** (inefficient_usage_) Hardware on which to run tripleo is a finite
|
||||
resource, there is a spec in place to run devtest on an openstack
|
||||
deployment[1], this is the best way forward in order to use the resources we
|
||||
have in the most efficient way possible. We also have a number of options to
|
||||
explore that would help minimise resource wastage.
|
||||
|
||||
**Problem :** (system_feedback_) Our CI provides no feedback about trends.
|
||||
A good CI system should be more than a system that reports pass or fail, we
|
||||
should be getting feedback on metrics allowing us to observe degradations,
|
||||
where possible we should make use of services already provided by infra.
|
||||
This will allow us to proactively intervene as CI begins to degrade?
|
||||
|
||||
**Problem :** (bug_frequency_) We currently have no indication of which CI
|
||||
bugs are occurring most often. This frustrates efforts to make CI more
|
||||
reliable.
|
||||
|
||||
**Problem :** (test_coverage_) Currently CI only tests a subset of what it
|
||||
should.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
There are a number of changes required in order to address the problems we have
|
||||
been seeing, each listed here (in order of priority).
|
||||
|
||||
.. _hp1_reliability:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Temporarily scale back on CI by removing one of the overcloud jobs (so rh1 has
|
||||
the capacity to run CI Solo).
|
||||
* Remove hp1 from the configuration.
|
||||
* Run burn-in tests on each hp1 host, removing(or repairing) failing hosts.
|
||||
Burn-in tests should consist of running CI on a newly deployed cloud matching
|
||||
the load expected to run on the region. Any failure rate should not exceed
|
||||
that of currently deployed regions.
|
||||
* Redeploy testing infrastructure on hp1 and test with tempest, this redeploy
|
||||
should be done with our tripleo scripts so it can be repeated and we
|
||||
are sure of parity between ci-overcloud deployments.
|
||||
* Place hp1 back into CI and monitor situation.
|
||||
* Add back any removed CI jobs.
|
||||
* Ensure burn-in / tempest tests are followed on future regions being deployed.
|
||||
* Attempts should be made to deal with problems that develop on already
|
||||
deployed clouds, if it becomes obvious they can't be quickly dealt with after
|
||||
48 hours they should be temporarily removed from the CI infrastructure and will
|
||||
need to pass the burn-in tests before being added back into production.
|
||||
|
||||
.. _net_reliability:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Deploy a mirror of pypi.openstack.org on each Region.
|
||||
* Deploy a mirror of the Fedora and Ubuntu package repositories on each region.
|
||||
* Deploy squid in each region and cache http traffic through it, mirroring
|
||||
where possible should be considered our preference but having squid in place
|
||||
should cache any resources not mirrored.
|
||||
* Mirror other resources (e.g. github.com, percona tarballs etc..).
|
||||
* Any new requirements added to devtest should be cachable with caches in
|
||||
place before the requirement is added.
|
||||
|
||||
.. _system_health:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Monitor our CI clouds and testenvs with Icinga, monitoring should include
|
||||
ping, starting (and connecting to) new instances, disk usage etc....
|
||||
* Monitor CI test results and trigger an alert if "X" number of jobs of the
|
||||
same type fail in succession. An example of using logstash to monitor CI
|
||||
results can be found here[5].
|
||||
|
||||
Once consistency is no longer a problem we will investigate speed improvements
|
||||
we can make on the speed of CI jobs.
|
||||
|
||||
.. _ci_run_times:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Investigate if unsafe disk caching strategies will speed up disk image
|
||||
creation, if an improvement is found implement it in production CI by one of
|
||||
|
||||
* run "unsafe" disk caching strategy on ci cloud VM's (would involve exposing
|
||||
this libvirt option via the nova api).
|
||||
* use "eatmydata" to noop disk sync system calls, not currently
|
||||
packaged for F20 but we could try and restart that process[2].
|
||||
|
||||
|
||||
.. _inefficient_usage:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Abandon on failure : adding a feature to zuul (or turning it on if it already
|
||||
exists) to abandon all jobs in a queue for a particular commit as soon as a
|
||||
voting commit fails. This would minimize usage of resources running long
|
||||
running jobs that we already know will have to be rechecked.
|
||||
|
||||
* Adding the collectl element to compute nodes and testenv hosts will allow us
|
||||
to find bottle necks and also identify places where it is safe to overcommit
|
||||
(e.g. we may find that overcommitting CPU a lot on testenv hosts is viable).
|
||||
|
||||
.. _system_feedback:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Using a combination of logstash and graphite
|
||||
|
||||
* Output graphs of occurrences of false negative test results.
|
||||
* Output graphs of CI run times over time in order to identify trends.
|
||||
* Output graphs of CI job peak memory usage over time.
|
||||
* Output graphs of CI image sizes over time.
|
||||
|
||||
.. _bug_frequency:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* In order to be able to track false negatives that are hurting us most we
|
||||
should agree not to use "recheck no bug", instead recheck with the
|
||||
relevant bug number. Adding signatures to Elastic recheck for known CI
|
||||
issues should help uptake of this.
|
||||
|
||||
.. _test_coverage:
|
||||
|
||||
**Solution :**
|
||||
|
||||
* Run tempest against the deployed overcloud.
|
||||
* Test our upgrade story by upgrading to a new images. Initially to avoid
|
||||
having to build new images we can edit something on the overcloud qcow images
|
||||
in place in order to get a set of images to upgrade too[3].
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* As an alternative to deploying our own distro mirrors we could simply point
|
||||
directly at a mirror known to be reliable. This is undesirable as a long
|
||||
term solution as we still can't control outages.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
* No longer using recheck no bug places a burden on developers to
|
||||
investigate why a job failed.
|
||||
|
||||
* Adding coverage to our tests will increase the overall time to run a job.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Performance of CI should improve overall.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
derekh
|
||||
|
||||
Other contributors:
|
||||
looking for volunteers...
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* hp1 upgrade to trusty.
|
||||
* Potential pypi mirror.
|
||||
* Fedora Mirrors.
|
||||
* Ubuntu Mirrors.
|
||||
* Mirroring other non distro resources.
|
||||
* Per region caching proxy.
|
||||
* Document CI.
|
||||
* Running an unsafe disk caching strategy in the overcloud nodes.
|
||||
* ZUUL abandon on failure.
|
||||
* Include collectl on compute and testenv Hosts and analyse output.
|
||||
* Mechanism to monitor CI run times.
|
||||
* Mechanism to monitor nodepool connection failures to instances.
|
||||
* Remove ability to recheck no bug or at the very least discourage its use.
|
||||
* Monitoring cloud/testenv health.
|
||||
* Expand ci to include tempest.
|
||||
* Expand ci to include upgrades.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
CI failure rate and timings will be tracked to confirm improvements.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The tripleo-ci repository needs additional documentation in order to describe
|
||||
the current layout and should then be updated as changes are made.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [1] spec to run devtest on openstack https://review.openstack.org/#/c/92642/
|
||||
* [2] eatmydata for Fedora https://bugzilla.redhat.com/show_bug.cgi?id=1007619
|
||||
* [3] CI upgrades https://review.openstack.org/#/c/87758/
|
||||
* [4] summit session https://etherpad.openstack.org/p/juno-summit-tripleo-ci
|
||||
* [5] http://jogo.github.io/gate/tripleo.html
|
@ -1,238 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======================================================
|
||||
Configurable directory for persistent and stateful data
|
||||
=======================================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-configurable-mnt-state
|
||||
|
||||
Make the hardcoded /mnt/state path for stateful data be configurable.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
1. A hard coded directory of /mnt/state for persistent data is incompatible
|
||||
with Red Hat based distros available mechanism for a stateful data path. Red
|
||||
Hat based distros, such as Fedora, RHEL, and CentOS, have a feature that uses
|
||||
bind mounts for mounting paths onto a stateful data partition and does not
|
||||
require manually reconfiguring software to use /mnt/state.
|
||||
|
||||
2. Distros that use SELinux have pre-existing policy that allows access to
|
||||
specific paths. Reconfiguring these paths to be under /mnt/state, results
|
||||
in SELinux denials for existing services, requiring additional policy to be
|
||||
written and maintained.
|
||||
|
||||
3. Some operators and administrators find the reconfiguring of many services to
|
||||
not use well known default values for filesystem paths to be disruptive and
|
||||
inconsistent. They do not expect these changes when using a distro that they've
|
||||
come to learn and anticipate certain configurations. These types of changes
|
||||
also require documentation changes to existing documents and processes.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
Deployers will be able to choose a configurable path instead of the hardcoded
|
||||
value of /mnt/state for the stateful path.
|
||||
|
||||
A new element, stateful-path will be added that defines the value for the
|
||||
stateful path. The default will be /mnt/state.
|
||||
|
||||
There are 3 areas that need to respect the configurable path:
|
||||
|
||||
os-apply-config template generation
|
||||
The stateful-path element will set the stateful path value by installing a
|
||||
JSON file to a well known location for os-collect-config to use as a local
|
||||
data source. This will require a new local data source collector to be added
|
||||
to os-collect-config (See `Dependencies`_).
|
||||
|
||||
The JSON file's contents will be based on $STATEFUL_PATH, e.g.:
|
||||
|
||||
{'stateful-path': '/mnt/state'}
|
||||
|
||||
File templates (files under os-apply-config in an element) will then be
|
||||
updated to replace the hard coded /mnt/state with {{stateful-path}}.
|
||||
|
||||
Currently, there is a mix of root locations of the os-apply-config templates.
|
||||
Most are written under /, although some are written under /mnt/state. The
|
||||
/mnt/state is hard coded in the directory tree under os-apply-config in these
|
||||
elements, so this will be removed to have the templates just written under /.
|
||||
Symlinks could instead be used in these elements to setup the correct paths.
|
||||
Support can also be added to os-apply-config's control file mechanism to
|
||||
indicate these files should be written under the stateful path. An example
|
||||
patch that does this is at: https://review.openstack.org/#/c/113651/
|
||||
|
||||
os-refresh-config scripts run at boot time
|
||||
In order to make the stateful path configurable, all of the hard coded
|
||||
references to /mnt/state in os-refresh-config scripts will be replaced with an
|
||||
environment variable, $STATEFUL_PATH.
|
||||
|
||||
The stateful-path element will provide an environment.d script for
|
||||
os-refresh-config that reads the value from os-apply-config:
|
||||
|
||||
export STATEFUL_PATH=$(os-apply-config --key stateful-path --type raw)
|
||||
|
||||
Hook scripts run at image build time
|
||||
The stateful-path element will provide an environment.d script for use at
|
||||
image build time:
|
||||
|
||||
export STATEFUL_PATH=${STATEFUL_PATH:-"/mnt/state"}
|
||||
|
||||
The use-ephemeral element will depend on the stateful-path element, effectively
|
||||
making the default stateful path remain /mnt/state.
|
||||
|
||||
The stateful path can be reconfigured by defining $STATEFUL_PATH either A) in
|
||||
the environment before an image build; or B) in an element with an
|
||||
environment.d script which runs earlier than the stateful-path environment.d
|
||||
script.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
None come to mind, the point of this spec is to enable an alternative to what's
|
||||
already existing. There may be additional alternatives out there other folks
|
||||
may wish to add support for.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
End users using elements that change the stateful path location from /mnt/state
|
||||
to something else will see this change reflected in configuration files and in
|
||||
the directories used for persistent and stateful data. They will have to know
|
||||
how the stateful path is configured and accessed.
|
||||
|
||||
Different TripleO installs would appear different if used with elements that
|
||||
configured the stateful path differently.
|
||||
|
||||
This also adds some complexity when reading TripleO code, because instead of
|
||||
there being an explicit path, there would instead be a reference to a
|
||||
configurable value.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
There will be additional logic in os-refresh-config to determine and set the
|
||||
stateful path, and an additional local collector that os-collect-config would
|
||||
use. However, these are negligible in terms of negatively impacting
|
||||
performance.
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
Deployers will be able to choose different elements that may reconfigure the
|
||||
stateful path or change the value for $STATEFUL_PATH. The default will remain
|
||||
unchanged however.
|
||||
|
||||
Deployers would have to know what the stateful path is, and if it's different
|
||||
across their environment, this could be confusing. However, this seems unlikely
|
||||
as deployers are likely to be standardizing on one set of common elements,
|
||||
distro, etc.
|
||||
|
||||
In the future, if TripleO CI and CD clouds that are based on Red Hat distros
|
||||
make use of this feature to enable Red Hat read only root support, then these
|
||||
clouds would be configured differently from clouds that are configured to use
|
||||
/mnt/state. As a team, the tripleo-cd-admins will have to know which
|
||||
configuration has been used.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
1. Developers need to use the $STATEFUL_PATH and {{stateful-path}}
|
||||
substitutions when they intend to refer to the stateful path.
|
||||
|
||||
2. Code that needs to know the stateful path will need access to the variable
|
||||
defining the path, it won't be able to assume the path is /mnt/state. A call to
|
||||
os-apply-config to query the key defining the path could be done to get
|
||||
the value, as long as os-collect-config has already run at least once.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
james-slagle
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
tripleo-incubator
|
||||
^^^^^^^^^^^^^^^^^
|
||||
* Update troubleshooting docs to mention that /mnt/state is a configurable
|
||||
path, and could be different in local environments.
|
||||
|
||||
tripleo-image-elements
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
* Add a new stateful-path element that configures stateful-path and $STATEFUL_PATH
|
||||
to /mnt/state
|
||||
* Update os-apply-config templates to replace /mnt/state with {{stateful-path}}
|
||||
* Update os-refresh-config scripts to replace /mnt/state with $STATEFUL_PATH
|
||||
* Update all elements that have os-apply-config template files under /mnt/state
|
||||
to just be under /.
|
||||
|
||||
* update os-apply-config element to call os-apply-config with a --root
|
||||
$STATEFUL_PATH option
|
||||
* update elements that have paths to os-apply-config generated files (such
|
||||
as /etc/nova/nova.conf) to refer to those paths as
|
||||
$STATEFUL_PATH/path/to/file.
|
||||
|
||||
* make use-ephemeral element depend on stateful-path element
|
||||
|
||||
Dependencies
|
||||
============
|
||||
1. os-collect-config will need a new feature to read from a local data source
|
||||
directory that elements can install JSON files into, such as a source.d. There
|
||||
will be a new spec filed on this feature.
|
||||
https://review.openstack.org/#/c/100965/
|
||||
|
||||
2. os-apply-config will need an option in its control file to support
|
||||
generating templates under the configurable stateful path. There is a patch
|
||||
here: https://review.openstack.org/#/c/113651/
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
There is currently no testing that all stateful and persistent data is actually
|
||||
written to a stateful partition.
|
||||
|
||||
We should add tempest tests that directly exercise the preserve_ephemeral
|
||||
option, and have tests that check that all stateful data has been preserved
|
||||
across a "nova rebuild". Tempest seems like a reasonable place to add these
|
||||
tests since preserve_ephemeral is a Nova OpenStack feature. Plus, once TripleO
|
||||
CI is running tempest against the deployed OverCloud, we will be testing this
|
||||
feature.
|
||||
|
||||
We should also test in TripleO CI that state is preserved across a rebuild by
|
||||
adding stateful data before a rebuild and verifying it is still present after a
|
||||
rebuild.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We will document the new stateful-path element.
|
||||
|
||||
TripleO documentation will need to mention the potential difference in
|
||||
configuration files and the location of persistent data if a value other than
|
||||
/mnt/state is used.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
os-collect-config local datasource collector spec:
|
||||
|
||||
* https://review.openstack.org/100965
|
||||
|
||||
Red Hat style stateful partition support this will enable:
|
||||
|
||||
* https://git.fedorahosted.org/cgit/initscripts.git/tree/systemd/fedora-readonly
|
||||
* https://git.fedorahosted.org/cgit/initscripts.git/tree/sysconfig/readonly-root
|
||||
* https://git.fedorahosted.org/cgit/initscripts.git/tree/statetab
|
||||
* https://git.fedorahosted.org/cgit/initscripts.git/tree/rwtab
|
@ -1,258 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================================
|
||||
TripleO Deploy Cloud Hypervisor Type
|
||||
====================================
|
||||
|
||||
# TODO: file the actual blueprint...
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-deploy-cloud-hypervisor-type
|
||||
|
||||
The goal of this spec is to detail how the TripleO deploy cloud type could be
|
||||
varied from just baremetal to baremetal plus other hypervisors to deploy
|
||||
Overcloud services.
|
||||
|
||||
Linux kernel containers make this approach attractive due to the lightweight
|
||||
nature that services and process can be virtualized and isolated, so it seems
|
||||
likely that libvirt+lxc and Docker would be likely targets. However we should
|
||||
aim to make this approach as agnostic as possible for those deployers who may
|
||||
wish to use any Nova driver, such as libvirt+kvm.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The overcloud control plane is generally lightly loaded and allocation of
|
||||
entire baremetal machines to it is wasteful. Also, when the Overcloud services
|
||||
are running entirely on baremetal they take longer to upgrade and rollback.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We should support any Nova virtualization type as a target for Overcloud
|
||||
services, as opposed to using baremetal nodes to deploy overcloud images.
|
||||
Containers are particularly attractive because they are lightweight, easy to
|
||||
upgrade/rollback and offer similar isolation and security as full VM's. For the
|
||||
purpose of this spec, the alternate Nova virtualization target for the
|
||||
Overcloud will be referred to as alt-hypervisor. alt-hypervisor could be
|
||||
substituted with libvirt+lxc, Docker, libvirt+kvm, etc.
|
||||
|
||||
At a minimum, we should support running each Overcloud service in isolation in
|
||||
its own alt-hypervisor instance in order to be as flexible as possible to deployer
|
||||
needs. We should also support combining services.
|
||||
|
||||
In order to make other alt-hypervisors available as deployment targets for the
|
||||
Overcloud, we need additional Nova Compute nodes/services configured to use
|
||||
alt-hypervisors registered with the undercloud Nova.
|
||||
|
||||
Additionally, the undercloud must still be running a Nova compute with the
|
||||
ironic driver in order to allow for scaling itself out to add additional
|
||||
undercloud compute nodes.
|
||||
|
||||
To accomplish this, we can run 2 Nova compute processes on each undercloud
|
||||
node. One configured with Nova+Ironic and one configured with
|
||||
Nova+alt-hypervisor. For the straight baremetal deployment, where an alternate
|
||||
hypervisor is not desired, the additional Nova compute process would not be
|
||||
included. This would be accomplished via the standard inclusion/exclusion of
|
||||
elements during a diskimage-builder tripleo image build.
|
||||
|
||||
It will also be possible to build and deploy just an alt-hypervisor compute
|
||||
node that is registered with the Undercloud as an additional compute node.
|
||||
|
||||
To minimize the changes needed to the elements, we will aim to run a full init
|
||||
stack in each alt-hypervisor instance, such as systemd. This will allow all the
|
||||
services that we need to also be running in the instance (cloud-init,
|
||||
os-collect-config, etc). It will also make troubleshooting similar to the
|
||||
baremetal process in that you'd be able to ssh to individual instances, read
|
||||
logs, restart services, turn on debug mode, etc.
|
||||
|
||||
To handle Neutron network configuration for the Overcloud, the Overcloud
|
||||
neutron L2 agent will have to be on a provider network that is shared between
|
||||
the hypervisors. VLAN provider networks will have to be modeled in Neutron and
|
||||
connected to alt-hypervisor instances.
|
||||
|
||||
Overcloud compute nodes themselves would be deployed to baremetal nodes. These
|
||||
images would be made up of:
|
||||
* libvirt+kvm (assuming this is the hypervisor choice for the Overcloud)
|
||||
* nova-compute + libvirt+kvm driver (registered to overcloud control).
|
||||
* neutron-l2-agent (registered to overcloud control)
|
||||
An image with those contents is deployed to a baremetal node via nova+ironic
|
||||
from the undercloud.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Deployment from the seed
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
An alternative to having the undercloud deploy additional alt-hypervisor
|
||||
compute nodes would be to register additional baremetal nodes with the seed vm,
|
||||
and then describe an undercloud stack in a template that is the undercloud
|
||||
controller and its set of alt-hypervisor compute nodes. When the undercloud
|
||||
is deployed via the seed, all of the nodes are set up initially.
|
||||
|
||||
The drawback with that approach is that the seed is meant to be short-lived in
|
||||
the long term. So, it then becomes difficult to scale out the undercloud if
|
||||
needed. We could offer a hybrid of the 2 models: launch all nodes initially
|
||||
from the seed, but still have the functionality in the undercloud to deploy
|
||||
more alt-hypervisor compute nodes if needed.
|
||||
|
||||
The init process
|
||||
^^^^^^^^^^^^^^^^
|
||||
If running systemd in a container turns out to be problematic, it should be
|
||||
possible to run a single process in the container that starts just the
|
||||
OpenStack service that we care about. However that process would also need to
|
||||
do things like read Heat metadata. It's possible this process could be
|
||||
os-collect-config. This change would require more changes to the elements
|
||||
themselves however since they are so dependent on an init process currently in
|
||||
how they enable/restart services etc. It may be possible to replace os-svc-*
|
||||
with other tools that don't use systemd or upstart when you're building images
|
||||
for containers.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
* We should aim for equivalent security when deploying to alt-hypervisor
|
||||
instances as we do when deploying to baremetal. To the best of our ability, it
|
||||
should not be possible to compromise the instance if an individual service is
|
||||
compromised.
|
||||
|
||||
* Since Overcloud services and Undercloud services would be co-located on the
|
||||
same baremetal machine, compromising the hypervisor and gaining access to the
|
||||
host is a risk to both the Undercloud and Overcloud. We should mitigate this
|
||||
risk to the best of our ability via things like SELinux, and removing all
|
||||
unecessary software/processes from the alt-hypervisor instances.
|
||||
|
||||
* Certain hypervisors are inherently more secure than others. libvirt+kvm uses
|
||||
virtualization and is much more secure then container based hypervisors such as
|
||||
libvirt+lxc and Docker which use namespacing.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
None. The impact of this change is limited to Deployers. End users should have
|
||||
no visibility into the actual infrastructure of the Overcloud.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
Ideally, deploying an overcloud to containers should result in a faster
|
||||
deployment than deploying to baremetal. Upgrading and downgrading the Overcloud
|
||||
should also be faster.
|
||||
|
||||
More images will have to be built via diskimage-builder however, which will
|
||||
take more time.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
The main impact to deployers will be the ability to use alt-hypervisors
|
||||
instances, such as containers if they wish. They also must understand how to
|
||||
use nova-baremetal/ironic on the undercloud to scale out the undercloud and add
|
||||
additional alt-hypervisor compute nodes if needed.
|
||||
|
||||
Additional space in the configured glance backend would also likely be needed
|
||||
to store additional images.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
* Developers working on TripleO will have the option of deploying to
|
||||
alt-hypervisor instances. This should make testing and developing on some
|
||||
aspects of TripleO easier due to the need for less vm's.
|
||||
|
||||
* More images will have to be built due to the greater potential variety with
|
||||
alt-hypervisor instances housing Overcloud services.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
james-slagle
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
tripleo-incubator
|
||||
^^^^^^^^^^^^^^^^^
|
||||
* document how to use an alternate hypervisor for the overcloud deployment
|
||||
** eventually, this could possibly be the default
|
||||
* document how to troubleshoot this type of deployment
|
||||
* need a user option or json property to describe if the devtest
|
||||
environment being set up should use an alternate hypervisor for the overcloud
|
||||
deployment or not. Consider using HEAT_ENV where appropriate.
|
||||
* load-image should be updated to add an additional optional argument that sets
|
||||
the hypervisor_type property on the loaded images in glance. The argument is
|
||||
optional and wouldn't need to be specified for some images, such as regular
|
||||
dib images that can run under KVM.
|
||||
* Document commands to setup-neutron for modeling provider VLAN networks.
|
||||
|
||||
tripleo-image-elements
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
* add new element for nova docker driver
|
||||
* add new element for docker registry (currently required by nova docker
|
||||
driver)
|
||||
* more hypervisor specific configuration files for the different nova compute
|
||||
driver elements
|
||||
** /etc/nova/compute/nova-kvm.conf
|
||||
** /etc/nova/compute/nova-baremetal.conf
|
||||
** /etc/nova/compute/nova-ironic.conf
|
||||
** /etc/nova/compute/nova-docker.conf
|
||||
* Separate configuration options per compute process for:
|
||||
** host (undercloud-kvm, undercloud-baremetal, etc).
|
||||
** state_path (/var/lib/nova-kvm, /var/lib/nova-baremetal, etc).
|
||||
* Maintain backwards compatibility in the elements by consulting both old and
|
||||
new heat metadata key namespaces.
|
||||
|
||||
tripleo-heat-templates
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
* Split out heat metadata into separate namespaces for each compute process
|
||||
configuration.
|
||||
* For the vlan case, update templates for any network modeling for
|
||||
alt-hypervisor instances so that those instances have correct interfaces
|
||||
attached to the vlan network.
|
||||
|
||||
diskimage-builder
|
||||
^^^^^^^^^^^^^^^^^
|
||||
* add ability where needed to build new image types for alt-hypervisor
|
||||
** Docker
|
||||
** libvirt+lxc
|
||||
* Document how to build images for the new types
|
||||
|
||||
Dependencies
|
||||
============
|
||||
For Docker support, this effort depends on continued development on the nova
|
||||
Docker driver. We would need to drive any missing features or bug fixes that
|
||||
were needed in that project.
|
||||
|
||||
For other drivers that may not be as well supported as libvirt+kvm, we will
|
||||
also have to drive missing features there as well if we want to support them,
|
||||
such as libvirt+lxc, openvz, etc.
|
||||
|
||||
This effort also depends on the provider resource templates spec (unwritten)
|
||||
that will be done for the template backend for Tuskar. That work should be done
|
||||
in such a way that the provider resource templates are reusable for this effort
|
||||
as well in that you will be able to create templates to match the images that
|
||||
you intend to create for your Overcloud deployment.
|
||||
|
||||
Testing
|
||||
=======
|
||||
We would need a separate set of CI jobs that were configured to deploy an
|
||||
Overcloud to each alternate hypervisor that TripleO intended to support well.
|
||||
|
||||
For Docker support specifically, CI jobs could be considered non-voting since
|
||||
they'd rely on a stackforge project which isn't officially part of OpenStack.
|
||||
We could potentially make this job voting if TripleO CI was enabled on the
|
||||
stackforge/nova-docker repo so that changes there are less likely to break
|
||||
TripleO deployments.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
We should update the TripleO specific docs in tripleo-incubator to document how
|
||||
to use an alternate hypervisor for an Overcloud deployment.
|
||||
|
||||
References
|
||||
==========
|
||||
Juno Design Summit etherpad: https://etherpad.openstack.org/p/juno-summit-tripleo-and-docker
|
||||
nova-docker driver: https://git.openstack.org/cgit/stackforge/nova-docker
|
||||
Docker: https://www.docker.io/
|
||||
Docker github: https://github.com/dotcloud/docker
|
@ -1,176 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================
|
||||
Dracut Deploy Ramdisks
|
||||
======================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-dracut-ramdisks
|
||||
|
||||
Our current deploy ramdisks include functionality that is duplicated from
|
||||
existing tools such as Dracut, and do not include some features that those
|
||||
tools do. Reimplementing our deploy ramdisks to use Dracut would shrink
|
||||
our maintenance burden for that code and allow us to take advantage of those
|
||||
additional features.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently our deploy ramdisks are implemented as a bash script that runs
|
||||
as init during the deploy process. This means that we are responsible for
|
||||
correctly configuring things such as udev and networking which would normally
|
||||
be handled by distribution tools. While this isn't an immediate problem
|
||||
because the implementation has already been done, it is an unnecessary
|
||||
duplication and additional maintenance debt for the future as we need to add
|
||||
or change such low-level functionality.
|
||||
|
||||
In addition, because our ramdisk is a one-off, users will not be able to make
|
||||
use of any ramdisk troubleshooting methods that they might currently know.
|
||||
This is an unnecessary burden when there are tools to build ramdisks that are
|
||||
standardized and well-understood by the people using our software.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The issues discussed above can be dealt with by using a standard tool such as
|
||||
Dracut to build our deploy ramdisks. This will actually result in a reduction
|
||||
in code that we have to maintain and should be compatible with all of our
|
||||
current ramdisks because we can continue to use the same method of building
|
||||
the init script - it will just run as a user script instead of process 0,
|
||||
allowing Dracut to do low-level configuration for us.
|
||||
|
||||
Initially this will be implemented alongside the existing ramdisk element to
|
||||
provide a fallback option if there are any use cases not covered by the
|
||||
initial version of the Dracut ramdisk.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
For consistency with the rest of Red Hat/Fedora's ramdisks I would prefer to
|
||||
implement this using Dracut, but if there is a desire to also make use of
|
||||
another method of building ramdisks, that could probably be implemented
|
||||
alongside Dracut. The current purely script-based implementation could even
|
||||
be kept in parallel with a Dracut version. However, I believe Dracut is
|
||||
available on all of our supported platforms so I don't see an immediate need
|
||||
for alternatives.
|
||||
|
||||
Additionally, there is the option to replace our dynamically built init
|
||||
script with Dracut modules for each deploy element. This is probably
|
||||
unnecessary as it is perfectly fine to use the current method with Dracut,
|
||||
and using modules would tightly couple our deploy ramdisks to Dracut, making
|
||||
it difficult to use any alternatives in the future.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The same security considerations that apply to the current deploy ramdisk
|
||||
would continue to apply to Dracut-built ones.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This change would enable end users to make use of any Dracut knowledge they
|
||||
might already have, including the ability to dynamically enable tracing
|
||||
of the commands used to do the deployment (essentially set -x in bash).
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Because Dracut supports more hardware and software configurations, it is
|
||||
possible there will be some additional overhead during the boot process.
|
||||
However, I would expect this to be negligible in comparison to the time it
|
||||
takes to copy the image to the target system, so I see it as a reasonable
|
||||
tradeoff.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
As noted before, Dracut supports a wide range of hardware configurations,
|
||||
so deployment methods that currently wouldn't work with our script-based
|
||||
ramdisk would become available. For example, Dracut supports using network
|
||||
disks as the root partition, so running a diskless node with separate
|
||||
storage should be possible.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
There would be some small changes to how developers would add a new dependency
|
||||
to the ramdisk images. Instead of executables and their required libraries
|
||||
being copied to the ramdisk manually, the executable can simply be added to
|
||||
the list of things Dracut will include in the ramdisk.
|
||||
|
||||
Developers would also gain the dynamic tracing ability mentioned above in
|
||||
the end user impact.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bnemec
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Convert the ramdisk element to use Dracut (see WIP change in References).
|
||||
|
||||
* Verify that DHCP booting of ramdisks still works.
|
||||
|
||||
* Verify that nova-baremetal ramdisks can be built successfully with Dracut.
|
||||
|
||||
* Verify that Ironic ramdisks can be built successfully with Dracut.
|
||||
|
||||
* Verify that Dracut can build Ironic-IPA ramdisks.
|
||||
|
||||
* Verify the Dracut debug shell provides equivalent functionality to the
|
||||
existing one.
|
||||
|
||||
* Provide ability for other elements to install additional files to the
|
||||
ramdisk.
|
||||
|
||||
* Provide ability for other elements to include additional drivers.
|
||||
|
||||
* Find a way to address potential 32-bit binaries being downloaded and run in
|
||||
the ramdisk for firmware deployments.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
This would add a dependency on Dracut for building ramdisks.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Since building deploy ramdisks is already part of CI, this should be covered
|
||||
automatically. If it is implemented in parallel with another method, then
|
||||
the CI jobs would need to be configured to exercise the different methods
|
||||
available.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We would want to document the additional features available in Dracut.
|
||||
Otherwise this should function in essentially the same way as the current
|
||||
ramdisks, so any existing documentation will still be valid.
|
||||
|
||||
Some minor developer documentation changes may be needed to address the
|
||||
different ways Dracut handles adding extra kernel modules and files.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Dracut: https://dracut.wiki.kernel.org/index.php/Main_Page
|
||||
|
||||
* PoC of building ramdisks with Dracut:
|
||||
https://review.openstack.org/#/c/105275/
|
||||
|
||||
* openstack-dev discussion:
|
||||
http://lists.openstack.org/pipermail/openstack-dev/2014-July/039356.html
|
@ -1,168 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===================================
|
||||
os-collect-config local data source
|
||||
===================================
|
||||
|
||||
|
||||
https://blueprints.launchpad.net/tripleo-juno-occ-local-datasource
|
||||
|
||||
os-collect-config needs a local data source collector for configuration data.
|
||||
This will allow individual elements to drop files into a well-known location to
|
||||
set the initial configuration data of an instance.
|
||||
|
||||
There is already a heat_local collector, but that uses a single hard coded path
|
||||
of /var/lib/heat-cfntools/cfn-init-data.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
* Individual elements can not currently influence the configuration available
|
||||
to os-apply-config for an instance without overwriting each other.
|
||||
* Elements that rely on configuration values that must be set the same at both
|
||||
image build time and instance run time currently have no way of propagating the
|
||||
value used at build time to a run time value.
|
||||
* Elements have no way to specify default values for configuration they may
|
||||
need at runtime (outside of configuration file templates).
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
A new collector class will be added to os-collect-config that collects
|
||||
configuration data from JSON files in a configurable list of directories with a
|
||||
well known default of /var/lib/os-collect-config/local-data.
|
||||
|
||||
The collector will return a list of pairs of JSON files and their content,
|
||||
sorted by the JSON filename in traditional C collation. For example, if
|
||||
/var/lib/os-collect-config/local-data contains bar.json and foo.json
|
||||
|
||||
[ ('bar.json', bar_content),
|
||||
('foo.json', foo_content) ]
|
||||
|
||||
This new collector will be configured first in DEFAULT_COLLECTORS in
|
||||
os-collect-config. This means all later configured collectors will override any
|
||||
shared configuration keys from the local datasource collector.
|
||||
|
||||
Elements making use of this feature can install a json file into the
|
||||
/var/lib/os-collect-config/local-data directory. The os-collect-config element
|
||||
will be responsible for creating the /var/lib/os-collect-config/local-data
|
||||
directory at build time and will create it with 0755 permissions.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
OS_CONFIG_FILES
|
||||
^^^^^^^^^^^^^^^
|
||||
There is already a mechanism in os-apply-config to specify arbitrary files to
|
||||
look at for configuration data via setting the OS_CONFIG_FILES environment
|
||||
variable. However, this is not ideal because each call to os-apply-config would
|
||||
have to be prefaced with setting OS_CONFIG_FILES, or it would need to be set
|
||||
globally in the environment (via an environment.d script for instance). As an
|
||||
element developer, this is not clear. Having a robust and clear documented
|
||||
location to drop in configuration data will be simpler.
|
||||
|
||||
heat_local collector
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
There is already a collector that reads from local data, but it must be
|
||||
configured to read explicit file paths. This does not scale well if several
|
||||
elements want to each provide local configuration data, in that you'd have to
|
||||
reconfigure os-collect-config itself. We could modify the heat_local collector
|
||||
to read from directories instead, while maintaining backwards compatibility as
|
||||
well, instead of writing a whole new collector. However, given that collectors
|
||||
are pretty simple implementations, I'm proposing just writing a new one, so
|
||||
that they remain generally single purpose with clear goals.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
* Harmful elements could drop bad configuration data into the well known
|
||||
location. This is mitigated somewhat that as a deployer, you should know and
|
||||
validate what elements you're using that may inject local configuration.
|
||||
|
||||
* We should verify that the local data source files are not world writable and
|
||||
are in a directory that is root owned. Checks to dib-lint could be added to
|
||||
verify this at image build time. Checks could be added to os-collect-config
|
||||
for instance run time.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
An additional collector will be running as part of os-collect-config, but its
|
||||
execution time should be minimal.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
* There will be an additional configuration option in os-collect-config to
|
||||
configure the list of directories to look at for configuration data. This
|
||||
will have a reasonable default and will not usually need to be changed.
|
||||
* Deployers will have to consider what local data source configuration may be
|
||||
influencing their current applied configuration.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
We will need to make clear in documentation when to use this feature versus
|
||||
what to expose in a template or specify via passthrough configuration.
|
||||
Configuration needed at image build time where you need access to those values
|
||||
at instance run time as well are good candidates for using this feature.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
james-slagle
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* write new collector for os-collect-config
|
||||
* unit tests for new collector
|
||||
* document new collector
|
||||
* add checks to dib-lint to verify JSON files installed to the local data
|
||||
source directory are not world writable
|
||||
* add checks to os-collect-config to verify JSON files read by the local data
|
||||
collector are not world writable and that their directory is root owned.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* The configurable /mnt/state spec at:
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-configurable-mnt-state
|
||||
depends on this spec.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit tests will be written for the new collector. The new collector will also
|
||||
eventually be tested in CI because there will be an existing element that will
|
||||
configure the persistent data directory to /mnt/state that will make use of
|
||||
this implementation.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The ability of elements to drop configuration data into a well known location
|
||||
should be documented in tripleo-image-elements itself so folks can be made
|
||||
better aware of the functionality.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-configurable-mnt-state
|
||||
* https://review.openstack.org/#/c/94876
|
@ -1,611 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================
|
||||
Tuskar Plan REST API Specification
|
||||
==================================
|
||||
|
||||
Blueprint:
|
||||
https://blueprints.launchpad.net/tuskar/+spec/tripleo-juno-tuskar-plan-rest-api
|
||||
|
||||
In Juno, the Tuskar API is moving towards a model of being a large scale
|
||||
application planning service. Its initial usage will be to deploy OpenStack
|
||||
on OpenStack by leveraging TripleO Heat Templates and fitting into the
|
||||
greater TripleO workflow.
|
||||
|
||||
As compared to Icehouse, Tuskar will no longer make calls to Heat for creating
|
||||
and updating a stack. Instead, it will serve to define and manipulate the Heat
|
||||
templates for describing a cloud. Tuskar will be the source for the cloud
|
||||
planning while Heat is the source for the state of the live cloud.
|
||||
|
||||
Tuskar employs the following concepts:
|
||||
|
||||
* *Deployment Plan* - The description of an application (for example,
|
||||
the overcloud) being planned by Tuskar. The deployment plan keeps track of
|
||||
what roles will be present in the deployment and their configuration values.
|
||||
In TripleO terms, each overcloud will have its own deployment plan that
|
||||
describes what services will run and the configuration of those services
|
||||
for that particular overcloud. For brevity, this is simply referred to as
|
||||
the "plan" elsewhere in this spec.
|
||||
* *Role* - A unit of functionality that can be added to a plan. A role
|
||||
is the definition of what will run on a single server in the deployed Heat
|
||||
stack. For example, an "all-in-one" role may contain all of the services
|
||||
necessary to run an overcloud, while a "compute" role may provide only the
|
||||
nova-compute service.
|
||||
|
||||
Put another way, Tuskar is responsible for assembling
|
||||
the user-selected roles and their configuration into a Heat environment and
|
||||
making the built Heat templates and files available to the caller (the
|
||||
Tuskar UI in TripleO but, more generally, any consumer of the REST API) to send
|
||||
to Heat.
|
||||
|
||||
Tuskar will ship with the TripleO Heat Templates installed to serve as its
|
||||
roles (dependent on the conversions taking place this release [4]_).
|
||||
For now it is assumed those templates are installed as part of the TripleO's
|
||||
installation of Tuskar. A different spec will cover the API calls necessary
|
||||
for users to upload and manipulate their own custom roles.
|
||||
|
||||
This specification describes the REST API clients will interact with in
|
||||
Tuskar, including the URLs, HTTP methods, request, and response data, for the
|
||||
following workflow:
|
||||
|
||||
* Create an empty plan in Tuskar.
|
||||
* View the list of available roles.
|
||||
* Add roles to the plan.
|
||||
* Request, from Tuskar, the description of all of the configuration values
|
||||
necessary for the entire plan.
|
||||
* Save user-entered configuration values with the plan in Tuskar.
|
||||
* Request, from Tuskar, the Heat templates for the plan, which includes
|
||||
all of the files necessary to deploy the configured application in Heat.
|
||||
|
||||
The list roles call is essential to this workflow and is therefore described
|
||||
in this specification. Otherwise, this specification does not cover the API
|
||||
calls around creating, updating, or deleting roles. It is assumed that the
|
||||
installation process for Tuskar in TripleO will take the necessary steps to
|
||||
install the TripleO Heat Templates into Tuskar. A specification will be filed
|
||||
in the future to cover the role-related API calls.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The REST API in Tuskar seeks to fulfill the following needs:
|
||||
|
||||
* Flexible selection of an overcloud's functionality and deployment strategy.
|
||||
* Repository for discovering what roles can be added to a cloud.
|
||||
* Help the user to avoid having to manually manipulate Heat templates to
|
||||
create the desired cloud setup.
|
||||
* Storage of a cloud's configuration without making the changes immediately
|
||||
live (future needs in this area may include offering a more structured
|
||||
review and promotion lifecycle for changes).
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
**Overall Concepts**
|
||||
|
||||
* These API calls will be added under the ``/v2/`` path, however the v1 API
|
||||
will not be maintained (the model is being changed to not contact Heat and
|
||||
the existing database is being removed [3]_).
|
||||
* All calls have the potential to raise a 500 if something goes horribly wrong
|
||||
in the server, but for brevity this is omitted from the list of possible
|
||||
response codes in each call.
|
||||
* All calls have the potential to raise a 401 in the event of a failed user
|
||||
authentication and have been similarly omitted from each call's
|
||||
documentation.
|
||||
|
||||
----
|
||||
|
||||
.. _retrieve-single-plan:
|
||||
|
||||
**Retrieve a Single Plan**
|
||||
|
||||
URL: ``/plans/<plan-uuid>/``
|
||||
|
||||
Method: ``GET``
|
||||
|
||||
Description: Returns the details of a specific plan, including its
|
||||
list of assigned roles and configuration information.
|
||||
|
||||
Notes:
|
||||
|
||||
* The configuration values are read from Tuskar's stored files rather than
|
||||
Heat itself. Heat is the source for the live stack, while Tuskar is the
|
||||
source for the plan.
|
||||
|
||||
Request Data: None
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - if the plan is found
|
||||
* 404 - if there is no plan with the given UUID
|
||||
|
||||
Response Data:
|
||||
|
||||
JSON document containing the following:
|
||||
|
||||
* Tuskar UUID for the given plan.
|
||||
* Name of the plan that was created.
|
||||
* Description of the plan that was created.
|
||||
* The timestamp of the last time a change was made.
|
||||
* List of the roles (identified by name and version) assigned to the plan.
|
||||
For this sprint, there will be no pre-fetching of any more role information
|
||||
beyond name and version, but can be added in the future while maintaining
|
||||
backward compatibility.
|
||||
* List of parameters that can be configured for the plan, including the
|
||||
parameter name, label, description, hidden flag, and current value if
|
||||
set.
|
||||
|
||||
Response Example:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"uuid" : "dd4ef003-c855-40ba-b5a6-3fe4176a069e",
|
||||
"name" : "dev-cloud",
|
||||
"description" : "Development testing cloud",
|
||||
"last_modified" : "2014-05-28T21:11:09Z",
|
||||
"roles" : [
|
||||
{
|
||||
"uuid" : "55713e6a-79f5-42e1-aa32-f871b3a0cb64",
|
||||
"name" : "compute",
|
||||
"version" : "1",
|
||||
"links" : {
|
||||
"href" : "http://server/v2/roles/55713e6a-79f5-42e1-aa32-f871b3a0cb64/",
|
||||
"rel" : "bookmark"
|
||||
}
|
||||
},
|
||||
{
|
||||
"uuid" : "2ca53130-b9a4-4fa5-86b8-0177e8507803",
|
||||
"name" : "controller",
|
||||
"version" : "1",
|
||||
"links" : {
|
||||
"href" : "http://server/v2/roles/2ca53130-b9a4-4fa5-86b8-0177e8507803/",
|
||||
"rel" : "bookmark"
|
||||
}
|
||||
}
|
||||
],
|
||||
"parameters" : [
|
||||
{"name" : "database_host",
|
||||
"label" : "Database Host",
|
||||
"description" : "Hostname of the database server",
|
||||
"hidden" : "false",
|
||||
"value" : "10.11.12.13"
|
||||
}
|
||||
],
|
||||
"links" : [
|
||||
{
|
||||
"href" : "http://server/v2/plans/dd4ef003-c855-40ba-b5a6-3fe4176a069e/",
|
||||
"rel" : "self"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
.. _retrieve-plan-template:
|
||||
|
||||
**Retrieve a Plan's Template Files**
|
||||
|
||||
URL: ``/plans/<plan-uuid>/templates/``
|
||||
|
||||
Method: ``GET``
|
||||
|
||||
Description: Returns the set of files to send to Heat to create or update
|
||||
the planned application.
|
||||
|
||||
Notes:
|
||||
|
||||
* The Tuskar service will build up the entire environment into a single
|
||||
file suitable for sending to Heat. The contents of this file are returned
|
||||
from this call.
|
||||
|
||||
Request Data: None
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - if the plan's templates are found
|
||||
* 404 - if no plan exists with the given ID
|
||||
|
||||
Response Data: <Heat template>
|
||||
|
||||
----
|
||||
|
||||
.. _list-plans:
|
||||
|
||||
**List Plans**
|
||||
|
||||
URL: ``/plans/``
|
||||
|
||||
Method: ``GET``
|
||||
|
||||
Description: Returns a list of all plans stored in Tuskar. In the future when
|
||||
multi-tenancy is added, this will be scoped to a particular tenant.
|
||||
|
||||
Notes:
|
||||
|
||||
* The detailed information about a plan, including its roles and configuration
|
||||
values, are not returned in this call. A follow up call is needed on the
|
||||
specific plan. It may be necessary in the future to add a flag to pre-fetch
|
||||
this information during this call.
|
||||
|
||||
Request Data: None (future enhancement will require the tenant ID and
|
||||
potentially support a pre-fetch flag for more detailed data)
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - if the list can be retrieved, even if the list is empty
|
||||
|
||||
Response Data:
|
||||
|
||||
JSON document containing a list of limited information about each plan.
|
||||
An empty list is returned when no plans are present.
|
||||
|
||||
Response Example:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
[
|
||||
{
|
||||
"uuid" : "3e61b4b2-259b-4b91-8344-49d7d6d292b6",
|
||||
"name" : "dev-cloud",
|
||||
"description" : "Development testing cloud",
|
||||
"links" : {
|
||||
"href" : "http://server/v2/plans/3e61b4b2-259b-4b91-8344-49d7d6d292b6/",
|
||||
"rel" : "bookmark"
|
||||
}
|
||||
},
|
||||
{
|
||||
"uuid" : "135c7391-6c64-4f66-8fba-aa634a86a941",
|
||||
"name" : "qe-cloud",
|
||||
"description" : "QE testing cloud",
|
||||
"links" : {
|
||||
"href" : "http://server/v2/plans/135c7391-6c64-4f66-8fba-aa634a86a941/",
|
||||
"rel" : "bookmark"
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
----
|
||||
|
||||
.. _create-new-plan:
|
||||
|
||||
**Create a New Plan**
|
||||
|
||||
URL: ``/plans/``
|
||||
|
||||
Method: ``POST``
|
||||
|
||||
Description: Creates an entry in Tuskar's storage for the plan. The details
|
||||
are outside of the scope of this spec, but the idea is that all of the
|
||||
necessary Heat environment infrastructure files and directories will be
|
||||
created and stored in Tuskar's storage solution [3]_.
|
||||
|
||||
Notes:
|
||||
|
||||
* Unlike in Icehouse, Tuskar will not make any calls into Heat during this
|
||||
call. This call is to create a new (empty) plan in Tuskar that
|
||||
can be manipulated, configured, saved, and retrieved in a format suitable
|
||||
for sending to Heat.
|
||||
* This is a synchronous call that completes when Tuskar has created the
|
||||
necessary files for the newly created plan.
|
||||
* As of this time, this call does not support a larger batch operation that
|
||||
will add roles or set configuration values in a single call. From a REST
|
||||
perspective, this is acceptable, but from a usability standpoint we may want
|
||||
to add this support in the future.
|
||||
|
||||
Request Data:
|
||||
|
||||
JSON document containing the following:
|
||||
|
||||
* Name - Name of the plan being created. Must be unique across all plans
|
||||
in the same tenant.
|
||||
* Description - Description of the plan to create.
|
||||
|
||||
Request Example:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"name" : "dev-cloud",
|
||||
"description" : "Development testing cloud"
|
||||
}
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 201 - if the create is successful
|
||||
* 409 - if there is an existing plan with the given name (for a particular
|
||||
tenant when multi-tenancy is taken into account)
|
||||
|
||||
Response Data:
|
||||
|
||||
JSON document describing the created plan.
|
||||
The details are the same as for the GET operation on an individual plan
|
||||
(see :ref:`Retrieve a Single Plan <retrieve-single-plan>`).
|
||||
|
||||
|
||||
----
|
||||
|
||||
.. _delete-plan:
|
||||
|
||||
**Delete an Existing Plan**
|
||||
|
||||
URL: ``/plans/<plan-uuid>/``
|
||||
|
||||
Method: ``DELETE``
|
||||
|
||||
Description: Deletes the plan's Heat templates and configuration values from
|
||||
Tuskar's storage.
|
||||
|
||||
Request Data: None
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - if deleting the plan entries from Tuskar's storage was successful
|
||||
* 404 - if there is no plan with the given UUID
|
||||
|
||||
Response Data: None
|
||||
|
||||
|
||||
----
|
||||
|
||||
.. _add-plan-role:
|
||||
|
||||
**Adding a Role to a Plan**
|
||||
|
||||
URL: ``/plans/<plan-uuid>/roles/``
|
||||
|
||||
Method: ``POST``
|
||||
|
||||
Description: Adds the specified role to the given plan.
|
||||
|
||||
Notes:
|
||||
|
||||
* This will cause the parameter consolidation to occur and entries added to
|
||||
the plan's configuration parameters for the new role.
|
||||
* This call will update the ``last_modified`` timestamp to indicate a change
|
||||
has been made that will require an update to Heat to be made live.
|
||||
|
||||
Request Data:
|
||||
|
||||
JSON document containing the uuid of the role to add.
|
||||
|
||||
Request Example:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"uuid" : "role_uuid"
|
||||
}
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 201 - if the addition is successful
|
||||
* 404 - if there is no plan with the given UUID
|
||||
* 409 - if the plan already has the specified role
|
||||
|
||||
Response Data:
|
||||
|
||||
The same document describing the plan as from
|
||||
:ref:`Retrieve a Single Plan <retrieve-single-plan>`. The newly added
|
||||
configuration parameters will be present in the result.
|
||||
|
||||
|
||||
----
|
||||
|
||||
.. _remove-cloud-plan:
|
||||
|
||||
**Removing a Role from a Plan**
|
||||
|
||||
URL: ``/plans/<plan-uuid>/roles/<role-uuid>/``
|
||||
|
||||
Method: ``DELETE``
|
||||
|
||||
Description: Removes a role identified by role_uuid from the given plan.
|
||||
|
||||
Notes:
|
||||
|
||||
* This will cause the parameter consolidation to occur and entries to be
|
||||
removed from the plan's configuration parameters.
|
||||
* This call will update the ``last_modified`` timestamp to indicate a change
|
||||
has been made that will require an update to Heat to be made live.
|
||||
|
||||
Request Data: None
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - if the removal is successful
|
||||
* 404 - if there is no plan with the given UUID or it does not have the
|
||||
specified role and version combination
|
||||
|
||||
Response Data:
|
||||
|
||||
The same document describing the cloud as from
|
||||
:ref:`Retrieve a Single Plan <retrieve-single-plan>`. The configuration
|
||||
parameters will be updated to reflect the removed role.
|
||||
|
||||
|
||||
----
|
||||
|
||||
.. _changing-plan-configuration:
|
||||
|
||||
**Changing a Plan's Configuration Values**
|
||||
|
||||
URL: ``/plans/<plan-uuid>/``
|
||||
|
||||
Method: ``PATCH``
|
||||
|
||||
Description: Sets the values for one or more configuration parameters.
|
||||
|
||||
Notes:
|
||||
|
||||
* This call will update the ``last_modified`` timestamp to indicate a change
|
||||
has been made that will require an update to Heat to be made live.
|
||||
|
||||
Request Data: JSON document containing the parameter keys and values to set
|
||||
for the plan.
|
||||
|
||||
Request Example:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
[
|
||||
{
|
||||
"name" : "database_host",
|
||||
"value" : "10.11.12.13"
|
||||
},
|
||||
{
|
||||
"name" : "database_password",
|
||||
"value" : "secret"
|
||||
}
|
||||
]
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - if the update was successful
|
||||
* 400 - if one or more of the new values fails validation
|
||||
* 404 - if there is no plan with the given UUID
|
||||
|
||||
Response Data:
|
||||
|
||||
The same document describing the plan as from
|
||||
:ref:`Retrieve a Single Plan <retrieve-single-plan>`.
|
||||
|
||||
|
||||
----
|
||||
|
||||
.. _list-roles:
|
||||
|
||||
**Retrieving Possible Roles**
|
||||
|
||||
URL: ``/roles/``
|
||||
|
||||
Method: ``GET``
|
||||
|
||||
Description: Returns a list of all roles available in Tuskar.
|
||||
|
||||
Notes:
|
||||
|
||||
* There will be a separate entry for each version of a particular role.
|
||||
|
||||
Request Data: None
|
||||
|
||||
Response Codes:
|
||||
|
||||
* 200 - containing the available roles
|
||||
|
||||
Response Data: A list of roles, where each role contains:
|
||||
|
||||
* Name
|
||||
* Version
|
||||
* Description
|
||||
|
||||
Response Example:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
[
|
||||
{
|
||||
"uuid" : "3d46e510-6a63-4ed1-abd0-9306a451f8b4",
|
||||
"name" : "compute",
|
||||
"version" : "1",
|
||||
"description" : "Nova Compute"
|
||||
},
|
||||
{
|
||||
"uuid" : "71d6c754-c89c-4293-9d7b-c4dcc57229f0",
|
||||
"name" : "compute",
|
||||
"version" : "2",
|
||||
"description" : "Nova Compute"
|
||||
},
|
||||
{
|
||||
"uuid" : "651c26f6-63e2-4e76-9b60-614b51249677",
|
||||
"name" : "controller",
|
||||
"version" : "1",
|
||||
"description" : "Controller Services"
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are currently no alternate schemas proposed for the REST APIs.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
These changes should have no additional security impact.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The potential performance issues revolve around Tuskar's solution for storing
|
||||
the cloud files [3]_.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
After being merged, there will be a period where the Tuskar CLI is out of date
|
||||
with the new calls. The Tuskar UI will also need to be updated for the changes
|
||||
in flow.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jdob
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement plan CRUD APIs
|
||||
* Implement role retrieval API
|
||||
* Write REST API documentation
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
These API changes are dependent on the rest of the Tuskar backend being
|
||||
implemented, including the changes to storage and the template consolidation.
|
||||
|
||||
Additionally, the assembly of roles (provider resources) into a Heat
|
||||
environment is contingent on the conversion of the TripleO Heat templates [4]_.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Tempest testing should be added as part of the API creation.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The REST API documentation will need to be updated accordingly.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [3] https://review.openstack.org/#/c/97553/
|
||||
.. [4] https://review.openstack.org/#/c/97939/
|
@ -1,552 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================================
|
||||
TripleO Template and Deployment Plan Storage
|
||||
============================================
|
||||
|
||||
This design specification describes a storage solution for a deployment plan.
|
||||
Deployment plans consist of a set of roles, which in turn define a master Heat
|
||||
template that can be used by Heat to create a stack representing the deployment
|
||||
plan; and an environment file that defines the parameters needed by the master
|
||||
template.
|
||||
|
||||
This specification is principally intended to be used by Tuskar.
|
||||
|
||||
https://blueprints.launchpad.net/tuskar/+spec/tripleo-juno-tuskar-template-storage
|
||||
|
||||
.. _tripleo_juno_tuskar_template_storage_problem:
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
.. note:: The terminology used in this specification is defined in the `Tuskar
|
||||
REST API`_ specification.
|
||||
|
||||
.. _Tuskar REST API: https://blueprints.launchpad.net/tuskar/+spec/tripleo-juno-tuskar-plan-rest-api
|
||||
|
||||
In order to accomplish the goal of this specification, we need to first define
|
||||
storage domain models for roles, deployment plans, and associated concepts.
|
||||
These associated concepts include Heat templates and environment files. The
|
||||
models must account for requirements such as versioning and the appropriate
|
||||
relationships between objects.
|
||||
|
||||
We also need to create a storage mechanism for these models. The storage
|
||||
mechanism should be distinct from the domain model, allowing the latter to be
|
||||
stable while the former retains enough flexibility to use a variety of backends
|
||||
as need and availability dictates. Storage requirements for particular models
|
||||
include items such as versioning and secure storage.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
**Change Summary**
|
||||
|
||||
The following proposed change is split into three sections:
|
||||
|
||||
- Storage Domain Models: Defines the domain models for templates, environment
|
||||
files, roles, and deployment plans.
|
||||
- Storage API Interface: Defines Python APIs that relate the models to
|
||||
the underlying storage drivers; is responsible for translating stored content
|
||||
into a model object and vice versa. Each model requires its own storage
|
||||
interface.
|
||||
- Storage Drivers: Defines the API that storage backends need to implement in
|
||||
order to be usable by the Python API Interface. Plans for initial and future
|
||||
driver support are discussed here.
|
||||
|
||||
It should be noted that each storage interface will be specified by the user as
|
||||
part of the Tuskar setup. Thus, the domain model can assume that the appropriate
|
||||
storage interfaces - a template store, an environment store, etc - are defined
|
||||
globally and accessible for use.
|
||||
|
||||
|
||||
**Storage Domain Models**
|
||||
|
||||
The storage API requires the following domain models:
|
||||
|
||||
- Template
|
||||
- Environment File
|
||||
- Role
|
||||
- Deployment Plan
|
||||
|
||||
The first two map directly to Heat concepts; the latter two are Tuskar concepts.
|
||||
|
||||
Note that each model will also contain a save method. The save method will call
|
||||
create on the store if the uuid isn't set, and will call update on the store
|
||||
if the instance has a uuid.
|
||||
|
||||
|
||||
**Template Model**
|
||||
|
||||
The template model represents a Heat template.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class Template:
|
||||
uuid = UUID string
|
||||
name = string
|
||||
version = integer
|
||||
description = string
|
||||
content = string
|
||||
created_at = datetime
|
||||
|
||||
# This is derived from the content from within the template store.
|
||||
parameters = dict of parameter names with their types and defaults
|
||||
|
||||
|
||||
**Environment File Model**
|
||||
|
||||
The environment file defines the parameters and resource registry for a Heat
|
||||
stack.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class EnvironmentFile:
|
||||
uuid = UUID string
|
||||
content = string
|
||||
created_at = datetime
|
||||
updated_at = datetime
|
||||
|
||||
# These are derived from the content from within the environment file store.
|
||||
resource_registry = list of provider resource template names
|
||||
parameters = dict of parameter names and their values
|
||||
|
||||
def add_provider_resource(self, template):
|
||||
# Adds the specified template object to the environment file as a
|
||||
# provider resource. This updates the parameters and resource registry
|
||||
# in the content. The provider resource type will be derived from the
|
||||
# template file name.
|
||||
|
||||
def remove_provider_resource(self, template):
|
||||
# Removes the provider resource that matches the template from the
|
||||
# environment file. This updates the parameters and resource registry
|
||||
# in the content.
|
||||
|
||||
def set_parameters(self, params_dict):
|
||||
# The key/value pairs in params_dict correspond to parameter names/
|
||||
# desired values. This method updates the parameters section in the
|
||||
# content to the values specified in params_dict.
|
||||
|
||||
|
||||
**Role Model**
|
||||
|
||||
A role is a scalable unit of a cloud. A deployment plan specifies one or more
|
||||
roles. Each role must specify a primary role template. It must also specify
|
||||
the dependencies of that template.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class Role:
|
||||
uuid = UUID string
|
||||
name = string
|
||||
version = integer
|
||||
description = string
|
||||
role_template_uuid = Template UUID string
|
||||
dependent_template_uuids = list of Template UUID strings
|
||||
created_at = datetime
|
||||
|
||||
def retrieve_role_template(self):
|
||||
# Retrieves the Template with uuid matching role_template_uuid
|
||||
|
||||
def retrieve_dependent_templates(self):
|
||||
# Retrieves the list of Templates with uuids matching
|
||||
# dependent_template_uuids
|
||||
|
||||
|
||||
**Deployment Plan Model**
|
||||
|
||||
The deployment plan defines the application to be deployed. It does so by
|
||||
specifying a list of roles. Those roles are used to construct an environment
|
||||
file that contains the parameters that are needed by the roles' templates and
|
||||
the resource registry that register each role's primary template as a provider
|
||||
resource. A master template is also constructed so that the plan can be
|
||||
deployed as a single Heat stack.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class DeploymentPlan:
|
||||
uuid = UUID string
|
||||
name = string
|
||||
description = string
|
||||
role_uuids = list of Role UUID strings
|
||||
master_template_uuid = Template UUID string
|
||||
environment_file_uuid = EnvironmentFile UUID string
|
||||
created_at = datetime
|
||||
updated_at = datetime
|
||||
|
||||
def retrieve_roles(self):
|
||||
# Retrieves the list of Roles with uuids matching role_uuids
|
||||
|
||||
def retrieve_master_template(self):
|
||||
# Retrieves the Template with uuid matching master_template_uuid
|
||||
|
||||
def retrieve_environment_file(self):
|
||||
# Retrieves the EnvironmentFile with uuid matching environment_file_uuid
|
||||
|
||||
def add_role(self, role):
|
||||
# Adds a Role to the plan. This operation will modify the master
|
||||
# template and environment file through template munging operations
|
||||
# specified in a separate spec.
|
||||
|
||||
def remove_role(self, role):
|
||||
# Removes a Role from the plan. This operation will modify the master
|
||||
# template and environment file through template munging operations
|
||||
# specified in a separate spec.
|
||||
|
||||
def get_dependent_templates(self):
|
||||
# Returns a list of dependent templates. This consists of the
|
||||
# associated role templates.
|
||||
|
||||
|
||||
**Storage API Interface**
|
||||
|
||||
Each of the models defined above has their own Python storage interface. These
|
||||
are manager classes that query and perform CRUD operations against the storage
|
||||
drivers and return instances of the models for use (with the exception of delete
|
||||
which returns ``None``). The storage interfaces bind the models to the driver
|
||||
being used; this allows us to store each model in a different location.
|
||||
|
||||
Note that each store also contains a serialize method and a deserialize method.
|
||||
The serialize method takes the relevant object and returns a dictionary
|
||||
containing all value attributes; the deserialize method does the reverse.
|
||||
|
||||
The drivers are discussed in
|
||||
:ref:`the next section<tripleo_juno_tuskar_template_storage_drivers>`.
|
||||
|
||||
|
||||
**Template API**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class TemplateStore:
|
||||
|
||||
def create(self, name, content, description=None):
|
||||
# Creates a Template. If no template exists with a matching name,
|
||||
# the template version is set to 0; otherwise it is set to the
|
||||
# greatest existing version plus one.
|
||||
|
||||
def retrieve(self, uuid):
|
||||
# Retrieves the Template with the specified uuid. Queries a Heat
|
||||
# template parser for template parameters and dependent template names.
|
||||
|
||||
def retrieve_by_name(self, name, version=None):
|
||||
# Retrieves the Template with the specified name and version. If no
|
||||
# version is specified, retrieves the latest version of the Template.
|
||||
|
||||
def delete(self, uuid):
|
||||
# Deletes the Template with the specified uuid.
|
||||
|
||||
def list(self, only_latest=False):
|
||||
# Returns a list of all Templates. If only_latest is True, filters
|
||||
# the list to the latest version of each Template name.
|
||||
|
||||
|
||||
**Environment File API**
|
||||
|
||||
The environment file requires secure storage to protect parameter values.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class EnvironmentFileStore:
|
||||
|
||||
def create(self):
|
||||
# Creates an empty EnvironmentFile.
|
||||
|
||||
def retrieve(self, uuid):
|
||||
# Retrieves the EnvironmentFile with the specified uuid.
|
||||
|
||||
def update(self, model):
|
||||
# Updates an EnvironmentFile.
|
||||
|
||||
def delete(self, uuid):
|
||||
# Deletes the EnvironmentFile with the specified uuid.
|
||||
|
||||
def list(self):
|
||||
# Returns a list of all EnvironmentFiles.
|
||||
|
||||
|
||||
**Role API**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class RoleStore:
|
||||
|
||||
def create(self, name, role_template, description=None):
|
||||
version=None, template_uuid=None):
|
||||
# Creates a Role. If no role exists with a matching name, the
|
||||
# template version is set to 0; otherwise it is set to the greatest
|
||||
# existing version plus one.
|
||||
#
|
||||
# Dependent templates are derived from the role_template. The
|
||||
# create method will take all dependent template names from
|
||||
# role_template, retrieve the latest version of each from the
|
||||
# TemplateStore, and use those as the dependent template list.
|
||||
#
|
||||
# If a dependent template is missing from the TemplateStore, then
|
||||
# an exception is raised.
|
||||
|
||||
def retrieve(self, uuid):
|
||||
# Retrieves the Role with the specified uuid.
|
||||
|
||||
def retrieve_by_name(self, name, version=None):
|
||||
# Retrieves the Role with the specified name and version. If no
|
||||
# version is specified, retrieves the latest version of the Role.
|
||||
|
||||
def update(self, model):
|
||||
# Updates a Role.
|
||||
|
||||
def delete(self, uuid):
|
||||
# Deletes the Role with the specified uuid.
|
||||
|
||||
def list(self, only_latest=False):
|
||||
# Returns a list of all Roles. If only_latest is True, filters
|
||||
# the list to the latest version of each Role.
|
||||
|
||||
|
||||
**Deployment Plan API**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class DeploymentPlanStore:
|
||||
|
||||
def create(self, name, description=None):
|
||||
# Creates a DeploymentPlan. Also creates an associated empty master
|
||||
# Template and EnvironmentFile; these will be modified as Roles are
|
||||
|
||||
def retrieve(self, uuid):
|
||||
# Retrieves the DeploymentPlan with the specified uuid.
|
||||
|
||||
def update(self, model):
|
||||
# Updates a DeploymentPlan.
|
||||
|
||||
def delete(self, uuid):
|
||||
# Deletes the DeploymentPlan with the specified uuid.
|
||||
|
||||
def list(self):
|
||||
# Retrieves a list of all DeploymentPlans.
|
||||
|
||||
|
||||
.. _tripleo_juno_tuskar_template_storage_drivers:
|
||||
|
||||
**Storage Drivers**
|
||||
|
||||
Storage drivers operate by storing object dictionaries. For storage solutions
|
||||
such as Glance these dictionaries are stored as flat files. For a storage
|
||||
solution such as a database, the dictionary is translated into a table row. It
|
||||
is the responsibility of the driver to understand how it is storing the object
|
||||
dictionaries.
|
||||
|
||||
Each storage driver must provide the following methods.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class Driver:
|
||||
|
||||
def create(self, filename, object_dict):
|
||||
# Stores the specified content under filename and returns the resulting
|
||||
# uuid.
|
||||
|
||||
def retrieve(self, uuid):
|
||||
# Returns the object_dict matching the uuid.
|
||||
|
||||
def update(self, uuid, object_dict):
|
||||
# Updates the object_dict specified by the uuid.
|
||||
|
||||
def delete(self, uuid):
|
||||
# Deletes the content specified by the uuid.
|
||||
|
||||
def list(self):
|
||||
# Return a list of all content.
|
||||
|
||||
|
||||
For Juno, we will aim to use a combination of a relational database and Heat.
|
||||
Heat will be used for the secure storage of sensitive environment parameters.
|
||||
Database tables will be used for everything else. The usage of Heat for secure
|
||||
stores relies on `PATCH support`_ to be added the Heat API. This bug is
|
||||
targeted for completion by Juno-2.
|
||||
|
||||
.. _PATCH support: https://bugs.launchpad.net/heat/+bug/1224828
|
||||
|
||||
This is merely a short-term solution, as it is understood that there is some
|
||||
reluctance in introducing an unneeded database dependency. In the long-term we
|
||||
would like to replace the database with Glance once it is updated from an image
|
||||
store to a more general artifact repository. However, this feature is currently
|
||||
in development and cannot be relied on for use in the Juno cycle. The
|
||||
architecture described in this specification should allow reasonable ease in
|
||||
switching from one to the other.
|
||||
|
||||
|
||||
.. _tripleo_juno_tuskar_template_storage_alternatives:
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
**Modeling Relationships within Heat Templates**
|
||||
|
||||
The specification proposes modeling relationships such as a plan's associated
|
||||
roles or a role's dependent templates as direct attributes of the object.
|
||||
However, this information would appear to be available as part of a plan's
|
||||
environment file or by traversing the role template's dependency graph. Why
|
||||
not simply derive the relationships in that way?
|
||||
|
||||
A role is a Tuskar abstraction. Within Heat, it corresponds to a template used
|
||||
as a provider resource; however, a role has added requirements, such as the
|
||||
versioning of itself and its dependent templates, or the ability to list out
|
||||
available roles for selection within a plan. These are not requirements that
|
||||
Heat intends to fulfill, and fulfilling them entirely within Heat feels like an
|
||||
abuse of mechanics.
|
||||
|
||||
From a practical point of view, modeling relationships within Heat templates
|
||||
requires the in-place modification of Heat templates by Tuskar to deal with
|
||||
versioning. For example, if version 1 of the compute role specifies
|
||||
{{compute.yaml: 1}, {compute-config.yaml: 1}}, and version 2 of the role
|
||||
specifies {{compute.yaml: 1}, {compute-config.yaml: 2}}, the only way to
|
||||
allow both versions of the role to be used is to allow programmatic
|
||||
modification of compute.yaml to point at the correct version of
|
||||
compute-config.yaml.
|
||||
|
||||
|
||||
**Swift as a Storage Backend**
|
||||
|
||||
Swift was considered as an option to replace the relational database but was
|
||||
ultimately discounted for two key reasons:
|
||||
|
||||
- The versioning system in Swift doesn't provide a static reference to the
|
||||
current version of an object. Rather it has the version "latest" and this is
|
||||
dynamic and changes when a new version is added, therefore there is no way to
|
||||
stick a deployment to a version.
|
||||
- We need to create a relationship between the provider resources within a Role
|
||||
and swift doesn't support relationships between stored objects.
|
||||
|
||||
Having said that, after seeking guidance from the Swift team, it has been
|
||||
suggested that a naming convention or work with different containers may
|
||||
provide us with enough control to mimic a versioning system that meets our
|
||||
requirements. These suggestions have made Swift more favourable as an option.
|
||||
|
||||
|
||||
**File System as a Storage Backend**
|
||||
|
||||
The filesystem was briefly considered and may be included to provide a simpler
|
||||
developer setup. However, to create a production ready system with versioning,
|
||||
and relationships this would require re-implementing much of what other
|
||||
databases and services provide for us. Therefore, this option is reserved only
|
||||
for a development option which will be missing key features.
|
||||
|
||||
|
||||
**Secure Driver Alternatives**
|
||||
|
||||
Barbican, the OpenStack secure storage service, provides us with an alternative
|
||||
if PATCH support isn't added to Heat in time.
|
||||
|
||||
Currently the only alternative other than Barbican is to implement our own
|
||||
cryptography with one of the other options listed above. This isn't a
|
||||
favourable choice as it adds a technical complexity and risk that should be
|
||||
beyond the scope of this proposal.
|
||||
|
||||
The other option with regards to sensitive data is to not store any. This would
|
||||
require the REST API caller to provide the sensitive information each time a
|
||||
Heat create (and potentially update) is called.
|
||||
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Some of the configuration values, such as service passwords, will be sensitive.
|
||||
For this reason, Heat or Barbican will be used to store all configuration
|
||||
values.
|
||||
|
||||
While access will be controlled by the Tuskar API large files could be provided
|
||||
in the place of provider resource files or configuration files. These should be
|
||||
verified against a reasonable limit.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The template storage will be primarily used by the Tuskar API, but as it may be
|
||||
used directly in the future it will need to be documented.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Storing the templates in Glance and Barbican will lead to API calls over the
|
||||
local network rather than direct database access. These are likely to have
|
||||
higher overhead. However, the read and writing used in Tuskar is expected to be
|
||||
infrequent and will only trigger simple reads and writes when manipulating a
|
||||
deployment plan.
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
TripleO will have access to sensitive and insensitive storage through the
|
||||
storage API.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
d0ugal
|
||||
|
||||
Other contributors:
|
||||
tzumainn
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Implement storage API
|
||||
- Create Glance and Barbican based storage driver
|
||||
- Create database storage driver
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- Glance
|
||||
- Barbican
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
- The API logic will be verified with a suite of unit tests that mock the
|
||||
external services.
|
||||
- Tempest will be used for integration testing.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The code should be documented with docstrings and comments. If it is used
|
||||
outside of Tuskar further user documentation should be developed.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- https://blueprints.launchpad.net/glance/+spec/artifact-repository-api
|
||||
- https://blueprints.launchpad.net/glance/+spec/metadata-artifact-repository
|
||||
- https://bugs.launchpad.net/heat/+bug/1224828
|
||||
- https://docs.google.com/document/d/1tOTsIytVWtXGUaT2Ia4V5PWq4CiTfZPDn6rpRm5In7U
|
||||
- https://etherpad.openstack.org/p/juno-hot-artifacts-repository-finalize-design
|
||||
- https://etherpad.openstack.org/p/juno-summit-tripleo-tuskar-planning
|
||||
- https://wiki.openstack.org/wiki/Barbican
|
||||
- https://wiki.openstack.org/wiki/TripleO/TuskarJunoPlanning
|
||||
- https://wiki.openstack.org/wiki/TripleO/TuskarJunoPlanning/TemplateBackend
|
@ -1,246 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
QuintupleO - TripleO on OpenStack
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-on-openstack
|
||||
|
||||
This is intended as a new way to do a TripleO deployment in a virtualized
|
||||
environment. Rather than provisioning the target virtual machines directly
|
||||
via virsh, we would be able to use the standard OpenStack apis to create and
|
||||
manage the instances. This should make virtual TripleO environments more
|
||||
scalable and easier to manage.
|
||||
|
||||
Ultimately the goal would be to make it possible to do virtual TripleO
|
||||
deployments on any OpenStack cloud, except where necessary features have
|
||||
explicitly been disabled. We would like to have the needed features
|
||||
available on the public clouds used for OpenStack CI, so existing providers
|
||||
are invited to review this specification.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
TripleO development and testing requires a lot of hardware resources, and
|
||||
this is only going to increase as things like HA are enabled by default.
|
||||
In addition, we are going to want to be able to test larger deployments than
|
||||
will fit on a single physical machine. While it would be possible to set
|
||||
this up manually, OpenStack already provides services capable of managing
|
||||
a large number of physical hosts and virtual machines, so it doesn't make
|
||||
sense to reinvent the wheel.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
* Write a virtual power driver for OpenStack instances. I already have a
|
||||
rough version for nova-baremetal, but it needs a fair amount of cleaning up
|
||||
before it could be merged into the main codebase. We will also need to
|
||||
work with the Ironic team to enable this functionality there.
|
||||
|
||||
* Determine whether changes are needed in Neutron to allow us to run our own
|
||||
DHCP server, and if so work with the Neutron team to make those changes.
|
||||
This will probably require allowing an instance to be booted without any
|
||||
ip assigned. If not, booting an instance without an IP would be a good
|
||||
future enhancement to avoid wasting IP quota.
|
||||
|
||||
* Likewise, determine how to use virtual ips with keepalived/corosync+pacemaker
|
||||
in Neutron, and if changes to Neutron are needed work with their team to
|
||||
enable that functionality.
|
||||
|
||||
* Enable PXE booting in Nova. There is already a bug open to track this
|
||||
feature request, but it seems to have been abandoned. See the link in the
|
||||
References section of this document. Ideally this should be enabled on a
|
||||
per-instance basis so it doesn't require a specialized compute node, which
|
||||
would not allow us to run on a standard public cloud.
|
||||
|
||||
* For performance and feature parity with the current virtual devtest
|
||||
environment, we will want to be allow the use of unsafe caching for the
|
||||
virtual baremetal instances.
|
||||
|
||||
* Once all of the OpenStack services support this use case we will want to
|
||||
convert our CI environment to a standard OpenStack KVM cloud, as well as
|
||||
deprecate the existing method of running TripleO virtually and enable
|
||||
devtest to install and configure a local OpenStack installation (possibly
|
||||
using devstack) on which to run.
|
||||
|
||||
* Depending on the state of our container support at that time, we may want
|
||||
to run the devtest OpenStack using containers to avoid taking over the host
|
||||
system the way devstack normally does. This may call for its own spec when
|
||||
we reach that point.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* There's no real alternative to writing a virtual power driver. We have to
|
||||
be able to manage OpenStack instances as baremetal nodes for this to work.
|
||||
|
||||
* Creating a flat Neutron network connected to a local bridge can address the
|
||||
issues with Neutron not allowing DHCP traffic, but that only works if you
|
||||
have access to create the local bridge and configure the new network. This
|
||||
may not be true in many (all?) public cloud providers.
|
||||
|
||||
* I have not done any work with virtual IP addresses in Neutron yet, so it's
|
||||
unclear to me whether any alternatives exist for that.
|
||||
|
||||
* As noted earlier, using an iPXE image can allow PXE booting of Nova
|
||||
instances. However, because that image is overwritten during the deploy,
|
||||
it is not possible to PXE boot the instance afterward. Making the TripleO
|
||||
images bootable on their own might be an option, but it would diverge from
|
||||
how a real baremetal environment would work and thus is probably not
|
||||
desirable.
|
||||
|
||||
Deploy overcloud without PXE boot
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Since a number of the complications around doing TripleO development on an
|
||||
OpenStack cloud relate to PXE booting the instances, one option that could
|
||||
be useful in some situations is the ability to deploy images directly. Since
|
||||
we're using Heat for deployments, it should be possible to build the TripleO
|
||||
images with the ``vm`` element and deploy them as regular instances instead of
|
||||
fake baremetal ones.
|
||||
|
||||
This has the drawback of not exercising as much of the TripleO baremetal
|
||||
functionality as a full virtual PXE boot process, but it should be easier to
|
||||
implement, and for some development work not related to the deploy process
|
||||
would be sufficient for verifying that a feature works as intended. It might
|
||||
serve as a good intermediate step while we work to enable full PXE boot
|
||||
functionality in OpenStack clouds.
|
||||
|
||||
It would also prevent exercising HA functionality because we would likely not
|
||||
be able to use virtual IP addresses if we can't use DHCP/PXE to manage our
|
||||
own networking environment.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
* The virtual power driver is going to need access to OpenStack
|
||||
credentials so it can control the instances.
|
||||
|
||||
* The Neutron changes to allow private networks to behave as flat networks
|
||||
may have security impacts, though I'm not exactly sure what they would be.
|
||||
The same applies to virtual IP support.
|
||||
|
||||
* PXE booting instances could in theory allow an attacker to override the
|
||||
DHCP server and boot arbitrary images, but in order to do that they would
|
||||
already need to have access to the private network being used, so I don't
|
||||
consider this a significant new threat.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
End users doing proof of concepts using a virtual deployment environment
|
||||
would need to be switched to this new method, but that should be largely
|
||||
taken care of by the necessary changes to devtest since that's what would
|
||||
be used for such a deployment.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
In my testing, my OpenStack virtual power driver was significantly slower
|
||||
than the existing virsh-based one, but I believe with a better implementation
|
||||
that could be easily solved.
|
||||
|
||||
When running TripleO on a public cloud, a developer would be subject to the
|
||||
usual limitations of shared hardware - a given resource may be oversubscribed
|
||||
and cause performance issues for the processing or disk-heavy operations done
|
||||
by a TripleO deployment.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
This is not intended to be visible to regular deployers, but it should
|
||||
make our CI environment more flexible by allowing more dynamic allocation
|
||||
of resources.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
If this becomes the primary method of doing TripleO development, devtest would
|
||||
need to be altered to either point at an existing OpenStack environment or
|
||||
to configure a local one itself. This will have an impact on how developers
|
||||
debug problems with their environment, but since they would be debugging
|
||||
OpenStack in that case it should be beneficial in the long run.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bnemec
|
||||
|
||||
Other contributors:
|
||||
jang
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Implement an Ironic OpenStack virtual power driver.
|
||||
|
||||
* Implement a nova-baremetal OpenStack virtual power driver, probably out
|
||||
of tree based on the feedback we're getting from Nova and Ironic.
|
||||
|
||||
* Enable PXE booting of Nova instances.
|
||||
|
||||
* Enable unsafe caching to be enabled on Nova instances.
|
||||
|
||||
* Allow DHCP/PXE traffic on private networks in Neutron.
|
||||
|
||||
* If not already covered by the previous point, allow booting of instances
|
||||
without IP addresses.
|
||||
|
||||
* Migrate CI to use an OpenStack cloud for its virtual baremetal instances.
|
||||
|
||||
* Migrate devtest to install and configure an OpenStack cloud instead of
|
||||
managing instances and networking manually.
|
||||
|
||||
* To simplify the VM provisioning process, we should make it possible to
|
||||
provision but not boot a Nova VM.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
The Ironic, Neutron, and Nova changes in the Work Items section will all have
|
||||
to be done before TripleO can fully adopt this feature.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* All changes in the other projects will be unit and functional tested as
|
||||
would any other new feature.
|
||||
|
||||
* We cannot test this functionality by running devstack to provision an
|
||||
OpenStack cloud in a gate VM, such as would be done for Tempest, because
|
||||
the performance of the nested qemu virtual machines would make the process
|
||||
prohibitively slow. We will need to have a baremetal OpenStack deployment
|
||||
that can be targeted by the tests. A similar problem exists today with
|
||||
virsh instances, however, and it can probably be solved in a similar
|
||||
fashion with dedicated CI environments.
|
||||
|
||||
* We will need to have Tempest tests gating on all the projects we use to
|
||||
exercise the functionality we depend on. This should be largely covered
|
||||
by the functional tests for the first point, but it's possible we will find
|
||||
TripleO-specific scenarios that need to be added as well.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
devtest will need to be updated to reflect the new setup steps needed to run
|
||||
it against an OpenStack-based environment.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
This is largely based on the discussion Devtest on OpenStack in
|
||||
https://etherpad.openstack.org/p/devtest-env-reqs
|
||||
|
||||
Nova bug requesting PXE booting support:
|
||||
https://bugs.launchpad.net/nova/+bug/1183885
|
@ -1,187 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Unit Testing TripleO Projects
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/unit-testing
|
||||
|
||||
We should enable more unit testing in TripleO projects to allow better test
|
||||
coverage of code paths not included in CI, make it easier for reviewers
|
||||
to verify that a code change does what it is supposed to, and avoid wasting
|
||||
reviewer and developer time resolving style issues.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Right now there is very little unit testing of the code in most of the TripleO
|
||||
projects. This has a few negative effects:
|
||||
|
||||
- We have no test coverage of any code that isn't included in our CI runs.
|
||||
|
||||
- For the code that is included in CI runs, we don't actually know how much
|
||||
of that code is being tested. There may be many code branches that are not
|
||||
used during a CI run.
|
||||
|
||||
- We have no way to test code changes in isolation, which makes it slower to
|
||||
iterate on them.
|
||||
|
||||
- Changes not covered by CI are either not tested at all or must be manually
|
||||
tested by reviewers, which is tedious and error-prone.
|
||||
|
||||
- Major refactorings frequently break less commonly used interfaces to tools
|
||||
because those interfaces are not tested.
|
||||
|
||||
Additionally, because there are few/no hacking-style checks in the TripleO
|
||||
projects, many patches get -1'd for style issues that could be caught by
|
||||
an automated tool. This causes unnecessary delay in merging changes.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
I would like to build out a unit testing framework that simplifies the
|
||||
process of unit testing in TripleO. Once that is done, we should start
|
||||
requiring unit tests for new and changed features like the other OpenStack
|
||||
projects do. At that point we can also begin adding test coverage for
|
||||
existing code.
|
||||
|
||||
The current plan is to make use of Python unit testing libraries to be as
|
||||
consistent as possible with the rest of OpenStack and make use of the test
|
||||
infrastructure that already exists. This will reduce the amount of new code
|
||||
required and make it easier for developers to begin writing unit tests.
|
||||
|
||||
For style checking, the dib-lint tool has already been created to catch
|
||||
common errors in image elements. More rules should be added to it as we
|
||||
find problems that can be automatically found. It should also be applied
|
||||
to the tripleo-image-elements project.
|
||||
|
||||
The bashate project also provides some general style checks that would be
|
||||
useful in TripleO, so we should begin making use of it as well. We should
|
||||
also contribute additional checks when possible and provide feedback on any
|
||||
checks we disagree with.
|
||||
|
||||
Any unit tests added should be able to run in parallel. This both speeds up
|
||||
testing and helps find race bugs.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Shell unit testing
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
Because of the quantity of bash code used in TripleO, we may want to
|
||||
investigate using a shell unit test framework in addition to Python. I
|
||||
think this can be revisited once we are further along in the process and
|
||||
have a better understanding of how difficult it will be to unit test our
|
||||
scripts with Python. I still think we should start with Python for the
|
||||
reasons above and only add other options if we find something that Python
|
||||
unit tests can't satisfy.
|
||||
|
||||
One possible benefit of a shell-specific unit testing framework is that it
|
||||
could provide test coverage stats so we know exactly what code is and isn't
|
||||
being tested.
|
||||
|
||||
If we determine that a shell unit test framework is needed, we should try
|
||||
to choose a widely-used one with well-understood workflows to ease adoption.
|
||||
|
||||
Sandboxing
|
||||
^^^^^^^^^^
|
||||
I have done some initial experimentation with using fakeroot/fakechroot to
|
||||
sandbox scripts that expect to have access to the root filesystem. I was
|
||||
able to run a script that writes to root-owned files as a regular user, making
|
||||
it think it was writing to the real files, but I haven't gotten this working
|
||||
with tox for running unit tests that way.
|
||||
|
||||
Another option would be to use real chroots. This would provide isolation
|
||||
and is probably more common than fakeroots. The drawback would be that
|
||||
chrooting requires root access on the host machine, so running the unit tests
|
||||
would as well.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Many scripts in elements assume they will be running as root. We obviously
|
||||
don't want to do that in unit tests, so we need a way to sandbox those scripts
|
||||
to allow them to run but not affect the test system's root filesystem.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Adding more tests will increase the amount of time Jenkins gate jobs take.
|
||||
This should have minimal real impact though, because unit tests should run
|
||||
in significantly less time than the integration tests.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers will need to implement unit tests for their code changes, which
|
||||
will require learning the unit testing tools we adopt.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bnemec
|
||||
|
||||
goneri has begun some work to enable dib-lint in tripleo-image-elements
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Provide and document a good Python framework for testing the behavior of
|
||||
bash scripts. Use existing functionality in upstream projects where
|
||||
possible, and contribute new features when necessary.
|
||||
|
||||
* Gate tripleo-image-elements on dib-lint, which will require fixing any
|
||||
lint failures currently in tripleo-image-elements.
|
||||
|
||||
* Enable bashate in the projects with a lot of bash scripts.
|
||||
|
||||
* Add unit-testing to tripleo-incubator to enable verification of things
|
||||
like ``devtest.sh --build-only``.
|
||||
|
||||
* Add a template validation test job to triple-heat-templates.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* bashate will be a new test dependency.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
These changes should leverage the existing test infrastructure as much as
|
||||
possible, so the only thing needed to enable the new tests would be changes
|
||||
to the infra config for the affected projects.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None of this work should be user-visible, but we may need developer
|
||||
documentation to help with writing unit tests.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
bashate: http://git.openstack.org/cgit/openstack-dev/bashate/
|
||||
|
||||
There are some notes related to this spec at the bottom of the Summit
|
||||
etherpad: https://etherpad.openstack.org/p/juno-summit-tripleo-ci
|
@ -1,159 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
Virtual IPs for public addresses
|
||||
================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+specs/tripleo-juno-virtual-public-ips
|
||||
|
||||
The current public IP feature is intended to specify the endpoint that a cloud
|
||||
can be reached at. This is typically something where HA is highly desirable.
|
||||
|
||||
Making the public IP be a virtual IP instead of locally bound to a single
|
||||
machine should increase the availability of the clustered service, once we
|
||||
increase the control plane scale to more than one machine.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Today, we run all OpenStack services with listening ports on one virtual IP.
|
||||
|
||||
This means that we're exposing RabbitMQ, MySQL and possibly other cluster-only
|
||||
services to the world, when really what we want is public services exposed to
|
||||
the world and cluster only servers not exposed to the world. Deployers are
|
||||
(rightfully) not exposing our all-services VIP to the world, which leads to
|
||||
them having to choose between a) no support for externally visible endpoints,
|
||||
b) all services attackable or c) manually tracking the involved ports and
|
||||
playing a catch-up game as we evolve things.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Create a second virtual IP from a user supplied network. Bind additional copies
|
||||
of API endpoints that should be publically accessible to that virtual IP. We
|
||||
need to keep presenting them internally as well (still via haproxy and the
|
||||
control virtual IP) so that servers without any public connectivity such as
|
||||
hypervisors can still use the APIs (though they may need to override the IP to
|
||||
use in their hosts files - we have facilities for that already).
|
||||
|
||||
The second virtual IP could in principle be on a dedicated ethernet card, or
|
||||
on a VLAN on a shared card. For now, lets require the admin to specify the
|
||||
interface on which keepalived should be provisioning the shared IP - be that
|
||||
``br-ctlplane``, ``vlan25`` or ``eth2``. Because the network topology may be
|
||||
independent, the keepalive quorum checks need to take place on the specified
|
||||
interface even though this costs external IP addresses.
|
||||
|
||||
The user must be able to specify the same undercloud network as they do today
|
||||
so that small installs are not made impossible - requiring two distinct
|
||||
networks is likely hard for small organisations. Using the same network would
|
||||
not imply using the same IP address - a dedicated IP address will still be
|
||||
useful to permit better testing confidence and also allows for simple exterior
|
||||
firewalling of the cluster.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could not do HA for the public endpoints - not really an option.
|
||||
|
||||
We could not do public endpoints and instead document how to provide border
|
||||
gateway firewalling and NAT through to the endpoints. This just shifts the
|
||||
problem onto infrastructure we are not deploying, making it harder to deploy.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Our security story improves by making this change, as we can potentially
|
||||
start firewalling the intra-cluster virtual IP to only allow known nodes to
|
||||
connect. Short of that, our security story has improved since we started
|
||||
binding to specific ips only, as that made opening a new IP address not
|
||||
actually expose core services (other than ssh) on it.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
End users will need to be able to find out about the new virtual IP. That
|
||||
should be straight forward via our existing mechanisms.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None anticipated.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Deployers will require an additional IP address either on their undercloud
|
||||
ctlplane network (small installs) or on their public network (larger/production
|
||||
installs).
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None expected.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
lifeless (hahahaha)
|
||||
|
||||
Other contributors:
|
||||
None.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Generalise keepalived.conf to support multiple VRRP interfaces.
|
||||
|
||||
* Add support for binding multiple IPs to the haproxy configuration.
|
||||
|
||||
* Add logic to incubator and/or heat templates to request a second virtual IP.
|
||||
|
||||
* Change heat templates to bind public services to the public virtual IP.
|
||||
|
||||
* Possibly tweak setup-endpoints to cooperate, though the prior support
|
||||
should be sufficient.
|
||||
|
||||
These are out of scope for this, but necessary to use it - I intend to put
|
||||
them in the discussion in Dan's network overhaul spec.
|
||||
|
||||
* Add optional support to our heat templates to boot the machines with two
|
||||
nics, not just one - so that we have an IP address for the public interface
|
||||
when its a physical interface. We may find there are ordering / enumeration
|
||||
issues in Nova/Ironic/Neutron to solve here.
|
||||
|
||||
* Add optional support to our heat templates for statically allocating a port
|
||||
from neutron and passing it into the control plane for when we're using
|
||||
VLANs.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This will be on by default, so our default CI path will exercise it.
|
||||
|
||||
Additionally we'll be using it in the up coming VLAN test job which will
|
||||
give us confidence it works when the networks are partitoned.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Add to the manual is the main thing.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
@ -1,183 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=========
|
||||
Cinder HA
|
||||
=========
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-kilo-cinder-ha
|
||||
|
||||
Ensure Cinder volumes remain available if one or multiple nodes running
|
||||
Cinder services or hosting volumes go down.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
TripleO currently deploys Cinder without a shared storage, balancing requests
|
||||
amongst the nodes. Should one of the nodes running `cinder-volume` fail,
|
||||
requests for volumes hosted by that node will fail as well. In addition to that,
|
||||
without a shared storage, should a disk of any of the `cinder-volume` nodes
|
||||
fail, volumes hosted by that node would be lost forever.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
We aim at introducing support for the configuration of Cinder's Ceph backend
|
||||
driver and for the deployment of a Ceph storage for use with Cinder.
|
||||
|
||||
Such a scenario will install `ceph-osd` on an arbitrary number of Ceph storage
|
||||
nodes and `cinder-api`, `cinder-scheduler`, `cinder-volume` and `ceph-mon` on
|
||||
the controller nodes, allowing users to scale out the Ceph storage nodes
|
||||
independently from the controller nodes.
|
||||
|
||||
To ensure HA of the volumes, these will be then hosted on the Ceph storage and
|
||||
to achieve HA for the `cinder-volume` service, all Cinder nodes will use a
|
||||
shared string as their `host` config setting so that will be able to operate
|
||||
on the entire (and shared) set of volumes.
|
||||
|
||||
Support for configuration of more drivers could be added later.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
An alternative approach could be to deploy the `cinder-volume` services in an
|
||||
active/standby configuration. This would allow us to support scenarios where the
|
||||
storage is not shared amongst the Cinder nodes, one of which is for example
|
||||
LVM over a shared Fiber Channel LUNs. Such a scenario would suffer from
|
||||
downsides though, it won't permit to scale out and balance traffic over the
|
||||
storage nodes as easily and may be prone to issues related to the iSCSI session
|
||||
management on failover.
|
||||
|
||||
A different scenario, based instead on the usage of LVM and DRBD combined, could
|
||||
be imagined too. Yet this would suffer from downsides as well. The deployment
|
||||
program would be put in charge of managing the replicas and probably required to
|
||||
have some understanding of the replicas status as well. These are easily covered
|
||||
by Ceph itself which takes care of more related problems indeed, like data
|
||||
rebalancing, or replicas recreation.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
By introducing support for the deployment of the Ceph's tools, we will have to
|
||||
secure the Ceph services.
|
||||
|
||||
We will allow access to the data hosted by Ceph only to authorized hosts via
|
||||
usage of `cephx` for authentication, distributing the `cephx` keyrings on the
|
||||
relevant nodes. Controller nodes will be provisioned with the `ceph.mon`
|
||||
keyring, with the `client.admin` keyring and the `client.cinder` keyring,
|
||||
Compute nodes will be provisioned with the `client.cinder` secret in libvirt and
|
||||
lastly the Ceph storage nodes will be provisioned with the `client.admin`
|
||||
keyring.
|
||||
|
||||
It is to be said that monitors should not be reachable from the public
|
||||
network, despite being hosted on the Controllers. Also Cinder won't need
|
||||
to get access to the monitors' keyring nor the `client.admin` keyring but
|
||||
those will be hosted on same host as Controllers also run the Ceph monitor
|
||||
service; Cinder config will not provide any knowledge about those though.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Cinder volumes as well as Cinder services will remain available despite failure
|
||||
of one (or more depending on scaling setting) of the Controller nodes or Ceph
|
||||
storage nodes.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The `cinder-api` services will remain balanced and the Controller nodes unloaded
|
||||
of the LVM-file overhead and the iSCSI traffic so this topology should, as an
|
||||
additional benefit, improve performances.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
* Automated setup of Cinder HA will require the deployment of Ceph.
|
||||
|
||||
* To take advantage of a pre-existing Ceph installation instead of deploying it
|
||||
via TripleO, deployers will have to provide the input data needed to configure
|
||||
Cinder's backend driver appropriately
|
||||
|
||||
* It will be possible to scale the number of Ceph storage nodes at any time, as
|
||||
well as the number of Controllers (running `cinder-volume`) but changing the
|
||||
backend driver won't be supported as there are no plans to support volumes
|
||||
migration.
|
||||
|
||||
* Not all Cinder drivers support the scenario where multiple instances of the
|
||||
`cinder-volume` service use a shared `host` string, notably the default LVM
|
||||
driver does not. We will use this setting only when appropriate config params
|
||||
are found in the Heat template, as it happens today with the param called
|
||||
`include_nfs_backend`.
|
||||
|
||||
* Ceph storage nodes, running the `ceph-osd` service, use the network to
|
||||
maintain replicas' consistency and as such may transfer some large amount of
|
||||
data over the network. Ceph allows for the OSD service to differentiate
|
||||
between a public network and a cluster network for this purpose. This spec
|
||||
is not going to introduce support for usage of a dedicated cluster network
|
||||
but we want to have a follow-up spec to implement support for that later.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Cinder will continue to be configured with the LVM backend driver by default.
|
||||
|
||||
Developers interested in testing Cinder with the Ceph shared storage will have
|
||||
to use an appropriate scaling setting for the Ceph storage nodes.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
gfidente
|
||||
|
||||
Other contributors:
|
||||
jprovazn
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* add support for deployment of Cinder's Ceph backend driver
|
||||
|
||||
* add support for deployment of the Ceph services
|
||||
|
||||
* add support for external configuration of Cinder's Ceph backend driver
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Will be testable in CI when support for the deployment of the shared Ceph
|
||||
storage nodes becomes available in TripleO itself.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We will need to provide documentation on how users can deploy Cinder together
|
||||
with the Ceph storage nodes and also on how users can use instead some
|
||||
pre-existing Ceph deployment.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
juno mid-cycle meetup
|
||||
kilo design session, https://etherpad.openstack.org/p/tripleo-kilo-l3-and-cinder-ha
|
@ -1,486 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========================================
|
||||
Remove merge.py from TripleO Heat Templates
|
||||
===========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-remove-mergepy
|
||||
|
||||
``merge.py`` is where we've historically accumulated the technical debt for our
|
||||
Heat templates [0]_ with the intention of migrating away from it when Heat meets
|
||||
our templating needs.
|
||||
|
||||
Its main functionality includes combining smaller template snippets into a
|
||||
single template describing the full TripleO deployment, merging certain
|
||||
resources together to reduce duplication while keeping the snippets themselves
|
||||
functional as standalone templates and a support for manual scaling of Heat
|
||||
resources.
|
||||
|
||||
This spec describes the changes necessary to move towards templates
|
||||
that do not depend on ``merge.py``. We will use native Heat features
|
||||
where we can and document the rest, possibly driving new additions to
|
||||
the Heat template format.
|
||||
|
||||
It is largely based on the April 2014 discussion in openstack-dev [1]_.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Because of the mostly undocumented nature of ``merge.py`` our templates are
|
||||
difficult to understand or modify by newcomers (even those already familiar with
|
||||
Heat).
|
||||
|
||||
It has always been considered a short-term measure and Heat can now provide most
|
||||
of what we need in our templates.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We will start with making small correctness-preserving changes to our
|
||||
templates and ``merge.py`` that move us onto using more Heat native
|
||||
features. Where we cannot make the change for some reason, we will
|
||||
file a bug with Heat and work with them to unblock the process.
|
||||
|
||||
Once we get to a point where we have to do large changes to the
|
||||
structure of our templates, we will split them off to new files and
|
||||
enable them in our CI as parallel implementations.
|
||||
|
||||
Once we are confident that the new templates fulfill the same
|
||||
requirements as the original ones, we will deprecate the old ones,
|
||||
deprecate ``merge.py`` and switch to the new ones as the default.
|
||||
|
||||
The list of action items necessary for the full transition is
|
||||
below.
|
||||
|
||||
**1. Remove the custom resource types**
|
||||
|
||||
TripleO Heat templates and ``merge.py`` carry two custom types that (after the
|
||||
move to software config [8]_, [9]_) are no longer used for anything:
|
||||
|
||||
* OpenStack::ImageBuilder::Elements
|
||||
* OpenStack::Role
|
||||
|
||||
We will drop them from the templates and deprecate in the merge tool.
|
||||
|
||||
|
||||
**2. Remove combining whitelisted resource types**
|
||||
|
||||
If we have two ``AWS::AutoScaling::LaunchConfiguration`` resources with the same
|
||||
name, ``merge.py`` will combine their ``Properties`` and ``Metadata``. Our
|
||||
templates are no longer using this after the software-config update.
|
||||
|
||||
|
||||
**3. Port TripleO Heat templates to HOT**
|
||||
|
||||
With most of the non-Heat syntax out of the way, porting our CFN/YAML templates
|
||||
to pure HOT format [2]_ should be straightforward.
|
||||
|
||||
We will have to update ``merge.py`` as well. We should be able to support both
|
||||
the old format and HOT.
|
||||
|
||||
We should be able to differentiate between the two by looking for the
|
||||
``heat_template_version`` top-level section which is mandatory in the HOT
|
||||
syntax.
|
||||
|
||||
Most of the changes to ``merge.py`` should be around spelling (``Parameters`` ->
|
||||
``parameters``, ``Resources`` -> ``resources``) and different names for
|
||||
intrinsic functions, etc. (``Fn::GetAtt`` -> ``get_attr``).
|
||||
|
||||
This task will require syntactic changes to all of our templates and
|
||||
unfortunately, it isn't something different people can update bit by bit. We
|
||||
should be able to update the undercloud and overcloud portions separately, but
|
||||
we can't e.g. just update a part of the overcloud. We are still putting
|
||||
templates together with ``merge.py`` at this point and we would end up with a
|
||||
template that has both CFN and HOT bits.
|
||||
|
||||
|
||||
**4. Move to Provider resources**
|
||||
|
||||
Heat allows passing-in multiple templates when deploying a stack. These
|
||||
templates can map to custom resource types. Each template would represent a role
|
||||
(compute server, controller, block storage, etc.) and its ``parameters`` and
|
||||
``outputs`` would map to the custom resource's ``properties`` and
|
||||
``attributes``.
|
||||
|
||||
These roles will be referenced from a master template (``overcloud.yaml``,
|
||||
``undercloud.yaml``) and eventually wrapped in a scaling resource
|
||||
(``OS::Heat::ResourceGroup`` [5]_) or whatever scaling mechanism we adopt.
|
||||
|
||||
.. note:: Provider resources represent fully functional standalone templates.
|
||||
Any provider resource template can be passed to Heat and turned into a
|
||||
stack or treated as a custom resource in a larger deployment.
|
||||
|
||||
Here's a hypothetical outline of ``compute.yaml``::
|
||||
|
||||
parameters:
|
||||
flavor:
|
||||
type: string
|
||||
image:
|
||||
type: string
|
||||
amqp_host:
|
||||
type: string
|
||||
nova_compute_driver:
|
||||
type: string
|
||||
|
||||
resources:
|
||||
compute_instance:
|
||||
type: OS::Nova::Server
|
||||
properties:
|
||||
flavor: {get_param: flavor}
|
||||
image: {get_param: image}
|
||||
|
||||
compute_deployment:
|
||||
type: OS::Heat::StructuredDeployment
|
||||
properties:
|
||||
server: {ref: compute_instance}
|
||||
config: {ref: compute_config}
|
||||
input_values:
|
||||
amqp_host: {get_param: amqp_host}
|
||||
nova_compute_driver: {get_param: nova_compute_driver}
|
||||
|
||||
compute_config:
|
||||
type: OS::Heat::StructuredConfig
|
||||
properties:
|
||||
group: os-apply-config
|
||||
config:
|
||||
amqp:
|
||||
host: {get_input: amqp_host}
|
||||
nova:
|
||||
compute_driver: {get_input: nova_compute_driver}
|
||||
...
|
||||
|
||||
We will use a similar structure for all the other roles (``controller.yaml``,
|
||||
``block-storage.yaml``, ``swift-storage.yaml``, etc.). That is, each role will
|
||||
contain the ``OS::Nova::Server``, the associated deployments and any other
|
||||
resources required (random string generators, security groups, ports, floating
|
||||
IPs, etc.).
|
||||
|
||||
We can map the roles to custom types using Heat environments [4]_.
|
||||
|
||||
``role_map.yaml``: ::
|
||||
|
||||
resource_registry:
|
||||
OS::TripleO::Compute: compute.yaml
|
||||
OS::TripleO::Controller: controller.yaml
|
||||
OS::TripleO::BlockStorage: block-storage.yaml
|
||||
OS::TripleO::SwiftStorage: swift-storage.yaml
|
||||
|
||||
|
||||
Lastly, we'll have a master template that puts it all together.
|
||||
|
||||
``overcloud.yaml``::
|
||||
|
||||
parameters:
|
||||
compute_flavor:
|
||||
type: string
|
||||
compute_image:
|
||||
type: string
|
||||
compute_amqp_host:
|
||||
type: string
|
||||
compute_driver:
|
||||
type: string
|
||||
...
|
||||
|
||||
resources:
|
||||
compute0:
|
||||
# defined in controller.yaml, type mapping in role_map.yaml
|
||||
type: OS::TripleO::Compute
|
||||
parameters:
|
||||
flavor: {get_param: compute_flavor}
|
||||
image: {get_param: compute_image}
|
||||
amqp_host: {get_param: compute_amqp_host}
|
||||
nova_compute_driver: {get_param: compute_driver}
|
||||
|
||||
controller0:
|
||||
# defined in controller.yaml, type mapping in role_map.yaml
|
||||
type: OS::TripleO::Controller
|
||||
parameters:
|
||||
flavor: {get_param: controller_flavor}
|
||||
image: {get_param: controller_image}
|
||||
...
|
||||
|
||||
outputs:
|
||||
keystone_url:
|
||||
description: URL for the Overcloud Keystone service
|
||||
# `keystone_url` is an output defined in the `controller.yaml` template.
|
||||
# We're referencing it here to expose it to the Heat user.
|
||||
value: { get_attr: [controller_0, keystone_url] }
|
||||
|
||||
and similarly for ``undercloud.yaml``.
|
||||
|
||||
.. note:: The individual roles (``compute.yaml``, ``controller.yaml``) are
|
||||
structured in such a way that they can be launched as standalone
|
||||
stacks (i.e. in order to test the compute instance, one can type
|
||||
``heat stack-create -f compute.yaml -P ...``). Indeed, Heat treats
|
||||
provider resources as nested stacks internally.
|
||||
|
||||
|
||||
**5. Remove FileInclude from ``merge.py``**
|
||||
|
||||
The goal of ``FileInclude`` was to keep individual Roles (to borrow a
|
||||
loaded term from TripleO UI) viable as templates that can be launched
|
||||
standalone. The canonical example is ``nova-compute-instance.yaml`` [3]_.
|
||||
|
||||
With the migration to provider resources, ``FileInclude`` is not necessary.
|
||||
|
||||
|
||||
**6. Move the templates to Heat-native scaling**
|
||||
|
||||
Scaling of resources is currently handled by ``merge.py``. The ``--scale``
|
||||
command line argument takes a resource name and duplicates it as needed (it's
|
||||
a bit more complicated than that, but that's beside the point).
|
||||
|
||||
Heat has a native scaling ``OS::Heat::ResourceGroup`` [5]_ resource that does
|
||||
essentially the same thing::
|
||||
|
||||
scaled_compute:
|
||||
type: OS::Heat::ResourceGroup
|
||||
properties:
|
||||
count: 42
|
||||
resource_def:
|
||||
type: OS::TripleO::Compute
|
||||
parameters:
|
||||
flavor: baremetal
|
||||
image: compute-image-rhel7
|
||||
...
|
||||
|
||||
This will create 42 instances of compute hosts.
|
||||
|
||||
|
||||
**7. Replace Merge::Map with scaling groups' inner attributes**
|
||||
|
||||
We are using the custom ``Merge::Map`` helper function for getting values out of
|
||||
scaled-out servers:
|
||||
|
||||
* `Building a comma-separated list of RabbitMQ nodes`__
|
||||
|
||||
__ https://github.com/openstack/tripleo-heat-templates/blob/a7f2a2c928e9c78a18defb68feb40da8c7eb95d6/overcloud-source.yaml#L642
|
||||
|
||||
* `Getting the name of the first controller node`__
|
||||
|
||||
__ https://github.com/openstack/tripleo-heat-templates/blob/a7f2a2c928e9c78a18defb68feb40da8c7eb95d6/overcloud-source.yaml#L405
|
||||
|
||||
* `List of IP addresses of all controllers`__
|
||||
|
||||
__ https://github.com/openstack/tripleo-heat-templates/blob/a7f2a2c928e9c78a18defb68feb40da8c7eb95d6/overcloud-source.yaml#L405
|
||||
|
||||
* `Building the /etc/hosts file`__
|
||||
|
||||
__ https://github.com/openstack/tripleo-heat-templates/blob/a7f2a2c928e9c78a18defb68feb40da8c7eb95d6/overcloud-source.yaml#L585
|
||||
|
||||
|
||||
The ``ResourceGroup`` resource supports selecting an attribute of an inner
|
||||
resource as well as getting the same attribute from all resources and returning
|
||||
them as a list.
|
||||
|
||||
Example of getting an IP address of the controller node: ::
|
||||
|
||||
{get_attr: [controller_group, resource.0.networks, ctlplane, 0]}
|
||||
|
||||
(`controller_group` is the `ResourceGroup` of our controller nodes, `ctlplane`
|
||||
is the name of our control plane network)
|
||||
|
||||
Example of getting the list of names of all of the controller nodes: ::
|
||||
|
||||
{get_attr: [controller_group, name]}
|
||||
|
||||
The more complex uses of ``Merge::Map`` involve formatting the returned data in
|
||||
some way, for example building a list of ``{ip: ..., name: ...}`` dictionaries
|
||||
for haproxy or generating the ``/etc/hosts`` file.
|
||||
|
||||
Since our ResourceGroups will not be using Nova servers directly, but rather the
|
||||
custom role types using provider resources and environments, we can put this
|
||||
data formatting into the role's ``outputs`` section and then use the same
|
||||
mechanism as above.
|
||||
|
||||
Example of building out the haproxy node entries::
|
||||
|
||||
# overcloud.yaml:
|
||||
resources:
|
||||
controller_group:
|
||||
type: OS::Heat::ResourceGroup
|
||||
properties:
|
||||
count: {get_param: controller_scale}
|
||||
resource_def:
|
||||
type: OS::TripleO::Controller
|
||||
properties:
|
||||
...
|
||||
|
||||
controllerConfig:
|
||||
type: OS::Heat::StructuredConfig
|
||||
properties:
|
||||
...
|
||||
haproxy:
|
||||
nodes: {get_attr: [controller_group, haproxy_node_entry]}
|
||||
|
||||
|
||||
|
||||
# controller.yaml:
|
||||
resources:
|
||||
...
|
||||
controller:
|
||||
type: OS::Nova::Server
|
||||
properties:
|
||||
...
|
||||
|
||||
outputs:
|
||||
haproxy_node_entry:
|
||||
description: A {ip: ..., name: ...} dictionary for configuring the
|
||||
haproxy node
|
||||
value:
|
||||
ip: {get_attr: [controller, networks, ctlplane, 0]}
|
||||
name: {get_attr: [controller, name]}
|
||||
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
This proposal is very t-h-t and Heat specific. One alternative is to do nothing
|
||||
and keep using and evolving ``merge.py``. That was never the intent, and most
|
||||
members of the core team do not consider this a viable long-term option.
|
||||
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
This proposal does not affect the overall functionality of TripleO in any way.
|
||||
It just changes the way TripleO Heat templates are stored and written.
|
||||
|
||||
If anything, this will move us towards more standard and thus more easily
|
||||
auditable templates.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
There should be no impact for the users of vanilla TripleO.
|
||||
|
||||
More advanced users may want to customise the existing Heat templates or write
|
||||
their own. That will be made easier when we rely on standard Heat features only.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This moves some of the template-assembling burden from ``merge.py`` to Heat. It
|
||||
will likely also end up producing more resources and nested stacks on the
|
||||
background.
|
||||
|
||||
As far as we're aware, no one has tested these features at the scale we are
|
||||
inevitably going to hit.
|
||||
|
||||
Before we land changes that can affect this (provider config and scaling) we
|
||||
need to have scale tests in Tempest running TripleO to make sure Heat can cope.
|
||||
|
||||
These tests can be modeled after the `large_ops`_ scenario: a Heat template that
|
||||
creates and destroys a stack of 50 Nova server resources with associated
|
||||
software configs.
|
||||
|
||||
We should have two tests to asses the before and after performance:
|
||||
|
||||
1. A single HOT template with 50 copies of the same server resource and software
|
||||
config/deployment.
|
||||
2. A template with a single server and its software config/deploys, an
|
||||
environment file with a custom type mapping and an overall template that
|
||||
wraps the new type in a ResourceGroup with the count of 50.
|
||||
|
||||
.. _large_ops: https://github.com/openstack/tempest/blob/master/tempest/scenario/test_large_ops.py
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Deployers can keep using ``merge.py`` and the existing Heat templates as before
|
||||
-- existing scripts ought not break.
|
||||
|
||||
With the new templates, Heat will be called directly and will need the resource
|
||||
registry (in a Heat environment file). This will mean a change in the deployment
|
||||
process.
|
||||
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
This should not affect non-Heat and non-TripleO OpenStack developers.
|
||||
|
||||
There will likely be a slight learning curve for the TripleO developers who want
|
||||
to write and understand our Heat templates. Chances are, we will also encounter
|
||||
bugs or unforeseen complications while swapping ``merge.py`` for Heat features.
|
||||
|
||||
The impact on Heat developers would involve processing the bugs and feature
|
||||
requests we uncover. This will hopefully not be an avalanche.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Tomas Sedovic <lp: tsedovic> <irc: shadower>
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
1. Remove the custom resource types
|
||||
2. Remove combining whitelisted resource types
|
||||
3. Port TripleO Heat templates to HOT
|
||||
4. Move to Provider resources
|
||||
5. Remove FileInclude from ``merge.py``
|
||||
6. Move the templates to Heat-native scaling
|
||||
7. Replace Merge::Map with scaling groups' inner attributes
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* The Juno release of Heat
|
||||
* Being able to kill specific nodes in Heat (for scaling down or because they're
|
||||
misbehaving)
|
||||
- Relevant Heat blueprint: `autoscaling-parameters`_
|
||||
|
||||
.. _autoscaling-parameters: https://blueprints.launchpad.net/heat/+spec/autoscaling-parameters
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
All of these changes will be made to the tripleo-heat-templates repository and
|
||||
should be testable by our CI just as any other t-h-t change.
|
||||
|
||||
In addition, we will need to add Tempest scenarios for scale to ensure Heat can
|
||||
handle the load.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We will need to update the `devtest`_, `Deploying TripleO`_ and `Using TripleO`_
|
||||
documentation and create a guide for writing TripleO templates.
|
||||
|
||||
.. _devtest: http://docs.openstack.org/developer/tripleo-incubator/devtest.html
|
||||
.. _Deploying TripleO: http://docs.openstack.org/developer/tripleo-incubator/deploying.html
|
||||
.. _Using TripleO: http://docs.openstack.org/developer/tripleo-incubator/userguide.html
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [0] https://github.com/openstack/tripleo-heat-templates
|
||||
.. [1] http://lists.openstack.org/pipermail/openstack-dev/2014-April/031915.html
|
||||
.. [2] http://docs.openstack.org/developer/heat/template_guide/hot_guide.html
|
||||
.. [3] https://github.com/openstack/tripleo-heat-templates/blob/master/nova-compute-instance.yaml
|
||||
.. [4] http://docs.openstack.org/developer/heat/template_guide/environment.html
|
||||
.. [5] http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::ResourceGroup
|
||||
.. [8] https://review.openstack.org/#/c/81666/
|
||||
.. [9] https://review.openstack.org/#/c/93319/
|
@ -1,169 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Enable Neutron DVR on overcloud in TripleO
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/support-neutron-dvr
|
||||
|
||||
Neutron distributed virtual routing should be able to be configured in TripleO.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
To be able to enable distributed virtual routing in Neutron there needs to be
|
||||
several changes to the current TripleO overcloud deployment. The overcloud
|
||||
compute node(s) are constructed with the ``neutron-openvswitch-agent`` image
|
||||
element, which provides the ``neutron-openvswitch-agent`` on the compute node.
|
||||
In order to support distributed virtual routing, the compute node(s) must also
|
||||
have the ``neutron-metadata-agent`` and ``neutron-l3-agent`` installed. The
|
||||
installation of the ``neutron-l3-agent`` and ``neutron-dhcp-agent`` will need
|
||||
also to be decoupled.
|
||||
|
||||
Additionally, for distributed virtual routing to be enabled, the
|
||||
``neutron.conf``, ``l3_agent.ini`` and ``ml2_conf.ini`` all need to have
|
||||
additional settings.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In the tripleo-image-elements, move the current ``neutron-network-node`` element
|
||||
to an element named ``neutron-router``, which will be responsible for doing the
|
||||
installation and configuration work required to install the ``neutron-l3-agent``
|
||||
and the ``neutron-metadata-agent``. This ``neutron-router`` element will list
|
||||
the ``neutron-openvswitch-agent`` in its element-deps. The ``neutron-network
|
||||
-node`` element will then become simply a 'wrapper' whose sole purpose is to list
|
||||
the dependencies required for a network node (neutron, ``neutron-dhcp-agent``,
|
||||
``neutron-router``, os-refresh-config).
|
||||
|
||||
Additionally, in the tripleo-image-elements/neutron element, the
|
||||
``neutron.conf``, ``l3_agent.ini`` and ``plugins/ml2/ml2_conf.ini`` will be
|
||||
modified to add the configuration variables required in each to support
|
||||
distributed virtual routing (the required configuration variables are listed at
|
||||
https://wiki.openstack.org/wiki/Neutron/DVR/HowTo#Configuration).
|
||||
|
||||
In the tripleo-heat-templates, the ``nova-compute-config.yaml``
|
||||
``nova-compute-instance.yaml`` and ``overcloud-source.yaml`` files will be
|
||||
modified to provide the correct settings for the new distributed virtual routing
|
||||
variables. The enablement of distributed virtual routing will be determined by
|
||||
a 'NeutronDVR' variable which will be 'False' by default (distributed virtual
|
||||
routing not enabled) for backward compatibility, but can be set to 'True' if
|
||||
distributed virtual routing is desired.
|
||||
|
||||
Lastly, the tripleo-incubator script ``devtest_overcloud.sh`` will be modified
|
||||
to: a) build the overcloud-compute disk-image with ``neutron-router`` rather
|
||||
than with ``neutron-openvswitch-agent``, and b) configure the appropriate
|
||||
parameter values to be passed in to the heat stack create for the overcloud so
|
||||
that distributed routing is either enabled or disabled.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could choose to make no change to the ``neutron-router`` image-element and
|
||||
it can be included as well in the list of elements arguments to the disk image
|
||||
build for compute nodes. This has the undesired effect of also
|
||||
including/configuring and starting the ``neutron-dhcp-agent`` on each compute
|
||||
node. Alternatively, it is possible to keep the ``neutron-network-node``
|
||||
element as it is and create a ``neutron-router`` element which is a copy of
|
||||
most of the element contents of the ``neutron-network-node`` element but without
|
||||
the dependency on the ``neutron-dhcp-agent`` element. This approach would
|
||||
introduce a significant amount of code duplication.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Although TripleO installation does not use FWaaS, enablement of DVR currently
|
||||
is known to break FWaaS.
|
||||
See https://blueprints.launchpad.net/neutron/+spec/neutron-dvr-fwaas
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The user will have the ability to set an environment variable during install
|
||||
which will determine whether distributed virtual routing is enabled or not.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None identified
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The option to enable or disable distributed virtual routing at install time will
|
||||
be added. By default distributed virtual routing will be disabled.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None identified
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Erik Colnick (erikcolnick on Launchpad)
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create ``neutron-router`` element in tripleo-image-elements and move related
|
||||
contents from ``neutron-network-node`` element. Remove the
|
||||
``neutron-dhcp-agent`` dependency from the element-deps of the
|
||||
``neutron-router`` element.
|
||||
|
||||
* Add the ``neutron-router`` element as a dependency in the
|
||||
``neutron-network-node`` ``element-deps`` file. The ``element-deps``
|
||||
file becomes the only content in the ``neutron-network-node`` element.
|
||||
|
||||
* Add the configuration values indicated in
|
||||
https://wiki.openstack.org/wiki/Neutron/DVR/HowTo#Configuration to the
|
||||
``neutron.conf``, ``l3_agent.ini`` and ``ml2_conf.ini`` files in the
|
||||
``neutron`` image element.
|
||||
|
||||
* Add the necessary reference variables to the ``nova-compute-config.yaml`` and
|
||||
``nova-compute-instance.yaml`` tripleo-heat-templates files in order to be
|
||||
able to set the new variables in the config files (from above item). Add
|
||||
definitions and default values in ``overcloud-source.yaml``.
|
||||
|
||||
* Modify tripleo-incubator ``devtest_overcloud.sh`` script to set the
|
||||
appropriate environment variables which will drive the configuration of
|
||||
neutron on the overcloud to either enable distributed virtual routers or
|
||||
disable distributed virtual routers (with disable as the default).
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Existing TripleO CI will help ensure that as this is implemented, the current
|
||||
feature set is not impacted and that the default behavior of disabled
|
||||
distributed virtual routers is maintained.
|
||||
|
||||
Additional CI tests which test the installation with distributed virtual
|
||||
routers should be added as this implementation is completed.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation of the new configuration option will be needed.
|
||||
|
||||
References
|
||||
==========
|
||||
|
@ -1,144 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
========================
|
||||
TripleO Review Standards
|
||||
========================
|
||||
|
||||
No launchpad blueprint because this isn't a spec to be implemented in code.
|
||||
|
||||
Like many OpenStack projects, TripleO generally has more changes incoming to
|
||||
the projects than it has core reviewers to review and approve those changes.
|
||||
Because of this, optimizing reviewer bandwidth is important. This spec will
|
||||
propose some changes to our review process discussed at the Paris OpenStack
|
||||
Summit and intended to make the best possible use of core reviewer time.
|
||||
|
||||
There are essentially two major areas that a reviewer looks at when reviewing
|
||||
a given change: design and implementation. The design part of the review
|
||||
covers things like whether the change fits with the overall direction of the
|
||||
project and whether new code is organized in a reasonable fashion. The
|
||||
implementation part of a review will get into smaller details, such as
|
||||
whether language functionality is being used properly and whether the general
|
||||
sections of the code identified in the design part of the review do what is
|
||||
intended.
|
||||
|
||||
Generally design is considered first, and then the reviewer will drill down to
|
||||
the implementation details of the chosen design.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
Many times an overall design for a given change will be agreed upon early in
|
||||
the change's lifecycle. The implementation for the design may then be
|
||||
tweaked multiple times (due to rebases, or specific issues pointed out by
|
||||
reviewers) without any changes to the overall design. Many times these
|
||||
implementation details are small changes that shouldn't require much
|
||||
review effort, but because of our current standard of 2 +2's on the current
|
||||
patch set before a change can be approved, reviewers often must unnecessarily
|
||||
revisit a change even when it is clear that everyone involved in the review
|
||||
is in favor of it.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
When appropriate, allow a core reviewer to approve a change even if the
|
||||
latest patch set does not have 2 +2's. Specifically, this should be used
|
||||
under the following circumstances:
|
||||
|
||||
* A change that has had multiple +2's on past patch sets, indicating an
|
||||
agreement from the other reviewers that the overall design of the change
|
||||
is good.
|
||||
* Any further alterations to the change since the patch set(s) with +2's should
|
||||
be implementation details only - trivial rebases, minor syntax changes, or
|
||||
comment/documentation changes. Any more significant changes invalidate this
|
||||
option.
|
||||
|
||||
As always, core reviewers should use their judgment. When in doubt, waiting
|
||||
for 2 +2's to approve a change is always acceptable, but this new policy is
|
||||
intended to make it socially acceptable to single approve a change under the
|
||||
circumstances described above.
|
||||
|
||||
When approving a change in this manner, it is preferable to leave a comment
|
||||
explaining why the change is being approved without 2 +2's.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Allowing a single +2 on "trivial" changes was also discussed, but there were
|
||||
concerns from a number of people present that such a policy might cause more
|
||||
trouble than it was worth, particularly since "trivial" changes by nature do
|
||||
not require much review and therefore don't take up much reviewer time.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Should be minimal to none. If a change between patch sets is significant
|
||||
enough to have a security impact then this policy does not apply.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Core reviewers will spend less time revisiting patches they have already
|
||||
voted in favor of, and contributors should find it easier to get their
|
||||
patches merged because they won't have to wait as long after rebases and
|
||||
minor changes.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bnemec
|
||||
|
||||
Other contributors:
|
||||
All cores should review and implement this spec in their reviewing
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Publish the agreed-upon guidelines somewhere more permanent than a spec.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
None
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A new document will need to be created for core reviewers to reference.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
https://etherpad.openstack.org/p/kilo-tripleo-summit-reviews
|
@ -1,219 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Release Branch proposal for TripleO
|
||||
==========================================
|
||||
|
||||
To date, the majority of folks consuming TripleO have been doing so via the
|
||||
master branches of the various repos required to allow TripleO to deploy
|
||||
an OpenStack cloud. This proposes an alternative "release branch" methodology
|
||||
which should enable those consuming stable OpenStack releases to deploy
|
||||
more easily using TripleO.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Historically strong guarantees about deploying the current stable OpenStack
|
||||
release have not been made, and it's not something we've been testing in
|
||||
upstream CI. This is fine from a developer perspective, but it's a major
|
||||
impediment to those wishing to deploy production clouds based on the stable
|
||||
OpenStack releases/branches.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
I propose we consider supporting additional "release" branches, for selected
|
||||
TripleO repos where release-specific changes are required.
|
||||
|
||||
The model will be based on the stable branch model[1] used by many/most
|
||||
OpenStack projects, but with one difference, "feature" backports will be
|
||||
permitted provided they are 100% compatible with the currently released
|
||||
OpenStack services.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The justification for allowing features is that many/most TripleO features are
|
||||
actually enabling access to features of OpenStack services which will exist in
|
||||
the stable branches of the services being deployed. Thus, the target audience
|
||||
of this branch will likely want to consume such "features" to better access
|
||||
features and configurations which are appropriate to the OpenStack release they
|
||||
are consuming.
|
||||
|
||||
The other aspect of justification is that projects are adding features
|
||||
constantly, thus it's unlikely TripleO will be capable of aligning with every
|
||||
possible new feature for, say Liberty, on day 1 of the release being made. The
|
||||
recognition that we'll be playing "catch up", and adopting a suitable branch
|
||||
policy should mean there is scope to continue that alignment after the services
|
||||
themselves have been released, which will be of benefit to our users.
|
||||
|
||||
Changes landing on the master branch can be considered as valid candidates for
|
||||
backport, unless:
|
||||
|
||||
* The patch requires new features of an OpenStack service (that do not exist
|
||||
on the stable branches) to operate. E.g if a tripleo-heat-templates change
|
||||
needs new-for-liberty Heat features it would *not* be allowed for release/kilo.
|
||||
|
||||
* The patch enables Overcloud features of an OpenStack service that do not
|
||||
exist on the stable branches of the supported Overcloud version (e.g for
|
||||
release/kilo we only support kilo overcloud features).
|
||||
|
||||
* User visible interfaces are modified, renamed or removed - removal of
|
||||
deprecated interfaces may be allowed on the master branch (after a suitable
|
||||
deprecation period), but these changes would *not* be valid for backport as
|
||||
they could impact existing users without warning. Adding new interfaces
|
||||
such as provider resources or parameters would be permitted provided the
|
||||
default behavior does not impact existing users of the release branch.
|
||||
|
||||
* The patch introduces new dependencies or changes the current requirements.txt.
|
||||
|
||||
To make it easier to identify not-valid-for-backport changes, it's proposed
|
||||
that a review process be adopted whereby a developer proposing a patch to
|
||||
master would tag a commit if it doesn't meet the criteria above, or there is
|
||||
some other reason why the patch would be unsuitable for backport.
|
||||
|
||||
e.g:
|
||||
|
||||
No-Backport: This patch requires new for Mitaka Heat features
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The main alternative to this is to leave upstream TripleO as something which
|
||||
primarily targets developer/trunk-chasing users, and leave maintaining a
|
||||
stable branch of the various components to downstream consumers of TripleO,
|
||||
rdo-manager for example.
|
||||
|
||||
The disadvantage of this approach is it's an impediment to adoption and
|
||||
participation in the upstream project, so I feel it'd be better to do this work
|
||||
upstream, and improve the experience for those wishing to deploy via TripleO
|
||||
using only the upstream tools and releases.
|
||||
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
We'd need to ensure security related patches landing in master got
|
||||
appropriately applied to the release branches (same as stable branches for all
|
||||
other projects).
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This should make it much easier for end users to stand up a TripleO deployed
|
||||
cloud using the stable released versions of OpenStack services.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
This may reduce duplication of effort when multiple downstream consumers of
|
||||
TripleO exist.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
The proposal of valid backports will ideally be made by the developer
|
||||
proposing a patch to the master branch, but avoid creating an undue barrier to
|
||||
entry for new contributors this will not be mandatory, but will be reccomended
|
||||
and encouraged via code review comments.
|
||||
|
||||
Standard stable-maint processes[1] will be observed when proposing backports.
|
||||
|
||||
We need to consider if we want a separate stable-maint core (as is common on
|
||||
most other projects), or if all tripleo-core members can approve backports.
|
||||
Initially it is anticipated to allow all tripleo-core, potentially with the
|
||||
addition of others with a specific interest in branch maintenance (e.g
|
||||
downstream package maintainers).
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Initially the following repos will gain release branches:
|
||||
|
||||
* openstack/tripleo-common
|
||||
* openstack/tripleo-docs
|
||||
* openstack/tripleo-heat-templates
|
||||
* openstack/tripleo-puppet-elements
|
||||
* openstack/python-tripleoclient
|
||||
* openstack/instack-undercloud
|
||||
|
||||
These will all have a new branch created, ideally near the time of the upcoming
|
||||
liberty release, and to avoid undue modification to existing infra tooling,
|
||||
e.g zuul, they will use the standard stable branch naming, e.g:
|
||||
|
||||
* stable/liberty
|
||||
|
||||
If any additional repos require stable branches, we can add those later when
|
||||
required.
|
||||
|
||||
It is expected that any repos which don't have a stable/release branch must
|
||||
maintain compatibility such that they don't break deploying the stable released
|
||||
OpenStack version (if this proves impractical in any case, we'll create
|
||||
branches when required).
|
||||
|
||||
Also, when the release branches have been created, we will explicitly *not*
|
||||
require the master branch for those repos to observe backwards compatibility,
|
||||
with respect to consuming new OpenStack features. For example, new-for-mitaka
|
||||
Heat features may be consumed on the master branch of tripleo-heat-templates
|
||||
after we have a stable/liberty branch for that repo.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
shardy
|
||||
|
||||
Other contributors:
|
||||
TBC
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
1. Identify the repos which require release branches
|
||||
2. Create the branches
|
||||
3. Communicate need to backport to developers, consider options for automating
|
||||
4. CI jobs to ensure the release branch stays working
|
||||
5. Documentation to show how users may consume the release branch
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We'll need CI jobs configured to use the TripleO release branches, deploying
|
||||
the stable branches of other OpenStack projects. Hopefully we can make use of
|
||||
e.g RDO packages for most of the project stable branch content, then build
|
||||
delorean packages for the tripleo release branch content.
|
||||
|
||||
Ideally in future we'd also test upgrade from one release branch to another
|
||||
(e.g current release from the previous, and/or from the release branch to
|
||||
master).
|
||||
|
||||
As a starting point derekh has suggested we create a single centos job, which
|
||||
only tests HA, and that we'll avoid having a tripleo-ci release branch,
|
||||
ideally using the under development[2] tripleo.sh developer script to abstract
|
||||
any differences between deployment steps for branches.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We'll need to update the docs to show:
|
||||
|
||||
1. How to deploy an undercloud node from the release branches using stable
|
||||
OpenStack service versions
|
||||
2. How to build images containing content from the release branches
|
||||
3. How to deploy an overcloud using only the release branch versions
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
We started discussing this idea in this thread:
|
||||
|
||||
http://lists.openstack.org/pipermail/openstack-dev/2015-August/072217.html
|
||||
|
||||
[1] https://wiki.openstack.org/wiki/StableBranch
|
||||
[2] https://review.openstack.org/#/c/225096/
|
@ -1,169 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================
|
||||
External Load Balancer
|
||||
======================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-mitaka-external-load-balancer
|
||||
|
||||
Make it possible to use (optionally) an external load balancer as frontend for
|
||||
the Overcloud.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
To use an external load balancer the Overcloud templates and manifests will be
|
||||
updated to accomplish the following three:
|
||||
|
||||
* accept a list of virtual IPs as parameter to be used instead of the virtual
|
||||
IPs which are normally created as Neutron ports and hosted by the controllers
|
||||
|
||||
* make the deployment and configuration of HAProxy on the controllers optional
|
||||
|
||||
* allow for the assignment of a predefined list of IPs to the controller nodes
|
||||
so that these can be used for the external load balancer configuration
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The VipMap structure, governed by the ``OS::TripleO::Network::Ports::NetIpMap``
|
||||
resource type, will be switched to ``OS::TripleO::Network::Ports::NetVipMap``,
|
||||
a more specific resource type so that it can pointed to a custom YAML allowing
|
||||
for the VIPs to be provided by the user at deployment time. Any reference to the
|
||||
VIPs in the templates will be updated to gather the VIP details from such a
|
||||
structure. The existing VIP resources will also be switched from the non
|
||||
specialized type ``OS::TripleO::Controller::Ports::InternalApiPort`` into a
|
||||
more specific type ``OS::TripleO::Network::Ports::InternalApiVipPort`` so that
|
||||
it will be possible to noop the VIPs or add support for more parameters as
|
||||
required and independently from the controller ports resource.
|
||||
|
||||
The deployment and configuration of HAProxy on the controller nodes will become
|
||||
optional and driven by a new template parameter visible only to the controllers.
|
||||
|
||||
It will be possible to provide via template parameters a predefined list of IPs
|
||||
to be assigned to the controller nodes, on each network, so that these can be
|
||||
configured as target IPs in the external load balancer, before the deployment
|
||||
of the Overcloud is initiated. A new port YAML will be provided for the purpose;
|
||||
when using an external load balancer this will be used for resources like
|
||||
``OS::TripleO::Controller::Ports::InternalApiPort``.
|
||||
|
||||
As a requirement for the deployment process to succeed, the external load
|
||||
balancer must be configured in advance with the appropriate balancing rules and
|
||||
target IPs. This is because the deployment process itself uses a number of
|
||||
infrastructure services (database/messaging) as well as core OpenStack services
|
||||
(Keystone) during the configuration steps. A validation script will be provided
|
||||
so that connectivity to the VIPs can be tested in advance and hopefully avoid
|
||||
false negatives during the deployment.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
By filtering the incoming connections for the controller nodes, an external load
|
||||
blancer might help the Overcloud survive network flood attacks or issues due
|
||||
to purposely malformed API requests.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The deployer wishing to deploy with an external load balancer will have to
|
||||
provide at deployment time a few more parameters, amongst which:
|
||||
|
||||
* the VIPs configured on the balancer to be used by the Overcloud services
|
||||
|
||||
* the IPs to be configured on the controllers, for each network
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Given there won't be any instance of HAProxy running on the controllers, when
|
||||
using an external load balancer these might benefit from a lower stress on the
|
||||
TCP stack.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None expected unless deploying with an external load balancer. A sample
|
||||
environment file will be provided to provide some guidance over the parameters
|
||||
to be passed when deploying with an external load balancer.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
In those scenarios where the deployer was using only a subset of the isolated
|
||||
networks, the customization templates will need to be updated so that the new
|
||||
VIPs resource type is nooped. This can be achieved with something like:
|
||||
|
||||
.. code::
|
||||
|
||||
resource_registry:
|
||||
OS::TripleO::Network::Ports::InternalApiVipPort: /path/to/network/ports/noop.yaml
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
gfidente
|
||||
|
||||
Other contributors:
|
||||
dprince
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* accept user provided collection of VIPs as parameter
|
||||
|
||||
* make the deployment of the managed HAProxy optional
|
||||
|
||||
* allow for the assignment of a predefined list of IPs to the controller nodes
|
||||
|
||||
* add a validation script to test connectivity against the external VIPs
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The feature seems untestable in CI at the moment but it will be possible to test
|
||||
at least the assignment of a predefined list of IPs to the controller nodes by
|
||||
providing only the predefined list of IPs as parameter.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
In addition to documenting the specific template parameters needed when
|
||||
deploying with an external load balancer, it will also be necessary to provide
|
||||
some guidance for the configuration of the load balancer configuration so that
|
||||
it will behave as expected in the event of a failure. Unfortunately the
|
||||
configuration settings are strictly dependent on the balancer in use; we should
|
||||
publish a copy of a managed HAProxy instance config to use as reference so that
|
||||
a deployer could configure his external appliance similarily.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None.
|
@ -1,202 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================
|
||||
Puppet Module Deployment via Swift
|
||||
==================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/puppet-modules-deployment-via-swift
|
||||
|
||||
The ability to deploy a local directory of puppet modules to an overcloud
|
||||
using the OpenStack swift object service.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
When deploying puppet modules to the overcloud there are currently three
|
||||
options:
|
||||
|
||||
* pre-install the puppet modules into a "golden" image. You can pre-install
|
||||
modules via git sources or by using a distro package.
|
||||
|
||||
* use a "firstboot" script to rsync the modules from the undercloud (or
|
||||
some other rsync server that is available).
|
||||
|
||||
* post-install the puppet modules via a package upgrade onto a running
|
||||
Overcloud server by using a (RPM, Deb, etc.)
|
||||
|
||||
None of the above mechanisms provides an easy workflow when making
|
||||
minor (ad-hoc) changes to the puppet modules and only distro packages can be
|
||||
used to provide updated puppet modules to an already deployed overcloud.
|
||||
While we do have a way to rsync over updated modules on "firstboot" via
|
||||
rsync this isn't a useful mechanism for operator who may wish to
|
||||
use heat stack-update to deploy puppet changes without having to build
|
||||
a new RPM/Deb package for each revision.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Create an optional (opt-in) workflow that if enabled will allow an operator
|
||||
to create and deploy a local artifact (tarball, distro package, etc.) of
|
||||
puppet modules to a new or existing overcloud via heat stack-create and
|
||||
stack-update. The mechanism would use the OpenStack object store service
|
||||
(rather than rsync) which we already have available on the undercloud.
|
||||
The new workflow would work like this:
|
||||
|
||||
* A puppet modules artifact (tarball, distro package, etc.) would be uploaded
|
||||
into a swift container.
|
||||
|
||||
* The container would be configured so that a Swift Temp URL can be generated
|
||||
|
||||
* A Swift Temp URL would be generated for the puppet modules URL that is
|
||||
stored in swift
|
||||
|
||||
* A heat environment would be generated which sets a DeployArtifactURLs
|
||||
parameter to this swift URL. (the parameter could be a list so that
|
||||
multiple URLs could also be downloaded.)
|
||||
|
||||
* The TripleO Heat Templates would be modified so that they include a new
|
||||
'script' step which if it detects a custom DeployArtifactURLs parameter
|
||||
would automatically download the artifact from the provided URL, and
|
||||
deploy it locally on each overcloud role during the deployment workflow.
|
||||
By "deploy locally" we mean a tarball would be extracted, and RPM would
|
||||
get installed, etc. The actual deployment mechanism will be pluggable
|
||||
such that both tarballs and distro packages will be supported and future
|
||||
additions might be added as well so long as they also fit into the generic
|
||||
DeployArtifactURLs abstraction.
|
||||
|
||||
* The Operator could then use the generated heat environment to deploy
|
||||
a new set of puppet modules via heat stack-create or heat stack-update.
|
||||
|
||||
* TripleO client could be modified so that it automically loads
|
||||
generated heat environments in a convienent location. This (optional)
|
||||
extra step would make enabling the above workflow transparent and
|
||||
only require the operator to run a 'upload-puppet-modules' tool to
|
||||
upload and configure new puppet modules for deployment via Swift.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are many alternatives we could use to obtain a similar workflow that
|
||||
allows the operator to more deploy puppet modules from a local directory:
|
||||
|
||||
* Setting up a puppet master would allow a similar workflow. The downside
|
||||
of this approach is that it would require a bit of overhead, and it
|
||||
is puppet specific (the deployment mechanism would need to be re-worked
|
||||
if we ever had other types of on-disk files to update).
|
||||
|
||||
* Rsync. We already support rsync for firstboot scripts. The downside of
|
||||
rsync is it requires extra setup, and doesn't have an API like
|
||||
OpenStack swift does allowing for local or remote management and updates
|
||||
to the puppet modules.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The new deployment would use a Swift Temp URL over HTTP/HTTPS. The duration
|
||||
of the Swift Temp URL's can be controlled when they are signed via
|
||||
swift-temp-url if extra security is desired. By using a Swift Temp URL we
|
||||
avoid the need to pass the administrators credentials onto each overcloud
|
||||
node for swiftclient and instead can simply use curl (or wget) to download
|
||||
the updated puppet modules. Given we already deploy images over http/https
|
||||
using an undercloud the use of Swift in this manner should pose minimal extra
|
||||
security risks.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The ability to deploy puppet modules via Swift will be opt-in so the
|
||||
impact on end users would be minimal. The heat templates will contain
|
||||
a new script deployment that may take a few extra seconds to deploy on
|
||||
each node (even if the feature is not enabled). We could avoid the extra
|
||||
deployment time perhaps by noop'ing out the heat resource for the new
|
||||
swift puppet module deployment.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Developers and Operators would likely be able to deploy puppet module changes
|
||||
more quickly (without having to create a distro package). The actual deployment
|
||||
of puppet modules via swift (downloading and extracting the tarball) would
|
||||
likely be just as fast as a tarball.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Being able to more easily deploy updated puppet modules to an overcloud would
|
||||
likely speed up the development update and testing cycle of puppet modules.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
dan-prince
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create an upload-puppet-modules script in tripleo-common. Initially this
|
||||
may be a bash script which we ultimately refine into a Python version if
|
||||
it proves useful.
|
||||
|
||||
* Modify tripleo-heat-templates so that it supports a DeployArtifactURLs
|
||||
parameter (if the parameter is set) attempt to deploy the list of
|
||||
files from this parameter. The actual contents of the file might be
|
||||
a tarball or a distribution package (RPM).
|
||||
|
||||
* Modify tripleoclient so that the workflow around using upload-puppet-modules
|
||||
can be "transparent". Simply running upload-puppet-modules would not only
|
||||
upload the puppet modules it would also generate a Heat environment that
|
||||
would then automatically configure heat stack-update/create commands
|
||||
to use the new URL via a custom heat environment.
|
||||
|
||||
* Update our CI scripts in tripleo-ci and/or tripleo-common so that we
|
||||
make use of the new Puppet modules deployment mechanism.
|
||||
|
||||
* Update tripleo-docs to make note of the new feature.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We would likely want to switch to use this feature in our CI because
|
||||
it allows us to avoid git cloning the same puppet modules for both
|
||||
the undercloud and overcloud nodes. Simply calling the extra
|
||||
upload-puppet-modules script on the undercloud as part of our
|
||||
deployment workflow would enable the feature and allow it to be tested.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We would need to document the additional (optional) workflow associated
|
||||
with deploying puppet modules via Swift.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://review.openstack.org/#/c/245314/ (Add support for DeployArtifactURLs)
|
||||
* https://review.openstack.org/#/c/245310/ (Add scripts/upload-swift-artifacts)
|
||||
* https://review.openstack.org/#/c/245172/ (tripleoclient --environment)
|
@ -1,129 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Refactor top level puppet manifests
|
||||
==========================================
|
||||
|
||||
Launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/refactor-puppet-manifests
|
||||
|
||||
The current overcloud controller puppet manifests duplicate a large amount
|
||||
of code between the pacemaker (HA) and non-ha version. We can reduce the
|
||||
effort required to add new features by refactoring this code, and since
|
||||
there is already a puppet-tripleo module this is the logical destination.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Large amounts of puppet/manifests/overcloud\_controller.pp are shared with
|
||||
puppet/manifests/overcloud\_controller\_pacemaker.pp. When adding a feature
|
||||
or fixing a mistake in the former, it is frequently also an issue in the
|
||||
latter. It is a violation of the common programming principle of DRY, which
|
||||
while not an inviolable rule, is usually considered good practice.
|
||||
|
||||
In addition, moving this code into separate classes in another module will
|
||||
make it simpler to enable/disable components, as it will be a matter of
|
||||
merely controlling which classes (profiles) are included.
|
||||
|
||||
Finally, it allows easier experimentation with modifying the 'ha strategy'.
|
||||
Currently this is done using 'step', but could in theory be done using a
|
||||
service registry. By refactoring into ha+non-ha classes this would be quite
|
||||
simple to swap in/out.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
While there are significant differences in ha and non-ha deployments, in almost
|
||||
all cases the ha code will be a superset of the non-ha. A simple example of
|
||||
this is at the top of both files, where the load balancer is handled. The non
|
||||
ha version simply includes the loadbalancing class, while the HA version
|
||||
instantiates the exact same class but with some parameters changed. Across
|
||||
the board the same classes are included for the openstack services, but with
|
||||
manage service set to false in the HA case.
|
||||
|
||||
I propose first breaking up the non-ha version into profiles which can reside
|
||||
in puppet-tripleo/manifests/profile/nonha, then adding ha versions which
|
||||
use those classes under puppet-tripleo-manifests/profile/pacemaker. Pacemaker
|
||||
could be described as an 'ha strategy' which in theory should be replaceable.
|
||||
For this reason we use a pacemaker subfolder since one day perhaps we'll have
|
||||
an alternative.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could leave things as they are, which works and isn't the end of the world,
|
||||
but it's probably not optimal.
|
||||
|
||||
We could use kolla or something that removes the need for puppet entirely, but
|
||||
this discussion is outside the scope of this spec.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
It will make downstreams happy since they can sub in/out classes more easily.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Adding wrapper classes isn't going to impact puppet compile times very much.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Changes in t-h-t and puppet-tripleo will often be coupled, as t-h-t
|
||||
defines the data on which puppet-tripleo depends on.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
michaeltchapman
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Move overcloud controller to profile classes
|
||||
Move overcloud controller pacemaker to profile classes
|
||||
Move any other classes from the smaller manifests in t-h-t
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
No new features so current tests apply in their entirety.
|
||||
Additional testing can be added for each profile class
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
@ -1,274 +0,0 @@
|
||||
============================================================
|
||||
Library support for TripleO Overcloud Deployment Via Mistral
|
||||
============================================================
|
||||
|
||||
We need a TripleO library that supports the overcloud deployment workflow.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
TripleO has an overcloud deployment workflow that uses Heat templates and uses
|
||||
the following steps:
|
||||
|
||||
* The user edits the templates and environment file. These can be stored
|
||||
anywhere.
|
||||
* Templates may be validated by Heat.
|
||||
* Templates and environment are sent to Heat for overcloud deployment.
|
||||
|
||||
This workflow is already supported by the CLI.
|
||||
|
||||
However from a GUI perspective, although the workflow is straightforward, it is
|
||||
not simple. Here are some of the complications that arise:
|
||||
|
||||
* Some of the business logic in this workflow is contained in the CLI itself,
|
||||
making it difficult for other UIs to use.
|
||||
* If the TripleO overcloud deployment workflow changes, it is easy for the CLI
|
||||
and GUI approach to end up on divergent paths - a dangerous situation.
|
||||
* The CLI approach allows open-ended flexibility (the CLI doesn't care where
|
||||
the templates come from) that is detrimental for a GUI (the GUI user doesn't
|
||||
care where the templates are stored, but consistency in approach is desirable
|
||||
to prevent divergence among GUIs and CLIs).
|
||||
|
||||
There is a need to create common code that accommodates the flexibility of the
|
||||
CLI with the ease-of-use needs of GUI consumers.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
In order to solve this problem, we propose to create a Mistral-integrated
|
||||
deployment with the following:
|
||||
|
||||
* Encapsulate the business logic involved in the overcloud deployment workflow
|
||||
within the tripleo-common library utilizing Mistral actions and workflows.
|
||||
* Provide a simplified workflow to hide unneeded complexity from GUI consumers
|
||||
* Update the CLI to use this code where appropriate to prevent divergence with
|
||||
GUIs.
|
||||
|
||||
The first three points deserve further explanation. First, let us lay out the
|
||||
proposed GUI workflow.
|
||||
|
||||
1. A user pushes the Heat deployment templates into swift.
|
||||
2. The user defines values for the template resource types given by Heat
|
||||
template capabilities which are stored in an environment[1]. Note that this
|
||||
spec will be completed by mitaka at the earliest. A workaround is discussed
|
||||
below.
|
||||
3. Now that the template resource types are specified, the user can configure
|
||||
deployment parameters given by Heat. Edited parameters are updated and are
|
||||
stored in an environment. 'Roles' can still be derived from available Heat
|
||||
parameters[2].
|
||||
4. Steps 2 and 3 can be repeated.
|
||||
5. With configuration complete, the user triggers the deployment of the
|
||||
overcloud. The templates and environment file are taken from Swift
|
||||
and sent to Heat.
|
||||
6. Once overcloud deployment is complete, any needed post-deploy config is
|
||||
performed.
|
||||
|
||||
The CLI and GUI will both use the Swift workflow and store the templates into
|
||||
Swift. This would facilitate the potential to switch to the UI from a CLI based
|
||||
deployment and vice-versa.
|
||||
|
||||
Mistral Workflows are composed of Tasks, which group together one or more
|
||||
Actions to be executed with a Workflow Execution. The Action is implemented as
|
||||
a class with an initialization method and a run method. The run method provides
|
||||
a single execution point for Python code. Any persistence of state required for
|
||||
Actions or Workflows will be stored in a Mistral Environment object.
|
||||
|
||||
In some cases, an OpenStack Service may be missing a feature needed for TripleO
|
||||
or it might only be accessible through its associated Python client. To
|
||||
mitigate this issue in the short term, some of the Actions will need to be
|
||||
executed directly with an Action Execution [3] which calls the Action directly and
|
||||
returns instantly, but also doesn't have access to the same context as a
|
||||
Workflow Execution. In theory, every action execution should be replaced by an
|
||||
OpenStack service API call.
|
||||
|
||||
Below is a summary of the intended Workflows and Actions to be executed from the
|
||||
CLI or the GUI using the python-mistralclient or Mistral API. There may be
|
||||
additional actions or library code necessary to enable these operations that
|
||||
will not be intended to be consumed directly.
|
||||
|
||||
Workflows:
|
||||
|
||||
* Node Registration
|
||||
* Node Introspection
|
||||
* Plan Creation
|
||||
* Plan Deletion
|
||||
* Deploy
|
||||
* Validation Operations
|
||||
|
||||
Actions:
|
||||
|
||||
* Plan List
|
||||
* Get Capabilites
|
||||
* Update Capabilities
|
||||
* Get Parameters
|
||||
* Update Parameters
|
||||
* Roles List
|
||||
|
||||
For Flavors and Image management, the Nova and Glance APIs will be used
|
||||
respectively.
|
||||
|
||||
The registration and introspection of nodes will be implemented within a
|
||||
Mistral Workflow. The logic is currently in tripleoclient and will be ported,
|
||||
as certain node configurations are specified as part of the logic (ramdisk,
|
||||
kernel names, etc.) so the user does not have to specify those. Tagging,
|
||||
listing and deleting nodes will happen via the Ironic/Inspectors APIs as
|
||||
appropriate.
|
||||
|
||||
A deployment plan consists of a collection of heat templates in a Swift
|
||||
container, combined with data stored in a Mistral Environment. When the plan is
|
||||
first created, the capabilities map data will be parsed and stored in the
|
||||
associated Mistral Environment. The templates will need to be uploaded to a
|
||||
Swift container with the same name as the stack to be created. While any user
|
||||
could use a raw POST request to accomplish this, the GUI and CLI will provide
|
||||
convenience functions improve the user experience. The convenience functions
|
||||
will be implemented in an Action that can be used directly or included in a
|
||||
Workflow.
|
||||
|
||||
The deletion of a plan will be implemented in a Workflow to ensure there isn't
|
||||
an associated stack before deleting the templates, container and Mistral
|
||||
Environment. Listing the plans will be accomplished by calling
|
||||
'mistral environment-list'.
|
||||
|
||||
To get a list of the available Heat environment files with descriptions and
|
||||
constraints, the library will have an Action that returns the information about
|
||||
capabilities added during plan creation and identifies which Heat environment
|
||||
files have already been selected. There will also be an action that accepts a
|
||||
list of user selected Heat environment files and stores the information in the
|
||||
Mistral Environment. It would be inconvenient to use a Workflow for these
|
||||
actions as they just read or update the Mistral Environment and do not require
|
||||
additional logic.
|
||||
|
||||
The identification of Roles will be implemented in a Workflow that calls out to
|
||||
Heat.
|
||||
|
||||
To obtain the deployment parameters, Actions will be created that will call out
|
||||
to heat with the required template information to obtain the parameters and set
|
||||
the parameter values to the Environment.
|
||||
|
||||
To perform TripleO validations, Workflows and associated Actions will be created
|
||||
to support list, start, stop, and results operations. See the spec [4] for more
|
||||
information on how the validations will be implemented with Mistral.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative is to force non-CLI UIs to re-implement the business logic
|
||||
currently contained within the CLI. This is not a good alternative. Another
|
||||
possible alternative would be to create a REST API [5] to abstract TripleO
|
||||
deployment logic, but it would require considerably more effort to create and
|
||||
maintain and has been discussed at length on the mailing list. [6][7]
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The --templates workflow will end up being modified to use the updated
|
||||
tripleo-common library.
|
||||
|
||||
Integrating with Mistral is a straightforward process and this may result in
|
||||
increased usage.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Rather than write workflow code in python-tripleoclient directly developers will
|
||||
now create Mistral Actions and Workflows that help implement the requirements.
|
||||
|
||||
Right now, changing the overcloud deployment workflow results in stress due to
|
||||
the need to individually update both the CLI and GUI code. Converging the two
|
||||
makes this a far easier proposition. However developers will need to have this
|
||||
architecture in mind and ensure that changes to the --templates or --plan
|
||||
workflow are maintained in the tripleo-common library (when appropriate) to
|
||||
avoid unneeded divergences.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignees:
|
||||
|
||||
* rbrady
|
||||
* jtomasek
|
||||
* dprince
|
||||
|
||||
Work Items
|
||||
----------
|
||||
The work items required are:
|
||||
|
||||
* Develop the tripleo-common Mistral actions that provide all of the
|
||||
functionality required for our deployment workflows.
|
||||
* This involves moving much of the code out of python-tripleoclient and into
|
||||
generic, narrowly focused, Mistral actions that can be consumed via the
|
||||
Mistral API.
|
||||
* Create new Mistral workflows to help with high level things like deployment,
|
||||
introspection, node registration, etc.
|
||||
* tripleo-common is more of an internal library, and its logic is meant to be
|
||||
consumed (almost) solely by using Mistral
|
||||
actions. Projects should not attempt to circumvent the API by using
|
||||
tripleo-common as a library as much as possible.
|
||||
There may be some exceptions to this for common polling functions, etc. but in
|
||||
general all core workflow logic should be API driven.
|
||||
* Update the CLI to consume these Mistral actions directly via
|
||||
python-mistralclient.
|
||||
|
||||
All patches that implement these changes must pass CI and add additional tests
|
||||
as needed.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The TripleO CI should be updated to test the updated tripleo-common library.
|
||||
|
||||
Our intent is to make tripleoclient consume Mistral actions as we write them.
|
||||
Because all of the existing upstream Tripleo CI release on tripleoclient taking
|
||||
this approach ensures that our all of our workflow actions always work. This
|
||||
should get us coverage on 90% of the Mistral actions and workflows and allow us
|
||||
to proceed with the implementation iteratively/quickly. Once the UI is installed
|
||||
and part of our upstream CI we can also rely on coverage there to ensure we
|
||||
don't have breakages.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Mistral Actions and Workflows are sort of self-documenting and can be easily
|
||||
introspected by running 'mistral workflow-list' or 'mistral action-list' on the
|
||||
command line. The updated library however will have to be well-documented and
|
||||
meet OpenStack standards. Documentation will be needed in both the
|
||||
tripleo-common and tripleo-docs repositories.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://specs.openstack.org/openstack/heat-specs/specs/mitaka/resource-capabilities.html
|
||||
|
||||
[2] https://specs.openstack.org/openstack/heat-specs/specs/liberty/nested-validation.html
|
||||
|
||||
[3] http://docs.openstack.org/developer/mistral/terminology/executions.html
|
||||
|
||||
[4] https://review.openstack.org/#/c/255792/
|
||||
|
||||
[5] http://specs.openstack.org/openstack/tripleo-specs/specs/mitaka/tripleo-overcloud-deployment-library.html
|
||||
|
||||
[6] http://lists.openstack.org/pipermail/openstack-dev/2016-January/083943.html
|
||||
|
||||
[7] http://lists.openstack.org/pipermail/openstack-dev/2016-January/083757.html
|
||||
|
@ -1,244 +0,0 @@
|
||||
================================================
|
||||
Library support for TripleO Overcloud Deployment
|
||||
================================================
|
||||
|
||||
We need a TripleO library that supports the overcloud deployment workflow.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
With Tuskar insufficient for complex overcloud deployments, TripleO has moved to
|
||||
an overcloud deployment workflow that bypasses Tuskar. This workflow can be
|
||||
summarized as follows:
|
||||
|
||||
* The user edits the templates and environment file. These can be stored
|
||||
anywhere.
|
||||
* Templates may be validated by Heat.
|
||||
* Templates and environment are sent to Heat for overcloud deployment.
|
||||
* Post-deploy, overcloud endpoints are configured.
|
||||
|
||||
This workflow is already supported by the CLI.
|
||||
|
||||
However from a GUI perspective, although the workflow is straightforward, it is
|
||||
not simple. Here are some of the complications that arise:
|
||||
|
||||
* Some of the business logic in this workflow is contained in the CLI itself,
|
||||
making it difficult for other UIs to use.
|
||||
* If the TripleO overcloud deployment workflow changes, it is easy for the CLI
|
||||
and GUI approach to end up on divergent paths - a dangerous situation.
|
||||
* The CLI approach allows open-ended flexibility (the CLI doesn't care where the
|
||||
templates come from) that is detrimental for a GUI (the GUI user doesn't care
|
||||
where the templates are stored, but consistency in approach is desirable to
|
||||
prevent divergence among GUIs).
|
||||
|
||||
There is a need to create common code that accommodates the flexibility of the
|
||||
CLI with the ease-of-use needs of Python-based GUI consumers. Note that an API
|
||||
will eventually be needed in order to accommodate non-Python GUIs. The work
|
||||
there will be detailed in a separate spec.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
In order to solve this problem, we propose the following:
|
||||
|
||||
* Encapsulate the business logic involved in the overcloud deployment workflow
|
||||
within the tripleo-common library.
|
||||
* Provide a simplified workflow to hide unneeded complexity from GUI consumers
|
||||
- for example, template storage.
|
||||
* Update the CLI to use this code where appropriate to prevent divergence with
|
||||
GUIs.
|
||||
|
||||
The first two points deserve further explanation. First, let us lay out the
|
||||
proposed GUI workflow. We will refer to the Heat files the user desires to use
|
||||
for the overcloud deployment as a 'plan'.
|
||||
|
||||
1. A user creates a plan by pushing a copy of the Heat deployment templates into
|
||||
a data store.
|
||||
2. The user defines values for the template resource types given by Heat
|
||||
template capabilities. This results in an updated resource registry in an
|
||||
environment file saved to the data store.
|
||||
(https://review.openstack.org/#/c/196656/7/specs/liberty/resource-capabilities.rst)
|
||||
Note that this spec will be completed by mitaka at the earliest. A
|
||||
workaround is discussed below.
|
||||
3. Now that the template resource types are specified, the user can configure
|
||||
deployment parameters given by Heat. Edited parameters are updated and an
|
||||
updated environment file is saved to the data store. 'Roles' no longer exist
|
||||
in Tuskar, but can still be derived from available Heat parameters.
|
||||
(https://review.openstack.org/#/c/197199/5/specs/liberty/nested-validation.rst)
|
||||
4. Steps 2 and 3 can be repeated.
|
||||
5. With configuration complete, the user triggers the deployment of the
|
||||
overcloud. The templates and environment file are taken from the data store
|
||||
and sent to Heat.
|
||||
6. Once overcloud deployment is complete, any needed post-deploy config is
|
||||
performed.
|
||||
|
||||
In order to fulfill this workflow, we propose to initially promote the use of
|
||||
Swift as the template data store. This usage will be abstracted away behind
|
||||
the tripleo-common library, and later updates may allow the use of other data
|
||||
stores.
|
||||
|
||||
Note that the Swift-workflow is intended to be an alternative to the current CLI
|
||||
'--templates' workflow. Both would end up being options under the CLI; a user
|
||||
could choose '--templates' or '--plan'. However they would both be backed by
|
||||
common tripleo-common library code, with the '--plan' option simply calling
|
||||
additional functions to pull the plan information from Swift. And GUIs that
|
||||
expect a Swift-backed deployment would lose functionality if the deployment
|
||||
is deployed using the '--templates' CLI workflow.
|
||||
|
||||
The tripleo-common library functions needed are:
|
||||
|
||||
* **Plan CRUD**
|
||||
|
||||
* **create_plan(plan_name, plan_files)**: Creates a plan by creating a Swift
|
||||
container matching plan_name, and placing all files needed for that plan
|
||||
into that container (for Heat that would be the 'parent' templates, nested
|
||||
stack templates, environment file, etc). The Swift container will be
|
||||
created with object versioning active to allow for versioned updates.
|
||||
* **get_plan(plan_name)**: Retrieves the Heat templates and environment file
|
||||
from the Swift container matching plan_name.
|
||||
* **update_plan(plan_name, plan_files)**: Updates a plan by updating the
|
||||
plan files in the Swift container matching plan_name. This may necessitate
|
||||
an update to the environment file to add and/or remove parameters. Although
|
||||
updates are versioned, retrieval of past versions will not be implemented
|
||||
until the future.
|
||||
* **delete_plan(plan_name)**: Deletes a plan by deleting the Swift container
|
||||
matching plan_name, but only if there is no deployed overcloud that was
|
||||
deployed with the plan.
|
||||
|
||||
* **Deployment Options**
|
||||
|
||||
* **get_deployment_plan_resource_types(plan_name)**: Determine available
|
||||
template resource types by retrieving plan_name's templates from Swift and
|
||||
using the proposed Heat resource-capabilities API
|
||||
(https://review.openstack.org/#/c/196656/7/specs/liberty/resource-capabilities.rst).
|
||||
If that API is not ready in the required timeframe, then we will implement
|
||||
a temporary workaround - a manually created map between templates and
|
||||
provider resources. We would work closely with the spec developers to try
|
||||
and ensure that the output of this method matches their proposed output, so
|
||||
that once their API is ready, replacement is easy.
|
||||
* **update_deployment_plan_resource_types(plan_name, resource_types)**:
|
||||
Retrieve plan_name's environment file from Swift and update the
|
||||
resource_registry tree according to the values passed in by resource_types.
|
||||
Then update the environment file in Swift.
|
||||
|
||||
* **Deployment Configuration**
|
||||
|
||||
* **get_deployment_parameters(plan_name)**: Determine available deployment
|
||||
parameters by retrieving plan_name's templates from Swift and using the
|
||||
proposed Heat nested-validation API call
|
||||
(https://review.openstack.org/#/c/197199/5/specs/liberty/nested-validation.rst).
|
||||
* **update_deployment_parameters(plan_name, deployment_parameters)**:
|
||||
Retrieve plan_name's environment file from Swift and update the parameters
|
||||
according to the values passed in by deployment_parameters. Then update the
|
||||
environment file in Swift.
|
||||
* **get_deployment_roles(plan_name)**: Determine available deployment roles.
|
||||
This can be done by retrieving plan_name's deployment parameters and
|
||||
deriving available roles from parameter names; or by looking at the top-
|
||||
level ResourceGroup types.
|
||||
|
||||
* **Deployment**
|
||||
|
||||
* **validate_plan(plan_name)**: Retrieve plan_name's templates and environment
|
||||
file from Swift and use them in a Heat API validation call.
|
||||
* **deploy_plan(plan_name)**: Retrieve plan_name's templates and environment
|
||||
file from Swift and use them in a Heat API call to create the overcloud
|
||||
stack. Perform any needed pre-processing of the templates, such as the
|
||||
template file dictionary needed by Heat. This function will return a Heat
|
||||
stack ID that can be used to monitor the status of the deployment.
|
||||
|
||||
* **Post-Deploy**
|
||||
|
||||
* **postdeploy_plan(plan_name)**: Initialize the API endpoints of the
|
||||
overcloud corresponding to plan_name.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The alternative is to force non-CLI UIs to re-implement the business logic
|
||||
currently contained within the CLI. This is not a good alternative.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The --templates workflow will end up being modified to use the updated
|
||||
tripleo-common library.
|
||||
|
||||
Python-based code would find it far easier to adapt the TripleO method of
|
||||
deployment. This may result in increased usage.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Right now, changing the overcloud deployment workflow results in stress due to
|
||||
the need to individually update both the CLI and GUI code. Converging the two
|
||||
makes this a far easier proposition. However developers will need to have this
|
||||
architecture in mind and ensure that changes to the --templates or --plan
|
||||
workflow are maintained in the tripleo-common library (when appropriate) to
|
||||
avoid unneeded divergences.
|
||||
|
||||
Another important item to note is that we will need to keep the TripleO CI
|
||||
updated with changes, and will be responsible for fixing the CI as needed.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignees:
|
||||
|
||||
* tzumainn
|
||||
* akrivoka
|
||||
* jtomasek
|
||||
* dmatthews
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
The work items required are:
|
||||
|
||||
* Develop the tripleo-common library to provide the functionality described
|
||||
above. This also involves moving code from the CLI to tripleo-common.
|
||||
* Update the CLI to use the tripleo-common library.
|
||||
|
||||
All patches that implement these changes must pass CI and add additional tests as
|
||||
needed.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
We are dependent upon two HEAT specs:
|
||||
|
||||
* Heat resource-capabilities API
|
||||
(https://review.openstack.org/#/c/196656/7/specs/liberty/resource-capabilities.rst)
|
||||
* Heat nested-validation API
|
||||
(https://review.openstack.org/#/c/197199/5/specs/liberty/nested-validation.rst)
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The TripleO CI should be updated to test the updated tripleo-common library.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The updated library with its Swift-backed workflow will have to be well-
|
||||
documented and meet OpenStack standards. Documentation will be needed in both
|
||||
the tripleo-common and tripleo-docs repositories.
|
||||
|
||||
References
|
||||
==========
|
@ -1,140 +0,0 @@
|
||||
==================
|
||||
TripleO Quickstart
|
||||
==================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-quickstart
|
||||
|
||||
We need a common way for developers/CI systems to quickly stand up a virtual
|
||||
environment.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The tool we currently document for this use case is instack-virt-setup.
|
||||
However this tool has two major issues, and some missing features:
|
||||
|
||||
* There is no upstream CI using it. This means we have no way to test changes
|
||||
other than manually. This is a huge barrier to adding the missing features.
|
||||
|
||||
* It relies on a maze of bash scripts in the incubator repository[1] in order
|
||||
to work. This is a barrier to new users, as it can take quite a bit of time
|
||||
to find and then navigate that maze.
|
||||
|
||||
* It has no way to use a pre-built undercloud image instead of starting from
|
||||
scratch and redoing the same work that CI and every other tripleo developer
|
||||
is doing on every run. Starting from a pre-built undercloud with overcloud
|
||||
images prebaked can be a significant time savings for both CI systems as well
|
||||
as developer test environments.
|
||||
|
||||
* It has no way to create this undercloud image either.
|
||||
|
||||
* There are other smaller missing features like automatically tagging the fake
|
||||
baremetals with profile capability tags via instackenv.json. These would not
|
||||
be too painful to implement, but without CI even small changes carry some
|
||||
amount of pain.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
* Import the tripleo-quickstart[2] tool that RDO is using for this purpose.
|
||||
This project is a set of ansible roles that can be used to build an
|
||||
undercloud.qcow2, or alternatively to consume it. It was patterned after
|
||||
instack-virt-setup, and anything configurable via instack-virt-setup is
|
||||
configurable in tripleo-quickstart.
|
||||
|
||||
* Use third-party CI for self-gating this new project. In order to setup an
|
||||
environment similar to how developers and users can use this tool, we need
|
||||
a baremetal host. The CI that currently self gates this project is setup on
|
||||
ci.centos.org[3], and setting this up as third party CI would not be hard.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* One alternative is to keep using instack-virt-setup for this use case.
|
||||
However, we would still need to add CI for instack-virt-setup. This would
|
||||
still need to be outside of tripleoci, since it requires a baremetal host.
|
||||
Unless someone is volunteering to set that up, this is not really a viable
|
||||
alternative.
|
||||
|
||||
* Similarly, we could use some other method for creating virtual environments.
|
||||
However, this alternative is similarly constrained by needing third-party CI
|
||||
for validation.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Using a pre-built undercloud.qcow2 drastically symplifies the virt-setup
|
||||
instructions, and therefore is less error prone. This should lead to a better
|
||||
new user experience of TripleO.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Using a pre-built undercloud.qcow2 will shave 30+ minutes from the CI
|
||||
gate jobs.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
There is no reason this same undercloud.qcow2 could not be used to deploy
|
||||
real baremetal environments. There have been many production deployments of
|
||||
TripleO that have used a VM undercloud.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
The undercloud.qcow2 approach makes it much easier and faster to reproduce
|
||||
exactly what is run in CI. This leads to a much better developer experience.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignees:
|
||||
|
||||
* trown
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Import the existing work from the RDO community to the openstack namespace
|
||||
under the TripleO umbrella.
|
||||
|
||||
* Setup third-party CI running in ci.centos.org to self-gate this new project.
|
||||
(We can just update the current CI[3] to point at the new upstream location)
|
||||
|
||||
* Documentation will need to be updated for the virtual environment setup.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Currently, the only undercloud.qcow2 available is built in RDO. We would
|
||||
either need to build one in tripleo-ci, or use the one built in RDO.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We need a way to CI the virtual environment setup. This is not feasible within
|
||||
tripleoci, since it requires a baremetal host machine. We will need to rely on
|
||||
third party CI for this.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Overall this will be a major simplification of the documentation.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://github.com/openstack/tripleo-incubator/tree/master/scripts
|
||||
[2] https://github.com/redhat-openstack/tripleo-quickstart
|
||||
[3] https://ci.centos.org/view/rdo/job/tripleo-quickstart-gate-mitaka-delorean-minimal/
|
@ -1,175 +0,0 @@
|
||||
==========
|
||||
TripleO UI
|
||||
==========
|
||||
|
||||
We need a graphical user interface that will support deploying OpenStack using
|
||||
TripleO.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Tuskar-UI, the only currently existing GUI capable of TripleO deployments, has
|
||||
several significant issues.
|
||||
|
||||
Firstly, its back-end relies on an obsolete version of the Tuskar API, which is
|
||||
insufficient for complex overcloud deployments.
|
||||
|
||||
Secondly, it is implemented as a Horizon plugin and placed under the Horizon
|
||||
umbrella, which has proven to be suboptimal, for several reasons:
|
||||
|
||||
* The placement under the Horizon program. In order to be able to develop the
|
||||
Tuskar-UI, one needs deep familiarity with both Horizon and TripleO projects.
|
||||
Furthermore, in order to be able to approve patches, one needs to be a
|
||||
Horizon core reviewer. This restriction reduces the number of people who can
|
||||
contribute drastically, as well as makes it hard for Tuskar-UI developers to
|
||||
actually land code.
|
||||
|
||||
* The complexity of the Horizon Django application. Horizon is a very complex
|
||||
heavyweight application comprised of many OpenStack services. It has become
|
||||
very large, inflexible and consists of several unnecessary middle layers. As
|
||||
a result of this, we have been witnessing the emergence of several new GUIs
|
||||
implemented as independent (usually fully client-side JavaScript) applications,
|
||||
rather than as Horizon plugins. Ironic webclient[1] is one such example. This
|
||||
downside of Horizon has been recognized and an attempt to address it is
|
||||
described in the next point.
|
||||
|
||||
* The move to Angular JS (version 1). In an attempt to address the issues listed
|
||||
above, the Horizon community decided to rewrite it in Angular JS. However,
|
||||
instead of doing a total rewrite, they opted for a more gradual approach,
|
||||
resulting in even more middle layers (the original Django layer turned into an
|
||||
API for Angular based front end). Although the intention is to eventually
|
||||
get rid of the unwanted layers, the move is happening very slowly. In
|
||||
addition, this rewrite of Horizon is to AngularJS version 1, which may soon
|
||||
become obsolete, with version 2 just around the corner. This probably means
|
||||
another complete rewrite in not too distant future.
|
||||
|
||||
* Packaging issues. The move to AngularJS brought along a new set of issues
|
||||
related to the poor state of packaging of nodejs based tooling in all major
|
||||
Linux distributions.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In order to address the need for a TripleO based GUI, while avoiding the issues
|
||||
listed above, we propose introducing a new GUI project, *TripleO UI*, under the
|
||||
TripleO program.
|
||||
|
||||
As it is a TripleO specific UI, TripleO GUI will be placed under the TripleO
|
||||
program, which will bring it to attention of TripleO reviewers and allow
|
||||
TripleO core reviewers to approve patches. This should facilitate the code
|
||||
contribution process.
|
||||
|
||||
TripleO UI will be a web UI designed for overcloud deployment and
|
||||
management. It will be a lightweight, independent client-side application,
|
||||
designed for flexibility, adaptability and reusability.
|
||||
|
||||
TripleO UI will be a fully client-side JavaScript application. It will be
|
||||
stateless and contain no business logic. It will consume the TripleO REST API[2],
|
||||
which will expose the overcloud deployment workflow business logic implemented
|
||||
in the tripleo-common library[3]. As opposed to the previous architecture which
|
||||
included many unwanted middle layers, this one will be very simple, consisting
|
||||
only of the REST API serving JSON, and the client-side JavaScript application
|
||||
consuming it.
|
||||
|
||||
The development stack will consist of ReactJS[4] and Flux[5]. We will use ReactJS
|
||||
to implement the web UI components, and Flux for architecture design.
|
||||
|
||||
Due to the packaging problems described above, we will not provide any packages
|
||||
for the application for now. We will simply make the code available for use.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The alternative is to keep developing Tuskar-UI under the Horizon umbrella. In
|
||||
addition to all the problems outlined above, this approach would also mean a
|
||||
complete re-write of Tuskar-UI back-end to make it use the new tripleo-common
|
||||
library.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
This proposal introduces a brand new application; all the standard security
|
||||
concerns which come with building a client-side web application apply.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
We plan to build a standalone web UI which will be capable of deploying
|
||||
OpenStack with TripleO. Since as of now no such GUIs exist, this can be a huge
|
||||
boost for adoption of TripleO.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The proposed technology stack, ReactJS and Flux, have excellent performance
|
||||
characteristics. TripleO UI should be a lightweight, fast, flexible application.
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Right now, development on Tuskar-UI is uncomfortable for the reasons
|
||||
detailed above. This proposal should result in more comfortable development
|
||||
as it logically places TripleO UI under the TripleO program, which brings
|
||||
it under the direct attention of TripleO developers and core reviewers.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignees:
|
||||
|
||||
* jtomasek
|
||||
* flfuchs
|
||||
* jrist
|
||||
* <TBD person with JS & CI skills>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
This is a general proposal regarding the adoption of a new graphical user
|
||||
interface under the TripleO program. The implementation of specific features
|
||||
will be covered in subsequent proposals.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
We are dependent upon the creation of the TripleO REST API[2], which in turn
|
||||
depends on the tripleo-common[3] library containing all the functionality
|
||||
necessary for advanced overcloud deployment.
|
||||
|
||||
Alternatively, using Mistral to provide a REST API, instead of building a new
|
||||
API, is currently being investigated as another option.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
TripleO UI should be thoroughly tested, including unit tests and integration
|
||||
tests. Every new feature and bug fix should be accompanied by appropriate tests.
|
||||
|
||||
The TripleO CI should be updated to test the TripleO UI.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
TripleO UI will have to be well-documented and meet OpenStack standards.
|
||||
We will need both developer and deployment documentation. Documentation will
|
||||
live in the tripleo-docs repository.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://github.com/openstack/ironic-webclient
|
||||
[2] https://review.openstack.org/#/c/230432
|
||||
[3] http://specs.openstack.org/openstack/tripleo-specs/specs/mitaka/tripleo-overcloud-deployment-library.html
|
||||
[4] https://facebook.github.io/react/
|
||||
[5] https://facebook.github.io/flux/
|
@ -1,220 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================================
|
||||
Metal to Tenant: Ironic in Overcloud
|
||||
====================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/ironic-integration
|
||||
|
||||
This blueprint adds support for providing bare metal machines to tenants by
|
||||
integrating Ironic to the overcloud.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
There is an increasing interest in providing bare metal machines to tenants in
|
||||
the overcloud in addition to or instead of virtual instances. One example is
|
||||
Sahara: users hope to achieve better performance by removing the hypervisor
|
||||
abstraction layer in order to eliminate the noisy neighbor effect. For that
|
||||
purpose, the OpenStack Bare metal service (Ironic) provides an API and a Nova
|
||||
driver to serve bare metal instances behind the same Nova and Neutron API's.
|
||||
Currently however TripleO does not support installing and configuring Ironic
|
||||
and Nova to serve bare metal instances to the tenant.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Composable Services
|
||||
-------------------
|
||||
|
||||
In the bare metal deployment case, the nova-compute service is only a thin
|
||||
abstraction layer around the Ironic API. The actual compute instances in
|
||||
this case are the bare metal nodes. Thus a TripleO deployment with support for
|
||||
only bare metal nodes will not need dedicated compute nodes in the overcloud.
|
||||
The overcloud nova-compute service will therefore be placed on controller nodes.
|
||||
|
||||
New TripleO composable services will be created and optionally deployed on the
|
||||
controller nodes:
|
||||
|
||||
* ``OS::TripleO::Services::IronicApi`` will deploy the bare metal API.
|
||||
|
||||
* ``OS::TripleO::Services::IronicNovaCompute`` will deploy nova compute
|
||||
with Ironic as a back end. It will also configure the nova compute to use
|
||||
`ClusteredComputeManager
|
||||
<https://github.com/openstack/ironic/blob/master/ironic/nova/compute/manager.py>`_
|
||||
provide by Ironic to work around inability to have several nova compute
|
||||
instances configured with Ironic.
|
||||
|
||||
* ``OS::TripleO::Services::IronicConductor`` will deploy a TFTP server,
|
||||
an HTTP server (for an optional iPXE environment) and an ironic-conductor
|
||||
instance. The ironic-conductor instance will not be managed by pacemaker
|
||||
in the HA scenario, as Ironic has its own Active/Active HA model,
|
||||
which spreads load on all active conductors using a hash ring.
|
||||
|
||||
There is no public data on how many bare metal nodes each conductor
|
||||
can handle, but the Ironic team expects an order of hundreds of nodes
|
||||
per conductor.
|
||||
|
||||
Since this feature is not a requirement in all deployments, this will be
|
||||
opt-in by having a separate environment file.
|
||||
|
||||
Hybrid Deployments
|
||||
------------------
|
||||
|
||||
For hybrid deployments with both virtual and bare metal instances, we will use
|
||||
Nova host aggregates: one for all bare metal hosts, the other for all virtual
|
||||
compute nodes. This will prevent virtual instances being deployed on baremetal
|
||||
nodes. Note that every bare metal machine is presented as a separate
|
||||
Nova compute host. These host aggregates will always be created, even for
|
||||
purely bare metal deployments, as users might want to add virtual computes
|
||||
later.
|
||||
|
||||
Networking
|
||||
----------
|
||||
|
||||
As of Mitaka, Ironic only supports flat networking for all tenants and for
|
||||
provisioning. The **recommended** deployment layout will consist of two networks:
|
||||
|
||||
* The ``provisioning`` / ``tenant`` network. It must have access to the
|
||||
overcloud Neutron service for DHCP, and to overcloud baremetal-conductors
|
||||
for provisioning.
|
||||
|
||||
.. note:: While this network can technically be the same as the undercloud
|
||||
provisioning network, it's not recommended to do so due to
|
||||
potential conflicts between various DHCP servers provided by
|
||||
Neutron (and in the future by ironic-inspector).
|
||||
|
||||
* The ``management`` network. It will contain the BMCs of bare metal nodes,
|
||||
and it only needs access to baremetal-conductors. No tenant access will be
|
||||
provided to this network.
|
||||
|
||||
.. note:: Splitting away this network is not really required if tenants are
|
||||
trusted (which is assumed in this spec) and BMC access is
|
||||
reasonably restricted.
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
To limit the scope of this spec the following definitely useful features are
|
||||
explicitly left out for now:
|
||||
|
||||
* ``provision`` <-> ``tenant`` network separation (not yet implemented by
|
||||
ironic)
|
||||
|
||||
* in-band inspection (requires ironic-inspector, which is not yet HA-ready)
|
||||
|
||||
* untrusted tenants (requires configuring secure boot and checking firmwares,
|
||||
which is vendor-dependent)
|
||||
|
||||
* node autodiscovery (depends on ironic-inspector)
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Alternatively, we could leave configuring a metal-to-tenant environment up to
|
||||
the operator.
|
||||
|
||||
We could also have it enabled by default, but most likely it won't be required
|
||||
in most deployments.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Most of the security implications have to be handled within Ironic. Eg. wiping
|
||||
the hard disk, checking firmwares, etc. Ironic needs to be configured to be
|
||||
able to run these jobs by enabling automatic cleaning during node lifecycle.
|
||||
It is also worth mentioning that we will assume trusted tenants for these bare
|
||||
metal machines.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The ability to deploy Ironic in the overcloud will be optional.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
If enabled, TripleO will deploy additional services to the overcloud:
|
||||
|
||||
* ironic-conductor
|
||||
|
||||
* a TFTP server
|
||||
|
||||
* an HTTP server
|
||||
|
||||
None of these should have heavy performance requirements.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
ifarkas
|
||||
|
||||
Other contributors:
|
||||
dtantsur, lucasagomes, mgould, mkovacik
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
when the environment file is included, make sure:
|
||||
|
||||
* ironic is deployed on baremetal-conductor nodes
|
||||
|
||||
* nova compute is deployed and correctly configured, including:
|
||||
|
||||
* configuring Ironic as a virt driver
|
||||
|
||||
* configuring ClusteredComputeManager
|
||||
|
||||
* setting ram_allocation_ratio to 1.0
|
||||
|
||||
* host aggregates are created
|
||||
|
||||
* update documentation
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This is testable in the CI with nested virtualization and tests will be added
|
||||
to the tripleo-ci jobs.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Quick start documentation and a sample environment file will be provided.
|
||||
|
||||
* Document how to enroll new nodes in overcloud ironic (including host
|
||||
aggregates)
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* `Host aggregates <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/4/html/Configuration_Reference_Guide/host-aggregates.html>`_
|
@ -1,197 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================================
|
||||
Add Adapter Teaming to os-net-config
|
||||
====================================
|
||||
|
||||
https://blueprints.launchpad.net/os-net-config/+spec/os-net-config-teaming
|
||||
|
||||
This spec describes adding features to os-net-config to support adapter teaming
|
||||
as an option for bonded interfaces. Adapter teaming allows additional features
|
||||
over regular bonding, due to the use of the teaming agent.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
os-net-config supports both OVS bonding and Linux kernel bonding, but some
|
||||
users want to use adapter teaming instead of bonding. Adapter teaming provides
|
||||
additional options that bonds don't support, and do support almost all of the
|
||||
options that are supported by bonds.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Add a new class similar to the existing bond classes that allows for the
|
||||
configuration of the teamd daemon through teamdctl. The syntax for the
|
||||
configuration of the teams should be functionally similar to configuring
|
||||
bonds.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We already have two bonding methods in use, the Linux bonding kernel module,
|
||||
and Open vSwitch. However, adapter teaming is becoming a best practice, and
|
||||
this change will open up that possibility.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The end result of using teaming instead of other modes of bonding should be
|
||||
the same from a security standpoint. Adapter teaming does not interfere with
|
||||
iptables or selinux.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Operators who are troubleshooting a deployment where teaming is used may need
|
||||
to familiarize themselves with the teamdctl utility.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Using teaming rather than bonding will have a mostly positive impact on
|
||||
performance. Teaming is very lightweight, and may use less CPU than other
|
||||
bonding modes, especially OVS. Teaming has the following impacts:
|
||||
|
||||
* Fine-grained control over load balancing hashing algorithms.
|
||||
|
||||
* Port-priorities and stickyness
|
||||
|
||||
* Per-port monitoring.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
In TripleO, os-net-config has existing sample templates for OVS-mode
|
||||
bonds and Linux bonds. There has been some discussion with Dan Prince
|
||||
about unifying the bonding templates in the future.
|
||||
|
||||
The type of bond could be set as a parameter in the NIC config
|
||||
templates. To this end, it probably makes sense to make the teaming
|
||||
configuration as similar to the bonding configurations as possible.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
If possible, the configuration should be as similar to the bonding
|
||||
configuration as possible. In fact, it might be treated as a different
|
||||
form of bond, as long as the required metadata for teaming can be
|
||||
provided in the options.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Dan Sneddon <dsneddon@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add teaming object and unit tests.
|
||||
|
||||
* Configure sample templates to demonstrate usage of teaming.
|
||||
|
||||
* Test TripleO with new version of os-net-config and adapter teaming configured.
|
||||
|
||||
Configuration Example
|
||||
---------------------
|
||||
|
||||
The following is an example of a teaming configuration that os-net-config
|
||||
should be able to implement::
|
||||
|
||||
-
|
||||
type: linux_team
|
||||
name: team0
|
||||
bonding_options: "{"runner": {"name": "activebackup"}, "link_watch": {"name": "ethtool"}}"
|
||||
addresses:
|
||||
-
|
||||
ip_subnet: 192.168.0.10/24
|
||||
members:
|
||||
-
|
||||
type: interface
|
||||
name: eno2
|
||||
primary: true
|
||||
-
|
||||
type: interface
|
||||
name: eno3
|
||||
|
||||
The only difference between a Linux bond configuration and an adapter team
|
||||
configuration in the above example is the type (linux_team), and the content
|
||||
of the bonding_options (bonding has a different format for options).
|
||||
|
||||
Implementation Details
|
||||
----------------------
|
||||
|
||||
os-net-config will have to configure the ifcfg files for the team. The ifcfg
|
||||
format for team interfaces is documented here [1].
|
||||
|
||||
If an interface is marked as primary, then the ifcfg file for that interface
|
||||
should list it at a higher than default (0) priority::
|
||||
|
||||
TEAM_PORT_CONFIG='{"prio": 100}'
|
||||
|
||||
The mode is set in the runner: statement, as well as any settings that
|
||||
apply to that teaming mode.
|
||||
|
||||
We have the option of using strictly ifcfg files or using the ip utility
|
||||
to influence the settings of the adapter team. It appears from the teaming
|
||||
documentation that either approach will work.
|
||||
|
||||
The proposed implementation [2] of adapter teaming for os-net-config uses
|
||||
only ifcfg files to set the team settings, slave interfaces, and to
|
||||
set the primary interface. The potential downside of this path is that
|
||||
the interface must be shut down and restarted when config changes are
|
||||
made, but that is consistent with the other device types in os-net-config.
|
||||
This is probably acceptable, since network changes are made rarely and
|
||||
are assumed to be disruptive to the host being reconfigured.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* teamd daemon and teamdctl command-line utility must be installed. teamd is
|
||||
not installed by default on RHEL/CENTOS, however, teamd is currently
|
||||
included in the RDO overcloud-full image. This should be added ot the list
|
||||
of os-net-config RPM dependencies.
|
||||
|
||||
* For LACP bonds using 802.3ad, switch support will need to be configured and
|
||||
at least two ports must be configured for LACP bonding.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
In order to test this in CI, we would need to have an environment where we
|
||||
have multiple physical NICs. Adapter teaming supports modes other than LACP,
|
||||
so we could possibly get away with multiple links without any special
|
||||
configuration.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The deployment documentation will need to be updated to cover the use of
|
||||
teaming. The os-net-config sample configurations will demonstrate the use
|
||||
in os-net-config. TripleO Heat template examples should also help with
|
||||
deployments using teaming.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [1] - Documentation: Creating a Network Team Using ifcfg Files
|
||||
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configure_a_Network_Team_Using-the_Command_Line.html#sec-Creating_a_Network_Team_Using_ifcfg_Files
|
||||
|
||||
* [2] - Review: Add adapter teaming support using teamd for ifcfg-systems
|
||||
https://review.openstack.org/#/c/339854/
|
@ -1,229 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================================
|
||||
Pacemaker Next Generation Architecture
|
||||
======================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/ha-lightweight-architecture
|
||||
|
||||
Change the existing HA manifests and templates to deploy a minimal pacemaker
|
||||
architecture, where all the openstack services are started and monitored by
|
||||
systemd with the exception of: VIPs/Haproxy, rabbitmq, redis and galera.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The pacemaker architecture deployed currently via
|
||||
`puppet/manifests/overcloud_controller_pacemaker.pp` manages most
|
||||
service on the controllers via pacemaker. This approach, while having the
|
||||
advantage of having a single entity managing and monitoring all services, does
|
||||
bring a certain complexity to it and assumes that the operators are quite
|
||||
familiar with pacemaker and its management of resources. The aim is to
|
||||
propose a new architecture, replacing the existing one, where pacemaker
|
||||
controls the following resources:
|
||||
|
||||
* Virtual IPs + HAProxy
|
||||
* RabbitMQ
|
||||
* Galera
|
||||
* Redis
|
||||
* openstack-cinder-volume (as the service is not A/A yet)
|
||||
* Any future Active/Passive service
|
||||
|
||||
Basically every service that is managed today by a specific resource agent
|
||||
and not systemd, will be still running under pacemaker. The same goes
|
||||
for any service (like openstack-cinder-volume) that need to be active/passive.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Initially the plan was to create a brand new template implementing this
|
||||
new HA architecture. After a few rounds of discussions within the TripleO
|
||||
community, it has been decided to actually have a single HA architecture.
|
||||
The main reasons for moving to a single next generation HA architecture are due to
|
||||
the amount work needed to maintain two separate architectures and to the
|
||||
fact that the previous HA architecture does not bring substantial advantages
|
||||
over this next generation one.
|
||||
|
||||
The new architecture will enable most services via systemd and will remove most
|
||||
pacemaker resource definitions with their corresponding constraints.
|
||||
In terms of ordering constraints we will go from a graph like this one:
|
||||
http://acksyn.org/files/tripleo/wsgi-openstack-core.pdf (mitaka)
|
||||
|
||||
to a graph like this one:
|
||||
http://acksyn.org/files/tripleo/light-cib-nomongo.pdf (next-generation-mitaka)
|
||||
|
||||
Once this new architecture is in place and we have tested it extensively, we
|
||||
can work on the upgrade path from the previous fully-fledged pacemaker HA
|
||||
architecture to this new one. Since the impact of pacemaker in the new
|
||||
architecture is quite small, it is possible to consider dropping the non-ha
|
||||
template in the future for every deployment and every CI job. The decision
|
||||
on this can be taken in a later step, even post-newton.
|
||||
|
||||
Another side-benefit is that with this newer architecture the
|
||||
whole upgrade/update topic is much easier to manage with TripleO,
|
||||
because there is less coordination needed between pacemaker, the update
|
||||
of openstack services, puppet and the update process itself.
|
||||
|
||||
Note that once composable service land, this next generation architecture will
|
||||
merely consist of a single environment file setting some services to be
|
||||
started via systemd, some via pacemaker and a bunch of environment variables
|
||||
needed for the services to reconnect even when galera and rabbitmq are down.
|
||||
All services that need to be started via systemd will be done via the default
|
||||
state:
|
||||
https://github.com/openstack/tripleo-heat-templates/blob/40ad2899106bc5e5c0cf34c40c9f391e19122a49/overcloud-resource-registry-puppet.yaml#L124
|
||||
|
||||
The services running via pacemaker will be explicitely listed in an
|
||||
environment file, like here:
|
||||
https://github.com/openstack/tripleo-heat-templates/blob/40ad2899106bc5e5c0cf34c40c9f391e19122a49/environments/puppet-pacemaker.yaml#L12
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are many alternative designs for the HA architecture. The decision
|
||||
to use pacemaker only for a certain set of "core" services and all the
|
||||
Active/Passive services comes from a careful balance between complexity
|
||||
of the architecture and its management and being able to recover resources
|
||||
in a known broken state. There is a main assumption here about native
|
||||
openstack services:
|
||||
|
||||
They *must* be able to start when the broker and the database are down and keep
|
||||
retrying.
|
||||
|
||||
The reason for using only pacemaker for the core services and not, for
|
||||
example keepalived for the Virtual IPs, is to keep the stack simple and
|
||||
not introduce multiple distributed resource managers. Also, if we used
|
||||
only keepalived, we'd have no way of recovering from a failure beyond
|
||||
trying to relocate the VIP.
|
||||
|
||||
The reason for keeping haproxy under pacemaker's management is that
|
||||
we can guarantee that a VIP will always run where haproxy is running,
|
||||
should an haproxy service fail.
|
||||
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No changes regarding security aspects compared to the existing status quo.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The operators working with a cloud are impacted in the following ways:
|
||||
|
||||
* The services (galera, redis, openstack-cinder-volume, VIPs,
|
||||
haproxy) will be managed as usual via `pcs`. Pacemaker will monitor these
|
||||
services and provide their status via `pcs status`.
|
||||
|
||||
* All other services will be managed via `systemctl` and systemd will be
|
||||
configured to automatically restart a failed service. Note, that this is
|
||||
already done in RDO with (Restart={always,on-failure}) in the service files.
|
||||
It is a noop when pacemaker manages the service as an override file is
|
||||
created by pacemaker:
|
||||
|
||||
https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/systemd.c#L547
|
||||
|
||||
With the new architecture, restarting a native openstack service across
|
||||
all controllers will require restarting it via `systemctl` on each node (as opposed
|
||||
to a single `pcs` command as it is done today)
|
||||
|
||||
* All services will be configured to retry indefinitely to connect to
|
||||
the database or to the messaging broker. In case of a controller failure,
|
||||
the failover scenario will be the same as with the current HA architecture,
|
||||
with the difference that the services will just retry to re-connect indefinitely.
|
||||
|
||||
* Previously with the HA template every service would be monitored and managed by
|
||||
pacemaker. With the split between openstack services being managed by systemd and
|
||||
"core" services managed by pacemaker, the operator needs to know which service
|
||||
to monitor with which command.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No changes compared to the existing architecture.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
In the future we might see if the removal of the non-HA template is feasible,
|
||||
thereby simplifying our CI jobs and have single more-maintained template.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
michele
|
||||
|
||||
Other contributors:
|
||||
...
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Prepare the roles that deploy the next generation architecture. Initially,
|
||||
keep it as close as possible to the existing HA template and make it simpler
|
||||
in a second iteration (remove unnecessary steps, etc.) Template currently
|
||||
lives here and deploys successfully:
|
||||
|
||||
https://review.openstack.org/#/c/314208/
|
||||
|
||||
* Test failure scenarios and recovery scenario, open bugs against services that
|
||||
misbehave in the face of database and/or broker being down.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Initial smoke-testing has been completed successfully. Another set of tests
|
||||
focusing on the behaviour of openstack services when galera and rabbitmq are
|
||||
down is in the process of being run.
|
||||
|
||||
Particular focus will be on failover scenarios and recovery times and making
|
||||
sure that there are no regressions compared to the current HA architecture.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Currently we do not describe the architectures as deployed by TripleO itself,
|
||||
so no changes needed. A short page in the docs describing the architecture
|
||||
would be a nice thing to have in the future.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
This design came mostly out from a meeting in Brno with the following attendees:
|
||||
|
||||
* Andrew Beekhof
|
||||
* Chris Feist
|
||||
* Eoghan Glynn
|
||||
* Fabio Di Nitto
|
||||
* Graeme Gillies
|
||||
* Hugh Brock
|
||||
* Javier Peña
|
||||
* Jiri Stransky
|
||||
* Lars Kellogg-Steadman
|
||||
* Mark Mcloughlin
|
||||
* Michele Baldessari
|
||||
* Raoul Scarazzini
|
||||
* Rob Young
|
@ -1,229 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
TripleO LLDP Validation
|
||||
==========================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-lldp-validation
|
||||
|
||||
The Link Layer Discovery Protocol (LLDP) is a vendor-neutral link layer
|
||||
protocol in the Internet Protocol Suite used by network devices for
|
||||
advertising their identity, capabilities, and neighbors on an
|
||||
IEEE 802 local area network, principally wired Ethernet. [1]
|
||||
|
||||
The Link Layer Discover Protocol (LLDP) helps identify layer 1/2
|
||||
connections between hosts and switches. The switch port, chassis ID,
|
||||
VLANs trunked, and other info is available, for planning or
|
||||
troubleshooting a deployment. For instance, a deployer may validate
|
||||
that the proper VLANs are supplied on a link, or that all hosts
|
||||
are connected to the Provisioning network.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
A detailed description of the problem:
|
||||
|
||||
* Deployment networking is one of the most difficult parts of any
|
||||
OpenStack deployment. A single misconfigured port or loose cable
|
||||
can derail an entire multi-rack deployment.
|
||||
|
||||
* Given the first point, we should work to automate validation and
|
||||
troubleshooting where possible.
|
||||
|
||||
* Work is underway to collect LLDP data in ironic-python-agent,
|
||||
and we have an opportunity to make that data useful [2].
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The goal is to expose LLDP data that is collected during
|
||||
introspection, and provide this data in a format that is useful for the
|
||||
deployer. This work depends on the LLDP collection work being done
|
||||
in ironic-python-agent [3].
|
||||
|
||||
There is work being done to implement LLDP data collection for Ironic/
|
||||
Neutron integration. Although this work is primarily focused on features
|
||||
for bare-metal Ironic instances, there will be some overlap with the
|
||||
way TripleO uses Ironic to provision overcloud servers.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are many network management utilities that use CDP or LLDP data to
|
||||
validate the physical networking. Some of these are open source, but none
|
||||
are integrated with OpenStack.
|
||||
|
||||
Alternative approaches that do not use LLDP are typically vendor-specific
|
||||
and require specific hardware support. Cumulus has a solution which works
|
||||
with multiple vendors' hardware, but that solution requires running their
|
||||
custom OS on the Ethernet switches.
|
||||
|
||||
Another approach which is common is to perform collection of the switch
|
||||
configurations to a central location, where port configurations can be
|
||||
viewed, or in some cases even altered and remotely pushed. The problem
|
||||
with this approach is that the switch configurations are hardware and
|
||||
vendor-specific, and typically a network engineer is required to read
|
||||
and interpret the configuration. A unified approach that works for all
|
||||
common switch vendors is preferred, along with a unified reporting format.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The physical network report provides a roadmap to the underlying network
|
||||
structure. This could prove handy to an attacker who was unaware of the
|
||||
existing topology. On the other hand, the information about physical
|
||||
network topology is less valuable than information about logical topology
|
||||
to an attacker. LLDP contains some information about both physical and
|
||||
logical topology, but the logical topology is limited to VLAN IDs.
|
||||
|
||||
The network topology report should be considered sensitive but not
|
||||
critical. No credentials or shared secrets are revealed in the data
|
||||
collected by ironic-inspector.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This report will hopefully reduce the troubleshooting time for nodes
|
||||
with failed network deployments.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
If this report is produced as part of the ironic-inspector workflow,
|
||||
then it will increase the time taken to introspect each node by a
|
||||
negligible amount, perhaps a few seconds.
|
||||
|
||||
If this report is called by the operator on demand, it will have
|
||||
no performance impact on other components.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Deployers may want additional information than the per-node LLDP report.
|
||||
There may be some use in providing aggregate reports, such as the number
|
||||
of nodes with a specific configuration of interfaces and trunked VLANs.
|
||||
This would help to highlight outliers or misconfigured nodes.
|
||||
|
||||
There have been discussions about adding automated switch configuration
|
||||
in TripleO. This would be a mechanism whereby deployers could produce the
|
||||
Ethernet switch configuration with a script based on a configuration
|
||||
template. The deployer would provide specifics like the number of nodes
|
||||
and the configuration per node, and the script would generate the switch
|
||||
configuration to match. In that case, the LLDP collection and analysis
|
||||
would function as a validator for the automatically generated switch
|
||||
port configurations.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
The initial work will be to fill in fixed fields such as Chassis ID
|
||||
and switch port. An LLDP packet can contain additional data on a
|
||||
per-vendor basis, however.
|
||||
|
||||
The long-term plan is to store the entire LLDP packet in the
|
||||
metadata. This will have to be parsed out. We may have to work with
|
||||
switch vendors to figure out how to interpret some of the data if
|
||||
we want to make full use of it.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Some notes about implementation:
|
||||
|
||||
* This Python tool will access the introspection data and produce
|
||||
reports on various information such as VLANs per port, host-to-port
|
||||
mapping, and MACs per host.
|
||||
|
||||
* The introspection data can be retrieved with the Ironic API [4] [5].
|
||||
|
||||
* The data will initially be a set of fixed fields which are retrievable
|
||||
in the JSON in the Ironic introspection data. Later, the entire
|
||||
LLDP packet will be stored, and will need to be parsed outisde of the
|
||||
Ironic API.
|
||||
|
||||
* Although the initial implementation can return a human-readable report,
|
||||
other outputs should be available for automation, such as YAML.
|
||||
|
||||
* The tool that produces the LLDP report should be able to return data
|
||||
on a single host, or return all of the data.
|
||||
|
||||
* Some basic support for searching would be a nice feature to have.
|
||||
|
||||
* This data will eventually be used by the GUI to display as a validation
|
||||
step in the deployment workflow.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
dsneddon <dsneddon@redhat.com>
|
||||
|
||||
Other contributors:
|
||||
bfournie <bfournie@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create the Python script to grab introspection data from Swift using
|
||||
the API.
|
||||
|
||||
* Create the Python code to extract the relevant LLDP data from the
|
||||
data JSON.
|
||||
|
||||
* Implement per-node reports
|
||||
|
||||
* Implement aggregate reports
|
||||
|
||||
* Interface with UI developers to give them the data in a form that can
|
||||
be consumed and presented by the TripleO UI.
|
||||
|
||||
* In the future, when the entire LLDP packet is stored, refactor logic
|
||||
to take this into account.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Since this is a report that is supposed to benefit the operator, perhaps
|
||||
the best way to include it in CI is to make sure that the report gets
|
||||
logged by the Undercloud. Then the report can be reviewed in the log
|
||||
output from the CI run.
|
||||
|
||||
In fact, this might benefit the TripleO CI process, since hardware issues
|
||||
on the network would be easier to troubleshoot without having access to
|
||||
the bare metal console.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation will need to be written to cover making use of the new
|
||||
LLDP reporting tool. This should cover running the tool by hand and
|
||||
interpreting the data.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
* [1] - Wikipedia entry on LLDP:
|
||||
https://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol
|
||||
|
||||
* [2] - Blueprint for Ironic/Neutron integration:
|
||||
https://blueprints.launchpad.net/ironic/+spec/ironic-ml2-integration
|
||||
|
||||
* [3] - Review: Support LLDP data as part of interfaces in inventory
|
||||
https://review.openstack.org/#/c/320584/
|
||||
|
||||
* [4] - Accessing Ironic Introspection Data
|
||||
http://tripleo.org/advanced_deployment/introspection_data.html
|
||||
|
||||
* [5] - Ironic API - Get Introspection Data
|
||||
http://docs.openstack.org/developer/ironic-inspector/http-api.html#get-introspection-data
|
@ -1,186 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================================
|
||||
Enable deployment of availability monitoring
|
||||
============================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-opstools-availability-monitoring
|
||||
|
||||
TripleO should be deploying out-of-the-box availability monitoring solution
|
||||
to serve the overcloud.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently there is no such feature implemented except for possibility to deploy
|
||||
sensu-server, sensu-api and uchiwa (Sensu dashboard) services in the undercloud
|
||||
stack. Without sensu-client services deployed on overcloud nodes this piece
|
||||
of code is useless. Due to potential of high resource consumption it is also
|
||||
reasonable to remove current undercloud code to avoid possible problems
|
||||
when high number of overcloud nodes is being deployed.
|
||||
|
||||
Instead sensu-server, sensu-api and uchiwa should be deployed on the separate
|
||||
node(s) whether it is on the undercloud level or on the overcloud level.
|
||||
And so sensu-client deployment support should be flexible enough to enable
|
||||
connection to external monitoring infrastructure or with Sensu stack deployed
|
||||
on the dedicated overcloud node.
|
||||
|
||||
Summary of use cases:
|
||||
|
||||
1. sensu-server, sensu-api and uchiwa deployed in external infrastructure;
|
||||
sensu-client deployed on each overcloud node
|
||||
2. sensu-server, sensu-api and uchiwa deployed as a separate Heat stack in
|
||||
the overcloud stack; sensu-client deployed on each overcloud node
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The sensu-client service will be deployed as a composable service on
|
||||
the overcloud stack when it is explicitly stated via environment file.
|
||||
Sensu checks will have to be configured as subscription checks (see [0]
|
||||
for details). Each composable service will have it's own subscription string,
|
||||
which will ensure that checks defined on Sensu server node (wherever it lives)
|
||||
are run on the correct overcloud nodes.
|
||||
|
||||
There will be implemented a possibility to deploy sensu-server, sensu-api
|
||||
and uchiwa services on a stand alone node deployed by the undercloud.
|
||||
This standalone node will have a dedicated purpose for monitoring
|
||||
(not only for availability monitoring services, but in future also for
|
||||
centralized logging services or performance monitoring services)
|
||||
|
||||
The monitoring node will be deployed as a separate Heat stack to the overcloud
|
||||
stack using Puppet and composable roles for required services.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Additional service (sensu-client) will be installed on all overcloud nodes.
|
||||
These services will have open connection to RabbitMQ instance running
|
||||
on monitoring node and are used to execute commands (checks) on the overcloud
|
||||
nodes. Check definition will live on the monitoring node.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
We might consider deploying separate RabbitMQ and Redis for monitoring purposes
|
||||
if we want to avoid influencing OpenStack deployment in the overcloud.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
* Sensu clients will be deployed by default on all overcloud nodes except the monitoring node.
|
||||
* New Sensu common parameters:
|
||||
|
||||
* MonitoringRabbitHost
|
||||
|
||||
* RabbitMQ host Sensu has to connect to
|
||||
|
||||
* MonitoringRabbitPort
|
||||
|
||||
* RabbitMQ port Sensu has to connect to
|
||||
|
||||
* MonitoringRabbitUseSSL
|
||||
|
||||
* Whether Sensu should connect to RabbitMQ using SSL
|
||||
|
||||
* MonitoringRabbitPassword
|
||||
|
||||
* RabbitMQ password used for Sensu to connect
|
||||
|
||||
* MonitoringRabbitUserName
|
||||
|
||||
* RabbitMQ username used for Sensu to connect
|
||||
|
||||
* MonitoringRabbitVhost
|
||||
|
||||
* RabbitMQ vhost used for monitoring purposes.
|
||||
|
||||
* New Sensu server/API parameters
|
||||
|
||||
* MonitoringRedisHost
|
||||
|
||||
* Redis host Sensu has to connect to
|
||||
|
||||
* MonitoringRedisPassword
|
||||
|
||||
* Redis password used for Sensu to connect
|
||||
|
||||
* MonitoringChecks:
|
||||
|
||||
* Full definition (for all subscriptions) of checks performed by Sensu
|
||||
|
||||
* New parameters for subscription strings for each composable service:
|
||||
|
||||
* For example for service nova-compute MonitoringSubscriptionNovaCompute, which will default to 'overcloud-nova-compute'
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Support for new node type should be implemented for tripleo-quickstart.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Martin Mágr <mmagr@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* puppet-tripleo profile for Sensu services
|
||||
* puppet-tripleo profile for Uchiwa service
|
||||
* tripleo-heat-templates composable service for sensu-client deployment
|
||||
* tripleo-heat-templates composable service for sensu-server deployment
|
||||
* tripleo-heat-templates composable service for sensu-api deployment
|
||||
* tripleo-heat-templates composable service for uchiwa deployment
|
||||
* Support for monitoring node in tripleo-quickstart
|
||||
* Revert patch(es) implementing Sensu support in instack-undercloud
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Puppet module for Sensu services: sensu-puppet [1]
|
||||
* Puppet module for Uchiwa: puppet-uchiwa [2]
|
||||
* CentOS Opstools SIG repo [3]
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Sensu client deployment will be tested by current TripleO CI as soon as
|
||||
the patch is merged, as it will be deployed by default.
|
||||
|
||||
We should consider creating CI job for deploying overcloud with monitoring
|
||||
node to test the rest of the monitoring components.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Process of creating new node type and new options will have to be documented.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[0] https://sensuapp.org/docs/latest/reference/checks.html#subscription-checks
|
||||
[1] https://github.com/sensu/sensu-puppet
|
||||
[2] https://github.com/Yelp/puppet-uchiwa
|
||||
[3] https://wiki.centos.org/SpecialInterestGroup/OpsTools
|
@ -1,147 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
========================================
|
||||
Enable deployment of centralized logging
|
||||
========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-opstools-centralized-logging
|
||||
|
||||
TripleO should be deploying with an out-of-the-box centralized logging
|
||||
solution to serve the overcloud.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
With a complex distributed system like OpenStack, identifying and
|
||||
diagnosing a problem may require tracking a transaction across many
|
||||
different systems and many different logfiles. In the absence of a
|
||||
centralized logging solution, this process is frustrating to both new
|
||||
and experienced operators and can make even simple problems hard to
|
||||
diagnose.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We will deploy the Fluentd_ service in log collecting mode as a
|
||||
composable service on all nodes in the overcloud stack when configured
|
||||
to do so by the environment. Each composable service will have its
|
||||
own fluentd source configuration.
|
||||
|
||||
.. _fluentd: http://www.fluentd.org/
|
||||
|
||||
To receive these messages, we will deploy a centralized logging system
|
||||
running Kibana_, Elasticsearch_ and Fluentd on dedicated nodes to
|
||||
provide log aggregation and analysis. This will be deployed in a
|
||||
dedicated Heat stack that is separate from the overcloud stack using
|
||||
composable roles.
|
||||
|
||||
.. _kibana: https://www.elastic.co/products/kibana
|
||||
.. _elasticsearch: https://www.elastic.co/
|
||||
|
||||
We will also support sending messages to an external Fluentd
|
||||
instance not deployed by tripleo.
|
||||
|
||||
Summary of use cases
|
||||
--------------------
|
||||
|
||||
1. Elasticsearch, Kibana and Fluentd log relay/transformer deployed as
|
||||
a separate Heat stack in the overcloud stack; Fluentd log
|
||||
collector deployed on each overcloud node
|
||||
|
||||
2. ElasticSearch, Kibana and Fluentd log relay/transformer deployed in
|
||||
external infrastructure; Fluentd log collector deployed on each
|
||||
overcloud node
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Data collected from the logs of OpenStack services can contain
|
||||
sensitive information:
|
||||
|
||||
- Communication between the
|
||||
fluentd agent and the log aggregator should be protected with SSL.
|
||||
|
||||
- Access to the Kibana UI must have at least basic HTTP
|
||||
authentication, and client access should be via SSL.
|
||||
|
||||
- ElasticSearch should only allow collections over ``localhost``.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Additional resources will be required for running Fluentd on overcloud
|
||||
nodes. Log traffic from the overcloud nodes to the log aggregator
|
||||
will consume some bandwidth.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
- Fluentd will be deployed on all overcloud nodes.
|
||||
- New parameters for configuring Fluentd collector.
|
||||
- New parameters for configuring log collector (Fluentd,
|
||||
ElasticSearch, and Kibana)
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Support for the new node type should be implemented for tripleo-quickstart.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Martin Mágr <mmagr@redhat.com>
|
||||
Lars Kellogg-Stedman <lars@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- puppet-tripleo profile for fluentd service
|
||||
- tripleo-heat-templates composable role for FluentD collector deployment
|
||||
- tripleo-heat-templates composable role for FluentD aggregator deployment
|
||||
- tripleo-heat-templates composable role for ElasticSearch deployment
|
||||
- tripleo-heat-templates composable role for Kibana deployment
|
||||
- Support for logging node in tripleo-quickstart
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- Puppet module for Fluentd: `konstantin-fluentd` [1]
|
||||
- Puppet module for ElasticSearch `elasticsearch-elasticsearch` [2]
|
||||
- Puppet module for Kibana (tbd)
|
||||
- CentOS Opstools SIG package repository
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Fluentd client deployment will be tested by current TripleO CI as soon as
|
||||
the patch is merged. Because the centralized logging features will not
|
||||
be enabled by default we may need to introduce specific tests for
|
||||
these features.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Process of creating new node type and new options will have to be documented.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://forge.puppet.com/srf/fluentd
|
||||
[2] https://forge.puppet.com/elasticsearch/elasticsearch
|
@ -1,232 +0,0 @@
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Adding OVS-DPDK to Tripleo
|
||||
==========================================
|
||||
|
||||
Blueprint URL -
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-ovs-dpdk
|
||||
|
||||
DPDK is a set of libraries and drivers for fast packet processing and gets as
|
||||
close to wire-line speed as possible for virtual machines.
|
||||
|
||||
* It is a complete framework for fast packet processing in data plane
|
||||
applications.
|
||||
|
||||
* Directly polls the data from the NIC.
|
||||
|
||||
* Does not use interrupts - to prevent performance overheads.
|
||||
|
||||
* Uses the hugepages to preallocate large regions of memory, which allows the
|
||||
applications to DMA data directly into these pages.
|
||||
|
||||
* DPDK also has its own buffer and ring management systems for handling
|
||||
sk_buffs efficiently.
|
||||
|
||||
DPDK provides data plane libraries and NIC drivers for -
|
||||
|
||||
* Queue management with lockless queues.
|
||||
|
||||
* Buffer manager with pre-allocated fixed size buffers.
|
||||
|
||||
* PMD (poll mode drivers) to work without asynchronous notifications.
|
||||
|
||||
* Packet framework (set of libraries) to help data plane packet processing.
|
||||
|
||||
* Memory manager - allocates pools of memory, uses a ring to store free
|
||||
objects.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
* Today the installation and configuration of OVS+DPDK in openstack is done
|
||||
manually after overcloud deployment. This can be very challenging for the
|
||||
operator and tedious to do over a large number of compute nodes.
|
||||
The installation of OVS+DPDK needs be automated in tripleo.
|
||||
|
||||
* Identification of the hardware capabilities for DPDK were all done manually
|
||||
today and the same shall be automated during introspection. This hardware
|
||||
detection also provides the operator with the data needed for configuring
|
||||
Heat templates.
|
||||
|
||||
* As of today its not possible to have the co-existence of compute nodes with
|
||||
DPDK enabled hardware and without DPDK enabled hardware.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
* Ironic Python Agent shall discover the below hardware details and store it
|
||||
in swift blob -
|
||||
|
||||
* CPU flags for hugepages support -
|
||||
If pse exists then 2MB hugepages are supported
|
||||
If pdpe1gb exists then 1GB hugepages are supported
|
||||
|
||||
* CPU flags for IOMMU -
|
||||
If VT-d/svm exists, then IOMMU is supported, provided IOMMU support is
|
||||
enabled in BIOS.
|
||||
|
||||
* Compatible nics -
|
||||
Shall compare it with the list of NICs whitelisted for DPDK. The DPDK
|
||||
supported NICs are available at http://dpdk.org/doc/nics
|
||||
|
||||
The nodes without any of the above mentioned capabilities can't be used for
|
||||
COMPUTE role with DPDK.
|
||||
|
||||
* Operator shall have a provision to enable DPDK on compute nodes
|
||||
|
||||
* The overcloud image for the nodes identified to be COMPUTE capable and having
|
||||
DPDK NICs, shall have the OVS+DPDK package instead of OVS. It shall also have
|
||||
packages dpdk and driverctl.
|
||||
|
||||
* The device names of the DPDK capable NIC’s shall be obtained from T-H-T.
|
||||
The PCI address of DPDK NIC needs to be identified from the device name.
|
||||
It is required for whitelisting the DPDK NICs during PCI probe.
|
||||
|
||||
* Hugepages needs to be enabled in the Compute nodes with DPDK.
|
||||
Bug: https://bugs.launchpad.net/tripleo/+bug/1589929 needs to be implemeted
|
||||
|
||||
* CPU isolation needs to be done so that the CPU cores reserved for DPDK Poll
|
||||
Mode Drivers (PMD) are not used by the general kernel balancing,
|
||||
interrupt handling and scheduling algorithms.
|
||||
Bug: https://bugs.launchpad.net/tripleo/+bug/1589930 needs to be implemented.
|
||||
|
||||
* On each COMPUTE node with DPDK enabled NIC, puppet shall configure the
|
||||
DPDK_OPTIONS for whitelisted NIC's, CPU mask and number of memory channels
|
||||
for DPDK PMD. The DPDK_OPTIONS needs to be set in /etc/sysconfig/openvswitch
|
||||
|
||||
* Os-net-config shall -
|
||||
|
||||
* Associate the given interfaces with the dpdk drivers (default as vfio-pci
|
||||
driver) by identifying the pci address of the given interface. The
|
||||
driverctl shall be used to bind the driver persistently
|
||||
|
||||
* Understand the ovs_user_bridge and ovs_dpdk_port types and configure the
|
||||
ifcfg scripts accordingly.
|
||||
|
||||
* The "TYPE" ovs_user_bridge shall translate to OVS type OVSUserBridge and
|
||||
based on this OVS will configure the datapath type to 'netdev'.
|
||||
|
||||
* The "TYPE" ovs_dpdk_port shall translate OVS type OVSDPDKPort and based on
|
||||
this OVS adds the port to the bridge with interface type as 'dpdk'
|
||||
|
||||
* Understand the ovs_dpdk_bond and configure the ifcfg scripts accordingly.
|
||||
|
||||
* On each COMPUTE node with DPDK enabled NIC, puppet shall -
|
||||
|
||||
* Enable OVS+DPDK in /etc/neutron/plugins/ml2/openvswitch_agent.ini
|
||||
[OVS]
|
||||
datapath_type=netdev
|
||||
vhostuser_socket_dir=/var/run/openvswitch
|
||||
|
||||
* Configure vhostuser ports in /var/run/openvswitch to be owned by qemu.
|
||||
|
||||
* On each controller node, puppet shall -
|
||||
|
||||
* Add NUMATopologyFilter to scheduler_default_filters in nova.conf.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* The boot parameters could be configured via puppet (during overcloud
|
||||
deployment) as well as virt-customize (after image building or downloading).
|
||||
The choice of selection of boot parameter is moved out of scope of this
|
||||
blueprint and would be tracked via
|
||||
https://bugs.launchpad.net/tripleo/+bug/1589930.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
* We have no firewall drivers which support ovs-dpdk at present. Security group
|
||||
support with conntrack is a possible option, and this work is in progress.
|
||||
Security groups will not be supported.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* OVS-DPDK can augment 3 times dataplane performance.
|
||||
Refer http://goo.gl/Du1EX2
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
* The operator shall ensure that the VT-d/IOMMU virtualization technology is
|
||||
enabled in BIOS of the compute nodes.
|
||||
|
||||
* Post deployment, operator shall modify the VM flavors for using hugepages,
|
||||
CPU pinning
|
||||
Ex: nova flavor-key m1.small set "hw:mem_page_size=large"
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
|
||||
* karthiks
|
||||
* sanjayu
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* The proposed changes discussed earlier will be the work items
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* We are dependent on composable roles, as this is something we would
|
||||
require only on specific compute nodes and not generally on all the nodes.
|
||||
|
||||
* To enable Hugepages, bug: https://bugs.launchpad.net/tripleo/+bug/1589929
|
||||
needs to be implemeted
|
||||
|
||||
* To address boot parameter changes for CPU isolation,
|
||||
bug: https://bugs.launchpad.net/tripleo/+bug/1589930 needs to be implemented
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Since DPDK needs specific hardware support, this feature cant be tested under
|
||||
CI. We will need third party CI for validating it.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Manual steps that needs to be done by the operator shall be documented.
|
||||
Ex: configuring BIOS for VT-d, adding boot parameter for CPU isolation,
|
||||
hugepages, post deploymenent configurations.
|
||||
|
||||
Refrences
|
||||
=========
|
||||
|
||||
* Manual steps to setup DPDK in RedHat Openstack Platform 8
|
||||
https://goo.gl/6ymmJI
|
||||
|
||||
* Setup procedure for CPU pinning and NUMA topology
|
||||
http://goo.gl/TXxuhv
|
||||
|
||||
* DPDK supported NICS
|
||||
http://dpdk.org/doc/nics
|
||||
|
||||
|
||||
|
@ -1,250 +0,0 @@
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Adding SR-IOV to Tripleo
|
||||
==========================================
|
||||
|
||||
Blueprint URL:
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-sriov
|
||||
|
||||
SR-IOV is a specification that extends the PCI Express specification and allows
|
||||
a PCIe device to appear to be multiple separate physical PCIe devices.
|
||||
|
||||
SR-IOV provides one or more Virtual Functions (VFs) and a Physical Function(PF)
|
||||
|
||||
* Virtual Functions (VF's) are ‘lightweight’ PCIe functions that contain the
|
||||
resources necessary for data movement but have a carefully minimized set
|
||||
of configuration resources.
|
||||
|
||||
* Physical Function (PF) are full PCIe functions that include the SR-IOV
|
||||
Extended Capability. This capability is used to configure and manage
|
||||
the SR-IOV functionality.
|
||||
|
||||
The VF’s could be attached to VMs like a dedicated PCIe device and thereby the
|
||||
usage of SR-IOV NICs boosts the networking performance considerably.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
* Today the installation and configuration of SR-IOV feature is done manually
|
||||
after overcloud deployment. It shall be automated via tripleo.
|
||||
|
||||
* Identification of the hardware capabilities for SR-IOV were all done manually
|
||||
today and the same shall be automated during introspection. The hardware
|
||||
detection also provides the operator, the data needed for configuring Heat
|
||||
templates.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
* Ironic Python Agent will discover the below hardware details and store it in
|
||||
swift blob -
|
||||
|
||||
* SR-IOV capable NICs:
|
||||
Shall read /sys/bus/pci/devices/.../sriov_totalvfs and check if its non
|
||||
zero, inorder to identify if the NIC is SR-IOV capable
|
||||
|
||||
* VT-d or IOMMU support in BIOS:
|
||||
The CPU flags shall be read to identify the support.
|
||||
|
||||
* DIB shall include the package by default - openstack-neutron-sriov-nic-agent.
|
||||
|
||||
* The nodes without any of the above mentioned capabilities can't be used for
|
||||
COMPUTE role with SR-IOV
|
||||
|
||||
* SR-IOV drivers shall be loaded during bootup via persistent module loading
|
||||
scripts. These persistent module loading scripts shall be created by the
|
||||
puppet manifests.
|
||||
|
||||
* T-H-T shall provide the below details
|
||||
|
||||
* supported_pci_vendor_devs - configure the vendor-id/product-id couples in
|
||||
the nodes running neutron-server
|
||||
|
||||
* max number of vf's - persistent across reboots
|
||||
|
||||
* physical device mappings - Add physical device mappings ml2_conf_sriov.ini
|
||||
file in compute node
|
||||
|
||||
* On the nodes running the Neutron server, puppet shall
|
||||
|
||||
* enable sriovnicswitch in the /etc/neutron/plugin.ini file
|
||||
mechanism_drivers = openvswitch,sriovnicswitch
|
||||
This configuration enables the SR-IOV mechanism driver alongside
|
||||
OpenvSwitch.
|
||||
|
||||
* Set the VLAN range for SR-IOV in the file /etc/neutron/plugin.ini, present
|
||||
in the network node
|
||||
network_vlan_ranges = <physical network name SR-IOV interface>:<VLAN min>
|
||||
:<VLAN max> Ex : network_vlan_ranges = fabric0:15:20
|
||||
|
||||
* Configure the vendor-id/product-id couples if it differs from
|
||||
“15b3:1004,8086:10ca” in /etc/neutron/plugins/ml2/ml2_conf_sriov.ini
|
||||
supported_pci_vendor_devs = 15b3:1004,8086:10ca,<vendor-id:product-id>
|
||||
|
||||
* Configure neutron-server.service to use the ml2_conf_sriov.ini file
|
||||
[Service] Type=notify User=neutron ExecStart=/usr/bin/neutron-server
|
||||
--config-file /usr/share/neutron/neutron-dist.conf --config-file
|
||||
/etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini
|
||||
--config-file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini --log-file
|
||||
/var/log/neutron/server.log
|
||||
|
||||
* In the nodes running nova scheduler, puppet shall
|
||||
|
||||
* add PciPassthroughFilter filter to the list of scheduler_default_filters.
|
||||
This needs to be done to allow proper scheduling of SR-IOV devices
|
||||
|
||||
* On each COMPUTE+SRIOV node, puppet shall configure /etc/nova/nova.conf
|
||||
|
||||
* Associate the available VFs with each physical network
|
||||
Ex: pci_passthrough_whitelist={"devname": "enp5s0f1",
|
||||
"physical_network":"fabric0"}
|
||||
|
||||
PCI passthrough whitelist entries use the following syntax: ["device_id":
|
||||
"<id>",] ["product_id": "<id>",] ["address":
|
||||
"[[[[<domain>]:]<bus>]:][<slot>][.[<function>]]" | "devname": "Ethernet
|
||||
Interface Name",] "physical_network":"Network label string"
|
||||
|
||||
VF's that needs to be excluded from agent configuration shall be added in
|
||||
[sriov_nic]/exclude_devices. T-H-T shall configure this.
|
||||
|
||||
Multiple whitelist entries per host are supported.
|
||||
|
||||
* Puppet shall
|
||||
|
||||
* Setup max number of VF's to be configured by the operator
|
||||
echo required_max_vfs > /sys/bus/pci/devices/.../sriov_numvfs
|
||||
Puppet will also validate the required_max_vfs, so that it does not go
|
||||
beyond the supported max on the device.
|
||||
|
||||
* Enable NoopFirewallDriver in the
|
||||
'/etc/neutron/plugins/ml2/sriov_agent.ini' file.
|
||||
|
||||
[securitygroup]
|
||||
firewall_driver = neutron.agent.firewall.NoopFirewallDriver
|
||||
|
||||
* Add mappings to the /etc/neutron/plugins/ml2/sriov_agent.ini file. Ex:
|
||||
physical_device_mappings = fabric0:enp4s0f1
|
||||
In this example, fabric0 is the physical network, and enp4s0f1 is the
|
||||
physical function
|
||||
|
||||
* Puppet shall start the SR-IOV agent on Compute
|
||||
|
||||
* systemctl enable neutron-sriov-nic-agent.service
|
||||
|
||||
* systemctl start neutron-sriov-nic-agent.service
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
* We have no firewall drivers which support SR-IOV at present.
|
||||
Security groups will be disabled only for SR-IOV ports in compute hosts.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* SR-IOV provides near native I/O performance for each virtual machine on a
|
||||
physical server. Refer - http://goo.gl/HxZvXX
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
* The operator shall ensure that the BIOS supports VT-d/IOMMU virtualization
|
||||
technology on the compute nodes.
|
||||
|
||||
* IOMMU needs to be enabled in the Compute+SR-IOV nodes. Boot parameters
|
||||
(intel_iommu=on or amd_iommu=pt) shall be added in the grub.conf, using the
|
||||
first boot scripts (THT).
|
||||
|
||||
* Post deployment, operator shall
|
||||
|
||||
* Create neutron ports prior to creating VM’s (nova boot)
|
||||
neutron port-create fabric0_0 --name sr-iov --binding:vnic-type direct
|
||||
|
||||
* Create the VM with the required flavor and SR-IOV port id
|
||||
Ex: nova boot --flavor m1.small --image <image id> --nic port-id=<port id>
|
||||
vnf0
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
|
||||
* karthiks
|
||||
* sanjayu
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Documented above in the Proposed changes
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* We are dependent on composable roles as SR-IOV specific changes is something
|
||||
we would require on specific compute nodes and not generally on all the
|
||||
nodes. Blueprint -
|
||||
https://blueprints.launchpad.net/tripleo/+spec/composable-services-within-roles
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Since SR-IOV needs specific hardware support, this feature cant be tested
|
||||
under CI. We will need third party CI for validating it.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Manual steps that needs to be done by the operator shall be documented.
|
||||
Ex: configuring BIOS for VT-d, IOMMU, post deploymenent configurations.
|
||||
|
||||
Refrences
|
||||
=========
|
||||
|
||||
* SR-IOV support for virtual networking
|
||||
https://goo.gl/eKP1oO
|
||||
|
||||
* Enable SR-IOV functionality available in OpenStack
|
||||
http://docs.openstack.org/liberty/networking-guide/adv_config_sriov.html
|
||||
|
||||
* Introduction to SR-IOV
|
||||
http://goo.gl/m7jP3
|
||||
|
||||
* Setup procedure for CPU pinning and NUMA topology
|
||||
http://goo.gl/TXxuhv
|
||||
|
||||
* /sys/bus/pci/devices/.../sriov_totalvfs - This file appears when a physical
|
||||
PCIe device supports SR-IOV.
|
||||
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci
|
||||
|
@ -1,272 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================
|
||||
Undercloud Upgrade
|
||||
==================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/undercloud-upgrade
|
||||
|
||||
Our currently documented upgrade path for the undercloud is very problematic.
|
||||
In fact, it doesn't work. A number of different patches are attempting to
|
||||
address this problem (see the `References`_ section), but they all take slightly
|
||||
different approaches that are not necessarily compatible with each other.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The undercloud upgrade must be carefully orchestrated. A few of the problems
|
||||
that can be encountered during an undercloud upgrade if things are not done
|
||||
or not done in the proper order:
|
||||
|
||||
#. Services may fail and get stuck in a restart loop
|
||||
|
||||
#. Service databases may not be properly upgraded
|
||||
|
||||
#. Services may fail to stop and prevent the upgraded version from starting
|
||||
|
||||
Currently there is not agreement over who should be responsible for running
|
||||
the various steps of the undercloud upgrade. Getting everyone on the same
|
||||
page regarding this is the ultimate goal of this spec.
|
||||
|
||||
Also of note is the MariaDB major version update flow from
|
||||
`Upgrade documentation (under and overcloud)`_. This will need to be
|
||||
addressed as part of whatever upgrade solution we decide to pursue.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
I'm going to present my proposed solution here, but will try to give a fair
|
||||
overview of the other proposals in the `Alternatives`_ section. Others
|
||||
should feel free to push modifications or follow-ups if I miss anything
|
||||
important, however.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Services must be stopped before their respective package update is run.
|
||||
This is because the RPM specs for the services include a mandatory restart to
|
||||
ensure that the new code is running after the package is updated. On a major
|
||||
version upgrade, this can and does result in broken services because the config
|
||||
files are not always forward compatible, so until Puppet is run again to
|
||||
configure them appropriately the service cannot start. The broken services
|
||||
can cause other problems as well, such as the yum update taking an excessively
|
||||
long time because it times out waiting for the service to restart. It's worth
|
||||
noting that this problem does not exist on an HA overcloud because Pacemaker
|
||||
stubs out the service restarts in the systemd services so the package update
|
||||
restart becomes a noop.
|
||||
|
||||
Because the undercloud is not required to have extremely high uptime, I am in
|
||||
favor of just stopping all of the services, updating all the packages, then
|
||||
re-running the undercloud install to apply the new configs and start the
|
||||
services again. This ensures that the services are not restarted by the
|
||||
package update - which only happens if the service was running at the time of
|
||||
the update - and that there is no chance of an old version of a service being
|
||||
left running and interfering with the new version, as can happen when moving
|
||||
a service from a standalone API process to httpd.
|
||||
|
||||
instack-undercloud will be responsible for implementing the process described
|
||||
above. However, to avoid complications with instack-undercloud trying to
|
||||
update itself, tripleoclient will be responsible for updating
|
||||
instack-undercloud and its dependencies first. This two-step approach
|
||||
should allow us to sanely use an older tripleoclient to run the upgrade
|
||||
because the code in the client will be minimal and should not change from
|
||||
release to release. Upgrade-related backports to stable clients should not
|
||||
be needed in any foreseeable case. Any potential version-specific logic can
|
||||
live in instack-undercloud. The one exception being that we may need to
|
||||
initially backport this new process to the previous stable branch so we can
|
||||
start using it without waiting an entire cycle. Since the current upgrade
|
||||
process does not work correctly there, I think this would be a valid bug fix
|
||||
backport.
|
||||
|
||||
A potential drawback of this approach is that it will not automatically
|
||||
trigger the Puppet service db-syncs because Puppet is not aware that the
|
||||
version has changed if we update the packages separately. However, I feel
|
||||
this is a case we need to handle sanely anyway in case a package is updated
|
||||
outside Puppet either intentionally or accidentally. To that end, we've
|
||||
already merged a patch to always run db-syncs on the undercloud since they're
|
||||
idempotent anyway. See `Stop all services before upgrading`_ for a link to
|
||||
the patch.
|
||||
|
||||
MariaDB
|
||||
-------
|
||||
|
||||
Regarding the MariaDB issue mentioned above, I believe that regardless of the
|
||||
approach we take, we should automate the dump and restore of the database as
|
||||
much as possible. Either solution should be able to look at the version of
|
||||
mariadb before yum update and the version after, and decide whether the db
|
||||
needs to be dumped. If a user updates the package manually outside the
|
||||
undercloud upgrade flow then they will be responsible for the db upgrade
|
||||
themselves. I think this is the best we can do, short of writing some sort
|
||||
of heuristic that can figure out whether the existing db files are for an
|
||||
older version of MariaDB and doing the dump/restore based on that.
|
||||
|
||||
Updates vs. Upgrades
|
||||
--------------------
|
||||
|
||||
I am also proposing that we not differentiate between minor updates and major
|
||||
upgrades on the undercloud. Because we don't need to be as concerned with
|
||||
uptime there, any additional time required to treat all upgrades as a
|
||||
potential major version upgrade should be negligible, and it avoids us
|
||||
having to maintain and test multiple paths.
|
||||
|
||||
Additionally, the difference between a major and minor upgrade becomes very
|
||||
fuzzy for anyone upgrading between versions of master. There may be db
|
||||
or rpc changes that require the major upgrade flow anyway. Also, the whole
|
||||
argument assumes we can even come up with a sane, yet less-invasive update
|
||||
strategy for the undercloud anyway, and I think our time is better spent
|
||||
elsewhere.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
As shown in `Don't update whole system on undercloud upgrade`_, another
|
||||
option is to limit the manual yum update to just instack-undercloud and make
|
||||
Puppet responsible for updating everything else. This would allow Puppet
|
||||
to handle all of the upgrade logic internally. As of this writing, there is
|
||||
at least one significant problem with the patch as proposed because it does
|
||||
not update the Puppet modules installed on the undercloud, which leaves us
|
||||
in a chicken and egg situation with a newer instack-undercloud calling older
|
||||
Puppet modules to run the update. I believe this could be solved by also
|
||||
updating the Puppet modules along with instack-undercloud.
|
||||
|
||||
Drawbacks of this approach would be that each service needs to be orchestrated
|
||||
correctly in Puppet (this could also be a feature, from a Puppet CI
|
||||
perspective), and it does not automatically handle things like services moving
|
||||
from standalone to httpd. This could be mitigated by the undercloud upgrade
|
||||
CI job catching most such problems before they merge.
|
||||
|
||||
I still personally feel this is more complicated than the proposal above, but
|
||||
I believe it could work, and as noted could have benefits for CI'ing upgrades
|
||||
in Puppet modules.
|
||||
|
||||
There is one other concern with this that is less a functional issue, which is
|
||||
that it significantly alters our previous upgrade methods, and might be
|
||||
problematic to backport as older versions of instack-undercloud were assuming
|
||||
an external package update. It's probably not an insurmountable obstacle, but
|
||||
I do feel it's worth noting. Either approach is going to require some amount
|
||||
of backporting, but this may require backporting in non-tripleo Puppet modules
|
||||
which may be more difficult to do.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No significant security impact one way or another.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This will likely have an impact on how a user runs undercloud upgrades,
|
||||
especially compared to our existing documented upgrade method.
|
||||
Ideally all of the implementation will happen behind the ``openstack undercloud
|
||||
upgrade`` command regardless of which approach is taken, but even that is a
|
||||
change from before.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The method I am suggesting can do an undercloud upgrade in 20-25
|
||||
minutes end-to-end in a scripted CI job.
|
||||
|
||||
The performance impact of the Puppet approach is unknown to me.
|
||||
|
||||
The performance of the existing method where service packages are updated with
|
||||
the service still running is terrible - upwards of two hours for a full
|
||||
upgrade in some cases, assuming the upgrade completes at all. This is largely
|
||||
due to the aforementioned problem with services restarting before their config
|
||||
files have been updated.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Same as the end user impact. In this case I believe they're the same person.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Discussed somewhat in the proposals, but I believe my approach is a little
|
||||
simpler from the developer perspective. They don't have to worry about the
|
||||
orchestration of the upgrade, they only have to provide a valid configuration
|
||||
for a given version of OpenStack. The one drawback is that if we add any new
|
||||
services on the undercloud, their db-sync must be wired into the "always run
|
||||
db-syncs" list.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
|
||||
* bnemec
|
||||
* EmilienM
|
||||
|
||||
Other contributors (I'm essentially listing everyone who has been involved in
|
||||
upgrade work so far):
|
||||
|
||||
* lbezdick
|
||||
* bandini
|
||||
* marios
|
||||
* jistr
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement an undercloud upgrade CI job to test upgrades.
|
||||
* Implement the selected approach in the undercloud upgrade command.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
A CI job is already underway. See `Undercloud Upgrade CI Job`_. This should
|
||||
provide reasonable coverage on a per-patch basis. We may also want to test
|
||||
undercloud upgrades in periodic jobs to ensure that it is possible to deploy
|
||||
an overcloud with an upgraded undercloud. This probably takes too long to be
|
||||
done in the regular CI jobs, however.
|
||||
|
||||
There has also been discussion of running Tempest API tests on the upgraded
|
||||
undercloud, but I'm unsure of the status of that work. It would be good to
|
||||
have in the standalone undercloud upgrade job though.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The docs will need to be updated to reflect the new upgrade method. Hopefully
|
||||
this will be as simple as "Run openstack undercloud upgrade", but that remains
|
||||
to be seen.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Stop all services before upgrading
|
||||
----------------------------------
|
||||
Code: https://review.openstack.org/331804
|
||||
|
||||
Docs: https://review.openstack.org/315683
|
||||
|
||||
Always db-sync: https://review.openstack.org/#/c/346138/
|
||||
|
||||
Don't update whole system on undercloud upgrade
|
||||
-----------------------------------------------
|
||||
https://review.openstack.org/327176
|
||||
|
||||
Upgrade documentation (under and overcloud)
|
||||
-------------------------------------------
|
||||
https://review.openstack.org/308985
|
||||
|
||||
Undercloud Upgrade CI Job
|
||||
-------------------------
|
||||
https://review.openstack.org/346995
|
@ -1,159 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================
|
||||
TripleO Deployment Validations
|
||||
==============================
|
||||
|
||||
We need ways in TripleO for performing validations at various stages of the
|
||||
deployment.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
TripleO deployments, and more generally all OpenStack deployments, are complex,
|
||||
error prone, and highly dependent on the environment. An appropriate set of
|
||||
tools can help engineers to identify potential problems as early as possible
|
||||
and fix them before going further with the deployment.
|
||||
|
||||
People have already developed such tools [1], however they appear more like
|
||||
a random collection of scripts than a well integrated solution within TripleO.
|
||||
We need to expose the validation checks from a library so they can be consumed
|
||||
from the GUI or CLI without distinction and integrate flawlessly within TripleO
|
||||
deployment workflow.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We propose to extend the TripleO Overcloud Deployment Mistral workflow [2] to
|
||||
include Actions for validation checks.
|
||||
|
||||
These actions will need at least to:
|
||||
|
||||
* List validations
|
||||
* Run and stop validations
|
||||
* Get validation status
|
||||
* Persist and retrieve validation results
|
||||
* Permit grouping validations by 'deployment stage' and execute group operations
|
||||
|
||||
Running validations will be implemented in a workflow to ensure the nodes meet
|
||||
certain expectations. For example, a baremetal validation may require the node
|
||||
to boot on a ramdisk first.
|
||||
|
||||
Mistral workflow execution can be started with the `mistral execution-create`
|
||||
command and can be stopped with the `mistral execution-update` command by
|
||||
setting the workflow status to either SUCCESS or ERROR.
|
||||
|
||||
Every run of the workflow (workflow execution) is stored in Mistral's DB and
|
||||
can be retrieved for later use. The workflow execution object contains all
|
||||
information about the workflow and its execution, including all output data and
|
||||
statuses for all the tasks composing the workflow.
|
||||
|
||||
By introducing a reasonable validation workflows naming, we are able to use
|
||||
workflow names to identify stage at which the validations should run and
|
||||
trigger all validations of given stage (e.g.
|
||||
tripleo.validation.hardware.undercloudRootPartitionDiskSizeCheck)
|
||||
|
||||
Using the naming conventions, the user is also able to register a new
|
||||
validation workflow and add it to the existing ones.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative is to ship a collection of scripts within TripleO to be run by
|
||||
engineers at different stages of the deployment. This solution is not optimal
|
||||
because it requires a lot of manual work and does not integrate with the UI.
|
||||
|
||||
Another alternative is to build our own API, but it would require significantly
|
||||
more effort to create and maintain. This topic has been discussed at length on
|
||||
the mailing list.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The whole point behind the validations framework is to permit running scripts
|
||||
on the nodes, thus providing access from the control node to the deployed nodes
|
||||
at different stages of the deployment. Special care needs to be taken to grant
|
||||
access to the target nodes using secure methods and ensure only trusted scripts
|
||||
can be executed from the library.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
We expect reduced deployment time thanks to early issue detection.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers will need to keep the TripleO CI updated with changes, and will be
|
||||
responsible for fixing the CI as needed.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
|
||||
* shadower
|
||||
* mandre
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
The work items required are:
|
||||
|
||||
* Develop the tripleo-common Mistral actions that provide all of the
|
||||
functionality required for the validation workflow.
|
||||
* Write an initial set of validation checks based on real deployment
|
||||
experience, starting by porting existing validations [1] to work with the
|
||||
implemented Mistral actions.
|
||||
|
||||
All patches that implement these changes must pass CI and add additional tests as
|
||||
needed.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
We are dependent upon the tripleo-mistral-deployment-library [2] work.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The TripleO CI should be updated to test the updated tripleo-common library.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Mistral Actions and Workflows are sort of self-documenting and can be easily
|
||||
introspected by running 'mistral workflow-list' or 'mistral action-list' on the
|
||||
command line. The updated library however will have to be well-documented and
|
||||
meet OpenStack standards. Documentation will be needed in both the
|
||||
tripleo-common and tripleo-docs repositories.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [1] Set of tools to help detect issues during TripleO deployments:
|
||||
https://github.com/rthallisey/clapper
|
||||
* [2] Library support for TripleO Overcloud Deployment Via Mistral:
|
||||
https://specs.openstack.org/openstack/tripleo-specs/specs/mitaka/tripleo-mistral-deployment-library.html
|
@ -1,212 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Workflow Simplification
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/workflow-simplification
|
||||
|
||||
The TripleO workflow is still too complex for many (most?) users to follow
|
||||
successfully. There are some fairly simple steps we can take to improve
|
||||
that situation.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The current TripleO workflow grew somewhat haphazardly out of a collection
|
||||
of bash scripts that originally made up instack-undercloud. These scripts
|
||||
started out life as primarily a proof of concept exercise to demonstrate
|
||||
that the idea was viable, and while the steps still work fine when followed
|
||||
correctly, it seems "when followed correctly" is too difficult today, at least
|
||||
based on the feedback I'm hearing from users.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
There seem to be a number of low-hanging fruit candidates for cleanup. In the
|
||||
order in which they appear in the docs, these would be:
|
||||
|
||||
#. **Node registration** Why is this two steps? Is there ever a case where we
|
||||
would want to register a node but not configure it to be able to boot?
|
||||
If there is, is it a significant enough use case to justify the added
|
||||
step every time a user registers nodes?
|
||||
|
||||
I propose that we configure boot on newly registered nodes automatically.
|
||||
Note that this will probably require us to also update the boot
|
||||
configuration when updating images, but again this is a good workflow
|
||||
improvement. Users are likely to forget to reconfigure their nodes' boot
|
||||
images after updating them in Glance.
|
||||
|
||||
.. note:: This would not remove the ``openstack baremetal configure boot``
|
||||
command for independently updating the boot configuration of
|
||||
Ironic nodes. In essence, it would just always call the
|
||||
configure boot command immediately after registering nodes so
|
||||
it wouldn't be a mandatory step.
|
||||
|
||||
This also means that the deploy ramdisk would have to be built
|
||||
and loaded into Glance before registering nodes, but our
|
||||
documented process already satisfies that requirement, and we
|
||||
could provide a --no-configure-boot param to import for cases
|
||||
where someone wanted to register nodes without configuring them.
|
||||
|
||||
#. **Flavor creation** Nowhere in our documentation do we recommend or
|
||||
provide guidance on customizing the flavors that will be used for
|
||||
deployment. While it is possible to deploy solely based on flavor
|
||||
hardware values (ram, disk, cpu), in practice it is often simpler
|
||||
to just assign profiles to Ironic nodes and have scheduling done solely
|
||||
on that basis. This is also the method we document at this time.
|
||||
|
||||
I propose that we simply create all of the recommended flavors at
|
||||
undercloud install time and assign them the appropriate localboot and
|
||||
profile properties at that time. These flavors would be created with the
|
||||
minimum supported cpu, ram, and disk values so they would work for any
|
||||
valid hardware configuration. This would also reduce the possibility of
|
||||
typos in the flavor creation commands causing avoidable deployment
|
||||
failures.
|
||||
|
||||
These default flavors can always be customized if a user desires, so there
|
||||
is no loss of functionality from making this change.
|
||||
|
||||
#. **Node profile assignment** This is not currently part of the standard
|
||||
workflow, but in practice it is something we need to be doing for most
|
||||
real-world deployments with heterogeneous hardware for controllers,
|
||||
computes, cephs, etc. Right now the documentation requires running an
|
||||
ironic node-update command specifying all of the necessary capabilities
|
||||
(in the manual case anyway, this section does not apply to the AHC
|
||||
workflow).
|
||||
|
||||
os-cloud-config does have support for specifying the node profile in
|
||||
the imported JSON file, but to my knowledge we don't mention that anywhere
|
||||
in the documentation. This would be the lowest of low-hanging
|
||||
fruit since it's simply a question of documenting something we already
|
||||
have.
|
||||
|
||||
We could even give the generic baremetal flavor a profile and have our
|
||||
default instackenv.json template include that[1], with a note that it can
|
||||
be overridden to a more specific profile if desired. If users want to
|
||||
change a profile assignment after registration, the node update command
|
||||
for ironic will still be available.
|
||||
|
||||
1. For backwards compatibility, we might want to instead create a new flavor
|
||||
named something like 'default' and use that, leaving the old baremetal
|
||||
flavor as an unprofiled thing for users with existing unprofiled nodes.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
tripleo.sh
|
||||
~~~~~~~~~~
|
||||
tripleo.sh addresses the problem to some extent for developers, but it is
|
||||
not a viable option for real world deployments (nor should it be IMHO).
|
||||
However, it may be valuable to look at tripleo.sh for guidance on a simpler
|
||||
flow that can be more easily followed, as that is largely the purpose of the
|
||||
script. A similar flow codified into the client/API would be a good result
|
||||
of these proposed changes.
|
||||
|
||||
Node Registration
|
||||
~~~~~~~~~~~~~~~~~
|
||||
One option Dmitry has suggested is to make the node registration operation
|
||||
idempotent, so that it can be re-run any number of times and will simply
|
||||
update the details of any already registered nodes. He also suggested
|
||||
moving the bulk import functionality out of os-cloud-config and (hopefully)
|
||||
into Ironic itself.
|
||||
|
||||
I'm totally in favor of both these options, but I suspect that they will
|
||||
represent a significantly larger amount of work than the other items in this
|
||||
spec, so I think I'd like that to be addressed as an independent spec since
|
||||
this one is already quite large.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Minimal, if any. This is simply combining existing deployment steps. If we
|
||||
were to add a new API for node profile assignment that would have some slight
|
||||
security impact as it would increase our attack surface, but I feel even that
|
||||
would be negligible.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Simpler deployments. This is all about the end user.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Some individual steps may take longer, but only because they will be
|
||||
performing actions that were previously in separate steps. In aggregate
|
||||
the process should take about the same time.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
If all of these suggested improvements are implemented, it will make the
|
||||
standard deployment process somewhat less flexible. However, in the
|
||||
Proposed Change section I attempted to address any such new limitations,
|
||||
and I feel they are limited to the edgiest of edge cases that in most cases
|
||||
can still be implemented through some extra manual steps (which likely would
|
||||
have been necessary anyway - they are edge cases after all).
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
There will be some changes in the basic workflow, but as noted above the same
|
||||
basic steps will be getting run. Developers will see some impact from the
|
||||
proposed changes, but as they will still likely be using tripleo.sh for an
|
||||
already simplified workflow it should be minimal.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
bnemec
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Configure boot on newly registered nodes automatically.
|
||||
* Reconfigure boot on nodes after deploy images are updated.
|
||||
* Remove explicit step for configure boot from the docs, but leave the actual
|
||||
function itself in the client so it can still be used when needed.
|
||||
* Create flavors at undercloud install time and move documentation on creating
|
||||
them manually to the advanced section of the docs.
|
||||
* Add a 'default' flavor to the undercloud.
|
||||
* Update the sample instackenv.json to include setting a profile (by default,
|
||||
the 'default' profile associated with the flavor from the previous step).
|
||||
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Nothing that I'm aware of.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
As these changes are implemented, we would need to update tripleo.sh to match
|
||||
the new flow, which will result in the changes being covered in CI.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
This should reduce the number of steps in the basic deployment flow in the
|
||||
documentation. It is intended to simplify the documentation.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Proposed change to create flavors at undercloud install time:
|
||||
https://review.openstack.org/250059
|
||||
https://review.openstack.org/251555
|
@ -1,133 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========================================
|
||||
Tool to Capture Environment Status and Logs
|
||||
===========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/capture-environment-status-and-logs
|
||||
|
||||
To aid in troubleshooting, debugging, and reproducing issues we should create
|
||||
or integrate with a tool that allows an operator or developer to collect and
|
||||
generage a single bundle that provides the state and history of a deployed
|
||||
environment.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently there is no single command that can be run via either the
|
||||
tripleoclient or via the UI that will generage a single artifact to be used
|
||||
to report issues when failures occur.
|
||||
|
||||
* tripleo-quickstart_, tripleo-ci_ and operators collect the logs for bug
|
||||
reports in different ways.
|
||||
|
||||
* When a failure occurs, many different peices of information must be collected
|
||||
to be able to understand where the failure occured. If the logs required are
|
||||
not asked for, an operator may not know to what to provide for
|
||||
troubleshooting.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
TripleO should provide a unified method for collecting status and logs from the
|
||||
undercloud and overcloud nodes. The tripleoclient should support executing a
|
||||
workflow to run status and log collection processes via sosreport_. The output
|
||||
of the sosreport_ should be collected and bundled together in a single location.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Currently, various shell scripts and ansible tasks are used by the CI processes
|
||||
to perform log collection. These scripts are not maintained in combination with
|
||||
the core TripleO and may require additional artifacts that are not installed by
|
||||
default with a TripleO environment.
|
||||
|
||||
tripleo-quickstart_ uses ansible-role-tripleo-collect-logs_ to collect logs.
|
||||
|
||||
tripleo-ci_ uses bash scripts to collect the logs.
|
||||
|
||||
Fuel uses timmy_.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The logs and status information may be considered sensitive information. The
|
||||
process to trigger status and logs should require authentication. Additionally
|
||||
we should provide a basic password protection mechanism for the bundle of logs
|
||||
that is created by this process.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
alex-schultz
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Ensure OpenStack `sosreport plugins`_ are current.
|
||||
* Write a TripleO sosreport plugin.
|
||||
* Write a `Mistral workflow`_ to execute sosreport and collect artifacts.
|
||||
* Write python-tripleoclient_ integration to execute Mistral workflows.
|
||||
* Update documentation and CI scripts to leverage new collection method.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
As part of CI testing, the new tool should be used to collect environment logs.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation should be updated to reflect the standard ways to collect the logs
|
||||
using the tripleo client.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _ansible-role-tripleo-collect-logs: https://github.com/redhat-openstack/ansible-role-tripleo-collect-logs
|
||||
.. _Mistral workflow: http://docs.openstack.org/developer/mistral/terminology/workflows.html
|
||||
.. _python-tripleoclient: https://github.com/openstack/python-tripleoclient
|
||||
.. _tripleo-ci: https://github.com/openstack-infra/tripleo-ci
|
||||
.. _tripleo-quickstart: https://github.com/openstack/tripleo-quickstart
|
||||
.. _sosreport: https://github.com/sosreport/sos
|
||||
.. _sosreport plugins: https://github.com/sosreport/sos/tree/master/sos/plugins
|
||||
.. _timmy: https://github.com/openstack/timmy
|
@ -1,201 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================
|
||||
Composable HA architecture
|
||||
==========================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/composable-ha
|
||||
|
||||
Since Newton, we have the following services managed by pacemaker:
|
||||
|
||||
* Cloned and master/slave resources:
|
||||
galera, redis, haproxy, rabbitmq
|
||||
|
||||
* Active/Passive resources:
|
||||
VIPs, cinder-volume, cinder-backup, manila-share
|
||||
|
||||
It is currently not possible to compose the above service in the same
|
||||
way like we do today via composable roles for the non-pacemaker services
|
||||
This spec aims to address this limitation and let the operator be more flexible
|
||||
in the composition of the control plane.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently tripleo has implemented no logic whatsoever to assign specific pacemaker
|
||||
managed services to roles/nodes.
|
||||
|
||||
* Since we do not have a lot in terms of hard performance data, we typically support
|
||||
three controller nodes. This is perceived as a scalability limiting factor and there is
|
||||
a general desire to be able to assign specific nodes to specific pacemaker-managed
|
||||
services (e.g. three nodes only for galera, five nodes only for rabbitmq)
|
||||
|
||||
* Right now if the operator deploys on N controllers he will get N cloned instances
|
||||
of the non-A/P pacemaker services on the same N nodes. We want to be able to
|
||||
be much more flexible. E.g. deploy galera on the first 3 nodes, rabbitmq on the
|
||||
remaining 5 nodes, etc.
|
||||
|
||||
* It is also desirable for the operator to be able to choose on which nodes the A/P
|
||||
resources will run.
|
||||
|
||||
* We also currently have a scalability limit of 16 nodes for the pacemaker cluster.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposal here is to keep the existing cluster in its current form, but to extend
|
||||
it in two ways:
|
||||
A) Allow the operator to include a specific service in a custom node and have pacemaker
|
||||
run that resource only on that node. E.g. the operator can define the following custom nodes:
|
||||
|
||||
* Node A
|
||||
pacemaker
|
||||
galera
|
||||
|
||||
* Node B
|
||||
pacemaker
|
||||
rabbitmq
|
||||
|
||||
* Node C
|
||||
pacemaker
|
||||
VIPs, cinder-volume, cinder-backup, manila-share, redis, haproxy
|
||||
|
||||
With the above definition the operator can instantiate any number of A, B or C nodes
|
||||
and scale up to a total of 16 nodes. Pacemaker will place the resources only on
|
||||
the appropriate nodes.
|
||||
|
||||
B) Allow the operator to extend the cluster beyond 16 nodes via pacemaker remote.
|
||||
For example an operator could define the following:
|
||||
|
||||
* Node A
|
||||
pacemaker
|
||||
galera
|
||||
rabbitmq
|
||||
|
||||
* Node B
|
||||
pacemaker-remote
|
||||
redis
|
||||
|
||||
* Node C
|
||||
pacemaker-remote
|
||||
VIPs, cinder-volume, cinder-backup, manila-share, redis, haproxy
|
||||
|
||||
This second scenario would allow an operator to extend beyond the 16 nodes limit.
|
||||
The only difference to scenario 1) is the fact that the quorum of the cluster is
|
||||
obtained only by the nodes from Node A.
|
||||
|
||||
The way this would work is that the placement on nodes would be controllerd by location
|
||||
rules that would work based on node properties matching.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
A bunch of alternative designs was discussed and evaluated:
|
||||
A) A cluster per service:
|
||||
|
||||
One possible architecture would be to create a separate pacemaker cluster for
|
||||
each HA service. This has been ruled out mainly for the following reasons:
|
||||
|
||||
* It cannot be done outside of containers
|
||||
* It would create a lot of network traffic
|
||||
|
||||
* It would increase the management/monitoring of the pacemaker resources and clusters
|
||||
exponentially
|
||||
|
||||
* Each service would still be limited to 16 nodes
|
||||
* A new container fencing agent would have to be written
|
||||
|
||||
B) A single cluster where only the clone-max property is set for the non A/P services
|
||||
|
||||
This would be still a single cluster, but unlike today where the cloned and
|
||||
master/slave resources run on every controller we would introduce variables to
|
||||
control the maximum number of nodes a resource could run on. E.g.
|
||||
GaleraResourceCount would set clone-max to a value different than the number of
|
||||
controllers. Example: 10 controllers, galera has clone-max set to 3, rabbit to
|
||||
5 and redis to 3.
|
||||
While this would be rather simple to implement and would change very little in the
|
||||
current semantics, this design was ruled out:
|
||||
|
||||
* We'd still have the 16 nodes limit
|
||||
* It would not provide fine grained control over which services live on which nodes
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No changes regarding security aspects compared to the existing status quo.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
No particular impact except added flexibility in placing pacemaker-managed resources.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The performance impact here is that with the added scalability it will be possible for
|
||||
an operator to dedicate specific nodes for certain pacemaker-managed services.
|
||||
There are no changes in terms of code, only a more flexible and scalable way to deploy
|
||||
services on the control plane.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
This proposal aims to use the same method that the custom roles introduced in Newton
|
||||
use to tailor the services running on a node. With the very same method it will be possible
|
||||
to do that for the HA services managed by pacemaker today.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
No impact
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
michele
|
||||
|
||||
Other contributors:
|
||||
cmsj, abeekhof
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
We need to work on the following:
|
||||
|
||||
1. Add location rule constraints support in puppet
|
||||
2. Make puppet-tripleo set node properties on the nodes where a service profile
|
||||
3. Create corresponding location rules
|
||||
4. Add a puppet-tripleo pacemaker-remote profile
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
No additional dependencies are required.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We will need to test the flexible placement of the pacemaker-managed services
|
||||
within the CI. This can be done within today's CI limitations (i.e. in the three
|
||||
controller HA job we can make sure that the placement is customized and working)
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
No impact
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Mostly internal discussions within the HA team at Red Hat
|
@ -1,212 +0,0 @@
|
||||
===============================
|
||||
Deploying TripleO in Containers
|
||||
===============================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/containerize-tripleo
|
||||
|
||||
Ability to deploy TripleO in Containers.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Linux containers are changing how the industry deploys applications by offering
|
||||
a lightweight, portable and upgradeable alternative to deployments on a physical
|
||||
host or virtual machine.
|
||||
|
||||
Since TripleO already manages OpenStack infrastructure by using OpenStack
|
||||
itself, containers could be a new approach to deploy OpenStack services. It
|
||||
would change the deployment workflow but could extend upgrade capabilities,
|
||||
orchestration, and security management.
|
||||
|
||||
Benefits of containerizing the openstack services include:
|
||||
|
||||
* Upgrades can be performed by swapping out containers.
|
||||
* Since the entire software stack is held within the container,
|
||||
interdependencies do not affect deployments of services.
|
||||
* Containers define explicit state and data requirements. Ultimately if we
|
||||
moved to kubernetes all volumes would be off the host making the host
|
||||
stateless.
|
||||
* Easy rollback to working containers if upgrading fails.
|
||||
* Software shipped in each container has been proven to work for this service.
|
||||
* Mix and match versions of services on the same host.
|
||||
* Immutable containers provide a consistent environment upon startup.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The intention of this blueprint is to introduce containers as a method of
|
||||
delivering an OpenStack installation. We currently have a fully functioning
|
||||
containerized version of the compute node but we would like to extend this to
|
||||
all services. In addition it should work with the new composable roles work that
|
||||
has been recently added.
|
||||
|
||||
The idea is to have an interface within the heat templates that adds information
|
||||
for each service to be started as a container. This container format should
|
||||
closely resemble the Kubernetes definition so we can possibly transition to
|
||||
Kubernetes in the future. This work has already been started here:
|
||||
|
||||
https://review.openstack.org/#/c/330659/
|
||||
|
||||
There are some technology choices that have been made to keep things usable and
|
||||
practical. These include:
|
||||
|
||||
* Using Kolla containers. Kolla containers are built using the most popular
|
||||
operating system choices including CentOS, RHEL, Ubuntu, etc. and are a
|
||||
good fit for our use case.
|
||||
* We are using a heat hook to start these containers directly via docker.
|
||||
This minimizes the software required on the node and maps directly to the
|
||||
current baremetal implementation. Also maintaining the heat interface
|
||||
keeps the GUI functional and allows heat to drive upgrades and changes to
|
||||
containers.
|
||||
* Changing the format of container deployment to match Kubernetes for
|
||||
potential future use of this technology.
|
||||
* Using CentOS in full (not CentOS Atomic) on the nodes to allow users to
|
||||
have a usable system for debugging.
|
||||
* Puppet driven configuration that is mounted into the container at startup.
|
||||
This allows us to retain our puppet configuration system and operate in
|
||||
parallel with existing baremetal deployment.
|
||||
|
||||
Bootstrapping
|
||||
-------------
|
||||
|
||||
Once the node is up and running, there is a systemd service script that runs
|
||||
which starts the docker agents container. This container has all of the
|
||||
components needed to bootstrap the system. This includes:
|
||||
|
||||
* heat agents including os-collect-config, os-apply-config etc.
|
||||
* puppet-agent and modules needed for the configuration of the deployment.
|
||||
* docker client that connects to host docker daemon.
|
||||
* environment for configuring networking on the host.
|
||||
|
||||
This containers acts as a self-installing container. Once started, this
|
||||
container will use os-collect-config to connect back to heat. The heat agents
|
||||
then perform the following tasks:
|
||||
|
||||
* Set up an etc directory and runs puppet configuration scripts. This
|
||||
generates all the config files needed by the services in the same manner
|
||||
it would if run without containers. These are copied into a directory
|
||||
accessible on the host and by all containerized services.
|
||||
* Begin starting containerized services and other steps as defined in the
|
||||
heat template.
|
||||
|
||||
Currently all containers are implemented using net=host to allow the services to
|
||||
listen directly on the host network(s). This maintains functionality in terms of
|
||||
network isolation and IPv6.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
There shouldn't be major security impacts from this change. The deployment
|
||||
shouldn't be affected negatively by this change from a security standpoint but
|
||||
unknown issues might be found. SELinux support is implemented in Docker.
|
||||
|
||||
End User Impact
|
||||
---------------
|
||||
|
||||
* Debugging of containerized services will be different as it will require
|
||||
knowledge about docker (kubernetes in the future) and other tools to access
|
||||
the information from the containers.
|
||||
* Possibly provide more options for upgrades and new versions of services.
|
||||
* It'll allow for service isolation and better dependency management
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Very little impact:
|
||||
|
||||
* Runtime performance should remain the same.
|
||||
* We are noticing a slightly longer bootstrapping time with containers but that
|
||||
should be fixable with a few easy optimizations.
|
||||
|
||||
Deployer Impact
|
||||
---------------
|
||||
|
||||
From a deployment perspective very little changes:
|
||||
* Deployment workflow remains the same.
|
||||
* There may be more options for versions of different services since we do
|
||||
not need to worry about interdependency issues with the software stack.
|
||||
|
||||
Upgrade Impact
|
||||
--------------
|
||||
|
||||
This work aims to allow for resilent, transparent upgrades from baremetal
|
||||
overcloud deployments to container based ones.
|
||||
|
||||
Initially we need to transition to containers:
|
||||
* Would require node reboots.
|
||||
* Automated upgrades should be possible as services are the same, just
|
||||
containerized.
|
||||
* Some state may be moved off nodes to centralized storage. Containers very
|
||||
clearly define required data and state storage requirements.
|
||||
|
||||
Upgrades could be made easier:
|
||||
* Individual services can be upgraded because of reduced interdependencies.
|
||||
* It is easier to roll back to a previous version of a service.
|
||||
* Explicit storage of data and state for containers makes it very clear what
|
||||
needs to be preserved. Ultimately state information and data will likely
|
||||
not exist on individual nodes.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
The developer work flow changes slighly. Instead of interacting with the service
|
||||
via systemd and log files, you will interact with the service via docker.
|
||||
|
||||
Inside the compute node:
|
||||
* sudo docker ps -a
|
||||
* sudo docker logs <container-name>
|
||||
* sudo docker exec -it <container-name> /bin/bash
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
rhallisey
|
||||
imain
|
||||
flaper87
|
||||
mandre
|
||||
|
||||
Other contributors:
|
||||
dprince
|
||||
emilienm
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Heat Docker hook that starts containers (DONE)
|
||||
* Containerized Compute (DONE)
|
||||
* TripleO CI job (INCOMPLETE - https://review.openstack.org/#/c/288915/)
|
||||
* Containerized Controller
|
||||
* Automatically build containers for OpenStack services
|
||||
* Containerized Undercloud
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Composable roles.
|
||||
* Heat template interface which allows extensions to support containerized
|
||||
service definitions.
|
||||
|
||||
Testing
|
||||
=======
|
||||
TripleO CI would need a new Jenkins job that will deploy an overcloud in
|
||||
containers by using the selected solution.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/README-containers.md
|
||||
|
||||
* Deploying TripleO in containers
|
||||
* Debugging TripleO containers
|
||||
|
||||
References
|
||||
==========
|
||||
* https://docs.docker.com/misc/
|
||||
* https://etherpad.openstack.org/p/tripleo-docker-puppet
|
||||
* https://docs.docker.com/articles/security/
|
||||
* http://docs.openstack.org/developer/kolla/
|
||||
* https://review.openstack.org/#/c/209505/
|
||||
* https://review.openstack.org/#/c/227295/
|
@ -1,236 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
GUI Deployment configuration improvements
|
||||
==========================================
|
||||
|
||||
TripleO UI deployment configuration is based on enabling environments provided by
|
||||
deployment plan (tripleo-heat-templates) and letting user set parameter values.
|
||||
|
||||
This spec proposes improvements to this approach.
|
||||
|
||||
Blueprint: https://blueprints.launchpad.net/tripleo/+spec/deployment-configuration-improvements
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The general goal of TripleO UI is to guide user through the deployment
|
||||
process and provide relevant information along the way, so user does not
|
||||
have to search for a context in documentation or by analyzing TripleO templates.
|
||||
|
||||
There is a set of problems identified with a current deployment configuration
|
||||
solution. Resolving those problems should lead to improved user experience when
|
||||
making deployment design decisions.
|
||||
|
||||
The important information about the usage of environment and relevant parameters
|
||||
is usually included as a comment in environment file itself. This is not consumable by GUI.
|
||||
We currently use capabilities-map.yaml to define environment meta data to work
|
||||
around this.
|
||||
|
||||
* As the number of environments is growing it is hard to keep capabilities-map.yaml
|
||||
up to date. When certain environment is added, capabilities-map.yaml is usually
|
||||
not updated by the same developer, which leads to inaccuracy in environment
|
||||
description when added later.
|
||||
|
||||
* The environments depend on each other and potentially collide when used together
|
||||
|
||||
* There are no means to list and let user set parameters relevant to certain
|
||||
environment. These are currently listed as comments in environments - not
|
||||
consumable by GUI (example: [1])
|
||||
|
||||
* There are not enough means to organize parameters coming as a result of
|
||||
heat validate
|
||||
|
||||
* Not all parameters defined in tripleo-heat-templates have correct type set
|
||||
and don't include all relevant information that Hot Spec provides.
|
||||
(constraints...)
|
||||
|
||||
* Same parameters are defined in multiple templates in tripleo-heat-templates
|
||||
but their definition differs
|
||||
|
||||
* List of parameters which are supposed to get auto-generated when value is not
|
||||
provided by user are hard-coded in deployment workflow
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
* Propose environment metadata to track additional information about environment
|
||||
directly as part of the file in Heat (partially in progress [2]). Similar concept is
|
||||
already present in heat resources [3].
|
||||
In the meantime update tripleo-common environment listing feature to read
|
||||
environments and include environment metadata.
|
||||
|
||||
Each TripleO environment file should define:
|
||||
|
||||
.. code::
|
||||
|
||||
metadata:
|
||||
label: <human readable environment name>
|
||||
description: <description of the environment purpose>
|
||||
|
||||
resource_registry:
|
||||
...
|
||||
|
||||
parameter_defaults:
|
||||
...
|
||||
|
||||
|
||||
* With the environment metadata in place, capabilities-map.yaml purpose would
|
||||
simplify to defining grouping and dependencies among environments.
|
||||
|
||||
* Implement environment parameter listing in TripleO UI
|
||||
|
||||
* To organize parameters we should use ParameterGroups.
|
||||
(related discussion: [4])
|
||||
|
||||
* Make sure that same parameters are defined the same way across tripleo-heat-templates
|
||||
There may be exceptions but in those cases it must be sure that two templates which
|
||||
define same parameter differently won't be used at the same time.
|
||||
|
||||
* Update parameter definitions in TripleO templates, so the type actually matches
|
||||
expected parameter value (e.g. 'string' vs 'boolean') This will result in correct
|
||||
input type being used in GUI
|
||||
|
||||
* Define a custom constraint for parameters which are supposed to be auto-generated.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Potential alternatives to listing environment related parameters are:
|
||||
|
||||
* Use Parameter Groups to match template parameters to an environment. This
|
||||
solution ties the template with an environment and clutters the template.
|
||||
|
||||
|
||||
* As the introduction of environment metadata depends on having this feature accepted
|
||||
and implemented in Heat, alternative solution is to keep title and description in
|
||||
capabilities map as we do now
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No significant security impact
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Resolving mentioned problems greatly improves the TripleO UI workflow and
|
||||
makes deployment configuration much more streamlined.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Described approach allows to introduce caching of Heat validation which is
|
||||
currently the most expensive operation. Cache gets invalid only in case
|
||||
when a deployment plan is updated or switched.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Same as End User Impact
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jtomasek
|
||||
|
||||
Other contributors:
|
||||
rbrady
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* tripleo-heat-templates: update environments to include metadata (label,
|
||||
description), update parameter_defaults to include all parameters relevant
|
||||
to the environment
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/update-environment-files-with-related-parameters
|
||||
|
||||
* tripleo-heat-templates: update capabilities-map.yaml to map environment
|
||||
grouping and dependencies
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/update-capabilities-map-to-map-environment-dependencies
|
||||
|
||||
* tripleo-heat-templates: create parameter groups for deprecated and internal
|
||||
parameters
|
||||
|
||||
* tripleo-heat-templates: make sure that same parameters have the same definition
|
||||
|
||||
bug: https://bugs.launchpad.net/tripleo/+bug/1640243
|
||||
|
||||
* tripleo-heat-templates: make sure type is properly set for all parameters
|
||||
|
||||
bug: https://bugs.launchpad.net/tripleo/+bug/1640248
|
||||
|
||||
* tripleo-heat-templates: create custom constraint for autogenerated parameters
|
||||
|
||||
bug: https://bugs.launchpad.net/tripleo/+bug/1636987
|
||||
|
||||
* tripleo-common: update environments listing to combine capabilities map with
|
||||
environment metadata
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/update-capabilities-map-to-map-environment-dependencies
|
||||
|
||||
* tripleo-ui: Environment parameters listing
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/get-environment-parameters
|
||||
|
||||
* tripleo-common: autogenerate values for parameters with custom constraint
|
||||
|
||||
bug: https://bugs.launchpad.net/tripleo/+bug/1636987
|
||||
|
||||
* tripleo-ui: update environment configuration to reflect API changes, provide means to display and configure environment parameters
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/tripleo-ui-deployment-configuration
|
||||
|
||||
* tripleo-ui: add client-side parameter validations based on parameter type
|
||||
and constraints
|
||||
|
||||
bugs: https://bugs.launchpad.net/tripleo/+bug/1638523, https://bugs.launchpad.net/tripleo/+bug/1640463
|
||||
|
||||
* tripleo-ui: don't show parameters included in deprecated and internal groups
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Heat Environment metadata discussion [2]
|
||||
|
||||
* Heat Parameter Groups discussion [3]
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The changes should be covered by unit tests in tripleo-common and GUI
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Part of this effort should be proper documentation of how TripleO environments
|
||||
as well as capabilities-map.yaml should be defined
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://github.com/openstack/tripleo-heat-templates/blob/b6a4bdc3e4db97785b930065260c713f6e70a4da/environments/storage-environment.yaml
|
||||
|
||||
[2] http://lists.openstack.org/pipermail/openstack-dev/2016-June/097178.html
|
||||
|
||||
[3] http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#resources-section.
|
||||
|
||||
[4] http://lists.openstack.org/pipermail/openstack-dev/2016-August/102297.html
|
@ -1,154 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================
|
||||
GUI: Import/Export Deployment Plan
|
||||
==================================
|
||||
|
||||
Add two features to TripleO UI:
|
||||
|
||||
* Import a deployment plan with a Mistral environment
|
||||
* Export a deployment plan with a Mistral environment
|
||||
|
||||
Blueprint: https://blueprints.launchpad.net/tripleo/+spec/gui-plan-import-export
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Right now, the UI only supports simple plan creation. The user needs to upload
|
||||
the plan files, make the environment selection and set the parameters. We want
|
||||
to add a plan import feature which would allow the user to import the plan
|
||||
together with a complete Mistral environment. This way the selection of the
|
||||
environment and parameters would be stored and automatically imported, without
|
||||
any need for manual configuration.
|
||||
|
||||
Conversely, we want to allow the user to export a plan together with a Mistral
|
||||
environment, using the UI.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In order to identify the Mistral environment when importing a plan, I propose
|
||||
we use a JSON formatted file and name it 'plan-environment.json'. This file
|
||||
should be uploaded to the Swift container together with the rest of the
|
||||
deployment plan files. The convention of calling the file with a fixed name is
|
||||
enough for it to be detected. Once this file is detected by the tripleo-common
|
||||
workflow handling the plan import, the workflow then creates (or updates) the
|
||||
Mistral environment using the file's contents. In order to avoid possible future
|
||||
unintentional overwriting of environment, the workflow should delete this file
|
||||
once it has created (or updated) the Mistral environment with its contents.
|
||||
|
||||
Exporting the plan should consist of downloading all the plan files from the
|
||||
swift container, adding the plan-environment.json, and packing it all up in
|
||||
a tarball.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative is what we have now, i.e. making the user perform all the
|
||||
environment configuration settings and parameter settings manually each time.
|
||||
This is obviously very tedious and the user experience suffers greatly as a
|
||||
result.
|
||||
|
||||
The alternative to deleting the plan-environment.json file upon its
|
||||
processing is to leave in the swift container and keep it in sync with all
|
||||
the updates that might happen thereafter. This can get very complicated and is
|
||||
entirely unnecessary, so deleting the file instead is a better choice.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The import and export features will only be triggered on demand (user clicks
|
||||
on button, or similar), so they will have no performance impact on the rest
|
||||
of the application.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
akrivoka
|
||||
|
||||
Other contributors:
|
||||
jtomasek
|
||||
d0ugal
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* tripleo-common: Enhance plan creation/update to consume plan-environment.json
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/enhance-plan-creation-with-plan-environment-json
|
||||
|
||||
* tripleo-common: Add plan export workflow
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/plan-export-workflow
|
||||
|
||||
* python-tripleoclient: Add plan export command
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/plan-export-command
|
||||
|
||||
* tripleo-ui: Integrate plan export into UI
|
||||
|
||||
bluerpint: https://blueprints.launchpad.net/tripleo/+spec/plan-export-gui
|
||||
|
||||
Note: We don't need any additional UI (neither GUI nor CLI) for plan import - the
|
||||
existing GUI elements and CLI command for plan creation can be used for import
|
||||
as well.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The changes should be covered by unit tests in tripleo-ui, tripleo-common and
|
||||
python-tripleoclient.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
User documentation should be enhanced by adding instructions on how these two
|
||||
features are to be used.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
@ -1,190 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================================================
|
||||
Enable deployment of alternative backends for oslo.messaging
|
||||
============================================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/om-dual-backends
|
||||
|
||||
This spec describes adding two functional capabilities to the messaging
|
||||
services of an overcloud deployment. The first capability is to enable
|
||||
the selection and configuration of separate messaging backends for
|
||||
oslo.messaging RPC and Notification communications. The second
|
||||
capability is to introduce support for a brokerless messaging backend
|
||||
for oslo.messaging RPC communications via the AMQP 1.0 Apache
|
||||
qpid-dispatch-router.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The oslo.messaging library supports the deployment of dual messaging system
|
||||
backends. This enables alternative backends to be deployed for RPC and
|
||||
Notification messaging communications. Users have identified the
|
||||
constraints of using a store and forward (broker based) messaging system for RPC
|
||||
communications and are seeking direct messaging (brokerless)
|
||||
approaches to optimize the RPC messaging pattern. In addition to
|
||||
operational challenges, emerging distributed cloud architectures
|
||||
define requirements around peer-to-peer relationships and geo-locality
|
||||
that can be addressed through intelligent messaging transport routing
|
||||
capabilities such as is provided by the AMQP 1.0 qpid-dispatch-router.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Provide the capability to select and configure alternative
|
||||
transport_url's for oslo.messaging RPCs and Notifications across
|
||||
overcloud OpenStack services.
|
||||
|
||||
Retain the current default behavior to deploy the rabbitMQ server as
|
||||
the messaging backend for both RPC and Notification communications.
|
||||
|
||||
Introduce an alternative deployment of the qpid-dispatch-router as the
|
||||
messaging backend for RPC communications.
|
||||
|
||||
Utilize the oslo.messaging AMQP 1.0 driver for delivering RPC services
|
||||
via the dispatch-router messaging backend.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The configuration of dual backends for oslo.messaging could be
|
||||
performed post overcloud deployment.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The end result of using the AMQP 1.0 dispatch-router as an alternative
|
||||
messaging backend for oslo.messaging RPC communications should be the
|
||||
same from a security standpoint. The driver/router solution provides
|
||||
SSL and SASL support in parity to the current rabbitMQ server deployment.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The configuration of the dual backends for RPC and Notification
|
||||
messaging communications should be transparent to the operation of the OpenStack
|
||||
services.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Using a dispatch-router mesh topology rather than broker clustering
|
||||
for messaging communications will have a positive impact on
|
||||
performance and scalability by:
|
||||
|
||||
* Directly expanding connection capacity
|
||||
|
||||
* Providing parallel communication flows across the mesh
|
||||
|
||||
* Increasing aggregate message transfer capacity
|
||||
|
||||
* Improving resource utilization of messaging infrastructure
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The deployment of the dispatch-router, however, will be new to
|
||||
OpenStack operators. Operators will need to learn the
|
||||
architectural differences as compared to a broker cluster
|
||||
deployment. This will include capacity planning, monitoring,
|
||||
troubleshooting and maintenance best practices.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Support for alternative oslo.messaging backends and deployment of
|
||||
qpid-dispatch-router in addition to rabbitMQ should be implemented for
|
||||
tripleo-quickstart.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
|
||||
* John Eckersberg <jeckersb@redhat.com>
|
||||
|
||||
* Andy Smith <ansmith@redhat.com>
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Update overcloud templates for dual backends and dispatch-router service
|
||||
|
||||
* Add dispatch-router packages to overcloud image elements
|
||||
|
||||
* Add services template for dispatch-router
|
||||
|
||||
* Update OpenStack services base templates to select and configure
|
||||
transport_urls for RPC and Notification
|
||||
|
||||
* Deploy dispatch-router for controller and compute for topology
|
||||
|
||||
* Test failure and recovery scenarios for dispatch-router
|
||||
|
||||
Transport Configuration
|
||||
-----------------------
|
||||
|
||||
The oslo.messaging configuration options define a default and
|
||||
additional notification transport_url. If the notification
|
||||
transport_url is not specified, oslo.messaging will use the default
|
||||
transport_url for both RPC and Notification messaging communications.
|
||||
|
||||
The transport_url parameter is of the form::
|
||||
|
||||
transport://user:pass@host1:port[,hostN:porN]/virtual_host
|
||||
|
||||
Where the transport scheme specifies the RPC or Notification backend as
|
||||
one of rabbit or amqp, etc. Oslo.messaging is deprecating the host,
|
||||
port and auth configuration options. All drivers will get these
|
||||
options via the transport_url.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Support for dual backends in and AMQP 1.0 driver integration
|
||||
with the dispatch-router depends on oslo.messaging V5.10 or later.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
In order to test this in CI, an environment will be needed where dual
|
||||
messaging system backends (e.g. rabbitMQ server and dispatch-router
|
||||
server) are deployed. Any existing hardware configuration should be
|
||||
appropriate for the dual backend deployment.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The deployment documentation will need to be updated to cover the
|
||||
configuration of dual messaging system backends and the use of the
|
||||
dispatch-router. TripleO Heat template examples should also help with
|
||||
deployments using dual backends.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [1] https://blueprints.launchpad.net/oslo.messaging/+spec/amqp-dispatch-router
|
||||
* [2] http://qpid.apache.org/components/dispatch-router/
|
||||
* [3] http://docs.openstack.org/developer/oslo.messaging/AMQP1.0.html
|
||||
* [4] https://etherpad.openstack.org/p/ocata-oslo-consistent-mq-backends
|
||||
* [5] https://github.com/openstack/puppet-qdr
|
@ -1,258 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================================
|
||||
PKI management of the overcloud using Certmonger
|
||||
================================================
|
||||
|
||||
There is currently support for enabling SSL for the public endpoints of the
|
||||
OpenStack services. However, certain use cases require the availability of SSL
|
||||
everywhere. This spec proposes an approach to enable it.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Even though there is support for deploying both the overcloud and the
|
||||
undercloud with TLS/SSL support for the public endpoints, there are deployments
|
||||
that demand the usage of encrypted communications through all the interfaces.
|
||||
|
||||
The current approach for deploying SSL in TripleO is to inject the needed
|
||||
keys/certificates through Heat environment files; this requires the
|
||||
pre-creation of those. While this approach works for the public-facing
|
||||
services, as we attempt to secure the communication between different
|
||||
services, and in different levels of the infrastructure, the amount of keys
|
||||
and certificates grows. So, getting the deployer to generate all the
|
||||
certificates and manage them will be quite cumbersome.
|
||||
|
||||
On the other hand, TripleO is not meant to handle the PKI of the cloud. And
|
||||
being the case that we will at some point need to enable the deployer to be
|
||||
able to renew, revoke and keep track of the certificates and keys deployed in
|
||||
the cloud, we are in need of a system with such capabilities.
|
||||
|
||||
Instead of brewing an OpenStack-specific solution ourselves. I propose the
|
||||
usage of already existing systems that will make this a lot easier.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposal is to start using certmonger[1] in the nodes of the overcloud to
|
||||
interact with a CA for managing the certificates that are being used. With this
|
||||
tool, we can request the fetching of the needed certificates for interfaces
|
||||
such as the internal OpenStack endpoints, the database cluster and the message
|
||||
broker for the cloud. Those certificates will in turn have automatic tracking,
|
||||
and for cases where there is a certificate to identify the node, it could
|
||||
even automatically request a renewal of the certificate when needed.
|
||||
|
||||
Certmonger is already available in several distributions (both Red Hat or
|
||||
Debian based) and has the capability of interacting with several CAs, so if the
|
||||
operator already has a working one, they could use that. On the other hand,
|
||||
certmonger has the mechanism for registering new CAs, and executing scripts
|
||||
(which are customizable) to communicate with those CAs. Those scripts are
|
||||
language independent. But for means of the open source community, a solution
|
||||
such as FreeIPA[2] or Dogtag[3] could be used to act as a CA and handle the
|
||||
certificates and keys for us. Note that it's possible to write a plugin for
|
||||
certmonger to communicate with Barbican or another CA, if that's what we would
|
||||
like to go for.
|
||||
|
||||
In the FreeIPA case, this will require a full FreeIPA system running either on
|
||||
another node in the cluster or in the undercloud in a container[4].
|
||||
|
||||
For cases where the services are terminated by HAProxy, and the overcloud being
|
||||
in an HA-deployment, the controller nodes will need to share a certificate that
|
||||
HAProxy will present when accessed. In this case, the workflow will be as
|
||||
following:
|
||||
|
||||
#. Register the undercloud as a FreeIPA client. This configures the kerberos
|
||||
environment and provides access control to the undercloud node.
|
||||
#. Get keytab (credentials) corresponding to the undercloud in order to access
|
||||
FreeIPA, and be able to register nodes.
|
||||
#. Create a HAProxy service
|
||||
#. Create a certificate/key for that service
|
||||
#. Store the key in FreeIPA's Vault.
|
||||
#. Create each of the controllers to be deployed as hosts in FreeIPA (Please
|
||||
see note about this)
|
||||
#. On each controller node get the certificate from service entry.
|
||||
#. Fetch the key from the FreeIPA vault.
|
||||
#. Set certmonger to keep track of the resulting certificates and
|
||||
keys.
|
||||
|
||||
.. note::
|
||||
|
||||
While the process of creating each node beforehand could sound cumbersome,
|
||||
this can be automated to increase usability. The proposed approach is to
|
||||
have a nova micro-service that automatically registers the nodes from the
|
||||
overcloud when they are created [5]. This hook will not only register the
|
||||
node in the system, but will also inject an OTP which the node can use to
|
||||
fetch the required credentials and get its corresponding certificate and
|
||||
key. The aforementioned OTP is only used for enrollment. Once enrollment
|
||||
has already taken place, certmonger can already be used to fetch
|
||||
certificates from FreeIPA.
|
||||
|
||||
However, even if this micro-service is not in place, we could pass the OTP
|
||||
via the TripleO Heat Templates (in the overcloud deployment). So it is
|
||||
possible to have the controllers fetching their keytab and subsequently
|
||||
request their certificates even if we don't have auto-enrollment in place.
|
||||
|
||||
.. note::
|
||||
|
||||
Barbican could also be used instead of FreeIPA's Vault. With the upside of
|
||||
it being an already accepted OpenStack service. However, Barbican will also
|
||||
need to have a backend, which might be Dogtag in our case, since having an
|
||||
HSM for the CI will probably not be an option.
|
||||
|
||||
Now, for services such as the message broker, where an individual certificate
|
||||
is required per-host, the process is much simpler, since the nodes will have
|
||||
already been registered in FreeIPA and will be able to fetch their credentials.
|
||||
Now we can just let certmonger do the work and request, and subsequently track
|
||||
the appropriate certificates.
|
||||
|
||||
Once the certificates and keys are present in the nodes, then we can let the
|
||||
subsequent steps of the overcloud deployment process take place; So the
|
||||
services will be configured to use those certificates and enable TLS where the
|
||||
deployer specifies it.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The alternative is to take the same approach as we did for the public
|
||||
endpoints. Which is to simply inject the certificates and keys to the nodes.
|
||||
That would have the downside that the certificates and keys will be pasted in
|
||||
heat environment files. This will be problematic for services such as RabbitMQ,
|
||||
where we are giving a list of nodes for communication, because to enable SSL in
|
||||
it, we need to have a certificate per-node serving as a message broker.
|
||||
In this case two approaches could be taken:
|
||||
|
||||
* We will need to copy and paste each certificate and key that is needed for
|
||||
each of the nodes. With the downside being how much text needs to be copied,
|
||||
and the difficulty of keeping track of the certificates. On the other hand,
|
||||
each time a node is removed or added, we need to make sure we remember to add
|
||||
a certificate and a key for it in the environment file. So this becomes a
|
||||
scaling and a usability issue too.
|
||||
|
||||
* We could also give in an intermediate certificate, and let TripleO create the
|
||||
certificates and keys per-service. However, even if this fixes the usability
|
||||
issue, we still cannot keep track of the specific certificates and keys that
|
||||
are being deployed in the cloud.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
This approach enables better security for the overcloud, as it not only eases
|
||||
us to enable TLS everywhere (if desired) but it also helps us keep track and
|
||||
manage our PKI. On the other hand, it enables other means of security, such as
|
||||
mutual authentication. In the case of FreeIPA, we could let the nodes have
|
||||
client certificates, and so they would be able to authenticate to the services
|
||||
(as is possible with tools such as HAProxy or Galera/MySQL). However, this can
|
||||
come as subsequent work of this.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
For doing this, the user will need to pass extra parameters to the overcloud
|
||||
deployment, such as the CA information. In the case of FreeIPA, we will need to
|
||||
pass the host and port, the kerberos realm, the kerberos principal of the
|
||||
undercloud and the location of the keytab (the credentials) for the undercloud.
|
||||
|
||||
However, this will be reflected in the documentation.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Having SSL everywhere will degrade the performance of the overcloud overall, as
|
||||
there will be some overhead in each call. However, this is a known issue and
|
||||
this is why SSL everywhere is optional. It should only be enabled for deployers
|
||||
that really need it.
|
||||
|
||||
The usage of an external CA or FreeIPA shouldn't impact the overcloud
|
||||
performance, as the operations that it will be doing are not recurrent
|
||||
operations (issuing, revoking or renewing certificates).
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
If a deployer wants to enable SSL everywhere, they will need to have a working
|
||||
CA for this to work. Or if they don't they could install FreeIPA in a node.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Discuss things that will affect other developers working on OpenStack.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jaosorior
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Enable certmonger and the FreeIPA client tools in the overcloud image
|
||||
elements.
|
||||
|
||||
* Include the host auto-join hook for nova in the undercloud installation.
|
||||
|
||||
* Create nested templates that will be used in the existing places for the
|
||||
NodeTLSData and NodeTLSCAData. These templates will do the certmonger
|
||||
certificate fetching and tracking.
|
||||
|
||||
* Configure the OpenStack internal endpoints to use TLS and make this optional
|
||||
through a heat environment.
|
||||
|
||||
* Configure the Galera/MySQL cluster to use TLS and make this optional through
|
||||
a heat environment.
|
||||
|
||||
* Configure RabbitMQ to use TLS (which means having a certificate for each
|
||||
node) and make this optional through a heat environment
|
||||
|
||||
* Create a CI gate for SSL everywhere. This will include a FreeIPA installation
|
||||
and it will enable SSL for all the services, ending in the running of a
|
||||
pingtest. For the FreeIPA preparations, a script running before the overcloud
|
||||
deployment will add the undercloud as a client, configure the appropriate
|
||||
permissions for it and deploy a keytab so that it can use the nova hook.
|
||||
Subsequently it will create a service for the OpenStack internal endpoints,
|
||||
and the database, which it will use to create the needed certificates and
|
||||
keys.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* This requires the following bug to be fixed in Nova:
|
||||
https://bugs.launchpad.net/nova/+bug/1518321
|
||||
|
||||
* Also requires the packaging of the nova hook.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We will need to create a new gate in CI to test this.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The documentation on how to use an external CA and how to install and use
|
||||
FreeIPA with TripleO needs to be created.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://fedorahosted.org/certmonger/
|
||||
[2] http://www.freeipa.org/page/Main_Page
|
||||
[3] http://pki.fedoraproject.org/wiki/PKI_Main_Page
|
||||
[4] http://www.freeipa.org/page/Docker
|
||||
[5] https://github.com/richm/rdo-vm-factory/blob/use-centos/rdo-ipa-nova/novahooks.py
|
@ -1,149 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======================
|
||||
Step by step validation
|
||||
=======================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/step-by-step-validation
|
||||
|
||||
Validate each step during the installation to be able to stop fast in
|
||||
case of errors and provide feedback on which components are in error.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
During deployment, problems are often spotted at the end of the
|
||||
configuration and can accumulate on top of each other making it
|
||||
difficult to find the root cause.
|
||||
|
||||
Deployers and developers will benefit by having the installation
|
||||
process fails fast and spotting the lowest level possible components
|
||||
causing the problem.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Leverage the steps already defined in Tripleo to run a validation tool
|
||||
at the end of each step.
|
||||
|
||||
During each step, collect assertions about what components are
|
||||
configured on each host then at the end of the step, run a validation
|
||||
tool consumming the assertions to report all the failed assertions.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could use Puppet to add assertions in the code to validate what has
|
||||
been configured. The drawback of this approach is the difficulty to
|
||||
have a good reporting on what are the issues compared to a specialized
|
||||
tool that can be run outside of the installer if needed.
|
||||
|
||||
The other drawback to this approach is that it can't be reused in
|
||||
future if/when we support non-puppet configuration and it probably
|
||||
also can't be used when we use puppet to generate an external config
|
||||
file for containers.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
* some validations may require access to sensitive data like passwords
|
||||
or keys to access the components.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This feature will be activated automatically in the installer.
|
||||
|
||||
If needed, the deployer or developper will be able to launch the tool
|
||||
by hand to validate a set of assertions.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
We expect the validations to take less than one minute by step.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The objective is to have a fastest iterative process by failing fast.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Each configuration module will need to generate assertions to be
|
||||
consummed by the validation tool.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Note that this approach (multiple step application of ansible in
|
||||
localhost mode via heat) for upgrades and it will work well for
|
||||
validations too.
|
||||
|
||||
https://review.openstack.org/#/c/393448/
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee: <shardy@redhat.com>
|
||||
|
||||
Other contributors to help validate services:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* generate assertions about the configured components on the server
|
||||
being configured in yaml files.
|
||||
|
||||
* implement the validation tool leveraging the work that has already
|
||||
been done in ``tripleo-validations`` that will do the following
|
||||
steps:
|
||||
|
||||
1. collect yaml files from the servers on the undercloud.
|
||||
|
||||
2. run validations in parallel on each server from the undercloud.
|
||||
|
||||
3. report all issues and exit with 0 if no error or 1 if there is at
|
||||
least one error.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
To be added.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The change will be used automatically in the CI so it will always be tested.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We'll need to document integration with whatever validation tool is
|
||||
used, e.g so that those integrating new services (or in future
|
||||
out-of-tree additional services) can know how to integrate with the
|
||||
validation.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
A similar approach was used in SpinalStack using serverspec. See
|
||||
https://github.com/redhat-cip/config-tools/blob/master/verify-servers.sh
|
||||
|
||||
A collection of Ansible playbooks to detect and report potential
|
||||
issues during TripleO deployments:
|
||||
https://github.com/openstack/tripleo-validations
|
||||
|
||||
Prototype of composable upgrades with Heat+Ansible:
|
||||
https://review.openstack.org/#/c/393448/
|
@ -1,258 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================================================
|
||||
Make tripleo third party ci toolset tripleo-quickstart
|
||||
======================================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/use-tripleo-quickstart-and-tripleo-quickstart-extras-for-the-tripleo-ci-toolset
|
||||
|
||||
Devstack being the reference CI deployment of OpenStack does a good job at
|
||||
running both in CI and locally on development hardware.
|
||||
TripleO-Quickstart (TQ)`[3]`_ and TripleO-QuickStart-Extras (TQE) can provide
|
||||
an equivalent experience like devstack both in CI and on local development
|
||||
hardware. TQE does a nice job of breaking down the steps required to install an
|
||||
undercloud and deploy and overcloud step by step by creating bash scripts on the
|
||||
developers system and then executing them in the correct order.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently there is a population of OpenStack developers that are unfamiliar
|
||||
with TripleO and our TripleO CI tools. It's critical that this population have
|
||||
a tool which can provide a similar user experience that devstack currently
|
||||
provides OpenStack developers.
|
||||
|
||||
Recreating a deployment failure from TripleO-CI can be difficult for developers
|
||||
outside of TripleO. Developers may need more than just a script that executes
|
||||
a deployment. Ideally developers have a tool that provides a high level
|
||||
overview, a step-by-step install process with documentation, and a way to inject
|
||||
their local patches or patches from Gerrit into the build.
|
||||
|
||||
Additionally there may be groups outside of TripleO that want to integrate
|
||||
additional code or steps to a deployment. In this case the composablity of the
|
||||
CI code is critical to allow others to plugin, extend and create their own steps
|
||||
for a deployment.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Replace the tools found in openstack-infra/tripleo-ci that drive the deployment
|
||||
of tripleo with TQ and TQE.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative is to break down TripleO-CI into composable shell scripts, and
|
||||
improve the user experience `[4]`_.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No known additional security vulnerabilities at this time.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
We expect that newcomers to TripleO will have an enhanced experience
|
||||
reproducing results from CI.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Using an undercloud image with preinstalled rpms should provide a faster
|
||||
deployment end-to-end.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None at this time.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
This is the whole point really and discussed elsewhere in the spec. However,
|
||||
this should provide a quality user experience for developers wishing to setup
|
||||
TripleO.
|
||||
|
||||
TQE provides a step-by-step, well documented deployment of TripleO.
|
||||
Furthermore, and is easy to launch and configure::
|
||||
|
||||
bash quickstart.sh -p quickstart-extras.yml -r quickstart-extras-requirements.txt --tags all <development box>
|
||||
|
||||
Everything is executed via a bash shell script, the shell scripts are customized
|
||||
via jinja2 templates. Users can see the command prior to executing it when
|
||||
running it locally. Documentation of what commands were executed are
|
||||
automatically generated per execution.
|
||||
|
||||
Node registration and introspection example:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Bash script::
|
||||
|
||||
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-31/undercloud/home/stack/overcloud-prep-images.sh
|
||||
|
||||
|
||||
* Execution log::
|
||||
|
||||
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-31/undercloud/home/stack/overcloud_prep_images.log.gz
|
||||
|
||||
* Generated rst documentation::
|
||||
|
||||
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-31/docs/build/overcloud-prep-images.html
|
||||
|
||||
Overcloud Deployment example:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Bash script::
|
||||
|
||||
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal_pacemaker-31/undercloud/home/stack/overcloud-deploy.sh.gz
|
||||
|
||||
* Execution log::
|
||||
|
||||
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal_pacemaker-31/undercloud/home/stack/overcloud_deploy.log.gz
|
||||
|
||||
* Generated rST documentation::
|
||||
|
||||
https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-current-tripleo-delorean-minimal-37/docs/build/overcloud-deploy.html
|
||||
|
||||
Step by Step Deployment:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are times when a developer will want to walk through a deployment step-by-step,
|
||||
run commands by hand, and try to figure out what exactly is involved with
|
||||
a deployment. A developer may also want to tweak the settings or add a patch.
|
||||
To do the above the deployment can not just run through end to end.
|
||||
|
||||
TQE can setup the undercloud and overcloud nodes, and then just add add already
|
||||
configured scripts to install the undercloud and deploy the overcloud
|
||||
successfully. Essentially allowing the developer to ssh to the undercloud and
|
||||
drive the installation from there with prebuilt scripts.
|
||||
|
||||
* Example::
|
||||
|
||||
./quickstart.sh --no-clone --bootstrap --requirements quickstart-extras-requirements.txt --playbook quickstart-extras.yml --skip-tags undercloud-install,undercloud-post-install,overcloud-deploy,overcloud-validate --release newton <development box>
|
||||
|
||||
Composability:
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
TQE is not a single tool, it's a collection of composable Ansible roles. These
|
||||
Ansible roles can coexist in a single Git repository or be distributed to many
|
||||
Git repositories. See "Additional References."
|
||||
|
||||
Why have two projects? Why risk adding complexity?
|
||||
One of the goals of the TQ and TQE is to not assume we are
|
||||
writing code that works for everyone, on every deployment type, and in any
|
||||
kind of infrastructure. To ensure that TQE developers can not block outside
|
||||
contributions (roles, additions, and customization to either TQ or TQE),
|
||||
it was thought best to uncouple as well and make it as composable
|
||||
as possible. Ansible playbooks after all, are best used as a method to just
|
||||
call roles so that anyone can create playbooks with a variety of roles in the
|
||||
way that best suits their purpose.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
- weshayutin
|
||||
|
||||
Other contributors:
|
||||
- trown
|
||||
- sshnaidm
|
||||
- gcerami
|
||||
- adarazs
|
||||
- larks
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Enable third party testing `[1]`_
|
||||
- Enable TQE to run against the RH2 OVB OpenStack cloud `[2]`_
|
||||
- Move the TQE roles into one or many OpenStack Git Repositories, see the roles listed
|
||||
in the "Additional References"
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- A decision needs to be made regarding `[1]`_
|
||||
- The work to enable third party testing in rdoproject needs to be completed
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
There is a work in progress testing TQE against the RH2 OVB cloud atm `[2]`_. TQE
|
||||
has been vetted for quite some time with OVB on other clouds.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
What is the impact on the docs? Don't repeat details discussed above, but
|
||||
please reference them here.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
* `[1]`_ -- http://lists.openstack.org/pipermail/openstack-dev/2016-October/105248.html
|
||||
* `[2]`_ -- https://review.openstack.org/#/c/381094/
|
||||
* `[3]`_ -- https://etherpad.openstack.org/p/tripleo-third-party-ci-quickstart
|
||||
* `[4]`_ -- https://blueprints.launchpad.net/tripleo/+spec/make-tripleo-ci-externally-consumable
|
||||
|
||||
.. _[1]: http://lists.openstack.org/pipermail/openstack-dev/2016-October/105248.html
|
||||
.. _[2]: https://review.openstack.org/#/c/381094/
|
||||
.. _[3]: https://etherpad.openstack.org/p/tripleo-third-party-ci-quickstart
|
||||
.. _[4]: https://blueprints.launchpad.net/tripleo/+spec/make-tripleo-ci-externally-consumable
|
||||
|
||||
Additional References
|
||||
=====================
|
||||
|
||||
TQE Ansible role library
|
||||
------------------------
|
||||
|
||||
* Undercloud roles:
|
||||
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-baremetal-virt-undercloud
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-pre-deployment-validate ( under development )
|
||||
|
||||
* Overcloud roles:
|
||||
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-config
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-flavors
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-images
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-network
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-ssl ( under development )
|
||||
|
||||
* Utility roles:
|
||||
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-cleanup-nfo
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-collect-logs
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-gate
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-provision-heat
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-image-build
|
||||
|
||||
* Post Deployment roles:
|
||||
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-upgrade
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-scale-nodes
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-tempest
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-validate
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-validate-ipmi
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-validate-ha
|
||||
|
||||
* Baremetal roles:
|
||||
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-baremetal-prep-virthost
|
||||
* https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-baremetal
|
@ -1,197 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========================
|
||||
Composable Service Upgrades
|
||||
===========================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/overcloud-upgrades-per-service
|
||||
|
||||
In the Newton release TripleO delivered a new capability to deploy arbitrary
|
||||
custom roles_ (groups of nodes) with a lot of flexibility of which services
|
||||
are placed on which roles (using roles_data.yaml_). This means we can no
|
||||
longer make the same assumptions about a specific service running on a
|
||||
particular role (e.g Controller).
|
||||
|
||||
The current upgrades workflow_ is organised around the node role determining
|
||||
the order in which that given node and services deployed therein are upgraded.
|
||||
The workflow dictates "swifts", before "controllers", before "cinders", before
|
||||
"computes", before "cephs". The reasons for this ordering are beyond the scope
|
||||
here and ultimately inconsequential, since the important point to note is
|
||||
there is a hard coded relationship between a given service and a given node
|
||||
with respect to upgrading that service (e.g. a script that upgrades all
|
||||
services on "Compute" nodes). For upgrades from Newton to Ocata we can no
|
||||
longer make these assumptions about services being tied to a specific role,
|
||||
so a more composable model is needed.
|
||||
|
||||
Consensus after the initial discussion during the Ocata design summit session_
|
||||
was that:
|
||||
|
||||
* Re-engineering the upgrades workflow for Newton to Ocata is necessary
|
||||
because 'custom roles'
|
||||
* We should start by moving the upgrades logic into the composable service
|
||||
templates in the tripleo-heat-templates (i.e. into each service)
|
||||
* There is still a need for an over-arching workflow - albeit service
|
||||
rather than role oriented.
|
||||
* It is TBD what will drive that workflow. We will use whatever will be
|
||||
'easier' for a first iteration, especially given the Ocata development
|
||||
time contraints.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
As explained in the introduction above, the current upgrades workflow_ can no
|
||||
longer work for composable service deployments. Right now the upgrade scripts
|
||||
are organised around and indeed targetted at specific nodes: the upgrade
|
||||
script for swifts_ is different to that for computes_ or for controllers (split
|
||||
across a number_ of_ steps_) cinders_ or cephs_. These scripts are invoked
|
||||
as part of a worfklow where each step is either a heat stack update or
|
||||
invocation of the upgrade-non-controller.sh_ script to execute the node
|
||||
specific upgrade script (delivered as one of the earlier steps in the workflow)
|
||||
on non controllers.
|
||||
|
||||
One way to handle this problem is to decompose the upgrades logic
|
||||
from those monolithic per-node upgrade scripts into per-service upgrades logic.
|
||||
This should live in the tripleo-heat-templates puppet services_ templates for
|
||||
each service. For the upgrade of a give service we need to express:
|
||||
|
||||
* any pre-upgrade requirements (run a migration, stop a service, pin RPC)
|
||||
* any post upgrade (migrations, service starts/reload config)
|
||||
* any dependencies on other services (upgrade foo only after bar)
|
||||
|
||||
If we organise the upgrade logic in this manner the idea is to gain the
|
||||
flexibility to combine this dynamically into the new upgrades workflow.
|
||||
Besides the per-service upgrades logic the worklow will also need to handle
|
||||
and provide for any deployment wide upgrades related operations such as
|
||||
unpin of the RPC version once all services are successfully running Ocata, or
|
||||
upgrading of services that aren't directly managed or configured by the
|
||||
tripleo deployment (like openvswitch as just one example), or even the delivery
|
||||
of a new kernel which will require a reboot on the given service node after
|
||||
all services have been upgraded.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The first step is to work out where to add upgrades related configuration to
|
||||
each service in the tripleo-heat-templates services_ templates. The exact
|
||||
format will depend on what we end up using to drive the workflow. We could
|
||||
include them in the *outputs* as 'upgrade_config', like::
|
||||
|
||||
outputs:
|
||||
role_data:
|
||||
description: Role data for the Nova Compute service.
|
||||
value:
|
||||
service_name: nova_compute
|
||||
...
|
||||
upgrade_tasks:
|
||||
- name: RPC pin nova-compute
|
||||
exec: "crudini --set /etc/nova/nova.conf upgrade_levels compute $upgrade_level_nova_compute"
|
||||
tags: step1
|
||||
- name: stop nova-compute
|
||||
service: name=openstack-nova-compute state=stopped
|
||||
tags: step2
|
||||
- name: update heat database
|
||||
command: nova-manage db_sync
|
||||
tags: step3
|
||||
- name: start nova-compute
|
||||
service: name=openstack-nova-compute state=started
|
||||
tags: step4
|
||||
...
|
||||
|
||||
The current proposal is for the upgrade snippets to be expressed in Ansible.
|
||||
The initial focus will be to drive the upgrade via the existing tripleo
|
||||
tooling, e.g heat applying ansible similar to how heat applies scripts for
|
||||
the non composable implementation. In future it may also be possible to
|
||||
expose the per-role ansible playbooks to enable advanced operators to drive
|
||||
the upgrade workflow directly, perhaps used in conjunction with the dynamic
|
||||
inventory provided for tripleo validations.
|
||||
|
||||
One other point of note that was brought up in the Ocata design summit
|
||||
session_ and which should factor into the design here is that operators may
|
||||
wish to run the upgrade in stages rather than all at once. It could still be
|
||||
the case that the new workflow can differentiate between 'controlplane'
|
||||
vs 'non-controlplane' services. The operator could then upgrade controlplane
|
||||
services as one stand-alone upgrade step and then later start to roll out the
|
||||
upgrade of non-controlplane services.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative is to have a stand-alone upgrades workflow driven by ansible.
|
||||
Some early work and prototyping was done as well as a (linked from the
|
||||
Ocata design summit session_). Ultimately the proposal was abandoned but it is
|
||||
still possible that we will use ansible for the upgrade logic as described
|
||||
above. We could also explore exposing the resulting ansible playbooks for
|
||||
advanced operators to invoke as part of their own tooling.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
Significant change in the tripleo upgrades workflow.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee: shardy
|
||||
|
||||
Other contributors: marios, emacchi, matbu, chem, lbezdick,
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
Some prototyping by shardy at
|
||||
"WIP prototyping composable upgrades with Heat+Ansible" at
|
||||
I39f5426cb9da0b40bec4a7a3a4a353f69319bdf9_
|
||||
|
||||
* Decompose the upgrades logic into each service template in the tht
|
||||
* Design a workflow that incorporates migrations, the per-service upgrade
|
||||
scripts and any deployment wide upgrades operations.
|
||||
* Decide how this workflow is to be invoked (mistral? puppet? bash?)
|
||||
* profit!
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Hopefully we can use the soon to be added upgrades job_ to help with the
|
||||
development and testing of this feature and obviously guard against changes
|
||||
that break upgrades. Ideally we will expand that to include jobs for each of
|
||||
the stable branches (upgrade M->N and N->O). The M->N would exercise the
|
||||
previous upgrades workflow whereas N->O would be exercising the work developed
|
||||
as part of this spec.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
|
||||
.. _roles: https://blueprints.launchpad.net/tripleo/+spec/custom-roles
|
||||
.. _roles_data.yaml: https://github.com/openstack/tripleo-heat-templates/blob/78500bc2e606bd1f80e05d86bf7da4d1d27f77b1/roles_data.yaml
|
||||
.. _workflow: http://docs.openstack.org/developer/tripleo-docs/post_deployment/upgrade.html
|
||||
.. _session: https://etherpad.openstack.org/p/ocata-tripleo-upgrades
|
||||
.. _swifts: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_object_storage.sh
|
||||
.. _computes: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_compute.sh
|
||||
.. _number: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh
|
||||
.. _of: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
|
||||
.. _steps: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh
|
||||
.. _cinders: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_block_storage.sh
|
||||
.. _cephs: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_ceph_storage.sh
|
||||
.. _upgrade-non-controller.sh: https://github.com/openstack/tripleo-common/blob/01b68d0b0cdbd0323b7f006fbda616c12cbf90af/scripts/upgrade-non-controller.sh
|
||||
.. _services: https://github.com/openstack/tripleo-heat-templates/tree/master/puppet/services
|
||||
.. _I39f5426cb9da0b40bec4a7a3a4a353f69319bdf9 : https://review.openstack.org/#/c/393448/
|
||||
.. _job: https://bugs.launchpad.net/tripleo/+bug/1583125
|
@ -1,105 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Enable deployment of performace monitoring
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-opstools-performance-monitoring
|
||||
|
||||
TripleO should have a possibility to automatically setup and install
|
||||
the performance monitoring agent (collectd) to service the overcloud.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
We need to easily enable operators to connect overcloud nodes to performance
|
||||
monitoring stack. The possible way to do so is to install collectd agent
|
||||
together with set of plugins, depending on a metrics we want to collect
|
||||
from overcloud nodes.
|
||||
|
||||
Summary of use cases:
|
||||
|
||||
1. collectd deployed on each overcloud node reporting configured metrics
|
||||
(via collectd plugins) to external collector.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The collectd service will be deployed as a composable service on
|
||||
the overcloud stack when it is explicitly stated via environment file.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Metric collection and transport to the monitoring node can create I/O which
|
||||
might have performance impact on monitored nodes.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Lars Kellogg-Stedman (larsks)
|
||||
|
||||
Other contributors:
|
||||
Martin Magr (mmagr)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* puppet-tripleo profile for collectd service
|
||||
* tripleo-heat-templates composable service for collectd deployment
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Puppet module for collectd service: puppet-collectd [1]
|
||||
* CentOS Opstools SIG repo [2]
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We should consider creating CI job for deploying overcloud with monitoring
|
||||
node to perform functional testing.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
New template parameters will have to be documented.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://github.com/voxpupuli/puppet-collectd
|
||||
[2] https://wiki.centos.org/SpecialInterestGroup/OpsTools
|
@ -1,139 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================
|
||||
TripleO Repo Management Tool
|
||||
==============================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/tripleo-repos
|
||||
|
||||
Create a tool to handle the repo setup for TripleO
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The documented repo setup steps for TripleO are currently:
|
||||
|
||||
* 3 curls
|
||||
* a sed
|
||||
* a multi-line bash command
|
||||
* a yum install
|
||||
* (optional) another yum install and sed command
|
||||
|
||||
These steps are also implemented in multiple other places, which means every
|
||||
time a change needs to be made it has to be done in at least three different
|
||||
places. The stable branches also need slightly different commands which further
|
||||
complicates the documentation. They also need to appear in multiple places
|
||||
in the docs (e.g. virt system setup, undercloud install, image build,
|
||||
undercloud upgrade).
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
My proposal is to abstract away the repo management steps into a standalone
|
||||
tool. This would essentially change the repo setup from the process
|
||||
described above to something like::
|
||||
|
||||
sudo yum install -y http://tripleo.org/tripleo-repos.rpm
|
||||
sudo tripleo-repos current
|
||||
|
||||
Historical note: The original proposal was called dlrn-repo because it was
|
||||
dealing exclusively with dlrn repos. Now that we've started to add more
|
||||
repos like Ceph that are not from dlrn, that name doesn't really make sense.
|
||||
|
||||
This will mean that when repo setup changes are needed (which happen
|
||||
periodically), they only need to be made in one place and will apply to both
|
||||
developer and user environments.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Use tripleo.sh's repo setup. However, tripleo.sh is not intended as a
|
||||
user-facing tool. It's supposed to be a thin wrapper that essentially
|
||||
implements the documented deployment commands.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The tool would need to make changes to the system's repo setup and install
|
||||
packages. This is the same thing done by the documented commands today.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This would be a new user-facing CLI.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No meaningful change
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Deployers would need to switch to this new method of configuring the
|
||||
TripleO repos in their deployments.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
There should be little to no developer impact because they are mostly using
|
||||
other tools to set up their repos, and those tools should be converted to use
|
||||
the new tool.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bnemec
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Update the proposed tool to match the current repo setup
|
||||
* Import code into gerrit
|
||||
* Package tool
|
||||
* Publish the package somewhere easily accessible
|
||||
* Update docs to use tool
|
||||
* Convert existing developer tools to use this tool
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
NA
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
tripleo.sh would be converted to use this tool so it would be covered by
|
||||
existing CI.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation would be simplified.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Original proposal:
|
||||
http://lists.openstack.org/pipermail/openstack-dev/2016-June/097221.html
|
||||
|
||||
Current version of the tool:
|
||||
https://github.com/cybertron/dlrn-repo
|
@ -1,177 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
composable-undercloud
|
||||
================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/heat-undercloud
|
||||
|
||||
Deploy the undercloud with Heat instead of elements. This will allow us to use
|
||||
composable services for the Undercloud and better fits with the architecture
|
||||
of TripleO (providing a feedback loop between the Undercloud and Overcloud).
|
||||
Furthermore this gets us a step closer to an HA undercloud and will help
|
||||
us potentially convert the Undercloud to containers as work is ongoing
|
||||
in t-h-t for containers as well.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The Undercloud today uses instack-undercloud. Instack undercloud is built
|
||||
around the concept of 'instack' which uses elements to install service.
|
||||
|
||||
* When instack-undercloud started we shared elements across the undercloud
|
||||
and overcloud via the tripleo-image-elements project. This is no longer the
|
||||
case, thus we have lost the feedback loop of using the same elements in
|
||||
both the overcloud and undercloud.
|
||||
|
||||
* We retro-fitted instack-undercloud with a single element called
|
||||
puppet-stack-config that contains a single (large) puppet manifest for
|
||||
all the services. Being able to compose the Undercloud would be more
|
||||
scalable.
|
||||
|
||||
* A maintenance problem. Ideally we could support the under and overcloud with the same tooling.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
We can use a single process Heat API/Engine in noauth mode to leverage
|
||||
recent "composable services" work in the tripleo-heat-templates project.
|
||||
|
||||
* A new heat-all launcher will be created.
|
||||
|
||||
* We will run the heat-all launcher with "noauth" middleware to skip keystone
|
||||
auth at a high level.
|
||||
|
||||
* The heat-all process will use fake RPC driver and SQLite thus avoiding
|
||||
the need to run RabbitMQ or MySQL on the deployment server for bootstrapping.
|
||||
|
||||
* To satisfy client library requirements inside heat we will run a fake keystone
|
||||
API (a thread in our installer perhaps), that will return just enough to
|
||||
make these clients functionally work in noauth mode.
|
||||
|
||||
* The new "deployed-server" feature in tripleo-heat-templates will make it
|
||||
it possible to create Heat "server" objects and thus run
|
||||
OS::Heat::SoftwareDeployment resources on pre-installed servers.
|
||||
|
||||
* We will use os-collect-config to communicate with the local Heat API via
|
||||
the Heat signal transport. We will run os-collect-config until the
|
||||
stack finished processing and either completes or fails.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* Create another tool which can read composable services in
|
||||
tripleo-heat-templates. This tool would be required to have feature
|
||||
parity with Heat such that things like parameters, nested stacks,
|
||||
environments all worked in a similar fashion so that we could share the
|
||||
template work across the Undercloud and Overcloud. This approach isn't
|
||||
really feasable.
|
||||
|
||||
* Use an alternate tool like Ansible. This would creating duplicate services
|
||||
in Ansible playbooks for each service we require in the Undercloud. This
|
||||
approach isn't ideal in that it involves duplicate work across the Undercloud
|
||||
and Overcloud. Ongoing work around multi-node configuration and containers
|
||||
would need to be duplicated into both the Overcloud (tripleo-heat-templates)
|
||||
and Undercloud (Ansible) frameworks.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
* The approach would run Heat on a single node in noauth mode. Heat
|
||||
API and the fake Keystone stub would listen on 127.0.0.1 only. This
|
||||
would be similar to other projects which allow noauth in local mode
|
||||
as well.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
* We would again have a single template language driving our Undercloud
|
||||
and Overcloud tooling. Heat templates are very well documented.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* Initial testing shows the single process Heat API/Engine is quite light
|
||||
taking only 70MB of RAM on a machine.
|
||||
|
||||
* The approach is likely to be on-par with the performance of
|
||||
instack-undercloud.
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
* The format of undercloud.conf may change. We will add a
|
||||
'compat' layer which takes the format of 'undercloud.conf' today
|
||||
and sets Heat parameters and or includes heat environments to give
|
||||
feature parity and an upgrade path for existing users. Additional,
|
||||
CI jobs will also be created to ensure users who upgrade from
|
||||
previous instack environments can use the new tool.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
* Developers would be able to do less work to maintain the UnderCloud by
|
||||
sharing composable services.
|
||||
|
||||
* Future work around composable upgrades could also be utilized and shared
|
||||
across the Undercloud and Overcloud.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
dprince (dan-prince on LP)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create heat-all launcher.
|
||||
|
||||
* Create python-tripleoclient command to run 'undercloud deploy'.
|
||||
|
||||
* Create undercloud.yaml Heat templates.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Heat all launcher and noauth middleware.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Swapping in the new Undercloud as part of CI should allow us to fully test it.
|
||||
|
||||
Additionally, we will have an upgrade job that tests an upgrade from
|
||||
an instack-undercloud installation to a new t-h-t driven Undercloud install.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation changes will need to be made that explains new config
|
||||
interfaces (Heat parameters and environments). We could minimiz doc changes
|
||||
by developing a 'compat' interface to process the legacy undercloud.conf
|
||||
and perhaps even re-use the 'undercloud install' task in python-tripleoclient
|
||||
as well so it essentially acts the same on the CLI.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Onward dark owl presentation: https://www.youtube.com/watch?v=y1qMDLAf26Q
|
||||
|
||||
* https://etherpad.openstack.org/p/tripleo-composable-undercloud
|
||||
|
||||
* https://blueprints.launchpad.net/tripleo/+spec/heat-undercloud
|
@ -1,142 +0,0 @@
|
||||
=============================
|
||||
TripleO Undercloud NTP Server
|
||||
=============================
|
||||
|
||||
The Undercloud should provide NTP services for when external NTP services are
|
||||
not available.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
NTP services are required to deploy with HA, but we rely on external services.
|
||||
This means that TripleO can't be installed without Internet access or a local
|
||||
NTP server.
|
||||
|
||||
This has several drawbacks:
|
||||
|
||||
* The NTP server is a potential point of failure, and it is an external
|
||||
dependency.
|
||||
|
||||
* Isolated deployments without Internet access are not possible without
|
||||
additional effort (manually deploying an NTP server).
|
||||
|
||||
* Infra CI is dependent on an external resource, leading to potential
|
||||
false negative test runs or CI failures.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In order to address this problem, the Undercloud installation process should
|
||||
include setting up an NTP server on the local Undercloud. The use of this
|
||||
NTP server would be optional, but we may wish to make it a default. Having
|
||||
a default is better than none, since HA deployments will fail without time
|
||||
synchronization between the controller cluster members.
|
||||
|
||||
The operation of the NTP server on the Undercloud would be primarily of use
|
||||
in small or proof-of-concept deployments. It is expected that sufficiently
|
||||
large deployments will have an infrastructure NTP server already operating
|
||||
locally.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The alternative is to continue to require external NTP services, or to
|
||||
require manual steps to set up a local NTP server.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Since the NTP server is required for syncing the HA, a skewed clock on one
|
||||
controller (in relation to the other controllers) may make it ineligable to
|
||||
participate in the HA cluster. If more than one controller's clock is skewed,
|
||||
the entire cluster will fail to operate. This opens up an opportunity for
|
||||
denial-of-service attacks against the cloud, either by causing NTP updates
|
||||
to fail, or using a man-in-the-middle attack where deliberately false NTP
|
||||
responses are returned to the controllers.
|
||||
|
||||
Of course, operating the NTP server on the Undercloud moves that attack
|
||||
vector down to the Undercloud, so sufficient security hardening should be done
|
||||
on the Undercloud and/or the attached networks. We may wish to bind the NTP
|
||||
server only to the provisioning (control plane) network.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
This may make the life of the installer easier, since they don't need to open
|
||||
a network connection to an NTP server or set up a local NTP server.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The operation of the NTP server should have a negligible impact on Undercloud
|
||||
performance. It is a lightweight protocol and the daemon requires little
|
||||
resources.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
We now require that a valid NTP server be configured either in the templates
|
||||
or on the deployment command-line. This requirement would be optional if we had
|
||||
a default pointing to NTP services on the Undercloud.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignees:
|
||||
|
||||
* dsneddon@redhat.com
|
||||
* bfournie@redhat.com
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
The TripleO Undercloud installation scripts will have to be modified to include
|
||||
the installation and configuration of an NTP server. This will likely be done
|
||||
using a composable service for the Undercloud, with configuration data taken
|
||||
from undercloud.conf. The configuration should include a set of default NTP
|
||||
servers which are reachable on the public Internet for when no servers are
|
||||
specified in undercloud.conf.
|
||||
|
||||
Implement opening up iptables for NTP on the control plane network (bound to
|
||||
only one IP/interface [ctlplane] if possible).
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
The NTP server RPMs must be installed, and upstream NTP servers must be
|
||||
identified (although we might configure a default such as pool.ntp.org)
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Since proper operation of the NTP services are required for successful
|
||||
deployment of an HA overcloud, this functionality will be tested every time
|
||||
a TripleO CI HA job is run.
|
||||
|
||||
We may also want to implement a validation that ensures that the NTP server
|
||||
can reach its upstream stratum 1 servers. This will ensure that the NTP
|
||||
server is serving up the correct time. This is optional, however, since the
|
||||
only dependency is that the overcloud nodes agree on the time, not that it
|
||||
be correct.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The setup and configuration of the NTP server should be documented. Basic NTP
|
||||
best practices should be communicated.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [1] - Administration Guide Draft/NTP - Fedora Project
|
||||
https://fedoraproject.org/wiki/Administration_Guide_Draft/NTP
|
@ -1,224 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
Validations in TripleO Workflows
|
||||
================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/validations-in-workflows
|
||||
|
||||
The Newton release introduced TripleO validations -- a set of
|
||||
extendable checks that identify potential deployment issues early and
|
||||
verify that the deployed OpenStack is set up properly. These
|
||||
validations are automatically being run by the TripleO UI, but there
|
||||
is no support for the command line workflow and they're not being
|
||||
exercised by our CI jobs either.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
When enabled, TripleO UI runs the validations at the appropriate phase
|
||||
of the planning and deployment. This is done within the TripleO UI
|
||||
codebase and therefore not available to python-tripleoclient or
|
||||
the CI.
|
||||
|
||||
The TripleO deployer can run the validations manually, but they need
|
||||
to know at which point to do so and they will need to do it by calling
|
||||
Mistral directly.
|
||||
|
||||
This causes a disparity between the command line and GUI experience
|
||||
and complicates the efforts to exercise the validations by the CI.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Each validation already advertises where in the planning/deployment
|
||||
process it should be run. This is under the ``vars/metagata/groups``
|
||||
section. In addition, the ``tripleo.validations.v1.run_groups``
|
||||
Mistral workflow lets us run all validations belonging to a given
|
||||
group.
|
||||
|
||||
For each validation group (currently ``pre-introspection``, ``pre-deployment``
|
||||
and ``post-deployment``) we will update the appropriate workflow in
|
||||
tripleo-common to optionally call ``run_groups``.
|
||||
|
||||
Each of the workflows above will receive a new Mistral input called
|
||||
``run_validations``. It will be a boolean value that indicates whether
|
||||
the validations ought to be run as part of that workflow or not.
|
||||
|
||||
To expose this functionality to the command line user, we will add an
|
||||
option for enabling/disabling validations into python-tripleoclient
|
||||
(which will set the ``run_validations`` Mistral input) and a way to
|
||||
show the results of each validation to the screen output.
|
||||
|
||||
When the validations are run, they will report their status to Zaqar
|
||||
and any failures will block the deployment. The deployer can disable
|
||||
validations if they wish to proceed despite failures.
|
||||
|
||||
One unresolved question is the post-deployment validations. The Heat
|
||||
stack create/update Mistral action is currently asynchronous and we
|
||||
have no way of calling actions after the deployment has finished.
|
||||
Unless we change that, the post-deployment validations may have to be
|
||||
run manually (or via python-tripleoclient).
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
1. Document where to run each group and how and leave it at that. This
|
||||
risks that the users already familiar with TripleO may miss the
|
||||
validations or that they won't bother.
|
||||
|
||||
We would still need to find a way to run validations in a CI job,
|
||||
though.
|
||||
|
||||
2. Provide subcommands to run validations (and groups of validations)
|
||||
into python-tripleoclient and rely on people running them manually.
|
||||
|
||||
This is similar to 1., but provides an easier way of running a
|
||||
validation and getting its result.
|
||||
|
||||
Note that this may be a useful addition even if with the proposal
|
||||
outlined in this specification.
|
||||
|
||||
3. Do what the GUI does in python-tripleoclient, too. The client will
|
||||
know when to run which validation and will report the results back.
|
||||
|
||||
The drawback is that we'll need to implement and maintain the same
|
||||
set of rules in two different codebases and have no API to do them.
|
||||
I.e. what the switch to Mistral is supposed to solve.
|
||||
|
||||
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
We will need to modify python-tripleoclient to be able to display the
|
||||
status of validations once they finished. TripleO UI already does this.
|
||||
|
||||
The deployers may need to learn about the validations.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Running a validation can take about a minute (this depends on the
|
||||
nature of the validation, e.g. does it check a configuration file or
|
||||
does it need to log in to all compute nodes).
|
||||
|
||||
This may can be a concern if we run multiple validations at the same
|
||||
time.
|
||||
|
||||
We should be able to run the whole group in parallel. It's possible
|
||||
we're already doing that, but this needs to be investigated.
|
||||
Specifically, does ``with-items`` run the tasks in sequence or in
|
||||
parallel?
|
||||
|
||||
There are also some options that would allow us to speed up the
|
||||
running time of a validation itself, by using common ways of speeding
|
||||
up Ansible playbooks in general:
|
||||
|
||||
* Disabling the default "setup" task for validations that don't need
|
||||
it (this task gathers hardware and system information about the
|
||||
target node and it takes some time)
|
||||
* Using persistent SSH connections
|
||||
* Making each validation task run independently (by default, Ansible
|
||||
runs a task on all the nodes, waits for its completion everywhere
|
||||
and then moves on to another task)
|
||||
* Each validation runs the ``tripleo-ansible-inventory`` script which
|
||||
gathers information about deployed servers and configuration from
|
||||
Mistral and Heat. Running this script can be slow. When we run
|
||||
multiple validations at the same time, we should generate the
|
||||
inventory only once and cache the results.
|
||||
|
||||
Since the validations are going to be optional, the deployer can
|
||||
always choose not to run them. On the other hand, any slowdown should
|
||||
ideally outweigh the time spent investigating failed deployments.
|
||||
|
||||
We will also document the actual time difference. This information
|
||||
should be readily available from our CI environments, but we should
|
||||
also provide measurements on the bare metal.
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Depending on whether the validations will be run by default or not,
|
||||
the only impact should be an option that lets the deployer to run them
|
||||
or not.
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
The TripleO developers may need to learn about validations, where to
|
||||
find them and how to change them.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
tsedovic
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
* Add ``run_validations`` input and call ``run_groups`` from the
|
||||
deployment and node registration workflows
|
||||
* Add an option to run the validations to python-tripleoclient
|
||||
* Display the validations results with python-tripleoclient
|
||||
* Add or update a CI job to run the validations
|
||||
* Add a CI job to tripleo-validations
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This should make the validations testable in CI. Ideally, we would
|
||||
verify the expected success/failure for the known validations given
|
||||
the CI environment. But having them go through the testing machinery
|
||||
would be a good first step to ensure we don't break anything.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We will need to document the fact that we have validations, where they
|
||||
live and when and how are they being run.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* http://docs.openstack.org/developer/tripleo-common/readme.html#validations
|
||||
* http://git.openstack.org/cgit/openstack/tripleo-validations/
|
||||
* http://docs.openstack.org/developer/tripleo-validations/
|
@ -1,185 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================================
|
||||
AIDE - Intrustion Detection Database
|
||||
====================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-aide-database
|
||||
|
||||
AIDE (Advanced Intrusion Detection Environment) is a file and directory
|
||||
integrity verification system. It computes a checksum of object
|
||||
attributes, which are then stored into a database. Operators can then
|
||||
run periodic checks against the current state of defined objects and
|
||||
verify if any attributes have been changed (thereby suggesting possible
|
||||
malicious / unauthorised tampering).
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Security Frameworks such as DISA STIG [1] / CIS [3] require that AIDE be
|
||||
installed and configured on all Linux systems.
|
||||
|
||||
To enable OpenStack operators to comply with the aforementioned security
|
||||
requirements, they require a method of automating the installation of
|
||||
AIDE and initialization of AIDE's integrity Database. They also require
|
||||
a means to perform a periodic integrity verification run.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Introduce a puppet-module to manage the AIDE service and ensure the AIDE
|
||||
application is installed, create rule entries and a CRON job to allow
|
||||
a periodic check of the AIDE database or templates to allow monitoring
|
||||
via Sensu checks as part of OpTools.
|
||||
|
||||
Create a tripleo-heat-template service to allow population of hiera data
|
||||
to be consumed by the puppet-module managing AIDE.
|
||||
|
||||
The proposed puppet-module is lhinds-aide [2] as this module will accept
|
||||
rules declared in hiera data, initialize the Database and enables CRON
|
||||
entries. Other puppet AIDE modules were missing hiera functionality or
|
||||
other features (such as CRON population).
|
||||
|
||||
Within tripleo-heat-templates, a composable service will be created to
|
||||
feed a rule hash into the AIDE puppet module as follows:
|
||||
|
||||
AIDERules:
|
||||
description: Mapping of AIDE config rules
|
||||
type: json
|
||||
default: {}
|
||||
|
||||
The Operator can then source an environment file and provide rule
|
||||
information as a hash:
|
||||
|
||||
parameter_defaults:
|
||||
AIDERules:
|
||||
'Monitor /etc for changes':
|
||||
content: '/etc p+sha256'
|
||||
order : 1
|
||||
'Monitor /boot for changes':
|
||||
content: '/boot p+u+g+a'
|
||||
order : 2
|
||||
|
||||
Ops Tool Integration
|
||||
--------------------
|
||||
|
||||
In order to allow active monitoring of AIDE events, a sensu check can
|
||||
be created to perform an interval based verification of AIDE monitored
|
||||
files (set using ``AIDERules``) against the last initialized database.
|
||||
|
||||
Results of the Sensu activated AIDE verification checks will then be fed
|
||||
to the sensu server for alerting and archiving.
|
||||
|
||||
The Sensu clients (all overcloud nodes) will be configured with a
|
||||
standalone/passive check via puppet-sensu module which is already
|
||||
installed on overcloud image.
|
||||
|
||||
If the Operator should choose not to use OpTools, then they can still
|
||||
configure AIDE using the traditional method by means of a CRON entry.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Using a puppet-module coupled with a TripleO service is the most
|
||||
pragmatic approach to populating AIDE rules and managing the AIDE
|
||||
service.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
AIDE is an integrity checking application and therefore requires
|
||||
Operators insure the security of AIDE's database is protected from
|
||||
tampering. Should an attacker get access to the database, they could
|
||||
attempt to hide malicious activity by removing records of file integrity
|
||||
hashes.
|
||||
|
||||
The default location is currently `/var/lib/aide/$database` which
|
||||
puppet-aide sets with privileges of `0600` and ownership of
|
||||
`root \ root`.
|
||||
|
||||
AIDE itself introduces no security impact to any OpenStack projects
|
||||
and has no interaction with any OpenStack services.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
The service interaction will occur via heat templates and the TripleO
|
||||
UI (should a capability map be present).
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No Performance Impact
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The service will be utlised by means of an environment file. Therefore,
|
||||
should a deployer not reference the environment template using the
|
||||
`openstack overcloud deploy -e` flag, there will be no impact.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
No impact on other OpenStack Developers.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
lhinds
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
1. Add puppet-aide [1] to RDO as a puppet package
|
||||
|
||||
2. Create TripleO Service for AIDE
|
||||
|
||||
3. Create Capability Map
|
||||
|
||||
4. Create CI Job
|
||||
|
||||
5. Submit documentation to tripleo-docs.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Dependency on lhinds-aide Puppet Module.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Will be tested in TripleO CI by adding the service and an environment
|
||||
template to a TripleO CI scenario.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation patches will be made to explain how to use the service.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Original Launchpad issue: https://bugs.launchpad.net/tripleo/+bug/1665031
|
||||
|
||||
[1] https://www.stigviewer.com/stig/red_hat_enterprise_linux_6/2016-07-22/finding/V-38489
|
||||
|
||||
[2] https://forge.puppet.com/lhinds/aide
|
||||
|
||||
[3]
|
||||
file:///home/luke/project-files/tripleo-security-hardening/CIS_Red_Hat_Enterprise_Linux_7_Benchmark_v2.1.0.pdf
|
||||
|
||||
[3]
|
||||
file:///home/luke/project-files/tripleo-security-hardening/CIS_Red_Hat_Enterprise_Linux_7_Benchmark_v2.1.0.pdf
|
@ -1,148 +0,0 @@
|
||||
===========================================
|
||||
Container Healthchecks for TripleO Services
|
||||
===========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/container-healthchecks
|
||||
|
||||
An OpenStack deployment involves many services spread across many
|
||||
hosts. It is important that we provide tooling and APIs that make it
|
||||
as easy as possible to monitor this large, distributed environment.
|
||||
The move to containerized services in the overcloud [1]
|
||||
brings with it many opportunities, such as the ability to bundle
|
||||
services with their associated health checks and provide a standard
|
||||
API for assessing the health of the service.
|
||||
|
||||
[1]: https://blueprints.launchpad.net/tripleo/+spec/containerize-tripleo
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The people who are in the best position to develop appropriate health
|
||||
checks for a service are generally those people responsible for
|
||||
developing the service. Unfortunately, the task of setting up
|
||||
monitoring generally ends up in the hands of cloud operators or some
|
||||
intermediary.
|
||||
|
||||
I propose that we take advantage of the bundling offered by
|
||||
containerized services and create a standard API with which an
|
||||
operator can assess the health of a service. This makes life easier
|
||||
for the operator, who can now provide granular service monitoring
|
||||
without requiring detailed knowledge about every service, and it
|
||||
allows service developers to ensure that services are monitored
|
||||
appropriately.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The Docker engine (since version 1.12), as well as most higher-level
|
||||
orchestration frameworks, provide a standard mechanism for validating
|
||||
the health of a container. Docker itself provides the
|
||||
HEALTHCHECK_ directive, while Kubernetes has explicit
|
||||
support for `liveness and readiness probes`_. Both
|
||||
mechanisms work by executing a defined command inside the container,
|
||||
and using the result of that executing to determine whether or not the
|
||||
container is "healthy".
|
||||
|
||||
.. _liveness and readiness probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
|
||||
.. _healthcheck: https://docs.docker.com/engine/reference/builder/#healthcheck
|
||||
|
||||
I propose that we explicitly support these interfaces in containerized
|
||||
TripleO services through the following means:
|
||||
|
||||
1. Include in every container a `/openstack/healthcheck` command that
|
||||
will check the health of the containerized service, exit with
|
||||
status ``0`` if the service is healthy or ``1`` if not, and provide
|
||||
a message on ``stdout`` describing the nature of the error.
|
||||
|
||||
2. Include in every Docker image an appropriate ``HEALTHCHECK``
|
||||
directive to utilize the script::
|
||||
|
||||
HEALTHCHECK CMD /openstack/healthcheck
|
||||
|
||||
3. If Kubernetes becomes a standard part of the TripleO deployment
|
||||
process, we may be able to implement liveness or readiness probes
|
||||
using the same script::
|
||||
|
||||
livenessProbe:
|
||||
exec:
|
||||
command:
|
||||
- /openstack/healthcheck
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The alternative is the status quo: services do not provide a standard
|
||||
healthcheck API, and service monitoring must be configured
|
||||
individually by cloud operators.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
N/A
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Users can explicitly run the healthcheck script to immediately assess
|
||||
the state of a service.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This proposal will result in the periodic execution of tasks on the
|
||||
overcloud hosts. When designing health checks, service developers
|
||||
should select appropriate check intervals such that there is minimal
|
||||
operational overhead from the health checks.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
N/A
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers will need to determine how best to assess the health of a
|
||||
service and provide the appropriate script to perform this check.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
N/A
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
N/A
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- This requires that we implement `containerize-tripleo-overcloud`_
|
||||
blueprint.
|
||||
|
||||
.. _containerize-tripleo-overcloud: https://specs.openstack.org/openstack/tripleo-specs/specs/ocata/containerize-tripleo-overcloud.html
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
TripleO CI jobs should be updated to utilize the healthcheck API to
|
||||
determine if services are running correctly.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Any documentation describing the process of containerizing a service
|
||||
for TripleoO must be updated to describe the healthcheck API.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
N/A
|
||||
|
@ -1,305 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================================================
|
||||
Best practices for logging of containerized services
|
||||
====================================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/containerized-services-logs
|
||||
|
||||
Containerized services shall persist its logs. There are many ways to address
|
||||
that. The scope of this blueprint is to suggest best practices and intermediate
|
||||
implementation steps for Pike release as well.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Pike will be released with a notion of hybrid deployments, which is some
|
||||
services may be running in containers and managed by docker daemon, and
|
||||
some may be managed by systemd or Pacemaker and placed on hosts directly.
|
||||
|
||||
The notion of composable deployments as well assumes end users and
|
||||
developers may want to deploy some services non-containerized and tripleo
|
||||
heat templates shall not prevent them from doing so.
|
||||
|
||||
Despite the service placement type, end users and developers shall get all
|
||||
logs persisted, consistent and available for future analysis.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
.. note:: As the spec transitions from Pike, some of the sections below are
|
||||
split into the Pike and Queens parts.
|
||||
|
||||
The scope of this document for Pike is limited to recommendations for
|
||||
developers of containerized services, bearing in mind use cases for hybrid
|
||||
environments. It addresses only intermediate implementation steps for Pike and
|
||||
smooth UX with upgrades from Ocata to Pike, and with future upgrades from Pike
|
||||
as well.
|
||||
|
||||
A `12factor <https://12factor.net/logs>`_ is the general guideline for logging
|
||||
in containerized apps. Based on it, we rephrase our main design assumption as:
|
||||
"each running process writes its only event stream to be persisted outside
|
||||
of its container". And we put an additional design constraint: "each container
|
||||
has its only running foreground process, nothing else requires persistent
|
||||
logs that may outlast the container execution time". This assumes all streams
|
||||
but the main event stream are ephemeral and live no longer than the container
|
||||
instance does.
|
||||
|
||||
.. note:: HA statefull services may require another approach, see the
|
||||
alternatives section for more details.
|
||||
|
||||
The scope for future releases, starting from Queens, shall include best
|
||||
practices for collecting (shipping), storing (persisting), processing (parsing)
|
||||
and accessing (filtering) logs of hybrid TripleO deployments with advanced
|
||||
techniques like EFK (Elasticsearch, Fluentd, Kibana) or the like. Hereafter
|
||||
those are referred as "future steps".
|
||||
|
||||
Note, this is limited to OpenStack and Linux HA stack (Pacemaker and Corosync).
|
||||
We can do nothing to the rest of the supporting and legacy apps like
|
||||
webservers, load balancing revers proxies, database and message queue clusters.
|
||||
Even if we could, this stays out of OpenStack specs scope.
|
||||
|
||||
Here is a list of suggested best practices for TripleO developers for Pike:
|
||||
|
||||
* Host services shall keep writing logs as is, having UIDs, logging configs,
|
||||
rotation rules and target directories unchanged.
|
||||
|
||||
.. note:: Host services changing its control plane to systemd or pacemaker
|
||||
in Ocata to Pike upgrade process, may have logging configs, rules and
|
||||
destinations changed as well, but this is out of the scope of this spec.
|
||||
|
||||
* Containerized services that normally log to files under the `/var/log` dir,
|
||||
shall keep logging as is inside of containers. The logs shall be persisted
|
||||
with hostpath mounted volumes placed under the `/var/log/containers` path.
|
||||
This is required because of the hybrid use cases. For example, containerized
|
||||
nova services access `/var/log/nova` with different UIDs than the host
|
||||
services would have. Given that, nova containers should have log volumes
|
||||
mounted as ``-v /var/log/nova:/var/log/containers/nova`` in order to not
|
||||
bring conflicts. Persisted log files then can be pulled by a node agent like
|
||||
fluentd or rsyslog and forwarded to a central logging service.
|
||||
|
||||
* Containerized services that can only log to syslog facilities: bind mount
|
||||
/dev/log into all tripleo service containers as well so that the host
|
||||
collects the logs via journald. This should be a standard component of our
|
||||
container "API": we guarantee (a) a log directory and (b) a syslog socket
|
||||
for *every* containerized service. Collected journald logs then can be pulled
|
||||
by a node agent like fluentd or rsyslog and forwarded to a central logging
|
||||
service.
|
||||
|
||||
* Containerized services that leverage Kolla bootstrap, extended start and/or
|
||||
config facilities, shall be templated with Heat deployment steps as the
|
||||
following:
|
||||
|
||||
* Host prep tasks to ensure target directories pre-created for hosts.
|
||||
|
||||
* Kolla config's permissions to enforce ownership for log dirs (hostpath
|
||||
mounted volumes).
|
||||
|
||||
* Init containers steps to chown log directories early otherwise. Kolla
|
||||
bootstrap and DB sync containers are normally invoked before the
|
||||
`kolla_config` permissions to be set. Therefore come init containers.
|
||||
|
||||
* Containerized services that do not use Kolla and run as root in containers
|
||||
shall be running from a separate user namespace remapped to a non root host
|
||||
user, for security reasons. No such services are currently deployed by
|
||||
TripleO, though.
|
||||
|
||||
.. note:: Docker daemon would have to be running under that remapped non root
|
||||
user as well. See docker documentation for the ``--userns-remap`` option.
|
||||
|
||||
* Containerized services that run under pacemaker (or pacemaker remote)
|
||||
control plane and do not fall into any of the given cases: bind mount
|
||||
/dev/log as well. At this stage the way services log is in line with the best
|
||||
practice w.r.t "dedicated log directory to avoid conflicts". Pacemaker
|
||||
bundles isolate the containerized resources' logs on the host into
|
||||
`/var/log/pacemaker/bundles/{resource}`.
|
||||
|
||||
Future steps TBD.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Those below come for future steps only.
|
||||
|
||||
Alternatively to hostpath mounted volumes, create a directory structure such
|
||||
that each container has a namespace for its logs somewhere under `/var/log`.
|
||||
So, a container named 12345 would have *all its logs* in the
|
||||
`/var/log/container-12345` directory structure (requires clarification).
|
||||
This also alters the assumption that in general there is only one main log
|
||||
per a container, which is the case for highly available containerized
|
||||
statefull services bundled with pacemaker remote, with multiple logs to
|
||||
capture, like `/var/log/pacemaker.log`, logs for cluster bootstrapping
|
||||
events, control plane agents, helper tools like rsyncd, and the statefull
|
||||
service itself.
|
||||
|
||||
When we have control over the logging API (e.g. via oslo.log), we can forsake
|
||||
hostpath mounted volumes and configure containerized services to output to
|
||||
syslog (via bind mount `/dev/log`) so that the host collects the logs via
|
||||
journald). Or configure services to log only to stdout, so that docker daemon
|
||||
collects logs and ships them to the journald.
|
||||
|
||||
.. note:: The "winning" trend is switching all (including openstack
|
||||
services) to syslog and log nothing to the /var/log/, e.g. just bind-mount
|
||||
``-v /dev/null:/var/log`` for containers.
|
||||
|
||||
Or use a specialized log driver like the oslo.log fluentd logging driver
|
||||
(instead of the default journald or json-file) to output to a fluentd log agent
|
||||
running on the host or containerized as well, which would then aggregate logs
|
||||
from all containers, annotate with node metadata, and use the fluentd
|
||||
`secure_forward` protocol to send the logs to a remote fluentd agent like
|
||||
common logging.
|
||||
|
||||
These are not doable for Pike as requiring too many changes impacting upgrade
|
||||
UX as well. Although, this is the only recommended best practice and end goal for
|
||||
future releases and future steps coming after Pike.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
As the spec transitions from Pike, the section is split into the Pike and
|
||||
Queens parts.
|
||||
|
||||
UID collisions may happen for users in containers to occasionally match another
|
||||
user IDs on the host. And to allow those to access logs of foreign services.
|
||||
This should be mitigated with SELinux policies.
|
||||
|
||||
Future steps impact TBD.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
As the spec transitions from Pike, the section is split into the Pike and
|
||||
Queens parts.
|
||||
|
||||
Containerized and host services will be logging under different paths. The former
|
||||
to the `/var/log/containers/foo` and `/var/log/pacemaker/bundles/*`, the latter
|
||||
to the `/var/log/foo`. This impacts logs collecting tools like
|
||||
`sosreport <https://github.com/sosreport/sos>`_ et al.
|
||||
|
||||
Future steps impact TBD.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
As the spec transitions from Pike, the section is split into the Pike and
|
||||
Queens parts.
|
||||
|
||||
Hostpath mounted volumes bring no performance overhead for containerized
|
||||
services' logs. Host services are not affected by the proposed change.
|
||||
|
||||
Future steps impact is that handling of the byte stream of stdout can
|
||||
have a significant impact on performance.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
As the spec transitions from Pike, the section is split into the Pike and
|
||||
Queens parts.
|
||||
|
||||
When upgrading from Ocata to Pike, containerized services will change its
|
||||
logging destination directory as described in the end user impact section.
|
||||
This also impacts logs collecting tools like sosreport et al.
|
||||
|
||||
Logrotate scripts must be adjusted for the `/var/log/containers` and
|
||||
`/var/log/pacemaker/bundles/*` as well.
|
||||
|
||||
Future steps impact TBD.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
As the spec transitions from Pike, the section is split into the Pike and
|
||||
Queens parts.
|
||||
|
||||
Developers will have to keep in mind the recommended intermediate best
|
||||
practices, when designing heat templates for TripleO hybrid deployments.
|
||||
|
||||
Developers will have to understand Kolla and Docker runtime internals, although
|
||||
that's already the case once we have containerized services onboard.
|
||||
|
||||
Future steps impact (to be finished):
|
||||
|
||||
* The notion of Tracebacks in the events is difficult to handle as a byte
|
||||
stream, because it becomes the responsibility of the apps to ensure output
|
||||
of new-line separated text is not interleaved. That notion of Tracebacks
|
||||
needs to be implemented apps side.
|
||||
|
||||
* Oslo.log is really emitting a stream of event points, or trace points, with
|
||||
rich metadata to describe those events. Capturing that metadata via a byte
|
||||
stream later needs to be implemented.
|
||||
|
||||
* Event streams of child processes, forked even temporarily, should or may need
|
||||
to be captured by the parent events stream as well.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bogdando
|
||||
|
||||
Other contributors:
|
||||
michele
|
||||
flaper87
|
||||
larsks
|
||||
dciabrin
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
As the spec transitions from Pike, the work items are split into the Pike and
|
||||
Queens parts:
|
||||
|
||||
* Implement an intermediate logging solution for tripleo-heat-templates for
|
||||
containerized services that log under `/var/log` (flaper87, bogdando). Done
|
||||
for Pike.
|
||||
* Come up with an intermediate logging solution for containerized services that
|
||||
log to syslog only (larsks). Done for Pike.
|
||||
* Come up with a solution for HA containerized services managed by Pacemaker
|
||||
(michele). Done for Pike.
|
||||
* Make sure that sosreport collects `/var/log/containers/*` and
|
||||
`/var/log/pacemaker/bundles/*` (no assignee). Pending for Pike.
|
||||
* Adjust logrotate scripts for the `/var/log/containers` and
|
||||
`/var/log/pacemaker/bundles/*` paths (no assignee). Pending for Pike.
|
||||
* Verify if the namespaced `/var/log/` for containers works and fits the case
|
||||
(no assignee).
|
||||
* Address the current state of OpenStack infrastructure apps as they are, and
|
||||
gently move them towards these guidelines referred as "future steps" (no
|
||||
assignee).
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Existing CI coverage fully fits the proposed change needs.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The given best practices and intermediate solutions built from those do not
|
||||
involve changes visible for end users but those given in the end users impact
|
||||
section. The same is true for developers and dev docs.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* `Sosreport tool <https://github.com/sosreport/sos>`_.
|
||||
* `Pacemaker container bundles <http://lists.clusterlabs.org/pipermail/users/2017-April/005380.html>`_.
|
||||
* `User namespaces in docker <https://success.docker.com/KBase/Introduction_to_User_Namespaces_in_Docker_Engine>`_.
|
||||
* `Docker logging drivers <https://docs.docker.com/engine/admin/logging/overview/>`_.
|
||||
* `Engineering blog posts <http://blog.oddbit.com/2017/06/14/openstack-containers-and-logging/>`_.
|
@ -1,230 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================
|
||||
Deployment Plan Management changes
|
||||
==================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/deployment-plan-management-refactor
|
||||
|
||||
The goal of this work is to improve GUI and CLI interoperability by changing the way
|
||||
deployment configuration is stored, making it more compact and simplify plan import
|
||||
and export.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The problem is broadly described in mailing list discussion [1]. This spec is a result
|
||||
of agreement achieved in that discussion.
|
||||
|
||||
TripleO-Common library currently operates on Mistral environment for storing plan
|
||||
configuration although not all data are stored there since there are additional files
|
||||
which define plan configuration (roles_data.yaml, network_data.yaml, capabilities-map.yaml)
|
||||
which are currently used by CLI to drive certain parts of deployment configuration.
|
||||
This imposes a problem of synchronization of content of those files with Mistral
|
||||
environment when plan is imported or exported.
|
||||
|
||||
TripleO-Common needs to be able to provide means for roles and networks management.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
TripleO plan configuration data should be stored in single place rather than in multiple
|
||||
(mistral environment + plan meta files stored in Swift container).
|
||||
|
||||
TripleO-Common should move from using mistral environment to storing the information
|
||||
in file (plan-environment.yaml) in Swift container so all plan configuration data
|
||||
are stored in 'meta' files in Swift and tripleo-common provides API to perform operations
|
||||
on this data.
|
||||
|
||||
Plan meta files: capabilities-map.yaml, roles_data.yaml, network_data.yaml [3],
|
||||
plan-environment.yaml
|
||||
|
||||
Proposed plan-environment.yaml file structure::
|
||||
|
||||
version: 1.0
|
||||
|
||||
name: A name of a plan which this file describes
|
||||
description: >
|
||||
A description of a plan, it's usage and potential summary of features it provides
|
||||
template: overcloud.yaml
|
||||
environments:
|
||||
- path: overcloud-resource-registry-puppet.yaml
|
||||
parameter_defaults:
|
||||
ControllerCount: 1
|
||||
passwords:
|
||||
TrovePassword: "vEPKFbdpTeesCWRmtjgH4s7M8"
|
||||
PankoPassword: "qJJj3gTg8bTCkbtYtYVPtzcyz"
|
||||
KeystoneCredential0: "Yeh1wPLUWz0kiugxifYU19qaf5FADDZU31dnno4gJns="
|
||||
|
||||
|
||||
This solution makes whole plan configuration stored in Swift container together with
|
||||
rest of plan files, simplifies plan import/export functionality as no synchronization
|
||||
is necessary between the Swift files and mistral environment. Plan configuration is
|
||||
more straightforward and CLI/GUI interoperability is improved.
|
||||
|
||||
Initially the plan configuration is going to be split into multiple 'meta' files
|
||||
(plan-environment.yaml, capabilities-map.yaml, roles_data.yaml, network_data.yaml)
|
||||
all stored in Swift container.
|
||||
As a next step we can evaluate a solution which merges them all into plan-environment.yaml
|
||||
|
||||
Using CLI workflow user works with local files. Plan, Networks and Roles are configured by
|
||||
making changes directly in relevant files (plan-management.yaml, roles_data.yaml, ...).
|
||||
Plan is created and templates are generated on deploy command.
|
||||
|
||||
TripleO Common library will implement CRUD actions for Roles and Networks
|
||||
management. This will allow clients to manage Roles and Networks and generate relevant
|
||||
templates (see work items).
|
||||
|
||||
TripleO UI and other clients use tripleo-common library which operates on plan stored in
|
||||
Swift container.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Alternative approach is treating Swift 'meta' files as an input during plan creation
|
||||
and synchronize them to Mistral environment when plan is imported which is described
|
||||
initially in [1] and is used in current plan import/export implementation [2]
|
||||
|
||||
This solution needs to deal with multiple race conditions, makes plan import/export
|
||||
much more complicated and overall solution is not simple to understand. Using this
|
||||
solution should be considered if using mistral environment as a plan configuration
|
||||
storage has some marginal benefits over using file in Swift. Which is not the case
|
||||
according to the discussion [1]
|
||||
|
||||
As a subsequent step to proposed solution, it is possible to join all existing
|
||||
'meta' files into a single one.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
CLI/GUI interoperability is improved
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
This change makes Deployment Plan import/export functionality much simpler as well as
|
||||
makes the tripleo-common operate on the same set of files as CLI does. It is much
|
||||
easier to understand the CLI users how tripleo-common works as it does not do any
|
||||
swift files -> mistral environment synchronization on the background.
|
||||
|
||||
TripleO-Common can introduce functionality manage Roles and Networks which perfectly
|
||||
matches to how CLI workflow does it.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
akrivoka
|
||||
|
||||
Other contributors:
|
||||
* d0ugal
|
||||
* rbrady
|
||||
* jtomasek
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* [tripleo-heat-templates] Update plan-environment.yaml to match new specification.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/update-plan-environment-yaml
|
||||
|
||||
* [tripleo-common] Update relevant actions to store data in plan-environment.yaml in
|
||||
Swift instead of using mistral-environment. Migrate any existing data away from Mistral.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/stop-using-mistral-env
|
||||
|
||||
* [tripleo-common] On plan creation/update tripleo-common validates the plan and checks
|
||||
that roles_data.yaml and network_data.yaml exist as well as validates it's format.
|
||||
On success, plan creation/update templates are generated/regenerated.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/validate-roles-networks
|
||||
|
||||
* [tripleo-common] Provide a GetRoles action to list current roles in json format by reading
|
||||
roles_data.yaml.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/get-roles-action
|
||||
|
||||
* [tripleo-common] Provide a GetNetworks action to list current networks in json format
|
||||
by reading network_data.yaml.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/get-networks-action
|
||||
|
||||
* [tripleo-common] Provide an UpdateRoles action to update Roles. It takes data in
|
||||
json format validates it's contents and persists them in roles_data.yaml, after
|
||||
successful update, templates are regenerated.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/update-roles-action
|
||||
|
||||
* [tripleo-common] Provide an UpdateNetworks action to update Networks. It takes data in
|
||||
json format validates it's contents and persists them in network_data.yaml.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/update-networks-action
|
||||
|
||||
* [tripleo-ui] Provide a way to create/list/update/delete Roles by calling tripleo-common
|
||||
actions.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/roles-crud-ui
|
||||
|
||||
* [tripleo-ui] Provide a way to create/list/update/delete Networks by calling tripleo-common
|
||||
actions.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/networks-crud-ui
|
||||
|
||||
* [tripleo-ui] Provide a way to assign Networks to Roles.
|
||||
|
||||
blueprint: https://blueprints.launchpad.net/tripleo/+spec/networks-roles-assignment-ui
|
||||
|
||||
* [python-tripleoclient] Update CLI to use tripleo-common actions for operations
|
||||
that currently modify mistral environment
|
||||
|
||||
related bug: https://bugs.launchpad.net/tripleo/+bug/1635409
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Feature will be tested as part of TripleO CI
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation should be updated to reflect the new capabilities of GUI (Roles/Networks management),
|
||||
a way to use plan-environment.yaml via CLI workflow and CLI/GUI interoperability using plan import
|
||||
and export features.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] http://lists.openstack.org/pipermail/openstack-dev/2017-February/111433.html
|
||||
[2] https://specs.openstack.org/openstack/tripleo-specs/specs/ocata/gui-plan-import-export.html
|
||||
[3] https://review.openstack.org/#/c/409921/
|
@ -1,167 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================
|
||||
Sample Environment Generator
|
||||
============================
|
||||
|
||||
A common tool to generate sample Heat environment files would be beneficial
|
||||
in two main ways:
|
||||
|
||||
* Consistent formatting and details. Every environment file would include
|
||||
parameter descriptions, types, defaults, etc.
|
||||
|
||||
* Ease of updating. The parameters can be dynamically read from the templates
|
||||
which allows the sample environments to be updated automatically when
|
||||
parameters are added or changed.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently our sample environments are hand written, with no consistency in
|
||||
terms of what is included. Most do not include a description of what all
|
||||
the parameters do, and almost none include the types of the parameters or the
|
||||
default values for them.
|
||||
|
||||
In addition, the environment files often get out of date because developers
|
||||
have to remember to manually update them any time they make a change to the
|
||||
parameters for a given feature or service. This is tedious and error-prone.
|
||||
|
||||
The lack of consistency in environment files is also a problem for the UI,
|
||||
which wants to use details from environments to improve the user experience.
|
||||
When environments are created manually, these details are likely to be missed.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
A new tool, similar to the oslo.config generator, will allow us to eliminate
|
||||
these problems. It will take some basic information about the environment and
|
||||
use the parameter definitions in the templates to generate the sample
|
||||
environment file.
|
||||
|
||||
The resulting environments should contain the following information:
|
||||
|
||||
* Human-readable Title
|
||||
* Description
|
||||
* parameter_defaults describing all the available parameters for the
|
||||
environment
|
||||
* Optional resource_registry with any necessary entries
|
||||
|
||||
Initially the title and description will simply be comments, but eventually we
|
||||
would like to get support for those fields into Heat itself so they can be
|
||||
top-level keys.
|
||||
|
||||
Ideally the tool would be able to update the capabilities map automatically as
|
||||
well. At some point there may be some refactoring done there to eliminate the
|
||||
overlap, but during the transition period this will be useful.
|
||||
|
||||
This is also a good opportunity to impose some organization on the environments
|
||||
directory of tripleo-heat-templates. Currently it is mostly a flat directory
|
||||
that contains all of the possible environments. It would be good to add
|
||||
subdirectories that group related environments so they are easier to find.
|
||||
|
||||
The non-generated environments will either be replaced by generated ones,
|
||||
when that makes sense, or deprecated in favor of a generated environment.
|
||||
In the latter case the old environments will be left for a cycle to allow
|
||||
users transition time to the new environments.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could add more checks to the yaml-validate tool to ensure environment files
|
||||
contain the required information, but this still requires more developer
|
||||
time and doesn't solve the maintenance problems as parameters change.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Users should get an improved deployment experience through more complete and
|
||||
better documented sample environments. Existing users who are referencing
|
||||
the existing sample environments may need to switch to the new generated
|
||||
environments.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No runtime performance impact. Initial testing suggests that it may take a
|
||||
non-trivial amount of time to generate all of the environments, but it's not
|
||||
something developers should have to do often.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
See End User Impact
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers will need to write an entry in the input file for the tool rather
|
||||
than directly writing sample environments. The input format of the tool will
|
||||
be documented, so this should not be too difficult.
|
||||
|
||||
When an existing environment is deprecated in favor of a generated one, a
|
||||
release note should be written by the developer making the change in order to
|
||||
communicate it to users.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bnemec
|
||||
|
||||
Other contributors:
|
||||
jtomasek
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Update the proposed tool to reflect the latest design decisions
|
||||
* Convert existing environments to be generated
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
No immediate dependencies, but in the long run we would like to have some
|
||||
added functionality from Heat to allow these environments to be more easily
|
||||
consumed by the UI. However, it was agreed at the PTG that we would proceed
|
||||
with this work and make the Heat changes in parallel so we can get some of
|
||||
the benefits of the change as soon as possible.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Any environments used in CI should be generated with the tool. We will want
|
||||
to add a job that exercises the tool as well, probably a job that ensures any
|
||||
changes in the patch under test are reflected in the environment files.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We will need to document the format of the input file.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
`Initial proposed version of the tool
|
||||
<https://review.openstack.org/#/c/253638/>`_
|
||||
|
||||
https://etherpad.openstack.org/p/tripleo-environment-generator
|
@ -1,121 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========
|
||||
GUI logging
|
||||
===========
|
||||
|
||||
The TripleO GUI currently has no way to persist logging information.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The TripleO GUI is a web application without its own dedicated backend. As
|
||||
such, any and all client-side errors are lost when the End User reloads the page
|
||||
or navigates away from the application. When things go wrong, the End User is
|
||||
unable to retrieve client-side logs because this information is not persisted.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
I propose that we use Zaqar as a persistence backend for client-side logging.
|
||||
At present, the web application is already communicating with Zaqar using
|
||||
websockets. We can use this connection to publish new messages to a dedicated
|
||||
logging queue.
|
||||
|
||||
Zaqar messages have a TTL of one hour. So once every thirty minutes, Mistral
|
||||
will query Zaqar using crontrigger, and retrieve all messages from the
|
||||
``tripleo-ui-logging`` queue. Mistral will then look for a file called
|
||||
``tripleo-ui-log`` in Swift. If this file exists, Mistral will check its size.
|
||||
If the size exceeds a predetermined size (e.g. 10MB), Mistral will rename it to
|
||||
``tripleo-ui-log-<timestamp>``, and create a new file in its place. The file
|
||||
will then receive the messages from Zaqar, one per line. Once we reach, let's
|
||||
say, a hundred archives (about 1GB) we can start removing dropping data in order
|
||||
to prevent unnecessary data accumulation.
|
||||
|
||||
To view the logging data, we can ask Swift for 10 latest messages with a prefix
|
||||
of ``tripleo-ui-log``. These files can be presented in the GUI for download.
|
||||
Should the user require, we can present a "View more" link that will display the
|
||||
rest of the collected files.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None at this time
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
There is a chance of logging sensitive data. I propose that we apply some
|
||||
common scrubbing mechanism to the messages before they are stored in Swift.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Sending additional messages over an existing websocket connection should have
|
||||
a negligible performance impact on the web application. Likewise, running
|
||||
hourly cron tasks in Mistral shouldn't impose a significant burden on the
|
||||
undercloud machine.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers should also benefit from having a centralized logging system in
|
||||
place as a means of improving productivity when debugging.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
hpokorny
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Introduce a central logging system (already in progress, see `blueprint`_)
|
||||
* Introduce a global error handler
|
||||
* Convert all logging messages to JSON using a standard format
|
||||
* Configuration: the name for the Zaqar queue to carry the logging data
|
||||
* Introduce a Mistral workflow to drain a Zaqar queue and publish the acquired
|
||||
data to a file in Swift
|
||||
* Introduce GUI elements to download the log files
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We can write unit tests for the code that handles sending messages over the
|
||||
websocket connection. We might be able to write an integration smoke test that
|
||||
will ensure that a message is received by the undercloud. We can also add some
|
||||
testing code to tripleo-common to cover the logic that drains the queue, and
|
||||
publishes the log data to Swift.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We need to document the default name of the Zaqar queue, the maximum size of
|
||||
each log file, and how many log files can be stored at most. On the End User
|
||||
side, we should document the fact that a GUI-oriented log is available, and the
|
||||
way to get it.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _blueprint: https://blueprints.launchpad.net/tripleo/+spec/websocket-logging
|
@ -1,129 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================================
|
||||
Tool send email with tripleo tempest results
|
||||
============================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/send-mail-tool
|
||||
|
||||
To speed up the troubleshooting, debugging and reproducing TripleO tempest
|
||||
results, we should have a list of people responsible to receive email status
|
||||
about tempest failures, containing a list of all the failures and failures
|
||||
that are known issues and are being covered by some opened bug in launchpad.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently there is periodic TripleO jobs running tempest, and these results
|
||||
are not being verified whether is failing or passing.
|
||||
Even if there is someone responsible to verify these runs, still is a manual
|
||||
job go to logs web site, check what's the latest job, go to the logs, verify
|
||||
if tempest ran, list the number of failures, check against a list if these
|
||||
failures are known failures or new ones, and only after all these steps,
|
||||
start to work to identify the root cause of the problem.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
TripleO should provide a unified method for send email for a list of
|
||||
users who would be responsible to take action when something goes wrong with
|
||||
tempest results.
|
||||
The method should run at the end of every run, in the validate-tempest role,
|
||||
and read the log file, either by the output generated by tempest, or by the
|
||||
logs uploaded to the logs website, identifying failures on tempest and report
|
||||
it by mail, or save the mail content in a file to be verified later. The mail
|
||||
should contain information such list of failures, list of known
|
||||
failures, date, link to the logs of the run, and any other information that
|
||||
might be relevant.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One of the alternatives would be openstack-health, where the user can
|
||||
subscribe into the rss feed of one of the jobs using a third party application.
|
||||
Right now, openstack-health doesn't support user subscription or send emails.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None, since it will use a API running in some cloud service to send the email,
|
||||
so the username and password remain secure.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers in different teams will be more involved in TripleO CI debugging.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
arxcruz
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* The script should be writen in Python
|
||||
* Should be part of validate-tempest role in tripleo-quickstart-extras
|
||||
* Should be able to read the logs in any run in http://logs.openstack.org
|
||||
* Once it reads the log, collect information about the failures,
|
||||
passing and known failures or taking tempest output and parsing it directly.
|
||||
* Be able to work with Jinja2 template to send email, so it's
|
||||
possible to have different templates for different types of job
|
||||
* Read the list of address that the report should be sent
|
||||
* The list is a dictionary mapping the email address to a list of tests
|
||||
and/or jobs where the users are interested.
|
||||
* Render the template with the proper data
|
||||
* Send the report
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
As part of CI testing, the new tool should be used to send a
|
||||
report to a list of interested people
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation should be updated to reflect the standard ways
|
||||
to send the report and call the script at the end of every
|
||||
periodic run.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Sagi mail tempest:
|
||||
https://github.com/sshnaidm/various/blob/master/check_tests.py
|
||||
|
@ -1,571 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===============================================
|
||||
Enable TripleO to Deploy Ceph via Ceph Ansible
|
||||
===============================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-ceph-ansible
|
||||
|
||||
Enable TripleO to deploy Ceph via Ceph Ansible using a new Mistral
|
||||
workflow. This will make the Ceph installation less tightly coupled
|
||||
with TripleO but the existing operator interfaces to deploy Ceph with
|
||||
TripleO will still be supported until the end of the Queens release.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The Ceph community maintains ceph-ansible to deploy and manage Ceph.
|
||||
Members of the TripleO community maintain similar tools too. This is
|
||||
a proposal to have TripleO trigger the Ceph community's tools via
|
||||
Mistral as an alternative method to deploy and manage Ceph.
|
||||
|
||||
Benefits of using another project to deploy and manage Ceph
|
||||
===========================================================
|
||||
|
||||
Avoid duplication of effort
|
||||
---------------------------
|
||||
|
||||
If there is a feature or bug fix in the Ceph community's tools not in
|
||||
the tools used by TripleO, then members of the TripleO community could
|
||||
allow deployers to use those features directly instead of writing
|
||||
their own implementation. If this proposal is successful, then it
|
||||
might result in not maintaining two code bases, (along with the bug
|
||||
fixes and testing included) in the future. For example, if
|
||||
ceph-ansible fixed a bug to correctly handle alternative system paths
|
||||
to block devices, e.g. /dev/disk/by-path/ in lieu of /dev/sdb, then
|
||||
the same bug would not need to be fixed in puppet-ceph. This detail
|
||||
would also be nicely abstracted from a deployer because this spec
|
||||
proposes maintaining parity with TripleO Heat Templates. Thus, the
|
||||
deployer would not need to change the `ceph::profile::params::osds`
|
||||
parameter as the same list of OSDs would work.
|
||||
|
||||
In taking this approach it's possible for there to be cases where
|
||||
TripleO's deployment architecture may have unique features that don't
|
||||
exist within ceph-ansible. In these cases, efforts may need to be
|
||||
taken so ensure such a features remian in parity with this approach.
|
||||
In no way, does this proposal enable a TripleO deployer to bypass
|
||||
TripleO and use ceph-ansible directly. Also, because Ceph is not an
|
||||
OpenStack service itself but a service that TripleO uses, this
|
||||
approach remains consistent with the TripleO mission.
|
||||
|
||||
|
||||
Consistency between OpenStack and non-OpenStack Ceph deployments
|
||||
----------------------------------------------------------------
|
||||
|
||||
A deployer may seek assistance from the Ceph community with a Ceph
|
||||
deployment and this process will be simplified if both deployments
|
||||
were done using the same tool.
|
||||
|
||||
Enable Decoupling of Ceph management from TripleO
|
||||
-------------------------------------------------
|
||||
|
||||
The complexity of Ceph management can be moved to a different tool
|
||||
and abstracted, where appropriate, from TripleO making the Ceph
|
||||
management aspect of TripleO less complex. Combining this with
|
||||
containerized Ceph would offer flexible deployment options. This
|
||||
is a deployer benefit that is difficult to deliver today.
|
||||
|
||||
Features in the Ceph community's tools not in TripleO's tools
|
||||
-------------------------------------------------------------
|
||||
|
||||
The Ceph community tool, ceph-ansible [1]_, offers benefits to
|
||||
OpenStack users not found in TripleO's tool chain, including playbooks
|
||||
to deploy Ceph in containers and migrate a non-containerized
|
||||
deployment to a containerized deployment without downtime. Also,
|
||||
making the Ceph deployment in TripleO less tightly coupled, by moving
|
||||
it into a new Mistral workflow, would make it easier in a future
|
||||
release to add a business logic layer through a tool like Tendrl [2]_,
|
||||
to offer additional Ceph policy based configurations and possibly a
|
||||
graphical tool to see the status of the Ceph cluster. However, the
|
||||
scope of this proposal for Pike does not include Tendrl and instead
|
||||
takes the first step towards deploying Ceph via a Mistral workflow by
|
||||
triggering ceph-ansible directly. After the Pike cycle is complete
|
||||
triggering Mistral may be considered in a future spec.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The ceph-ansible [1]_ project provides a set of playbooks to deploy
|
||||
and manage Ceph. A proof of concept [3]_ has been written which uses
|
||||
two custom Mistral actions from the experimental
|
||||
mistral-ansible-actions project [4]_ to have a Mistral workflow on the
|
||||
undercloud trigger ceph-ansible to produce a working hyperconverged
|
||||
overcloud.
|
||||
|
||||
The deployer experience to stand up Ceph with TripleO at the end of
|
||||
this cycle should be the following:
|
||||
|
||||
#. The deployer chooses to deploy a role containing any of the
|
||||
Ceph server services: CephMon, CephOSD, CephRbdMirror, CephRgw,
|
||||
or CephMds.
|
||||
#. The deployer provides the same Ceph parameters they provide today
|
||||
in a Heat env file, e.g. a list of OSDs.
|
||||
#. The deployer starts the deploy and gets an overcloud with Ceph
|
||||
|
||||
Thus, the deployment experience remains the same for the deployer but
|
||||
behind the scenes a Mistral workflow is started which triggers
|
||||
ceph-ansible. The details of the Mistral workflow to accomplish this
|
||||
follows.
|
||||
|
||||
TripleO Ceph Deployment via Mistral
|
||||
-----------------------------------
|
||||
|
||||
TripleO's workflow to deploy a Ceph cluster would be changed so that
|
||||
there are two ways to deploy a Ceph cluster; the way currently
|
||||
supported by TripleO and the way described in this proposal.
|
||||
|
||||
The workflow described here assumes the following:
|
||||
|
||||
#. A deployer chooses to deploy Ceph server services from the
|
||||
following list of five services found in THT's roles_data.yaml:
|
||||
CephMon, CephOSD, CephRbdMirror, CephRgw, or CephMds.
|
||||
#. The deployer chooses to include new Heat environment files which
|
||||
will be in THT when this spec is implemented. The new Heat
|
||||
environment file will change the implementation of any of the five
|
||||
services from the previous step. Using storage-environment.yaml,
|
||||
which defaults to Ceph deployed by puppet-ceph, will still trigger
|
||||
the Ceph deployment by puppet-ceph. However, if the new Heat
|
||||
environment files are included instead of storage-environment.yaml,
|
||||
then the implementation of the service will be done by ceph-ansible
|
||||
instead; which already configures these services for hosts under
|
||||
the following roles in the Ansible inventory: mons, osds, mdss,
|
||||
rgws, or rbdmirrors.
|
||||
#. The undercloud has a directory called /usr/share/ceph-ansible
|
||||
which contains the ceph-ansible playbooks described in this spec.
|
||||
It will be present because its install will contain the
|
||||
installation of the ceph-ansible package.
|
||||
#. The Mistral on the Undercloud will contain to custom actions called
|
||||
`ansible` and `ansible-playbook` (or similar) and will also contain
|
||||
the workflow for each task below and can be observed by running
|
||||
`openstack workflow list`. Assume this is the case because the
|
||||
tripleo-common package will be modified to ship these actions and
|
||||
they will be available after undercloud installation.
|
||||
#. Heat will ship a new CustomResource type like
|
||||
OS::Mistral::WorflowExecution [6]_, which will execute custom
|
||||
Mistral workflows.
|
||||
|
||||
The standard TripleO workflow, as executed by a deployer, will create
|
||||
a custom Heat resource which starts an independent Mistral workflow to
|
||||
interact with ceph-ansible. An example of such a Heat resource would be
|
||||
OS::Mistral::WorflowExecution [6]_.
|
||||
|
||||
Each independent Mistral workflow may be implemented directly in
|
||||
tripleo-common/workbooks. A separate Mistral workbook will be created
|
||||
for each goal described below:
|
||||
|
||||
* Initial deployment of OpenStack and Ceph
|
||||
* Adding additional Ceph OSDs to existing OpenStack and Ceph clusters
|
||||
|
||||
The initial goal for the Pike cycle will be to maintain feature parity
|
||||
with what is possible today in TripleO and puppet-ceph but with
|
||||
containerized Ceph. Additional Mistral workflows may be written, time
|
||||
permitting or in a future cycle to add new features to TripleO's Ceph
|
||||
deployment which leverage ceph-ansible playbooks to shrink the Ceph
|
||||
Cluster and safely remove an OSD or to perform maintenance on the
|
||||
cluster by using Ceph's 'noout' flag so that the maintenance does not
|
||||
result in more data migration than necessary.
|
||||
|
||||
Initial deployment of OpenStack and Ceph
|
||||
----------------------------------------
|
||||
|
||||
The sequence of events for this new Mistral workflow and Ceph-Ansible
|
||||
to be triggered during initial deployment with TripleO follows:
|
||||
|
||||
#. Define the Overcloud on the Undercloud in Heat. This includes the
|
||||
Heat parameters that are related to storage which will later be
|
||||
passed to ceph-ansible via a Mistral workflow.
|
||||
#. Run `openstack overcloud deploy` with standard Ceph options but
|
||||
including a new Heat environment file to make the implementation
|
||||
of the service deployment use ceph-ansible.
|
||||
#. The undercloud assembles and uploads the deployment plan to the
|
||||
undercloud Swift and Mistral environment.
|
||||
#. Mistral starts the workflow to deploy the Overcloud and interfaces
|
||||
with Heat accordingly.
|
||||
#. A point in the deployment is reached where the Overcloud nodes are
|
||||
imaged, booted, and networked. At that point the undercloud has
|
||||
access to the provisioning or management IPs of the Overcloud
|
||||
nodes.
|
||||
#. A new Heat Resource is created which starts a Mistral workflow to
|
||||
Deploy Ceph on the systems with the any of the five Ceph server
|
||||
services, including CephMon, CephOSD, CephRbdMirror, CephRgw, or
|
||||
CephMds [6]_.
|
||||
#. The servers which host Ceph services have their relevant firewall
|
||||
ports opened according to the needs of their service, e.g. the Ceph
|
||||
monitor firewalls are configured to accept connections on TCP
|
||||
port 6789. [7]_.
|
||||
#. The Heat resource is passed the same parameters normally found in
|
||||
the tripleo-heat-templates environments/storage-environment.yaml
|
||||
but instead through a new Heat environment file. Additional files
|
||||
may be passed to include overrides, e.g. the list of OSD disks.
|
||||
#. The Heat resource passes its parameters to the Mistral workflow as
|
||||
parameters. This will include information about which hosts should
|
||||
have which of the five Ceph server services.
|
||||
#. The Mistral workflow translates these parameters so that they match
|
||||
the parameters that ceph-ansible expects, e.g.
|
||||
ceph::profile::params::osds would become devices though they'd have
|
||||
the same content, which would be a list of block devices. The
|
||||
translation entails building an argument list that may be passed
|
||||
to the playbook by calling `ansible-playbook --extra-vars`.
|
||||
Typically ceph-ansible uses modified files in the group_vars
|
||||
directory but in this case, no files are modified and instead the
|
||||
parameters are passed programmatically. Thus, the playbooks in
|
||||
/usr/share/ceph-ansible may be run unaltered and that will be the
|
||||
default directory. However, it will be possible to pass an
|
||||
alternative location for the /usr/share/ceph-ansible playbook as
|
||||
an argument. No playbooks are run yet at this stage.
|
||||
#. The Mistral environment is updated to generate a new SSH key-pair
|
||||
for ceph-ansible and the Overcloud nodes using the same process
|
||||
that is used to create the SSH keys for TripleO validations and
|
||||
install the public key on Overcloud nodes. After this environment
|
||||
update it will be possible to run `mistral environment-get
|
||||
ssh_keys_ceph` on the undercloud and see the public and private
|
||||
keys in JSON.
|
||||
#. The Mistral Action Plugin `ansible-playbook` is called and passed
|
||||
the list of parameters as described earlier. The dynamic ansible
|
||||
inventory used by tripleo-validations is used with the `-i`
|
||||
option. In order for ceph-ansible to work as usual there must be a
|
||||
group called `[mons]` and `[osds]` in the inventory. In addition to
|
||||
optional groups for `[mdss]`, `[rgws]`, or `[rbdmirrors]`.
|
||||
Modifications to the tripleo-validations project's
|
||||
tripleo-ansible-inventory script may be made to support this, or a
|
||||
derivative work of the same as shipped by TripleO common. The SSH
|
||||
private key for the heat-admin user and the provisioning or
|
||||
management IPs of the Overcloud nodes are what Ansible will use.
|
||||
#. The mistral workflow computes the number of forks in Ansible
|
||||
according to the number of machines that are going to be
|
||||
bootstrapped and will pass this number with `ansible-playbook
|
||||
--forks`.
|
||||
#. Mistral verifies that the Ansible ping module can execute `ansible
|
||||
$group -m ping` for any group in mons, osds, mdss, rgws, or
|
||||
rbdmirrors, that was requested by the deployer. For example, if the
|
||||
deployer only specified the CephMon and CephOSD service, then
|
||||
Mistral will only run `ansible mons -m ping` and `ansible osds -m
|
||||
ping`. The Ansible ping module will SSH into each host as the
|
||||
heat-admin user with key which was generated as described
|
||||
previously. If this fails, then the deployment fails.
|
||||
#. Mistral starts the Ceph install using the `ansible-playbook`
|
||||
action.
|
||||
#. The Mistral workflow creates a Zaqar queue to send progress
|
||||
information back to the client (CLI or web UI).
|
||||
#. The workflow posts messages to the "tripleo" Zaqar queue or the
|
||||
queue name provided to the original deploy workflow.
|
||||
#. If there is a problem during the status of the deploy may be seen
|
||||
by `openstack workflow execution list | grep ceph` and in the logs
|
||||
at /var/log/mistral/{engine.log,executor.log}. Running `openstack
|
||||
stack resource list` would show the custom Heat resource that
|
||||
started the Mistral workflow, but `openstack workflow execution
|
||||
list` and `openstack workflow task list` would contain more details
|
||||
about what steps completed within the Mistral workflow.
|
||||
#. The Ceph deployment is done in containers in a way which must
|
||||
prevent any configuration file conflict for any composed service,
|
||||
e.g. if a Nova compute container (as deployed by TripleO) and a
|
||||
Ceph OSD container are on the same node, then they must have
|
||||
different ceph.conf files, even if those files have the same
|
||||
content. Though, ceph-ansible will manage ceph.conf for Ceph
|
||||
services and puppet-ceph will still manage ceph.conf for OpenStack
|
||||
services, neither tool will both try to manage the same ceph.conf
|
||||
because it will be in a different location on the container host
|
||||
and bind mounted to /etc/ceph/ceph.conf within different
|
||||
containers.
|
||||
#. After the Mistral workflow is completed successfully, the custom
|
||||
Heat resource is considered successfully created. If the Mistral
|
||||
workflow does not complete successfully, then the Heat resource
|
||||
is not considered successfully created. TripleO should handle this
|
||||
the same way that it handles any Heat resource that failed to be
|
||||
created. For example, because the workflow is idempotent, if the
|
||||
resource creation fails because the wrong parameter was passed or
|
||||
because of a temporary network issue, the deployer could simply run
|
||||
a stack-update the Mistral worklow would run again and if the
|
||||
issues which caused the first run to fail were resolved, the
|
||||
deployment should succeed. Similarly if a user updates a parameter,
|
||||
e.g. a new disk is added to `ceph::profile::params::osds`, then the
|
||||
workflow will run again without breaking the state of the running
|
||||
Ceph cluster but it will configure the new disk.
|
||||
#. After the dependency of the previous step is satisfied, the TripleO
|
||||
Ceph external Heat resource is created to configure the appropriate
|
||||
Overcloud nodes as Ceph clients.
|
||||
#. For the CephRGW service, hieradata will be emitted so that it may
|
||||
be used for the haproxy listener setup and keystone users setup.
|
||||
#. The Overcloud deployment continues as if it was using an external
|
||||
Ceph cluster.
|
||||
|
||||
Adding additional Ceph OSD Nodes to existing OpenStack and Ceph clusters
|
||||
------------------------------------------------------------------------
|
||||
|
||||
The process to add an additional Ceph OSD node is similar to the
|
||||
process to deploy the OSDs along with the Overcloud:
|
||||
|
||||
#. Introspect the new hardware to host the OSDs.
|
||||
#. In the Heat environment file containing the node counts, increment
|
||||
the CephStorageCount.
|
||||
#. Run `openstack overcloud deploy` with standard Ceph options and the
|
||||
environment file which specifies the implementation of the Ceph
|
||||
deployment via ceph-ansible.
|
||||
#. The undercloud updates the deployment plan.
|
||||
#. Mistral starts the workflow to update the Overcloud and interfaces
|
||||
with Heat accordingly.
|
||||
#. A point in the deployment is reached where the new Overcloud nodes
|
||||
are imaged, booted, and networked. At that point the undercloud has
|
||||
access to the provisioning or management IPs of the Overcloud
|
||||
nodes.
|
||||
#. A new Heat Resource is created which starts a Mistral workflow to
|
||||
add new Ceph OSDs.
|
||||
#. TCP ports 6800:7300 are opened on the OSD host [7]_.
|
||||
#. The Mistral environment already has an SSH key-pair as described in
|
||||
the initial deployment scenario. The same process that is used to
|
||||
install the public SSH key on Overcloud nodes for TripleO
|
||||
validations is used to install the SSH keys for ceph-ansible.
|
||||
#. If necessary, the Mistral workflow updates the number of forks in
|
||||
Ansible according to the new number of machines that are going to
|
||||
be bootstrapped.
|
||||
#. The dynamic Ansible inventory will contain the new node.
|
||||
#. Mistral confirms that Ansible can execute `ansible osds -m ping`.
|
||||
This causes Ansible to SSH as the heat-admin user into all of the
|
||||
CephOsdAnsible nodes, including the new nodes. If this fails, then
|
||||
the update fails.
|
||||
#. Mistral uses the Ceph variables found in Heat as described in the
|
||||
initial deployment scenario.
|
||||
#. Mistral runs the osd-configure.yaml playbook from ceph-ansible to
|
||||
add the extra Ceph OSD server.
|
||||
#. The OSDs on the server are each deployed in their own containers
|
||||
and `docker ps` will list each OSD container.
|
||||
#. After the Mistral workflow is completed, the Custom Heat resource
|
||||
is considered to be updated.
|
||||
#. No changes are necessary for the TripleO Ceph external Heat
|
||||
resource since the Overcloud Ceph clients only need information
|
||||
about new OSDs from the Ceph monitors.
|
||||
#. The Overcloud deployment continues as if it was using an external
|
||||
Ceph cluster.
|
||||
|
||||
Containerization of configuration files
|
||||
---------------------------------------
|
||||
|
||||
As described in the Containerize TripleO spec, configuration files
|
||||
for the containerized service will be generated by Puppet and then
|
||||
passed to the containerized service using a configuration volume [8]_.
|
||||
A similar containerization feature is already supported by
|
||||
ceph-ansible, which uses the following sequence to generate the
|
||||
ceph.conf configuration file.
|
||||
|
||||
* Ansible generates a ceph.conf on a monitor node
|
||||
* Ansible runs the monitor container and bindmount /etc/ceph
|
||||
* No modification is being done in the ceph.conf
|
||||
* Ansible copies the ceph.conf to the Ansible server
|
||||
* Ansible copies the ceph.conf and keys to the appropriate machine
|
||||
* Ansible runs the OSD container and bindmount /etc/ceph
|
||||
* No modification is being done in the ceph.conf
|
||||
|
||||
These similar processes are compatible, even in the case of container
|
||||
hosts which run more than one OpenStack service but which each need
|
||||
their own copy of the configuration file per container. For example,
|
||||
consider a containerzation node which hosts both Nova compute and Ceph
|
||||
OSD services. In this scenario, the Nova compute service would be a
|
||||
Ceph client and puppet-ceph would generate its ceph.conf and the Ceph
|
||||
OSD service would be a Ceph server and ceph-ansible would generate its
|
||||
ceph.conf. It is necessary for Puppet to configure the Ceph client
|
||||
because Puppet configures the other OpenStack related configuration
|
||||
files as is already provided by TripleO. Both generated ceph.conf
|
||||
files would need to be stored in a separate directory on the
|
||||
containerization hosts to avoid conflicts and the directories could be
|
||||
mapped to specific containers. For example, host0 could have the
|
||||
following versions of foo.conf for two different containers::
|
||||
|
||||
host0:/container1/etc/foo.conf <--- generated by conf tool 1
|
||||
host0:/container2/etc/foo.conf <--- generated by conf tool 2
|
||||
|
||||
When each container is started on the host, the different
|
||||
configuration files could then be mapped to the different containers::
|
||||
|
||||
docker run containter1 ... /container1/etc/foo.conf:/etc/foo.conf
|
||||
docker run containter2 ... /container2/etc/foo.conf:/etc/foo.conf
|
||||
|
||||
In the above scenario, it is necessary for both configuration files
|
||||
to be generated from the same parameters. I.e. both Puppet and Ansible
|
||||
will use the same values from the Heat environment file, but will
|
||||
generate the configuration files differently. After the configuration
|
||||
programs have run it won't matter that Puppet idempotently updated
|
||||
lines of the ceph.conf and that Ansible used a Jina2 template. What
|
||||
will matter is that both configuration files have the same value,
|
||||
e.g. the same FSID.
|
||||
|
||||
Configuration files generated as described in the Containerize TripleO
|
||||
spec will not store those configuration files on the container
|
||||
host's /etc directory before passing it to the container guest with a
|
||||
bind mount. By default, ceph-ansible generates the initial ceph.conf
|
||||
on the container host's /etc directory before it uses a bind mount to
|
||||
pass it through to the container. In order to be consistent with the
|
||||
Containerize TripleO spec, ceph-ansible will get a new feature for
|
||||
deploying Ceph in containers so that it will not generate the
|
||||
ceph.conf on the container host's /etc directory. The same option will
|
||||
need to apply when generating Ceph key rings; which will be stored in
|
||||
/etc/ceph in the container, but not on the container host.
|
||||
|
||||
Because Mistral on the undercloud runs the ansible playbooks, the
|
||||
user "mistral" on the undercloud will be the one that SSH's into the
|
||||
overcloud nodes to run ansible playbooks. Care will need to be taken
|
||||
to ensure that user doesn't make changes which are out of scope.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
From a high level, this proposal is an alternative to the current
|
||||
method of deploying Ceph with TripleO and offers the benefits listed
|
||||
in the problem description.
|
||||
|
||||
From a lower level, how this proposal is implemented as described in
|
||||
the Workflow section should be considered.
|
||||
|
||||
#. In a split-stack scenario, after the hardware has been provisioned
|
||||
by the first Heat stack and before the configuration Heat stack is
|
||||
created, a Mistral workflow like the one in the POC [3]_ could be
|
||||
run to configured Ceph on the Ceph nodes. This scenario would be
|
||||
more similar to the one where TripleO is deployed using the TripleO
|
||||
Heat Templates environment file puppet-ceph-external.yaml. This
|
||||
could be an alternative to a new OS::Mistral::WorflowExecution Heat
|
||||
resource [6]_.
|
||||
#. Trigger the ceph-ansible deployment before the OpenStack deployment
|
||||
In the initial workflow section, it is proposed that "A new
|
||||
Heat Resource is created which starts a Mistral workflow to Deploy
|
||||
Ceph". This may be difficult because, in general, composable services
|
||||
currently define snippets of puppet data which is then later combined
|
||||
to define the deployment steps, and there is not yet a way to support
|
||||
running an arbitrary Mistral workflow at a given step of a deployment.
|
||||
Thus, the Mistral workflow could be started first and then it could
|
||||
wait for what is described in step 6 of the overview section.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
* A new SSH key pair will be created on the undercloud and will be
|
||||
accessible in the Mistral environment via a command like
|
||||
`mistral environment-get ssh_keys_ceph`. The public key of this
|
||||
pair will be installed in the heat-admin user's authorized_keys
|
||||
file on all Overcloud nodes which will be Ceph Monitors or OSDs.
|
||||
This process will follow the same pattern used to create the SSH
|
||||
keys used for TripleO validations so nothing new would happen in
|
||||
that respect; just another instance on the same type of process.
|
||||
* An additional tool would do configuration on the Overcloud, though
|
||||
the impact of this should be isolated via Containers.
|
||||
* Regardless of how Ceph services are configured, they require changes
|
||||
to the firewall. This spec will implement parity in fire-walling for
|
||||
Ceph services [7]_.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The following applies to the undercloud:
|
||||
|
||||
* Mistral will need to run an additional workflow
|
||||
* Heat's role in deploying Ceph would be lessened so the Heat stack
|
||||
would be smaller.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Ceph will be deployed using a method that is proven but who's
|
||||
integration is new to TripleO.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
fultonj
|
||||
|
||||
Other contributors:
|
||||
gfidente
|
||||
leseb
|
||||
colonwq
|
||||
d0ugal (to review Mistral workflows/actions)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Prototype a Mistral workflow to independently install Ceph on
|
||||
Overcloud nodes [3]_. [done]
|
||||
* Prototype a Heat Resource to start an independent Mistral Workflow
|
||||
[6]_. [done]
|
||||
* Expand mistral-ansible-actions with necessary options (fultonj)
|
||||
* Parametize mistral workflow (fultonj)
|
||||
* Update and have merged Heat CustomResource [6]_ (gfidente)
|
||||
* Have ceph-ansible create openstack pools and keys for containerized
|
||||
deployments: https://github.com/ceph/ceph-ansible/issues/1321 (leseb)
|
||||
* get ceph-ansible packaged in ceph.com and push to centos cbs
|
||||
(fultonj / leseb)
|
||||
* Make undercloud install produce /usr/share/ceph-ansible by modifying
|
||||
RDO's instack RPM's spec file to add a dependency (fultonj)
|
||||
* Submit mistral workflow and ansible-mistral-actions to
|
||||
tripleo-common (fultonj)
|
||||
* Prototype new service plugin interface that defines per-service
|
||||
workflows (gfidente / shardy / fultonj)
|
||||
* Submit new services into tht/roles_data.yaml so users can use it.
|
||||
This should include a change to the tripleo-heat-templates
|
||||
ci/environments/scenario001-multinode.yaml to include the new
|
||||
service, e.g. CephMonAnsible so that CI is tested. This may not
|
||||
work unless it all co-exists in a single overcloud deploy.
|
||||
If it works, we use it to get started. The initial plan is for
|
||||
scenario004 to keep using puppet-ceph.
|
||||
* Implement the deleting the Ceph Cluster scenario
|
||||
* Implement the adding additional Ceph OSDs to existing OpenStack and
|
||||
Ceph clusters scenario
|
||||
* Implement the removing Ceph OSD nodes scenario
|
||||
* Implement the performing maintenance on Ceph OSD nodes (optional)
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Containerization of the Ceph services provided by ceph-ansible is
|
||||
used to ensure the configuration tools aren't competing. This
|
||||
will need to be compatible with the Containerize TripleO spec
|
||||
[9]_.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
A change to tripleo-heat-templates' scenario001-multinode.yaml will be
|
||||
submitted which includes deployment of the new services CephMonAnsible
|
||||
and CephOsdAnsible (note that these role names will be changed when
|
||||
fully working). This testing scenario may not work unless all of the
|
||||
services may co-exist; however, preliminary testing indicates that
|
||||
this will work. Initially scenario004 will not be modified and will be
|
||||
kept using puppet-ceph. We may start by changing ovb-nonha scenario
|
||||
first as we believe this may be faster. When the CI move to
|
||||
tripleo-quickstart happens and there is a containers only scenario we
|
||||
will want to add a hyperconverged containerized deployment too.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A new TripleO Backend Configuration document "Deploying Ceph with
|
||||
ceph-ansible" would be required.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] `ceph-ansible <https://github.com/ceph/ceph-ansible>`_
|
||||
.. [2] `Tendrl <https://github.com/Tendrl/documentation>`_
|
||||
.. [3] `POC tripleo-ceph-ansible <https://github.com/fultonj/tripleo-ceph-ansible>`_
|
||||
.. [4] `Experimental mistral-ansible-actions project <https://github.com/d0ugal/mistral-ansible-actions>`_
|
||||
.. [6] `Proposed new Heat resource OS::Mistral::WorflowExecution <https://review.openstack.org/#/c/420664>`_
|
||||
.. [7] `These firewall changes must be managed in a way that does not conflict with TripleO's mechanism for managing host firewall rules and should be done before the Ceph servers are deployed. We are working on a solution to this problem.`
|
||||
.. [8] `Configuration files generated by Puppet and passed to a containerized service via a config volume <https://review.openstack.org/#/c/416421/29/docker/docker-puppet.py>`_
|
||||
.. [9] `Spec to Containerize TripleO <https://review.openstack.org/#/c/223182>`_
|
@ -1,440 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========================
|
||||
Deriving TripleO Parameters
|
||||
===========================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-derive-parameters
|
||||
|
||||
This specification proposes a generic interface for automatically
|
||||
populating environment files with parameters which were derived from
|
||||
formulas; where the formula's input came from introspected hardware
|
||||
data, workload type, and deployment type. It also provides specific
|
||||
examples of how this interface may be used to improve deployment of
|
||||
overclouds to be used in DPDK or HCI usecases. Finally, it proposes
|
||||
how this generic interface may be shared and extended by operators
|
||||
who optionally chose to have certain parameters prescribed so that
|
||||
future systems tuning expertise may be integrated into TripleO.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Operators must populate parameters for a deployment which may be
|
||||
specific to hardware and deployment type. The hardware information
|
||||
of a node is available to the operator once the introspection of the
|
||||
node is completed. However, the current process requires that the
|
||||
operator manually read the introspected data, make decisions based on
|
||||
that data and then update the parameters in an environment file. This
|
||||
makes deployment preparation unnecessarily complex.
|
||||
|
||||
For example, when deploying for DPDK, the operator must provide the
|
||||
list of CPUs which should be assigned to the DPDK Poll Mode Driver
|
||||
(PMD) and the CPUs should be provided from the same NUMA node on which
|
||||
the DPDK interface is present. In order to provide the correct
|
||||
parameters, the operator must cross check all of these details.
|
||||
|
||||
Another example is the deployment of HCI overclouds, which run both
|
||||
Nova compute and Ceph OSD services on the same nodes. In order to
|
||||
prevent contention between compute and storage services, the operator
|
||||
may manually apply formulas, provided by performance tuning experts,
|
||||
which take into account available hardware, type of workload, and type
|
||||
of deployment, and then after computing the appropriate parameters
|
||||
based on those formulas, manually store them in environment files.
|
||||
|
||||
In addition to the complexity of the DPDK or HCI usecase, knowing the
|
||||
process to assign CPUs to the DPDK Poll Mode Driver or isolate compute
|
||||
and storage resources for HCI is, in itself, another problem. Rather
|
||||
than document the process and expect operators to follow it, the
|
||||
process should be captured in a high level language with a generic
|
||||
interface so that performance tuning experts may easily share new
|
||||
similar processes for other use cases with operators.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
This spec aims to make three changes to TripleO outlined below.
|
||||
|
||||
Mistral Workflows to Derive Parameters
|
||||
--------------------------------------
|
||||
|
||||
A group of Mistral workflows will be added for the features which are
|
||||
complex to determine the deployment parameters. Features like DPDK,
|
||||
SR-IOV and HCI require, input from the introspection data to be
|
||||
analyzed to compute the deployment parameters. This derive parameters
|
||||
workflow will provide a default set of computational formulas by
|
||||
analyzing the introspected data. Thus, there will be a hard dependency
|
||||
with node introspection for this workflow to be successful.
|
||||
|
||||
During the first iterations, all the roles in a deployment will be
|
||||
analyzed to find a service associated with the role, which requires
|
||||
parameter derivation. Various options of using this and the final
|
||||
choice for the current iteration is discussed in below section
|
||||
`Workflow Association with Services`_.
|
||||
|
||||
This workflow assumes that all the nodes in a role have a homegenous
|
||||
hardware specification and introspection data of the first node will
|
||||
be used for processing the parameters for the entire role. This will
|
||||
be reexamined in later iterations, based on the need for node specific
|
||||
derivations. The workflow will consider the flavor-profile association
|
||||
and nova placement scheduler to identify the nodes associated with a
|
||||
role.
|
||||
|
||||
Role-specific parameters are an important requirement for this workflow.
|
||||
If there are multiple roles with the same service (feature) enabled,
|
||||
the parameters which are derived from this workflow will be applied
|
||||
only on the corresponding role.
|
||||
|
||||
The input sources for these workflows are the ironic database and ironic
|
||||
introspection data stored in Swift, in addition to the Deployment plan stored
|
||||
in Swift. Computations done to derive the parameters within the Mistral
|
||||
workflow will be implemented in YAQL. These computations will be a separate
|
||||
workflow on per feature basis so that the formulas can be customizable. If an
|
||||
operator has to modify the default formulas, he or she has to update only this
|
||||
workflow with customized formula.
|
||||
|
||||
|
||||
Applying Derived Parameters to the Overcloud
|
||||
--------------------------------------------
|
||||
|
||||
In order for the resulting parameters to be applied to the overcloud,
|
||||
the deployment plan, which is stored in Swift on the undercloud,
|
||||
will be modified with the Mistral `tripleo.parameters.update` action
|
||||
or similar.
|
||||
|
||||
The methods for providing input for derivation and the update of
|
||||
parameters which are derivation output should be consistent with the
|
||||
Deployment Plan Management specification [1]_. The implementation of
|
||||
this spec with respect to the interfaces to set and get parameters may
|
||||
change as it is updated. However, the basic workflow should remain the
|
||||
same.
|
||||
|
||||
Trigger Mistral Workflows with TripleO
|
||||
--------------------------------------
|
||||
|
||||
Assuming that workflows are in place to derive parameters and update the
|
||||
deployment plan as described in the previous two sections, an operator may
|
||||
take advantage of this optional feature by enabling it via ``plan-
|
||||
environment.yaml``. A new section ``workflow_parameters`` will be added to
|
||||
the ``plan-environments.yaml`` file to accomodate the additional parameters
|
||||
required for executing workflows. With this additional section, we can ensure
|
||||
that the workflow specific parameters are provide only to the workflow,
|
||||
without polluting the heat environments. It will also be possible to provide
|
||||
multiple plan environment files which will be merged in the CLI before plan
|
||||
creation.
|
||||
|
||||
These additional parameters will be read by the derive params workflow
|
||||
directly from the merged ``plan-environment.yaml`` file stored in Swift.
|
||||
|
||||
It is possible to modify the created plan or modify the profile-node
|
||||
association, after the derive parameters workflow execution. As of
|
||||
now, we assume that there no such alterations done, but it will be
|
||||
extended after the initial iteration, to fail the deployment with
|
||||
some validations.
|
||||
|
||||
An operator should be able to derive and view parameters without doing a
|
||||
deployment; e.g. "generate deployment plan". If the calculation is done as
|
||||
part of the plan creation, it would be possible to preview the calculated
|
||||
values. Alternatively the workflow could be run independently of the overcloud
|
||||
deployment, but how that will fit with the UI workflow needs to be determined.
|
||||
|
||||
Usecase 1: Derivation of DPDK Parameters
|
||||
========================================
|
||||
|
||||
A part of the Mistral workflow which uses YAQL to derive DPDK
|
||||
parameters based on introspection data, including NUMA [2]_, exists
|
||||
and may be seen on GitHub [3]_.
|
||||
|
||||
Usecase 2: Derivation Profiles for HCI
|
||||
======================================
|
||||
|
||||
This usecase uses HCI, running Ceph OSD and Nova Compute on the same node. HCI
|
||||
derive parameters workflow works with a default set of configs to categorize
|
||||
the type of the workload that the role will host. An option will be provide to
|
||||
override the default configs with deployment specific configs via ``plan-
|
||||
environment.yaml``.
|
||||
|
||||
In case of HCI deployment, the additional plan environment used for the
|
||||
deployment will look like::
|
||||
|
||||
workflow_parameters:
|
||||
tripleo.workflows.v1.derive_parameters:
|
||||
# HCI Derive Parameters
|
||||
HciProfile: nfv-default
|
||||
HciProfileConfig:
|
||||
default:
|
||||
average_guest_memory_size_in_mb: 2048
|
||||
average_guest_CPU_utilization_percentage: 50
|
||||
many_small_vms:
|
||||
average_guest_memory_size_in_mb: 1024
|
||||
average_guest_CPU_utilization_percentage: 20
|
||||
few_large_vms:
|
||||
average_guest_memory_size_in_mb: 4096
|
||||
average_guest_CPU_utilization_percentage: 80
|
||||
nfv_default:
|
||||
average_guest_memory_size_in_mb: 8192
|
||||
average_guest_CPU_utilization_percentage: 90
|
||||
|
||||
In the above example, the section ``workflow_parameters`` is used to provide
|
||||
input parameters for the workflow in order to isolate Nova and Ceph
|
||||
resources while maximizing performance for different types of guest
|
||||
workloads. An example of the derivation done with these inputs is
|
||||
provided in nova_mem_cpu_calc.py on GitHub [4]_.
|
||||
|
||||
|
||||
Other Integration of Parameter Derivation with TripleO
|
||||
======================================================
|
||||
|
||||
Users may still override parameters
|
||||
-----------------------------------
|
||||
|
||||
If a workflow derives a parameter, e.g. cpu_allocation_ratio, but the
|
||||
operator specified a cpu_allocation_ratio in their overcloud deploy,
|
||||
then the operator provided value is given priority over the derived
|
||||
value. This may be useful in a case where an operator wants all of the
|
||||
values that were derived but just wants to override a subset of those
|
||||
parameters.
|
||||
|
||||
Handling Cross Dependency Resources
|
||||
-----------------------------------
|
||||
|
||||
It is possible that multiple workflows will end up deriving parameters based
|
||||
on the same resource (like CPUs). When this happens, it is important to have a
|
||||
specific order for the workflows to be run considering the priority.
|
||||
|
||||
For example, let us consider the resource CPUs and how it should be used
|
||||
between DPDK and HCI. DPDK requires a set of dedicated CPUs for Poll Mode
|
||||
Drivers (NeutronDpdkCoreList), which should not be used for host process
|
||||
(ComputeHostCpusList) and guest VM's (NovaVcpuPinSet). HCI requires the CPU
|
||||
allocation ratio to be derived based on the number of CPUs that are available
|
||||
for guest VMs (NovaVcpuPinSet). Priority is given to DPDK, followed by HOST
|
||||
parameters and then HCI parameters. In this case, the workflow execution
|
||||
starts with a pool of CPUs, then:
|
||||
|
||||
* DPDK: Allocate NeutronDpdkCoreList
|
||||
* HOST: Allocate ComputeHostCpusList
|
||||
* HOST: Allocate NovaVcpuPinSet
|
||||
* HCI: Fix the cpu allocation ratio based on NovaVcpuPinSet
|
||||
|
||||
Derived parameters for specific services or roles
|
||||
-------------------------------------------------
|
||||
|
||||
If an operator only wants to configure Enhanced Placement Awareness (EPA)
|
||||
features like CPU pinning or huge pages, which are not associated with any
|
||||
feature like DPDK or HCI, then it should be associated with just the compute
|
||||
service.
|
||||
|
||||
Workflow Association with Services
|
||||
----------------------------------
|
||||
|
||||
The optimal way to associate the derived parameter workflows with
|
||||
services, is to get the list of the enabled services on a given role,
|
||||
by previewing Heat stack. With the current limitations in Heat, it is
|
||||
not possible fetch the enabled services list on a role. Thus, a new
|
||||
parameter will be introduced on the service which is associated with a
|
||||
derive parameters workflow. If this parameter is referenced in the
|
||||
heat resource tree, on a specific role, then the corresponding derive
|
||||
parameter workflow will be invoked. For example, the DPDK service will
|
||||
have a new parameter "EnableDpdkDerivation" to enable the DPDK
|
||||
specific workflows.
|
||||
|
||||
Future integration with TripleO UI
|
||||
----------------------------------
|
||||
|
||||
If this spec were implemented and merged, then the TripleO UI could
|
||||
have a menu item for a deployment, e.g. HCI, in which the deployer may
|
||||
choose a derivation profile and then deploy an overcloud with that
|
||||
derivation profile.
|
||||
|
||||
The UI could better integrate with this feature by allowing a deployer
|
||||
to use a graphical slider to vary an existing derivation profile and
|
||||
then save that derivation profile with a new name. The following
|
||||
cycle could be used by the deployer to tune the overcloud.
|
||||
|
||||
* Choose a deployment, e.g. HCI
|
||||
* Choose an HCI profile, e.g. many_small_vms
|
||||
* Run the deployment
|
||||
* Benchmark the planned workload on the deployed overcloud
|
||||
* Use the sliders to change aspects of the derivation profile
|
||||
* Update the deployment and re-run the benchmark
|
||||
* Repeat as needed
|
||||
* Save the new derivation profile as the one to be deployed in the field
|
||||
|
||||
The implementation of this spec would enable the TripleO UI to support
|
||||
the above.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The simplest alternative is for operators to determine what tunings
|
||||
are appropriate by testing or reading documentation and then implement
|
||||
those tunings in the appropriate Heat environment files. For example,
|
||||
in an HCI scenario, an operator could run nova_mem_cpu_calc.py [4]_
|
||||
and then create a Heat environment file like the following with its
|
||||
output and then deploy the overcloud and directly reference this
|
||||
file::
|
||||
|
||||
parameter_defaults:
|
||||
ExtraConfig:
|
||||
nova::compute::reserved_host_memory: 75000
|
||||
nova::cpu_allocation_ratio: 8.2
|
||||
|
||||
This could translate into a variety of overrides which would require
|
||||
initiative on the operator's part.
|
||||
|
||||
Another alternative is to write separate tools which generate the
|
||||
desired Heat templates but don't integrate them with TripleO. For
|
||||
example, nova_mem_cpu_calc.py and similar, would produce a set of Heat
|
||||
environment files as output which the operator would then include
|
||||
instead of output containing the following:
|
||||
|
||||
* nova.conf reserved_host_memory_mb = 75000 MB
|
||||
* nova.conf cpu_allocation_ratio = 8.214286
|
||||
|
||||
When evaluating the above, keep in mind that only two parameters for
|
||||
CPU allocation and memory are being provided as an example, but that
|
||||
a tuned deployment may contain more.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
There is no security impact from this change as it sits at a higher
|
||||
level to automate, via Mistral and Heat, features that already exist.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Operators need not manually derive the deployment parameters based on the
|
||||
introspection or hardware specification data, as it is automatically derived
|
||||
with pre-defined formulas.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The deployment and update of an overcloud may take slightly longer if
|
||||
an operator uses this feature because an additional Mistral workflow
|
||||
needs to run to perform some analytics before applying configuration
|
||||
updates. However, the performance of the overcloud would be improved
|
||||
because this proposal aims to make it easier to tune the overcloud for
|
||||
performance.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
A new configuration option is being added, but it has to be explicitly
|
||||
enabled, and thus it would not take immediate effect after its merged.
|
||||
Though, if a deployer chooses to use it and there is a bug in it, then
|
||||
it could affect the overcloud deployment. If a deployer uses this new
|
||||
option, and had a deploy in which they set a parameter directly,
|
||||
e.g. the Nova cpu_allocation_ratio, then that parameter may be
|
||||
overridden by a particular tuning profile. So that is something a
|
||||
deployer should be aware of when using this proposed feature.
|
||||
|
||||
The config options being added will ship with a variety of defaults
|
||||
based on deployments put under load in a lab. The main idea is to make
|
||||
different sets of defaults, which were produced under these
|
||||
conditions, available. The example discussed in this proposal and to
|
||||
be made available on completion could be extended.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
This spec proposes modifying the deployment plan which, if there was a
|
||||
bug, could introduce problems into a deployment. However, because the
|
||||
new feature is completely optional, a developer could easily disable
|
||||
it.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
skramaja
|
||||
fultonj
|
||||
|
||||
Other contributors:
|
||||
jpalanis
|
||||
abishop
|
||||
shardy
|
||||
gfidente
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Derive Params start workflow to find list of roles
|
||||
* Workflow run for each role to fetch the introspection data and trigger
|
||||
individual features workflow
|
||||
* Workflow to identify if a service associated with a features workflow is
|
||||
enabled in a role
|
||||
* DPDK Workflow: Analysis and concluding the format of the input data (jpalanis)
|
||||
* DPDK Workflow: Parameter deriving workflow (jpalanis)
|
||||
* HCI Workflow: Run a workflow that calculates the parameters (abishop)
|
||||
* SR-IOV Workflow
|
||||
* EPA Features Workflow
|
||||
* Run the derive params workflow from CLI
|
||||
* Add CI scenario testing if workflow with produced expected output
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* NUMA Topology in introspection data (ironic-python-agent) [5]_
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Create a new scenario in the TripleO CI in which a deployment is done
|
||||
using all of the available options within a derivation profile called
|
||||
all-derivation-options. A CI test would need to be added that would
|
||||
test this new feature by doing the following:
|
||||
|
||||
* A deployment would be done with the all-derivation-options profile
|
||||
* The deployment would be checked that all of the configurations had been made
|
||||
* If the configuration changes are in place, then the test passed
|
||||
* Else the test failed
|
||||
|
||||
Relating the above to the HCI usecase, the test could verify one of
|
||||
two options:
|
||||
|
||||
1. A Heat environment file created with the following syntactically
|
||||
valid Heat::
|
||||
|
||||
parameter_defaults:
|
||||
ExtraConfig:
|
||||
nova::compute::reserved_host_memory: 75000
|
||||
nova::cpu_allocation_ratio: 8.2
|
||||
|
||||
2. The compute node was deployed such that the commands below return
|
||||
something like the following::
|
||||
|
||||
[root@overcloud-osd-compute-0 ~]# grep reserved_host_memory /etc/nova/nova.conf
|
||||
reserved_host_memory_mb=75000
|
||||
[root@overcloud-osd-compute-0 ~]# grep cpu_allocation_ratio /etc/nova/nova.conf
|
||||
cpu_allocation_ratio=8.2
|
||||
[root@overcloud-osd-compute-0 ~]#
|
||||
|
||||
Option 1 would put less load on the CI infrastructure and produce a
|
||||
faster test but Option 2 tests the full scenario.
|
||||
|
||||
If a new derived parameter option is added, then the all-derivation-options
|
||||
profile would need to be updated and the test would need to be updated
|
||||
to verify that the new options were set.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A new chapter would be added to the TripleO document on deploying with
|
||||
derivation profiles.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] `Deployment Plan Management specification <https://review.openstack.org/#/c/438918>`_
|
||||
.. [2] `Spec for Ironic to retrieve NUMA node info <https://review.openstack.org/#/c/396147>`_
|
||||
.. [3] `<https://github.com/Jaganathancse/Jagan/tree/master/mistral-workflow>`_
|
||||
.. [4] `nova_mem_cpu_calc.py <https://github.com/RHsyseng/hci/blob/master/scripts/nova_mem_cpu_calc.py>`_
|
||||
.. [5] `NUMA Topology in introspection data (ironic-python-agent) <https://review.openstack.org/#/c/424729/>`_
|
@ -1,235 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================================
|
||||
Add real-time compute nodes to TripleO
|
||||
======================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-realtime
|
||||
|
||||
Real-time guest VMs require compute nodes with a specific configuration to
|
||||
control the sources of latency spikes.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Manual configuration of compute nodes to support real-time guests is possible.
|
||||
However this is complex and time consuming where there is large number of
|
||||
compute nodes to configure.
|
||||
|
||||
On a real-time compute node a subset of the available physical CPUs (pCPUs) are
|
||||
isolated and dedicated to real-time tasks. The remaining pCPUs are dedicated to
|
||||
general housekeeping tasks. This requires a real-time Linux Kernel and real-time
|
||||
KVM that allow their housekeeping tasks to be isolated. The real-time and
|
||||
housekeeping pCPUs typically reside on different NUMA nodes.
|
||||
|
||||
Huge pages are also reserved for guest VMs to prevent page faults, either via
|
||||
the kernel command line or via sysfs. Sysfs is preferable as it allows the
|
||||
reservation on each individual NUMA node to be set.
|
||||
|
||||
A real-time Linux guest VM is partitioned in a similar manner, having one or
|
||||
more real-time virtual CPUs (vCPUs) and one or more general vCPUs to handle
|
||||
the non real-time housekeeping tasks.
|
||||
|
||||
A real-time vCPU is pinned to a real-time pCPU while a housekeeping vCPU is
|
||||
pinned to a housekeeping pCPUS.
|
||||
|
||||
It is expected that operators would require both real-time and non real-time
|
||||
compute nodes on the same overcloud.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
The primary use-case is NFV appliances deployed by the telco community which
|
||||
require strict latency guarantees. Other latency sensitive applications should
|
||||
also benefit.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
This spec proposes changes to automate the deployment of real-time capable
|
||||
compute nodes using TripleO.
|
||||
|
||||
* a custom overcloud image for the real-time compute nodes, which shall include:
|
||||
|
||||
* real-time Linux Kernel
|
||||
* real-time KVM
|
||||
* real-time tuned profiles
|
||||
|
||||
* a new real-time compute role that is a variant of the existing compute role
|
||||
|
||||
* huge pages shall be enabled on the real-time compute nodes.
|
||||
* huge pages shall be reserved for the real-time guests.
|
||||
* CPU pinning shall be used to isolate kernel housekeeping tasks from the
|
||||
real-time tasks by configuring tuned.
|
||||
* CPU pinning shall be used to isolate virtualization housekeeping tasks from
|
||||
the real-time tasks by configuring nova.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Worse-case latency in real-time guest VMs should be significantly reduced.
|
||||
However a real-time configuration potentially reduces the overall throughput of
|
||||
a compute node.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The operator will remain responsible for:
|
||||
|
||||
* appropriate BIOS settings on compute node.
|
||||
* setting appropriate parameters for the real-time role in an environment file
|
||||
* post-deployment configuration
|
||||
|
||||
* creating/modifying overcloud flavors to enable CPU pinning, hugepages,
|
||||
dedicated CPUs, real-time policy
|
||||
* creating host aggregates for real-time and non real-time compute nodes
|
||||
|
||||
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Real-time ``overcloud-full`` image creation:
|
||||
|
||||
* create a disk-image-builder element to include the real-time packages
|
||||
* add support for multiple overcloud images in python-tripleoclient CLIs::
|
||||
|
||||
openstack overcloud image build
|
||||
openstack overcloud image upload
|
||||
|
||||
Real-time compute role:
|
||||
|
||||
* create a ``ComputeRealtime`` role
|
||||
|
||||
* variant of the ``Compute`` role that can be configued and scaled
|
||||
independently
|
||||
* allows a different image and flavor to be used for real-time nodes
|
||||
* includes any additional parameters/resources that apply to real-time nodes
|
||||
|
||||
* create a ``NovaRealtime`` service
|
||||
|
||||
* contains a nested ``NovaCompute`` service
|
||||
* allows parameters to be overridden for the real-time role only
|
||||
|
||||
Nova configuration:
|
||||
|
||||
* Nova ``vcpu_pin_set`` support is already implemented. See NovaVcpuPinSet in
|
||||
:ref:`references`
|
||||
|
||||
Kernel/system configuration:
|
||||
|
||||
* hugepages support
|
||||
|
||||
* set default hugepage size (kernel cmdline)
|
||||
* number of hugepages of each size to reserve at boot (kernel cmdline)
|
||||
* number of hugepages of each size to reserve post boot on each NUMA node
|
||||
(sysfs)
|
||||
|
||||
* Kernel CPU pinning
|
||||
|
||||
* isolcpu option (kernel cmdline)
|
||||
|
||||
Ideally this can be implemented outside of TripleO in the Tuned profiles, where
|
||||
it is possible to set the kernel command line and manage sysfs. TripleO would
|
||||
then manage the Tuned profile config files.
|
||||
Alternatively the grub and systemd config files can be managed directly.
|
||||
|
||||
.. note::
|
||||
|
||||
This requirement is shared with OVS-DPDK. The development should be
|
||||
coordinated to ensure a single implementation is implemented for
|
||||
both use-cases.
|
||||
Managing the grub config via a UserData script is the current approach used
|
||||
for OVS-DPDK. See OVS-DPDK documentation in :ref:`references`.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
owalsh
|
||||
|
||||
Other contributors:
|
||||
ansiwen
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
As outlined in the proposed changes.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Libvirt real time instances
|
||||
https://blueprints.launchpad.net/nova/+spec/libvirt-real-time
|
||||
* Hugepages enabled in the Compute nodes.
|
||||
https://bugs.launchpad.net/tripleo/+bug/1589929
|
||||
* CPU isolation of real-time and non real-time tasks.
|
||||
https://bugs.launchpad.net/tripleo/+bug/1589930
|
||||
* Tuned
|
||||
https://fedorahosted.org/tuned/
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Genuine real-time guests are unlikely to be testable in CI:
|
||||
|
||||
* specific BIOS settings are required.
|
||||
* images with real-time Kernel and KVM modules are required
|
||||
|
||||
However the workflow to deploy these guest should be testable in CI.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Manual steps performed by the operator shall be documented:
|
||||
|
||||
* BIOS settings for low latency
|
||||
* Real-time overcloud image creation
|
||||
|
||||
.. note::
|
||||
|
||||
CentOS repos do not include RT packages. The CERN CentOS RT repository is an
|
||||
alternative.
|
||||
* Flavor and profile creation
|
||||
* Parameters required in a TripleO environment file
|
||||
* Post-deployment configuration
|
||||
|
||||
.. _references:
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Nova blueprint `"Libvirt real time instances"
|
||||
<https://blueprints.launchpad.net/nova/+spec/libvirt-real-time>`_
|
||||
|
||||
The requirements are similar to :doc:`../newton/tripleo-ovs-dpdk`
|
||||
|
||||
CERN CentOS 7 RT repo http://linuxsoft.cern.ch/cern/centos/7/rt/
|
||||
|
||||
NoveVcpuPinSet parameter added: https://review.openstack.org/#/c/343770/
|
||||
|
||||
OVS-DPDK documentation (work-in-progress): https://review.openstack.org/#/c/395431/
|
@ -1,386 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================================
|
||||
Modify TripleO Ironic Inspector to PXE Boot Via DHCP Relay
|
||||
==========================================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-routed-networks-ironic-inspector
|
||||
|
||||
This blueprint is part of the series tripleo-routed-networks-deployment [0]_.
|
||||
|
||||
This spec describes adding features to the Undercloud to support Ironic
|
||||
Inspector performing PXE boot services for multiple routed subnets (with
|
||||
DHCP relay on the routers forwarding the requests). The changes required
|
||||
to support this will be in the format of ``undercloud.conf`` and in the Puppet
|
||||
script that writes the ``dnsmasq.conf`` configuration for Ironic Inspector.
|
||||
|
||||
TripleO uses Ironic Inspector to perform baremetal inspection of overcloud
|
||||
nodes prior to deployment. Today, the ``dnsmasq.conf`` that is used by Ironic
|
||||
Inspector is generated by Puppet scripts that run when the Undercloud is
|
||||
configured. A single subnet and IP allocation range is entered in
|
||||
``undercloud.conf`` in the parameter ``inspection_iprange``. This spec would
|
||||
implement support for multiple subnets in one provisioning network.
|
||||
|
||||
Background Context
|
||||
==================
|
||||
|
||||
For a detailed description of the desired topology and problems being
|
||||
addresssed, please reference the parent blueprint
|
||||
triplo-routed-networks-deployment [0]_.
|
||||
|
||||
Problem Descriptions
|
||||
====================
|
||||
|
||||
Ironic Inspector DHCP doesn't yet support DHCP relay. This makes it
|
||||
difficult to do introspection when the hosts are not on the same L2 domain
|
||||
as the controllers. The dnsmasq process will actually function across a DHCP
|
||||
relay, but the configuration must be edited by hand.
|
||||
|
||||
Possible Solutions, Ideas, or Approaches:
|
||||
|
||||
1. Add support for DHCP scopes and support for DHCP relays.
|
||||
2. Use remote DHCP/PXE boot but provide L3 routes back to the introspection server
|
||||
3. Use Neutron DHCP agent to PXE boot nodes for introspection (the Neutron
|
||||
dhcp-agent already supports multiple subnets, and can be modified to support
|
||||
DHCP relay). Note that there has been discussion about moving to Neutron for
|
||||
Ironic Introspection on this bug [3]_. This is currently infeasible due to
|
||||
Neutron not being able to issue IPs for unknown MACs. The related patch has
|
||||
been abandoned [5]_.
|
||||
|
||||
|
||||
Solution Implementation
|
||||
|
||||
The Ironic Inspector DHCP server uses dnsmasq, but only configures one subnet.
|
||||
We need to modify the Ironic Inspector DHCP configuration so that we can
|
||||
configure DHCP for multiple Neutron subnets and allocation pools. Then we
|
||||
should be able to use DHCP relay to send DHCP requests to the Ironic
|
||||
Inspector DHCP server. In the long term, we can likely leverage the Routed
|
||||
Networks work being done in Neutron to represent the subnets and allocation
|
||||
pools that would be used for the DHCP range sets below. This spec only covers
|
||||
the minimum needed for TripleO, so the work can be achieved simply by modifying
|
||||
the Undercloud Puppet scripts. The following has been tested and shown
|
||||
to result in successful introspection across two subnets, one local and one
|
||||
across a router configured with DHCP relay::
|
||||
|
||||
Current dnsmasq.conf representing one network (172.20.0.0/24), which is
|
||||
configured in the "inspection_iprange" in undercloud.conf:
|
||||
port=0
|
||||
interface=br-ctlplane
|
||||
bind-interfaces
|
||||
dhcp-range=172.21.0.100,172.21.0.120,29
|
||||
dhcp-sequential-ip
|
||||
dhcp-match=ipxe,175
|
||||
# Client is running iPXE; move to next stage of chainloading
|
||||
dhcp-boot=tag:ipxe,http://172.20.0.1:8088/inspector.ipxe
|
||||
dhcp-boot=undionly.kpxe,localhost.localdomain,172.20.0.1
|
||||
|
||||
Multiple-subnet dnsmasq.conf representing multiple subnets:
|
||||
port=0
|
||||
interface=br-ctlplane
|
||||
bind-interfaces
|
||||
# Ranges and options
|
||||
dhcp-range=172.21.0.100,172.21.0.120,29
|
||||
dhcp-range=set:leaf1,172.20.0.100,172.20.0.120,255.255.255.0,29
|
||||
dhcp-option=tag:leaf1,option:router,172.20.0.254
|
||||
dhcp-range=set:leaf2,172.19.0.100,172.19.0.120,255.255.255.0,29
|
||||
dhcp-option=tag:leaf2,option:router,172.19.0.254
|
||||
|
||||
dhcp-sequential-ip
|
||||
dhcp-match=ipxe,175
|
||||
# Client is running iPXE; move to next stage of chainloading
|
||||
dhcp-boot=tag:ipxe,http://172.20.0.1:8088/inspector.ipxe
|
||||
dhcp-boot=undionly.kpxe,localhost.localdomain,172.20.0.1
|
||||
|
||||
In the above configuration, a router is supplied for all subnets, including
|
||||
the subnet to which the Undercloud is attached. Note that the router is not
|
||||
required for nodes on the same subnet as the inspector host, but if it gets
|
||||
automatically generated it won't hurt anything.
|
||||
|
||||
This file is created by the Puppet file located in [1]_. That is where the
|
||||
changes will have to be made.
|
||||
|
||||
As discussed above, using a remote DHCP/PXE server is a possibility only if we
|
||||
have support in the top-of-rack switches, or if there is a system or VM
|
||||
listening on the remote subnet to relay DHCP requests. This configuration of
|
||||
dnsmasq will allow it to send DHCP offers to the DHCP relay, which forwards the
|
||||
offer on to the requesting host. After the offer is accepted, the host can
|
||||
communicate directly with the Undercloud, since it has already received the
|
||||
proper gateway address for packets to be forwarded. It will send a DHCP request
|
||||
directly based on the offer, and the DHCP ACK will be sent directly from the
|
||||
Undercloud to the client. Downloading of the PXE images is then done via TFTP
|
||||
and HTTP, not through the DHCP relay.
|
||||
|
||||
An additional problem is that Ironic Inspector blacklists nodes that have
|
||||
already been introspected using iptables rules blocking traffic from
|
||||
particular MAC addresses. Since packets relayed via DHCP relay will come
|
||||
from the MAC address of the router (not the original NIC that sent the packet),
|
||||
we will need to blacklist MACs based on the contents of the relayed DHCP
|
||||
packet. If possible, this blacklisting would be done using dnsmasq, which
|
||||
would provide the ability to decode the DHCP Discover packets and act on the
|
||||
contents. In order to do blacklisting directly with ``dnsmasq`` instead of
|
||||
using iptables, we need to be able to influence the ``dnsmasq`` configuration
|
||||
file.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
The proposed changes are discussed below.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The Puppet modules will need to be refactored to output a multi-subnet
|
||||
``dnsmasq.conf`` from a list of subnets in undercloud.conf.
|
||||
|
||||
The blacklisting functionality will need to be updated. Filtering by MAC
|
||||
address won't work for DHCP requests that are relayed by a router. In that
|
||||
case, the source MAC address will be the router interface that sent the
|
||||
relayed request. There are methods to blacklist MAC addresses within dnsmasq,
|
||||
such as this configuration::
|
||||
|
||||
dhcp-mac=blacklist,<target MAC address>
|
||||
dhcp-ignore=blacklist
|
||||
|
||||
Or this configuration::
|
||||
|
||||
# Never offer DHCP service to a machine whose Ethernet
|
||||
# address is 11:22:33:44:55:66
|
||||
dhcp-host=11:22:33:44:55:66,ignore
|
||||
|
||||
The configuration could be placed into the main ``dnsmasq.conf`` file, or into
|
||||
a file in ``/etc/dnsmasq.d/``. Either way, dnsmasq will have to be restarted
|
||||
in order to re-read the configuration files. This is due to a security feature
|
||||
in dnsmasq to prevent foreign configuration being loaded as root. Since DHCP
|
||||
has a built-in retry mechanism, the brief time it takes to restart dnsmasq
|
||||
should not impact introspection, as long as we don't restart dnsmasq too
|
||||
many times in any 60-second period.
|
||||
|
||||
It does not appear that the dnsmasq DBus interface can be used to set the
|
||||
"dhcp-ignore" option for individual MAC addresses [4]_ [6]_.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative approach is to use DHCP servers to assign IP addresses on all
|
||||
hosts on all interfaces. This would simplify configuration within the Heat
|
||||
templates and environment files. Unfortunately, this was the original approach
|
||||
of TripleO, and it was deemed insufficient by end-users, who wanted stability
|
||||
of IP addresses, and didn't want to have an external dependency on DHCP.
|
||||
|
||||
Another approach which was considered was simply trunking all networks back
|
||||
to the Undercloud, so that dnsmasq could respond to DHCP requests directly,
|
||||
rather than requiring a DHCP relay. Unfortunately, this has already been
|
||||
identified as being unacceptable by some large operators, who have network
|
||||
architectures that make heavy use of L2 segregation via routers. This also
|
||||
won't work well in situations where there is geographical separation between
|
||||
the VLANs, such as in split-site deployments.
|
||||
|
||||
Another approach is to use the DHCP server functionality in the network switch
|
||||
infrastructure in order to PXE boot systems, then assign static IP addresses
|
||||
after the PXE boot is done via DHCP. This approach would require configuration
|
||||
at the switch level that influenced where systems PXE boot, potentially opening
|
||||
up a security hole that is not under the control of OpenStack. This approach
|
||||
also doesn't lend itself to automation that accounts for things like changes
|
||||
to the PXE image that is being served to hosts.
|
||||
|
||||
It is not necessary to use hardware routers to forward DHCP packets. There
|
||||
are DHCP relay and DHCP proxy packages available for Linux. It is possible
|
||||
to place a system or a VM on both the Provisioning network and the remote
|
||||
network in order to forward DHCP requests. This might be one method for
|
||||
implementing CI testing. Another method might trunk all remote provisioning
|
||||
networks back to the Undercloud, with DHCP relay running on the Undercloud
|
||||
forwarding to the local br-ctlplane.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
One of the major differences between spine-and-leaf and standard isolated
|
||||
networking is that the various subnets are connected by routers, rather than
|
||||
being completely isolated. This means that without proper ACLs on the routers,
|
||||
private networks may be opened up to outside traffic.
|
||||
|
||||
This should be addressed in the documentation, and it should be stressed that
|
||||
ACLs should be in place to prevent unwanted network traffic. For instance, the
|
||||
Internal API network is sensitive in that the database and message queue
|
||||
services run on that network. It is supposed to be isolated from outside
|
||||
connections. This can be achieved fairly easily if *supernets* are used, so that
|
||||
if all Internal API subnets are a part of the ``172.19.0.0/16`` supernet, an
|
||||
ACL rule will allow only traffic between Internal API IPs (this is a simplified
|
||||
example that could be applied on all Internal API router VLAN interfaces
|
||||
or as a global ACL)::
|
||||
|
||||
allow traffic from 172.19.0.0/16 to 172.19.0.0/16
|
||||
deny traffic from * to 172.19.0.0/16
|
||||
|
||||
In the case of Ironic Inspector, the TFTP server is a potential point of
|
||||
vulnerability. TFTP is inherently unauthenticated and does not include an
|
||||
access control model. The network(s) where Ironic Inspector is operating
|
||||
should be secured from remote access.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Deploying with spine-and-leaf will require additional parameters to
|
||||
provide the routing information and multiple subnets required. This will have
|
||||
to be documented. Furthermore, the validation scripts may need to be updated
|
||||
to ensure that the configuration is validated, and that there is proper
|
||||
connectivity between overcloud hosts.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Much of the traffic that is today made over layer 2 will be traversing layer
|
||||
3 routing borders in this design. That adds some minimal latency and overhead,
|
||||
although in practice the difference may not be noticeable. One important
|
||||
consideration is that the routers must not be too overcommitted on their
|
||||
uplinks, and the routers must be monitored to ensure that they are not acting
|
||||
as a bottleneck, especially if complex access control lists are used.
|
||||
|
||||
The DHCP process is not likely to be affected, however delivery of system
|
||||
images via TFTP may suffer a performance degredation. Since TFTP does not
|
||||
deal well with packet loss, deployers will have to take care not to
|
||||
oversaturate the links between routing switches.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
A spine-and-leaf deployment will be more difficult to troubleshoot than a
|
||||
deployment that simply uses a set of VLANs. The deployer may need to have
|
||||
more network expertise, or a dedicated network engineer may be needed to
|
||||
troubleshoot in some cases.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Spine-and-leaf is not easily tested in virt environments. This should be
|
||||
possible, but due to the complexity of setting up libvirt bridges and
|
||||
routes, we may want to provide a simulation of spine-and-leaf for use in
|
||||
virtual environments. This may involve building multiple libvirt bridges
|
||||
and routing between them on the Undercloud, or it may involve using a
|
||||
DHCP relay on the virt-host as well as routing on the virt-host to simulate
|
||||
a full routing switch. A plan for development and testing will need to be
|
||||
formed, since not every developer can be expected to have a routed
|
||||
environment to work in. It may take some time to develop a routed virtual
|
||||
environment, so initial work will be done on bare metal.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Dan Sneddon <dsneddon@redhat.com>
|
||||
|
||||
Final assignees to be determined.
|
||||
|
||||
Approver(s)
|
||||
-----------
|
||||
|
||||
Primary approver:
|
||||
Emilien Macchi <emacchi@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
1. Modify Ironic Inspector ``dnsmasq.conf`` generation to allow export of
|
||||
multiple DHCP ranges. The patch enabling this has merged [7]_.
|
||||
2. Modify the Ironic Inspector blacklisting mechanism so that it supports DHCP
|
||||
relay, since the DHCP requests forwarded by the router will have the source
|
||||
MAC address of the router, not the node being deployed.
|
||||
3. Modify the documentation in ``tripleo-docs`` to cover the spine-and-leaf case.
|
||||
4. Add an upstream CI job to test booting across subnets (although
|
||||
hardware availability may make this a long-term goal).
|
||||
|
||||
[*] Note that depending on the timeline for Neutron/Ironic integration, it might
|
||||
make sense to implement support for multiple subnets via changes to the Puppet
|
||||
modules which process ``undercloud.conf`` first, then follow up with a patch
|
||||
to integrate Neutron networks into Ironic Inspector later on.
|
||||
|
||||
Implementation Details
|
||||
----------------------
|
||||
|
||||
Workflow for introspection and deployment:
|
||||
|
||||
1. Network Administrator configures all provisioning VLANs with IP address of
|
||||
Undercloud server on the ctlplane network as DHCP relay or "helper-address".
|
||||
2. Operator configures IP address ranges and default gateways in
|
||||
``undercloud.conf``. Each subnet will require its own IP address range.
|
||||
3. Operator imports baremetal instackenv.json.
|
||||
4. When introspection or deployment is run, the DHCP server receives the DHCP
|
||||
request from the baremetal host via DHCP relay.
|
||||
5. If the node has not been introspected, reply with an IP address from the
|
||||
introspection pool and the inspector PXE boot image.
|
||||
6. Introspection is performed. LLDP collection [2]_ is performed to gather
|
||||
information about attached network ports.
|
||||
7. The node is blacklisted in ``dnsmasq.conf`` (or in ``/etc/dnsmasq.d``),
|
||||
and dnsmasq is restarted.
|
||||
8. On the next boot, if the MAC address is blacklisted and a port exists in
|
||||
Neutron, then Neutron replies with the IP address from the Neutron port
|
||||
and the overcloud-full deployment image.
|
||||
9. The Heat templates are processed which generate os-net-config templates, and
|
||||
os-net-config is run to assign static IPs from the correct subnets, as well
|
||||
as routes to other subnets via the router gateway addresses.
|
||||
|
||||
When using spine-and-leaf, the DHCP server will need to provide an introspection
|
||||
IP address on the appropriate subnet, depending on the information contained in
|
||||
the DHCP relay packet that is forwarded by the segment router. dnsmasq will
|
||||
automatically match the gateway address (GIADDR) of the router that forwarded
|
||||
the request to the subnet where the DHCP request was received, and will respond
|
||||
with an IP and gateway appropriate for that subnet.
|
||||
|
||||
The above workflow for the DHCP server should allow for provisioning IPs on
|
||||
multiple subnets.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
There will be a dependency on routing switches that perform DHCP relay service
|
||||
for production spine-and-leaf deployments. Since we will not have routing
|
||||
switches in our virtual testing environment, a DHCP proxy may be set up as
|
||||
described in the testing section below.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
In order to properly test this framework, we will need to establish at least
|
||||
one CI test that deploys spine-and-leaf. As discussed in this spec, it isn't
|
||||
necessary to have a full routed bare metal environment in order to test this
|
||||
functionality, although there is some work required to get it working in virtual
|
||||
environments such as OVB.
|
||||
|
||||
For virtual testing, it is sufficient to trunk all VLANs back to the
|
||||
Undercloud, then run DHCP proxy on the Undercloud to receive all the
|
||||
requests and forward them to br-ctlplane, where dnsmasq listens. This
|
||||
will provide a substitute for routers running DHCP relay.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The TripleO docs will need to be updated to include detailed instructions
|
||||
for deploying in a spine-and-leaf environment, including the environment
|
||||
setup. Covering specific vendor implementations of switch configurations
|
||||
is outside this scope, but a specific overview of required configuration
|
||||
options should be included, such as enabling DHCP relay (or "helper-address"
|
||||
as it is also known) and setting the Undercloud as a server to receive
|
||||
DHCP requests.
|
||||
|
||||
The updates to TripleO docs will also have to include a detailed discussion
|
||||
of choices to be made about IP addressing before a deployment. If supernets
|
||||
are to be used for network isolation, then a good plan for IP addressing will
|
||||
be required to ensure scalability in the future.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [0] `Spec: Routed Networks for Neutron <https://review.openstack.org/#/c/225384/6/specs/mitaka/routed-networks.rst>`_
|
||||
.. [1] `Source Code: inspector_dnsmasq_http.erb <https://github.com/openstack/puppet-ironic/blob/master/templates/inspector_dnsmasq_http.erb>`_
|
||||
.. [2] `Review: Add LLDP processing hook and new CLI commands <https://review.openstack.org/#/c/374381>`_
|
||||
.. [3] `Bug: [RFE] Implement neutron routed networks support in Ironic <https://bugs.launchpad.net/ironic/+bug/1658964>`_
|
||||
.. [4] `Wikibooks: Python Programming: DBus <https://en.wikibooks.org/wiki/Python_Programming/Dbus>`_
|
||||
.. [5] `Review: Enhanced Network/Subnet DHCP Options <https://review.openstack.org/#/c/248931/>`_
|
||||
.. [6] `Documentation: DBus Interface for dnsmasq <http://www.thekelleys.org.uk/dnsmasq/docs/DBus-interface>`_
|
||||
.. [7] `Review: Multiple DHCP Subnets for Ironic Inspector <https://review.openstack.org/#/c/436716/>`_
|
@ -1,126 +0,0 @@
|
||||
..
|
||||
This template should be in ReSTructured text. For help with syntax,
|
||||
see http://sphinx-doc.org/rest.html
|
||||
|
||||
To test out your formatting, build the docs using tox, or see:
|
||||
http://rst.ninjs.org
|
||||
|
||||
The filename in the git repository should match the launchpad URL,
|
||||
for example a URL of
|
||||
https://blueprints.launchpad.net/oslo?searchtext=awesome-thing should be
|
||||
named awesome-thing.rst.
|
||||
|
||||
For specs targeted at a single project, please prefix the first line
|
||||
of your commit message with the name of the project. For example,
|
||||
if you're submitting a new feature for oslo.config, your git commit
|
||||
message should start something like: "config: My new feature".
|
||||
|
||||
Wrap text at 79 columns.
|
||||
|
||||
Do not delete any of the sections in this template. If you have
|
||||
nothing to say for a whole section, just write: None
|
||||
|
||||
If you would like to provide a diagram with your spec, ascii diagrams are
|
||||
required. http://asciiflow.com/ is a very nice tool to assist with making
|
||||
ascii diagrams. The reason for this is that the tool used to review specs is
|
||||
based purely on plain text. Plain text will allow review to proceed without
|
||||
having to look at additional files which can not be viewed in gerrit. It
|
||||
will also allow inline feedback on the diagram itself.
|
||||
|
||||
=========================
|
||||
The title of the policy
|
||||
=========================
|
||||
|
||||
Introduction paragraph -- why are we doing anything?
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
A detailed description of the problem.
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
Here is where you cover the change you propose to make in detail. How do you
|
||||
propose to solve this problem?
|
||||
|
||||
If the policy seeks to modify a process or workflow followed by the
|
||||
team, explain how and why.
|
||||
|
||||
If this is one part of a larger effort make it clear where this piece ends. In
|
||||
other words, what's the scope of this policy?
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
What other ways could we do this thing? Why aren't we using those? This doesn't
|
||||
have to be a full literature review, but it should demonstrate that thought has
|
||||
been put into why the proposed solution is an appropriate one.
|
||||
|
||||
If the policy changes over time, summarize the changes here. The exact
|
||||
details are always available by looking at the git history, but
|
||||
summarizing them will make it easier for anyone to follow the desired
|
||||
policy and understand when and why it might have changed.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Who is leading the writing of the policy? If more than one person is
|
||||
working on it, please designate the primary author and contact.
|
||||
|
||||
Primary author:
|
||||
<launchpad-id or None>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
When will the policy go into effect?
|
||||
|
||||
If there is a built-in deprecation period for the policy, or criteria
|
||||
that would trigger it no longer being in effect, describe them.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
List any concrete steps we need to take to implement the policy.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Please add any useful references here. You are not required to have
|
||||
any references. Moreover, this policy should still make sense when
|
||||
your references are unavailable. Examples of what you could include
|
||||
are:
|
||||
|
||||
* Links to mailing list or IRC discussions
|
||||
|
||||
* Links to notes from a summit session
|
||||
|
||||
* Links to relevant research, if appropriate
|
||||
|
||||
* Related policies as appropriate
|
||||
|
||||
* Anything else you feel it is worthwhile to refer to
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* -
|
||||
- Introduced
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,146 +0,0 @@
|
||||
====================
|
||||
Adding New CI Jobs
|
||||
====================
|
||||
|
||||
New CI jobs need to be added following a specific process in order to ensure
|
||||
they don't block patches unnecessarily and that they aren't ignored by
|
||||
developers.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
We need to have a process for adding CI jobs that is not going to result
|
||||
in a lot of spurious failures due to the new jobs. Bogus CI results force
|
||||
additional rechecks and reduce developer/reviewer confidence in the results.
|
||||
|
||||
In addition, maintaining CI jobs is a non-trivial task, and each one we add
|
||||
increases the load on the team. Hopefully having a process that requires the
|
||||
involvement of the new job's proposer makes it clear that the person/team
|
||||
adding the job has a responsibility to help maintain it. CI is everyone's
|
||||
problem.
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
The following steps should be completed in the order listed when adding a new
|
||||
job:
|
||||
|
||||
#. Create an experimental job or hijack an existing job for a single Gerrit
|
||||
change. See the references section for details on how to add a new job.
|
||||
This job should be passing before moving on to the next step.
|
||||
|
||||
#. Verify that the new job is providing a reasonable level of logging. Not
|
||||
too much, not too little. Important logs, such as the OpenStack service
|
||||
logs and basic system logs, are necessary to determine why jobs fail.
|
||||
However, OpenStack Infra has to store the logs from an enormous number of
|
||||
jobs, so it is also important to keep our log artifact sizes under control.
|
||||
When in doubt, try to capture about the same amount of logs as the existing
|
||||
jobs.
|
||||
|
||||
#. Promote the job to check non-voting. While the job should have been
|
||||
passing prior to this, it most likely has not been run a significant number
|
||||
of times, so the overall stability is still unknown.
|
||||
|
||||
"Stable" in this case would be defined as not having significantly more
|
||||
spurious failures than the ovb-ha job. Due to the additional complexity of
|
||||
an HA deployment, that job tends to fail for reasons unrelated to the patch
|
||||
being tested more often than the other jobs. We do not want to add any
|
||||
jobs that are less stable. Note that failures due to legitimate problems
|
||||
being caught by the new job should not count against its stability.
|
||||
|
||||
.. important:: Before adding OVB jobs to the check queue, even as
|
||||
non-voting, please check with the CI admins to ensure there is enough
|
||||
OVB capacity to run a large number of new jobs. As of this writing,
|
||||
the OVB cloud capacity is significantly more constrained than regular
|
||||
OpenStack Infra.
|
||||
|
||||
A job should remain in this state until it has been proven stable over a
|
||||
period of time. A good rule of thumb would be that after a week of
|
||||
stability the job can and should move to the next step.
|
||||
|
||||
.. important:: Jobs should not remain non-voting indefinitely. This causes
|
||||
reviewers to ignore the results anyway, so the jobs become a waste of
|
||||
resources. Once a job is believed to be stable, it should be made
|
||||
voting as soon as possible.
|
||||
|
||||
#. To assist with confirming the stability of a job, it should be added to the
|
||||
`CI Status <http://tripleo.org/cistatus.html>`_ page at this point. This
|
||||
can actually be done at any time after the job is moved to the check queue,
|
||||
but must be done before the job becomes voting.
|
||||
|
||||
Additionally, contact Sagi Shnaidman (sshnaidm on IRC) to get the job
|
||||
added to the `Extended CI Status <http://status-tripleoci.rhcloud.com/>`_
|
||||
page.
|
||||
|
||||
#. Send an e-mail to openstack-dev, tagged with [tripleo], that explains the
|
||||
purpose of the new job and notifies people that it is about to be made
|
||||
voting.
|
||||
|
||||
#. Make the job voting. At this point there should be sufficient confidence
|
||||
in the job that reviewers can trust the results and should not merge
|
||||
anything which does not pass it.
|
||||
|
||||
In addition, be aware that voting multinode jobs are also gating. If the
|
||||
job fails the patch cannot merge. This means a broken job can block all
|
||||
TripleO changes from merging.
|
||||
|
||||
#. Keep an eye on the `CI Status <http://tripleo.org/cistatus.html>`_ page to
|
||||
ensure the job keeps running smoothly. If it starts to fail an unusual
|
||||
amount, please investigate.
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
Historically, a number of jobs have been added to the check queue when they
|
||||
were completely broken. This is bad and reduces developer and reviewer
|
||||
confidence in the CI results. It can also block TripleO changes from merging
|
||||
if the broken job is gating.
|
||||
|
||||
We also have a bad habit of leaving jobs in the non-voting state, which makes
|
||||
them fairly worthless since reviewers will not respect the results. Per
|
||||
this policy, we should clean up all of the non-voting jobs by either moving
|
||||
them back to experimental, or stabilizing them and making them voting.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
bnemec
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
This policy would go into effect immediately.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
This policy is mostly targeted at new jobs, but we do have a number of
|
||||
non-voting jobs that should be brought into compliance with it.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
`OpenStack Infra Manual <https://docs.openstack.org/infra/manual/>`_
|
||||
|
||||
`Adding a New Job <https://docs.openstack.org/infra/manual/drivers.html#running-jobs-with-zuul>`_
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,150 +0,0 @@
|
||||
========
|
||||
Bug tags
|
||||
========
|
||||
|
||||
The main TripleO bug tracker is used to keep track of bugs for multiple
|
||||
projects that are all parts of TripleO. In order to reduce confusion,
|
||||
we are using a list of approved tags to categorize them.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Given the heavily interconnected nature of the various TripleO
|
||||
projects, there is a desire to track all the related bugs in a single
|
||||
bug tracker. However when it is needed, it can be difficult to narrow
|
||||
down the bugs related to a specific aspect of the project. Launchpad
|
||||
bug tags can help us here.
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
The Launchpad official tags list for TripleO contains the following
|
||||
tags. Keeping them official in Launchpad means the tags will
|
||||
auto-complete when users start writing them. A bug report can have any
|
||||
combination of these tags, or none.
|
||||
|
||||
Proposing new tags should be done via policy update (proposing a change
|
||||
to this file). Once such a change is merged, a member of the driver
|
||||
team will create/delete the tag in Launchpad.
|
||||
|
||||
Tags
|
||||
----
|
||||
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| Tag | Description |
|
||||
+===============================+============================================================================+
|
||||
| alert | For critical bugs requiring immediate attention. Triggers IRC notification |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| ci | A bug affecting the Continuous Integration system |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| ci-reproducer | A bug affecting local recreation of Continuous Integration environments |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| config-agent | A bug affecting os-collect-config, os-refresh-config, os-apply-config |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| containers | A bug affecting container based deployments |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| depcheck | A bug affecting 3rd party dependencies, for example ceph-ansible, podman |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| deployment-time | A bug affecting deployment time |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| documentation | A bug that is specific to documentation issues |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| edge | A bug that correlates to EDGE computing cases by network/scale etc. areas |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| i18n | A bug related to internationalization issues |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| low-hanging-fruit | A good starter bug for newcomers |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| networking | A bug that is specific to networking issues |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| promotion-blocker | Bug that is blocking promotion job(s) |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| puppet | A bug affecting the TripleO Puppet templates |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| quickstart | A bug affecting tripleo-quickstart or tripleo-quickstart-extras |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| selinux | A bug related to SELinux |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| tech-debt | A bug related to TripleO tech debt |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| tempest | A bug related to tempest running on TripleO |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| tripleo-common | A bug affecting tripleo-common |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| tripleo-heat-templates | A bug affecting the TripleO Heat Templates |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| tripleoclient | A bug affecting python-tripleoclient |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| ui | A bug affecting the TripleO UI |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| upgrade | A bug affecting upgrades |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| ux | A bug affecting user experience |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| validations | A bug affecting the Validations |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| workflows | A bug affecting the Mistral workflows |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| xxx-backport-potential | Cherry-pick request for the stable team |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
The current ad-hoc system is not working well, as people use
|
||||
inconsistent subject tags and other markers. Likewise, with the list
|
||||
not being official Launchpad tags do not autocomplete and quickly
|
||||
become inconsistent, hence not as useful.
|
||||
|
||||
We could use the wiki to keep track of the tags, but the future of the
|
||||
wiki is in doubt. By making tags an official policy, changes to the
|
||||
list can be reviewed.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
jpichon
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Newton-3
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Once the policy has merged, someone with the appropriate Launchpad
|
||||
permissions should create the tags and an email should be sent to
|
||||
openstack-dev referring to this policy.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Launchpad page to manage the tag list:
|
||||
https://bugs.launchpad.net/tripleo/+manage-official-tags
|
||||
|
||||
Thread that led to the creation of this policy:
|
||||
http://lists.openstack.org/pipermail/openstack-dev/2016-July/099444.html
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Newton
|
||||
- Introduced
|
||||
* - Queens
|
||||
- tech-debt tag added
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,246 +0,0 @@
|
||||
CI Team Structure
|
||||
=================
|
||||
|
||||
Problem Description
|
||||
-------------------
|
||||
The soft analysis over the past one to two years is that landing major new
|
||||
features and function in CI is difficult while being interrupted by a constant
|
||||
stream of issues. Each individual is siloed in their own work, feature or
|
||||
section of the production chain and there is very little time for thoughtful
|
||||
peer review and collaborative development.
|
||||
|
||||
Policy
|
||||
------
|
||||
|
||||
Goals
|
||||
^^^^^
|
||||
|
||||
* Increase developer focus, decrease distractions, interruptions, and time
|
||||
slicing.
|
||||
* Encourage collaborative team development.
|
||||
* Better and faster code reviews
|
||||
|
||||
Team Structure
|
||||
^^^^^^^^^^^^^^
|
||||
* The Ruck
|
||||
* The Rover
|
||||
* The Sprint Team
|
||||
|
||||
The Ruck
|
||||
^^^^^^^^
|
||||
One person per week will be on the front lines reporting failures found in CI.
|
||||
The Ruck & Rover switch roles in the second week of the sprint.
|
||||
|
||||
* Primary focus is to watch CI, report bugs, improve debug documentation.
|
||||
* Does not participate in the sprint
|
||||
* Attends the meetings where the team needs to be represented
|
||||
* Responds to pings on #oooq / #tripleo regarding CI
|
||||
* Reviews and improves documentation
|
||||
* Attends meetings for the group where possible
|
||||
* For identification, use the irc nick $user|ruck
|
||||
|
||||
The Rover
|
||||
^^^^^^^^^
|
||||
The primary backup for the Ruck. The Ruck should be catching all the issues
|
||||
in CI and passing the issues to the Rover for more in depth analysis or
|
||||
resolution of the bug.
|
||||
|
||||
* Back up for the Ruck
|
||||
* Workload is driven from the tripleo-quickstart bug queue, the Rover is
|
||||
not monitoring CI
|
||||
* A secondary input for work is identified technical debt defined in the
|
||||
Trello board.
|
||||
* Attends the sprint meetings, but is not responsible for any sprint work
|
||||
* Helps to triage incoming gerrit reviews
|
||||
* Responds to pings on irc #oooq / #tripleo
|
||||
* If the Ruck is overwhelmed with any of their responsibilities the
|
||||
Rover is the primary backup.
|
||||
* For identification, use the irc nick $user|rover
|
||||
|
||||
The Sprint Team
|
||||
^^^^^^^^^^^^^^^
|
||||
The team is defined at the beginning of the sprint based on availability.
|
||||
Members on the team should be as focused on the sprint epic as possible.
|
||||
A member of team should spend 80% of their time on sprint goals and 20%
|
||||
on any other duties like code review or incoming high priority bugs that
|
||||
the Rover can not manage alone.
|
||||
|
||||
* hand off interruptions to the Ruck and Rover as much as possible
|
||||
* focus as a team on the sprint epic
|
||||
* collaborate with other members of the sprint team
|
||||
* seek out peer review regarding sprint work
|
||||
* keep the Trello board updated daily
|
||||
* One can point to Trello cards in stand up meetings for status
|
||||
|
||||
The Squads
|
||||
^^^^^^^^^^
|
||||
The squads operate as a subunit of the sprint team. Each squad will operate
|
||||
with the same process and procedures and are managed by the team catalyst.
|
||||
|
||||
* Current Squads
|
||||
* CI
|
||||
* Responsible for the TripleO CI system ( non-infra ) and build
|
||||
verification.
|
||||
* Tempest
|
||||
* Responsible for tempest development.
|
||||
|
||||
Team Leaders
|
||||
------------
|
||||
|
||||
The team catalyst (TC)
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
The member of the team responsible organizing the group. The team will elect or
|
||||
appoint a team catalyst per release.
|
||||
|
||||
* organize and plan sprint meetings
|
||||
* collect status and send status emails
|
||||
|
||||
The user advocate (UA)
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
The member of the team responsible for help to prioritize work. The team will
|
||||
elect or appoint a user advocate per release.
|
||||
|
||||
* organize and prioritize the Trello board for the sprint planning
|
||||
* monitor the board during the sprint.
|
||||
* ensure the right work is being done.
|
||||
|
||||
The Squads
|
||||
^^^^^^^^^^
|
||||
There are two squads on the CI team.
|
||||
|
||||
* tripleo ci
|
||||
* tempest development
|
||||
|
||||
Each squad has a UA and they share a TC. Both contribute to Ruck and Rover rotations.
|
||||
|
||||
|
||||
Current Leaders for Rocky
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
* team catalyst (ci, tempest) - Matt Young
|
||||
* user advocate (ci) - Gabriele Cerami
|
||||
* user advocate (tempest) - Chandan Kumar
|
||||
|
||||
Sprint Structure
|
||||
^^^^^^^^^^^^^^^^
|
||||
The goal of the sprint is to define a narrow and focused feature called an epic
|
||||
to work on in a collaborative way. Work not completed in the sprint will be
|
||||
added to the technical debt column of Trello.
|
||||
|
||||
**Note:** Each sprint needs a clear definition of done that is documented in
|
||||
the epic used for the sprint.
|
||||
|
||||
Sprint Start ( Day 1 ) - 2.5 hours
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Sprints are three weeks in length
|
||||
* A planning meeting is attended by the entire team including the Ruck and
|
||||
Rover
|
||||
* Review PTO
|
||||
* Review any meetings that need to be covered by the Ruck/Rover
|
||||
* The UA will present options for the sprint epic
|
||||
* Discuss the epic, lightly breaking each one down
|
||||
* Vote on an epic
|
||||
* The vote can be done using a doodle form
|
||||
* Break down the sprint epic into cards
|
||||
* Review each card
|
||||
* Each card must have a clear definition of done
|
||||
* As a group include as much detail in the card as to provide enough
|
||||
information for an engineer with little to no background with the task.
|
||||
|
||||
|
||||
Sprint End ( Day 15 ) - 2.5 hours
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
* Retrospective
|
||||
* team members, ruck and rover only
|
||||
* Document any technical debt left over from the sprint
|
||||
* Ruck / Rover hand off
|
||||
* Assign Ruck and Rover positions
|
||||
* Sprint demo - when available
|
||||
* Office hours on irc
|
||||
|
||||
Scrum meetings - 30 Min
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
* Planning meeting, video conference
|
||||
* Sprint End, video and irc #oooq on freenode
|
||||
* 2 live video conference meetings per week
|
||||
* sprint stand up
|
||||
* Other days, post status to the team's Trello board and/or cards
|
||||
|
||||
|
||||
TripleoO CI Community meeting
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
* A community meeting should be held once a week.
|
||||
* The meeting should ideally be conveniently scheduled immediately after
|
||||
the TripleO community meeting on #tripleo (OFTC)
|
||||
* The CI meeting should be announced as part of the TripleO community meeting
|
||||
to encourage participation.
|
||||
|
||||
Alternatives & History
|
||||
----------------------
|
||||
|
||||
In the past the CI team has worked as individuals or by pairing up for distinct
|
||||
parts of the CI system and for certain features. Neither has been
|
||||
overwhelmingly successful for delivering features on a regular cadence.
|
||||
|
||||
Implementation
|
||||
--------------
|
||||
|
||||
Primary author: Wes Hayutin weshayutin at gmail
|
||||
|
||||
Other contributors:
|
||||
* Ronelle Landy rlandy at redhat
|
||||
* Arx Cruz acruz at redhat
|
||||
* Sagi Shnaidman at redhat
|
||||
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
This document is likely to evolve from the feedback discussed in sprint
|
||||
retrospectives. An in depth retrospective should be done at the end of each
|
||||
upstream cycle.
|
||||
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
Trello
|
||||
^^^^^^
|
||||
A Trello board will be used to organize work. The team is expected to keep the
|
||||
board and their cards updated on a daily basis.
|
||||
|
||||
* https://trello.com/b/U1ITy0cu/tripleo-ci-squad
|
||||
|
||||
Dashboards
|
||||
^^^^^^^^^^
|
||||
A number of dashboards are used to monitor the CI
|
||||
|
||||
* http://cistatus.tripleo.org/
|
||||
* https://dashboards.rdoproject.org/rdo-dev
|
||||
* http://zuul-status.tripleo.org/
|
||||
|
||||
Team Notes
|
||||
^^^^^^^^^^
|
||||
|
||||
* https://etherpad.openstack.org/p/tripleo-ci-squad-meeting
|
||||
|
||||
Bug Queue
|
||||
^^^^^^^^^
|
||||
* http://tinyurl.com/yag6y9ne
|
||||
|
||||
|
||||
Revision History
|
||||
----------------
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- April 16 2018
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License. http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,122 +0,0 @@
|
||||
=====================
|
||||
Expedited Approvals
|
||||
=====================
|
||||
|
||||
In general, TripleO follows the standard "2 +2" review standard, but there are
|
||||
situations where we want to make an exception. This policy is intended to
|
||||
document those exceptions.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Core reviewer time is precious, and there is never enough of it. In some
|
||||
cases, requiring 2 +2's on a patch is a waste of that core time, so we need
|
||||
to be reasonable about when to make exceptions. While core reviewers are
|
||||
always free to use their judgment about when to merge or not merge a patch,
|
||||
it can be helpful to list some specific situations where it is acceptable and
|
||||
even expected to approve a patch with a single +2.
|
||||
|
||||
Part of this information is already in the wiki, but the future of the wiki
|
||||
is in doubt and it's better to put policies in a place that they can be
|
||||
reviewed anyway.
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
Single +2 Approvals
|
||||
-------------------
|
||||
|
||||
A core can and should approve patches without a second +2 under the following
|
||||
circumstances:
|
||||
|
||||
* The change has multiple +2's on previous patch sets, indicating an agreement
|
||||
from the other cores that the overall design is good, and any alterations to
|
||||
the patch since those +2's must be minor implementation details only -
|
||||
trivial rebases, minor syntax changes, or comment/documentation changes.
|
||||
* Backports proposed by another core reviewer. Backports should already have
|
||||
been reviewed for design when they merged to master, so if two cores agree
|
||||
that the backport is good (one by proposing, the other by reviewing), they
|
||||
can be merged with a single +2 review.
|
||||
* Requirements updates proposed by the bot.
|
||||
* Translation updates proposed by the bot. (See also `reviewing
|
||||
translation imports
|
||||
<https://docs.openstack.org/i18n/latest/reviewing-translation-import.html>`_.)
|
||||
|
||||
Co-author +2
|
||||
------------
|
||||
|
||||
Co-authors on a patch are allowed to +2 that patch, but at least one +2 from a
|
||||
core not listed as a co-author is required to merge the patch. For example, if
|
||||
core A pushes a patch with cores B and C as a co-authors, core B and core C are
|
||||
both allowed to +2 that patch, but another core is required to +2 before the
|
||||
patch can be merged.
|
||||
|
||||
Self-Approval
|
||||
-------------
|
||||
|
||||
It is acceptable for a core to self-approve a patch they submitted if it has the
|
||||
requisite 2 +2's and a CI pass. However, this should not be done if there is any
|
||||
dispute about the patch, such as on a change with 2 +2's and an unresolved -1.
|
||||
|
||||
Note on CI
|
||||
----------
|
||||
|
||||
This policy does not affect CI requirements. Patches must still pass CI before
|
||||
merging.
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
This policy has been in effect for a while now, but not every TripleO core is
|
||||
aware of it, so it is simply being written down in an official location for
|
||||
reference.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
bnemec
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
The policy is already in effect.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Ensure all cores are aware of the policy. Once the policy has merged, an email
|
||||
should be sent to openstack-dev referring to it.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Existing wiki on review guidelines:
|
||||
https://wiki.openstack.org/wiki/TripleO/ReviewGuidelines
|
||||
|
||||
Previous spec that implemented some of this policy:
|
||||
https://specs.openstack.org/openstack/tripleo-specs/specs/kilo/tripleo-review-standards.html
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Newton
|
||||
- Introduced
|
||||
* - Newton
|
||||
- Added co-author +2 policy
|
||||
* - Ocata
|
||||
- Added note on translation imports
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
https://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,257 +0,0 @@
|
||||
..
|
||||
|
||||
========================
|
||||
TripleO First Principles
|
||||
========================
|
||||
|
||||
The TripleO first principles are a set of principles that guide decision making
|
||||
around future direction with TripleO. The principles are used to evaluate
|
||||
choices around changes in direction and architecture. Every impactful decision
|
||||
does not necessarily have to follow all the principles, but we use them to make
|
||||
informed decisions about trade offs when necessary.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
When evaluating technical direction within TripleO, a better and more
|
||||
consistent method is needed to weigh pros and cons of choices. Defining the
|
||||
principles is a step towards addressing that need.
|
||||
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
|
||||
Definitions
|
||||
-----------
|
||||
|
||||
Framework
|
||||
The functional implementation which exposes a set of standard enforcing
|
||||
interfaces that can be consumed by a service to describe that service's
|
||||
deployment and management. The framework includes all functional pieces that
|
||||
implement such interfaces, such as CLI's, API's, or libraries.
|
||||
|
||||
Example: tripleoclient/tripleo-common/tripleo-ansible/tripleo-heat-templates
|
||||
|
||||
Service
|
||||
The unit of deployment. A service will implement the necessary framework
|
||||
interfaces in order to describe it's deployment.
|
||||
|
||||
The framework does not enforce a particular service boundary, other than by
|
||||
prescribing best practices. For example, a given service implementation could
|
||||
deploy both a REST API and a database, when in reality the API and database
|
||||
should more likely be deployed as their own services and expressed as
|
||||
dependencies.
|
||||
|
||||
Example: Keystone, MariaDB, RabbitMQ
|
||||
|
||||
Third party integrations
|
||||
Service implementations that are developed and maintained outside of the
|
||||
TripleO project. These are often implemented by vendors aiming to add support
|
||||
for their products within TripleO.
|
||||
|
||||
Example: Cinder drivers, Neutron plugins
|
||||
|
||||
First Principles
|
||||
----------------
|
||||
|
||||
#. [UndercloudMigrate] No Undercloud Left Behind
|
||||
|
||||
#. TripleO itself as the deployment tool can be upgraded. We do
|
||||
not immediately propose what the upgrade will look like or the technology
|
||||
stack, but we will offer an upgrade path or a migration path.
|
||||
|
||||
#. [OvercloudMigrate] No Overcloud Left Behind
|
||||
|
||||
#. An overcloud deployed with TripleO can be upgraded to the next major version
|
||||
with either an in place upgrade or migration.
|
||||
|
||||
#. [DefinedInterfaces] TripleO will have a defined interface specification.
|
||||
|
||||
#. We will document clear boundaries between internal and external
|
||||
(third party integrations) interfaces.
|
||||
#. We will document the supported interfaces of the framework in the same
|
||||
way that a code library or API would be documented.
|
||||
#. Individual services of the framework can be deployed and tested in
|
||||
isolation from other services. Service dependencies are expressed per
|
||||
service, but do not preclude using the framework to deploy a service
|
||||
isolated from its dependencies. Whether that is successful or not
|
||||
depends on how the service responds to missing dependencies, and that is
|
||||
a behavior of the service and not the framework.
|
||||
#. The interface will offer update and upgrade tasks as first class citizens
|
||||
#. The interface will offer validation tasks as first class citizens
|
||||
|
||||
#. [OSProvisioningSeparation] Separation between operating system provisioning
|
||||
and software configuration.
|
||||
|
||||
#. Baremetal configuration, network configuration and base operating system
|
||||
provisioning is decoupled from the software deployment.
|
||||
#. The software deployment will have a defined set of minimal requirements
|
||||
which are expected to be in-place before it begins the software deployment.
|
||||
|
||||
#. Specific linux distributions
|
||||
#. Specific linux distribution versions
|
||||
#. Password-less access via ssh
|
||||
#. Password-less sudo access
|
||||
#. Pre-configured network bridges
|
||||
|
||||
#. [PlatformAgnostic] Platform agnostic deployment tooling.
|
||||
|
||||
#. TripleO is sufficiently isolated from the platform in a way that allows
|
||||
for use in a variety of environments (baremetal/virtual/containerized/OS
|
||||
version).
|
||||
#. The developer experience is such that it can easily be run in
|
||||
isolation on developer workstations
|
||||
|
||||
#. [DeploymentToolingScope] The deployment tool has a defined scope
|
||||
|
||||
#. Data collection tool.
|
||||
|
||||
#. Responsible for collecting host and state information and posting to a
|
||||
centralized repository.
|
||||
#. Handles writes to central repository (e.g. read information from
|
||||
repository, do aggregation, post to central repository)
|
||||
|
||||
#. A configuration tool to configure software and services as part of the
|
||||
deployment
|
||||
|
||||
#. Manages Software Configuration
|
||||
|
||||
#. Files
|
||||
#. Directories
|
||||
#. Service (containerized or non-containerized) state
|
||||
#. Software packages
|
||||
|
||||
#. Executes commands related to “configuration” of a service
|
||||
Example: Configure OpenStack AZ's, Neutron Networks.
|
||||
#. Isolated executions that are invoked independently by the orchestration tool
|
||||
#. Single execution state management
|
||||
|
||||
#. Input is configuration data/tasks/etc
|
||||
#. A single execution produces the desired state or reports failure.
|
||||
#. Idempotent
|
||||
|
||||
#. Read-only communication with centralized data repository for configuration data
|
||||
|
||||
#. The deployment process depends on an orchestration tool to handle various
|
||||
task executions.
|
||||
|
||||
#. Task graph manager
|
||||
#. Task transport and execution tracker
|
||||
#. Aware of hosts and work to be executed on the hosts
|
||||
#. Ephemeral deployment tooling
|
||||
#. Efficient execution
|
||||
#. Scale and reliability/durability are first class citizens
|
||||
|
||||
#. [CI/CDTooling] TripleO functionality should be considered within the context
|
||||
of being directly invoked as part of a CI/CD pipeline.
|
||||
|
||||
#. [DebuggableFramework] Diagnosis of deployment/configuration failures within
|
||||
the framework should be quick and simple. Interfaces should be provided to
|
||||
enable debuggability of service failures.
|
||||
|
||||
#. [BaseOSBootstrap] TripleO can start from a base OS and go to full cloud
|
||||
|
||||
#. It should be able to start at any point after base OS, but should be able
|
||||
to handle the initial OS bootstrap
|
||||
|
||||
#. [PerServiceManagement] TripleO can manage individual services in isolation,
|
||||
and express and rely on dependencies and ordering between services.
|
||||
|
||||
#. [Predictable/Reproducible/Idempotent] The deployment is predictable
|
||||
|
||||
#. The operator can determine what changes will occur before actually applying
|
||||
those changes.
|
||||
#. The deployment is reproducible in that the operator can re-run the
|
||||
deployment with the same set of inputs and achieve the same results across
|
||||
different environments.
|
||||
#. The deployment is idempotent in that the operator can re-run the
|
||||
deployment with the same set of inputs and the deployment will not change other
|
||||
than when it was first deployed.
|
||||
#. In the case where a service needs to restart a process, the framework
|
||||
will have an interface that the service can use to notify of the
|
||||
needed restart. In this way, the restarts are predictable.
|
||||
#. The interface for service restarts will allow for a service to describe
|
||||
how it should be restarted in terms of dependencies on other services,
|
||||
simultaneous restarts, or sequential restarts.
|
||||
|
||||
Non-principles
|
||||
--------------
|
||||
|
||||
#. [ContainerImageManagement] The framework does not manage container images.
|
||||
Other than using a given container image to start a container, the framework
|
||||
does not encompass common container image management to include:
|
||||
|
||||
#. Building container images
|
||||
#. Patching container images
|
||||
#. Serving or mirroring container images
|
||||
#. Caching container images
|
||||
|
||||
Specific tools for container image and runtime management and that need to
|
||||
leverage the framework during deployment are expected to be implemented as
|
||||
services.
|
||||
|
||||
#. [SupportingTooling] Tools and software executed by the framework to deploy
|
||||
services or tools required prior to service deployment by the framework are
|
||||
not considered part of the framework itself.
|
||||
|
||||
Examples: podman, TCIB, image-serve, nova-less/metalsmith
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
Many, if not all, the principles are already well agreed upon and understood as
|
||||
core to TripleO. Writing them down as policy makes them more discoverable and
|
||||
official.
|
||||
|
||||
Historically, there have been instances when decisions have been guided by
|
||||
desired technical implementation or outcomes. Recording the principles does not
|
||||
necessarily mean those decisions would stop, but it does allow for a more
|
||||
reasonable way to think about the trade offs.
|
||||
|
||||
We do not need to adopt any principles, or record them. However, there is no
|
||||
harm in doing so.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
James Slagle <jslagle@redhat.com>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
None.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
None.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None.
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - v0.0.1
|
||||
- Introduced
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,109 +0,0 @@
|
||||
=================
|
||||
Patch Abandonment
|
||||
=================
|
||||
|
||||
Goal
|
||||
====
|
||||
|
||||
Provide basic policy that core reviewers can apply to outstanding reviews. As
|
||||
always, it is up to the core reviewers discretion on whether a patch should or
|
||||
should not be abandoned. This policy is just a baseline with some basic rules.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
TripleO consists of many different projects in which many patches become stale
|
||||
or simply forgotten. This can lead to problems when trying to review the
|
||||
current patches for a given project.
|
||||
|
||||
When to Abandon
|
||||
===============
|
||||
|
||||
If a proposed patch has been marked -1 WIP by the author but has sat idle for
|
||||
more than 180 days, a core reviewer should abandon the change with a reference
|
||||
to this policy.
|
||||
|
||||
If a proposed patch is submitted and given a -2 and the patch has sat idle for
|
||||
90 days with no effort to address the -2, a core reviewer should abandon the
|
||||
change with a reference to this policy.
|
||||
|
||||
If a proposed patch becomes stale by ending up with a -1 from CI for 90 days
|
||||
and no activity to resolve the issues, a core reviewer should abandon the
|
||||
change with a reference to this policy.
|
||||
|
||||
If a proposed patch with no activity for 90 days is in merge conflict, even
|
||||
with a +1 from CI, a core reviewer should abandon the change with a reference
|
||||
to this policy.
|
||||
|
||||
When NOT to Abandon
|
||||
===================
|
||||
|
||||
If a proposed patch has no feedback but is +1 from CI, a core reviewer should
|
||||
not abandon such changes.
|
||||
|
||||
If a proposed patch a given a -1 by a reviewer but the patch is +1 from CI and
|
||||
not in merge conflict and the author becomes unresponsive for a few weeks,
|
||||
reviewers can leave a reminder comment on the review to see if there is
|
||||
still interest in the patch. If the issues are trivial then anyone should feel
|
||||
welcome to checkout the change and resubmit it using the same change ID to
|
||||
preserve original authorship. Core reviewers should not abandon such changes.
|
||||
|
||||
Restoration
|
||||
===========
|
||||
|
||||
Feel free to restore your own patches. If a change has been abandoned
|
||||
by a core reviewer, anyone can request the restoration of the patch by
|
||||
asking a core reviewer on IRC in #tripleo on OFTC or by sending a
|
||||
request to the openstack-dev mailing list. Should the patch again
|
||||
become stale it may be abandoned again.
|
||||
|
||||
Alternative & History
|
||||
=====================
|
||||
|
||||
This topic was previously brought up on the openstack mailing list [1]_ along
|
||||
with proposed code to use for automated abandonment [2]_. Similar policies are
|
||||
used by the Puppet OpenStack group [3]_.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
aschultz
|
||||
|
||||
Other contributors:
|
||||
bnemec
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Pike-2
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] http://lists.openstack.org/pipermail/openstack-dev/2015-October/076666.html
|
||||
.. [2] https://github.com/cybertron/tripleo-auto-abandon
|
||||
.. [3] https://docs.openstack.org/developer/puppet-openstack-guide/reviews.html#abandonment
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,163 +0,0 @@
|
||||
=========================
|
||||
Spec Review Process
|
||||
=========================
|
||||
|
||||
Document the existing process to help reviewers, especially newcomers,
|
||||
understand how to review specs. This is migrating the existing wiki
|
||||
documentation into a policy.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Care should be taken when approving specs. An approved spec, and an
|
||||
associated blueprint, indicate that the proposed change has some
|
||||
priority for the TripleO project. We don't want a bunch of approved
|
||||
specs sitting out there that no community members are owning or working
|
||||
on. We also want to make sure that our specs and blueprints are easy to
|
||||
understand and have sufficient enough detail to effectively communicate
|
||||
the intent of the change. The more effective the communication, the
|
||||
more likely we are to elicit meaningful feedback from the wider
|
||||
community.
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
To this end, we should be cognizant of the following checklist when
|
||||
reviewing and approving specs.
|
||||
|
||||
* Broad feedback from interested parties.
|
||||
|
||||
* We should do our best to elicit feedback from operators,
|
||||
non-TripleO developers, end users, and the wider OpenStack
|
||||
community in general.
|
||||
* Mail the appropriate lists, such as opentack-operators and
|
||||
openstack-dev to ask for feedback. Respond to feedback on the list,
|
||||
but also encourage direct comments on the spec itself, as those
|
||||
will be easier for other spec reviewers to find.
|
||||
|
||||
* Overall consensus
|
||||
|
||||
* Check for a general consensus in the spec.
|
||||
* Do reviewers agree this change is meaningful for TripleO?
|
||||
* If they don't have a vested interest in the change, are they at
|
||||
least not objecting to the change?
|
||||
|
||||
* Review older patchsets to make sure everything has been addressed
|
||||
|
||||
* Have any reviewers raised objections in previous patchsets that
|
||||
were not addressed?
|
||||
* Have any potential pitfalls been pointed out that have not been
|
||||
addressed?
|
||||
|
||||
* Impact/Security
|
||||
|
||||
* Ensure that the various Impact (end user, deployer, etc) and
|
||||
Security sections in the spec have some content.
|
||||
* These aren't sections to just gloss off over after understanding
|
||||
the implementation and proposed change. They are actually the most
|
||||
important sections.
|
||||
* It would be nice if that content had elicited some feedback. If it
|
||||
didn't, that's probably a good sign that the author and/or
|
||||
reviewers have not yet thought about these sections carefully.
|
||||
|
||||
* Ease of understandability
|
||||
|
||||
* The spec should be easy to understand for those reviewers who are
|
||||
familiar with the project. While the implementation may contain
|
||||
technical details that not everyone will grasp, the overall
|
||||
proposed change should be able to be understood by folks generally
|
||||
familiar with TripleO. Someone who is generally familiar with
|
||||
TripleO is likely someone who has run through the undercloud
|
||||
install, perhaps contributed some code, or participated in reviews.
|
||||
* To aid in comprehension, grammar nits should generally be corrected
|
||||
when they have been pointed out. Be aware though that even nits can
|
||||
cause disagreements, as folks pointing out nits may be wrong
|
||||
themselves. Do not bikeshed over solving disagreements on nits.
|
||||
|
||||
* Implementation
|
||||
|
||||
* Does the implementation make sense?
|
||||
* Are there alternative implementations, perhaps easier ones, and if
|
||||
so, have those been listed in the Alternatives section?
|
||||
* Are reasons for discounting the Alternatives listed in the spec?
|
||||
|
||||
* Ownership
|
||||
|
||||
* Is the spec author the primary assignee?
|
||||
* If not, has the primary assignee reviewed the spec, or at least
|
||||
commented that they agree that they are the primary assignee?
|
||||
|
||||
* Reviewer workload
|
||||
|
||||
* Specs turn into patches to codebases.
|
||||
* A +2 on a spec means that the core reviewer intends to review the
|
||||
patches associated with that spec in addition to their other core
|
||||
commitments for reviewer workload.
|
||||
* A +1 on a spec from a core reviewer indicates that the core
|
||||
reviewer is not necessarily committing to review that spec's
|
||||
patches.
|
||||
* It's fine to +2 even if the spec also relates to other repositories
|
||||
and areas of expertise, in addition to the reviewer's own. We
|
||||
probably would not want to merge any spec that spanned multiple
|
||||
specialties without a representative from each group adding their
|
||||
+2.
|
||||
* Have any additional (perhaps non-core) reviewers volunteered to
|
||||
review patches that implement the spec?
|
||||
* There should be a sufficient number of core reviewers who have
|
||||
volunteered to go above and beyond their typical reviewer workload
|
||||
(indicated by their +2) to review the relevant patches. A
|
||||
"sufficient number" is dependent on the individual spec and the
|
||||
scope of the change.
|
||||
* If reviewers have said they'll be reviewing a spec's patches
|
||||
instead of patches they'd review otherwise, that doesn't help much
|
||||
and is actually harmful to the overall project.
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
This is migrating the already agreed upon policy from the wiki.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
james-slagle (from the wiki history)
|
||||
|
||||
Other contributors:
|
||||
jpichon
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Once the policy has merged, an email should be sent to openstack-dev
|
||||
referring to this document.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Original documentation: https://wiki.openstack.org/wiki/TripleO/SpecReviews
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Ocata
|
||||
- Migrated from wiki
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,141 +0,0 @@
|
||||
==============
|
||||
TripleO Squads
|
||||
==============
|
||||
|
||||
Scaling-up a team is a common challenge in OpenStack.
|
||||
We always increase the number of projects, with more contributors
|
||||
and it often implies some changes in the organization.
|
||||
This policy is intended to document how we will address this challenge in
|
||||
the TripleO project.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Projects usually start from a single git repository and very often grow to
|
||||
dozen of repositories, doing different things. As long as a project gets
|
||||
some maturity, people who work together on a same topic needs some space
|
||||
to collaborate the open way.
|
||||
Currently, TripleO is acting as a single team where everyone meets
|
||||
on IRC once a week to talk about bugs, CI status, release management.
|
||||
Also, it happens very often that new contributors have hard time to find
|
||||
an area of where they could quickly start to contribute.
|
||||
Time is precious for our developers and we need to find a way to allow
|
||||
them to keep all focus on their area of work.
|
||||
|
||||
Policy
|
||||
======
|
||||
|
||||
The idea of this policy is to create squads of people who work on the
|
||||
same topic and allow them to keep focus with low amount of external
|
||||
distractions.
|
||||
|
||||
* Anyone would be free to join and leave a squad at will.
|
||||
Right now, there is no size limit for a squad as this is something
|
||||
experimental. If we realize a squad is too big (more than 10 people),
|
||||
we might re-consider the focus of area of the squad.
|
||||
* Anyone can join one or multiple squads at the same time. Squads will be
|
||||
documented in a place anyone can contribute.
|
||||
* Squads are free to organize themselves a weekly meeting.
|
||||
* #tripleo remains the official IRC channel. We won't add more channels.
|
||||
* Squads will have to choose a representative, who would be a squad liaison
|
||||
with TripleO PTL.
|
||||
* TripleO weekly meeting will still exist, anyone is encouraged to join,
|
||||
but topics would stay high level. Some examples of topics: release
|
||||
management; horizontal discussion between squads, CI status, etc.
|
||||
The meeting would be a TripleO cross-projects meeting.
|
||||
|
||||
We might need to test the idea for at least 1 or 2 months and invest some
|
||||
time to reflect what is working and what could be improved.
|
||||
|
||||
Benefits
|
||||
--------
|
||||
|
||||
* More collaboration is expected between people working on a same topic.
|
||||
It will reflect officially what we have nearly done over the last cycles.
|
||||
* People working on the same area of TripleO would have the possibility
|
||||
to do public and open meetings, where anyone would be free to join.
|
||||
* Newcomers would more easily understand what TripleO project delivers
|
||||
since squads would provide a good overview of the work we do. Also
|
||||
it would be an opportunity for people who want to learn on a specific
|
||||
area of TripleO to join a new squad and learn from others.
|
||||
* Open more possibilities like setting up mentoring program for each squad,
|
||||
or specific docs to get involved more quickly.
|
||||
|
||||
Challenges
|
||||
----------
|
||||
|
||||
* We need to avoid creating silos and keep horizontal collaboration.
|
||||
Working on a squad doesn't meen you need to ignore other squads.
|
||||
|
||||
Squads
|
||||
------
|
||||
|
||||
The list tends to be dynamic over the cycles, depending on which topics
|
||||
the team is working on. The list below is subject to change as squads change.
|
||||
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| Squad | Description |
|
||||
+===============================+============================================================================+
|
||||
| ci | Group of people focusing on Continuous Integration tooling and system |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| upgrade | Group of people focusing on TripleO upgrades |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| validations | Group of people focusing on TripleO validations tooling |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| networking | Group of people focusing on networking bits in TripleO |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| integration | Group of people focusing on configuration management (eg: services) |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| security | Group of people focusing on security |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| edge | Group of people focusing on Edge/multi-site/multi-cloud |
|
||||
| | https://etherpad.openstack.org/p/tripleo-edge-squad-status |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
| transformation | Group of people focusing on converting heat templates / puppet to Ansible |
|
||||
| | within the tripleo-ansible framework |
|
||||
+-------------------------------+----------------------------------------------------------------------------+
|
||||
|
||||
.. note::
|
||||
|
||||
Note about CI: the squad is about working together on the tooling used
|
||||
by OpenStack Infra to test TripleO, though every squad has in charge of
|
||||
maintaining the good shape of their tests.
|
||||
|
||||
|
||||
Alternatives & History
|
||||
======================
|
||||
|
||||
One alternative would be to continue that way and keep a single horizontal
|
||||
team. As long as we try to welcome in the team and add more projects, we'll
|
||||
increase the problem severity of scaling-up TripleO project.
|
||||
The number of people involved and the variety of topics that makes it really difficult to become able to work on everything.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
emacchi
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Ongoing
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Work with TripleO developers to document the area of work for every squad.
|
||||
* Document the output.
|
||||
* Document squads members.
|
||||
* Setup Squad meetings if needed.
|
||||
* For each squad, find a liaison or a squad leader.
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,113 +0,0 @@
|
||||
==================
|
||||
Tech Debt Tracking
|
||||
==================
|
||||
|
||||
Goal
|
||||
====
|
||||
|
||||
Provide a basic policy for tracking and being able to reference tech debt
|
||||
related changes in TripleO.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
During the development of TripleO, sometimes tech debt is acquired due to time
|
||||
or resource constraints that may exist. Without a solid way of tracking when
|
||||
we intentially add tech debt, it is hard to quantify how much tech debt is
|
||||
being self inflicted. Additionally tech debt gets lost in the code and without
|
||||
a way to remember where we left it, it is almost impossible to remember when
|
||||
and where we need to go back to fix some known issues.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Tracking Code Tech Debt with Bugs
|
||||
---------------------------------
|
||||
|
||||
Intentionally created tech debt items should have a bug [1]_ created with the
|
||||
`tech-debt` tag added to it. Additionally the commit message of the change
|
||||
should reference this `tech-debt` bug and if possible a comment should be added
|
||||
into the code referencing who put it in there.
|
||||
|
||||
Example Commit Message::
|
||||
|
||||
Always exit 0 because foo is currently broken
|
||||
|
||||
We need to always exit 0 because the foo process eroneously returns
|
||||
42. A bug has been reported upstream but we are not sure when it
|
||||
will be addressed.
|
||||
|
||||
Related-Bug: #1234567
|
||||
|
||||
Example Comment::
|
||||
|
||||
# TODO(aschultz): We need this because the world is falling apart LP#1234567
|
||||
foo || exit 0
|
||||
|
||||
Triaging Bugs as Tech Debt
|
||||
--------------------------
|
||||
|
||||
If an end user reports a bug that we know is a tech debt item, the person
|
||||
triaging the bug should add the `tech-debt` tag to the bug.
|
||||
|
||||
Reporting Tech Debt
|
||||
-------------------
|
||||
|
||||
With the `tech-debt` tag on bugs, we should be able to keep a running track
|
||||
of the bugs we have labeled and should report on this every release milestone
|
||||
to see trends around how much is being added and when. As part of our triaging
|
||||
of bugs, we should strive to add net-zero tech-debt bugs each major release if
|
||||
possible.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We continue to not track any of these things and continue to rely on developers
|
||||
to remember when they add code and circle back around to fix it themselves or
|
||||
when other developers find the issue and remove it.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Core reviewers should request that any tech debt be appropriately tracked and
|
||||
feel free to -1 any patches that are adding tech debt without proper
|
||||
attribution.
|
||||
|
||||
Author(s)
|
||||
---------
|
||||
|
||||
Primary author:
|
||||
aschultz
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Queens-1
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* aschultz to create tech-debt tag in Launchpad.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://docs.openstack.org/tripleo-docs/latest/contributor/contributions.html#reporting-bugs
|
||||
|
||||
Revision History
|
||||
================
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,351 +0,0 @@
|
||||
.
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================
|
||||
Fast-forward upgrades
|
||||
=====================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/fast-forward-upgrades
|
||||
|
||||
Fast-forward upgrades are upgrades that move an environment from release `N` to
|
||||
`N+X` in a single step, where `X` is greater than `1` and for fast-forward
|
||||
upgrades is typically `3`. This spec outlines how such upgrades can be
|
||||
orchestrated by TripleO between the Newton and Queens OpenStack releases.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
OpenStack upgrades are often seen by operators as problematic [1]_ [2]_.
|
||||
Whilst TripleO upgrades have improved greatly over recent cycles many operators
|
||||
are still reluctant to upgrade with each new release.
|
||||
|
||||
This often leads to a situation where environments remain on the release used
|
||||
when first deployed. Eventually this release will come to the end of its
|
||||
supported life (EOL), forcing operators to upgrade to the next supported
|
||||
release. There can also be restrictions imposed on an environment that simply
|
||||
do not allow upgrades to be performed ahead of the EOL of a given release,
|
||||
forcing operators to again wait until the release hits EOL.
|
||||
|
||||
While it is possible to then linearly upgrade to a supported release with the
|
||||
cadence of upstream releases, downstream distributions providing long-term
|
||||
support (LTS) releases may not be able to provide the same path once the
|
||||
initially installed release reaches EOL. Operators in such a situation may also
|
||||
want to avoid running multiple lengthy linear upgrades to reach their desired
|
||||
release.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
TripleO support for fast-forward upgrades will first target `N` to `N+3`
|
||||
upgrades between the Newton and Queens releases:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
Newton Ocata Pike Queens
|
||||
+-----+ +-----+ +-----+ +-----+
|
||||
| | | N+1 | | N+2 | | |
|
||||
| N | ---------------------> | N+3 |
|
||||
| | | | | | | |
|
||||
+-----+ +-----+ +-----+ +-----+
|
||||
|
||||
|
||||
This will give the impression of the Ocata and Pike releases being skipped with
|
||||
the fast-forward upgrade moving the environment from Newton to Queens. In
|
||||
reality as OpenStack projects with the `supports-upgrade` tag are only required
|
||||
to support `N` to `N+1` upgrades [3]_ the upgrade will still need to move
|
||||
through each release, completing database migrations and a limited set of other
|
||||
tasks.
|
||||
|
||||
Caveats
|
||||
-------
|
||||
|
||||
Before outlining the suggested changes to TripleO it is worth highlighting the
|
||||
following caveats for fast-forward upgrades:
|
||||
|
||||
* The control plane is inaccessible for the duration of the upgrade
|
||||
* The data plane and active workloads must remain available for the duration of
|
||||
the upgrade.
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
Prior to the overcloud fast-forward upgrade starting the following prerequisite
|
||||
tasks must be completed:
|
||||
|
||||
* Rolling minor update of the overcloud on `N`
|
||||
|
||||
This is a normal TripleO overcloud update [4]_ and should bring each node in
|
||||
the environment up to the latest supported version of the underlying OS and
|
||||
pulling in the latest packages. Operators can then reboot the nodes as
|
||||
required. The reboot ensuring that the latest kernel, openvswitch, QEMU and any
|
||||
other reboot dependant package is reloaded before proceeding with the upgrade.
|
||||
This can happen well in advance of the overcloud fast-forward upgrade and
|
||||
should remove the need for additional reboots during the upgrade.
|
||||
|
||||
* Upgrade undercloud from `N` to `N+3`
|
||||
|
||||
The undercloud also needs to be upgraded to `N+3` ahead of any overcloud
|
||||
upgrade. Again this can happen well in advance of the overcloud upgrade. For
|
||||
the time being this is a traditional, linear upgrade between `N` and `N+1`
|
||||
releases until we reach the target `N+3` Queens release.
|
||||
|
||||
* Container images cached prior to the start of the upgrade
|
||||
|
||||
With the introduction of containerised TripleO overclouds in Pike operators
|
||||
will need to cache the required container images prior to the fast-forward
|
||||
upgrade if they wish to end up with a containerised Queens overcloud.
|
||||
|
||||
High level flow
|
||||
---------------
|
||||
|
||||
At a high level the following actions will be carried out by the fast-forward
|
||||
upgrade to move the overcloud from `N` to `N+3`:
|
||||
|
||||
* Stop all OpenStack control and compute services across all roles
|
||||
|
||||
This will bring down the OpenStack control plane, leaving infrastructure
|
||||
services such as the databases running, while allowing any workloads to
|
||||
continue running without interruption. For HA environments this will disable
|
||||
the cluster, ensuring that OpenStack services are not restarted.
|
||||
|
||||
* Upgrade a single host from `N` to `N+1` then `N+1` to `N+2`
|
||||
|
||||
As alluded to earlier, OpenStack projects currently only support `N` to `N+1`
|
||||
upgrades and so fast-forward upgrades still need to cycle through each release in
|
||||
order to complete data migrations and any other tasks that are required before
|
||||
these migrations can be completed. This part of the upgrade is limited to a
|
||||
single host per role to ensure this is completed as quickly as possible.
|
||||
|
||||
* Optional upgrade and deployment of single canary compute host to `N+3`
|
||||
|
||||
As fast-forward upgrades aim to ensure workloads are online and accessible
|
||||
during the upgrade we can optionally upgrade all control service hosting roles
|
||||
_and_ a single canary compute to `N+3` to verify that workloads will remain
|
||||
active and accessible during the upgrade.
|
||||
|
||||
A canary compute node will be selected at the start of the upgrade and have
|
||||
instances launched on it to validate that both it and the data plane remain
|
||||
active during the upgrade. The upgrade will halt if either become inaccessible
|
||||
with a recovery procedure being provided to move all hosts back to `N+1`
|
||||
without further disruption to the active workloads on the untouched compute
|
||||
hosts.
|
||||
|
||||
* Upgrade and deployment of all roles to `N+3`
|
||||
|
||||
If the above optional canary compute host upgrade is not used then the final
|
||||
action in the fast-forward upgrade will be a traditional `N` to `N+1` migration
|
||||
between `N+2` and `N+3` followed by the deployment of all roles on `N+3`. This
|
||||
final action essentially being a redeployment of the overcloud to containers on
|
||||
`N+3` (Queens) as previously seen when upgrading TripleO environments from
|
||||
Ocata to Pike.
|
||||
|
||||
A python-tripleoclient command and associated Mistral workflow will control if
|
||||
this final step is applied to all roles in parallel (default), all hosts in a
|
||||
given role or selected hosts in a given role. The latter being useful if a user
|
||||
wants to control the order in which computes are moved from `N+1` to `N+3` etc.
|
||||
|
||||
Implementation
|
||||
--------------
|
||||
|
||||
As with updates [5]_ and upgrades [6]_ specific fast-forward upgrade Ansible
|
||||
tasks associated with the first two actions above will be introduced into the
|
||||
`tripleo-heat-template` service templates for each service as `RoleConfig`
|
||||
outputs.
|
||||
|
||||
As with `upgrade_tasks` each task is associated with a particular step in the
|
||||
process. For `fast_forward_upgrade_tasks` these steps are split between prep
|
||||
tasks that apply to all hosts and bootstrap tasks that only apply to a single
|
||||
host for a given role.
|
||||
|
||||
Prep step tasks will map to the following actions:
|
||||
|
||||
- Step=1: Disable the overall cluster
|
||||
- Step=2: Stop OpenStack services
|
||||
- Step=3: Update host repositories
|
||||
|
||||
Bootstrap step tasks will map to the following actions:
|
||||
|
||||
- Step=4: Take OpenStack DB backups
|
||||
- Step=5: Pre package update commands
|
||||
- Step=6: Update required packages
|
||||
- Step=7: Post package update commands
|
||||
- Step=8: OpenStack service DB sync
|
||||
- Step=9: Validation
|
||||
|
||||
As with `update_tasks` each task will use simple `when` conditionals to
|
||||
identify which step and release(s) it is associated with, ensuring these tasks
|
||||
are executed at the correct point in the upgrade.
|
||||
|
||||
For example, a step 2 `fast_forward_upgrade_task` task on Ocata is listed below:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
fast_forward_upgrade_tasks:
|
||||
- name: Example Ocata step 2 task
|
||||
command: /bin/foo bar
|
||||
when:
|
||||
- step|int == 2
|
||||
- release == 'ocata'
|
||||
|
||||
|
||||
These tasks will then be collated into role specific Ansible playbooks via the
|
||||
RoleConfig output of the `overcloud` heat template, with step and release
|
||||
variables being fed in to ensure tasks are executed in the correct order.
|
||||
|
||||
As with `major upgrades` [8]_ a new mistral workflow and tripleoclient command
|
||||
will be introduced to generate and execute the associated Ansible tasks.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
openstack overcloud fast-forward-upgrade --templates [..path to latest THT..] \
|
||||
[..original environment arguments..] \
|
||||
[..new container environment agruments..]
|
||||
|
||||
Operators will also be able to generate [7]_ , download and review the
|
||||
playbooks ahead of time using the latest version of `tripleo-heat-templates`
|
||||
with the following commands:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
openstack overcloud deploy --templates [..path to latest THT..] \
|
||||
[..original environment arguments..] \
|
||||
[..new container environment agruments..] \
|
||||
-e environments/fast-forward-upgrade.yaml \
|
||||
-e environments/noop-deploy-steps.yaml
|
||||
openstack overcloud config download
|
||||
|
||||
|
||||
Dev workflow
|
||||
------------
|
||||
|
||||
The existing tripleo-upgrade Ansible role will be used to automate the
|
||||
fast-forward upgrade process for use by developers and CI, including the
|
||||
initial overcloud minor update, undercloud upgrade to `N+3` and fast-forward
|
||||
upgrade itself.
|
||||
|
||||
Developers working on fast_forward_upgrade_tasks will also be able to deploy
|
||||
minimal overcloud deployments via `tripleo-quickstart` using release configs
|
||||
also used by CI.
|
||||
|
||||
Further, when developing tasks, developers will be able to manually render and
|
||||
run `fast_forward_upgrade_tasks` as standalone Ansible playbooks. Allowing them
|
||||
to run a subset of the tasks against specific nodes using
|
||||
`tripleo-ansible-inventory`. Examples of how to do this will be documented
|
||||
hopefully ensuring a smooth development experience for anyone looking to
|
||||
contribute tasks for specific services.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* Continue to force operators to upgrade linearly through each major release
|
||||
* Parallel cloud migrations.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
N/A
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
* The control plane will be down for the duration of the upgrade
|
||||
* The data plane and workloads will remain up.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
N/A
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
N/A
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
* Third party service template providers will need to provide
|
||||
fast_forward_upgrade_steps in their THT service configurations.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
|
||||
* lbezdick
|
||||
* marios
|
||||
* chem
|
||||
|
||||
Other contributors:
|
||||
|
||||
* shardy
|
||||
* lyarwood
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Introduce fast_forward_upgrades_playbook.yaml to RoleConfig
|
||||
* Introduce fast_forward_upgrade_tasks in each service template
|
||||
* Introduce a python-tripleoclient command and associated Mistral workflow.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* TripleO - Ansible upgrade Workflow with UI integration [9]_
|
||||
|
||||
The new major upgrade workflow being introduced for Pike to Queens upgrades
|
||||
will obviously impact what fast-forward upgrades looks like to Queens. At
|
||||
present the high level flow for fast-forward upgrades assumes that we can reuse
|
||||
the current `upgrade_tasks` between N+2 and N+3 to disable and then potentially
|
||||
remove baremetal services. This is likely to change as the major upgrade
|
||||
workflow is introduced and so it is likely that these steps will need to be
|
||||
encoded in `fast_forward_upgrade_tasks`.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Third party CI jobs will need to be created to test Newton to Queens using
|
||||
RDO given the upstream EOL of stable/newton with the release of Pike.
|
||||
|
||||
* These jobs should cover the initial undercloud upgrade, overcloud upgrade and
|
||||
optional canary compute node checks.
|
||||
|
||||
* An additional third party CI job will be required to verify that a Queens
|
||||
undercloud can correctly manage a Newton overcloud, allowing the separation
|
||||
of the undercloud upgrade and fast-forward upgrade discussed under
|
||||
prerequisites.
|
||||
|
||||
* Finally, minimal overcloud roles should be used to verify the upgrade for
|
||||
certain services. For example, when changes are made to the
|
||||
`fast_forward_upgrade_tasks` of Nova via changes to
|
||||
`docker/services/nova-*.yaml` files then a basic overcloud deployment of
|
||||
Keystone, Glance, Swift, Cinder, Neutron and Nova could be used to quickly
|
||||
verify the changes in regards to fast-forward upgrades.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* This will require extensive developer and user documentation to be written,
|
||||
most likely in a new section of the docs specifically detailing the
|
||||
fast-forward upgrade flow.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] https://etherpad.openstack.org/p/MEX-ops-migrations-upgrades
|
||||
.. [2] https://etherpad.openstack.org/p/BOS-forum-skip-level-upgrading
|
||||
.. [3] https://governance.openstack.org/tc/reference/tags/assert_supports-upgrade.html
|
||||
.. [4] http://tripleo.org/install/post_deployment/package_update.html
|
||||
.. [5] https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/README.rst#update-steps
|
||||
.. [6] https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/README.rst#upgrade-steps
|
||||
.. [7] https://review.openstack.org/#/c/495658/
|
||||
.. [8] https://review.openstack.org/#/q/topic:major-upgrade+(status:open+OR+status:merged)
|
||||
.. [9] https://specs.openstack.org/openstack/tripleo-specs/specs/queens/tripleo_ansible_upgrades_workflow.html
|
@ -1,145 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================
|
||||
Instance High Availability
|
||||
==========================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/instance-ha
|
||||
|
||||
A very often requested feature by operators and customers is to be able to
|
||||
automatically resurrect VMs that were running on a compute node that failed (either
|
||||
due to hardware failures, networking issues or general server problems).
|
||||
Currently we have a downstream-only procedure which consists of many manual
|
||||
steps to configure Instance HA:
|
||||
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview
|
||||
|
||||
What we would like to implement here is basically an optional opt-in automatic
|
||||
deployment of a cloud that has Instance HA support.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently if a compute node has a hardware failure or a kernel panic all the
|
||||
instances that were running on the node, will be gone and manual intervention
|
||||
is needed to resurrect these instances on another compute node.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposed change would be to add a few additional puppet-tripleo profiles that would help
|
||||
us configure the pacemaker resources needed for instance HA. Unlike in previous iterations
|
||||
we won't need to move nova-compute resources under pacemaker's management. We managed to
|
||||
achieve the same result without touching the compute nodes (except by setting
|
||||
up pacemaker_remote on the computes, but that support exists already)
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are a few specs that are modeling host recovery:
|
||||
|
||||
Host Recovery - https://review.openstack.org/#/c/386554/
|
||||
Instances auto evacuation - https://review.openstack.org/#/c/257809
|
||||
|
||||
The first spec uses pacemaker in a very similar way but is too new
|
||||
and too high level to really be able to comment at this point in time.
|
||||
The second one has been stalled for a long time and it looks like there
|
||||
is no consensus yet on the approaches needed. The longterm goal is
|
||||
to morph the Instance HA deployment into the spec that gets accepted.
|
||||
We are actively working on both specs as well. In any case we have
|
||||
discussed the long-term plan with SuSe and NTT and we agreed
|
||||
on a long-term plan of which this spec is the first step for TripleO.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No additional security impact.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
End users are not impacted except for the fact that VMs can be resurrected
|
||||
automatically on a non-failed compute node.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There are no performance related impacts as compared to a current deployment.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
So this change does not affect the default deployments. What it does it adds a boolean
|
||||
and some additional profiles so that a deployer can have a cloud configured with Instance
|
||||
HA support out of the box.
|
||||
|
||||
* One top-level parameter to enable the Instance HA deployment
|
||||
|
||||
* Although fencing configuration is already currently supported by tripleo, we will need
|
||||
to improve bits and pieces so that we won't need an extra command to generate the
|
||||
fencing parameters.
|
||||
|
||||
* Upgrades will be impacted by this change in the sense that we will need to make sure to test
|
||||
them when Instance HA is enabled.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
No developer impact is planned.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
michele
|
||||
|
||||
Other contributors:
|
||||
cmsj, abeekhof
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Make the fencing configuration fully automated (this is mostly done already, we need oooq integration
|
||||
and some optimization)
|
||||
|
||||
* Add the logic and needed resources on the control-plane
|
||||
|
||||
* Test the upgrade path when Instance HA is configured
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Testing this manually is fairly simple:
|
||||
|
||||
* Deploy with Instance HA configured and two compute nodes
|
||||
|
||||
* Spawn a test VM
|
||||
|
||||
* Crash the compute node where the VM is running
|
||||
|
||||
* Observe the VM being resurrected on the other compute node
|
||||
|
||||
Testing this in CI is doable but might be a bit more challenging due to resource constraints.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A section under advanced configuration is needed explaining the deployment of
|
||||
a cloud that supports Instance HA.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/
|
@ -1,189 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
========================
|
||||
IPSEC encrypted networks
|
||||
========================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/ipsec
|
||||
|
||||
This proposes the usage of IPSEC tunnels for encrypting all communications in a
|
||||
TripleO cloud.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Having everything in the network encrypted is a hard requirements for certain
|
||||
use-cases. While TLS everywhere provides support for this, not everyone wants a
|
||||
full-fledged CA. IPSEC provides an alternative which requires one component
|
||||
less (the CA) while still fulfilling the security requirements. With the
|
||||
downside that IPSEC tunnel configurations can get quite verbose.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
As mentioned in the mailing list [1], for OSP10 we already worked on an ansible
|
||||
role that runs on top of a TripleO deployment [2].
|
||||
|
||||
It does the following:
|
||||
|
||||
* Installs IPSEC if it's not available in the system.
|
||||
|
||||
* Sets up the firewall rules.
|
||||
|
||||
* Based on a hard-coded set of networks, it discovers the IP addresses for each
|
||||
of them.
|
||||
|
||||
* Based on a hard-coded set of networks, it discovers the Virtual IP addresses
|
||||
(including the Redis VIP).
|
||||
|
||||
* It puts up an IPSEC tunnel for most IPs in each network.
|
||||
|
||||
- Regular IPs are handled as a point-to-point IPSEC tunnel.
|
||||
|
||||
- Virtual IPs are handled with road-warrior configurations. This means that
|
||||
the VIP's tunnel listens for any connections. This enables easier
|
||||
configuration of the tunnel, as the VIP-holder doesn't need to be aware nor
|
||||
configure each tunnel.
|
||||
|
||||
- Similarly to TLS everywhere, this focuses on service-to-service
|
||||
communication, so we explicitly skip the tenant network. Or,
|
||||
as it was in the original ansible role, compute-to-compute communication.
|
||||
This significantly reduces the amount of tunnels we need to set up, but
|
||||
leaves application security to the deployer.
|
||||
|
||||
- Authentication for the tunnels is done via a Pre-Shared Key (PSK), which is
|
||||
shared between all nodes.
|
||||
|
||||
* Finally, it creates an OCF resource that tracks each VIP and puts up or down
|
||||
its corresponding IPSEC tunnel depending on the VIP's location.
|
||||
|
||||
- While this resource is still in the repository [3], it has now landed
|
||||
upstream [4]. Once this resource is available in the packaged version of
|
||||
the resource agents, the preferred version will be the packaged one.
|
||||
|
||||
- This resource effectively handles VIP fail-overs, by detecting that a VIP
|
||||
is no longer hosted by the node, it cleanly puts down the IPSEC tunnel and
|
||||
enables it where the VIP is now hosted.
|
||||
|
||||
All of this work is already part of the role, however, to have better
|
||||
integration with the current state of TripleO, the following work is needed:
|
||||
|
||||
* Support for composable networks.
|
||||
|
||||
- Now that composable networks are a thing, we can no longer rely on the
|
||||
hard-coded values we had in the role.
|
||||
|
||||
- Fortunately, this is information we can get from the tripleo dynamic
|
||||
inventory. So we would need to add information about the available networks
|
||||
and the VIPs.
|
||||
|
||||
* Configurable skipping of networks.
|
||||
|
||||
- In order to address the tenant network skipping, we need to somehow make it
|
||||
configurable.
|
||||
|
||||
* Add the IPSEC package as part of the image.
|
||||
|
||||
* Configure Firewall rules the TripleO way.
|
||||
|
||||
- Currently the role handles the firewall rule setup. However, it should be
|
||||
fairly simple to configure these rules the same way other services
|
||||
configure theirs (Using the tripleo.<service>.firewall_rules entry). This
|
||||
will require the usage of a composable service template.
|
||||
|
||||
* As mentioned above, we will need to create a composable service template.
|
||||
|
||||
- This could take into use the recently added `external_deploy_tasks` section
|
||||
of the templates, which will work similarly to the Kubernetes configuration
|
||||
and would rely on the config-download mechanism [5].
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
While deployers can already use TLS everywhere. A few are already using the
|
||||
aforementioned ansible role. So this would provide a seamless upgrade path for
|
||||
them.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
This by itself is a security enhancement, as it enables encryption in the
|
||||
network.
|
||||
|
||||
The PSK being shared by all the nodes is not ideal and could be addressed by
|
||||
per-network PSKs. However, this work could be done in further iterations.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Currently, the deployer needs to provide their PSK. However, this could be
|
||||
automated as part of the tasks that TripleO does.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Same as with TLS everywhere, adding encryption in the network will have a
|
||||
performance impact. We currently don't have concrete data on what this impact
|
||||
actually is.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
This would be added as a composable service. So it would be something that the
|
||||
deployer would need to enable via an environment file.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jaosorior
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add libreswan (IPSEC's frontend) package to the overcloud-full iamge.
|
||||
|
||||
* Add required information to the dynamic inventory (networks and VIPs)
|
||||
|
||||
* Based on the inventory, create the IPSEC tunnels dynamically, and not based
|
||||
on the hardcoded networks.
|
||||
|
||||
* Add tripleo-ipsec ansible role as part of the TripleO umbrella.
|
||||
|
||||
* Create composable service.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* This requires the triple-ipsec role to be available. For this, it will be
|
||||
moved to the TripleO umbrella and packaged as such.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Given that this doesn't require an extra component, we could test this as part
|
||||
of our upstream tests. The requirement being that the deployment has
|
||||
network-isolation enabled.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] http://lists.openstack.org/pipermail/openstack-dev/2017-November/124615.html
|
||||
[2] https://github.com/JAORMX/tripleo-ipsec
|
||||
[3] https://github.com/JAORMX/tripleo-ipsec/blob/master/files/ipsec-resource-agent.sh
|
||||
[4] https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/ipsec
|
||||
[5] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/services/kubernetes-master.yaml#L58
|
@ -1,115 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================
|
||||
Network configuration
|
||||
=====================
|
||||
|
||||
Network configuration for the TripleO GUI
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently, it's not possible to make advanced network configurations using the
|
||||
TripleO GUI.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In the GUI, we will provide a wizard to guide the user through configuring the
|
||||
networks of their deployment. The user will be able to assign networks to
|
||||
roles, and configure additional network parameters. We will use the
|
||||
``network_data.yaml`` in the `TripleO Heat Templates`_. The idea is to expose
|
||||
the data in ``network_data.yaml`` via the web interface.
|
||||
|
||||
In addition to the wizard, we will implement a dynamic network topology diagram
|
||||
to visually present the configured networks. This will enable the Deployer to
|
||||
quickly validate their work. The diagram will rely on ``network_data.yaml``
|
||||
and ``roles_data.yaml`` for the actual configuration.
|
||||
|
||||
For details, please see the `wireframes`_.
|
||||
|
||||
.. _wireframes: https://openstack.invisionapp.com/share/UM87J4NBQ#/screens
|
||||
.. _TripleO Heat Templates: https://review.openstack.org/#/c/409921/
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
As an alternative, heat templates can be edited manually to allow customization
|
||||
before uploading.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The Deployer could accidentally misconfigure the network topology, and thereby
|
||||
cause data to be exposed.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The addition of the configuration wizard and the network topology diagram should
|
||||
have no performance impact on the amount of time needed to run a deployment.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
As with any new substantial feature, the impact on the developer is cognitive.
|
||||
We will have to gain a detail understanding of network configuration in
|
||||
``network_data.yaml``. Also, testing will add overhead on our efforts.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
We can proceed with implementation immediately.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
hpokorny
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Network configuration wizard
|
||||
- Reading data from the backend
|
||||
- Saving changes
|
||||
- UI based on wireframes
|
||||
* Network topology diagram
|
||||
- Investigate suitable javascript libraries
|
||||
- UI based on wireframes
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* The presence of ``roles_data.yaml`` and ``network_data.yaml`` in the plan
|
||||
* A javascript library for drawing the diagram
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Testing shouldn't pose any real challenges with the exception of the network
|
||||
topology diagram rendering. At best, this is currently unknown as it depends on
|
||||
the chosen javascript library. Verifying that the correct diagram is displayed
|
||||
using automated testing might be non-trivial.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We should document the new settings introduced by the wizard. The documentation
|
||||
should be transferable between the heat template project, and TripleO UI.
|
||||
|
||||
References
|
||||
==========
|
@ -1,316 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================================
|
||||
Tripleo RPC and Notification Messaging Support
|
||||
==============================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo
|
||||
|
||||
This specification proposes changes to tripleo to enable the selection
|
||||
and configuration of separate messaging backends for oslo.messaging
|
||||
RPC and Notification communications. This proposal is a derivative of
|
||||
the work associated with the original blueprint [1]_ and specification
|
||||
[2]_ to enable dual backends for oslo.messaging in tripleo.
|
||||
|
||||
Most of the groundwork to enable dual backends was implemented during
|
||||
the pike release and the introduction of an alternative messaging
|
||||
backend (qdrouterd) service was made. Presently, the deployment of this
|
||||
alternative messaging backend is accomplished by aliasing the rabbitmq
|
||||
service as the tripleo implementation does not model separate
|
||||
messaging backends.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The oslo.messaging library supports the deployment of dual messaging
|
||||
system backends for RPC and Notification communications. However, tripleo
|
||||
currently deploys a single rabbitmq server (cluster) that serves as a
|
||||
single messaging backend for both RPC and Notifications.
|
||||
|
||||
::
|
||||
|
||||
+------------+ +----------+
|
||||
| RPC Caller | | Notifier |
|
||||
+-----+------+ +----+-----+
|
||||
| |
|
||||
+--+ +--+
|
||||
| |
|
||||
v v
|
||||
+-+---------------+-+
|
||||
| RabbitMQ Service |
|
||||
| Messaging Backend |
|
||||
| |
|
||||
+-+---------------+-+
|
||||
^ ^
|
||||
| |
|
||||
+--+ +--+
|
||||
| |
|
||||
v v
|
||||
+------+-----+ +------+-------+
|
||||
| RPC | | Notification |
|
||||
| Server | | Server |
|
||||
+------------+ +--------------+
|
||||
|
||||
|
||||
To support two separate and distinct messaging backends, tripleo needs
|
||||
to "duplicate" the set of parameters needed to specify each messaging
|
||||
system. The oslo.messaging library in OpenStack provides the API to the
|
||||
messaging services. It is proposed that the implementation model the
|
||||
RPC and Notification messaging services in place of the backend
|
||||
messaging server (e.g. rabbitmq).
|
||||
|
||||
::
|
||||
|
||||
+------------+ +----------+
|
||||
| RPC Caller | | Notifier |
|
||||
+-----+------+ +----+-----+
|
||||
| |
|
||||
| |
|
||||
v v
|
||||
+-------------------+ +-------------------+
|
||||
| RPC | | Notification |
|
||||
| Messaging Service | | Messaging Service |
|
||||
| | | |
|
||||
+--------+----------+ +--------+----------+
|
||||
| |
|
||||
| |
|
||||
v v
|
||||
+------------+ +------+-------+
|
||||
| RPC | | Notification |
|
||||
| Server | | Server |
|
||||
+------------+ +--------------+
|
||||
|
||||
|
||||
Introducing the separate messaging services and associated parameters in place
|
||||
of the rabbitmq server is not a major rework but special consideration
|
||||
must be made to upgrade paths and capabilities to ensure that existing
|
||||
configurations are not impacted.
|
||||
|
||||
Having separate messaging backends for RPC and Notification
|
||||
communications provides a number of benefits. These benefits include:
|
||||
|
||||
* tuning the backend to the messaging patterns
|
||||
* increased aggregate message capacity
|
||||
* reduced applied load to messaging servers
|
||||
* increased message throughput
|
||||
* reduced message latency
|
||||
* etc.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
A number of issues need to be resolved in order to express RPC
|
||||
and Notification messaging services on top of the backend messaging systems.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposed change is similar to the concept of a service "backend"
|
||||
that is configured by tripleo. A number of existing services support
|
||||
such a backend (or plugin) model. The implementation of a messaging
|
||||
service backend model should account for the following requirements:
|
||||
|
||||
* deploy a single messaging backend for both RPC and Notifications
|
||||
* deploy a messaging backend twice, once for RPC and once for
|
||||
Notifications
|
||||
* deploy a messaging backend for RPC and a different messaging backend
|
||||
for Notifications
|
||||
* deploy an external messaging backend for RPC
|
||||
* deploy an external messaging backend for Notifications
|
||||
|
||||
Generally, the parameters that were required for deployment of the
|
||||
rabbitmq service should be duplicated and renamed to "RPC Messaging"
|
||||
and "Notify Messaging" backend service definitions. Individual backend
|
||||
files would exist for each possible backend type (e.g. rabbitmq,
|
||||
qdrouterd, zeromq, kafka or external). The backend selected will
|
||||
correspondingly define the messaging transport for the messaging
|
||||
system.
|
||||
|
||||
* transport specifier
|
||||
* username
|
||||
* password (and generation)
|
||||
* host
|
||||
* port
|
||||
* virtual host(s)
|
||||
* ssl (enabled)
|
||||
* ssl configuration
|
||||
* health checks
|
||||
|
||||
Tripleo should continue to have a default configuration that deploys
|
||||
RPC and Notifications messaging services on top of a single rabbitmq
|
||||
backend server (cluster). Tripleo upgrades should map the legacy
|
||||
rabbitmq service deployment onto the RPC and Notification messaging
|
||||
services model.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The configuration of separate messaging backends could be post
|
||||
overcloud deployment (e.g. external to tripleo framework). This would
|
||||
be problematic over the lifecycle of deployments e.g. during upgrades etc.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The deployment of dual messaging backends for RPC and Notification
|
||||
communications should be the same from a security standpoint. This
|
||||
assumes the backends have parity from a security feature
|
||||
perspective, e.g authentication and encryption.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Depending on the configuration of the messaging backend deployment,
|
||||
there could be a number of end user impacts including the following:
|
||||
|
||||
* monitoring of separated messaging backend services
|
||||
* understanding differences in functionality/behaviors between different
|
||||
messaging backends (e.g. broker versus router, etc.)
|
||||
* handling exceptions (e.g. different places for logs, etc.)
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Using separate messaging systems for RPC and Notifications should
|
||||
have a positive impact on performance and scalability by:
|
||||
|
||||
* separating RPC and Notification messaging loads
|
||||
* increased parallelism in message processing
|
||||
* increased aggregate message transfer capacity
|
||||
* tuned backend configuration aligned to messaging patterns
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The deployment of hybrid messaging will be new to OpenStack
|
||||
operators. Operators will need to learn the architectural differences
|
||||
as compared to a single backend deployment. This will include capacity
|
||||
planning, monitoring, troubleshooting and maintenance best practices.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Discuss things that will affect other developers working on OpenStack.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
|
||||
* Andy Smith <ansmith@redhat.com>
|
||||
|
||||
* John Eckersberg <jeckersb@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
tripleo-heat-templates:
|
||||
|
||||
* Modify *puppet/services/<service>base.yaml* to introduce separate RPC and
|
||||
Notification Messaging parameters (e.g. replace 'rabbit' parameters)
|
||||
* Support two ssl environments (e.g. one for RPC and one for
|
||||
Notification when separate backends are deployed)
|
||||
* Consider example backend model such as the following:
|
||||
|
||||
::
|
||||
|
||||
tripleo-heat-templates
|
||||
|
|
||||
+--+ /environments
|
||||
| |
|
||||
| +--+ /messaging
|
||||
| |
|
||||
| +--+ messaging-(rpc/notify)-rabbitmq.yaml
|
||||
| +--+ messaging-(rpc/notify)-qdrouterd.yaml
|
||||
| +--+ messaging-(rpc/notify)-zmq.yaml
|
||||
| +--+ messaging-(rpc/notify)-kafka.yaml
|
||||
+--+ /puppet
|
||||
| |
|
||||
| +--+ /services
|
||||
| |
|
||||
| +--+ messaging-(rpc/notify)-backend-rabbitmq.yaml
|
||||
| +--+ messaging-(rpc/notify)-backend-qdrouterd.yaml
|
||||
| +--+ messaging-(rpc/notify)-backend-zmq.yaml
|
||||
| +--+ messaging-(rpc/notify)-backend-kafka.yaml
|
||||
|
|
||||
+--+ /roles
|
||||
|
||||
|
||||
puppet_tripleo:
|
||||
|
||||
* Replace rabbitmq_node_names with messaging_rpc_node_names and
|
||||
messaging_notify_node_names or similar
|
||||
* Add vhost support
|
||||
* Consider example backend model such as the following:
|
||||
|
||||
::
|
||||
|
||||
puppet-tripleo
|
||||
|
|
||||
+--+ /manifests
|
||||
|
|
||||
+--+ /profile
|
||||
|
|
||||
+--+ /base
|
||||
|
|
||||
+--+ /messaging
|
||||
|
|
||||
+--+ backend.pp
|
||||
+--+ rpc.pp
|
||||
+--+ notify.pp
|
||||
|
|
||||
+--+ /backend
|
||||
|
|
||||
+--+ rabbitmq.pp
|
||||
+--+ qdrouterd.pp
|
||||
+--+ zmq.pp
|
||||
+--+ kafka.pp
|
||||
|
||||
|
||||
tripleo_common:
|
||||
|
||||
* Add user and password management for RPC and Messaging services
|
||||
* Support distinct health checks for separated messaging backends
|
||||
|
||||
packemaker:
|
||||
|
||||
* Determine what should happen when two separate rabbitmq clusters
|
||||
are deployed. Does this result in two pacemaker services or one?
|
||||
Some experimentation may be required.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
In order to test this in CI, an environment will be needed where separate
|
||||
messaging system backends (e.g. rabbitMQ server and dispatch-router
|
||||
server) are deployed. Any existing hardware configuration should be
|
||||
appropriate for the dual backend deployment.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The deployment documentation will need to be updated to cover the
|
||||
configuration of the separate messaging (RPC and Notify) services.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://blueprints.launchpad.net/tripleo/+spec/om-dual-backends
|
||||
.. [2] https://review.openstack.org/#/c/396740/
|
@ -1,141 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================
|
||||
TripleO PTP (Precision Time Protocol) Support
|
||||
=============================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-ptp
|
||||
|
||||
This spec introduces support for a time synchronization method called PTP [0]
|
||||
which provides better time accuracy than NTP in general. With hardware
|
||||
timestamping support on the host, PTP can achieve clock accuracy in the
|
||||
sub-microsecond range, making it suitable for measurement and control systems.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently tripleo deploys NTP services by default which provide millisecond
|
||||
level time accuracy, but this is not enough for some cases:
|
||||
|
||||
* Fault/Error events will include timestamps placed on the associated event
|
||||
messages, retrieved by detectors with the purpose of accurately identifying
|
||||
the time that the event occurred. Given that the target Fault Management
|
||||
cycle timelines are in tens of milliseconds on most critical faults, events
|
||||
ordering may reverse against actual time if precison and accuracy of clock
|
||||
synchronization are in the same level of accuracy.
|
||||
|
||||
* NFV C-RAN (Cloud Radio Access Network) is looking for better time
|
||||
sychronization and distribution in micro-second level accuracy as alternative
|
||||
for NTP, PTP has been evaluated as one of the technologies.
|
||||
|
||||
This spec is not intended to cover all the possible ways of PTP usage, rather
|
||||
to provide a basic deployment path for PTP in tripleo with default
|
||||
configuration set to support PTP Ordinary Clock (slave mode); the master mode
|
||||
ptp clock configuration is not in the scope of this spec, but shall be deployed
|
||||
by user to provide the time source for the PTP Ordinary Clock. The full support
|
||||
of PTP capability can be enhanced further based on this spec.
|
||||
|
||||
User shall be aware of the fact that NTP and PTP can not be configured together
|
||||
on the same node without a coordinator program like timemaster which is also
|
||||
provided by linuxptp package. How to configure and use timemaster is not in the
|
||||
scope of this spec.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Provide the capability to configure PTP as time synchronization method:
|
||||
|
||||
* Add PTP configuration file path in overcloud resource registry.
|
||||
|
||||
* Add puppet-tripleo profile for PTP services.
|
||||
|
||||
* Add tripleo-heat-templates composable service for PTP.
|
||||
|
||||
Retain the current default behavior to deploy NTP as time synchronization
|
||||
source:
|
||||
|
||||
* The NTP services remain unchanged as the default time synchronization method.
|
||||
|
||||
* The NTP services must be disabled on nodes where PTP are deployed.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The alternative is to continue to use NTP.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
Security issues originated from PTP will need to be considered.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Users will get more accurate time from PTP.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No impact with default deployment mode which uses NTP as time source.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
The operator who wants to use PTP should identify and provide the PTP capable
|
||||
network interface name and make sure NTP is not deployed on the nodes where PTP
|
||||
will be deployed. The default PTP network interface name is set to 'nic1' where
|
||||
user should change it according to real interface name. By default, PTP will
|
||||
not be deployed unless explicitly configured.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
zshi
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Puppet-tripleo profile for PTP services
|
||||
* Tripleo-heat-templates composable service for PTP deployment
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Puppet module for PTP services: ptp [1]
|
||||
* The linuxptp RPM must be installed, and PTP capable NIC must be identified.
|
||||
* Refer to linuxptp project page [2] for the list of drivers that support the
|
||||
PHC (Physical Hardware Clock) subsystem.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The deployment of PTP should be testable in CI.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The deployment documation will need to be updated to cover the configuration of
|
||||
PTP.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [0] https://standards.ieee.org/findstds/standard/1588-2008.html
|
||||
* [1] https://github.com/redhat-nfvpe/ptp
|
||||
* [2] http://linuxptp.sourceforge.net
|
@ -1,733 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
========================================================
|
||||
TripleO Routed Networks Deployment (Spine-and-Leaf Clos)
|
||||
========================================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/tripleo-routed-networks-deployment
|
||||
|
||||
TripleO uses shared L2 networks today, so each node is attached to the
|
||||
provisioning network, and any other networks are also shared. This
|
||||
significantly reduces the complexity required to deploy on bare metal,
|
||||
since DHCP and PXE booting are simply done over a shared broadcast domain.
|
||||
This also makes the network switch configuration easy, since there is only
|
||||
a need to configure VLANs and ports, but no added complexity from dynamic
|
||||
routing between all switches.
|
||||
|
||||
This design has limitations, however, and becomes unwieldy beyond a certain
|
||||
scale. As the number of nodes increases, the background traffic from Broadcast,
|
||||
Unicast, and Multicast (BUM) traffic also increases. This design also requires
|
||||
all top-of-rack switches to trunk the VLANs back to the core switches, which
|
||||
centralizes the layer 3 gateway, usually on a single core switch. That creates
|
||||
a bottleneck which is not present in Clos architecture.
|
||||
|
||||
This spec serves as a detailed description of the overall problem set, and
|
||||
applies to the master blueprint. The sub-blueprints for the various
|
||||
implementation items also have their own associated spec.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Where possible, modern high-performance datacenter networks typically use
|
||||
routed networking to increase scalability and reduce failure domains. Using
|
||||
routed networks makes it possible to optimize a Clos (also known as
|
||||
"spine-and-leaf") architecture for scalability::
|
||||
|
||||
,=========. ,=========.
|
||||
| spine 1 |__ __| spine 2 |
|
||||
'==|\=====\_ \__________________/ _/=====/|=='
|
||||
| \_ \___ / \ ___/ _/ | ^
|
||||
| \___ / \ _______ / \ ___/ | |-- Dynamic routing (BGP, OSPF,
|
||||
| / \ / \ / \ | v EIGRP)
|
||||
,------. ,------ ,------. ,------.
|
||||
|leaf 1|....|leaf 2| |leaf 3|....|leaf 4| ======== Layer 2/3 boundary
|
||||
'------' '------' '------' '------'
|
||||
| | | |
|
||||
| | | |
|
||||
|-[serv-A1]=-| |-[serv-B1]=-|
|
||||
|-[serv-A2]=-| |-[serv-B2]=-|
|
||||
|-[serv-A3]=-| |-[serv-B3]=-|
|
||||
Rack A Rack B
|
||||
|
||||
|
||||
|
||||
In the above diagram, each server is connected via an Ethernet bond to both
|
||||
top-of-rack leaf switches, which are clustered and configured as a virtual
|
||||
switch chassis. Each leaf switch is attached to each spine switch. Within each
|
||||
rack, all servers share a layer 2 domain. The subnets are local to the rack,
|
||||
and the default gateway is the top-of-rack virtual switch pair. Dynamic routing
|
||||
between the leaf switches and the spine switches permits East-West traffic
|
||||
between the racks.
|
||||
|
||||
This is just one example of a routed network architecture. The layer 3 routing
|
||||
could also be done only on the spine switches, or there may even be distribution
|
||||
level switches that sit in between the top-of-rack switches and the routed core.
|
||||
The distinguishing feature that we are trying to enable is segregating local
|
||||
systems within a layer 2 domain, with routing between domains.
|
||||
|
||||
In a shared layer-2 architecture, the spine switches typically have to act in an
|
||||
active/passive mode to act as the L3 gateway for the single shared VLAN. All
|
||||
leaf switches must be attached to the active switch, and the limit on North-South
|
||||
bandwidth is the connection to the active switch, so there is an upper bound on
|
||||
the scalability. The Clos topology is favored because it provides horizontal
|
||||
scalability. Additional spine switches can be added to increase East-West and
|
||||
North-South bandwidth. Equal-cost multipath routing between switches ensures
|
||||
that all links are utlized simultaneously. If all ports are full on the spine
|
||||
switches, an additional tier can be added to connect additional spines,
|
||||
each with their own set of leaf switches, providing hyperscale expandability.
|
||||
|
||||
Each network device may be taken out of service for maintenance without the entire
|
||||
network being offline. This topology also allows the switches to be configured
|
||||
without physical loops or Spanning Tree, since the redundant links are either
|
||||
delivered via bonding or via multiple layer 3 uplink paths with equal metrics.
|
||||
Some advantages of using this architecture with separate subnets per rack are:
|
||||
|
||||
* Reduced domain for broadcast, unknown unicast, and multicast (BUM) traffic.
|
||||
* Reduced failure domain.
|
||||
* Geographical separation.
|
||||
* Association between IP address and rack location.
|
||||
* Better cross-vendor support for multipath forwarding using equal-cost
|
||||
multipath forwarding (ECMP) via L3 routing, instead of proprietary "fabric".
|
||||
|
||||
This topology is significantly different from the shared-everything approach that
|
||||
TripleO takes today.
|
||||
|
||||
Problem Descriptions
|
||||
====================
|
||||
|
||||
As this is a complex topic, it will be easier to break the problems down into
|
||||
their constituent parts, based on which part of TripleO they affect:
|
||||
|
||||
**Problem #1: TripleO uses DHCP/PXE on the Undercloud provisioning net (ctlplane).**
|
||||
|
||||
Neutron on the undercloud does not yet support DHCP relays and multiple L2
|
||||
subnets, since it does DHCP/PXE directly on the provisioning network.
|
||||
|
||||
Possible Solutions, Ideas, or Approaches:
|
||||
|
||||
1. Modify Ironic and/or Neutron to support multiple DHCP ranges in the dnsmasq
|
||||
configuration, use DHCP relay running on top-of-rack switches which
|
||||
receives DHCP requests and forwards them to dnsmasq on the Undercloud.
|
||||
There is a patch in progress to support that [11]_.
|
||||
2. Modify Neutron to support DHCP relay. There is a patch in progress to
|
||||
support that [10]_.
|
||||
|
||||
Currently, if one adds a subnet to a network, Neutron DHCP agent will pick up
|
||||
the changes and configure separate subnets correctly in ``dnsmasq``. For instance,
|
||||
after adding a second subnet to the ``ctlplane`` network, here is the resulting
|
||||
startup command for Neutron's instance of dnsmasq::
|
||||
|
||||
dnsmasq --no-hosts --no-resolv --strict-order --except-interface=lo \
|
||||
--pid-file=/var/lib/neutron/dhcp/aae53442-204e-4c8e-8a84-55baaeb496cf/pid \
|
||||
--dhcp-hostsfile=/var/lib/neutron/dhcp/aae53442-204e-4c8e-8a84-55baaeb496cf/host \
|
||||
--addn-hosts=/var/lib/neutron/dhcp/aae53442-204e-4c8e-8a84-55baaeb496cf/addn_hosts \
|
||||
--dhcp-optsfile=/var/lib/neutron/dhcp/aae53442-204e-4c8e-8a84-55baaeb496cf/opts \
|
||||
--dhcp-leasefile=/var/lib/neutron/dhcp/aae53442-204e-4c8e-8a84-55baaeb496cf/leases \
|
||||
--dhcp-match=set:ipxe,175 --bind-interfaces --interface=tap4ccef953-e0 \
|
||||
--dhcp-range=set:tag0,172.19.0.0,static,86400s \
|
||||
--dhcp-range=set:tag1,172.20.0.0,static,86400s \
|
||||
--dhcp-option-force=option:mtu,1500 --dhcp-lease-max=512 \
|
||||
--conf-file=/etc/dnsmasq-ironic.conf --domain=openstacklocal
|
||||
|
||||
The router information gets put into the dhcp-optsfile, here are the contents
|
||||
of /var/lib/neutron/dhcp/aae53442-204e-4c8e-8a84-55baaeb496cf/opts::
|
||||
|
||||
tag:tag0,option:classless-static-route,172.20.0.0/24,0.0.0.0,0.0.0.0/0,172.19.0.254
|
||||
tag:tag0,249,172.20.0.0/24,0.0.0.0,0.0.0.0/0,172.19.0.254
|
||||
tag:tag0,option:router,172.19.0.254
|
||||
tag:tag1,option:classless-static-route,169.254.169.254/32,172.20.0.1,172.19.0.0/24,0.0.0.0,0.0.0.0/0,172.20.0.254
|
||||
tag:tag1,249,169.254.169.254/32,172.20.0.1,172.19.0.0/24,0.0.0.0,0.0.0.0/0,172.20.0.254
|
||||
tag:tag1,option:router,172.20.0.254
|
||||
|
||||
The above options file will result in separate routers being handed out to
|
||||
separate IP subnets. Furthermore, Neutron appears to "do the right thing" with
|
||||
regard to routes for other subnets on the same network. We can see that the
|
||||
option "classless-static-route" is given, with pointers to both the default
|
||||
route and the other subnet(s) on the same Neutron network.
|
||||
|
||||
In order to modify Ironic-Inspector to use multiple subnets, we will need to
|
||||
extend instack-undercloud to support network segments. There is a patch in
|
||||
review to support segments in instack undercloud [0]_.
|
||||
|
||||
**Potential Workaround**
|
||||
|
||||
One possibility is to use an alternate method to DHCP/PXE boot, such as using
|
||||
DHCP configuration directly on the router, or to configure a host on the remote
|
||||
network which provides DHCP and PXE URLs, then provides routes back to the
|
||||
ironic-conductor and metadata server as part of the DHCP response.
|
||||
|
||||
It is not always feasible for groups doing testing or development to configure
|
||||
DHCP relay on the switches. For proof-of-concept implementations of
|
||||
spine-and-leaf, we may want to configure all provisioning networks to be
|
||||
trunked back to the Undercloud. This would allow the Undercloud to provide DHCP
|
||||
for all networks without special switch configuration. In this case, the
|
||||
Undercloud would act as a router between subnets/VLANs. This should be
|
||||
considered a small-scale solution, as this is not as scalable as DHCP relay.
|
||||
The configuration file for dnsmasq is the same whether all subnets are local or
|
||||
remote, but dnsmasq may have to listen on multiple interfaces (today it only
|
||||
listens on br-ctlplane). The dnsmasq process currently runs with
|
||||
``--bind-interface=tap-XXX``, but the process will need to be run with either
|
||||
binding to multiple interfaces, or with ``--except-interface=lo`` and multiple
|
||||
interfaces bound to the namespace.
|
||||
|
||||
For proof-of-concept deployments, as well as testing environments, it might
|
||||
make sense to run a DHCP relay on the Undercloud, and trunk all provisioning
|
||||
VLANs back to the Undercloud. This would allow dnsmasq to listen on the tap
|
||||
interface, and DHCP requests would be forwarded to the tap interface. The
|
||||
downside of this approach is that the Undercloud would need to have IP
|
||||
addresses on each of the trunked interfaces.
|
||||
|
||||
Another option is to configure dedicated hosts or VMs to be used as DHCP relay
|
||||
and router for subnets on multiple VLANs, all of which would be trunked to the
|
||||
relay/router host, thus acting exactly like routing switches.
|
||||
|
||||
------------
|
||||
|
||||
**Problem #2: Neutron's model for a segmented network that spans multiple L2
|
||||
domains uses the segment object to allow multiple subnets to be assigned to
|
||||
the same network. This functionality needs to be integrated into the
|
||||
Undercloud.**
|
||||
|
||||
Possible Solutions, Ideas, or Approaches:
|
||||
|
||||
1. Implement Neutron segments on the undercloud.
|
||||
|
||||
The spec for Neutron routed network segments [1]_ provides a schema that we can
|
||||
use to model a routed network. By implementing support for network segments, we
|
||||
can provide assign Ironic nodes to networks on routed subnets. This allows us
|
||||
to continue to use Neutron for IP address management, as ports are assigned by
|
||||
Neutron and tracked in the Neutron database on the Undercloud. See approach #1
|
||||
below.
|
||||
|
||||
2. Multiple Neutron networks (1 set per rack), to model all L2 segments.
|
||||
|
||||
By using a different set of networks in each rack, this provides us with
|
||||
the flexibility to use different network architectures on a per-rack basis.
|
||||
Each rack could have its own set of networks, and we would no longer have
|
||||
to provide all networks in all racks. Additionally, a split-datacenter
|
||||
architecture would naturally have a different set of networks in each
|
||||
site, so this approach makes sense. This is detailed in approach #2 below.
|
||||
|
||||
3. Multiple subnets per Neutron network.
|
||||
|
||||
This is probably the best approach for provisioning, since Neutron is
|
||||
already able to handle DHCP relay with multiple subnets as part of the
|
||||
same network. Additionally, this allows a clean separation between local
|
||||
subnets associated with provisioning, and networks which are used
|
||||
in the overcloud, such as External networks in two different datacenters).
|
||||
This is covered in more detail in approach #3 below.
|
||||
|
||||
4. Use another system for IPAM, instead of Neutron.
|
||||
|
||||
Although we could use a database, flat file, or some other method to keep
|
||||
track of IP addresses, Neutron as an IPAM back-end provides many integration
|
||||
benefits. Neutron integrates DHCP, hardware switch port configuration (through
|
||||
the use of plugins), integration in Ironic, and other features such as
|
||||
IPv6 support. This has been deemed to be infeasible due to the level of effort
|
||||
required in replacing both Neutron and the Neutron DHCP server (dnsmasq).
|
||||
|
||||
**Approaches to Problem #2:**
|
||||
|
||||
Approach 1 (Implement Neutron segments on the Undercloud):
|
||||
|
||||
The Neutron segments model provides a schema in Neutron that allows us to
|
||||
model the routed network. Using multiple subnets provides the flexibility
|
||||
we need without creating exponentially more resources. We would create the same
|
||||
provisioning network that we do today, but use multiple segments associated
|
||||
to different routed subnets. The disadvantage to this approach is that it makes
|
||||
it impossible to represent network VLANs with more than one IP subnet (Neutron
|
||||
technically supports more than one subnet per port). Currently TripleO only
|
||||
supports a single subnet per isolated network, so this should not be an issue.
|
||||
|
||||
Approach 2 (Multiple Neutron networks (1 set per rack), to model all L2 segments):
|
||||
|
||||
We will be using multiple networks to represent isolated networks in multiple
|
||||
L2 domains. One sticking point is that although Neutron will configure multiple
|
||||
routes for multiple subnets within a given network, we need to be able to both
|
||||
configure static IPs and routes, and be able to scale the network by adding
|
||||
additional subnets after initial deployment.
|
||||
|
||||
Since we control addresses and routes on the host nodes using a
|
||||
combination of Heat templates and os-net-config, it is possible to use
|
||||
static routes to supernets to provide L2 adjacency. This approach only
|
||||
works for non-provisioning networks, since we rely on Neutron DHCP servers
|
||||
providing routes to adjacent subnets for the provisioning network.
|
||||
|
||||
Example:
|
||||
Suppose 2 subnets are provided for the Internal API network: ``172.19.1.0/24``
|
||||
and ``172.19.2.0/24``. We want all Internal API traffic to traverse the Internal
|
||||
API VLANs on both the controller and a remote compute node. The Internal API
|
||||
network uses different VLANs for the two nodes, so we need the routes on the
|
||||
hosts to point toward the Internal API gateway instead of the default gateway.
|
||||
This can be provided by a supernet route to 172.19.x.x pointing to the local
|
||||
gateway on each subnet (e.g. 172.19.1.1 and 172.19.2.1 on the respective
|
||||
subnets). This could be represented in os-net-config with the following::
|
||||
|
||||
-
|
||||
type: interface
|
||||
name: nic3
|
||||
addresses:
|
||||
-
|
||||
ip_netmask: {get_param: InternalApiIpSubnet}
|
||||
routes:
|
||||
-
|
||||
ip_netmask: {get_param: InternalApiSupernet}
|
||||
next_hop: {get_param: InternalApiRouter}
|
||||
|
||||
Where InternalApiIpSubnet is the IP address on the local subnet,
|
||||
InternalApiSupernet is '172.19.0.0/16', and InternalApiRouter is either
|
||||
172.19.1.1 or 172.19.2.1 depending on which local subnet the host belongs to.
|
||||
|
||||
The end result of this is that each host has a set of IP addresses and routes
|
||||
that isolate traffic by function. In order for the return traffic to also be
|
||||
isolated by function, similar routes must exist on both hosts, pointing to the
|
||||
local gateway on the local subnet for the larger supernet that contains all
|
||||
Internal API subnets.
|
||||
|
||||
The downside of this is that we must require proper supernetting, and this may
|
||||
lead to larger blocks of IP addresses being used to provide ample space for
|
||||
scaling growth. For instance, in the example above an entire /16 network is set
|
||||
aside for up to 255 local subnets for the Internal API network. This could be
|
||||
changed into a more reasonable space, such as /18, if the number of local
|
||||
subnets will not exceed 64, etc. This will be less of an issue with native IPv6
|
||||
than with IPv4, where scarcity is much more likely.
|
||||
|
||||
Approach 3 (Multiple subnets per Neutron network):
|
||||
|
||||
The approach we will use for the provisioning network will be to use multiple
|
||||
subnets per network, using Neutron segments. This will allow us to take
|
||||
advantage of Neutron's ability to support multiple networks with DHCP relay.
|
||||
The DHCP server will supply the necessary routes via DHCP until the nodes are
|
||||
configured with a static IP post-deployment.
|
||||
|
||||
---------
|
||||
|
||||
**Problem #3: Ironic introspection DHCP doesn't yet support DHCP relay**
|
||||
|
||||
This makes it difficult to do introspection when the hosts are not on the same L2
|
||||
domain as the controllers. Patches are either merged or in review to support
|
||||
DHCP relay.
|
||||
|
||||
Possible Solutions, Ideas, or Approaches:
|
||||
|
||||
1. A patch to support a dnsmasq PXE filter driver has been merged. This will
|
||||
allow us to support selective DHCP when using DHCP relay (where the packet
|
||||
is not coming from the MAC of the host but rather the MAC of the switch)
|
||||
[12]_.
|
||||
|
||||
2. A patch has been merged to puppet-ironic to support multiple DHCP subnets
|
||||
for Ironic Inspector [13]_.
|
||||
|
||||
3. A patch is in review to add support for multiple subnets for the
|
||||
provisioning network in the instack-undercloud scripts [14]_.
|
||||
|
||||
For more information about solutions, please refer to the
|
||||
tripleo-routed-networks-ironic-inspector blueprint [5]_ and spec [6]_.
|
||||
|
||||
-------
|
||||
|
||||
**Problem #4: The IP addresses on the provisioning network need to be
|
||||
static IPs for production.**
|
||||
|
||||
Possible Solutions, Ideas, or Approaches:
|
||||
|
||||
1. Dan Prince wrote a patch [9]_ in Newton to convert the ctlplane network
|
||||
addresses to static addresses post-deployment. This will need to be
|
||||
refactored to support multiple provisioning subnets across routers.
|
||||
|
||||
Solution Implementation
|
||||
|
||||
This work is done and merged for the legacy use case. During the
|
||||
initial deployment, the nodes receive their IP address via DHCP, but during
|
||||
Heat deployment the os-net-config script is called, which writes static
|
||||
configuration files for the NICs with static IPs.
|
||||
|
||||
This work will need to be refactored to support assigning IPs from the
|
||||
appropriate subnet, but the work will be part of the TripleO Heat Template
|
||||
refactoring listed in Problems #6, and #7 below.
|
||||
|
||||
For the deployment model where the IPs are specified (ips-from-pool-all.yaml),
|
||||
we need to develop a model where the Control Plane IP can be specified
|
||||
on multiple deployment subnets. This may happen in a later cycle than the
|
||||
initial work being done to enable routed networks in TripleO. For more
|
||||
information, reference the tripleo-predictable-ctlplane-ips blueprint [7]_
|
||||
and spec [8]_.
|
||||
|
||||
------
|
||||
|
||||
**Problem #5: Heat Support For Routed Networks**
|
||||
|
||||
The Neutron routed networks extensions were only added in recent releases, and
|
||||
there was a dependency on TripleO Heat Templates.
|
||||
|
||||
Possible Solutions, Ideas or Approaches:
|
||||
|
||||
1. Add the required objects to Heat. At minimum, we will probably have to
|
||||
add ``OS::Neutron::Segment``, which represents layer 2 segments, the
|
||||
``OS::Neutron::Network`` will be updated to support the ``l2-adjacency``
|
||||
attribute, ``OS::Neutron::Subnet`` and ``OS::Neutron:port`` would be extended
|
||||
to support the ``segment_id`` attribute.
|
||||
|
||||
Solution Implementation:
|
||||
|
||||
Heat now supports the OS::Neutron::Segment resource. For example::
|
||||
|
||||
heat_template_version: 2015-04-30
|
||||
...
|
||||
resources:
|
||||
...
|
||||
the_resource:
|
||||
type: OS::Neutron::Segment
|
||||
properties:
|
||||
description: String
|
||||
name: String
|
||||
network: String
|
||||
network_type: String
|
||||
physical_network: String
|
||||
segmentation_id: Integer
|
||||
|
||||
This work has been completed in Heat with this review [15]_.
|
||||
|
||||
------
|
||||
|
||||
**Problem #6: Static IP assignment: Choosing static IPs from the correct
|
||||
subnet**
|
||||
|
||||
Some roles, such as Compute, can likely be placed in any subnet, but we will
|
||||
need to keep certain roles co-located within the same set of L2 domains. For
|
||||
instance, whatever role is providing Neutron services will need all controllers
|
||||
in the same L2 domain for VRRP to work properly.
|
||||
|
||||
The network interfaces will be configured using templates that create
|
||||
configuration files for os-net-config. The IP addresses that are written to each
|
||||
node's configuration will need to be on the correct subnet for each host. In
|
||||
order for Heat to assign ports from the correct subnets, we will need to have a
|
||||
host-to-subnets mapping.
|
||||
|
||||
Possible Solutions, Ideas or Approaches:
|
||||
|
||||
1. The simplest implementation of this would probably be a mapping of role/index
|
||||
to a set of subnets, so that it is known to Heat that Controller-1 is in
|
||||
subnet set X and Compute-3 is in subnet set Y.
|
||||
2. We could associate particular subnets with roles, and then use one role
|
||||
per L2 domain (such as per-rack).
|
||||
3. The roles and templates should be refactored to allow for dynamic IP
|
||||
assignment within subnets associated with the role. We may wish to evaluate
|
||||
the possibility of storing the routed subnets in Neutron using the routed
|
||||
networks extensions that are still under development. This would provide
|
||||
additional flexibility, but is probably not required to implement separate
|
||||
subnets in each rack.
|
||||
4. A scalable long-term solution is to map which subnet the host is on
|
||||
during introspection. If we can identify the correct subnet for each
|
||||
interface, then we can correlate that with IP addresses from the correct
|
||||
allocation pool. This would have the advantage of not requiring a static
|
||||
mapping of role to node to subnet. In order to do this, additional
|
||||
integration would be required between Ironic and Neutron (to make Ironic
|
||||
aware of multiple subnets per network, and to add the ability to make
|
||||
that association during introspection).
|
||||
|
||||
Solution Impelementation:
|
||||
|
||||
Solutions 1 and 2 above have been implemented in the "composable roles" series
|
||||
of patches [16]_. The initial implementation uses separate Neutron networks
|
||||
for different L2 domains. These templates are responsible for assigning the
|
||||
isolated VLANs used for data plane and overcloud control planes, but does not
|
||||
address the provisioning network. Future work may refactor the non-provisioning
|
||||
networks to use segments, but for now non-provisioning networks must use
|
||||
different networks for different roles.
|
||||
|
||||
Ironic autodiscovery may allow us to determine the subnet where each node
|
||||
is located without manual entry. More work is required to automate this
|
||||
process.
|
||||
|
||||
------
|
||||
|
||||
**Problem #7: Isolated Networking Requires Static Routes to Ensure Correct VLAN
|
||||
is Used**
|
||||
|
||||
In order to continue using the Isolated Networks model, routes will need to be
|
||||
in place on each node, to steer traffic to the correct VLAN interfaces. The
|
||||
routes are written when os-net-config first runs, but may change. We
|
||||
can't just rely on the specific routes to other subnets, since the number of
|
||||
subnets will increase or decrease as racks are added or taken away. Rather than
|
||||
try to deal with constantly changing routes, we should use static routes that
|
||||
will not need to change, to avoid disruption on a running system.
|
||||
|
||||
Possible Solutions, Ideas or Approaches:
|
||||
|
||||
1. Require that supernets are used for various network groups. For instance,
|
||||
all the Internal API subnets would be part of a supernet, for instance
|
||||
172.17.0.0/16 could be used, and broken up into many smaller subnets, such
|
||||
as /24. This would simplify the routes, since only a single route for
|
||||
172.17.0.0/16 would be required pointing to the local router on the
|
||||
172.17.x.0/24 network.
|
||||
2. Modify os-net-config so that routes can be updated without bouncing
|
||||
interfaces, and then run os-net-config on all nodes when scaling occurs.
|
||||
A review for this functionality was considered and abandeded [3]_.
|
||||
The patch was determined to have the potential to lead to instability.
|
||||
|
||||
os-net-config configures static routes for each interface. If we can keep the
|
||||
routing simple (one route per functional network), then we would be able to
|
||||
isolate traffic onto functional VLANs like we do today.
|
||||
|
||||
It would be a change to the existing workflow to have os-net-config run on
|
||||
updates as well as deployment, but if this were a non-impacting event (the
|
||||
interfaces didn't have to be bounced), that would probably be OK.
|
||||
|
||||
At a later time, the possibility of using dynamic routing should be considered,
|
||||
since it reduces the possibility of user error and is better suited to
|
||||
centralized management. SDN solutions are one way to provide this, or other
|
||||
approaches may be considered, such as setting up OVS tunnels.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
The proposed changes are discussed below.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In order to provide spine-and-leaf networking for deployments, several changes
|
||||
will have to be made to TripleO:
|
||||
|
||||
1. Support for DHCP relay in Ironic and Neutron DHCP servers. Implemented in
|
||||
patch [15]_ and the patch series [17]_.
|
||||
2. Refactoring of TripleO Heat Templates network isolation to support multiple
|
||||
subnets per isolated network, as well as per-subnet and supernet routes.
|
||||
The bulk of this work is done in the patch series [16]_ and in patch [10]_.
|
||||
3. Changes to Infra CI to support testing.
|
||||
4. Documentation updates.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The approach outlined here is very prescriptive, in that the networks must be
|
||||
known ahead of time, and the IP addresses must be selected from the appropriate
|
||||
pool. This is due to the reliance on static IP addresses provided by Heat.
|
||||
|
||||
One alternative approach is to use DHCP servers to assign IP addresses on all
|
||||
hosts on all interfaces. This would simplify configuration within the Heat
|
||||
templates and environment files. Unfortunately, this was the original approach
|
||||
of TripleO, and it was deemed insufficient by end-users, who wanted stability
|
||||
of IP addresses, and didn't want to have an external dependency on DHCP.
|
||||
|
||||
Another approach is to use the DHCP server functionality in the network switch
|
||||
infrastructure in order to PXE boot systems, then assign static IP addresses
|
||||
after the PXE boot is done via DHCP. This approach only solves for part of the
|
||||
requirement: the net booting. It does not solve the desire to have static IP
|
||||
addresses on each network. This could be achieved by having static IP addresses
|
||||
in some sort of per-node map. However, this approach is not as scalable as
|
||||
programatically determining the IPs, since it only applies to a fixed number of
|
||||
hosts. We want to retain the ability of using Neutron as an IP address
|
||||
management (IPAM) back-end, ideally.
|
||||
|
||||
Another approach which was considered was simply trunking all networks back
|
||||
to the Undercloud, so that dnsmasq could respond to DHCP requests directly,
|
||||
rather than requiring a DHCP relay. Unfortunately, this has already been
|
||||
identified as being unacceptable by some large operators, who have network
|
||||
architectures that make heavy use of L2 segregation via routers. This also
|
||||
won't work well in situations where there is geographical separation between
|
||||
the VLANs, such as in split-site deployments.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
One of the major differences between spine-and-leaf and standard isolated
|
||||
networking is that the various subnets are connected by routers, rather than
|
||||
being completely isolated. This means that without proper ACLs on the routers,
|
||||
networks which should be private may be opened up to outside traffic.
|
||||
|
||||
This should be addressed in the documentation, and it should be stressed that
|
||||
ACLs should be in place to prevent unwanted network traffic. For instance, the
|
||||
Internal API network is sensitive in that the database and message queue
|
||||
services run on that network. It is supposed to be isolated from outside
|
||||
connections. This can be achieved fairly easily if *supernets* are used, so
|
||||
that if all Internal API subnets are a part of the ``172.19.0.0/16`` supernet,
|
||||
an ACL rule will allow only traffic between Internal API IPs (this is a
|
||||
simplified example that could be applied to any Internal API VLAN, or as a
|
||||
global ACL)::
|
||||
|
||||
allow traffic from 172.19.0.0/16 to 172.19.0.0/16
|
||||
deny traffic from * to 172.19.0.0/16
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Deploying with spine-and-leaf will require additional parameters to
|
||||
provide the routing information and multiple subnets required. This will have
|
||||
to be documented. Furthermore, the validation scripts may need to be updated
|
||||
to ensure that the configuration is validated, and that there is proper
|
||||
connectivity between overcloud hosts.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Much of the traffic that is today made over layer 2 will be traversing layer
|
||||
3 routing borders in this design. That adds some minimal latency and overhead,
|
||||
although in practice the difference may not be noticeable. One important
|
||||
consideration is that the routers must not be too overcommitted on their
|
||||
uplinks, and the routers must be monitored to ensure that they are not acting
|
||||
as a bottleneck, especially if complex access control lists are used.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
A spine-and-leaf deployment will be more difficult to troubleshoot than a
|
||||
deployment that simply uses a set of VLANs. The deployer may need to have
|
||||
more network expertise, or a dedicated network engineer may be needed to
|
||||
troubleshoot in some cases.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Spine-and-leaf is not easily tested in virt environments. This should be
|
||||
possible, but due to the complexity of setting up libvirt bridges and
|
||||
routes, we may want to provide a simulation of spine-and-leaf for use in
|
||||
virtual environments. This may involve building multiple libvirt bridges
|
||||
and routing between them on the Undercloud, or it may involve using a
|
||||
DHCP relay on the virt-host as well as routing on the virt-host to simulate
|
||||
a full routing switch. A plan for development and testing will need to be
|
||||
developed, since not every developer can be expected to have a routed
|
||||
environment to work in. It may take some time to develop a routed virtual
|
||||
environment, so initial work will be done on bare metal.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Dan Sneddon <dsneddon@redhat.com>
|
||||
|
||||
Approver(s)
|
||||
-----------
|
||||
|
||||
Primary approver:
|
||||
Emilien Macchi <emacchi@redhat.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
1. Add static IP assignment to Control Plane [DONE]
|
||||
2. Modify Ironic Inspector ``dnsmasq.conf`` generation to allow export of
|
||||
multiple DHCP ranges, as described in Problem #1 and Problem #3.
|
||||
3. Evaluate the Routed Networks work in Neutron, to determine if it is required
|
||||
for spine-and-leaf, as described in Problem #2.
|
||||
4. Add OS::Neutron::Segment and l2-adjacency support to Heat, as described
|
||||
in Problem #5. This may or may not be a dependency for spine-and-leaf, based
|
||||
on the results of work item #3.
|
||||
5. Modify the Ironic-Inspector service to record the host-to-subnet mappings,
|
||||
perhaps during introspection, to address Problem #6.
|
||||
6. Add parameters to Isolated Networking model in Heat to support supernet
|
||||
routes for individual subnets, as described in Problem #7.
|
||||
7. Modify Isolated Networking model in Heat to support multiple subnets, as
|
||||
described in Problem #8.
|
||||
8. Add support for setting routes to supernets in os-net-config NIC templates,
|
||||
as described in the proposed solution to Problem #2.
|
||||
9. Implement support for iptables on the Controller, in order to mitigate
|
||||
the APIs potentially being reachable via remote routes. Alternatively,
|
||||
document the mitigation procedure using ACLs on the routers.
|
||||
10. Document the testing procedures.
|
||||
11. Modify the documentation in tripleo-docs to cover the spine-and-leaf case.
|
||||
|
||||
|
||||
Implementation Details
|
||||
----------------------
|
||||
|
||||
Workflow:
|
||||
|
||||
1. Operator configures DHCP networks and IP address ranges
|
||||
2. Operator imports baremetal instackenv.json
|
||||
3. When introspection or deployment is run, the DHCP server receives the DHCP
|
||||
request from the baremetal host via DHCP relay
|
||||
4. If the node has not been introspected, reply with an IP address from the
|
||||
introspection pool* and the inspector PXE boot image
|
||||
5. If the node already has been introspected, then the server assumes this is
|
||||
a deployment attempt, and replies with the Neutron port IP address and the
|
||||
overcloud-full deployment image
|
||||
6. The Heat templates are processed which generate os-net-config templates, and
|
||||
os-net-config is run to assign static IPs from the correct subnets, as well
|
||||
as routes to other subnets via the router gateway addresses.
|
||||
|
||||
* The introspection pool will be different for each provisioning subnet.
|
||||
|
||||
When using spine-and-leaf, the DHCP server will need to provide an introspection
|
||||
IP address on the appropriate subnet, depending on the information contained in
|
||||
the DHCP relay packet that is forwarded by the segment router. dnsmasq will
|
||||
automatically match the gateway address (GIADDR) of the router that forwarded
|
||||
the request to the subnet where the DHCP request was received, and will respond
|
||||
with an IP and gateway appropriate for that subnet.
|
||||
|
||||
The above workflow for the DHCP server should allow for provisioning IPs on
|
||||
multiple subnets.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
There may be a dependency on the Neutron Routed Networks. This won't be clear
|
||||
until a full evaluation is done on whether we can represent spine-and-leaf
|
||||
using only multiple subnets per network.
|
||||
|
||||
There will be a dependency on routing switches that perform DHCP relay service
|
||||
for production spine-and-leaf deployments.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
In order to properly test this framework, we will need to establish at least
|
||||
one CI test that deploys spine-and-leaf. As discussed in this spec, it isn't
|
||||
necessary to have a full routed bare metal environment in order to test this
|
||||
functionality, although there is some work to get it working in virtual
|
||||
environments such as OVB.
|
||||
|
||||
For bare metal testing, it is sufficient to trunk all VLANs back to the
|
||||
Undercloud, then run DHCP proxy on the Undercloud to receive all the
|
||||
requests and forward them to br-ctlplane, where dnsmasq listens. This
|
||||
will provide a substitute for routers running DHCP relay. For Neutron
|
||||
DHCP, some modifications to the iptables rule may be required to ensure
|
||||
that all DHCP requests from the overcloud nodes are received by the
|
||||
DHCP proxy and/or the Neutron dnsmasq process running in the dhcp-agent
|
||||
namespace.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The procedure for setting up a dev environment will need to be documented,
|
||||
and a work item mentions this requirement.
|
||||
|
||||
The TripleO docs will need to be updated to include detailed instructions
|
||||
for deploying in a spine-and-leaf environment, including the environment
|
||||
setup. Covering specific vendor implementations of switch configurations
|
||||
is outside this scope, but a specific overview of required configuration
|
||||
options should be included, such as enabling DHCP relay (or "helper-address"
|
||||
as it is also known) and setting the Undercloud as a server to receive
|
||||
DHCP requests.
|
||||
|
||||
The updates to TripleO docs will also have to include a detailed discussion
|
||||
of choices to be made about IP addressing before a deployment. If supernets
|
||||
are to be used for network isolation, then a good plan for IP addressing will
|
||||
be required to ensure scalability in the future.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [0] `Review: TripleO Heat Templates: Tripleo routed networks ironic inspector, and Undercloud <https://review.openstack.org/#/c/437544>`_
|
||||
.. [1] `Spec: Routed Networks for Neutron <https://specs.openstack.org/openstack/neutron-specs/specs/newton/routed-networks.html>`_
|
||||
.. [3] `Review: Modify os-net-config to make changes without bouncing interface <https://review.openstack.org/#/c/152732/>`_
|
||||
.. [5] `Blueprint: Modify TripleO Ironic Inspector to PXE Boot Via DHCP Relay <https://blueprints.launchpad.net/tripleo/+spec/tripleo-routed-networks-ironic-inspector>`_
|
||||
.. [6] `Spec: Modify TripleO Ironic Inspector to PXE Boot Via DHCP Relay <https://review.openstack.org/#/c/421011>`_
|
||||
.. [7] `Blueprint: User-specifiable Control Plane IP on TripleO Routed Isolated Networks <https://blueprints.launchpad.net/tripleo/+spec/tripleo-routed-networks-deployment>`_
|
||||
.. [8] `Spec: User-specifiable Control Plane IP on TripleO Routed Isolated Networks <https://review.openstack.org/#/c/421010>`_
|
||||
.. [9] `Review: Configure ctlplane network with a static IP <https://review.openstack.org/#/c/206022/>`_
|
||||
.. [10] `Review: Neutron: Make "on-link" routes for subnets optional <https://review.openstack.org/#/c/438171>`_
|
||||
.. [11] `Review: Ironic Inspector: Make "on-link" routes for subnets optional <https://review.openstack.org/438175>`_
|
||||
.. [12] `Review: Ironic Inspector: Introducing a dnsmasq PXE filter driver <https://review.openstack.org/466448>`_
|
||||
.. [13] `Review: Multiple DHCP Subnets for Ironic Inspector <https://review.openstack.org/#/c/436716>`_
|
||||
.. [14] `Review: Instack Undercloud: Add support for multiple inspection subnets <https://review.openstack.org/#/c/533367>`_
|
||||
.. [15] `Review: DHCP Agent: Separate local from non-local subnets <https://review.openstack.org/#/c/468744>`_
|
||||
.. [16] `Review Series: topic:bp/composable-networks <https://review.openstack.org/#/q/topic:bp/composable-networks+(status:open+OR+status:merged)>`_
|
||||
.. [17] `Review Series: project:openstack/networking-baremetal <https://review.openstack.org/#/q/project:openstack/networking-baremetal+committer:hjensas%2540redhat.com>`_
|
@ -1,190 +0,0 @@
|
||||
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================================
|
||||
TripleO - Ansible upgrade Worklow with UI integration
|
||||
==========================================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/major-upgrade-workflow
|
||||
|
||||
During the Pike cycle the minor update and some parts of the major upgrade
|
||||
are significantly different to any previous cycle, in that they are *not* being
|
||||
delivered onto nodes via Heat stack update. Rather, Heat stack update is used
|
||||
to only collect, but not execute, the relevant ansible tasks defined in each
|
||||
of the service manifests_ as upgrade_tasks_ or update_tasks_ accordingly.
|
||||
These tasks are then written as stand-alone ansible playbooks in the stack
|
||||
outputs_.
|
||||
|
||||
These 'config' playbooks are then downloaded using the *openstack overcloud
|
||||
config download* utility_ and finally executed to deliver the actual
|
||||
upgrade or update. See bugs 1715557_ and 1708115_ for more information
|
||||
(or pointers/reviews) about this mechanism as used during the P cycle.
|
||||
|
||||
For Queens and as discussed at the Denver PTG_ we aim to extend this approach
|
||||
to include the controlplane upgrade too. That is, instead of using HEAT
|
||||
SoftwareConfig and Deployments_ to invoke_ ansible we should instead collect
|
||||
the upgrade_tasks for the controlplane nodes into ansible playbooks that can
|
||||
then be invoked to deliver the actual upgrade.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Whilst it has continually improved in each cycle, complexity and difficulty to
|
||||
debug or understand what has been executed at any given point of the upgrade
|
||||
is still one of the biggest complaints from operators about the TripleO
|
||||
upgrades workflow. In the P cycle and as discussed above, the minor version
|
||||
update and some part of the 'non-controller' upgrade have already moved to the
|
||||
model being proposed here, i.e. generate ansible-playbooks with an initial heat
|
||||
stack update and then execute them.
|
||||
|
||||
If we are to use this approach for all parts of the upgrade, including for the
|
||||
controlplane services then we will also need a mistral workbook that can handle
|
||||
the download and execution of the ansible-playbook invocations. With this kind
|
||||
of ansible driven workflow, executed by mistral action/workbook, we can for
|
||||
the first time consider integration with the UI for upgrade/updates. This
|
||||
aligns well with the effort_ by the UI team for feature parity in CLI/UI for
|
||||
Queens. It should also be noted that there is already some work underway to
|
||||
adding the required mistral actions, at least for the minor update for Pike
|
||||
deployments in changes 487488_ and 487496_
|
||||
|
||||
Implementing a fully ansible-playbook delivered workflow for the entire major
|
||||
upgrade workflow will offer a number of benefits:
|
||||
|
||||
* very short initial heat stack update to generate the playbooks
|
||||
* easier to follow and understand what is happening at a given step of the upgrade
|
||||
* easier to debug and re-run any particular step of the upgrade
|
||||
* implies full python-tripleoclient and mistral workbook support for the
|
||||
ansible-playbook invocations.
|
||||
* can consider integrating upgrades/updates into the UI, for the first time
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We will need an initial heat stack update to populate the
|
||||
upgrade_tasks_playbook into the overcloud stack output (the cli is just a
|
||||
suggestion):
|
||||
|
||||
* openstack overcloud upgrade --init --init-commands [ "sudo curl -L -o /etc/yum.repos.d/delorean-pike.repo https://trunk.rdoproject.org/centos7-ocata/current/pike.repo",
|
||||
"sudo yum install my_package", ... ]
|
||||
|
||||
The first step of the upgrade will be used to deliver any required common
|
||||
upgrade initialisation, such as switching repos to the target version,
|
||||
installing any new packages required during the upgrade, and populating the upgrades playbooks.
|
||||
|
||||
Then the operator will run the upgrade targeting specific nodes:
|
||||
|
||||
* openstack overcloud upgrade --nodes [overcloud-novacompute-0, overcloud-novacompute-1] or
|
||||
openstack overcloud upgrade --nodes "Compute"
|
||||
|
||||
Download and execute the ansible playbooks on particular specified set of
|
||||
nodes. Ideally we will make it possible to specify a role name with the
|
||||
playbooks being invoked in a rolling fashion on each node.
|
||||
|
||||
One of the required changes is to convert all the service templates to have
|
||||
'when' conditionals instead of the current 'stepN'. For Pike we did this in
|
||||
the client_ to avoid breaking the heat driven upgrade workflow still in use
|
||||
for the controlplane during the Ocata to Pike upgrade. This will allow us to
|
||||
use the 'ansible-native' loop_ control we are currently using in the generated
|
||||
ansible playbooks.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
There will be significant changes to the workflow and cli the operator uses
|
||||
for the major upgrade as documented above.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The initial Heat stack update will not deliver any of the puppet or docker
|
||||
config to nodes since the DeploymentSteps will be disabled_ as we currently
|
||||
do for Pike minor update. This will mean a much shorter heat stack update -
|
||||
exact numbers TBD but 'minutes not hours'.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Should make it easier for developers to debug particular parts of the upgrades
|
||||
workflow.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Contributors:
|
||||
Marios Andreou (marios)
|
||||
Mathieu Bultel (matbu)
|
||||
Sofer Athlang Guyot (chem)
|
||||
Steve Hardy (shardy)
|
||||
Carlos Ccamacho (ccamacho)
|
||||
Jose Luis Franco Arza (jfrancoa)
|
||||
Marius Cornea (mcornea)
|
||||
Yurii Prokulevych (yprokule)
|
||||
Lukas Bezdicka (social)
|
||||
Raviv Bar-Tal (rbartal)
|
||||
Amit Ugol (amitu)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Remove steps and add when for all the ansible upgrade tasks, minor
|
||||
update tasks, deployment steps, post_upgrade_tasks
|
||||
* Need mistral workflows that can invoke the required stages of the
|
||||
workflow (--init and --nodes). There is some existing work in this
|
||||
direction in 463765_.
|
||||
* CLI/python-tripleoclient changes required. Related to the previous
|
||||
item there is some work started on this in 463728_.
|
||||
* UI work - we will need to collaborate with the UI team for the
|
||||
integration. We have never had UI driven upgrade or updates.
|
||||
* CI: Implement a simple job (one nodes, just controller, which does the
|
||||
heat-setup-output and run ansible --nodes Controller) with keystone
|
||||
only upgrade. Then iterate on this as we can add service upgrade_tasks.
|
||||
* Docs!
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We will aim to land a 'keystone-only' job asap which will be updated as the various
|
||||
changes required to deliver this spec are closer to merging. For example we
|
||||
may deploy only a very small subset of services (e.g. first keystone) and then iterate as changes
|
||||
related to this spec are proposed.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
We should also track changes in the documented upgrades workflow since as
|
||||
described here it is going to change significantly both internally as well as
|
||||
the interface exposed to an operator.
|
||||
|
||||
References
|
||||
==========
|
||||
Check the source_ for links
|
||||
|
||||
.. _manifests: https://github.com/openstack/tripleo-heat-templates/tree/master/docker/services
|
||||
.. _upgrade_tasks: https://github.com/openstack/tripleo-heat-templates/blob/211d7f32dc9cda261e96c3f5e0e1e12fc0afdbb5/docker/services/nova-compute.yaml#L147
|
||||
.. _update_tasks: https://github.com/openstack/tripleo-heat-templates/blob/60f3f10442f3b4cedb40def22cf7b6938a39b391/puppet/services/tripleo-packages.yaml#L59
|
||||
.. _outputs: https://github.com/openstack/tripleo-heat-templates/blob/3dcc9b30e9991087b9e898e25685985df6f94361/common/deploy-steps.j2#L324-L372
|
||||
.. _utility: https://github.com/openstack/python-tripleoclient/blob/27bba766daa737a56a8d884c47cca1c003f16e3f/tripleoclient/v1/overcloud_config.py#L26-L154
|
||||
.. _1715557: https://bugs.launchpad.net/tripleo/+bug/1715557
|
||||
.. _1708115: https://bugs.launchpad.net/tripleo/+bug/1708115
|
||||
.. _PTG: https://etherpad.openstack.org/p/tripleo-ptg-queens-upgrades
|
||||
.. _Deployments: https://github.com/openstack/tripleo-heat-templates/blob/f4730632a51dec2b9be6867d58184fac3b8a11a5/common/major_upgrade_steps.j2.yaml#L132-L173
|
||||
.. _invoke: https://github.com/openstack/tripleo-heat-templates/blob/f4730632a51dec2b9be6867d58184fac3b8a11a5/puppet/upgrade_config.yaml#L21-L50
|
||||
.. _effort: http://lists.openstack.org/pipermail/openstack-dev/2017-September/122089.html
|
||||
.. _487488: https://review.openstack.org/#/c/487488/
|
||||
.. _487496: https://review.openstack.org/#/c/487496/
|
||||
.. _client: https://github.com/openstack/python-tripleoclient/blob/4d342826d6c3db38ee88dccc92363b655b1161a5/tripleoclient/v1/overcloud_config.py#L63
|
||||
.. _loop: https://github.com/openstack/tripleo-heat-templates/blob/fe2acfc579295965b5f39c5ef7a34bea35f3d6bf/common/deploy-steps.j2#L364-L365
|
||||
.. _disabled: https://review.openstack.org/#/c/487496/21/tripleo_common/actions/package_update.py@63
|
||||
.. _source: https://raw.githubusercontent.com/openstack/tripleo-specs/master/specs/queens/tripleo_ansible_upgrades_workflow.rst
|
||||
.. _463728: https://review.openstack.org/#/c/463728/
|
||||
.. _463765: https://review.openstack.org/#/c/463765/
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user