stx-6.0: Initial spec for ceph upgrade

This spec proposes the upgrade of Ceph from Mimic to Nautilus and its
related components as part of STX 6.0 release. The upgrade is required
since Mimic is already EOL.

Story: 2009074

Signed-off-by: Vinicius Lopes da Silva <vinicius.lopesdasilva@windriver.com>
Change-Id: If7ed1c55c26aa4501a5638036dfc16cd4aca1291
This commit is contained in:
Vinicius Lopes da Silva 2021-07-30 15:43:15 -03:00
parent f738144690
commit 83f383b4ed
1 changed files with 249 additions and 0 deletions

View File

@ -0,0 +1,249 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License. http://creativecommons.org/licenses/by/3.0/legalcode
===================================
Ceph upgrade from Mimic to Nautilus
===================================
Storyboard:
https://storyboard.openstack.org/#!/story/2009074
This story covers the upgrade of Ceph from Mimic to Nautilus. The upgrade also includes code and
configuration changes in StarlingX components that are needed to support Nautilus.
Official instructions about how to migrate from Mimic to Nautilus can be found in [1]_.
Problem description
===================
Mimic end of life happened in 2020-07-22. It is needed to choose an active release that supports
an automated version migration between releases (i.e. start MON/OSD/MDS and service data formatting
is migrated to new formats if required from Mimic version).
This will require to evaluate historic HA reliability code and remove/retire unneeded code,
enable new/default supported features (Bluestore, systemd service files, use ceph-volume instead
of ceph-disk for OSD deployment) and enable ease of future upgrades.
Use Cases
---------
Users should be able to have access to the same current storage features without noticing a
difference between ceph versions.
Proposed change
===============
Firstly we should focus in building a StarlingX ISO having Ceph Nautilus built.
Other choices such as Octopus or Pacific are ruled out because we want to align with what is currently supported
by Debian Bullseye which is Nautilus. In addition, Pacific only supports migration from Octopus or Nautilus [2]_.
After having the image built, we can evaluate the changes made in Ceph Mimic downstream and port those
that are needed for Ceph Nautilus downstream. Next, we should be able to install an AIO-SX
successfully having Ceph Nautilus built in it.
It will be needed to check integration of Ceph subsystems with the new Ceph version. The
subsystems are:
* ceph-manager
* python-cephclient
* mgr-restful-plugin
* puppet
* ansible playbooks
New features enablement
-----------------------
Having an ISO, we can verify the enablement of some new features such as:
* Switch OSDs to BlueStore from FileStore
* BlueStore is the default technology for OSDs (starting in Luminous) which improves performance over the
previous FileStore technology. More details can be found in [3]_.
* Switch from sysvinit services/HA scripts to systemd/HA driven services
* Ceph upstream uses systemd to control ceph process initialization. This was disabled in downstream to maintain
historical (Ceph Hammer and Ceph Jewel) use of sysvinit script optimizations.
* Migrate OSD deployment to use ceph-volume due to ceph-disk deprecation
Investigate differences between ceph versions
---------------------------------------------
It is possible that some commands were changed/deprecated between migration from Mimic to Nautilus. It will be
needed to verify what are those commands and see what are the difference and their impact in the overall system.
The impacts might happen in the following projects/modules:
* config:
* sysinv/cgts-client
* sysinv/sysinv
* stx-puppet:
* puppet-manifests/src/modules/platform/manifests/ceph.pp
* utilities:
* ceph/ceph-manager
* ceph/python-cephclient
* integ:
* ceph
* config/puppet-modules/openstack/puppet-ceph-2.2.0 - upgrade to 3.1.1
Alternatives
------------
Ceph Octopus might be an alternative if it shows up in Debian bullseye package list and if time permits.
Data model impact
-----------------
N/A
REST API impact
---------------
Impact will depend on the required changes of Nautilus commands.
Security impact
---------------
N/A
Other end user impact
---------------------
N/A
Performance Impact
------------------
* Performance improvement should happen when switching from FileStore OSDs to BlueStore.
* Replacement of ceph-disk by ceph-volume should increase reliability and improve performance.
Details to be verified in [4]_.
Other deployer impact
---------------------
N/A
Developer impact
----------------
N/A
Upgrade impact
--------------
It should be possible to upgrade to the next releases in a simpler way.
New features to be enabled should provide a better user experience.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Vinícius Lopes da Silva (viniciuslopesdasilva)
Other contributors:
- Delfino Gomes Curado Filho (dcuradof)
- Felipe Sanches Zanoni (fsanches)
- Mauricio Biasi do Monte Carmelo (mbiasido)
- Thiago Oliveira Miranda (thiagooliveiramiranda)
- Alan Kyoshi (akyoshi)
- Daniel Pinto Barros (dbarros)
Repos Impacted
--------------
- config
- integ
- stx-puppet
- ha
- ansible-playbook
- utilities
Work Items
----------
* Verify compatibility between Nautilus and Mimic. According to
`upgrade compatibility notes <https://docs.ceph.com/en/latest/releases/nautilus/#upgrade-compatibility-notes>`_,
there are some commands that have changed between versions and we should make sure
of the impact in current implementation.
* Current OSDs are FileStore based, Nautilus supports BlueStore OSDs. So it will be needed
to determine the feasibility of migrating from FileStore to BlueStore OSDs. It will also
be required to determine if FileStore and BlueStore OSDs can coexist.
* Current Ceph's default use of systemd to control ceph process initialization is
`disabled <https://github.com/starlingx-staging/stx-ceph/commit/ecbbc1c833106a1151c6ccb93eebbad93b55b2c2>`_. It should
be re-enabled and evaluate the changes to be done in
`init script <https://github.com/starlingx-staging/stx-ceph/commits/stx/v13.2.2/src/init-ceph.in>`_ and
`pmon <https://opendev.org/starlingx/integ/src/branch/master/ceph/ceph/files/ceph-init-wrapper.sh>`_.
* Currently ceph-disk is being used to deploy OSDs. Problem is ceph-disk is deprecated and we should
use ceph-volume in its place. This will require an investigation about the impacts of this change. In the worst case
scenario, it is possible to still use ceph-disk since this is available through the Ceph Pacific release
(latest to date).
* Evaluate code from `current patch <https://github.com/starlingx-staging/stx-ceph/commits/stx/v13.2.2>`_ set applied on
Mimic and port the relevant patches to Nautilus branch.
* Ensure integration between Ceph and its subsystems (ceph-manager, python-cephclient, mgr-restful-plugin, Puppet code,
ansible-playbooks) are working correctly.
Dependencies
============
N/A
Testing
=======
All validation activities should pass Sanity/Storage regression tests.
Standard configurations scenarios
---------------------------------
* AIO-SX
* AIO-DX
* Standard 2C+2W
* Storage 2C+2S+2W
* Storage Tiers - Can be done on AIO-SX, should be valid across all installs
Additional scenarios
--------------------
* SSD Journal Disks - Use SSD journal disks validate proper configuration on storage lab
* Peer Groups - Provision system with up to 8 (replication 2) and 9 (replication 3) storage hosts
* OSD disk replacement - Validate OSD disk replacement procedure
Backup and restore scenarios
----------------------------
* B&R - AIO-SX
* B&R - AIO-DX
* B&R - Standard 2C+2W
* B&R - Storage 2C+2S+2W
Documentation Impact
====================
The changes to be made shouldn't interfere with system usage. At this time,
there is expected to be no documentation changes required.
References
==========
.. [1] https://docs.ceph.com/en/latest/releases/nautilus/#upgrading-from-mimic-or-luminous
.. [2] https://docs.ceph.com/en/latest/releases/pacific/#upgrade-from-pre-nautilus-releases-like-mimic-or-luminous
.. [3] https://ceph.io/en/news/blog/2017/new-luminous-bluestore/
.. [4] https://docs.ceph.com/en/latest/ceph-volume/intro/#ceph-disk-replaced