From 3e29e7aa7f12024ac2569ed4e19941c0b20ca4cb Mon Sep 17 00:00:00 2001
From: John Fulton <fulton@redhat.com>
Date: Thu, 27 May 2021 17:39:38 -0400
Subject: [PATCH] Document HCI deployments with cephadm

Change-Id: I3798d7e366c6e522368d326a44150599d2805b59
---
 deploy-guide/source/features/cephadm.rst      | 141 +++++++++++++++++-
 .../source/features/derived_parameters.rst    | 114 +++++++++-----
 2 files changed, 220 insertions(+), 35 deletions(-)

diff --git a/deploy-guide/source/features/cephadm.rst b/deploy-guide/source/features/cephadm.rst
index b4ef0f42..7de22f68 100644
--- a/deploy-guide/source/features/cephadm.rst
+++ b/deploy-guide/source/features/cephadm.rst
@@ -78,7 +78,7 @@ appropriate environment file as in the example below::
     -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml
 
 If you only wish to deploy Ceph RBD without RGW then use the following
-variation of the above.
+variation of the above::
 
   openstack overcloud deploy --templates \
     -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm-rbd-only.yaml
@@ -615,6 +615,144 @@ Now that the host and OSDs have been logically removed from the Ceph
 cluster proceed to remove the host from the overcloud as described in
 the "Scaling Down" section of :doc:`../provisioning/baremetal_provision`.
 
+Scenario: Deploy Hyperconverged Ceph
+------------------------------------
+
+Use a command like the following to create a `roles.yaml` file
+containing a standard Controller role and a ComputeHCI role::
+
+  openstack overcloud roles generate Controller ComputeHCI -o ~/roles.yaml
+
+The ComputeHCI role is a Compute node which also runs co-located Ceph
+OSD daemons. This kind of service co-location is referred to as HCI,
+or hyperconverged infrastructure. See the :doc:`composable_services`
+documentation for details on roles and services.
+
+When collocating Nova Compute and Ceph OSD services boundaries can be
+set to reduce contention for CPU and Memory between the two services.
+This is possible by adding parameters to `cephadm-overrides.yaml` like
+the following::
+
+  parameter_defaults:
+    CephHciOsdType: hdd
+    CephHciOsdCount: 4
+    CephConfigOverrides:
+      osd:
+        osd_memory_target_autotune: true
+        osd_numa_auto_affinity: true
+      mgr:
+        mgr/cephadm/autotune_memory_target_ratio: 0.2
+
+The `CephHciOsdType` and `CephHciOsdCount` parameters are used by the
+Derived Parameters workflow to tune the Nova scheduler to not allocate
+a certain amount of memory and CPU from the hypervisor to virtual
+machines so that Ceph can use them instead. See the
+:doc:`derived_parameters` documentation for details. If you do not use
+Derived Parameters workflow, then at least set the
+`NovaReservedHostMemory` to the number of OSDs multipled by 5 GB per
+OSD per host.
+
+The `CephConfigOverrides` map passes Ceph OSD parameters to limit the
+CPU and memory used by the OSDs.
+
+The `osd_memory_target_autotune`_ is set to true so that the OSD
+daemons will adjust their memory consumption based on the
+`osd_memory_target` config option. The `autotune_memory_target_ratio`
+defaults to 0.7. So 70% of the total RAM in the system is the starting
+point, from which any memory consumed by non-autotuned Ceph daemons
+are subtracted, and then the remaining memory is divided by the OSDs
+(assuming all OSDs have `osd_memory_target_autotune` true). For HCI
+deployments the `mgr/cephadm/autotune_memory_target_ratio` can be set
+to 0.2 so that more memory is available for the Nova Compute
+service. This has the same effect as setting the ceph-ansible `is_hci`
+parameter to true.
+
+A two NUMA node system can host a latency sensitive Nova workload on
+one NUMA node and a Ceph OSD workload on the other NUMA node. To
+configure Ceph OSDs to use a specific NUMA node (and not the one being
+used by the Nova Compute workload) use either of the following Ceph
+OSD configurations:
+
+- `osd_numa_node` sets affinity to a numa node (-1 for none)
+- `osd_numa_auto_affinity` automatically sets affinity to the NUMA
+  node where storage and network match
+
+If there are network interfaces on both NUMA nodes and the disk
+controllers are NUMA node 0, then use a network interface on NUMA node
+0 for the storage network and host the Ceph OSD workload on NUMA
+node 0. Then host the Nova workload on NUMA node 1 and have it use the
+network interfaces on NUMA node 1. Setting `osd_numa_auto_affinity`,
+to true, as in the example `cephadm-overrides.yaml` file above, should
+result in this configuration. Alternatively, the `osd_numa_node` could
+be set directly to 0 and `osd_numa_auto_affinity` could be unset so
+that it will default to false.
+
+When a hyperconverged cluster backfills as a result of an OSD going
+offline, the backfill process can be slowed down. In exchange for a
+slower recovery, the backfill activity has less of an impact on
+the collocated Compute workload. Ceph Pacific has the following
+defaults to control the rate of backfill activity::
+
+  parameter_defaults:
+    CephConfigOverrides:
+      osd:
+        osd_recovery_op_priority: 3
+        osd_max_backfills: 1
+        osd_recovery_max_active_hdd: 3
+        osd_recovery_max_active_ssd: 10
+
+It is not necessary to pass the above as they are the default values,
+but if these values need to be deployed with different values modify
+an example like the above before deployment. If the values need to be
+adjusted after the deployment use `ceph config set osd <key> <value>`.
+
+Deploy the overcloud as described in "Scenario: Deploy Ceph with
+TripleO and Metalsmith" but use the `-r` option to include generated
+`roles.yaml` file and the `-e` option with the
+`cephadm-overrides.yaml` file containing the HCI tunings described
+above.
+
+The examples above may be used to tune a hyperconverged system during
+deployment. If the values need to be changed after deployment, then
+use the `ceph orchestrator` command to set them directly.
+
+After deployment start a Ceph shell as described in "Accessing the
+Ceph Command Line" and confirm the above values were applied. For
+example, to check that the NUMA and memory target auto tuning run
+commands lke this::
+
+  [ceph: root@oc0-controller-0 /]# ceph config dump | grep numa
+    osd                                             advanced  osd_numa_auto_affinity                 true
+  [ceph: root@oc0-controller-0 /]# ceph config dump | grep autotune
+    osd                                             advanced  osd_memory_target_autotune             true
+  [ceph: root@oc0-controller-0 /]# ceph config get mgr mgr/cephadm/autotune_memory_target_ratio
+  0.200000
+  [ceph: root@oc0-controller-0 /]#
+
+We can then confirm that a specific OSD, e.g. osd.11, inherited those
+values with commands like this::
+
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_memory_target
+  4294967296
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_memory_target_autotune
+  true
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_numa_auto_affinity
+  true
+  [ceph: root@oc0-controller-0 /]#
+
+To confirm that the default backfill values are set for the same
+example OSD, use commands like this::
+
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_op_priority
+  3
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_max_backfills
+  1
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_max_active_hdd
+  3
+  [ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_max_active_ssd
+  10
+  [ceph: root@oc0-controller-0 /]#
+
 
 .. _`cephadm`: https://docs.ceph.com/en/latest/cephadm/index.html
 .. _`cleaning instructions in the Ironic documentation`: https://docs.openstack.org/ironic/latest/admin/cleaning.html
@@ -628,3 +766,4 @@ the "Scaling Down" section of :doc:`../provisioning/baremetal_provision`.
 .. _`pgcalc`: http://ceph.com/pgcalc
 .. _`CRUSH Map Rules`: https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/?highlight=ceph%20crush%20rules#crush-map-rules
 .. _`OSD Service Documentation for cephadm`: https://docs.ceph.com/en/latest/cephadm/osd/
+.. _`osd_memory_target_autotune`: https://docs.ceph.com/en/latest/cephadm/osd/#automatically-tuning-osd-memory
diff --git a/deploy-guide/source/features/derived_parameters.rst b/deploy-guide/source/features/derived_parameters.rst
index 58f6e55f..3b3d11df 100644
--- a/deploy-guide/source/features/derived_parameters.rst
+++ b/deploy-guide/source/features/derived_parameters.rst
@@ -9,8 +9,8 @@ supports this feature for both NFV (Network Function Virtualization)
 and HCI (Hyper-converged Infrastructure; nodes with collocated Ceph
 OSD and Nova Compute services) deployments.
 
-Using derived paramters during a deployment
--------------------------------------------
+Using derived parameters during a deployment
+--------------------------------------------
 
 To have TripleO derive parameters during deployment, specify an
 alternative *deployment plan* containing directives which trigger
@@ -18,7 +18,7 @@ either a Mistral workflow (prior to Victoria) or an Ansible playbook
 (in Victoria and newer) which derives the parameters.
 
 A default *deployment plan* is created during deployment. This
-deployment plan my be overridden by passing the ``-p`` or
+deployment plan may be overridden by passing the ``-p`` or
 ``--plan-environment-file`` option to the ``openstack overcloud
 deploy`` command. If the ``plan-environment-derived-params.yaml``
 file, located in
@@ -42,12 +42,12 @@ for use by a Ceph OSD.
 Parameters which are derived for HCI deployments
 ------------------------------------------------
 
-The derived paramters for HCI sets the NovaReservedHostMemory and
+The derived parameters for HCI sets the NovaReservedHostMemory and
 NovaCPUAllocationRatio per role based on the amount and type of Ceph
 OSDs requested during deployment, the available hardware in Ironic,
 and the average Nova guest workload.
 
-Deriving the paramters is useful because in an HCI deployment the Nova
+Deriving the parameters is useful because in an HCI deployment the Nova
 scheduler does not, by default, take into account the requirements of
 the Ceph OSD services which are collocated with the Nova Compute
 services. Thus, it's possible for Compute resources needed by an OSD
@@ -61,7 +61,7 @@ medium the more vCPUs an OSD should use in order for the CPU resources
 to not become a performance bottle-neck. All of this is taken into
 account by the derived parameters for HCI.
 
-The workload of the Nova guests should also to be taken into account.
+The workload of the Nova guests may also be taken into account.
 The ``plan-environment-derived-params.yaml`` file contains the
 following::
 
@@ -98,18 +98,18 @@ to take into account the memory overhead per guest for the hypervisor.
 It also does not set the NovaCPUAllocationRatio. Thus, passing an
 expected average workload will produce a more accurate set of derived
 HCI parameters. However, this default does allow for a simpler
-deployment where derived paramters may be used without having to
+deployment where derived parameters may be used without having to
 specify a workload but the OSDs are protected from having their memory
 allocated to Nova guests.
 
-Deriving HCI paramters before a deployment
-------------------------------------------
+Deriving HCI parameters before a deployment
+-------------------------------------------
 
 The ``tripleo_derive_hci_parameters`` Ansible module may be run
 independently on the undercloud before deployment to generate a YAML
-file to pass to the ``opentack overcloud deploy`` command with the
+file to pass to the ``openstack overcloud deploy`` command with the
 ``-e`` option. If this option is used it's not necessary to derive HCI
-paramters during deployment. Using this option also allows the
+parameters during deployment. Using this option also allows the
 deployer to quickly see the values of the derived parameters.
 
 .. warning::
@@ -143,35 +143,81 @@ modify it to set the four playbook variables as below::
       # Set the following variables for your environment
       ironic_node_id: ef4cbd49-3773-4db2-80da-4210a7c24047
       role: ComputeHCI
-      average_guest_cpu_utilization_percentage: 10
-      average_guest_memory_size_in_mb: 2048
+      average_guest_cpu_utilization_percentage: 50
+      average_guest_memory_size_in_mb: 8192
       heat_environment_input_file: /home/stack/ceph_overrides.yaml
   [stack@undercloud ~]$
 
 In the above example it is assumed the ``role`` `ComputeHCI` will use
-nodes with the same type of hardwqare which is set to the
+nodes with the same type of hardware which is set to the
 ``ironic_node_id`` and that the average guest will use 50% of its CPU
-and will use 8 GB of RAM. The ``heat_environment_input_file`` must
-be set to the path of the Heat environment file where the
-``CephAnsibleDisksConfig`` parameter is set. This parameter is used
-to define which disks are used as Ceph OSDs and might look like the
-following if bluestore was being deployed on 4 SSDs::
+and will use 8 GB of RAM. If the workload is unknown, remove these
+variables. The system tuning will not be as accurate but the Ansible
+module will at least set the NovaReservedHostMemory as a function of
+the number of OSDs.
 
-  CephAnsibleDisksConfig:
-    osd_scenario: lvm
-    osd_objectstore: bluestore
-    osds_per_device: 4
-    devices:
-      - /dev/sda
-      - /dev/sdb
-      - /dev/sdc
-      - /dev/sdd
+The ``heat_environment_input_file`` must be set to the path of the
+Heat environment file which defines the OSDs.
 
-The derived parameters workflow would use the values above to
-determine the number of OSDs requested (e.g. 4 devices * 4 OSDs per
-device = 16) and the type of device based on the Ironic data
-(e.g. during introspection, ironic can determine if a storage device
-is rotational).
+.. admonition:: Victoria or earlier
+
+  When ceph-ansible is used, in place of cephadm, this should be the
+  file where the ``CephAnsibleDisksConfig`` parameter is set. This
+  parameter is used to define which disks are used as Ceph OSDs and
+  might look like the following if bluestore was being deployed on 4
+  NVMe SSDs::
+
+    parameter_defaults:
+      CephAnsibleDisksConfig:
+        osd_scenario: lvm
+        osd_objectstore: bluestore
+        osds_per_device: 4
+        devices:
+          - /dev/nvme0n1
+          - /dev/nvme0n2
+          - /dev/nvme0n3
+          - /dev/nvme0n4
+
+  The derived parameters workflow would use the values above to
+  determine the number of OSDs requested (e.g. 4 devices * 4 OSDs per
+  device = 16) and the type of device based on the Ironic data
+  (e.g. during introspection, ironic can determine if a storage device
+  is rotational).
+
+If cephadm is used, in place of ceph-ansible (for Wallaby and newer),
+then the ``heat_environment_input_file`` must be set to the path of
+the file where the ``CephHciOsdCount`` and ``CephHciOsdType``
+parameters are set.
+
+The ``CephHciOsdCount`` and ``CephHciOsdType`` exist because
+``CephOsdSpec``, as used by cephadm, might only specify a description
+of devices to be used as OSDs (e.g. "all devices"), and not a list of
+devices like ``CephAnsibleDisksConfig``, setting the count directly is
+necessary in order to know how much CPU/RAM to reserve. Similarly,
+because a device path is not hard coded, we cannot look up that device
+in Ironic to determine its type. For information on the
+``CephOsdSpec`` parameter see the :doc:`cephadm` documentation.
+
+``CephHciOsdType`` is the type of data_device (not db_device) used for
+each OSD and must be one of hdd, ssd, or nvme. These are used by
+the Ansible module tripleo_derive_hci_parameters.
+
+``CephHciOsdCount`` is the number of expected Ceph OSDs per HCI
+node. If a server has eight HDD drives, then the parameters should be
+set like this::
+
+  parameter_defaults:
+    CephHciOsdType: hdd
+    CephHciOsdCount: 8
+
+To fully utilize nvme devices for data (not metadata), multiple
+OSDs are required. If the ``CephOsdSpec`` parameter is used to set
+`osds_per_device` to 4, and there are four NVMe drives on a host (and
+no HDD drives), then the parameters should be set like this::
+
+  parameter_defaults:
+    CephHciOsdType: nvme
+    CephHciOsdCount: 16
 
 After these values are set run the playbook::
 
@@ -184,7 +230,7 @@ After these values are set run the playbook::
   TASK [Get baremetal inspection data] *********************************************************
   ok: [localhost]
 
-  TASK [Get tripleo CephDisks environment paramters] *******************************************
+  TASK [Get tripleo CephDisks environment parameters] *******************************************
   ok: [localhost]
 
   TASK [Derive HCI parameters] *****************************************************************