Merge "Document HCI deployments with cephadm"

This commit is contained in:
Zuul 2021-06-08 13:12:17 +00:00 committed by Gerrit Code Review
commit 25c4b89652
2 changed files with 220 additions and 35 deletions

View File

@ -78,7 +78,7 @@ appropriate environment file as in the example below::
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml
If you only wish to deploy Ceph RBD without RGW then use the following
variation of the above.
variation of the above::
openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm-rbd-only.yaml
@ -615,6 +615,144 @@ Now that the host and OSDs have been logically removed from the Ceph
cluster proceed to remove the host from the overcloud as described in
the "Scaling Down" section of :doc:`../provisioning/baremetal_provision`.
Scenario: Deploy Hyperconverged Ceph
------------------------------------
Use a command like the following to create a `roles.yaml` file
containing a standard Controller role and a ComputeHCI role::
openstack overcloud roles generate Controller ComputeHCI -o ~/roles.yaml
The ComputeHCI role is a Compute node which also runs co-located Ceph
OSD daemons. This kind of service co-location is referred to as HCI,
or hyperconverged infrastructure. See the :doc:`composable_services`
documentation for details on roles and services.
When collocating Nova Compute and Ceph OSD services boundaries can be
set to reduce contention for CPU and Memory between the two services.
This is possible by adding parameters to `cephadm-overrides.yaml` like
the following::
parameter_defaults:
CephHciOsdType: hdd
CephHciOsdCount: 4
CephConfigOverrides:
osd:
osd_memory_target_autotune: true
osd_numa_auto_affinity: true
mgr:
mgr/cephadm/autotune_memory_target_ratio: 0.2
The `CephHciOsdType` and `CephHciOsdCount` parameters are used by the
Derived Parameters workflow to tune the Nova scheduler to not allocate
a certain amount of memory and CPU from the hypervisor to virtual
machines so that Ceph can use them instead. See the
:doc:`derived_parameters` documentation for details. If you do not use
Derived Parameters workflow, then at least set the
`NovaReservedHostMemory` to the number of OSDs multipled by 5 GB per
OSD per host.
The `CephConfigOverrides` map passes Ceph OSD parameters to limit the
CPU and memory used by the OSDs.
The `osd_memory_target_autotune`_ is set to true so that the OSD
daemons will adjust their memory consumption based on the
`osd_memory_target` config option. The `autotune_memory_target_ratio`
defaults to 0.7. So 70% of the total RAM in the system is the starting
point, from which any memory consumed by non-autotuned Ceph daemons
are subtracted, and then the remaining memory is divided by the OSDs
(assuming all OSDs have `osd_memory_target_autotune` true). For HCI
deployments the `mgr/cephadm/autotune_memory_target_ratio` can be set
to 0.2 so that more memory is available for the Nova Compute
service. This has the same effect as setting the ceph-ansible `is_hci`
parameter to true.
A two NUMA node system can host a latency sensitive Nova workload on
one NUMA node and a Ceph OSD workload on the other NUMA node. To
configure Ceph OSDs to use a specific NUMA node (and not the one being
used by the Nova Compute workload) use either of the following Ceph
OSD configurations:
- `osd_numa_node` sets affinity to a numa node (-1 for none)
- `osd_numa_auto_affinity` automatically sets affinity to the NUMA
node where storage and network match
If there are network interfaces on both NUMA nodes and the disk
controllers are NUMA node 0, then use a network interface on NUMA node
0 for the storage network and host the Ceph OSD workload on NUMA
node 0. Then host the Nova workload on NUMA node 1 and have it use the
network interfaces on NUMA node 1. Setting `osd_numa_auto_affinity`,
to true, as in the example `cephadm-overrides.yaml` file above, should
result in this configuration. Alternatively, the `osd_numa_node` could
be set directly to 0 and `osd_numa_auto_affinity` could be unset so
that it will default to false.
When a hyperconverged cluster backfills as a result of an OSD going
offline, the backfill process can be slowed down. In exchange for a
slower recovery, the backfill activity has less of an impact on
the collocated Compute workload. Ceph Pacific has the following
defaults to control the rate of backfill activity::
parameter_defaults:
CephConfigOverrides:
osd:
osd_recovery_op_priority: 3
osd_max_backfills: 1
osd_recovery_max_active_hdd: 3
osd_recovery_max_active_ssd: 10
It is not necessary to pass the above as they are the default values,
but if these values need to be deployed with different values modify
an example like the above before deployment. If the values need to be
adjusted after the deployment use `ceph config set osd <key> <value>`.
Deploy the overcloud as described in "Scenario: Deploy Ceph with
TripleO and Metalsmith" but use the `-r` option to include generated
`roles.yaml` file and the `-e` option with the
`cephadm-overrides.yaml` file containing the HCI tunings described
above.
The examples above may be used to tune a hyperconverged system during
deployment. If the values need to be changed after deployment, then
use the `ceph orchestrator` command to set them directly.
After deployment start a Ceph shell as described in "Accessing the
Ceph Command Line" and confirm the above values were applied. For
example, to check that the NUMA and memory target auto tuning run
commands lke this::
[ceph: root@oc0-controller-0 /]# ceph config dump | grep numa
osd advanced osd_numa_auto_affinity true
[ceph: root@oc0-controller-0 /]# ceph config dump | grep autotune
osd advanced osd_memory_target_autotune true
[ceph: root@oc0-controller-0 /]# ceph config get mgr mgr/cephadm/autotune_memory_target_ratio
0.200000
[ceph: root@oc0-controller-0 /]#
We can then confirm that a specific OSD, e.g. osd.11, inherited those
values with commands like this::
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_memory_target
4294967296
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_memory_target_autotune
true
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_numa_auto_affinity
true
[ceph: root@oc0-controller-0 /]#
To confirm that the default backfill values are set for the same
example OSD, use commands like this::
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_op_priority
3
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_max_backfills
1
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_max_active_hdd
3
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_max_active_ssd
10
[ceph: root@oc0-controller-0 /]#
.. _`cephadm`: https://docs.ceph.com/en/latest/cephadm/index.html
.. _`cleaning instructions in the Ironic documentation`: https://docs.openstack.org/ironic/latest/admin/cleaning.html
@ -628,3 +766,4 @@ the "Scaling Down" section of :doc:`../provisioning/baremetal_provision`.
.. _`pgcalc`: http://ceph.com/pgcalc
.. _`CRUSH Map Rules`: https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/?highlight=ceph%20crush%20rules#crush-map-rules
.. _`OSD Service Documentation for cephadm`: https://docs.ceph.com/en/latest/cephadm/osd/
.. _`osd_memory_target_autotune`: https://docs.ceph.com/en/latest/cephadm/osd/#automatically-tuning-osd-memory

View File

@ -9,8 +9,8 @@ supports this feature for both NFV (Network Function Virtualization)
and HCI (Hyper-converged Infrastructure; nodes with collocated Ceph
OSD and Nova Compute services) deployments.
Using derived paramters during a deployment
-------------------------------------------
Using derived parameters during a deployment
--------------------------------------------
To have TripleO derive parameters during deployment, specify an
alternative *deployment plan* containing directives which trigger
@ -18,7 +18,7 @@ either a Mistral workflow (prior to Victoria) or an Ansible playbook
(in Victoria and newer) which derives the parameters.
A default *deployment plan* is created during deployment. This
deployment plan my be overridden by passing the ``-p`` or
deployment plan may be overridden by passing the ``-p`` or
``--plan-environment-file`` option to the ``openstack overcloud
deploy`` command. If the ``plan-environment-derived-params.yaml``
file, located in
@ -42,12 +42,12 @@ for use by a Ceph OSD.
Parameters which are derived for HCI deployments
------------------------------------------------
The derived paramters for HCI sets the NovaReservedHostMemory and
The derived parameters for HCI sets the NovaReservedHostMemory and
NovaCPUAllocationRatio per role based on the amount and type of Ceph
OSDs requested during deployment, the available hardware in Ironic,
and the average Nova guest workload.
Deriving the paramters is useful because in an HCI deployment the Nova
Deriving the parameters is useful because in an HCI deployment the Nova
scheduler does not, by default, take into account the requirements of
the Ceph OSD services which are collocated with the Nova Compute
services. Thus, it's possible for Compute resources needed by an OSD
@ -61,7 +61,7 @@ medium the more vCPUs an OSD should use in order for the CPU resources
to not become a performance bottle-neck. All of this is taken into
account by the derived parameters for HCI.
The workload of the Nova guests should also to be taken into account.
The workload of the Nova guests may also be taken into account.
The ``plan-environment-derived-params.yaml`` file contains the
following::
@ -98,18 +98,18 @@ to take into account the memory overhead per guest for the hypervisor.
It also does not set the NovaCPUAllocationRatio. Thus, passing an
expected average workload will produce a more accurate set of derived
HCI parameters. However, this default does allow for a simpler
deployment where derived paramters may be used without having to
deployment where derived parameters may be used without having to
specify a workload but the OSDs are protected from having their memory
allocated to Nova guests.
Deriving HCI paramters before a deployment
------------------------------------------
Deriving HCI parameters before a deployment
-------------------------------------------
The ``tripleo_derive_hci_parameters`` Ansible module may be run
independently on the undercloud before deployment to generate a YAML
file to pass to the ``opentack overcloud deploy`` command with the
file to pass to the ``openstack overcloud deploy`` command with the
``-e`` option. If this option is used it's not necessary to derive HCI
paramters during deployment. Using this option also allows the
parameters during deployment. Using this option also allows the
deployer to quickly see the values of the derived parameters.
.. warning::
@ -143,35 +143,81 @@ modify it to set the four playbook variables as below::
# Set the following variables for your environment
ironic_node_id: ef4cbd49-3773-4db2-80da-4210a7c24047
role: ComputeHCI
average_guest_cpu_utilization_percentage: 10
average_guest_memory_size_in_mb: 2048
average_guest_cpu_utilization_percentage: 50
average_guest_memory_size_in_mb: 8192
heat_environment_input_file: /home/stack/ceph_overrides.yaml
[stack@undercloud ~]$
In the above example it is assumed the ``role`` `ComputeHCI` will use
nodes with the same type of hardwqare which is set to the
nodes with the same type of hardware which is set to the
``ironic_node_id`` and that the average guest will use 50% of its CPU
and will use 8 GB of RAM. The ``heat_environment_input_file`` must
be set to the path of the Heat environment file where the
``CephAnsibleDisksConfig`` parameter is set. This parameter is used
to define which disks are used as Ceph OSDs and might look like the
following if bluestore was being deployed on 4 SSDs::
and will use 8 GB of RAM. If the workload is unknown, remove these
variables. The system tuning will not be as accurate but the Ansible
module will at least set the NovaReservedHostMemory as a function of
the number of OSDs.
CephAnsibleDisksConfig:
osd_scenario: lvm
osd_objectstore: bluestore
osds_per_device: 4
devices:
- /dev/sda
- /dev/sdb
- /dev/sdc
- /dev/sdd
The ``heat_environment_input_file`` must be set to the path of the
Heat environment file which defines the OSDs.
The derived parameters workflow would use the values above to
determine the number of OSDs requested (e.g. 4 devices * 4 OSDs per
device = 16) and the type of device based on the Ironic data
(e.g. during introspection, ironic can determine if a storage device
is rotational).
.. admonition:: Victoria or earlier
When ceph-ansible is used, in place of cephadm, this should be the
file where the ``CephAnsibleDisksConfig`` parameter is set. This
parameter is used to define which disks are used as Ceph OSDs and
might look like the following if bluestore was being deployed on 4
NVMe SSDs::
parameter_defaults:
CephAnsibleDisksConfig:
osd_scenario: lvm
osd_objectstore: bluestore
osds_per_device: 4
devices:
- /dev/nvme0n1
- /dev/nvme0n2
- /dev/nvme0n3
- /dev/nvme0n4
The derived parameters workflow would use the values above to
determine the number of OSDs requested (e.g. 4 devices * 4 OSDs per
device = 16) and the type of device based on the Ironic data
(e.g. during introspection, ironic can determine if a storage device
is rotational).
If cephadm is used, in place of ceph-ansible (for Wallaby and newer),
then the ``heat_environment_input_file`` must be set to the path of
the file where the ``CephHciOsdCount`` and ``CephHciOsdType``
parameters are set.
The ``CephHciOsdCount`` and ``CephHciOsdType`` exist because
``CephOsdSpec``, as used by cephadm, might only specify a description
of devices to be used as OSDs (e.g. "all devices"), and not a list of
devices like ``CephAnsibleDisksConfig``, setting the count directly is
necessary in order to know how much CPU/RAM to reserve. Similarly,
because a device path is not hard coded, we cannot look up that device
in Ironic to determine its type. For information on the
``CephOsdSpec`` parameter see the :doc:`cephadm` documentation.
``CephHciOsdType`` is the type of data_device (not db_device) used for
each OSD and must be one of hdd, ssd, or nvme. These are used by
the Ansible module tripleo_derive_hci_parameters.
``CephHciOsdCount`` is the number of expected Ceph OSDs per HCI
node. If a server has eight HDD drives, then the parameters should be
set like this::
parameter_defaults:
CephHciOsdType: hdd
CephHciOsdCount: 8
To fully utilize nvme devices for data (not metadata), multiple
OSDs are required. If the ``CephOsdSpec`` parameter is used to set
`osds_per_device` to 4, and there are four NVMe drives on a host (and
no HDD drives), then the parameters should be set like this::
parameter_defaults:
CephHciOsdType: nvme
CephHciOsdCount: 16
After these values are set run the playbook::
@ -184,7 +230,7 @@ After these values are set run the playbook::
TASK [Get baremetal inspection data] *********************************************************
ok: [localhost]
TASK [Get tripleo CephDisks environment paramters] *******************************************
TASK [Get tripleo CephDisks environment parameters] *******************************************
ok: [localhost]
TASK [Derive HCI parameters] *****************************************************************