Document HCI deployments with cephadm
Change-Id: I3798d7e366c6e522368d326a44150599d2805b59
This commit is contained in:
parent
642efbce3e
commit
3e29e7aa7f
|
@ -78,7 +78,7 @@ appropriate environment file as in the example below::
|
|||
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml
|
||||
|
||||
If you only wish to deploy Ceph RBD without RGW then use the following
|
||||
variation of the above.
|
||||
variation of the above::
|
||||
|
||||
openstack overcloud deploy --templates \
|
||||
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm-rbd-only.yaml
|
||||
|
@ -615,6 +615,144 @@ Now that the host and OSDs have been logically removed from the Ceph
|
|||
cluster proceed to remove the host from the overcloud as described in
|
||||
the "Scaling Down" section of :doc:`../provisioning/baremetal_provision`.
|
||||
|
||||
Scenario: Deploy Hyperconverged Ceph
|
||||
------------------------------------
|
||||
|
||||
Use a command like the following to create a `roles.yaml` file
|
||||
containing a standard Controller role and a ComputeHCI role::
|
||||
|
||||
openstack overcloud roles generate Controller ComputeHCI -o ~/roles.yaml
|
||||
|
||||
The ComputeHCI role is a Compute node which also runs co-located Ceph
|
||||
OSD daemons. This kind of service co-location is referred to as HCI,
|
||||
or hyperconverged infrastructure. See the :doc:`composable_services`
|
||||
documentation for details on roles and services.
|
||||
|
||||
When collocating Nova Compute and Ceph OSD services boundaries can be
|
||||
set to reduce contention for CPU and Memory between the two services.
|
||||
This is possible by adding parameters to `cephadm-overrides.yaml` like
|
||||
the following::
|
||||
|
||||
parameter_defaults:
|
||||
CephHciOsdType: hdd
|
||||
CephHciOsdCount: 4
|
||||
CephConfigOverrides:
|
||||
osd:
|
||||
osd_memory_target_autotune: true
|
||||
osd_numa_auto_affinity: true
|
||||
mgr:
|
||||
mgr/cephadm/autotune_memory_target_ratio: 0.2
|
||||
|
||||
The `CephHciOsdType` and `CephHciOsdCount` parameters are used by the
|
||||
Derived Parameters workflow to tune the Nova scheduler to not allocate
|
||||
a certain amount of memory and CPU from the hypervisor to virtual
|
||||
machines so that Ceph can use them instead. See the
|
||||
:doc:`derived_parameters` documentation for details. If you do not use
|
||||
Derived Parameters workflow, then at least set the
|
||||
`NovaReservedHostMemory` to the number of OSDs multipled by 5 GB per
|
||||
OSD per host.
|
||||
|
||||
The `CephConfigOverrides` map passes Ceph OSD parameters to limit the
|
||||
CPU and memory used by the OSDs.
|
||||
|
||||
The `osd_memory_target_autotune`_ is set to true so that the OSD
|
||||
daemons will adjust their memory consumption based on the
|
||||
`osd_memory_target` config option. The `autotune_memory_target_ratio`
|
||||
defaults to 0.7. So 70% of the total RAM in the system is the starting
|
||||
point, from which any memory consumed by non-autotuned Ceph daemons
|
||||
are subtracted, and then the remaining memory is divided by the OSDs
|
||||
(assuming all OSDs have `osd_memory_target_autotune` true). For HCI
|
||||
deployments the `mgr/cephadm/autotune_memory_target_ratio` can be set
|
||||
to 0.2 so that more memory is available for the Nova Compute
|
||||
service. This has the same effect as setting the ceph-ansible `is_hci`
|
||||
parameter to true.
|
||||
|
||||
A two NUMA node system can host a latency sensitive Nova workload on
|
||||
one NUMA node and a Ceph OSD workload on the other NUMA node. To
|
||||
configure Ceph OSDs to use a specific NUMA node (and not the one being
|
||||
used by the Nova Compute workload) use either of the following Ceph
|
||||
OSD configurations:
|
||||
|
||||
- `osd_numa_node` sets affinity to a numa node (-1 for none)
|
||||
- `osd_numa_auto_affinity` automatically sets affinity to the NUMA
|
||||
node where storage and network match
|
||||
|
||||
If there are network interfaces on both NUMA nodes and the disk
|
||||
controllers are NUMA node 0, then use a network interface on NUMA node
|
||||
0 for the storage network and host the Ceph OSD workload on NUMA
|
||||
node 0. Then host the Nova workload on NUMA node 1 and have it use the
|
||||
network interfaces on NUMA node 1. Setting `osd_numa_auto_affinity`,
|
||||
to true, as in the example `cephadm-overrides.yaml` file above, should
|
||||
result in this configuration. Alternatively, the `osd_numa_node` could
|
||||
be set directly to 0 and `osd_numa_auto_affinity` could be unset so
|
||||
that it will default to false.
|
||||
|
||||
When a hyperconverged cluster backfills as a result of an OSD going
|
||||
offline, the backfill process can be slowed down. In exchange for a
|
||||
slower recovery, the backfill activity has less of an impact on
|
||||
the collocated Compute workload. Ceph Pacific has the following
|
||||
defaults to control the rate of backfill activity::
|
||||
|
||||
parameter_defaults:
|
||||
CephConfigOverrides:
|
||||
osd:
|
||||
osd_recovery_op_priority: 3
|
||||
osd_max_backfills: 1
|
||||
osd_recovery_max_active_hdd: 3
|
||||
osd_recovery_max_active_ssd: 10
|
||||
|
||||
It is not necessary to pass the above as they are the default values,
|
||||
but if these values need to be deployed with different values modify
|
||||
an example like the above before deployment. If the values need to be
|
||||
adjusted after the deployment use `ceph config set osd <key> <value>`.
|
||||
|
||||
Deploy the overcloud as described in "Scenario: Deploy Ceph with
|
||||
TripleO and Metalsmith" but use the `-r` option to include generated
|
||||
`roles.yaml` file and the `-e` option with the
|
||||
`cephadm-overrides.yaml` file containing the HCI tunings described
|
||||
above.
|
||||
|
||||
The examples above may be used to tune a hyperconverged system during
|
||||
deployment. If the values need to be changed after deployment, then
|
||||
use the `ceph orchestrator` command to set them directly.
|
||||
|
||||
After deployment start a Ceph shell as described in "Accessing the
|
||||
Ceph Command Line" and confirm the above values were applied. For
|
||||
example, to check that the NUMA and memory target auto tuning run
|
||||
commands lke this::
|
||||
|
||||
[ceph: root@oc0-controller-0 /]# ceph config dump | grep numa
|
||||
osd advanced osd_numa_auto_affinity true
|
||||
[ceph: root@oc0-controller-0 /]# ceph config dump | grep autotune
|
||||
osd advanced osd_memory_target_autotune true
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get mgr mgr/cephadm/autotune_memory_target_ratio
|
||||
0.200000
|
||||
[ceph: root@oc0-controller-0 /]#
|
||||
|
||||
We can then confirm that a specific OSD, e.g. osd.11, inherited those
|
||||
values with commands like this::
|
||||
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_memory_target
|
||||
4294967296
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_memory_target_autotune
|
||||
true
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_numa_auto_affinity
|
||||
true
|
||||
[ceph: root@oc0-controller-0 /]#
|
||||
|
||||
To confirm that the default backfill values are set for the same
|
||||
example OSD, use commands like this::
|
||||
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_op_priority
|
||||
3
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_max_backfills
|
||||
1
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_max_active_hdd
|
||||
3
|
||||
[ceph: root@oc0-controller-0 /]# ceph config get osd.11 osd_recovery_max_active_ssd
|
||||
10
|
||||
[ceph: root@oc0-controller-0 /]#
|
||||
|
||||
|
||||
.. _`cephadm`: https://docs.ceph.com/en/latest/cephadm/index.html
|
||||
.. _`cleaning instructions in the Ironic documentation`: https://docs.openstack.org/ironic/latest/admin/cleaning.html
|
||||
|
@ -628,3 +766,4 @@ the "Scaling Down" section of :doc:`../provisioning/baremetal_provision`.
|
|||
.. _`pgcalc`: http://ceph.com/pgcalc
|
||||
.. _`CRUSH Map Rules`: https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/?highlight=ceph%20crush%20rules#crush-map-rules
|
||||
.. _`OSD Service Documentation for cephadm`: https://docs.ceph.com/en/latest/cephadm/osd/
|
||||
.. _`osd_memory_target_autotune`: https://docs.ceph.com/en/latest/cephadm/osd/#automatically-tuning-osd-memory
|
||||
|
|
|
@ -9,8 +9,8 @@ supports this feature for both NFV (Network Function Virtualization)
|
|||
and HCI (Hyper-converged Infrastructure; nodes with collocated Ceph
|
||||
OSD and Nova Compute services) deployments.
|
||||
|
||||
Using derived paramters during a deployment
|
||||
-------------------------------------------
|
||||
Using derived parameters during a deployment
|
||||
--------------------------------------------
|
||||
|
||||
To have TripleO derive parameters during deployment, specify an
|
||||
alternative *deployment plan* containing directives which trigger
|
||||
|
@ -18,7 +18,7 @@ either a Mistral workflow (prior to Victoria) or an Ansible playbook
|
|||
(in Victoria and newer) which derives the parameters.
|
||||
|
||||
A default *deployment plan* is created during deployment. This
|
||||
deployment plan my be overridden by passing the ``-p`` or
|
||||
deployment plan may be overridden by passing the ``-p`` or
|
||||
``--plan-environment-file`` option to the ``openstack overcloud
|
||||
deploy`` command. If the ``plan-environment-derived-params.yaml``
|
||||
file, located in
|
||||
|
@ -42,12 +42,12 @@ for use by a Ceph OSD.
|
|||
Parameters which are derived for HCI deployments
|
||||
------------------------------------------------
|
||||
|
||||
The derived paramters for HCI sets the NovaReservedHostMemory and
|
||||
The derived parameters for HCI sets the NovaReservedHostMemory and
|
||||
NovaCPUAllocationRatio per role based on the amount and type of Ceph
|
||||
OSDs requested during deployment, the available hardware in Ironic,
|
||||
and the average Nova guest workload.
|
||||
|
||||
Deriving the paramters is useful because in an HCI deployment the Nova
|
||||
Deriving the parameters is useful because in an HCI deployment the Nova
|
||||
scheduler does not, by default, take into account the requirements of
|
||||
the Ceph OSD services which are collocated with the Nova Compute
|
||||
services. Thus, it's possible for Compute resources needed by an OSD
|
||||
|
@ -61,7 +61,7 @@ medium the more vCPUs an OSD should use in order for the CPU resources
|
|||
to not become a performance bottle-neck. All of this is taken into
|
||||
account by the derived parameters for HCI.
|
||||
|
||||
The workload of the Nova guests should also to be taken into account.
|
||||
The workload of the Nova guests may also be taken into account.
|
||||
The ``plan-environment-derived-params.yaml`` file contains the
|
||||
following::
|
||||
|
||||
|
@ -98,18 +98,18 @@ to take into account the memory overhead per guest for the hypervisor.
|
|||
It also does not set the NovaCPUAllocationRatio. Thus, passing an
|
||||
expected average workload will produce a more accurate set of derived
|
||||
HCI parameters. However, this default does allow for a simpler
|
||||
deployment where derived paramters may be used without having to
|
||||
deployment where derived parameters may be used without having to
|
||||
specify a workload but the OSDs are protected from having their memory
|
||||
allocated to Nova guests.
|
||||
|
||||
Deriving HCI paramters before a deployment
|
||||
------------------------------------------
|
||||
Deriving HCI parameters before a deployment
|
||||
-------------------------------------------
|
||||
|
||||
The ``tripleo_derive_hci_parameters`` Ansible module may be run
|
||||
independently on the undercloud before deployment to generate a YAML
|
||||
file to pass to the ``opentack overcloud deploy`` command with the
|
||||
file to pass to the ``openstack overcloud deploy`` command with the
|
||||
``-e`` option. If this option is used it's not necessary to derive HCI
|
||||
paramters during deployment. Using this option also allows the
|
||||
parameters during deployment. Using this option also allows the
|
||||
deployer to quickly see the values of the derived parameters.
|
||||
|
||||
.. warning::
|
||||
|
@ -143,29 +143,40 @@ modify it to set the four playbook variables as below::
|
|||
# Set the following variables for your environment
|
||||
ironic_node_id: ef4cbd49-3773-4db2-80da-4210a7c24047
|
||||
role: ComputeHCI
|
||||
average_guest_cpu_utilization_percentage: 10
|
||||
average_guest_memory_size_in_mb: 2048
|
||||
average_guest_cpu_utilization_percentage: 50
|
||||
average_guest_memory_size_in_mb: 8192
|
||||
heat_environment_input_file: /home/stack/ceph_overrides.yaml
|
||||
[stack@undercloud ~]$
|
||||
|
||||
In the above example it is assumed the ``role`` `ComputeHCI` will use
|
||||
nodes with the same type of hardwqare which is set to the
|
||||
nodes with the same type of hardware which is set to the
|
||||
``ironic_node_id`` and that the average guest will use 50% of its CPU
|
||||
and will use 8 GB of RAM. The ``heat_environment_input_file`` must
|
||||
be set to the path of the Heat environment file where the
|
||||
``CephAnsibleDisksConfig`` parameter is set. This parameter is used
|
||||
to define which disks are used as Ceph OSDs and might look like the
|
||||
following if bluestore was being deployed on 4 SSDs::
|
||||
and will use 8 GB of RAM. If the workload is unknown, remove these
|
||||
variables. The system tuning will not be as accurate but the Ansible
|
||||
module will at least set the NovaReservedHostMemory as a function of
|
||||
the number of OSDs.
|
||||
|
||||
The ``heat_environment_input_file`` must be set to the path of the
|
||||
Heat environment file which defines the OSDs.
|
||||
|
||||
.. admonition:: Victoria or earlier
|
||||
|
||||
When ceph-ansible is used, in place of cephadm, this should be the
|
||||
file where the ``CephAnsibleDisksConfig`` parameter is set. This
|
||||
parameter is used to define which disks are used as Ceph OSDs and
|
||||
might look like the following if bluestore was being deployed on 4
|
||||
NVMe SSDs::
|
||||
|
||||
parameter_defaults:
|
||||
CephAnsibleDisksConfig:
|
||||
osd_scenario: lvm
|
||||
osd_objectstore: bluestore
|
||||
osds_per_device: 4
|
||||
devices:
|
||||
- /dev/sda
|
||||
- /dev/sdb
|
||||
- /dev/sdc
|
||||
- /dev/sdd
|
||||
- /dev/nvme0n1
|
||||
- /dev/nvme0n2
|
||||
- /dev/nvme0n3
|
||||
- /dev/nvme0n4
|
||||
|
||||
The derived parameters workflow would use the values above to
|
||||
determine the number of OSDs requested (e.g. 4 devices * 4 OSDs per
|
||||
|
@ -173,6 +184,41 @@ device = 16) and the type of device based on the Ironic data
|
|||
(e.g. during introspection, ironic can determine if a storage device
|
||||
is rotational).
|
||||
|
||||
If cephadm is used, in place of ceph-ansible (for Wallaby and newer),
|
||||
then the ``heat_environment_input_file`` must be set to the path of
|
||||
the file where the ``CephHciOsdCount`` and ``CephHciOsdType``
|
||||
parameters are set.
|
||||
|
||||
The ``CephHciOsdCount`` and ``CephHciOsdType`` exist because
|
||||
``CephOsdSpec``, as used by cephadm, might only specify a description
|
||||
of devices to be used as OSDs (e.g. "all devices"), and not a list of
|
||||
devices like ``CephAnsibleDisksConfig``, setting the count directly is
|
||||
necessary in order to know how much CPU/RAM to reserve. Similarly,
|
||||
because a device path is not hard coded, we cannot look up that device
|
||||
in Ironic to determine its type. For information on the
|
||||
``CephOsdSpec`` parameter see the :doc:`cephadm` documentation.
|
||||
|
||||
``CephHciOsdType`` is the type of data_device (not db_device) used for
|
||||
each OSD and must be one of hdd, ssd, or nvme. These are used by
|
||||
the Ansible module tripleo_derive_hci_parameters.
|
||||
|
||||
``CephHciOsdCount`` is the number of expected Ceph OSDs per HCI
|
||||
node. If a server has eight HDD drives, then the parameters should be
|
||||
set like this::
|
||||
|
||||
parameter_defaults:
|
||||
CephHciOsdType: hdd
|
||||
CephHciOsdCount: 8
|
||||
|
||||
To fully utilize nvme devices for data (not metadata), multiple
|
||||
OSDs are required. If the ``CephOsdSpec`` parameter is used to set
|
||||
`osds_per_device` to 4, and there are four NVMe drives on a host (and
|
||||
no HDD drives), then the parameters should be set like this::
|
||||
|
||||
parameter_defaults:
|
||||
CephHciOsdType: nvme
|
||||
CephHciOsdCount: 16
|
||||
|
||||
After these values are set run the playbook::
|
||||
|
||||
[stack@undercloud ~]$ ansible-playbook derive-local-hci-parameters.yml
|
||||
|
@ -184,7 +230,7 @@ After these values are set run the playbook::
|
|||
TASK [Get baremetal inspection data] *********************************************************
|
||||
ok: [localhost]
|
||||
|
||||
TASK [Get tripleo CephDisks environment paramters] *******************************************
|
||||
TASK [Get tripleo CephDisks environment parameters] *******************************************
|
||||
ok: [localhost]
|
||||
|
||||
TASK [Derive HCI parameters] *****************************************************************
|
||||
|
|
Loading…
Reference in New Issue