Distributed Multibackend Storage ================================ In Ussuri and newer, |project| is able to extend :doc:`distributed_compute_node` to include distributed image management and persistent storage with the benefits of using OpenStack and Ceph. Features -------- This Distributed Multibackend Storage design extends the architecture described in :doc:`distributed_compute_node` to support the following worklow. - Upload an image to the Central site, and any additional DCN sites with storage, concurrently using one command like `glance image-create-via-import --stores central,dcn1,dcn3`. - Move a copy of the same image to additional DCN sites when needed using a command like `glance image-import --stores dcn2,dcn4 --import-method copy-image`. - The image's unique ID will be shared consistently across sites - The image may be copy-on-write booted on any DCN site as the RBD pools for Glance and Nova will use the same local Ceph cluster. - If the Glance server at each DCN site was configured with write access to the Central Ceph cluster as an additional store, then an image generated from making a snapshot of an instance running at a DCN site may be copied back to the central site and then copied to additional DCN sites. - The same Ceph cluster per site may also be used by Cinder as an RBD store to offer local volumes in active/active mode. In the above workflow the only time RBD traffic crosses the WAN is when an image is imported or copied between sites. Otherwise all RBD traffic is local to each site for fast COW boots, and performant IO to the local Cinder and Nova Ceph pools. Architecture ------------ The architecture to support the above features has the following properties. - A separate Ceph cluster at each availability zone or geographic location - Glance servers at each availability zone or geographic location - The containers implementing the Ceph clusters may be collocated on the same hardware providing compute services, i.e. the compute nodes may be hyper-converged, though it is not necessary that they be hyper-converged - It is not necessary to deploy Glance and Ceph at each DCN site, if storage services are not needed at that DCN site In this scenario the Glance service at the central site is configured with multiple stores such that. - The central Glance server's default store is the central Ceph cluster using the RBD driver - The central Glance server has additional RBD stores; one per DCN site running Ceph Similarly the Glance server at each DCN site is configured with multiple stores such that. - Each DCN Glance server's default store is the DCN Ceph cluster that is in the same geographic location. - Each DCN Glance server is configured with one additional store which is the Central RBD Ceph cluster. Though there are Glance services distributed to multiple sites, the Glance client for overcloud users should use the public Glance endpoints at the central site. These endpoints may be determined by querying the Keystone service, which only runs at the central site, with `openstack endpoint list`. Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image. Stacks ------ In the example deployment three stacks are deployed: control-plane All control plane services including Glance. Includes a Ceph cluster named central which is hypercoverged with compute nodes and runs Cinder in active/passive mode managed by pacemaker. dcn0 Runs Compute, Glance and Ceph services. The Cinder volume service is configured in active/active mode and not managed by pacemaker. The Compute and Cinder services are deployed in a separate availability zone and may also be in a separate geographic location. dcn1 Deploys the same services as dcn0 but in a different availability zone and also in a separate geographic location. Note how the above differs from the :doc:`distributed_compute_node` example which splits services at the primary location into two stacks called `control-plane` and `central`. This example combines the two into one stack. During the deployment steps all templates used to deploy the control-plane stack will be kept on the undercloud in `/home/stack/control-plane`, all templates used to deploy the dcn0 stack will be kept on the undercloud in `/home/stack/dcn0` and dcn1 will follow the same pattern as dcn0. The sites dcn2, dcn3 and so on may be created, based on need, by following the same pattern. Ceph Deployment Types --------------------- |project| supports two types of Ceph deployments. An "internal" Ceph deployment is one where a Ceph cluster is deployed as part of the overcloud as described in :doc:`ceph_config`. An "external" Ceph deployment is one where a Ceph cluster already exists and an overcloud is configured to be a client of that Ceph cluster as described in :doc:`ceph_external`. Ceph external deployments have special meaning to |project| in the following ways: - The Ceph cluster was not deployed by |project| - The OpenStack Ceph client is configured by |project| The deployment example in this document uses the "external" term to focus on the second of the above because the client configuration is important. This example differs from the first of the above because Ceph was deployed by |project|, however relative to other stacks, it is an external Ceph cluster because, for the stacks which configure the Ceph clients, it doesn't matter that the Ceph server came from a different stack. In this sense, the example in this document uses both types of deployments as described in the following sequence: - The central site deploys an internal Ceph cluster called central with an additional cephx keyring which may be used to access the central ceph pools. - The dcn0 site deploys an internal Ceph cluster called dcn0 with an additional cephx keyring which may be used to access the dcn0 Ceph pools. During the same deployment the dcn0 site is also configured with the cephx keyring from the previous step so that it is also a client of the external Ceph cluster, relative to dcn0, called central from the previous step. The `GlanceMultistoreConfig` parameter is also used during this step so that Glance will use the dcn0 Ceph cluster as an RBD store by default but it will also be configured to use the central Ceph cluster as an additional RBD backend. - The dcn1 site is deployed the same way as the dcn0 site and the pattern may be continued for as many DCN sites as necessary. - The central site is then updated so that in addition to having an internal Ceph deployment for the cluster called central, it is also configured with multiple external ceph clusters, relative to the central site, for each DCN site. This is accomplished by passing the cephx keys which were created during each DCN site deployment as input to the stack update. During the stack update the `GlanceMultistoreConfig` parameter is added so that Glance will continue to use the central Ceph cluster as an RBD store by default but it will also be configured to use each DCN Ceph cluster as an additional RBD backend. The above sequence is possible by using the `CephExtraKeys` parameter as described in :doc:`ceph_config` and the `CephExternalMultiConfig` parameter described in :doc:`ceph_external`. Deployment Steps ---------------- This section shows the deployment commands and associated environment files of an example DCN deployment with distributed image management. It is based on the :doc:`distributed_compute_node` example and does not cover redundant aspects of it such as networking. Create extra Ceph key ^^^^^^^^^^^^^^^^^^^^^ Create ``/home/stack/control-plane/ceph_keys.yaml`` with contents like the following:: parameter_defaults: CephExtraKeys: - name: "client.external" caps: mgr: "allow *" mon: "profile rbd" osd: "profile rbd pool=vms, profile rbd pool=volumes, profile rbd pool=images" key: "AQD29WteAAAAABAAphgOjFD7nyjdYe8Lz0mQ5Q==" mode: "0600" The key should be considered sensitive and may be randomly generated with the following command:: python3 -c 'import os,struct,time,base64; key = os.urandom(16); header = struct.pack(" | grep disk_format` after the image is uploaded. Set an environment variable to the ID of the newly created image: .. code-block:: bash ID=$(openstack image show cirros -c id -f value) Copy the image from the default store to the dcn1 store: .. code-block:: bash glance image-import $ID --stores dcn1 --import-method copy-image Confirm a copy of the image is in each store by looking at the image properties: .. code-block:: bash $ openstack image show $ID | grep properties | properties | direct_url='rbd://d25504ce-459f-432d-b6fa-79854d786f2b/images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076/snap', locations='[{u'url': u'rbd://d25504ce-459f-432d-b6fa-79854d786f2b/images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076/snap', u'metadata': {u'store': u'default_backend'}}, {u'url': u'rbd://0c10d6b5-a455-4c4d-bd53-8f2b9357c3c7/images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076/snap', u'metadata': {u'store': u'dcn0'}}, {u'url': u'rbd://8649d6c3-dcb3-4aae-8c19-8c2fe5a853ac/images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076/snap', u'metadata': {u'store': u'dcn1'}}]', os_glance_failed_import='', os_glance_importing_to_stores='', os_hash_algo='sha512', os_hash_value='b795f047a1b10ba0b7c95b43b2a481a59289dc4cf2e49845e60b194a911819d3ada03767bbba4143b44c93fd7f66c96c5a621e28dff51d1196dae64974ce240e', os_hidden='False', stores='default_backend,dcn0,dcn1' | The `stores` key, which is the last item in the properties map is set to 'default_backend,dcn0,dcn1'. On further inspection the `direct_url` key is set to:: rbd://d25504ce-459f-432d-b6fa-79854d786f2b/images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076/snap Which contains 'd25504ce-459f-432d-b6fa-79854d786f2b', the FSID of the central Ceph cluster, the name of the pool, 'images', followed by '8083c7e7-32d8-4f7a-b1da-0ed7884f1076', the Glance image ID and name of the Ceph object. The properties map also contains `locations` which is set to similar RBD paths for the dcn0 and dcn1 cluster with their respective FSIDs and pool names. Note that the Glance image ID is consistent in all RBD paths. If the image were deleted with `glance image-delete`, then the image would be removed from all three RBD stores to ensure consistency. However, if the glanceclient is >3.1.0, then an image may be deleted from a specific store only by using a syntax like `glance stores-delete --store `. Optionally, run the following on any Controller node from the control-plane stack: .. code-block:: bash sudo podman exec ceph-mon-$(hostname) rbd --cluster central -p images ls -l Run the following on any DistributedComputeHCI node from the dcn0 stack: .. code-block:: bash sudo podman exec ceph-mon-$(hostname) rbd --id external --keyring /etc/ceph/dcn0.client.external.keyring --conf /etc/ceph/dcn0.conf -p images ls -l Run the following on any DistributedComputeHCI node from the dcn1 stack: .. code-block:: bash sudo podman exec ceph-mon-$(hostname) rbd --id external --keyring /etc/ceph/dcn1.client.external.keyring --conf /etc/ceph/dcn1.conf -p images ls -l The results in all cases should produce output like the following:: NAME SIZE PARENT FMT PROT LOCK 8083c7e7-32d8-4f7a-b1da-0ed7884f1076 44 MiB 2 8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 44 MiB 2 yes When an ephemeral instance is COW booted from the image a similar command in the vms pool should show the same parent image: .. code-block:: bash $ sudo podman exec ceph-mon-$(hostname) rbd --id external --keyring /etc/ceph/dcn1.client.external.keyring --conf /etc/ceph/dcn1.conf -p vms ls -l NAME SIZE PARENT FMT PROT LOCK 2b431c77-93b8-4edf-88d9-1fd518d987c2_disk 1 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $ Confirm image-based volumes may be booted as DCN instances ---------------------------------------------------------- An instance with a persistent root volume may be created on a DCN site by using the active/active Cinder service at the DCN site. Assuming the Glance image created in the previous step is available, identify the image ID and pass it to `openstack volume create` with the `--image` option to create a volume based on that image. .. code-block:: bash IMG_ID=$(openstack image show cirros -c id -f value) openstack volume create --size 8 --availability-zone dcn0 pet-volume-dcn0 --image $IMG_ID Once the volume is created identify its volume ID and pass it to `openstack server create` with the `--volume` option. This example assumes a flavor, key, security group and network have already been created. .. code-block:: bash VOL_ID=$(openstack volume show -f value -c id pet-volume-dcn0) openstack server create --flavor tiny --key-name dcn0-key --network dcn0-network --security-group basic --availability-zone dcn0 --volume $VOL_ID pet-server-dcn0 It is also possible to issue one command to have Nova ask Cinder to create the volume before it boots the instance by passing the `--image` and `--boot-from-volume` options as in the shown in the example below: .. code-block:: bash openstack server create --flavor tiny --image $IMG_ID --key-name dcn0-key --network dcn0-network --security-group basic --availability-zone dcn0 --boot-from-volume 4 pet-server-dcn0 The above will only work if the Nova `cross_az_attach` setting of the relevant compute node is set to `false`. This is automatically configured by deploying with `environments/dcn-hci.yaml`. If the `cross_az_attach` setting is `true` (the default), then the volume will be created from the image not in the dcn0 site, but on the default central site (as verified with the `rbd` command on the central Ceph cluster) and then the instance will fail to boot on the dcn0 site. Even if `cross_az_attach` is `true`, it's still possible to create an instance from a volume by using `openstack volume create` and then `openstack server create` as shown earlier. Optionally, after creating the volume from the image at the dcn0 site and then creating an instance from the existing volume, verify that the volume is based on the image by running the `rbd` command within a ceph-mon container on the dcn0 site to list the volumes pool. .. code-block:: bash $ sudo podman exec ceph-mon-$HOSTNAME rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $ The following commands may be used to create a Cinder snapshot of the root volume of the instance. .. code-block:: bash openstack server stop pet-server-dcn0 openstack volume snapshot create pet-volume-dcn0-snap --volume $VOL_ID --force openstack server start pet-server-dcn0 In the above example the server is stopped to quiesce data for clean a snapshot. The `--force` option is necessary when creating the snapshot because the volume status will remain "in-use" even when the server is shut down. When the snapshot is completed start the server. Listing the contents of the volumes pool on the dcn0 Ceph cluster should show the snapshot which was created and how it is connected to the original volume and original image. .. code-block:: bash $ sudo podman exec ceph-mon-$HOSTNAME rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl volume-28c6fc32-047b-4306-ad2d-de2be02716b7@snapshot-a1ca8602-6819-45b4-a228-b4cd3e5adf60 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 yes $ Confirm image snapshots may be created and copied between sites --------------------------------------------------------------- A new image called "cirros-snapshot" may be created at the dcn0 site from the instance created in the previous section by running the following commands. .. code-block:: bash NOVA_ID=$(openstack server show pet-server-dcn0 -f value -c id) openstack server stop $NOVA_ID openstack server image create --name cirros-snapshot $NOVA_ID openstack server start $NOVA_ID In the above example the instance is stopped to quiesce data for clean a snapshot image and is then restarted after the image has been created. The output of `openstack image show $IMAGE_ID -f value -c properties` should contain a JSON data structure whose key called `stores` should only contain "dcn0" as that is the only store which has a copy of the new cirros-snapshot image. The new image may then by copied from the dcn0 site to the central site, which is the default backend for Glance. .. code-block:: bash IMAGE_ID=$(openstack image show cirros-snapshot -f value -c id) glance image-import $IMAGE_ID --stores default_backend --import-method copy-image After the above is run the output of `openstack image show $IMAGE_ID -f value -c properties` should contain a JSON data structure whose key called `stores` should looke like "dcn0,default_backend" as the image will also exist in the "default_backend" which stores its data on the central Ceph cluster. The same image at the Central site may then be copied to other DCN sites, booted in the vms or volumes pool, and snapshotted so that the same process may repeat.