High Availability ================= In this guide we'll go over design and programming considerations related to high availability in Cinder. The document aims to provide a single point of truth in all matters related to Cinder's high availability. Cinder developers must always have these aspects present during the design and programming of the Cinder core code, as well as the drivers' code. Most topics will focus on Active-Active deployments. Some topics covering node and process concurrency will also apply to Active-Passive deployments. Overview -------- There are 4 services that must be considered when looking at a highly available Cinder deployment: API, Scheduler, Volume, Backup. Each of these services has its own challenges and mechanisms to support concurrent and multi node code execution. This document provides a general overview of Cinder aspects related to high availability, together with implementation details. Given the breadth and depth required to properly explain them all, it will fall short in some places. It will provide external references to expand on some of the topics hoping to help better understand them. Some of the topics that will be covered are: - Job distribution. - Message queues. - Threading model. - Versioned Objects used for rolling upgrades. - Heartbeat system. - Mechanism used to clean up out of service cluster nodes. - Mutual exclusion mechanisms used in Cinder. It's good to keep in mind that Cinder threading model is based on eventlet's green threads. Some Cinder and driver code may use native threads to prevent thread blocking, but that's not the general rule. Throughout the document we'll be referring to clustered and non clustered Volume services. This distinction is not based on the number of services running, but on their configurations. A non clustered Volume service is one that will be deployed as Active-Passive and has not been included in a Cinder cluster. On the other hand, a clustered Volume service is one that can be deployed as Active-Active because it is part of a Cinder cluster. We consider a Volume service to be clustered even when there is only one node in the cluster. Job distribution ---------------- Cinder uses RPC calls to pass jobs to Scheduler, Volume, and Backup services. A message broker is used for the transport layer on the RPC calls and parameters. Job distribution is handled by the message broker using message queues. The different services, except the API, listen on specific message queues for RPC calls. Based on the maximum number of nodes that will connect, we can differentiate two types of message queues: those with a single listener and those with multiple listeners. We use single listener queues to send RPC calls to a specific service in a node. For example, when the API calls a non clustered Volume service to create a snapshot. Message queues having multiple listeners are used in operations such as: - Creating any volume. Call made from the API to the Scheduler. - Creating a volume in a clustered Volume service. Call made from the Scheduler to the Volume service. - Attaching a volume in a clustered Volume service. Call made from the API to the Volume service. Regardless of the number of listeners, all the above mentioned RPC calls are unicast calls. The caller will place the request in a queue in the message broker and a single node will retrieve it and execute the call. There are other kinds of RPC calls, those where we broadcast a single RPC call to multiple nodes. The best example of this type of call is the Volume service capabilities report sent to all the Schedulers. Message queues are fair queues and are used to distribute jobs in a round robin fashion. Single target RPC calls made to message queues with multiple listeners are distributed in round robin. So sending three request to a cluster of 3 Schedulers will send one request to each one. Distribution is content and workload agnostic. A node could be receiving all the quick and easy jobs while another one gets all the heavy lifting and its ongoing workload keeps increasing. Cinder's job distribution mechanism allows fine grained control over who to send RPC calls. Even on clustered Volume services we can still access individual nodes within the cluster. So developers must pay attention to where they want to send RPC calls and ask themselves: Is the target a clustered service? Is the RPC call intended for *any* node running the service? Is it for a *specific* node? For *all* nodes? The code in charge of deciding the target message queue, therefore the recipient, is in the `rpcapi.py` files. Each service has its own file with the RPC calls: `volume/rpcapi.py`, `scheduler/rpcapi.py`, and `backup/rpcapi.py`. For RPC calls the different `rcpapi.py` files ultimately use the `_get_cctxt` method from the `cinder.rpc.RPCAPI` class. For a detailed description on the issue, ramifications, and solutions, please refer to the `Cinder Volume Job Distribution`_. The `RabbitMQ tutorials`_ are a good way to understand message brokers general topics. Heartbeats ---------- Cinder services, with the exception of API services, have a periodic heartbeat to indicate they are up and running. When services are having health issues, they may decide to stop reporting heartbeats, even if they are running. This happens during initialization if the driver cannot be setup correctly. The database is used to report service heartbeats. Fields `report_count` and `updated_at`, in the `services` table, keep a heartbeat counter and the last time the counter was updated. There will be multiple database entries for Cinder Volume services running multiple backends. One per backend. Using a date-time to mark the moment of the last heartbeat makes the system time relevant for Cinder's operation. A significant difference in system times on our nodes could cause issues in a Cinder deployment. All services report and expect the `updated_at` field to be UTC. To determine if a service is up, we check the time of the last heartbeat to confirm that it's not older than `service_down_time` seconds. Default value for `service_down_time` configuration option is 60 seconds. Cinder uses method `is_up`, from the `Service` and `Cluster` Versioned Object, to ensure consistency in the calculations across the whole code base. Heartbeat frequency in Cinder services is determined by the `report_interval` configuration option. The default is 10 seconds, allowing network and database interruptions. Cinder protects itself against some incorrect configurations. If `report_interval` is greater or equal than `service_down_time`, Cinder will log a warning and use a service down time of two and a half times the configured `report_interval`. .. note:: It is of utter importance having the same `service_down_time` and `report_interval` configuration options in all your nodes. In each service's section we'll expand this topic with specific information only relevant to that service. Cleanup ------- Power outages, hardware failures, unintended reboots, and software errors. These are all events that could make a Cinder service unexpectedly halt its execution. A running Cinder service is usually carrying out actions on resources. So when the service dies unexpectedly, it will abruptly stop those operations. Stopped operations in this way leaves resources in transitioning states. For example a volume could be left in a `deleting` or `creating` status. If left alone resources will remain in this state forever, as the service in charge of transitioning them to a rest status (`available`, `error`, `deleted`) is no longer running. Existing reset-status operations allow operators to forcefully change the state of a resource. But these state resets are not recommended except in very specific cases and when we really know what we are doing. Cleanup mechanisms are tasked with service's recovery after an abrupt stop of the service. They are the recommended way to resolve stuck transitioning states caused by sudden service stop. There are multiple cleanup mechanisms in Cinder, but in essence they all follow the same logic. Based on the resource type and its status the mechanism determines the best cleanup action that will transition the state to a rest state. Some actions require a resource going through several services. In this case deciding the cleanup action may also require taking into account where the resource was being processed. Cinder has two types of cleanup mechanisms: - On node startup: Happen on Scheduler, Volume, and Backup services. - Upon user request. User requested cleanups can only be triggered on Scheduler and Volume nodes. When a node starts it will do a cleanup, but only for the resources that were left in a transitioning state when the service stopped. It will never touch resources from other services in the cluster. Node startup cleanup is slightly different on services supporting user requested cleanups -Scheduler and Volume- than on Backup services. Backup cleanups will be covered in the service's section. For services supporting user requested cleanups we can differentiate the following tasks: - Tracking transitioning resources: Using workers table and Cleanable Versioned Objects methods. - Defining when a resource must be cleaned if service dies: Done in Cleanable Versioned Objects. - Defining how a resource must be cleaned: Done in the service manager. .. note:: All Volume services can accept cleanup requests, doesn't matter if they are clustered or not. This will provide a better alternative to the reset-state mechanism to handle resources stuck in a transitioning state. Workers table ~~~~~~~~~~~~~ For Cinder Volume managed resources -Volumes and Snapshots- we used to establish a one-to-one relationship between a resource and the volume service managing it. A resource would belong to a node if the resource's `host` field matched that of the running Cinder Volume service. Snapshots must always be managed by the same service as the volume they originate from, so they don't have a `host` field in the database. In this case the parent volume's `host` is used to determine who owns the resource. Cinder-Volume services can be clustered, so we no longer have a one-to-one owner relationship. On clustered services we use the `cluster_name` database field instead of the `host` to determine ownership. Now we have a one-to-many ownership relationship. When a clustered service abruptly stops running, any of the nodes from the same cluster can cleanup the resources it was working on. There is no longer a need to restart the service to get the resources cleaned by the node startup cleanup process. We keep track of the resources our Cinder services are working on in the `workers` table. Only resources that can be cleaned are tracked. This table stores the resource type and id, the status that should be cleared on service failure, the service that is working on it, etc. And we'll be updating this table as the resources move from service to service. `Worker` entries are not passed as RPC parameters, so we don't need a Versioned Object class to represent them. We only have the `Worker` ORM class to represent database entries. Following subsections will cover implementation details required to develop new cleanup resources and states. For a detailed description on the issue, ramifications, and overall solution, please refer to the `Cleanup spec`_. Tracking resources ~~~~~~~~~~~~~~~~~~ Resources supporting cleanup using the workers table must inherit from the `CinderCleanableObject` Versioned Object class. This class provides helper methods and the general interface used by Cinder for the cleanup mechanism. This interface is conceptually split in three tasks: - Manage workers table on the database. - Defining what states must be cleaned. - Defining how to clean resources. Among methods provided by the `CinderCleanableObject` class the most important ones are: - `is_cleanable`: Checks if the resource, given its current status, is cleanable. - `create_worker`: Create a worker entry on the API service. - `set_worker`: Create or update worker entry. - `unset_worker`: Remove an entry from the database. This is a real delete, not a soft-delete. - `set_workers`: Function decorator to create or update worker entries. Inheriting classes must define `_is_cleanable` method to define which resource states can be cleaned up. Earlier we mentioned how cleanup depends on a resource's current state. But it also depends under what version the services are running. With rolling updates we can have a service running under an earlier pinned version for compatibility purposes. A version X service could have a resource that it would consider cleanable, but it's pinned to version X-1, where it was not considered cleanable. To avoid breaking things, the resource should be considered as non cleanable until the service version is unpinned. Implementation of `_is_cleanable` method must take them both into account. The state, and the version. Volume's implementation is a good example, as workers table was not supported before version 1.6: .. code-block:: python @staticmethod def _is_cleanable(status, obj_version): if obj_version and obj_version < 1.6: return False return status in ('creating', 'deleting', 'uploading', 'downloading') Tracking states in the workers table starts by calling the `create_worker` method on the API node. This is best done on the different `rpcapi.py` files. For example, a create volume operation will go from the API service to the Scheduler service, so we'll add it in `cinder/scheduler/rpcapi.py`: .. code-block:: python def create_volume(self, ctxt, volume, snapshot_id=None, image_id=None, request_spec=None, filter_properties=None, backup_id=None): volume.create_worker() But if we are deleting a volume or creating a snapshot the API will call the Volume service directly, so changes should go in `cinder/scheduler/rpcapi.py`: .. code-block:: python def delete_volume(self, ctxt, volume, unmanage_only=False, cascade=False): volume.create_worker() Once we receive the call on the other side's manager we have to call the `set_worker` method. To facilitate this task we have the `set_workers` decorator that will automatically call `set_worker` for any cleanable versioned object that is in a cleanable state. For the create volume on the Scheduler service: .. code-block:: python @objects.Volume.set_workers @append_operation_type() def create_volume(self, context, volume, snapshot_id=None, image_id=None, request_spec=None, filter_properties=None, backup_id=None): And then again for the create volume on the Volume service: .. code-block:: python @objects.Volume.set_workers def create_volume(self, context, volume, request_spec=None, filter_properties=None, allow_reschedule=True): In these examples we are using the `set_workers` method from the `Volume` Versioned Object class. But we could be using it from any other class as it is a `staticmethod` that is not overwritten by any of the classes. Using the `set_workers` decorator will cover most of our use cases, but sometimes we may have to call the `set_worker` method ourselves. That's the case when transitioning from `creating` state to `downloading`. The `worker` database entry was created with the `creating` state and the working service was updated when the Volume service received the RPC call. But once we change the status to `creating` the worker and the resource status don't match, so the cleanup mechanism will ignore the resource. To solve this we add another worker update in the `save` method from the `Volume` Versioned Object class: .. code-block:: python def save(self): ... if updates.get('status') == 'downloading': self.set_worker() Actions on resource cleanup ~~~~~~~~~~~~~~~~~~~~~~~~~~~ We've seen how to track cleanable resources in the `workers` table. Now we'll cover how to define the actions used to cleanup a resource. Services using the `workers` table inherit from the `CleanableManager` class and must implement the `_do_cleanup` method. This method receives a versioned object to clean and indicates whether we should keep the `workers` table entry. On asynchronous cleanup tasks method must return `True` and take care of removing the worker entry on completion. Simplified version of the cleanup of the Volume service, illustrating synchronous and asynchronous cleanups and how we can do a synchronous cleanup and take care ourselves of the `workers` entry: .. code-block:: python def _do_cleanup(self, ctxt, vo_resource): if isinstance(vo_resource, objects.Volume): if vo_resource.status == 'downloading': self.driver.clear_download(ctxt, vo_resource) elif vo_resource.status == 'deleting': if CONF.volume_service_inithost_offload: self._add_to_threadpool(self.delete_volume, ctxt, vo_resource, cascade=True) else: self.delete_volume(ctxt, vo_resource, cascade=True) return True if vo_resource.status in ('creating', 'downloading'): vo_resource.status = 'error' vo_resource.save() When the volume is `downloading` we don't return anything, so the caller receives `None`, which evaluates to not keep the row entry. When the status is `deleting` we call `delete_volume` synchronously or asynchronously. The `delete_volume` has the `set_workers` decorator, that calls `unset_worker` once the decorated method has successfully finished. So when calling `delete_volume` we must ask the caller of `_do_cleanup` to not try to remove the `workers` entry. Cleaning resources ~~~~~~~~~~~~~~~~~~ We may not have a `Worker` Versioned Object because we didn't need it, but we have a `CleanupRequest` Versioned Object to specify resources for cleanup. Resources will be cleaned when a node starts up and on user request. In both cases we'll use the `CleanupRequest` that contains a filtering of what needs to be cleaned up. The `CleanupRequest` can be considered as a filter on the `workers` table to determine what needs to be cleaned. Managers for services using the `workers` table must support the startup cleanup mechanism. Support for this mechanism is provided via the `init_host` method in the `CleanableManager` class. So managers inheriting from `CleanableManager` must make sure they call this `init_host` method. This can be done using `CleanableManager` as the first inherited class and using `super` to call the parent's `init_host` method, or by calling the class method directly: `cleanableManager.init_host(self, ...)`. `CleanableManager`'s `init_host` method will create a `CleanupRequest` for the current service before calling its `do_cleanup` method with it before returning. Thus cleaning up all transitioning resources from the service. For user requested cleanups, the API generates a `CleanupRequest` object using the request's parameters and calls the scheduler's `work_cleanup` RPC with it. The Scheduler receives the `work_cleanup` RPC call and uses the `CleanupRequest` to filter services that match the request. With this list of services the Scheduler sends an individual cleanup request for each of the services. This way we can spread the cleanup work if we have multiple services to cleanup. The Scheduler checks the service to clean to know where it must send the clean request. Scheduler service cleanup can be performed by any Scheduler, so we send it to the scheduler queue where all Schedulers are listening. In the worst case it will come back to us if there is no other Scheduler running at the time. For the Volume service we'll be sending it to the cluster message queue if it's a clustered service, or to a single node if it's non clustered. But unlike with the Scheduler, we can't be sure that there is a service to do the cleanup, so we check if the service or cluster is up before sending the request. After sending all the cleanup requests, the Scheduler will return a list of services that have received a cleanup request, and all the services that didn't because they were down. Mutual exclusion ---------------- In Cinder, as many other concurrent and parallel systems, there are "critical sections". Code sections that share a common resource that can only be accessed by one of them at a time. Resources can be anything, not only Cinder resources such as Volumes and Snapshots, and they can be local or remote. Examples of resources are libraries, command line tools, storage target groups, etc. Exclusion scopes can be per process, per node, or global. We have four mutual exclusion mechanisms available during Cinder development: - Database locking using resource states. - Process locks. - Node locks. - Global locks. For performance reasons we must always try to avoid using any mutual exclusion mechanism. If avoiding them is not possible, we should try to use the narrowest scope possible and reduce the critical section as much as possible. Locks by decreasing order of preference are: process locks, node locks, global locks, database locks. Status based locking ~~~~~~~~~~~~~~~~~~~~ Many Cinder operations are inherently exclusive and the Cinder core code ensures that drivers will not receive contradictory or incompatible calls. For example, you cannot clone a volume if it's being created. And you shouldn't delete the source volume of an ongoing snapshot. To prevent these from happening Cinder API services use resource status fields to check for incompatibilities preventing operations from getting through. There are exceptions to this rule, for example the force delete operation that ignores the status of a resource. We should also be aware that administrators can forcefully change the status of a resource and then call the API, bypassing the check that prevents multiple operations from being requested to the drivers. Resource locking using states is expanded upon in the `Race prevention`_ subsection in the `Cinder-API`_ section. Process locks ~~~~~~~~~~~~~ Cinder services are multi-threaded -not really since we use greenthreads-, so the narrowest possible scope of locking is among the threads of a single process. Some cases where we may want to use this type of locking are when we share arrays or dictionaries between the different threads within the process, and when we use a Python or C library that doesn't properly handle concurrency and we have to be careful with how we call its methods. To use this locking in Cinder we must use the `synchronized` method in `cinder.utils`. This method in turn uses the `synchronized` method from `oslo_concurrency.lockutils` with the `cinder-` prefix for all the locks to avoid conflict with other OpenStack services. The only required parameter for this usage is the name of the lock. The name parameter provided for these locks must be a literal string value. There is no kind of templating support. Example from `cinder/volume/throttling.py`: .. code-block:: python @utils.synchronized('BlkioCgroup') def _inc_device(self, srcdev, dstdev): .. note:: When developing a driver, and considering which type of lock to use, we must remember that Cinder is a multi backend service. So the same driver can be running multiple times on different processes in the same node. Node locks ~~~~~~~~~~ Sometimes we want to define the whole node as the scope of the lock. Our critical section requires that only one thread in the whole node is using the resource. This inter process lock ensures that no matter how many processes and backends want to access the same resource, only one will access it at a time. All others will have to wait. These locks are useful when: - We want to ensure there's only one ongoing call to a command line program. That's the case of the `cinder-rtstool` command in `cinder/volume/targets/lio.py`, and the `nvmetcli` command in `cinder/volume/targets/nvmet.py`. - Common initialization in all processes in the node. This is the case of the backup service cleanup code. The backup service can run multiple processes simultaneously for the same backend, but only one of them can run the cleanup code on start. - Drivers not supporting Active-Active configurations. Any operation that should only be performed by one driver at a time. For example creating target groups for a node. This type of lock use the same method as the `Process locks`_, `synchronized` method from `cinder.utils`. Here we need to pass two parameters, the name of the lock, and `external=True` to make sure that file locks are being used. The name parameter provided for these locks must be a literal string value. There is no kind of templating support. Example from `cinder/volume/targest/lio.py`: .. code-block:: python @staticmethod @utils.synchronized('lioadm', external=True) def _execute(*args, **kwargs): Example from `cinder/backup/manager.py`: .. code-block:: python @utils.synchronized('backup-pgid-%s' % os.getpgrp(), external=True, delay=0.1) def _cleanup_incomplete_backup_operations(self, ctxt): .. warning:: These are not fair locks. Order in which the lock is acquired by callers may differ from request order. Starvation is possible, so don't choose a generic lock name for all your locks and try to create a unique name for each locking domain. Drivers that use node locks based on volumes should implement method ``clean_volume_file_locks`` and if they use locks based on the snapshots they should also implement ``clean_snapshot_file_locks`` and use method ``synchronized_remove`` from ``cinder.utils``. Example for a driver that used ``cinder.utils.synchronized``: .. code-block:: python def my_operation(self, volume): @utils.synchronized('my-driver-lock' + volume.id) def method(): pass method() @classmethod def clean_volume_file_locks(cls, volume_id): utils.synchronized_remove('my-driver-lock-' + volume_id) Global locks ~~~~~~~~~~~~ Global locks, also known as distributed locks in Cinder, provide mutual exclusion in the global scope of the Cinder services. They allow you to have a lock regardless of the backend, for example to prevent deleting a volume that is being cloned, or making sure that your driver is only creating a Target group at a time, in the whole Cinder deployment, to avoid race conditions. Global locking functionality is provided by the `synchronized` decorator from `cinder.coordination`. This method is more advanced than the one used for the `Process locks`_ and the `Node locks`_, as the name supports templates. For the template we have all the method parameters as well as `f_name` that represents that name of the method being decorated. Templates must use Python's `Format Specification Mini-Language`_. Using brackets we can access the function name `'{f_name}'`, an attribute of a parameter `'{volume.id}'`, a key in a dictonary `{snapshot['name']}`, etc. Up to date information on the method can be found in the `synchronized method's documentation`_. Example from the delete volume operation in `cinder/volume/manager.py`. We use the `id` attribute of the `volume` parameter, and the function name to form the lock name: .. code-block:: python @coordination.synchronized('{volume.id}-{f_name}') @objects.Volume.set_workers def delete_volume(self, context, volume, unmanage_only=False, cascade=False): Example from create snapshot in `cinder/volume/drivers/nfs.py`, where we use an attribute from `self`, and a recursive reference in the `snapshot` parameter. .. code-block:: python @coordination.synchronized('{self.driver_prefix}-{snapshot.volume.id}') def create_snapshot(self, snapshot): Internally Cinder uses the `Tooz library`_ to provide the distributed locking. By default, this library is configured for Active-Passive deployments, where it uses file locks equivalent to those used for `Node locks`_. To support Active-Active deployments a specific driver will need to be configured using the `backend_url` configuration option in the `coordination` section. For a detailed description of the requirement for global locks in cinder please refer to the `replacing local locks with Tooz`_ and `manager local locks`_ specs. Drivers that use global locks based on volumes should implement method ``clean_volume_file_locks`` and if they use locks based on the snapshots they should also implement ``clean_snapshot_file_locks`` and use method ``synchronized_remove`` from ``cinder.coordination``. Example for the 3PAR driver: .. code-block:: python @classmethod def clean_volume_file_locks(cls, volume_id): coordination.synchronized_remove('3par-' + volume_id) Cinder locking ~~~~~~~~~~~~~~ Cinder uses the different locking mechanisms covered in this section to assure mutual exclusion on some actions. Here's an *incomplete* list: Barbican keys - Lock scope: Global. - Critical section: Migrate Barbican encryption keys. - Lock name: `{id}-_migrate_encryption_key`. - Where: `_migrate_encryption_key` method. - File: `cinder/keymgr/migration.py`. Backup service - Lock scope: Node. - Critical section: Cleaning up resources at startup. - Lock name: `backup-pgid-{process-group-id}`. - Where: `_cleanup_incomplete_backup_operations` method. - File: `cinder/backup/manager.py`. Image cache - Lock scope: Global. - Critical section: Create a new image cache entry. - Lock name: `{image_id}`. - Where: `_prepare_image_cache_entry` method. - File: `cinder/volume/flows/manager/create_volume.py`. Throttling: - Lock scope: Process. - Critical section: Set parameters of a cgroup using `cgset` CLI. - Lock name: `''BlkioCgroup'`. - Where: `_inc_device` and `_dec_device` methods. - File: `cinder/volume/throttling.py`. Volume deletion: - Lock scope: Global. - Critical section: Volume deletion operation. - Lock name: `{volume.id}-delete_volume`. - Where: `delete_volume` method. - File: `cinder/volume/manager.py`. Volume deletion request: - Lock scope: Status based. - Critical section: Volume delete RPC call. - Status requirements: attach_status != 'attached' && not migrating - Where: `delete` method. - File: `cinder/volume/api.py`. Snapshot deletion: - Lock scope: Global. - Critical section: Snapshot deletion operation. - Lock name: `{snapshot.id}-delete_snapshot`. - Where: `delete_snapshot` method. - File: `cinder/volume/manager.py`. Volume creation: - Lock scope: Global. - Critical section: Protect source of volume creation from deletion. Volume or Snapshot. - Lock name: `{snapshot-id}-delete_snapshot` or `{volume-id}-delete_volume}`. - Where: Inside `create_volume` method as context manager for calling `_fun_flow`. - File: `cinder/volume/manager.py`. Attach volume: - Lock scope: Global. - Critical section: Updating DB to show volume is attached. - Lock name: `{volume_id}`. - Where: `attach_volume` method. - File: `cinder/volume/manager.py`. Detach volume: - Lock scope: Global. - Critical section: Updating DB to show volume is detached. - Lock name: `{volume_id}-detach_volume`. - Where: `detach_volume` method. - File: `cinder/volume/manager.py`. Volume upload image: - Lock scope: Status based. - Critical section: `copy_volume_to_image` RPC call. - Status requirements: status = 'available' or (force && status = 'in-use') - Where: `copy_volume_to_image` method. - File: `cinder/volume/api.py`. Volume extend: - Lock scope: Status based. - Critical section: `extend_volume` RPC call. - Status requirements: status in ('in-use', 'available') - Where: `_extend` method. - File: `cinder/volume/api.py`. Volume migration: - Lock scope: Status based. - Critical section: `migrate_volume` RPC call. - Status requirements: status in ('in-use', 'available') && not migrating - Where: `migrate_volume` method. - File: `cinder/volume/api.py`. Volume retype: - Lock scope: Status based. - Critical section: `retype` RPC call. - Status requirements: status in ('in-use', 'available') && not migrating - Where: `retype` method. - File: `cinder/volume/api.py`. Driver locking ~~~~~~~~~~~~~~ There is no general rule on where drivers should use locks. Each driver has its own requirements and limitations determined by the storage backend and the tools and mechanisms used to manage it. Even if they are all different, commonalities may exist between drivers. Providing a list of where some drivers are using locks, even if the list is incomplete, may prove useful to other developers. To contain the length of this document and keep it readable, the list with the :doc:`drivers_locking_examples` has its own document. Cinder-API ---------- The API service is the public face of Cinder. Its REST API makes it possible for anyone to manage and consume block storage resources. So requests from clients can, and usually do, come from multiple sources. Each Cinder API service by default will run multiple workers. Each worker is run in a separate subprocess and will run a predefined maximum number of green threads. The number of API workers is defined by the `osapi_volume_workers` configuration option. Defaults to the number of CPUs available. Number of green threads per worker is defined by the `wsgi_default_pool_size` configuration option. Defaults to 100 green threads. The service takes care of validating request parameters. Any detected error is reported immediately to the user. Once the request has been validated, the database is changed to reflect the request. This can result in adding a new entry to the database and/or modifying an existing entry. For create volume and create snapshot operations the API service will create a new database entry for the new resource. And the new information for the resource will be returned to the caller right after the service passes the request to the next Cinder service via RPC. Operations like retype and delete will change the database entry referenced by the request, before making the RPC call to the next Cinder service. Create backup and restore backup are two of the operations that will create a new entry in the database, and modify an existing one. These database changes are very relevant to the high availability operation. Cinder core code uses resource states extensively to control exclusive access to resources. Race prevention ~~~~~~~~~~~~~~~ The API service checks that resources referenced in requests are in a valid state. Unlike allowed resource states, valid states are those that allow an operation to proceed. Validation usually requires checking multiple conditions. Careless coding leaves Cinder open to race conditions. Patterns in the form of DB data read, data check, and database entry modification, must be avoided in the Cinder API service. Cinder has implemented a custom mechanism, called conditional updates, to prevent race conditions. Leverages the SQLAlchemy ORM library to abstract the equivalent ``UPDATE ... FROM ... WHERE;`` SQL query. Complete reference information on the conditional updates mechanism is available on the :doc:`api_conditional_updates` development document. For a detailed description on the issue, ramifications, and solution, please refer to the `API Race removal spec`_. Cinder-Volume ------------- The most common deployment option for Cinder-Volume is as Active-Passive. This requires a common storage backend, the same Cinder backend configuration in all nodes, having the `backend_host` set on the backend sections, and using a high-availability cluster resource manager like Pacemaker. .. attention:: Having the same `host` value configured on more than one Cinder node is highly discouraged. Using `backend_host` in the backend section is the recommended way to set Active-Passive configurations. Setting the same `host` field will make Scheduler and Backup services report using the same database entry in the `services` table. This may create a good number of issues: We cannot tell when the service in a node is down, backups services will break other running services operation on start, etc. For Active-Active configurations we need to include the Volume services that will be managing the same backends on the cluster. To include a node in a cluster, we need to define its name in the `[DEFAULT]` section using the `cluster` configuration option, and start or restart the service. .. note:: We can create a cluster with a single volume node. Having a single node cluster allows us to later on add new nodes to the cluster without restarting the existing node. .. warning:: The name of the cluster must be unique and cannot match any of the `host` or `backend_host` values. Non unique values will generate duplicated names for message queues. When a Volume service is configured to be part of a cluster, and the service is restarted, the manager detects the change in configuration and moves existing resources to the cluster. Resources are added to the cluster in the `_include_resources_in_cluster` method setting the `cluster_name` field in the database. Volumes, groups, consistency groups, and image cache elements are added to the cluster. Clustered Volume services are different than normal services. To determine if a backend is up, it is no longer enough checking `service.is_up`, as that will only give us the status of a specific service. In a clustered deployment there could be other services that are able to service the same backend. That's why we'll have to check if a service is clustered using `cinder.is_clustered` and if it is, check the cluster's `is_up` property instead: `service.cluster.is_up`. In the code, to detect if a cluster is up, the `is_up` property from the `Cluster` Versioned Object uses the `last_heartbeat` field from the same object. The `last_heartbeat` is a *column property* from the SQLAlchemy ORM model resulting from getting the latest `updated_at` field from all the services in the same cluster. RPC calls ~~~~~~~~~ When we discussed the `Job distribution`_ we mentioned message queues having multiple listeners and how they were used to distribute jobs in a round robin fashion to multiple nodes. For clustered Volume services we have the same queues used for broadcasting and to address a specific node, but we also have queues to broadcast to the cluster and to send jobs to the cluster. Volume services will be listening in all these queues and they can receive request from any of them. Which they'll have to do to process RPC calls addressed to the cluster or to themselves. Deciding the target message queue for request to the Volume service is done in the `volume/rpcapi.py` file. We use method `_get_cctxt`, from the `VolumeAPI` class, to prepare the client context to make RPC calls. This method accepts a `host` parameter to indicate where we want to make the RPC. This `host` parameter refers to both hosts and clusters, and is used to determine the server and the topic. When calling the `_get_cctx` method, we would need to pass the resource's `host` field if it's not clustered, and `cluster_name` if it is. To facilitate this, clustered resources implement the `service_topic_queue` property that automatically gives you the right value to pass to `_get_cctx`. An example for the create volume: .. code-block:: python def create_volume(self, ctxt, volume, request_spec, filter_properties, allow_reschedule=True): cctxt = self._get_cctxt(volume.service_topic_queue) cctxt.cast(ctxt, 'create_volume', request_spec=request_spec, filter_properties=filter_properties, allow_reschedule=allow_reschedule, volume=volume) As we know, snapshots don't have a `host` or `cluseter_name` fields, but we can still use the `service_topic_queue` property from the `Snapshot` Versioned Object to get the right value. The `Snapshot` internally checks these values from the `Volume` Versioned Object linked to that `Snapshot` to determine the right value. Here's an example for deleting a snapshot: .. code-block:: python def delete_snapshot(self, ctxt, snapshot, unmanage_only=False): cctxt = self._get_cctxt(snapshot.service_topic_queue) cctxt.cast(ctxt, 'delete_snapshot', snapshot=snapshot, unmanage_only=unmanage_only) Replication ~~~~~~~~~~~ Replication v2.1 failover is requested on a per node basis, so when a failover request is received by the API it is then redirected to a specific Volume service. Only one of the services that form the cluster for the storage backend will receive the request, and the others will be oblivious to this change and will continue using the same replication site they had been using before. To support the replication feature on clustered Volume services, drivers need to implement the `Active-Active replication spec`_. In this spec the `failover_host` method is split in two, `failover` and `failover_completed`. On a backend supporting replication on Active-Active deployments, `failover_host` would end up being a call to `failover` followed by a call to `failover_completed`. Code extract from the RBD driver: .. code-block:: python def failover_host(self, context, volumes, secondary_id=None, groups=None): active_backend_id, volume_update_list, group_update_list = ( self.failover(context, volumes, secondary_id, groups)) self.failover_completed(context, secondary_id) return active_backend_id, volume_update_list, group_update_list Enabling Active-Active on Drivers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Supporting Active-Active configurations is driver dependent, so they have to opt in. By default drivers are not expected to support Active-Active configurations and will fail on startup if we try to deploy them as such. Drivers can indicate they support Active-Active setting the class attribute `SUPPORTS_ACTIVE_ACTIVE` to `True`. If a single driver supports multiple storage solutions, it can leave the class attribute as it is, and set it as an overriding instance attribute on `__init__`. There is no well defined procedure required to allow driver maintainers to set `SUPPORTS_ACTIVE_ACTIVE` to `True`. Though there is an ongoing effort to write a spec on `testing Active-Active`_. So for now, we could say that it's "self-certification". Vendors must do their own testing until they are satisfied with their testing. Real testing of Active-Active deployments requires multiple Cinder Volume nodes on different hosts, as well as a properly configured Tooz DLM. Driver maintainers can use Devstack to catch the rough edges on their initial testing. Running 2 Cinder Volume services on an All-In-One DevStack installation makes it easy to deploy and debug. Running 2 Cinder Volume services on the same node simulating different nodes can be easily done: - Creating a new directory for local locks: Since we are running both services on the same node, a file lock could make us believe that the code would work on different nodes. Having a different lock directory, default is `/opt/stack/data/cinder`, will prevent this. - Creating a layover cinder configuration file: Cinder supports having different configurations files where each new files overrides the common parts of the old ones. We can use the same base cinder configuration provided by DevStack and write a different file with a `[DEFAULT]` section that configures `host` (to anything different than the one used in the first service), and `lock_path` (to the new directory we created). For example we could create `/etc/cinder/cinder2.conf`. - Create a new service unit: This service unit should be identical to the existing `devstack@c-vol` except replace the `ExecStart` that should have the postfix `--config-file /etc/cinder/cinder2.conf`. Once we have tested it in DevStack way we should deploy Cinder in a new Node, and continue with the testings. It is not necessary to do the DevStack step first, we can jump to having Cinder in multiple nodes right from the start. Whatever way we decide to test this, we'll have to change `cinder.conf` and add the `cluster` configuration option and restart the Cinder service. We also need to modify the driver under test to include the `SUPPORTS_ACTIVE_ACTIVE = True` class attribute. Cinder-Scheduler ---------------- Unlike the Volume service, the Cinder Scheduler has supported Active-Active deployments for a long time. Unfortunately, current support is not perfect, scheduling on Active-Active deployments has some issues. The root cause of these issues is that the scheduler services don't have a reliable single source of truth for the information they rely on to make the scheduling. Volume nodes periodically send a broadcast with the backend stats to all the schedulers. The stats include total storage space, free space, configured maximum over provisioning, etc. All the backends' information is stored in memory at the Schedulers, and used to decide where to create new volumes, migrate them on a retype, and so on. For additional information on the stats, please refer to the :ref:`volume stats ` section of the Contributor/Developer docs. Trying to keep updated stats, schedulers reduce available free space on backends in their internal dictionary. These updates are not shared between schedulers, so there is not a single source of truth, and other schedulers don't operate with the same information. Until the next stat reports is sent, schedulers will not get in sync. This may create unexpected behavior on scheduling. There are ongoing efforts to fix this problem. Multiple solutions are being discussed: using the database as a single source of truth, or using an external placement service. When we added Active-Active support to the Cinder Volume service we had to update the scheduler to understand it. This mostly entailed 3 things: - Setting the `cluster_name` field on Versioned Objects once a backend has been chosen. - Grouping stats for all clustered hosts. We don't want to have individual entries for the stats of each host that manages a cluster, as there should be only one up to date value. We stopped using the `host` field as the id for each host, and created a new property called `backend_id` that takes into account if the service is clustered and returns the host or the cluster as the identifier. - Prevent race conditions on stats reports. Due to the concurrency on the multiple Volume services in a cluster, and the threading in the Schedulers, we could receive stat reports out of order (more up to date stats last). To prevent this we started time stamping the stats on the Volume services. Using the timestamps schedulers can discard older stats. Heartbeats ~~~~~~~~~~ Like any other non API service, schedulers also send heartbeats using the database. The difference is that, unlike other services, the purpose of these heartbeats is merely informative. Admins can easily know whether schedulers are running or not with a Cinder command. Using the same `host` configuration in all nodes defeats the whole purpose of reporting heartbeats in the schedulers, as they will all report on the same database entry. Cinder-Backups -------------- Originally, the Backup service was not only limited to Active-Passive deployments, but it was also tightly coupled to the Volume service. This coupling meant that the Backup service could only backup volumes created by the Volume service running on the same node. In the Mitaka cycle, the `Scalable Backup Service spec`_ was implemented. This added support for Active-Active deployments to the backup service. The Active-Active implementation for the backup service is different than the one we explained for the Volume Service. The reason lays not only on the fact that the Backup service supported it first, but also on it not supporting multiple backends, and not using the Scheduler for any operations. Scheduling ~~~~~~~~~~ For backups, it's the API the one selecting the host that will do the backup, using methods `_get_available_backup_service_host`, `_is_backup_service_enabled`, and `_get_any_available_backup_service`. These methods use the Backup services' heartbeats to determine which hosts are up to handle requests. Cleaning ~~~~~~~~ Cleanup on Backup services is only performed on start up. To know what resources each node is working on, they set the `host` field in the backup Versioned Object when they receive the RPC call. That way they can select them for cleanup on start. The method in charge of doing the cleanup for the backups is called `_cleanup_incomplete_backup_operations`. Unlike with the Volume service we cannot have a backup node clean up after another node's. .. _API Race removal spec: https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/cinder-volume-active-active-support.html .. _Cinder Volume Job Distribution: https://specs.openstack.org/openstack/cinder-specs/specs/ocata/ha-aa-job-distribution.html .. _RabbitMQ tutorials: https://www.rabbitmq.com/getstarted.html .. _Cleanup spec: https://specs.openstack.org/openstack/cinder-specs/specs/newton/ha-aa-cleanup.html .. _synchronized method's documentation: https://docs.openstack.org/cinder/latest/contributor/api/cinder.coordination.html#module-cinder.coordination .. _Format Specification Mini-Language: https://docs.python.org/2.7/library/string.html#formatspec .. _Tooz library: https://opendev.org/openstack/tooz .. _replacing local locks with Tooz: https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/ha-aa-tooz-locks.html .. _manager local locks: https://specs.openstack.org/openstack/cinder-specs/specs/newton/ha-aa-manager_locks.html .. _Active-Active replication spec: https://specs.openstack.org/openstack/cinder-specs/specs/ocata/ha-aa-replication.html .. _testing Active-Active: https://review.openstack.org/#/c/443504 .. _Scalable Backup Service spec: https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/scalable-backup-service.html