diff --git a/specs/7.0/volume-manager-refactoring.rst b/specs/7.0/volume-manager-refactoring.rst new file mode 100644 index 00000000..f37eda38 --- /dev/null +++ b/specs/7.0/volume-manager-refactoring.rst @@ -0,0 +1,684 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================== +Volume manager refactoring +========================== + +https://blueprints.launchpad.net/fuel/+spec/volume-manager-refactoring + +Currently nailgun volume manager is not flexible and customizable enough +to address many needs of users. For example, some users want some volumes +to be untouched during OS provisioning, some users want it to be possible +to deploy software RAIDs or configure FS mount options, etc. + +Problem description +=================== + +There are use cases which aren't covered with the fuctionality of current +implementation of volume manager. + +These use cases include at least the following: + +* Volume preservation + + Sometimes when a node is going to be re-provisioned there could be + volumes (partitions, logical volumes, MD devices) which user wants + to remain untouched. + +* FS mount options + + Sometimes user needs to mount some file systems using specific options, like + noatime or ro, etc. + +* Bootable disks + + Currently we install bootloader on all hard drives, but it does not always + correspond to what user wants. + +* Flexible partitioning scheme + + Currently we have predefined partition scheme which assumes, for example, + that we put root file system on logical volume and that we don't create + separate file system for /var. These assumptions limit users in their + abilities to create a partition scheme they might want. + +* Pluggable partitioning scheme + + Some Fuel plugins assume we need to have additional partitions on node. + Currently, plugin partitions can conflict with existent partitions and + we need to resolve these potential conflicts. + +Proposed change +=============== + +Provisioning process in general can be considered as the following +set of steps: + +:: + + +---------+ +----------+ +-----------+ +-----------+ + | | | | | | | | + |discovery+--> |allocation+--> |OS building+--> |OS copying | + | | | | | | | | + +---------+ +----------+ +-----------+ +-----------+ + +All those main steps can be implemented in a monolithic manner or they can be +a set of separable modules/plugins/extensions. + +1. Discovery + + This step is when we try to find out which hard drives are available on a + node. Anaconda and debian-installer do the same at the very beginning of + provisioning process. In our case this step is implemented as a separate + service which is called Nailgun Agent. + +2. Allocation + + On this step the default partitioning scheme is generated. This allocation + step can be data driven when, for example, a user of a provisioning agent + defines which file systems she needs to create and their priorities but + not their exact sizes. Again anaconda and debian-installer do the same + using some default hard coded or user defined (kickstart/preseed) + partitioning metadata. In our case it is implemented as the + ``volume manager`` module in Nailgun. + +3. OS building + + On this step OS is built from scratch using packages repositories or any + other available mechanisms. Anaconda builds OS using rpm packages and yum. + Debian-installer uses deb packages and debootstrap. In terms of Fuel this + step is exactly what we call OS image building. In contrast to anaconda + and debian-installer we build OS just once somewhere on the master node or + on a developer node during ISO building. We then just copy this pre-built + OS image on all provisioned nodes. This step indirectly depends on the + previous step (step 2) because a user might be potentially + interested in assigning some specific options for a particular file system. + Step 2 (allocation) is exactly the place where we define which partitions + and file systems we need. OS building (or equivalently OS image building) + being implemented in the scope of Fuel Agent can be potentially run on the + slave node if, for example, this node requires specific file system options. + +4. OS copying + + This step makes sense only for image based approach when we build OS + remotely. For example, anaconda and debian-installer build OS right on the + file system where it is going to live on a node. + +Anaconda and debian-installer implement these four steps in a monolithic +manner. For example, we can not separate OS building step from the whole +provisioning process. In case of Fuel all these steps implemented as separate +components. Currently, Fuel Agent implements steps 3 and 4, but it looks like +Fuel Agent is the right place where to implement also steps 1 and 2 +[#discovery]_. +This spec does not concern step 1. Re-implementing the functionality +of Nailgun Agent in the scope of Fuel Agent is a deal of a separate feature. +This spec is totally about step 2. + +The suggestion is to implement dynamic volume allocation on Volume Manager +side keeping as much code as possible in Fuel Agent and reusing it. +The motivation behind is: + +* Fuel Agent already has quite detailed partitioning + object model ``fuel_agent/objects/partition.py`` which just needs to + be developed so as to support dynamic allocation over existent hard drives + on a node. +* Allocation scheme can influence steps 3 and 4. So, it is much easier to + deal with the whole provisioning process when it is totally implemeted in + terms of one modular component. +* Being quite independent Fuel Agent can be used w/o Fuel. And it would be + great to make it able to dynamically allocate volumes when it is used + out of Fuel. +* In the future we will need to allocate volumes not only basing on their + size but also taking into account disk types and other parameters. And it + is going to be much easier to introduce those parameters in the scope of + Fuel Agent object model. + +On the other hand we are moving towards modular Fuel architecture, so, it +looks like it is the place where we can start putting our efforts towards +modularisation. The suggestion is to implement current volume manager +in nailgun as extension. Being installed this volume manager extension +imports Fuel Agent code in order to generate volume allocation +(metadata/UI driven). The default volume allocation should be +configurable via allocation metadata. A user then can modify this default +allocation on the disk management tab on UI. If other extensions +(ceph or mongo, etc.) need to modify volume allocation scheme they need +to use volume manager extension for this and they need to interact +with it only via its API. + +So, the feature can be considered as two independent tasks: + +1. Convert Nailgun volume manager into Nailgun volume manager extension +2. Implement dynamic volume allocation procedure in the scope of Fuel Agent + and introduce this functionality into Nailgun volume manager extension + importing necessary modules from Fuel Agent. + +The coverage scheme then will be as follows: + +:: + + +-------------------------+ +----------------------------+ + |Nailgun & vol. extension | | Fuel Agent | + +-------------------------+ +----------------------------+ + +---------+ +----------+ +-----------+ +-----------+ + | | | | | | | | + |discovery+--> |allocation+--> |OS building+--> |OS copying | + | | | | | | | | + +---------+ +----------+ +-----------+ +-----------+ + +More detailed scheme how it will work: + +:: + + VolumeManager + +--------------------+ + | +----------------+ | + | | objects | | + | |(from fuel_agent| | + | | import objects)| | + | +----------------+ | + | | +-----------+ + | +------------+ +---> | + | | new volumes| | | nailgun | + | | allocation | <---+ | + | | algorithm | | +-----------+ + | +------------+ | + +---------+----------+ + | + | +---------------+ + | | serialize | + | | ready to use | + | |PartitionScheme| + | +---------------+ + | + | fuel_agent + | + +---------v-------------------------+ + | +---------+ +----------------+ | + | | objects | | partitioning | | + | +---------+ | provisioning | | + | +----------------+ | + | | + | +-------------------------------+ | + | | NEW DataDriver | | + | | (deserialize obtained | | + | | PartitionScheme) | | + | +-------------------------------+ | + +-----------------------------------+ + + +New volumes allocation algorithm will be implement first in terms of +Fuel Agent and then used (imported but not moved) in volume manager. + +Dynamic allocation +------------------ + +Dynamic allocation metadata could look like (exact format will be found +during actual implementation): + +:: + + - id: 1 + type: "fs" + mount: "/boot" + device_id: 9 + fs_type: "ext2" + + - id: 2 + type: "fs" + mount: "/" + device_id: 5 + fs_type: "ext4" + + - id: 3 + type: "fs" + mount: "swap" + device_id: 6 + fs_type: "swap" + + - id: 4 + type: "fs" + device_id: 7 + mount: "/var/lib/mysql" + fs_type: "ext4" + block_size: "4K" + + - id: 5 + type: "lv" + vg_id: 8 + name: "root" + minsize: "10G" + bestsize: "15G" + priority: 1000 + + - id: 6 + type: "lv" + vg_id: 8 + minsize: "1G" + maxsize: "8G" + priority: 200 + name: "swap" + + - id: 7 + type: "partition" + minsize: "20G" + device_id: __auto__ + + - id: 8 + type: "vg" + name: "os" + minsize: __auto__ + pvs_id: __auto__ + + - id: 9 + type: "md" + level: "mirror" + minsize: "200M" + maxsize: "400M" + bestsize: "200M" + numactive: 2 + numspares: 1 + devices_id: __auto__ + spares_id: __auto__ + +The format of these metadata should be as close to the format of Fuel Agent +objects as possible. It can make it easier to serialize/de-serialize +objects. + +Let's go through these metadata step by step. + +1. Each item has id field which is used to connect objects wherever they need + to be connected avoiding at the same time non-trivial data hierarchies. + However, id is used only for serialized set of objects. When it is a set + of Python objects, ``device_id`` will be just ``device`` and it will be + a Python reference to the object. ``id`` can be integer or string for + sake of readability. Python objects are identified by their contents. + For example, there can not be two file systems with the + same mount point on a node. So, mount point can be + considered as unique identifier for the file system object. + Logical volumes are identified by the combination + of volume group name and logical volume name. + + That metadata is flat makes it easily scalable. Any plugin/extension + can append or remove items. For example, the following item means we need + to allocate ``ext2`` file system with ``/boot`` mount point + on device with ``id`` equal to 10. + +:: + + - id: 1 + type: "fs" + mount: "/boot" + device_id: 10 + fs_type: "ext2" + +2. Logical volume items have ``vg`` field which identifies volume group where + a logical volume is to be placed. + +:: + + - id: 5 + type: "lv" + vg_id: 8 + name: "root" + minsize: "10G" + bestsize: "15G" + maxsize: "50G" + priority: 1000 + +The fields ``minsize``, ``maxsize`` +and ``bestsize`` are used to set limits and give recommendations about the +size of the logical volume. The field ``priority`` is going to be used for +sharing the volume group space over all logical volumes in this group. +The priority is used as the weight of a particular volume. For example, +if two volumes are given and we need to share the whole space between these +two volumes, we can use the following algorithm: + +:: + + space_1 = total_space * priority_1 / (priority_1 + priority_2) + space_2 = total_space * priority_2 / (priority_1 + priority_2) + +Allocation algorithm for logical volumes should look like the following: + + - Allocating minimal size for each logical volume (fail if there is no + enough space) + - Allocating remaining space up to recommended size for each logical volume + taking into account their priorities + - Allocating remaining space up to maximal size for each logical volume + taking into account their priorities. If maximal size is not set, we + assume there is no such limit. + +Those size limitation/recommendation/priority fields are optional. +If they are not set we can use some default (0) +priority and allocate remaining space for the logical volume taking into +account this default priority value. + +3. Volume group can also have ``minsize``, ``maxsize``, ``bestsize`` and + ``priority`` fields which are to be used exactly the same way as in case + of logical volumes. If ``minsize`` is equal to ``__auto__`` then it means + it should be calculated as a sum of minimal sizes of all logical volumes + in the volume group. The field ``pvs`` should define a set of physical + volume identifiers which constitute the volume group. If this field is + equal to ``__auto__`` then it means we should define physical volumes + dynamically during allocation. For example, we need to allocate 100G for + the volume group, and there are two disks on the node partly allocated for + other volume groups and partitions. Let's say there is 50G of free space on + the first disk and 50G of free space on the second disk. So, two physical + volumes (50G each) will be allocated for the volume group. + +4. Plain partition can have the same limitation/recommendation fields + ``minsize``, ``maxsize``, ``bestsize``, ``priority`` and these fields have + the same meaning. It is necessary to note that unlike volume groups, + plain partitions can not be split into parts (physical volumes). + So, plain partitions should be allocated before volume groups and then + the remaining free space can be flexibly used for volume groups. + +5. MD device has the same dynamic allocation fields, but the trick here is + that need to allocate several partitions for one MD device and these + partitions are to be located on different hard drives. + +Ideally, dynamic allocation process must take into account many other +parameters apart from just size of a volume. For example, we'd better avoid +using SSD and HDD disks together for one volume group. Another example is +we need to set file system block sized taking into account the type of hard +drive, otherwise we can encounter some serious performance issues. +But due to tight deadline for 7.0 let's implement ONLY size driven allocation. +Other metadata can be easily introduced later. + +Another important thing is that currently Fuel Agent objects are +often initalized with actual block device names (e.g. /dev/sda). But in case +of dynamic allocation the actual device names are unknown when an object is +instantiated. Actual block device name makes sense not earlier than the +command parted is run. The correct way how to deal with this is to +modify objects so as to make it possible to postpone actual device evaluation +(e.g. ``fuel_agent/objects/device.py:Loop``). In partition scheme there +should not be names like ``/dev/sda3`` until it is evaluated and actualized. + +Volume sets, roles and compatibility +------------------------------------ + +Several named sets of volume items (like those which are outlined above) +can be defined and then these sets can be combined so as to define other sets. +When a set defines another set as its element, then this element should be +treated as a subset rather than an element. So, the resulting set is to +remain flat. In the example below, ``Set_3`` is a set of +elements: ``Item_1``, ``Item_2``, ``Item_4``. + +:: + + Set_1: + - Item_1 + - Item_2 + Set_2: + - Item_3 + Set_3: + - Set_1 + - Item_4 + +As mentioned above, every volume item is to have ``id`` field. This field is +only used to connect items with each other inside a set. When a set has +another set as its subset, other items ``id`` should not intersect with +those in the subset. Otherwise, items with the same ``id`` will override +those in the subset. It can be used if one, in fact, wants to override +one or more items in the subset. + +For example: + +:: + + Set_1: + - id: 1 + type: "fs" + ... + - id: 2 + type: "partition" + ... + Set_2: + - Set_1 + - id: 2 + type: "lv" + ... + - id: 3 + type: "vg" + ... + +gives ``Set_2`` equal to: + +:: + + Set_2: + - id: 1 + type: "fs" + ... + - id: 2 + type: "lv" + ... + - id: 3 + type: "vg" + +Some of the sets are to be named after node role names. So, if a set has the +same name as a role, then it means this set of volumes will be used for a node +with this role assigned. For example, the following means ``ControllerRole`` +will have three volume items: ``Item_1``, ``Item_2``, ``Item_3``. + +:: + + Set_1: + - Item_1 + - Item_2 + Controller_Role: + - Set_1 + - Item_3 + +If we have several roles assigned for a node and these roles define volume +items with parameters which conflict with each other, we need to be able to +resolve the conflict if it is possible or report error if the conflict can't +be resolved. + +:: + + Role_1: + - type: "lv" + name: "my_favorite_lv" + vg_id: "my_favorite_vg" + minsize: 10 + maxsize: 30 + Role_2: + - type: "lv" + name: "my_favorite_lv" + vg_id: "my_favorite_vg" + minsize: 20 + maxsize: 50 + +In the example above describes two roles which define the same logical volume +differently. Roles do not contain each other as their subsets, so, we can not +override logical volume definition from one role with parameters from another. +Roles don't have priorities, they are equal in their rights to define +volume items. The only way how to deal with this is to resolve this conflict. + +Fortunatly, it is always possible to consider parameter intervals (continuous +or enumerable) as abstract sets which can intersect with one another. If the +intersection is empty, then we need to conclude those parameters +are incompatible and report an error. If the intersection is not empty, +then the new parameter interval is equal to the intersection. It is not always +the most effective way to reconcile parameters but it is general enough +to be useful for all possible cases. How we calculate the parameter +intersection depends on the nature of a particular parameter. + +Let's define the following set of rules: + +:: + + def minsize(minsize_1, minsize_2, maxsize_1, maxsize_2): + result = max(minsize_1, minsize_2) + if result > min(maxsize_1, maxsize_2): + raise Exception("Incompatible parameters") + return result + + def maxsize(maxsize_1, maxsize_2): + result = min(maxsize_1, maxsize_2) + if result < max(minsize_1, minsize_2): + raise Exception("Incompatible parameters") + return result + + def bestsize(bestsize_1, bestsize_2, minsize, maxsize): + result = (bestsize_1 + bestsize_2) / 2.0 + if result > maxsize: + return maxsize + elif result < minsize: + return minsize + else: + return result + + def priority(priority_1, priority_2): + return max(priority_1, priority_2) + +Alternatives +------------ + +We could implement volume management mechanism from scratch and fully +independently from Fuel Agent. But it looks irrational avoiding using existent +code and ignoring beautiful architectural concept. + +Data model impact +----------------- + +Fuel Agent object model is going to be changed so as to include dynamic +allocation methods and data. + +Volume data in Nailgun are stored as plain json in the Node data model. As far +as Nailgun volume manager will re-implemented as an extension, these volume +data will be moved into extension table with foreign key to the Node. + +REST API impact +--------------- + +That part of REST API which deals with volume data is going to be moved into +volume manager extension. + +Upgrade impact +-------------- + +As far as Fuel Agent is installed into bootstrap ramdisk, nodes which are +booted with this ramdisk must be forced to be rebooted to make sure the newest +version of Fuel Agent is available on slave nodes. + +Also Fuel Agent package should be updated on the master node because Nailgun +volume manager extension is going to use Fuel Agent modules. + +Besides, we need to write a database migration which should create +the new volume manager table and move volume data there. + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +In 7.0 there is no plan to expose new format for user. + +Performance Impact +------------------ + +None + +Plugin impact +------------- + +Volume manager should be implemented as Fuel extension. Other +plugins/extensions which need to modify volume allocation, should use +volume manager extension API. + +Other deployer impact +--------------------- + +If a deployer needs specific allocation mechanism other than that is available +in Fuel Agent she just needs to write her own volume manager extension +implementing corresponding API. But as far as Fuel Agent allocation algorithm +is going to be metadata driven, it'll likely be possible to avoid changing +the code of Fuel Agent when covering such specific cases. + +Developer impact +---------------- + +None + +Infrastructure impact +--------------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + + +Other contributors: + + + + + +Work Items +---------- + +1. Implement Nailgun volume manager extension +2. Implement dynamic volume allocation in the scope of Fuel Agent +3. Use new dynamic volume allocation in volume manager extension + +Dependencies +============ + +None + + +Testing +======= + +After moving volume manager extension to new volume allocation format and +algorithm, new system tests need to be added to cover usage of it. + +Acceptance criteria +------------------- + +* Current functionality works as usual with no regressions until it + described by the spec. +* Volume preservation: ability to reserve partition as untouched while + re-provisioning. +* FS mount options: ability to specify different mount options for + particular partionions. +* Bootable disks: ability to choose what hardrives should contain + bootloader. +* Flexible partitioning scheme: ability to create various partition + schemes. +* Pluggable partitioning scheme: ability for plugins to create own + partitions without conflicts. + +Documentation Impact +==================== + +New format of volumes allocation need to be described. + +References +========== + +.. [#discovery] In fact, Fuel Agent currently implements discovery + functionality but only for block devices (hard drives) and it is not + compatible with Nailgun. So, if it is necessary, Fuel Agent is able + to get the information about available hard drives on a node + totally on its own.