fuel-specs/specs/7.0/volume-manager-refactoring.rst
Vladimir Kozhukalov cf1490a1e6 Added volume manager refactoring spec
Implements: bp volume-manager-refactoring
Change-Id: Icb07160512acb0f63a8a40d06e69dc7a88a93903
2015-07-23 05:19:09 +02:00

685 lines
22 KiB
ReStructuredText

..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==========================
Volume manager refactoring
==========================
https://blueprints.launchpad.net/fuel/+spec/volume-manager-refactoring
Currently nailgun volume manager is not flexible and customizable enough
to address many needs of users. For example, some users want some volumes
to be untouched during OS provisioning, some users want it to be possible
to deploy software RAIDs or configure FS mount options, etc.
Problem description
===================
There are use cases which aren't covered with the fuctionality of current
implementation of volume manager.
These use cases include at least the following:
* Volume preservation
Sometimes when a node is going to be re-provisioned there could be
volumes (partitions, logical volumes, MD devices) which user wants
to remain untouched.
* FS mount options
Sometimes user needs to mount some file systems using specific options, like
noatime or ro, etc.
* Bootable disks
Currently we install bootloader on all hard drives, but it does not always
correspond to what user wants.
* Flexible partitioning scheme
Currently we have predefined partition scheme which assumes, for example,
that we put root file system on logical volume and that we don't create
separate file system for /var. These assumptions limit users in their
abilities to create a partition scheme they might want.
* Pluggable partitioning scheme
Some Fuel plugins assume we need to have additional partitions on node.
Currently, plugin partitions can conflict with existent partitions and
we need to resolve these potential conflicts.
Proposed change
===============
Provisioning process in general can be considered as the following
set of steps:
::
+---------+ +----------+ +-----------+ +-----------+
| | | | | | | |
|discovery+--> |allocation+--> |OS building+--> |OS copying |
| | | | | | | |
+---------+ +----------+ +-----------+ +-----------+
All those main steps can be implemented in a monolithic manner or they can be
a set of separable modules/plugins/extensions.
1. Discovery
This step is when we try to find out which hard drives are available on a
node. Anaconda and debian-installer do the same at the very beginning of
provisioning process. In our case this step is implemented as a separate
service which is called Nailgun Agent.
2. Allocation
On this step the default partitioning scheme is generated. This allocation
step can be data driven when, for example, a user of a provisioning agent
defines which file systems she needs to create and their priorities but
not their exact sizes. Again anaconda and debian-installer do the same
using some default hard coded or user defined (kickstart/preseed)
partitioning metadata. In our case it is implemented as the
``volume manager`` module in Nailgun.
3. OS building
On this step OS is built from scratch using packages repositories or any
other available mechanisms. Anaconda builds OS using rpm packages and yum.
Debian-installer uses deb packages and debootstrap. In terms of Fuel this
step is exactly what we call OS image building. In contrast to anaconda
and debian-installer we build OS just once somewhere on the master node or
on a developer node during ISO building. We then just copy this pre-built
OS image on all provisioned nodes. This step indirectly depends on the
previous step (step 2) because a user might be potentially
interested in assigning some specific options for a particular file system.
Step 2 (allocation) is exactly the place where we define which partitions
and file systems we need. OS building (or equivalently OS image building)
being implemented in the scope of Fuel Agent can be potentially run on the
slave node if, for example, this node requires specific file system options.
4. OS copying
This step makes sense only for image based approach when we build OS
remotely. For example, anaconda and debian-installer build OS right on the
file system where it is going to live on a node.
Anaconda and debian-installer implement these four steps in a monolithic
manner. For example, we can not separate OS building step from the whole
provisioning process. In case of Fuel all these steps implemented as separate
components. Currently, Fuel Agent implements steps 3 and 4, but it looks like
Fuel Agent is the right place where to implement also steps 1 and 2
[#discovery]_.
This spec does not concern step 1. Re-implementing the functionality
of Nailgun Agent in the scope of Fuel Agent is a deal of a separate feature.
This spec is totally about step 2.
The suggestion is to implement dynamic volume allocation on Volume Manager
side keeping as much code as possible in Fuel Agent and reusing it.
The motivation behind is:
* Fuel Agent already has quite detailed partitioning
object model ``fuel_agent/objects/partition.py`` which just needs to
be developed so as to support dynamic allocation over existent hard drives
on a node.
* Allocation scheme can influence steps 3 and 4. So, it is much easier to
deal with the whole provisioning process when it is totally implemeted in
terms of one modular component.
* Being quite independent Fuel Agent can be used w/o Fuel. And it would be
great to make it able to dynamically allocate volumes when it is used
out of Fuel.
* In the future we will need to allocate volumes not only basing on their
size but also taking into account disk types and other parameters. And it
is going to be much easier to introduce those parameters in the scope of
Fuel Agent object model.
On the other hand we are moving towards modular Fuel architecture, so, it
looks like it is the place where we can start putting our efforts towards
modularisation. The suggestion is to implement current volume manager
in nailgun as extension. Being installed this volume manager extension
imports Fuel Agent code in order to generate volume allocation
(metadata/UI driven). The default volume allocation should be
configurable via allocation metadata. A user then can modify this default
allocation on the disk management tab on UI. If other extensions
(ceph or mongo, etc.) need to modify volume allocation scheme they need
to use volume manager extension for this and they need to interact
with it only via its API.
So, the feature can be considered as two independent tasks:
1. Convert Nailgun volume manager into Nailgun volume manager extension
2. Implement dynamic volume allocation procedure in the scope of Fuel Agent
and introduce this functionality into Nailgun volume manager extension
importing necessary modules from Fuel Agent.
The coverage scheme then will be as follows:
::
+-------------------------+ +----------------------------+
|Nailgun & vol. extension | | Fuel Agent |
+-------------------------+ +----------------------------+
+---------+ +----------+ +-----------+ +-----------+
| | | | | | | |
|discovery+--> |allocation+--> |OS building+--> |OS copying |
| | | | | | | |
+---------+ +----------+ +-----------+ +-----------+
More detailed scheme how it will work:
::
VolumeManager
+--------------------+
| +----------------+ |
| | objects | |
| |(from fuel_agent| |
| | import objects)| |
| +----------------+ |
| | +-----------+
| +------------+ +---> |
| | new volumes| | | nailgun |
| | allocation | <---+ |
| | algorithm | | +-----------+
| +------------+ |
+---------+----------+
|
| +---------------+
| | serialize |
| | ready to use |
| |PartitionScheme|
| +---------------+
|
| fuel_agent
|
+---------v-------------------------+
| +---------+ +----------------+ |
| | objects | | partitioning | |
| +---------+ | provisioning | |
| +----------------+ |
| |
| +-------------------------------+ |
| | NEW DataDriver | |
| | (deserialize obtained | |
| | PartitionScheme) | |
| +-------------------------------+ |
+-----------------------------------+
New volumes allocation algorithm will be implement first in terms of
Fuel Agent and then used (imported but not moved) in volume manager.
Dynamic allocation
------------------
Dynamic allocation metadata could look like (exact format will be found
during actual implementation):
::
- id: 1
type: "fs"
mount: "/boot"
device_id: 9
fs_type: "ext2"
- id: 2
type: "fs"
mount: "/"
device_id: 5
fs_type: "ext4"
- id: 3
type: "fs"
mount: "swap"
device_id: 6
fs_type: "swap"
- id: 4
type: "fs"
device_id: 7
mount: "/var/lib/mysql"
fs_type: "ext4"
block_size: "4K"
- id: 5
type: "lv"
vg_id: 8
name: "root"
minsize: "10G"
bestsize: "15G"
priority: 1000
- id: 6
type: "lv"
vg_id: 8
minsize: "1G"
maxsize: "8G"
priority: 200
name: "swap"
- id: 7
type: "partition"
minsize: "20G"
device_id: __auto__
- id: 8
type: "vg"
name: "os"
minsize: __auto__
pvs_id: __auto__
- id: 9
type: "md"
level: "mirror"
minsize: "200M"
maxsize: "400M"
bestsize: "200M"
numactive: 2
numspares: 1
devices_id: __auto__
spares_id: __auto__
The format of these metadata should be as close to the format of Fuel Agent
objects as possible. It can make it easier to serialize/de-serialize
objects.
Let's go through these metadata step by step.
1. Each item has id field which is used to connect objects wherever they need
to be connected avoiding at the same time non-trivial data hierarchies.
However, id is used only for serialized set of objects. When it is a set
of Python objects, ``device_id`` will be just ``device`` and it will be
a Python reference to the object. ``id`` can be integer or string for
sake of readability. Python objects are identified by their contents.
For example, there can not be two file systems with the
same mount point on a node. So, mount point can be
considered as unique identifier for the file system object.
Logical volumes are identified by the combination
of volume group name and logical volume name.
That metadata is flat makes it easily scalable. Any plugin/extension
can append or remove items. For example, the following item means we need
to allocate ``ext2`` file system with ``/boot`` mount point
on device with ``id`` equal to 10.
::
- id: 1
type: "fs"
mount: "/boot"
device_id: 10
fs_type: "ext2"
2. Logical volume items have ``vg`` field which identifies volume group where
a logical volume is to be placed.
::
- id: 5
type: "lv"
vg_id: 8
name: "root"
minsize: "10G"
bestsize: "15G"
maxsize: "50G"
priority: 1000
The fields ``minsize``, ``maxsize``
and ``bestsize`` are used to set limits and give recommendations about the
size of the logical volume. The field ``priority`` is going to be used for
sharing the volume group space over all logical volumes in this group.
The priority is used as the weight of a particular volume. For example,
if two volumes are given and we need to share the whole space between these
two volumes, we can use the following algorithm:
::
space_1 = total_space * priority_1 / (priority_1 + priority_2)
space_2 = total_space * priority_2 / (priority_1 + priority_2)
Allocation algorithm for logical volumes should look like the following:
- Allocating minimal size for each logical volume (fail if there is no
enough space)
- Allocating remaining space up to recommended size for each logical volume
taking into account their priorities
- Allocating remaining space up to maximal size for each logical volume
taking into account their priorities. If maximal size is not set, we
assume there is no such limit.
Those size limitation/recommendation/priority fields are optional.
If they are not set we can use some default (0)
priority and allocate remaining space for the logical volume taking into
account this default priority value.
3. Volume group can also have ``minsize``, ``maxsize``, ``bestsize`` and
``priority`` fields which are to be used exactly the same way as in case
of logical volumes. If ``minsize`` is equal to ``__auto__`` then it means
it should be calculated as a sum of minimal sizes of all logical volumes
in the volume group. The field ``pvs`` should define a set of physical
volume identifiers which constitute the volume group. If this field is
equal to ``__auto__`` then it means we should define physical volumes
dynamically during allocation. For example, we need to allocate 100G for
the volume group, and there are two disks on the node partly allocated for
other volume groups and partitions. Let's say there is 50G of free space on
the first disk and 50G of free space on the second disk. So, two physical
volumes (50G each) will be allocated for the volume group.
4. Plain partition can have the same limitation/recommendation fields
``minsize``, ``maxsize``, ``bestsize``, ``priority`` and these fields have
the same meaning. It is necessary to note that unlike volume groups,
plain partitions can not be split into parts (physical volumes).
So, plain partitions should be allocated before volume groups and then
the remaining free space can be flexibly used for volume groups.
5. MD device has the same dynamic allocation fields, but the trick here is
that need to allocate several partitions for one MD device and these
partitions are to be located on different hard drives.
Ideally, dynamic allocation process must take into account many other
parameters apart from just size of a volume. For example, we'd better avoid
using SSD and HDD disks together for one volume group. Another example is
we need to set file system block sized taking into account the type of hard
drive, otherwise we can encounter some serious performance issues.
But due to tight deadline for 7.0 let's implement ONLY size driven allocation.
Other metadata can be easily introduced later.
Another important thing is that currently Fuel Agent objects are
often initalized with actual block device names (e.g. /dev/sda). But in case
of dynamic allocation the actual device names are unknown when an object is
instantiated. Actual block device name makes sense not earlier than the
command parted is run. The correct way how to deal with this is to
modify objects so as to make it possible to postpone actual device evaluation
(e.g. ``fuel_agent/objects/device.py:Loop``). In partition scheme there
should not be names like ``/dev/sda3`` until it is evaluated and actualized.
Volume sets, roles and compatibility
------------------------------------
Several named sets of volume items (like those which are outlined above)
can be defined and then these sets can be combined so as to define other sets.
When a set defines another set as its element, then this element should be
treated as a subset rather than an element. So, the resulting set is to
remain flat. In the example below, ``Set_3`` is a set of
elements: ``Item_1``, ``Item_2``, ``Item_4``.
::
Set_1:
- Item_1
- Item_2
Set_2:
- Item_3
Set_3:
- Set_1
- Item_4
As mentioned above, every volume item is to have ``id`` field. This field is
only used to connect items with each other inside a set. When a set has
another set as its subset, other items ``id`` should not intersect with
those in the subset. Otherwise, items with the same ``id`` will override
those in the subset. It can be used if one, in fact, wants to override
one or more items in the subset.
For example:
::
Set_1:
- id: 1
type: "fs"
...
- id: 2
type: "partition"
...
Set_2:
- Set_1
- id: 2
type: "lv"
...
- id: 3
type: "vg"
...
gives ``Set_2`` equal to:
::
Set_2:
- id: 1
type: "fs"
...
- id: 2
type: "lv"
...
- id: 3
type: "vg"
Some of the sets are to be named after node role names. So, if a set has the
same name as a role, then it means this set of volumes will be used for a node
with this role assigned. For example, the following means ``ControllerRole``
will have three volume items: ``Item_1``, ``Item_2``, ``Item_3``.
::
Set_1:
- Item_1
- Item_2
Controller_Role:
- Set_1
- Item_3
If we have several roles assigned for a node and these roles define volume
items with parameters which conflict with each other, we need to be able to
resolve the conflict if it is possible or report error if the conflict can't
be resolved.
::
Role_1:
- type: "lv"
name: "my_favorite_lv"
vg_id: "my_favorite_vg"
minsize: 10
maxsize: 30
Role_2:
- type: "lv"
name: "my_favorite_lv"
vg_id: "my_favorite_vg"
minsize: 20
maxsize: 50
In the example above describes two roles which define the same logical volume
differently. Roles do not contain each other as their subsets, so, we can not
override logical volume definition from one role with parameters from another.
Roles don't have priorities, they are equal in their rights to define
volume items. The only way how to deal with this is to resolve this conflict.
Fortunatly, it is always possible to consider parameter intervals (continuous
or enumerable) as abstract sets which can intersect with one another. If the
intersection is empty, then we need to conclude those parameters
are incompatible and report an error. If the intersection is not empty,
then the new parameter interval is equal to the intersection. It is not always
the most effective way to reconcile parameters but it is general enough
to be useful for all possible cases. How we calculate the parameter
intersection depends on the nature of a particular parameter.
Let's define the following set of rules:
::
def minsize(minsize_1, minsize_2, maxsize_1, maxsize_2):
result = max(minsize_1, minsize_2)
if result > min(maxsize_1, maxsize_2):
raise Exception("Incompatible parameters")
return result
def maxsize(maxsize_1, maxsize_2):
result = min(maxsize_1, maxsize_2)
if result < max(minsize_1, minsize_2):
raise Exception("Incompatible parameters")
return result
def bestsize(bestsize_1, bestsize_2, minsize, maxsize):
result = (bestsize_1 + bestsize_2) / 2.0
if result > maxsize:
return maxsize
elif result < minsize:
return minsize
else:
return result
def priority(priority_1, priority_2):
return max(priority_1, priority_2)
Alternatives
------------
We could implement volume management mechanism from scratch and fully
independently from Fuel Agent. But it looks irrational avoiding using existent
code and ignoring beautiful architectural concept.
Data model impact
-----------------
Fuel Agent object model is going to be changed so as to include dynamic
allocation methods and data.
Volume data in Nailgun are stored as plain json in the Node data model. As far
as Nailgun volume manager will re-implemented as an extension, these volume
data will be moved into extension table with foreign key to the Node.
REST API impact
---------------
That part of REST API which deals with volume data is going to be moved into
volume manager extension.
Upgrade impact
--------------
As far as Fuel Agent is installed into bootstrap ramdisk, nodes which are
booted with this ramdisk must be forced to be rebooted to make sure the newest
version of Fuel Agent is available on slave nodes.
Also Fuel Agent package should be updated on the master node because Nailgun
volume manager extension is going to use Fuel Agent modules.
Besides, we need to write a database migration which should create
the new volume manager table and move volume data there.
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
In 7.0 there is no plan to expose new format for user.
Performance Impact
------------------
None
Plugin impact
-------------
Volume manager should be implemented as Fuel extension. Other
plugins/extensions which need to modify volume allocation, should use
volume manager extension API.
Other deployer impact
---------------------
If a deployer needs specific allocation mechanism other than that is available
in Fuel Agent she just needs to write her own volume manager extension
implementing corresponding API. But as far as Fuel Agent allocation algorithm
is going to be metadata driven, it'll likely be possible to avoid changing
the code of Fuel Agent when covering such specific cases.
Developer impact
----------------
None
Infrastructure impact
---------------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
<skalinowski@mirantis.com>
Other contributors:
<vkozhukalov@mirantis.com>
<mkwiek@mirantis.com>
<sbrzeczkowski@mirantis.com>
<eli@mirantis.com>
Work Items
----------
1. Implement Nailgun volume manager extension
2. Implement dynamic volume allocation in the scope of Fuel Agent
3. Use new dynamic volume allocation in volume manager extension
Dependencies
============
None
Testing
=======
After moving volume manager extension to new volume allocation format and
algorithm, new system tests need to be added to cover usage of it.
Acceptance criteria
-------------------
* Current functionality works as usual with no regressions until it
described by the spec.
* Volume preservation: ability to reserve partition as untouched while
re-provisioning.
* FS mount options: ability to specify different mount options for
particular partionions.
* Bootable disks: ability to choose what hardrives should contain
bootloader.
* Flexible partitioning scheme: ability to create various partition
schemes.
* Pluggable partitioning scheme: ability for plugins to create own
partitions without conflicts.
Documentation Impact
====================
New format of volumes allocation need to be described.
References
==========
.. [#discovery] In fact, Fuel Agent currently implements discovery
functionality but only for block devices (hard drives) and it is not
compatible with Nailgun. So, if it is necessary, Fuel Agent is able
to get the information about available hard drives on a node
totally on its own.