Browse Source

Add spec for scaling with the ansible inventory

Change-Id: I0cbc1620904acb149230cd5f295f2a17abd59146
James Slagle 2 years ago
  1. 251


@ -0,0 +1,251 @@
This work is licensed under a Creative Commons Attribution 3.0 Unported
Scaling with the Ansible Inventory
Scaling an existing deployment should be possible by adding new host
definitions directly to the Ansible inventory, and not having to increase the
<Role>Count parameters.
Problem Description
Currently to scale a deployment, a Heat stack update is required. The stack
update reflects the new desired node count of each role, which is then
represented in the generated Ansible inventory. The inventory file is then used
by the config-download process when ansible-playbook is executed to perform the
software configuration on each node.
Updating the Heat stack with the new desired node count has posed some
scaling challenges. Heat creates a set of resources associated with each node.
As the number of nodes in a deployment increases, Heat has more and more
resources to manage.
As the stack size grows, Heat must be tuned with software configurations or
horizontally scaled with additional engine workers. However, horizontal scaling
of Heat workers will only help so much as eventually other service workers
would need to be scaled as well, such as database, messaging, or Keystone
worker process. Having to increasingly scale worker processes results in
additional physical resource consumption.
Heat performance also begins to degrade as stack size increases. It takes
longer and longer for stack operations to complete as node count increases. The
stack operation time often reaches into taking many hours, which is usually
outside the range of typical maintenance windows.
It is also hard to predict what changes Heat will make. Often, no changes are
desired other than to scale out to new nodes. However, unintended template
changes or user error around forgetting to pass environment files poses
additional unnecessary risk to the scaling operation.
Proposed Change
The proposed change would allow for users to directly add new node definitions
to the Ansible inventory by way of a new Heat parameter to allow for scaling
services onto those new nodes. No change in the <Role>Count parameters would be
A minimum set of data would be required when adding a new node to the Ansible
inventory. Presently, this includes the TripleO role, and an IP address on each
network that is used by that role.
Only scaling of already defined roles will be possible with this method.
Defining new roles would still require a full Heat stack update which defined
the new role.
Once the new node(s) are added to the inventory, ansible-playbook could be
rerun with the config-download directory to scale the software services out
on to the new nodes.
As increasing the node count in the Heat stack operation won't be necessary
when scaling, if baremetal provisioning is required for the new nodes, then
this work depends on the nova-less-deploy work:
Once baremetal provisioning is migrated out of Heat with the above work, then
new nodes can be provisioned with those new workflows before adding them
directly to the Ansible inventory.
Since new nodes added directly to the Ansible inventory would still be
consuming IP's from the subnet ranges defined for the overcloud networks,
Neutron needs to be made aware of those assignments so that there are no
overlapping IP addresses. This could be done with a new interface in
tripleo-heat-templates that allows for specifying the extra node inventory
data. The parameter would be called ``ExtraInventoryData``. The templates would
take care of operating on that input and creating the appropriate Neutron ports
to correspond to the IP addresses specified in the data.
When tripleo-ansible-inventory is used to generate the inventory, it would
query Heat as it does today, but also layer in the extra inventory data as
specified by ``ExtraInventoryData``. The resulting inventory would be a unified
view of all nodes in the deployment.
``ExtraInventoryData`` may be a list of files that are consumed with Heat's
get_file function so that the deployer can keep their inventory data organized
by file.
This change is primarily targeted at addressing scaling issues around the
Heat stack operation. Alternative methods include using undercloud minions:
Multi-stack/split-controlplane also addresses the issue somewhat by breaking up
the deployment into smaller and more manageable stacks:
These alternatives are complimentary to the proposed solution here, and all of
these solutions can be used together for the greatest benefits.
Direct manipulation of inventory data
Another alternative would be to not make use of any new interface in the
templates such as the previously mentioned ``ExtraInventoryData``. Users could just
update the inventory file manually, or drop inventory files in a specified
location (since Ansible can use a directory as an inventory source).
The drawbacks to this approach are that another tool would be necessary to
create associated ports in Neutron so that there are no overlapping IP
addresses. It could also be a manual step, although that is prone to error.
The advantages to this approach is that it would completely eliminate the stack
update operation as part of the scaling. Not having any stack operation is
appealing in some regards due to the potential to forget environment files or
other user error (out of date templates, etc).
Security Impact
IP addresses and hostnames would potentially exist in user managed templates
that have the value for ``ExtraInventoryData``, however this is no different than
what is present today.
Upgrade Impact
The upgrade process will need to be aware that not all nodes are represented in
the Heat stack, and some will be represented only in the inventory. This should
not be an issue as long as there is a consistent interface to get a single
unified inventory as there exists now.
Any changes around creating the unified view of the inventory should be made
within the implementation of that interface (tripleo-ansible-inventory) such
that existing tooling continues to use an inventory that contains all nodes for
a deployment.
Other End User Impact
Users will potentially have to manage additional environment files for the
extra inventory data.
Performance Impact
Performance should be improved during scale out operations.
However, it should be noted that Ansible will face scaling challenges as well.
While this change does not directly introduce those new challenges, it may
expose them more rapidly as it bypasses the Heat scaling challenges.
For example, it is not expected that simply adding hundreds or thousands of new
nodes directly to the Ansible inventory means that scaling operation would
succeed. It would likely expose new scaling challenges in other tooling, such
as the playbook and role tasks or Ansible itself.
Other Deployer Impact
Since this proposal is meant to align with the nova-less-deploy, all nodes
(whether they are known to Heat or not) would be unprovisioned if the
deployment is deleted.
If using pre-provisioned nodes, then there is no change in behavior in that
deleting the Heat stack does not actually "undeploy" any software. This
proposal does not change that behavior.
Developer Impact
Developers could more quickly test scaling by bypassing the Heat stack update
completely if desired, or using the ``ExtraInventoryData`` interface.
Primary assignee:
James Slagle <>
Work Items
* Add new parameter ``ExtraInventoryData``
* Add Heat processing of ``ExtraInventoryData``
* create Neutron ports
* add stack outputs
* Update tripleo-ansible-inventory to consume from added stack outputs
* Update HostsEntry to be generic
* Depends on nova-less-deploy work for baremetal provisioning outside of Heat.
If using pre-provisioned nodes, does not depend on nova-less-deploy.
* All deployment configurations coming out of Heat need to be generic per role.
Most of this work was complete in Train, however this should be reviewed. For
example, the HostsEntry data is still static and Heat is calculating the node
list. This data needs to be moved to an Ansible template.
Scaling is not currently tested in CI, however perhaps it could be with this
Manual test plans and other test automation would need to be updated to also
test scaling with ``ExtraInventoryData``.
Documentation Impact
Documentation needs to be added for ``ExtraInventoryData``.
The feature should also be fully explained in that users and deployers need to
be made aware of the change of how nodes may or may not be represented in the
Heat stack.