Add spec for scaling with the ansible inventory

Change-Id: I0cbc1620904acb149230cd5f295f2a17abd59146
2019-10-16 20:04:58 -04:00 · 2019-10-16 20:04:58 -04:00 · 91ccca4058
parent 88b4a4a203
commit 91ccca4058
1 changed files with 251 additions and 0 deletions
--- a/specs/ussuri/scaling-with-ansible-inventory.rst
+++ b/specs/ussuri/scaling-with-ansible-inventory.rst
@ -0,0 +1,251 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==================================
+Scaling with the Ansible Inventory
+==================================
+
+https://blueprints.launchpad.net/tripleo/scaling-with-Ansible-inventory
+
+Scaling an existing deployment should be possible by adding new host
+definitions directly to the Ansible inventory, and not having to increase the
+<Role>Count parameters.
+
+Problem Description
+===================
+
+Currently to scale a deployment, a Heat stack update is required. The stack
+update reflects the new desired node count of each role, which is then
+represented in the generated Ansible inventory. The inventory file is then used
+by the config-download process when ansible-playbook is executed to perform the
+software configuration on each node.
+
+Updating the Heat stack with the new desired node count has posed some
+scaling challenges. Heat creates a set of resources associated with each node.
+As the number of nodes in a deployment increases, Heat has more and more
+resources to manage.
+
+As the stack size grows, Heat must be tuned with software configurations or
+horizontally scaled with additional engine workers. However, horizontal scaling
+of Heat workers will only help so much as eventually other service workers
+would need to be scaled as well, such as database, messaging, or Keystone
+worker process. Having to increasingly scale worker processes results in
+additional physical resource consumption.
+
+Heat performance also begins to degrade as stack size increases. It takes
+longer and longer for stack operations to complete as node count increases. The
+stack operation time often reaches into taking many hours, which is usually
+outside the range of typical maintenance windows.
+
+It is also hard to predict what changes Heat will make. Often, no changes are
+desired other than to scale out to new nodes. However, unintended template
+changes or user error around forgetting to pass environment files poses
+additional unnecessary risk to the scaling operation.
+
+
+Proposed Change
+===============
+
+Overview
+--------
+
+The proposed change would allow for users to directly add new node definitions
+to the Ansible inventory by way of a new Heat parameter to allow for scaling
+services onto those new nodes. No change in the <Role>Count parameters would be
+required.
+
+A minimum set of data would be required when adding a new node to the Ansible
+inventory. Presently, this includes the TripleO role, and an IP address on each
+network that is used by that role.
+
+Only scaling of already defined roles will be possible with this method.
+Defining new roles would still require a full Heat stack update which defined
+the new role.
+
+Once the new node(s) are added to the inventory, ansible-playbook could be
+rerun with the config-download directory to scale the software services out
+on to the new nodes.
+
+As increasing the node count in the Heat stack operation won't be necessary
+when scaling, if baremetal provisioning is required for the new nodes, then
+this work depends on the nova-less-deploy work:
+
+https://specs.openstack.org/openstack/tripleo-specs/specs/stein/nova-less-deploy.html
+
+Once baremetal provisioning is migrated out of Heat with the above work, then
+new nodes can be provisioned with those new workflows before adding them
+directly to the Ansible inventory.
+
+Since new nodes added directly to the Ansible inventory would still be
+consuming IP's from the subnet ranges defined for the overcloud networks,
+Neutron needs to be made aware of those assignments so that there are no
+overlapping IP addresses. This could be done with a new interface in
+tripleo-heat-templates that allows for specifying the extra node inventory
+data. The parameter would be called ``ExtraInventoryData``. The templates would
+take care of operating on that input and creating the appropriate Neutron ports
+to correspond to the IP addresses specified in the data.
+
+When tripleo-ansible-inventory is used to generate the inventory, it would
+query Heat as it does today, but also layer in the extra inventory data as
+specified by ``ExtraInventoryData``. The resulting inventory would be a unified
+view of all nodes in the deployment.
+
+``ExtraInventoryData`` may be a list of files that are consumed with Heat's
+get_file function so that the deployer can keep their inventory data organized
+by file.
+
+Alternatives
+------------
+
+This change is primarily targeted at addressing scaling issues around the
+Heat stack operation. Alternative methods include using undercloud minions:
+
+https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/undercloud_minion.html
+
+Multi-stack/split-controlplane also addresses the issue somewhat by breaking up
+the deployment into smaller and more manageable stacks:
+
+https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/distributed_compute_node.html
+
+These alternatives are complimentary to the proposed solution here, and all of
+these solutions can be used together for the greatest benefits.
+
+Direct manipulation of inventory data
+_____________________________________
+
+Another alternative would be to not make use of any new interface in the
+templates such as the previously mentioned ``ExtraInventoryData``. Users could just
+update the inventory file manually, or drop inventory files in a specified
+location (since Ansible can use a directory as an inventory source).
+
+The drawbacks to this approach are that another tool would be necessary to
+create associated ports in Neutron so that there are no overlapping IP
+addresses. It could also be a manual step, although that is prone to error.
+
+The advantages to this approach is that it would completely eliminate the stack
+update operation as part of the scaling. Not having any stack operation is
+appealing in some regards due to the potential to forget environment files or
+other user error (out of date templates, etc).
+
+Security Impact
+---------------
+
+IP addresses and hostnames would potentially exist in user managed templates
+that have the value for ``ExtraInventoryData``, however this is no different than
+what is present today.
+
+Upgrade Impact
+--------------
+
+The upgrade process will need to be aware that not all nodes are represented in
+the Heat stack, and some will be represented only in the inventory. This should
+not be an issue as long as there is a consistent interface to get a single
+unified inventory as there exists now.
+
+Any changes around creating the unified view of the inventory should be made
+within the implementation of that interface (tripleo-ansible-inventory) such
+that existing tooling continues to use an inventory that contains all nodes for
+a deployment.
+
+Other End User Impact
+---------------------
+
+Users will potentially have to manage additional environment files for the
+extra inventory data.
+
+Performance Impact
+------------------
+
+Performance should be improved during scale out operations.
+
+However, it should be noted that Ansible will face scaling challenges as well.
+While this change does not directly introduce those new challenges, it may
+expose them more rapidly as it bypasses the Heat scaling challenges.
+
+For example, it is not expected that simply adding hundreds or thousands of new
+nodes directly to the Ansible inventory means that scaling operation would
+succeed. It would likely expose new scaling challenges in other tooling, such
+as the playbook and role tasks or Ansible itself.
+
+Other Deployer Impact
+---------------------
+
+Since this proposal is meant to align with the nova-less-deploy, all nodes
+(whether they are known to Heat or not) would be unprovisioned if the
+deployment is deleted.
+
+If using pre-provisioned nodes, then there is no change in behavior in that
+deleting the Heat stack does not actually "undeploy" any software. This
+proposal does not change that behavior.
+
+Developer Impact
+----------------
+
+Developers could more quickly test scaling by bypassing the Heat stack update
+completely if desired, or using the ``ExtraInventoryData`` interface.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  James Slagle <jslagle@redhat.com>
+
+Work Items
+----------
+
+* Add new parameter ``ExtraInventoryData``
+
+* Add Heat processing of ``ExtraInventoryData``
+
+  * create Neutron ports
+
+  * add stack outputs
+
+* Update tripleo-ansible-inventory to consume from added stack outputs
+
+* Update HostsEntry to be generic
+
+Dependencies
+============
+
+* Depends on nova-less-deploy work for baremetal provisioning outside of Heat.
+  If using pre-provisioned nodes, does not depend on nova-less-deploy.
+
+* All deployment configurations coming out of Heat need to be generic per role.
+  Most of this work was complete in Train, however this should be reviewed. For
+  example, the HostsEntry data is still static and Heat is calculating the node
+  list. This data needs to be moved to an Ansible template.
+
+
+Testing
+=======
+
+Scaling is not currently tested in CI, however perhaps it could be with this
+change.
+
+Manual test plans and other test automation would need to be updated to also
+test scaling with ``ExtraInventoryData``.
+
+
+Documentation Impact
+====================
+
+Documentation needs to be added for ``ExtraInventoryData``.
+
+The feature should also be fully explained in that users and deployers need to
+be made aware of the change of how nodes may or may not be represented in the
+Heat stack.
+
+References
+==========
+
+* https://specs.openstack.org/openstack/tripleo-specs/specs/stein/nova-less-deploy.html
+* https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/undercloud_minion.html
+* https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/distributed_compute_node.html