TripleO.Next - Task-Core + Directord

Add spec to introduce task-core and directord into the tripleo stack.

Change-Id: I51f340d3250ee3b03be48b962efbc001d5e7ddb7
Signed-off-by: Kevin Carter <kecarter@redhat.com>
This commit is contained in:
Kevin Carter 2021-06-28 17:47:21 -05:00
parent 00b9f10ce5
commit 975c7280c6
No known key found for this signature in database
GPG Key ID: 5045BC941175BDF5
1 changed files with 526 additions and 0 deletions

View File

@ -0,0 +1,526 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===========================================================
Unifying TripleO Orchestration with Task-Core and Directord
===========================================================
Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/tripleo/+spec/unified-orchestration
The purpose of this spec is to introduce core concepts around Task-Core and
Directord, explain their benefits, and cover why the project should consider
using them.
TripleO has long been established as an enterprise deployment solution for
OpenStack. While TripleO has been built to meet the needs of operators, it has
never been built to be fast or concise. TripleO maintains many layers of
abstraction which are effectively infinitely configurable, all at the expense
of time. Over the past few cycles, the TripleO core team has been on a mission
to simplify the stack, focusing on removing unnecessary services and minimizing
or marginalizing other over-engineered components. These efforts have come to a
head and are now approaching the point of diminishing returns. To further ease
the time and complexity burdens within TripleO, the project must look deeper to
achieve the next level of improvement; this is where Task-Core and Directord
come in.
Task-Core_:
A dependency management and inventory graph solution which allows operators
to define tasks in simple terms with robust dominion over a given
environment. Declarative dependencies will ensure that if a container/config
is changed, only the necessary services are reloaded/restarted. Task-Core
provides access to the right tools for a given job with provenance, allowing
operators and developers to define outcomes confidently.
Directord_:
A deployment framework built to manage the data center life cycle, which is
both modular and fast. Directord focuses on consistently maintaining
deployment expectations with a near real-time level of performance_ at almost
any scale.
Problem Description
===================
TripleO presently uses a collection of bespoke tools to achieve its
orchestration goals. While the TripleO tool suite has worked and is likely to
continue working should maintainers bear an increased burden, recent
revelations around the apparatuses provide an inflection point. Because of the
impending perfect storm spanning almost everything in the TripleO stack, the
project is presented with a choice: stay the course or confidently course
correct.
Staying the course:
The TripleO project increases the core team size and begins planning for long
term maintenance. The project focus on developing individual maintainers for
at risk components (Ansible, Puppet, Heat); efforts to further simplify
TripleO ostensibly come to an end. While new deployment models may be
developed, decreasing time complexity and scalability will no longer be a
focus of the core team. The core team will ensure the TripleO project remains
practical_ for the forseeable future.
Course correcting:
Begin the systematic replacement of legacy core components with more tailored
solutions to meet the project's actual needs. Tailoring the stack will
simplify the maintenance required across life cycles. The corrective action
necessary to provide TripleO with a quantum leap will be invasive; having
said that, once complete, TripleO will exceed operator expectations and meet
future scale requirements ensuring platform sovereignty all without breaking
the user interface.
Upstream changes within applications like Ansible, where it is fundamentally
moving away from the TripleO use case, force TripleO maintainers to take on
more ownership for no additional benefit. The TripleO use case is actively
working against the future direction of Ansible. Secondly, while puppet has
remained stable over the years, the maintainers for puppet modules within
TripleO have reached an all-time low, represent a significant amount of
complexity in the stack, and becoming more of a risk to the project day. The
cost of maintaining systems like Ansible and Puppet, and all of their
corresponding overlapping functionality, especially as the project looks to
support future OS versions, has a high likelihood of causing a significant
disruption to the TripleO project. When an infinitely configurable interface
powered by Heat is compound against tightly coupled integrations across a set
of increasingly brittle services, TripleO is being faced with an `existential
crisis`_; the TripleO project needs to maintain less across the framework.
Presently, TripleO will see its objective end without a course correction as
there's no longer any meaningful performance, scale, or configurability that
can be squeezed out of the current system. Additionally, as the project veers
further off the path of leveraging supportable community tools, TripleO will
see the time to deliver indefinably extend, as the project's value proposition
invariably declines. To stem the tide, TripleO must greatly simplify the
framework, enable developers to build intelligent tasks, and provide meaningful
performance enhancements that scale to meet operators' expectations. If TripleO
can capitalize on this moment, it will improve the quality of life for day one
deployers and day two operations and upgrades.
Proposed Change
===============
Dramatically enhance the TripleO developer, operator, user experience by
unifying the stack with tools built for TripleO by TripleO.
In some ways, the move toward Task-Core and Directord creates a
General-Problem_, as it's proposing the replacement of many bespoke tools, which
are well known, with two new homegrown ones. Be that as it may, much attention
has been given to the user experience, addressing many well-known pain points
commonly associated with TripleO environments. Task-Core and Directord aim to
remove problems at scale, drop the development barrier to entry, and open the
flood gates of innovation. Teams surrounding TripleO will no longer worry about
execution times and convoluted step processes. TripleO Deployers and developers
of tomorrow will be empowered to run operations within an environment without
dedicating weeks to a risky or otherwise error-prone process.
Overview
--------
This specification consists of two parts that work together to achieve the
project goals.
Task-Core:
Task-Core builds upon native OpenStack libraries to create a dependency graph
and executes a compiled solution. With Task-Core, TripleO will be able to
define a deployment instead of a brute-forcing one. While powerful, Task-Core
keeps development easy and consistent, reducing the time to deliver and
allowing developers to focus on their actual deliverable, not the
orchestration details. Task-Core also guarantees reproducible builds, runtime
awareness, and the ability to resume when issues are encountered.
* Templates containing step-logic and ad-hoc tasks will be refactored into
Task-Core definitions.
* Each component can have its own Task-Core purpose, providing resources and
allowing other resources to depend on it.
* The invocation of Task-Core will be baked into the TripleO client, making its
existence transparent to operators and deployers.
* Advanced users will be able to use Task-Core to meet their environment
expectations without fully understanding the deployment nuance of multiple
bespoke systems.
* Employs a scheme validation system which will ensure input is always correct
and results in a functional outcome. While the scheme validation wont ensure
an operational deployment, it will eradicate issues caused by incorrect user
input; providing early feedback to deployers so they're able to make
corrections before running longer operations.
Directord:
Directord provides a modular execution platform that is environmentally
aware. Because Directord leverages messaging, the platform can guarantee
availability, transport, and performance. Directord has been built from the
ground up, making use of industry-standard messaging protocols which ensure
pseudo-real-time performance and limited resource utilization. The built-in
DSL provides most of what the TripleO project will require out of the box.
Because no solution is perfect, Directord utilizes a plugin system that will
allow developers to create new functionality without compromise or needing to
modify core components. Additionally, plugins are handled the same, allowing
Directord to ensure the delivery and execution performance remain consistent.
* Directord is a single application that is ideally suited for containers while
also providing native hooks into systems; this allows Directord to operate in
heterogeneous environments. Because Directord is a simplified application,
operators can choose how they want to run it and are not forced into a one size
fits all solution.
* Directord is platform-agnostic, allowing it to run across systems, versions,
and network topologies while simultaneously guaranteeing it maintains the
smallest possible footprint.
* Directord is built upon messaging, giving it the unique ability to span
network topologies with varying latencies; messaging protocols compensate for
high latency environments and will finally give TripleO the ability to address
multiple data-centers and fully embrace "the edge."
With Task-Core and Directord, TripleO will take a quantum leap in performance
and configurability. TripleO will no longer force developers and deployers to
run massive single-use systems to meet deployment goals. TripleO will have an
intelligent dependency graph that is both easy to understand and extend.
TripleO will now be environmentally aware, making it possible to run day two
operations quickly and efficiently. TripleO will better fulfill its life
cycle management through the use of cluster-aware orchestration. Finally,
TripleO will dramatically shrink its maintenance burden by eliminating many
bespoke systems running in unique and unsupported ways.
Alternatives
------------
The TripleO core team grows and embraces the maintenance burden of the bespoke
`legacy tooling`_ currently responsible for orchestration. Additionally, the
TripleO project begins documenting the scale limitations and the boundaries
that will never be addressed due to these limitations. Finally, TripleO
effectively ends the multi-cycle simplification efforts and shifts focus to the
required maintenance to maintain functional expectations long term.
Security Impact
---------------
While Task-Core and Directord are two new attack surfaces, their implementation
will eventually remove the entirety of services like Ansible and Puppet, which
are considerably more extensive in scope. A new Security assessment will need
to be performed to ensure the tooling exceeds the standard already set.
That said, steps have already been taken to ensure that systems are FIPS_
compatible, ensuring TripleO aims for a higher standard of operation from day
one.
Upgrade Impact
--------------
Upgrades will hopefully be impacted in a very positive way. With the
introduction of Task-Core, upgrade tasks will use well-defined dependencies and
job tailored actions. Therefore, upgrade jobs should be much more efficient,
easier to understand, and effectively more straightforward; all of which make
execution inherently faster. At present there's no possible way for TripleO to
meet the expectation of being able to perform upgrade/update tasks rapidly; in
the future, should this specification be implemented, TripleO will address
updates and upgrades efficiently, with the aim to reign in maintenance windows
so that TripleO is no longer synonymous with operations that take exorbitant
amounts of time.
The introduction of Directord will necessitate a rewrite of much of the
underlying functionality; however, upgrade tasks should be easily ported into
the Directord orchestrations and will allow TripleO to begin writing upgrades
that are based on the needs of a job, and allow us to massively simplify the
task definitions.
Both Task-Core and Directord greatly improve the quality of life for operators
and developers when considering upgrades and roll back operations. The TripleO
project will finally realize roll-forward/backward capabilities on a
per-application basis in a time conscious way. No longer will a failed
operation result in cluster wide instability and obscurity. When planning
activities the Task-Core dependency graph will ensure only the actions required
are included, without duplication, or forcing deployers into multi-day
maintenance scenarios. With Directord operations are easily written and
transparently executed. The combination of Task-Core and Directord will empower
updates and upgrades in ways never thought possible.
Other End User Impact
---------------------
When following the `happy path`_, the end-user, deployers, and operators will
not interact with this change. The user interface will effectively remain the
same. If an operator wishes to leverage the advanced capabilities of either
Task-Core or Directord, the tooling will be documented and at their disposal.
It should be noted that there's a change in deployment architecture in that
Directord follows a server/client model; albeit an ephemeral one. This change
aims to be fully transparent, however, it is something that end users,
deployers, will need to be aware of.
Performance Impact
------------------
This specification, if implemented, will have a massive impact on performance.
With Directord, the TripleO project will enjoy near-realtime execution without
compromise.
* Performance_ analysis has been done comparing configurability and runtime of
Directord vs. Ansible, the TripleO default orchestration tool. This analysis
highlights some of the performance gains this specification will provide;
initial testing suggests that Task-Core and Directord is more than 10x
faster than our current tool chain, representing a potential 90% time savings
when executing a comparable workload.
* One of the goals of this specification is to remove impediments in the time
to work. Deployers should not be spending exorbitant time waiting for tools to
do work; in some cases, waiting longer for a worker to be available than it
would take to perform a task manually.
* Deployers will no longer be required to run a massive server for medium-scale
deployment. Regardless of size, the memory footprint and compute cores needed
to execute a deployment will be significantly reduced.
Other Deployer Impact
---------------------
Deployers are the primary focus of this specification, and the impact to them
could be positively huge. The time savings alone represents a massive quality
of life improvement. The ability to configure deployments and debug problems is
an unexpected bonus. If TripleO deployers are also considered developers, the
ease of implementing new services will be a welcomed addition. All that said,
both Task-Core and Directord represent an unknown factor; as such, they are
**not** battle-tested and will create uncertainty in an otherwise "stable_"
project.
Implementing both Task-Core and Directord promises a better tomorrow by
fulfilling resolutions derived from the past. Extensive testing has been done;
all known use-cases, from system-level configuration to container pod
orchestration, have been covered, and automated tests have been created to
ensure nothing breaks unexpectedly. Additionally, for the first time, these
projects have expectations on performance, with tests backing up those claims,
even at a large scale. This proposal aims to remove a mountain of technical
debt while doing its best to create as little new debt as possible, all under
the lens of improving the lives of deployers.
Should TripleO adopt Task-Core and Directord, new cloud topologies will open
to deployers. At present, TripleO assumes SSH access between the Undercloud and
Overcloud is always present. Additionally, TripleO believes the infrastructure
is relatively static, making day two operations risky and potentially painful.
Task-Core will reduce the computational burden when crafting action plans, and
Directord will ensure actions are always performed against the functional
hosts.
Another improvement this specification will enhance is in the area of vendor
integrations. Vendors will finally be able to provide meaningful task
definitions which leverage an intelligent inventory and dependency system. No
longer will TripleO require vendors have in-depth knowledge of every deployment
detail, even those outside of the scope of their deliverable. By easing the job
definitions, simplifying the development process, and speeding up the execution
of tasks deployers will finally be able to develop solutions and test them with
confidence, without needing to spend months embedding resources into TripleO
and committing to huge capital expenditures associated with a minimally
functional environment. Test clouds are still highly recommended sources of
information, however, system requirements on the Undercloud will reduce meaning
the cost of running test environments, in terms of both hardware and time, will
be significantly lowered.
Developer Impact
----------------
Task-Core provides access to the right tool when required, meaning the
implementation of Task-Core will not adversely impact developers as they can
presently write code in whatever format they want; Ansible, Puppet, and
Directord are all perfectly viable options. Developers will need to change
their focus on tasks and ensure their jobs use the new graphing capabilities.
Because of the built-in dependency graph, the implementation of Task-Core
should be a welcomed one, without much in the way of negative developer impact.
One hugely positive impact on developers can be found in the Task-Core
interface validation_. Task-Core will validate the input scheme_ making the
framework more intelligent, thereby removing errors caused by the "free-form"
input and correctly setting task expectations.
To fully realize the benefits of this specification Ansible tasks will need to
be refactored into the Task-Core scheme. While Task-Core can run Ansible and
Directord has a plugin system which easily allows develoeprs to port legacy
modules into Directord plugins, there will be a developer impact as the TripleO
development methodology will change. It's fair to say that the potential
developer impact will be huge, yet, the shift isn't monumental. Much of the
Ansible presently in TripleO is shell-oriented, and as such, it is easily
portable and as stated, compatibility layers exist allowing the TripleO project
to make the required shift gradually. That said, once the Ansible tasks are
ported, the time saved in execution will be massive; this is on top of the fact
that TripleO will no longer be plagued with errors in day two operations
resulting from transient inventory.
Example `Task-Core and Directord implementation for Keystone`_:
While this implementation example is fairly basic, it does result in a
functional Keystone environment and in roughly 5 minutes and includes
services like MySQL, RabbitMQ, Keystone as well as ensuring that the
operating systems is setup and configured for a cloud execution environment.
The most powerful aspect of this example is the inclusion of the graph
dependency system which will allow us easily externalize services, such as in
the case where deployers wish offload applications into environments like
OKD.
The implementation of Task-Core and Directord will not change the user
interface when following a `happy path`_; however, it will allow developers to
bridge the TripleO to OKD gap more effectively. As mentioned, Directord is
container-native. Images for Directord already exist on Quay, Dockerhub, and
Github registries, all of the appropriate meta-data is available to support an
OKD environment, and tests have been implemented to ensure Directord is
functional from within pod environments. With Directord's ability to
automagically support heterogeneous infrastructure, TripleO developers and
deployers will now be able to implement solutions bridging container-native and
physical infrastructure without relying on fragile interfaces or legacy
transport models.
* The use of advanced messaging protocols means TripleO will efficiently
address deployments in local data centers or at the edge without transport
stress.
* The Directord server and storage can be easily offloaded, making it possible
for the TripleO Client to be executed from simple environments without access
to the overcloud network; imagine running a massive deployment from a laptop.
* TripleO through the implementation of Task-Core and Directord will finally be
able to compartmentalize systems.
Implementation
==============
In terms of essential TripleO integration, most of the work will occur within
the tripleoclient_, with the following new workflow.
Execution Workflow::
┌────┐ ┌─────────────┐ ┌────┐ ┌─────────┐ ┌─────────┬──────┐ ???????????
│USER├──►│TripleOclient├──►│Heat├──►│Task-Core├──►│Directord│Server├──►? Network ?
└────┘ └─────────────┘ └────┘ └─────────┘ └─────────┴──────┘ ???????????
▲ ▲ ▲
│ ┌─────────┬───────┐ | |
└──────────────────────►│Directord│Storage│◄──┘ |
└─────────┴───────┘ |
|
┌─────────┬──────┐ |
│Directord│Client│◄───────┘
└─────────┴──────┘
* Directord|Server - Task executor connecting to client.
* Directord|Client - Client program running on remote hosts connecting back to
the Directord|Server.
* Directord|Storage - An optional component, when not externalized, Directord will
maintain the runtime storage internally. In this configuration Directord is
ephemeral.
To enable a gradual transition, ansible-runner_ has been implemented within
Task-Core, allowing the TripleO project to convert playbooks into tasks that
rely upon strongly typed dependencies without requiring a complete rewrite. The
initial implementation should be transparent. Once the Task-Core hooks are set
within tripleoclient_ functional groups can then convert their tripleo-ansible_
roles or ad-hoc Ansible tasks into Directord orchestrations. Teams will have
the flexibility to transition code over time and are incentivized by a
significantly improved user experience and shorter time to delivery.
Assignee(s)
-----------
Primary assignee:
* Cloudnull - Kevin Carter
* Mwhahaha - Alex Schultz
Other contributors:
* Slagle - James Slagle
* Odyssey4me - Jesse Pretorius
Work Items
----------
1. Package all of the Task-Core and Directord dependencies, should there be any.
2. Package both Task-Core and Directord.
3. Converge on a Directord deployment model (container, system, hybrid).
4. Implement the Task-Core code path within TripleO client.
5. Port In template Ansible tasks to Directord orchestrations.
6. Port Ansible roles into Directord orchestrations.
Dependencies
============
Both Task-Core and Directord are dependencies, as they're new projects. These
dependencies may or may not be brought into the OpenStack namespace;
regardless, both of these projects, and their associated dependencies, will
need to be packaged and provided for by RDO.
Testing
=======
If successful, the implementation of Task-Core and Directord will leave the
existing testing infrastructure unchanged. TripleO will continue to function as
it currently does through the use of the tripleoclient_.
New tests will be created to ensure the Task-Core and Directord components
remain functional and provide an SLA around performance and configurability
expectations.
Documentation Impact
====================
Documentation around Ansible will need to be refactored.
New documentation will need to be created to encompass of the of the advanced
usage of Task-Core and Directord. Much of the client interactions from the
"`happy path`_" will remain unchanged.
References
==========
* Directord official documentation https://directord.com
* Ansible's decision to pivot to execution environments:
https://ansible-runner.readthedocs.io/en/latest/execution_environments.html
* Puppet 7 running experimental support for Ruby 3:
https://tickets.puppetlabs.com/browse/PUP-10957
.. _Task-Core: https://github.com/mwhahaha/task-core
.. _Directord: https://github.com/cloudnull/directord
.. _`existential crisis`: https://xkcd.com/1821
.. _General-Problem: https://xkcd.com/974
.. _`legacy tooling`: https://xkcd.com/1822
.. _FIPS: https://en.wikipedia.org/wiki/Federal_Information_Processing_Standards
.. _Performance: https://directord.com/overview.html#comparative-analysis
.. _practical: https://xkcd.com/382
.. _stable: https://xkcd.com/1343
.. _validation: https://xkcd.com/327
.. _scheme: https://github.com/mwhahaha/task-core/tree/main/schema
.. _`Task-Core and Directord implementation for Keystone`: https://raw.githubusercontent.com/mwhahaha/task-core/main/examples/directord/services/openstack-keystone.yaml
.. _`happy path`: https://xkcd.com/85
.. _tripleoclient: https://github.com/openstack/python-tripleoclient
.. _ansible-runner: https://github.com/ansible/ansible-runner
.. _tripleo-ansible: https://github.com/openstack/tripleo-ansible