diff --git a/specs/xena/directord-orchestration.rst b/specs/xena/directord-orchestration.rst new file mode 100644 index 00000000..10eacd01 --- /dev/null +++ b/specs/xena/directord-orchestration.rst @@ -0,0 +1,534 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +=========================================================== +Unifying TripleO Orchestration with Task-Core and Directord +=========================================================== + +Include the URL of your launchpad blueprint: +https://blueprints.launchpad.net/tripleo/+spec/unified-orchestration + +The purpose of this spec is to introduce core concepts around Task-Core and +Directord, explain their benefits, and cover why the project should consider +using them. + +TripleO has long been established as an enterprise deployment solution for +OpenStack. While TripleO has been built to meet the needs of operators, it has +never been built to be fast or concise. TripleO maintains many layers of +abstraction which are effectively infinitely configurable, all at the expense +of time. Over the past few cycles, the TripleO core team has been on a mission +to simplify the stack, focusing on removing unnecessary services and minimizing +or marginalizing other over-engineered components. These efforts have come to a +head and are now approaching the point of diminishing returns. To further ease +the time and complexity burdens within TripleO, the project must look deeper to +achieve the next level of improvement; this is where Task-Core and Directord +come in. + +Task-Core_: + A dependency management and inventory graph solution which allows operators + to define tasks in simple terms with robust dominion over a given + environment. Declarative dependencies will ensure that if a container/config + is changed, only the necessary services are reloaded/restarted. Task-Core + provides access to the right tools for a given job with provenance, allowing + operators and developers to define outcomes confidently. + +Directord_: + A deployment framework built to manage the data center life cycle, which is + both modular and fast. Directord focuses on consistently maintaining + deployment expectations with a near real-time level of performance_ at almost + any scale. + + +Problem Description +=================== + +TripleO presently uses a collection of bespoke tools to achieve its +orchestration goals. While the TripleO tool suite has worked and is likely to +continue working should maintainers bear an increased burden, recent +revelations around the apparatuses provide an inflection point. Because of the +impending perfect storm spanning almost everything in the TripleO stack, the +project is presented with a choice: stay the course or confidently course +correct. + +Staying the course: + The TripleO project increases the core team size and begins planning for long + term maintenance. The project focus on developing individual maintainers for + at risk components (Ansible, Puppet, Heat); efforts to further simplify + TripleO ostensibly come to an end. While new deployment models may be + developed, decreasing time complexity and scalability will no longer be a + focus of the core team. The core team will ensure the TripleO project remains + practical_ for the forseeable future. + +Course correcting: + Begin the systematic replacement of legacy core components with more tailored + solutions to meet the project's actual needs. Tailoring the stack will + simplify the maintenance required across life cycles. The corrective action + necessary to provide TripleO with a quantum leap will be invasive; having + said that, once complete, TripleO will exceed operator expectations and meet + future scale requirements ensuring platform sovereignty all without breaking + the user interface. + +Upstream changes within applications like Ansible, where it is fundamentally +moving away from the TripleO use case, force TripleO maintainers to take on +more ownership for no additional benefit. The TripleO use case is actively +working against the future direction of Ansible. Secondly, while puppet has +remained stable over the years, the maintainers for puppet modules within +TripleO have reached an all-time low, represent a significant amount of +complexity in the stack, and becoming more of a risk to the project. The cost +of maintaining systems like Ansible and Puppet, and all of their corresponding +overlapping functionality, has a high likelihood of causing a significant +disruption to the TripleO project; this is especially true as the project looks +to support future OS versions. When an infinitely configurable interface +powered by Heat is compound against tightly coupled integrations across a set +of increasingly brittle services, TripleO is being faced with an `existential +crisis`_; the TripleO project needs to maintain less across the framework. + +Presently, TripleO will see its objective end without a course correction as +there's no longer any meaningful performance, scale, or configurability that +can be squeezed out of the current system. Additionally, as the project veers +further off the path of leveraging supportable community tools, TripleO will +see the time to deliver indefinably extend, as the project's value proposition +invariably declines. To stem the tide, TripleO must greatly simplify the +framework, enable developers to build intelligent tasks, and provide meaningful +performance enhancements that scale to meet operators' expectations. If TripleO +can capitalize on this moment, it will improve the quality of life for day one +deployers and day two operations and upgrades. + + +Proposed Change +=============== + +Dramatically enhance the TripleO developer, operator, user experience by +unifying the stack with tools built for TripleO by TripleO. + +In some ways, the move toward Task-Core and Directord creates a +General-Problem_, as it's proposing the replacement of many bespoke tools, which +are well known, with two new homegrown ones. Be that as it may, much attention +has been given to the user experience, addressing many well-known pain points +commonly associated with TripleO environments. Task-Core and Directord aim to +remove problems at scale, drop the development barrier to entry, and open the +flood gates of innovation. Teams surrounding TripleO will no longer worry about +execution times and convoluted step processes. TripleO Deployers and developers +of tomorrow will be empowered to run operations within an environment without +dedicating weeks to a risky or otherwise error-prone process. + +Overview +-------- + +This specification consists of two parts that work together to achieve the +project goals. + +Task-Core: + Task-Core builds upon native OpenStack libraries to create a dependency graph + and executes a compiled solution. With Task-Core, TripleO will be able to + define a deployment instead of a brute-forcing one. While powerful, Task-Core + keeps development easy and consistent, reducing the time to deliver and + allowing developers to focus on their actual deliverable, not the + orchestration details. Task-Core also guarantees reproducible builds, runtime + awareness, and the ability to resume when issues are encountered. + +* Templates containing step-logic and ad-hoc tasks will be refactored into + Task-Core definitions. + +* Each component can have its own Task-Core purpose, providing resources and + allowing other resources to depend on it. + +* The invocation of Task-Core will be baked into the TripleO client, making its + existence transparent to operators and deployers. + +* Advanced users will be able to use Task-Core to meet their environment + expectations without fully understanding the deployment nuance of multiple + bespoke systems. + +* Employs a scheme validation system which will ensure input is always correct + and results in a functional outcome. While the scheme validation wont ensure + an operational deployment, it will eradicate issues caused by incorrect user + input; providing early feedback to deployers so they're able to make + corrections before running longer operations. + +Directord: + Directord provides a modular execution platform that is environmentally + aware. Because Directord leverages messaging, the platform can guarantee + availability, transport, and performance. Directord has been built from the + ground up, making use of industry-standard messaging protocols which ensure + pseudo-real-time performance and limited resource utilization. The built-in + DSL provides most of what the TripleO project will require out of the box. + Because no solution is perfect, Directord utilizes a plugin system that will + allow developers to create new functionality without compromise or needing to + modify core components. Additionally, plugins are handled the same, allowing + Directord to ensure the delivery and execution performance remain consistent. + +* Directord is a single application that is ideally suited for containers while + also providing native hooks into systems; this allows Directord to operate in + heterogeneous environments. Because Directord is a simplified application, + operators can choose how they want to run it and are not forced into a one size + fits all solution. + +* Directord is platform-agnostic, allowing it to run across systems, versions, + and network topologies while simultaneously guaranteeing it maintains the + smallest possible footprint. + +* Directord is built upon messaging, giving it the unique ability to span + network topologies with varying latencies; messaging protocols compensate for + high latency environments and will finally give TripleO the ability to address + multiple data-centers and fully embrace "the edge." + +With Task-Core and Directord, TripleO will take a quantum leap in performance +and configurability. TripleO will no longer force developers and deployers to +run massive single-use systems to meet deployment goals. TripleO will have an +intelligent dependency graph that is both easy to understand and extend. +TripleO will now be environmentally aware, making it possible to run day two +operations quickly and efficiently. TripleO will better fulfill its life +cycle management through the use of cluster-aware orchestration. Finally, +TripleO will dramatically shrink its maintenance burden by eliminating many +bespoke systems running in unique and unsupported ways. + + +Alternatives +------------ + +The TripleO core team grows and embraces the maintenance burden of the bespoke +`legacy tooling`_ currently responsible for orchestration. Additionally, the +TripleO project begins documenting the scale limitations and the boundaries +that will never be addressed due to these limitations. Finally, TripleO +effectively ends the multi-cycle simplification efforts and shifts focus to the +required maintenance to maintain functional expectations long term. + + +Security Impact +--------------- + +While Task-Core and Directord are two new attack surfaces, their implementation +will eventually remove the entirety of services like Ansible and Puppet, which +are considerably more extensive in scope. A new Security assessment will need +to be performed to ensure the tooling exceeds the standard already set. That +said, steps have already been taken to ensure the new proposed architecture is +FIPS_ compatible, enforces `transport encryption`_, and generally adheres to a +higher standard of security; TripleO next aims for a higher standard of +operation from day one. + + +Upgrade Impact +-------------- + +Upgrades will hopefully be impacted in a very positive way. With the +introduction of Task-Core, upgrade tasks will use well-defined dependencies and +job tailored actions. Therefore, upgrade jobs should be much more efficient, +easier to understand, and effectively more straightforward; all of which make +execution inherently faster. At present there's no possible way for TripleO to +meet the expectation of being able to perform upgrade/update tasks rapidly; in +the future, should this specification be implemented, TripleO will address +updates and upgrades efficiently, with the aim to reign in maintenance windows +so that TripleO is no longer synonymous with operations that take exorbitant +amounts of time. + +The introduction of Directord will necessitate a rewrite of much of the +underlying functionality; however, upgrade tasks should be easily ported into +the Directord orchestrations and will allow TripleO to begin writing upgrades +that are based on the needs of a job, and allow us to massively simplify the +task definitions. + +Both Task-Core and Directord greatly improve the quality of life for operators +and developers when considering upgrades and roll back operations. The TripleO +project will finally realize roll-forward/backward capabilities on a +per-application basis in a time conscious way. No longer will a failed +operation result in cluster wide instability and obscurity. When planning +activities the Task-Core dependency graph will ensure only the actions required +are included, without duplication, or forcing deployers into multi-day +maintenance scenarios. With Directord operations are easily written and +transparently executed. The combination of Task-Core and Directord will empower +updates and upgrades in ways never thought possible. + + +Other End User Impact +--------------------- + +When following the `happy path`_, the end-user, deployers, and operators will +not interact with this change. The user interface will effectively remain the +same. If an operator wishes to leverage the advanced capabilities of either +Task-Core or Directord, the tooling will be documented and at their disposal. + +It should be noted that there's a change in deployment architecture in that +Directord follows a server/client model; albeit an ephemeral one. This change +aims to be fully transparent, however, it is something that end users, +deployers, will need to be aware of. + + +Performance Impact +------------------ + +This specification, if implemented, will have a massive impact on performance. +With Directord, the TripleO project will enjoy near-realtime execution without +compromise. + +* Performance_ analysis has been done comparing configurability and runtime of + Directord vs. Ansible, the TripleO default orchestration tool. This analysis + highlights some of the performance gains this specification will provide; + initial testing suggests that Task-Core and Directord is more than 10x + faster than our current tool chain, representing a potential 90% time savings + when executing a comparable workload. + +* One of the goals of this specification is to remove impediments in the time + to work. Deployers should not be spending exorbitant time waiting for tools to + do work; in some cases, waiting longer for a worker to be available than it + would take to perform a task manually. + +* Deployers will no longer be required to run a massive server for medium-scale + deployment. Regardless of size, the memory footprint and compute cores needed + to execute a deployment will be significantly reduced. + + +Other Deployer Impact +--------------------- + +Deployers are the primary focus of this specification, and the impact to them +could be positively huge. The time savings alone represents a massive quality +of life improvement. The ability to configure deployments and debug problems is +an unexpected bonus. If TripleO deployers are also considered developers, the +ease of implementing new services will be a welcomed addition. All that said, +both Task-Core and Directord represent an unknown factor; as such, they are +**not** battle-tested and will create uncertainty in an otherwise "stable_" +project. + +Implementing both Task-Core and Directord promises a better tomorrow by +fulfilling resolutions derived from the past. Extensive testing has been done; +all known use-cases, from system-level configuration to container pod +orchestration, have been covered, and automated tests have been created to +ensure nothing breaks unexpectedly. Additionally, for the first time, these +projects have expectations on performance, with tests backing up those claims, +even at a large scale. This proposal aims to remove a mountain of technical +debt while doing its best to create as little new debt as possible, all under +the lens of improving the lives of deployers. + +Should TripleO adopt Task-Core and Directord, new cloud topologies will open +to deployers. At present, TripleO assumes SSH access between the Undercloud and +Overcloud is always present. Additionally, TripleO believes the infrastructure +is relatively static, making day two operations risky and potentially painful. +Task-Core will reduce the computational burden when crafting action plans, and +Directord will ensure actions are always performed against the functional +hosts. + +Another improvement this specification will enhance is in the area of vendor +integrations. Vendors will finally be able to provide meaningful task +definitions which leverage an intelligent inventory and dependency system. No +longer will TripleO require vendors have in-depth knowledge of every deployment +detail, even those outside of the scope of their deliverable. By easing the job +definitions, simplifying the development process, and speeding up the execution +of tasks deployers will finally be able to develop solutions and test them with +confidence, without needing to spend months embedding resources into TripleO +and committing to huge capital expenditures associated with a minimally +functional environment. + +Test clouds are still highly recommended sources of information; however, +system requirements on the Undercloud will reduce. By reducing the resources +required to operate the Undercloud, the cost of test environments, in terms of +both hardware and time, will be significantly lowered. With a lower barrier to +entry developers and operators alike will be able to more easily contribute to +the overall project. + + +Developer Impact +---------------- + +Task-Core provides access to the right tool when required, meaning the +implementation of Task-Core will not adversely impact developers as they can +presently write code in whatever format they want; Ansible, Puppet, and +Directord are all perfectly viable options. Developers will need to change +their focus on tasks and ensure their jobs use the new graphing capabilities. +Because of the built-in dependency graph, the implementation of Task-Core +should be a welcomed one, without much in the way of negative developer impact. +One hugely positive impact on developers can be found in the Task-Core +interface validation_. Task-Core will validate the input scheme_ making the +framework more intelligent, thereby removing errors caused by the "free-form" +input and correctly setting task expectations. + +To fully realize the benefits of this specification Ansible tasks will need to +be refactored into the Task-Core scheme. While Task-Core can run Ansible and +Directord has a plugin system which easily allows developers to port legacy +modules into Directord plugins, there will be a developer impact as the TripleO +development methodology will change. It's fair to say that the potential +developer impact will be huge, yet, the shift isn't monumental. Much of the +Ansible presently in TripleO is shell-oriented, and as such, it is easily +portable and as stated, compatibility layers exist allowing the TripleO project +to make the required shift gradually. That said, once the Ansible tasks are +ported, the time saved in execution will be massive; this is on top of the fact +that TripleO will no longer be plagued with errors in day two operations +resulting from transient inventory. + +Example `Task-Core and Directord implementation for Keystone`_: + While this implementation example is fairly basic, it does result in a + functional Keystone environment and in roughly 5 minutes and includes + services like MySQL, RabbitMQ, Keystone as well as ensuring that the + operating systems is setup and configured for a cloud execution environment. + The most powerful aspect of this example is the inclusion of the graph + dependency system which will allow us easily externalize services, such as in + the case where deployers wish offload applications into environments like + OKD. + +The implementation of Task-Core and Directord will not change the user +interface when following a `happy path`_; however, it will allow developers to +bridge the TripleO to OKD gap more effectively. As mentioned, Directord is +container-native. Images for Directord already exist on Quay, Dockerhub, and +Github registries, all of the appropriate meta-data is available to support an +OKD environment, and tests have been implemented to ensure Directord is +functional from within pod environments. With Directord's ability to +automagically support heterogeneous infrastructure, TripleO developers and +deployers will now be able to implement solutions bridging container-native and +physical infrastructure without relying on fragile interfaces or legacy +transport models. + +* The use of advanced messaging protocols means TripleO will efficiently + address deployments in local data centers or at the edge without transport + stress. + +* The Directord server and storage can be easily offloaded, making it possible + for the TripleO Client to be executed from simple environments without access + to the overcloud network; imagine running a massive deployment from a laptop. + +* TripleO through the implementation of Task-Core and Directord will finally be + able to compartmentalize systems. + + +Implementation +============== + +In terms of essential TripleO integration, most of the work will occur within +the tripleoclient_, with the following new workflow. + +`Execution Workflow`_:: + + ┌────┐ ┌─────────────┐ ┌────┐ ┌─────────┐ ┌─────────┬──────┐ ??????????? + │USER├──►│TripleOclient├──►│Heat├──►│Task-Core├──►│Directord│Server├──►? Network ? + └────┘ └─────────────┘ └────┘ └─────────┘ └─────────┴──────┘ ??????????? + ▲ ▲ ▲ + │ ┌─────────┬───────┐ | | + └──────────────────────►│Directord│Storage│◄──┘ | + └─────────┴───────┘ | + | + ┌─────────┬──────┐ | + │Directord│Client│◄───────┘ + └─────────┴──────┘ + +* Directord|Server - Task executor connecting to client. + +* Directord|Client - Client program running on remote hosts connecting back to + the Directord|Server. + +* Directord|Storage - An optional component, when not externalized, Directord will + maintain the runtime storage internally. In this configuration Directord is + ephemeral. + +To enable a gradual transition, ansible-runner_ has been implemented within +Task-Core, allowing the TripleO project to convert playbooks into tasks that +rely upon strongly typed dependencies without requiring a complete rewrite. The +initial implementation should be transparent. Once the Task-Core hooks are set +within tripleoclient_ functional groups can then convert their tripleo-ansible_ +roles or ad-hoc Ansible tasks into Directord orchestrations. Teams will have +the flexibility to transition code over time and are incentivized by a +significantly improved user experience and shorter time to delivery. + + +Assignee(s) +----------- + +Primary assignee: + * Cloudnull - Kevin Carter + * Mwhahaha - Alex Schultz + + +Other contributors: + * Slagle - James Slagle + * Odyssey4me - Jesse Pretorius + + +Work Items +---------- + +1. Package all of the Task-Core and Directord dependencies, should there be any. +2. Package both Task-Core and Directord. +3. Converge on a Directord deployment model (container, system, hybrid). +4. Implement the Task-Core code path within TripleO client. +5. Port In template Ansible tasks to Directord orchestrations. +6. Port Ansible roles into Directord orchestrations. + + +Dependencies +============ + +Both Task-Core and Directord are dependencies, as they're new projects. These +dependencies may or may not be brought into the OpenStack namespace; +regardless, both of these projects, and their associated dependencies, will +need to be packaged and provided for by RDO. + + +Testing +======= + +If successful, the implementation of Task-Core and Directord will leave the +existing testing infrastructure unchanged. TripleO will continue to function as +it currently does through the use of the tripleoclient_. + +New tests will be created to ensure the Task-Core and Directord components +remain functional and provide an SLA around performance and configurability +expectations. + + +Documentation Impact +==================== + +Documentation around Ansible will need to be refactored. + +New documentation will need to be created to encompass of the of the advanced +usage of Task-Core and Directord. Much of the client interactions from the +"`happy path`_" will remain unchanged. + + +References +========== + +* Directord official documentation https://directord.com + +* Ansible's decision to pivot to execution environments: + https://ansible-runner.readthedocs.io/en/latest/execution_environments.html + +* Puppet 7 running experimental support for Ruby 3: + https://tickets.puppetlabs.com/browse/PUP-10957 + +.. _Task-Core: https://github.com/mwhahaha/task-core + +.. _Directord: https://github.com/cloudnull/directord + +.. _`existential crisis`: https://xkcd.com/1821 + +.. _General-Problem: https://xkcd.com/974 + +.. _`legacy tooling`: https://xkcd.com/1822 + +.. _`transport encryption`: https://directord.com/authentication.html + +.. _FIPS: https://en.wikipedia.org/wiki/Federal_Information_Processing_Standards + +.. _Performance: https://directord.com/overview.html#comparative-analysis + +.. _practical: https://xkcd.com/382 + +.. _stable: https://xkcd.com/1343 + +.. _validation: https://xkcd.com/327 + +.. _scheme: https://github.com/mwhahaha/task-core/tree/main/schema + +.. _`Task-Core and Directord implementation for Keystone`: https://raw.githubusercontent.com/mwhahaha/task-core/main/examples/directord/services/openstack-keystone.yaml + +.. _`happy path`: https://xkcd.com/85 + +.. _tripleoclient: https://github.com/openstack/python-tripleoclient + +.. _`Execution Workflow`: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/798747 + +.. _ansible-runner: https://github.com/ansible/ansible-runner + +.. _tripleo-ansible: https://github.com/openstack/tripleo-ansible