From e77cfc3208b5c65f8031a0cadbe65f116419a193 Mon Sep 17 00:00:00 2001 From: Zane Bitter Date: Thu, 23 Jan 2020 21:09:18 -0500 Subject: [PATCH] Add Project Teapot idea Project Teapot is a design proposal for a bare-metal cloud to run Kubernetes on, and which itself also runs on Kubernetes. Change-Id: I5e0c63583318e64af9ec91f908eb8eac745ac907 --- doc/source/ideas/teapot/compute.rst | 178 +++++++++++++ doc/source/ideas/teapot/dns.rst | 117 +++++++++ doc/source/ideas/teapot/idm.rst | 146 +++++++++++ doc/source/ideas/teapot/index.rst | 129 ++++++++++ doc/source/ideas/teapot/installation.rst | 69 +++++ doc/source/ideas/teapot/key-management.rst | 70 +++++ doc/source/ideas/teapot/load-balancing.rst | 241 ++++++++++++++++++ doc/source/ideas/teapot/networking.rst | 187 ++++++++++++++ .../ideas/teapot/openstack-integration.rst | 109 ++++++++ doc/source/ideas/teapot/storage.rst | 140 ++++++++++ doc/source/index.rst | 3 +- 11 files changed, 1388 insertions(+), 1 deletion(-) create mode 100644 doc/source/ideas/teapot/compute.rst create mode 100644 doc/source/ideas/teapot/dns.rst create mode 100644 doc/source/ideas/teapot/idm.rst create mode 100644 doc/source/ideas/teapot/index.rst create mode 100644 doc/source/ideas/teapot/installation.rst create mode 100644 doc/source/ideas/teapot/key-management.rst create mode 100644 doc/source/ideas/teapot/load-balancing.rst create mode 100644 doc/source/ideas/teapot/networking.rst create mode 100644 doc/source/ideas/teapot/openstack-integration.rst create mode 100644 doc/source/ideas/teapot/storage.rst diff --git a/doc/source/ideas/teapot/compute.rst b/doc/source/ideas/teapot/compute.rst new file mode 100644 index 0000000..1202235 --- /dev/null +++ b/doc/source/ideas/teapot/compute.rst @@ -0,0 +1,178 @@ +Teapot Compute +============== + +Project Teapot is conceived as an exclusively bare-metal compute service for +Kubernetes clusters. Providing bare-metal compute workers to tenants allows +them to make their own decisions about how they make use of virtualisation. For +example, Tenants can choose to use a container hypervisor (such as Kata_) to +further sandbox applications, traditional VMs (such as those managed by +KubeVirt_ or `OpenStack Nova`_), *or both* `side-by-side +`_ in the same +cluster. Furthermore, it allows users to manage all components of an +application -- both those that run in containers and those that need a +traditional VM -- from the same Kubernetes control plane (using KubeVirt). +Finally, it eliminates the complexity of needing to virtualise access to +specialist hardware such as :abbr:`GPGPU (general-purpose GPU)`\ s or FPGAs, +while still allowing the capability to be used by different tenants at +different times. + +However, the *master* nodes of tenant cluster will run in containers on the +management cluster (or some other centrally-managed cluster). This makes it +easy and cost-effective to provide high availability of cluster control planes, +by not sacrificing large numbers of hosts to this purpose or requiring +workloads to run on master nodes. It also makes it possible to optionally +operate Teapot as a fully-managed Kubernetes service. Finally, it makes it +relatively cheap to scale a cluster to zero when it has nothing to do, for +example if it is only used for batch jobs, without requiring it to be recreated +from scratch each time. Since the management cluster also runs on bare metal, +the tenant pods could also be isolated from each other and from the rest of the +system using Kata, in addition to regular security policies. + +.. _teapot-compute-metal3: + +Metal³ +------ + +Provisioning of bare-metal servers will use `Metal³`_. + +The baremetal-operator from Metal³ provides a Kubernetes-native interface over +a simplified `OpenStack Ironic`_ deployment. In this configuration, Ironic runs +standalone (i.e. it does not use Keystone authentication). All communication +between components occurs inside of a pod. RabbitMQ has been replaced by +json-rpc. Ironic state is maintained in a database, but the database can run on +ephemeral storage -- the Kubernetes custom resource (BareMetalHost) is the +source of truth. + +The baremetal-operator will run only in the management cluster (or some other +centrally managed cluster) because it requires access to both the :abbr:`BMC +(Baseboard Management Controller)`\ s' network (as well as the +:ref:`provisioning network `) and the +authentication credentials for the BMCs. + +.. _teapot-compute-cluster-api: + +Cluster API +----------- + +The baremetal-operator can be integrated with the Kubernetes Cluster Lifecycle +SIG's `Cluster API`_ via another Metal³ component, the +cluster-api-provider-baremetal. This contains a BareMetalMachine controller +that implements the Machine abstraction using a BareMetalHost. (Airship_ 2.0 is +also slated to use Metal³ and the Cluster API to manage cluster provisioning, +so this mechanism could be extended to deploy fully-configured clusters with +Airship as well.) + +When the Cluster API is used to build standalone clusters, typically a +bootstrap node is created (often using a local VM) to run it in order to create +the permanent cluster members. The Cluster and Machine resources are then +'pivoted' (copied) into the cluster, which continues to manage itself while the +bootstrap node is retired. When used with a centralised cluster manager such as +Teapot, the process is usually similar but can use the management cluster to do +the bootstrapping. Pivoting is optional but usually expected. + +Teapot imposes some additional constraints. Because the BareMetalHost objects +must remain in the management cluster, the Machine objects cannot be simply +copied to the tenant cluster and continue to be backed by the BareMetalMachine +controller in its present form. + +One option might be to build a machine controller for the tenant cluster that +is backed by a Machine object in another cluster (the management cluster). This +might prove useful for centralised management clusters in general, not just +Teapot. We would have no choice but to name this component +cluster-api-provider-cluster-api. + +Cluster API does not yet have support for running the tenant control plane in +containers. Tools like Gardener_ do, but are not yet well integrated with the +Cluster API. However, the Cluster Lifecycle SIG is aware of this use case, and +will likely evolve the Cluster API to make this possible. + +.. _teapot-compute-autoscaling: + +Autoscaling +----------- + +The preferred mechanism in Kubernetes for applications to control the size of +the cluster they run in is the Cluster Autoscaler. There is no separate +interface to this mechanism for applications. If an application is too busy, it +simply requests more or larger pods. When there is no longer sufficient +capacity to schedule all requested pods, the Cluster Autoscaler will scale the +cluster up. Similarly, if there is significant excess capacity not being used +by pods, Cluster Autoscaler will scale the cluster down. + +Cluster Autoscaler works using its own cloud-specific plugins. A `plugin that +uses the Cluster API is in progress +`_, so Teapot could +automatically make use of that provided that the Machine resources were pivoted +into the tenant cluster. + +One significant challenge posed by bare-metal is the extremely high latency +involved in provisioning a bare-metal host (15 minutes is not unusual, due in +large part to running hardware tests including checking increasingly massive +amounts of RAM). The situation is even worse when needing to deprovision a host +from one tenant before giving it to another tenant, since that requires +cleaning the local disks, though this extra overhead can be essentially +eliminated if the disk is encrypted (in which case only the keys need be +erased). + +.. _teapot-compute-scheduling: + +Scheduling +---------- + +Unlike when operating a standalone bare-metal cluster, when allocating hosts +amongst different clusters it is important to have sophisticated ways of +selecting which hosts are added to which cluster. + +An obvious example would be selecting for various hardware traits -- which are +unlikely to be grouped into 'flavours' in the way that Nova does. The optimal +way of doing this would likely include some sort of cost function, so that a +cluster is always allocated the minimum spec machine that meets its +requirements. Another example would be selecting for either affinity or +anti-affinity of hosts, possibly at different (and deployment-specific) levels +of granularity. + +Work is underway in Metal³ on a hardware-classification-controller that will +add labels to BareMetalHosts based on selected traits, and the baremetal +actuator can select hosts based on labels. This would be sufficient to perform +flavour-based allocation and affinity, but likely not on its own for +trait-based allocation and anti-affinity. + +.. _teapot-compute-reservation: + +Reservation and Quota Management +-------------------------------- + +The design for quota management should recognise the many ways in which it is +used in both private and public clouds. In public clouds utilisation is +controlled by billing; quotas are primarily a tool for *users* to limit their +financial exposure. + +In private OpenStack clouds, the implementation of chargeback is rare. A more +common model is that a department will contribute a portion of the capital +budget for a cloud in exchange for a quota -- a model that fits quite well with +Teapot's allocation of entire hosts to tenants. + +To best support the private cloud use case, there need to be separate concepts +of a guaranteed minimum reservation and a maximum quota. The sum of minimum +reservations must not exceed the capacity of the cloud (are more complex +requirement than it sounds, since it must take into account selected hardware +traits). Some form of pre-emption is needed, along with a way of prioritising +requests for hosts. Similar concepts exist in many public clouds, in the form +of reserved and spot-rate instances. + +The reservation/quota system should have a time component. This allows, for +example, users who have large batch jobs to reserve capacity for them without +tying it up around the clock. (The increasing importance of machine learning +means that once again almost everybody has large batch jobs.) Time-based +reservations can also help mitigate the high latency of moving hosts between +tenants, by allowing some of the demand to be anticipated. + + +.. _Kata: https://katacontainers.io/ +.. _KubeVirt: https://kubevirt.io/ +.. _OpenStack Nova: https://docs.openstack.org/nova +.. _Metal³: https://metal3.io/ +.. _OpenStack Ironic: https://docs.openstack.org/ironic +.. _Cluster API: https://github.com/kubernetes-sigs/cluster-api#readme +.. _Airship: https://www.airshipit.org/ +.. _Gardener: https://gardener.cloud/030-architecture/ diff --git a/doc/source/ideas/teapot/dns.rst b/doc/source/ideas/teapot/dns.rst new file mode 100644 index 0000000..4bdc722 --- /dev/null +++ b/doc/source/ideas/teapot/dns.rst @@ -0,0 +1,117 @@ +Teapot DNS +========== + +Project Teapot must provide a trusted way for DNS information generated by the +(untrusted) tenant clusters to be propagated out to the network. + +Each tenant cluster requires at least 2 DNS records -- one for the control +plane, and a wildcard for any applications. These would usually be subdomains +of a zone delegated to the Teapot for this purpose. Teapot would be responsible +for rolling up these records and making them available over DNS. + +Since Teapot will be responsible for :ref:`allocating public IP addresses +`, it will also need to be responsible for +advertising reverse DNS records for those IPs. + +Implementation Options +---------------------- + +The Kubernetes SIG ExternalDNS_ project is a Kubernetes-native service that +collects IP addresses for Services and Ingresses running in the cluster and +exports DNS records for them (though it is *not* itself a DNS server). It +supports many different back-ends -- both traditional DNS servers and +cloud-based services (including OpenStack Designate). + +While tenants are of course free to run this in their own clusters already +(perhaps pointing to an external cloud service), this is not sufficient to +satisfy the above requirements. It requires them to use an external cloud +service (which may not always be appropriate for internal-only applications in +a private cloud), since tenants are untrusted and cannot be given write access +to an internal DNS server. And reverse DNS records cannot be exported, because +tenant clusters are not a trusted source of information about what IP addresses +are assigned to them. + +.. _teapot-dns-externaldns: + +ExternalDNS in load balancing cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If Teapot implemented the :doc:`load balancing ` :ref:`option +based on Ingress resources ` in the +management cluster (or a separate load balancing cluster), and these were used +for both Services and Ingresses, then ExternalDNS running in that same cluster +would automatically see all of the external endpoints for the tenant clusters. +It could even rely on the fact that the IP addresses will have been sanitised +already before creating the Ingress objects. There would need to be provision +made somewhere for sanitising the DNS names, however. + +On its own this only satisfies the first requirement. Additional work might +need to be done to export the wildcard DNS records for the tenant workloads. +(Note that the tenant control planes would be running in containers on the +management cluster or another centrally-managed cluster, and may well have +Ingress resources associated with them already.) And additional work would +certainly be needed to export the reverse DNS records. + +A major downside of this is that it gives the tenant very little control over +whether and how it exports DNS information. + +.. _teapot-dns-externaldns-sync: + +Build a component to sync ExternalDNS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A component running in a tenant cluster could sync any ExternalDNS Endpoint +resources (not to be confused with Kubernetes Endpoint resources) from the +tenant cluster into the management cluster. (This component could even be +written as an ExternalDNS provider.) This option is analagous to an +:ref:`Ingress-based API for load balancing +`. + +On the management cluster side, a validating webhook would check for legitimacy +prior to accepting a resource. More investigation is required into the +mechanics of this -- since the resources are not normally manipulated by +anything other than ExternalDNS itself, having something else writing to them +might prove brittle. + +Again, additional work might need to be done to export the wildcard DNS records +for the tenant workloads and would be needed for reverse DNS records. + +.. _teapot-dns-designate: + +OpenStack Designate +~~~~~~~~~~~~~~~~~~~ + +Designate_ is already one of the supported back-ends for ExternalDNS. By +running a minimal, opinionated installation of Designate in the management +cluster we could allow tenants to choose whether and how to set up ExternalDNS +in their own clusters. They could choose to export records to the Teapot cloud, +to some external cloud, or not at all. + +Since Designate has an API, it would be easy to add the two top-level records +for each cluster. + +Designate has the ability to export reverse DNS records based on floating IPs. +However, the current implementation is tightly coupled to Neutron. If Neutron +is used in Teapot it should be as an implementation detail only, so other +services like Designate should not rely on integrating with it. Therefore +additional work would be required to support reverse DNS. There is an API +plugin point to pull data, or it could be pushed in through the Designate API. + +Ideally the back-end in the opinionated configuration would be CoreDNS_, due to +its status in the Kubernetes community (it is used for the *internal* DNS and +is a CNCF project). However, there is currently no CoreDNS back-end for +Designate. An alternative to writing one would be to write a Designate plugin +for CoreDNS -- similar plugins exist for other clouds already. The latter would +provide the most benefit to OpenStack users, since theoretically tenants could +make use of it even if CoreDNS is not chosen as the back-end by their OpenStack +cloud's administrators. + +The Designate Sink component would not be required, but the rest of Designate +is also built around RabbitMQ, which is highly undesirable. However, it is +largely used to implement RPC patterns (``call``, not ``cast``), and might be +amenable to being swapped for a json-rpc interface in the same way as is done +in Ironic for Metal³. + +.. _ExternalDNS: https://github.com/kubernetes-sigs/external-dns#readme +.. _Designate: https://docs.openstack.org/designate/ +.. _CoreDNS: https://coredns.io/ diff --git a/doc/source/ideas/teapot/idm.rst b/doc/source/ideas/teapot/idm.rst new file mode 100644 index 0000000..1e68f79 --- /dev/null +++ b/doc/source/ideas/teapot/idm.rst @@ -0,0 +1,146 @@ +Teapot Identity Management +========================== + +Teapot need not, and should not, impose any particular identity management +system for tenant clusters. These are the clusters that applications and +application developers/operators will routinely interact with, and the choice +of identity management providers is completely up to the administrators of +those clusters, or at least the administrator of the Teapot cloud when running +as a fully-managed service. + +Identity management in Teapot itself (i.e. the management cluster) is needed +for two different purposes. While not strictly necessary, it would be +advantageous to require only one identity management provider to cover both of +these use cases. + +Authenticating From Below +------------------------- + +Software running in the tenant clusters needs to authenticate to the cloud to +request resources, such as machines, :doc:`load balancers `, +:doc:`shared storage `, :doc:`DNS records `, and (in future) +managed software services. + +Credentials for these purposes should be regularly rotated and narrowly +authorised, to limit both the scope and duration of any compromise. + +Authenticating From Above +------------------------- + +Real users and sometime software services need to authenticate to the cloud to +create or destroy clusters, manually scale them up or down, request quotas, and +so on. + +In many cases, such as most enterprise private clouds, these credentials should +be linked to an external identity management provider. This would allow +auditors of the system to tie physical hardware directly back to corporeal +humans to which it is allocated and the organisational units to which they +belong. + +Humans must also have a secure way of delegating privileges to an application +to interact with the cloud in this way -- for example, imagine a CI system that +needs to create an entire test cluster from scratch and destroy it again. This +must not require the user's own credentials to be stored anywhere. + +Implementation options +---------------------- + +.. _teapot-idm-keystone: + +OpenStack Keystone +~~~~~~~~~~~~~~~~~~ + +Keystone_ is currently the only game in town for providing identity management +for OpenStack services that are candidates for being included to provide some +multi-tenant functionality in Teapot, such as :ref:`Manila +` and :ref:`Designate `. Therefore +using Keystone for all identity management on the management cluster would not +only not increase complexity of the deployment, it would actually minimise it. + +An authorisation webhook for Kubernetes that uses Keystone is available in +cloud-provider-openstack_. In general, OAuth seems to be preferred to webhooks +for connecting external identity management systems, but there is at least a +working option. + +Keystone supports delegating user authentication +to LDAP, as well as offering its own built-in user management. It can also +federate with other identity providers via the `OpenID Connect`_ or SAML_ +protocols. Using Keystone would also make it simpler to run Teapot alongside an +existing OpenStack cloud -- enabling tenants to share services in that cloud, +as well as potentially making Teapot's functionality available behind an +OpenStack-native API (similar to Magnum) for those who want it. + +Keystone also features quota management capabilities that could be reused to +manage tenant quotas_. A proof-of-concept for a validating webhook that allows +this to be used for governing Kubernetes resources `exists +`_. + +While there are generally significant impedance mismatches between the +Kubernetes and Keystone models of authorisation, Project Teapot is a fresh +start and can prescribe custom policy models that mitigate the mismatch. +(Ongoing changes to default policies will likely smooth over these kinds of +issues in regular OpenStack clouds also.) This may not be so easy when sharing +a Keystone :doc:`with an OpenStack cloud ` though. + +Keystone Application Credentials allow users to create (potentially) +short-lived credentials that an application can use to authenticate without the +need to store the user's own LDAP password (which likely also governs their +access to a wide range of unrelated corporate services) anywhere. Credentials +provided to tenant clusters should be exclusively of this type, limited to the +purpose assigned (e.g. credentials intended for accessing storage can only be +used to access storage), and regularly rotated out and expired. + +.. _teapot-idm-dex: + +Dex +~~~ + +Dex_ is an identity management service that uses `OpenID Connect`_ to provide +authentication to Kubernetes. It too supports delegating user authentication to +LDAP, amongst others. This would likely be seen as a more conventional choice +in the Kubernetes community. Dex can store its data using Kubernetes custom +resources, so it is the most lightweight option. + +Dex does not support authorisation. However, Keystone supports OpenID Connect +as a federated identity provider, so it could still be used as the +authorisation mechanism (including for OpenStack-derived services such as +Manila) using Dex for authentication. However, this inevitably adds additional +moving parts. In general, Keystone has `difficultly +`_ with application +credentials for federated users because it is not immediately notified of +membership revocations, but since both components are under the same control in +this case it would be easier to build some additional integration to keep them +in sync. + +.. _teapot-idm-keycloak: + +Keycloak +~~~~~~~~ + +Keycloak_ is a more full-featured identity management service. It would also be +seen in the Kubernetes community as a more conventional choice than Keystone, +although it does not use the Kubernetes API as a data store. Keycloak is +significantly more complex to deploy than Dex. However, a `Kubernetes operator +for Keycloak `_ now exists, +which should hide much of the complexity. + +Keystone could federate to Keycloak as an identity management provider using +either OpenID Connect or SAML. + +Theoretically, Keycloak could be used without Keystone if the Keystone +middleware in the services were replaced by some new OpenID Connect middleware. +The architecture of OpenStack is designed to make this at least possible. It +would also require changes to client-side code (most prominently any +cloud-provider-openstack providers that might otherwise be reused), although +there is a chance that they could be contained to a small blast radius around +Gophercloud's `clientconfig module +`. + + +.. _Keystone: https://docs.openstack.org/keystone/ +.. _OpenID Connect: https://openid.net/connect/ +.. _SAML: https://docs.oasis-open.org/security/saml/Post2.0/sstc-saml-tech-overview-2.0.html +.. _cloud-provider-openstack: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/using-keystone-webhook-authenticator-and-authorizer.md#readme +.. _quotas: https://docs.openstack.org/keystone/latest/admin/unified-limits.html +.. _Dex: https://github.com/dexidp/dex/#readme +.. _Keycloak: https://www.keycloak.org/ diff --git a/doc/source/ideas/teapot/index.rst b/doc/source/ideas/teapot/index.rst new file mode 100644 index 0000000..e1def79 --- /dev/null +++ b/doc/source/ideas/teapot/index.rst @@ -0,0 +1,129 @@ +Project Teapot +============== + +.. _teapot-introduction: + +Introduction +------------ + +Project Teapot is a design proposal for a bare-metal cloud to run Kubernetes +on. + +When OpenStack was first designed, 10 years ago, the cloud computing landscape +was a very different place. In the intervening period, OpenStack has amassed an +enormous installed base of many thousands of users who all depend on it +remaining essentially the same service, with backward-compatible APIs. If we +designed an open source cloud platform without those restrictions and looking +ahead to the 2020s, knowing everything we know today, what might it look like? +And how could we build it without starting from scratch, but using existing +open source technologies where possible? Project Teapot is one answer to these +questions. + +Project Teapot is designed to run natively on Kubernetes, and to integrate with +Kubernetes clusters deployed by tenants. It provides only bare-metal compute +capacity, so that tenants can orchestrate all aspects of an application -- from +legacy VMs to cloud-native containers to workloads requiring custom hardware, +and everything in between -- through a single API that they can control. + +It seems inevitable that numerous organisations are going to end up +implementing various subsets of this functionality just to deal with bare-metal +clusters in their own environment. By developing Teapot in the open, we would +give them a chance to reduce costs by collaborating on a common solution. + +.. _teapot-goals: + +Goals +----- + +OpenStack's `mission +`_ +is to be ubiquitous; Teapot's is narrower. In the 2020s, Kubernetes will be +ubiquitous. However, Kubernetes' separation of responsibilities with the +underlying cloud mean that some important capabilities are considered out of +scope for it -- most obviously multi-tenancy of the sort provided by clouds, +allowing isolation from potentially malicious users (including innocuous users +who have had their workloads hacked by malicious third parties). Teapot's +primary mission is to fill those gaps with an open source solution, by +providing a cloud layer to manage a physical data center beneath Kubernetes. + +In addition to mediating access to a physical data center, another important +role of clouds is to offer managed services (for example, a database as a +service). Teapot itself can be used to provide a managed service -- Kubernetes +(though it could equally be configured to provide fully user-controlled tenant +clusters). A secondary goal is to make Teapot a platform that cloud providers +could use to offer other kinds of managed service as well. Teapot is an easier +base than OpenStack on which to deploy such services because it is itself based +on Kubernetes. + +.. _teapot-non-goals: + +Non-Goals +--------- + +Teapot's design makes it suitable for deployments that require multi-tenancy +and are medium-sized or larger. Specifically, Teapot makes sense when tenants +are large enough to be able to utilise at least one (and usually more than one) +entire bare-metal server, because managing virtual machines is not a goal. + +Smaller deployments that nevertheless require hard multi-tenancy (that is to +say, zero trust required between tenants) would be better off with OpenStack. + +Smaller deployments that do not require hard multi-tenancy would be better off +running a single standalone Kubernetes cluster. + +.. _teapot-design: + +Design +------ + +The `Vision for OpenStack Clouds`_ states that the `physical data center +management function +`_ +of a cloud must "[provide] the abstractions needed to deal with external +systems like :doc:`compute `, :doc:`storage `, and +:doc:`networking ` hardware [including :doc:`load balancers +` and :doc:`hardware security modules `], the +:doc:`Domain Name System `, and :doc:`identity management systems `." +This proposal discusses implementation options for each of those classes of +systems. + +Teapot also fulfils the `self-service +`_ +requirements of a cloud, by providing multi-tenancy and :ref:`capacity +management `. In the Kubernetes model, +multi-tenancy is something that must be provided by the cloud layer. + +Because Teapot targets Kubernetes as its tenant workload, it is able to +`provide applications control +`_ +over the cloud using the standard Kubernetes interfaces (such as Ingress +resources and the Cluster Autoscaler). This greatly simplifies porting of many +workloads to and from other clouds. + +Teapot is designed to be radically simpler than OpenStack to :doc:`install +` and operate. By running on the same technology stack as the +tenant clusters it deploys, it allows a common set of skills to be applied to +the operation of both applications and the underlying infrastructure. By +eschewing direct management of virtualisation it avoids having to shoehorn +bare-metal management into a virtualisation context or vice-versa, and +eliminates entire layers of networking abstractions. + +At the same time, Teapot should be able to :doc:`interoperate with OpenStack +` when required so that each enhances the value of the +other without adding unnecessary layers of complexity. + +Index +----- + +.. toctree:: + compute + storage + networking + load-balancing + dns + idm + key-management + installation + openstack-integration + +.. _Vision for OpenStack Clouds: https://governance.openstack.org/tc/reference/technical-vision.html diff --git a/doc/source/ideas/teapot/installation.rst b/doc/source/ideas/teapot/installation.rst new file mode 100644 index 0000000..3b979e7 --- /dev/null +++ b/doc/source/ideas/teapot/installation.rst @@ -0,0 +1,69 @@ +Teapot Installation +=================== + +In a sense, the core of Teapot is simply an application running in a Kubernetes +cluster (the management cluster). This is a great advantage for ease of +installation, because Kubernetes is renowned for its simplicity in +bootstrapping. Many, many (perhaps too many) tools already exist for +bootstrapping a Kubernetes cluster, so there is no need to reinvent them. + +However, Teapot is designed to be the system that provides cloud services to +bare-metal Kubernetes clusters, and while it is possible to run the management +cluster on another cloud (such as OpenStack), it is likely in most instances to +be self-hosted on bare metal. This presents a unique bootstrapping challenge. + +OpenStack does not define an 'official' installer, largely due to the plethora +of configuration management tools that different users preferred. Teapot does +not have the same issue, as it standardises on Kubernetes as the *lingua +franca*. There should be a single official installer and third parties are +encouraged to add extensions and customisations by adding Resources and +Operators through the Kubernetes API. + +Implementation Options +---------------------- + +Metal³ +~~~~~~ + +`Metal³`_ is designed to bootstrap standalone bare-metal clusters, so it can be +used to install the management cluster. There are multiple ways to do this. One +is to use the `Cluster API`_ on a bootstrap VM, and then pivot the relevant +resources into the cluster. The OpenShift installer takes a slightly different +approach, again using a bootstrap VM, but creating the master nodes initially +using Terraform and then creating BareMetalHost resources marked as 'externally +provisioned' for them in the cluster. + +One inevitable challenge is that the initial bootstrap VM must be able to +connect to the :ref:`management and provisioning networks +` in order to begin the installation. That +makes it difficult to simply run from a laptop, which makes installing a small +proof-of-concept cluster harder than anyone would like. (This is inherent to +the bare-metal environment and also a problem for OpenStack installers). If a +physical host must be used as the bootstrap, reincorporating that hardware into +the actual cluster once it is up and running should at least be simpler on +Kubernetes. + +Airship +~~~~~~~ + +Airship_ 2.0 uses Metal³ and the Cluster API to provision Kubernetes clusters +on bare metal. It also provides a declarative way of repeatably setting the +initial configuration and workloads of the deployed cluster, along with a rich +document layering and substitution model (based on Kustomize). This might be +the simplest existing way of defining what a Teapot installation looks like +while allowing distributors and third-party vendors a clear method for +providing customisations and add-ons. + +Teapot Operator +~~~~~~~~~~~~~~~ + +A Kubernetes operator for managing the deployment and configuration of the +Teapot components could greatly simplify the installation process. This is not +incompatible with using Airship (or indeed any other method) to define the +configuration, as Helm would just create the top-level custom resource(s) +controlled by the operator, instead of lower-level resources for the individual +components. + +.. _Metal³: https://metal3.io/ +.. _Cluster API: https://github.com/kubernetes-sigs/cluster-api#readme +.. _Airship: https://www.airshipit.org/ diff --git a/doc/source/ideas/teapot/key-management.rst b/doc/source/ideas/teapot/key-management.rst new file mode 100644 index 0000000..bc85ec4 --- /dev/null +++ b/doc/source/ideas/teapot/key-management.rst @@ -0,0 +1,70 @@ +Teapot Key Management +===================== + +Kubernetes offers the Secret resource for storing secrets needed by +applications. This is an improvement on storing them in the applications' +source code, but unfortunately by default Secrets are not encrypted at rest, +but simply stored in etcd in plaintext. An EncryptionConfiguration_ resource +can be used to ensure the Secrets are encrypted before storing them, but in +most cases the keys used to encrypt the data are themselves stored in etcd in +plaintext, alongside the encrypted data. + +This can be avoided by using a `Key Management Service provider`_ plugin. In +this case the encryption keys for each Secret are themselves encrypted, and can +only be decrypted using a master key stored in the key management service +(which may be a hardware security module). All extant KMS providers appear to +be for cloud services; there are no baremetal options. + +Since the KMS provider is necessary to provide effective encryption at rest and +is the *de facto* responsibility of the cloud, it would be desirable for Teapot +to support it. The implementation should be able to make use of :abbr:`HSM +(Hardware Security Module)`\ s, but also be able to work with a pure-software +solution. + +Implementation Options +---------------------- + +.. _teapot-key-management-barbican: + +OpenStack Barbican +~~~~~~~~~~~~~~~~~~ + +Barbican_ provides exactly the thing we want. It `provides +`_ an +abstraction over HSMs as well as software implementations using Dogtag_ (which +can itself store its master keys either in software or in an HSM) or Vault_, +along with another that simply stores its master key in the config file. + +Like other OpenStack services, Barbican uses Keystone for :doc:`authentication +`. A :abbr:`KMS (Key Management Service)` provider for Barbican already +exists in cloud-provider-openstack_. This could be used in both the management +cluster and in tenant clusters. + +Barbican's architecture is relatively simple, although it does rely on RabbitMQ +for communication between the API and the workers. This should be easy to +replace with something like json-rpc as was done for Ironic in Metal³ to +simplify the deployment. + +Storing keys in software on a dynamic system like Kubernetes presents +challenges. It might be necessary to use a host volume on the master nodes to +store master keys when no HSM is available. Ultimately the most secure solution +is to use a HSM. + +.. _teapot-key-management-secrets: + +Write a new KMS plugin using Secrets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writing a KMS provider plugin is very straightforward. We could write one that +just uses a Secret stored in the management cluster as the master key. + +However, this could not be used to encrypt Secrets at rest in the management +cluster itself. + + +.. _EncryptionConfiguration: https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/ +.. _Key Management Service provider: https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/ +.. _Barbican: https://docs.openstack.org/barbican/latest/ +.. _Dogtag: https://www.dogtagpki.org/wiki/PKI_Main_Page +.. _Vault: https://www.vaultproject.io/ +.. _cloud-provider-openstack: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/using-barbican-kms-plugin.md#readme diff --git a/doc/source/ideas/teapot/load-balancing.rst b/doc/source/ideas/teapot/load-balancing.rst new file mode 100644 index 0000000..8944232 --- /dev/null +++ b/doc/source/ideas/teapot/load-balancing.rst @@ -0,0 +1,241 @@ +Teapot Load Balancing +===================== + +Load balancers are one of the things that Kubernetes expects to be provided by +the underlying cloud. No multi-tenant bare-metal solutions for this exist, so +project Teapot would need to provide one. Ideally an external load balancer +would act as an abstraction over what could be either a tenant-specific +software load balancer or multi-tenant-safe access to a hardware (or virtual) +load balancer. + +There are two ways for an application to request an external load balancer in +Kubernetes. The first is to create a Service_ with type |LoadBalancer|_. This +is the older way of doing things but is still useful for lower-level plumbing, +and may be required for non-HTTP(S) protocols. The preferred (though nominally +beta) way is to create an Ingress_. The Ingress API allows for more +sophisticated control (such as adding abbr:`TLS (Transport Layer Security)` +termination), and can allow multiple services to share a single external load +balancer (including across different DNS names), and hence a single IP address. + +Most managed Kubernetes services provide an Ingress controller that can set up +external load balancers, including TLS termination, using the underlying +cloud's services. Without this, tenants can still use an Ingress controller, +but it would have to be one that uses resources available to the tenant, such +as by running software load balancers in the tenant cluster. + +When using a Service of type |LoadBalancer| (rather than an Ingress), there is +no standardised way of requesting TLS termination (some cloud providers permit +it using an annotation), so supporting this use case is not a high priority. +The |LoadBalancer| Service type in general should be supported, however (though +there are existing Kubernetes offerings where it is not). + +Implementation options +---------------------- + +The choices below are not mutually exclusive. An administrator of a Teapot +cloud and their tenants could each potentially choose from among several +available options. + +.. _teapot-load-balancing-metallb-l2: + +MetalLB (Layer 2) on tenant cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The MetalLB_ project provides two ways of doing load balancing for bare-metal +clusters. One requires control over only layer 2, although it really only +provides the high-availability aspects of load balancing, not actual balancing. +All incoming traffic for each service is directed to a single node; from there +kubeproxy distributes it to the endpoints that handle it. However, should the +node die, traffic rapidly fails over to another node. + +This form of load balancing does not support offloading TLS termination, +results in large amounts of East-West traffic, and consumes resources from the +guest cluster. + +Tenants could decide to use this unilaterally (i.e. without the involvement of +the management cluster or its administrators). However, using MetalLB restricts +the choice of CNI plugins -- for example it does not work with OVN. A +pre-requisite to use it would be that all tenant machines share a layer 2 +broadcast domain, which may be undesirable in larger clouds. This may be an +acceptable solution for Services in some cases though. + +.. _teapot-load-balancing-metallb-l3-management: + +MetalLB (Layer 3) on management cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The layer 3 form of MetalLB_ load balancing provides true load balancing, but +requires control over the network hardware in the form of advertising +:abbr:`ECMP (Equal Cost Multiple Path)` routes via BGP. (This also places +additional `requirements +`_ on the network +hardware.) Since tenant clusters are not trusted to do this, it would have to +run in the management cluster. There would need to be an API in the management +cluster to vet requests and pass them on to MetalLB, and a +cloud-provider-teapot plugin that tenants could optionally install to connect +to it. + +This form of load balancing does not support offloading TLS termination either. + +.. _teapot-load-balancing-metallb-l3-tenant: + +MetalLB (Layer 3) on tenant cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While the network cannot trust BGP announcements from tenants, in principle the +management cluster could have a component, perhaps based on `ExaBGP +`_, that listens to such +announcements on the tenant V(x)LANs, drops any that refer to networks not +allocated to the tenant, and rebroadcasts the legitimate ones to the network +hardware. + +This would allow tenant networks to choose to make use of MetalLB in its Layer +3 mode, providing actual traffic balancing as well as making it possible to +split tenant machines amongst separate L2 broadcast domains. It would also +allow tenants to choose among a much wider range of :doc:`CNI plugins +<./networking>`, many of which also rely on BGP announcements. + +This form of load balancing still does not support offloading TLS termination. + +.. _teapot-load-balancing-ovn: + +Build a new OVN-based load balancer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One drawback of MetalLB is that it is not compatible with using OVN as the +network overlay. This is unfortunate, as OVN is one of the most popular network +overlays used with OpenStack, and thus might be a common choice for those +wanting to integrate workloads running in OpenStack and Kubernetes together. + +A new OVN-based network load balancer in the vein of MetalLB might provide more +options for this group. + +.. _teapot-load-balancing-ingress-api: + +Build a new API using Ingress resources +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A new API in the management cluster would receive requests in a form similar to +an Ingress resource, sanitise them, and then proxy them to an Ingress +controller running in the management cluster (or some other +centrally-controlled cluster). In fact, it is possible the 'API' could be as +simple as using the existing Ingress API in a namespace with a validating +webhook. + +The most challenging part of this would be coaxing the Ingress controllers on +the load balancing cluster to target services in a different cluster (the +tenant cluster). Most likely we would have to sync the EndpointSlices from the +tenant cluster into the load balancing cluster. + +In all likelihood when using a software-based Ingress controller running in a +load balancing cluster, a network load balancer would also be used on that +cluster to ensure high-availability of the load balancers themselves. Examples +include MetalLB and `kube-keepalived-vip +`_ (which uses :abbr:`VRRP +(Virtual Router Redundancy Protocol)` to ensure high availability). This +component would need to be integrated with :ref:`public IP assignment +`. + +There are already controllers for several types of software load balancers (the +nginx controller is even officially supported by the Kubernetes project), as +well as multiple hardware load balancers. This includes an existing Octavia +Ingress controller in cloud-provider-openstack_, which would be useful for +:doc:`integrating with OpenStack clouds `. The ecosystem +around this API is likely to have continued growth. This is also likely to be +the site of future innovation around configuration of network hardware, such as +hardware firewalls. + +In general, Ingress controllers are not expected to support non-HTTP(S) +protocols, so it's not necessarily possible to implement the |LoadBalancer| +Service type with an arbitrary plugin. However, the nginx Ingress controller +has support for arbitrary `TCP and UDP services +`_, +so the API would be able to provide for either type. + +Unlike the network load balancer options, this form of load balancing would be +able to terminate TLS connections. + +.. _teapot-load-balancing-custom-api: + +Build a new custom API +~~~~~~~~~~~~~~~~~~~~~~ + +A new service running on the management cluster would provide an API through +which tenants could request a load balancer. An implementation of this API +would provide a pure-software load balancer running in containers in the +management cluster (or some other centrally-controlled cluster). As in the case +of an Ingress-based controller, a network load balancer would likely be used to +provide high-availability of the load balancers. + +The API would be designed such that alternate implementations of the controller +could be created for various load balancing hardware. Ideally one would take +the form of a shim to the existing cloud-provider API for load balancers, so +that existing plugins could be used. This would include +cloud-provider-openstack, for the case where Teapot is installed alongside an +OpenStack cloud allowing it to make use of Octavia. + +Unlike the network load balancer options, this form of load balancing would be +able to terminate TLS connections. + +This option seems to be strictly inferior to using Ingress controllers on the +load balancing cluster to implement an API, assuming both options prove +feasible. + +.. _teapot-load-balancing-ingress-controller: + +Build a new Ingress controller +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In the event that we build a new API in the management cluster, a Teapot +Ingress controller would proxy requests for an Ingress to it. This controller +would likely be responsible for syncing the EndpointSlices to the API as well. + +.. _teapot-load-balancing-cloud-provider: + +Build a new cloud-provider +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In the event that we build a new API in the management cluster, a +cloud-provider-teapot plugin that tenants could optionally install would allow +them to make use of the API in the management cluster to configure Services of +type |LoadBalancer|. + +While helpful to increase portability of applications between clouds, this is a +much lower priority than building an Ingress controller. Tenants can always +choose to use Layer 2 MetalLB for their |LoadBalancer| Services instead. + +.. _teapot-load-balancing-octavia: + +OpenStack Octavia +~~~~~~~~~~~~~~~~~ + +On paper, Octavia_ provides exactly what we want: a multi-tenant abstraction +layer over hardware load balancer APIs, with a software-based driver for those +wanting a pure-software solution. + +In practice, however, there is only one driver for a hardware load balancer +(along with a couple of other out-of-tree drivers), and an Ingress controller +for that hardware also exists. More drivers existed for the earlier Neutron +LBaaS v2 API, but some vendors had largely moved on to Kubernetes by the time +the Neutron API was replaced by Octavia. + +The pure-software driver (Amphora) itself supports provider plugins for its +compute and network. However the only currently available providers are for +OpenStack Nova and OpenStack Neutron. Nova will not be present in Teapot. Since +we want to make use of Neutron only as a replaceable implementation detail -- +if at all -- Teapot cannot allow other components of the system to become +dependent on it. Additional providers would have to be written in order to use +Octavia in Teapot. + +Another possibility is integration in the other direction -- using a +Kubernetes-based service as a driver for Octavia when Teapot is +:doc:`co-installed with an OpenStack cloud `. + +.. |LoadBalancer| replace:: ``LoadBalancer`` + +.. _Service: https://kubernetes.io/docs/concepts/services-networking/service/ +.. _LoadBalancer: https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer +.. _Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/ +.. _cloud-provider-openstack: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/using-octavia-ingress-controller.md#readme +.. _MetalLB: https://metallb.universe.tf/ +.. _Octavia: https://docs.openstack.org/octavia/ diff --git a/doc/source/ideas/teapot/networking.rst b/doc/source/ideas/teapot/networking.rst new file mode 100644 index 0000000..5353009 --- /dev/null +++ b/doc/source/ideas/teapot/networking.rst @@ -0,0 +1,187 @@ +Teapot Networking +================= + +In Project Teapot, tenant clusters are deployed exclusively on bare-metal +servers, which are under the complete control of the tenant. Therefore the +network itself must be the guarantor of multi-tenancy, with only untrusted +components running on tenant machines. (Trusted components can still run within +the management cluster.) + +.. _teapot-networking-multi-tenancy: + +Multi-tenant Network Model +-------------------------- + +Support for VLANs and VxLAN is ubiquitous in modern data center network +hardware, so this will be the basis for Teapot's networking. Each tenant will +be assigned one or more V(x)LANs. (Separate failure domains will likely also +have separate broadcast domains.) As machines are assigned to the tenant, the +Teapot controller will connect each to a private virtual network also assigned +to the tenant. + +Small deployments can just use VLANs. Larger deployments may need VxLAN, and in +this case :abbr:`VTEP (VxLAN Tunnel EndPoint)`-capable edge switches and a +VTEP-capable router will be required. + +This design frees the tenant clusters from being forced to use a particular +:abbr:`CNI (Container Network Interface)` plugin. Tenants are free to select a +networking overlay (e.g. Flannel, Cilium, OVN, &c.) or other CNI plugin (e.g. +Calico, Romana) of their choice within the tenant cluster, provided that it +does not need to be trusted by the network. (This would preclude solutions that +rely on advertising BGP/OSPF routes, although it's conceivable that one day +these advertisements could be filtered through a trusted component in the +management cluster and rebroadcast to the unencapsulated network -- this would +also be useful for :ref:`load balancing +` of Services.) If the tenant's CNI +plugin does create an overlay network, that technically means that packets will +be double-encapsulated, which is a Bad Thing when it occurs in VM-based +clusters, for several reasons: + +* There is a performance overhead to encapsulating the packets on the + hypervisor, and it also limits the ability to apply some performance + optimisations (such as using SR-IOV to provide direct access to the NICs from + the VMs by virtualising the PCIe bus). +* The extra overhead in each packet can cause fragmentation, and reduces the + bandwidth available at the edge. +* Broadcast, multicast and unknown unicast traffic is flooded to all possible + endpoints in the overlay network; doing this at multiple layers can increase + network load. + +However, these problems are significantly mitigated in the Teapot model: + +* The performance cost of performing the second encapsulation is eliminated by + offloading it to the network hardware. +* Encapsulation headers are carried only within the core of the network, where + bandwidth is less scarce and frame sizes can be adjusted to prevent + fragmentation. +* CNI plugins don't generally make significant use of broadcast or multicast. + +.. _teapot-networking-provisioning: + +Provisioning Network +-------------------- + +Generally bare-metal machines will need at least one interface connected to a +provisioning network in order to boot using :abbr:`PXE (Pre-boot execution +environment)`. Typically the provisioning network is required to be an untagged +VLAN. + +PXE can be avoided by provisioning using virtual media (where the BMC attaches +a virtual disk containing the boot image to the host's USB), but hardware +support for doing this from Ironic is uneven (though rapidly improving) and it +is considerably slower than PXE. In addition, the Ironic agent typically +communicates over this network for purposes such as introspection of hosts or +cleaning of disks. + +For the purpose of PXE booting, hosts could be left permanently connected to +the provisioning network provided they are isolated from each other (e.g. using +private VLANs). This would have the downside that the main network interface of +the tenant worker would have to appear on a tagged VLAN. However, the Ironic +agent's access to the Ironic APIs is unauthenticated, and therefore not safe to +be carried over networks that have hosts allocated to tenants connected to +them. This could occur over a separate network, but in any event hosts' +membership of this network will have to be changed dynamically in concert with +the baremetal provisioner. + +The :abbr:`BMC (Baseboard management controller)`\ s will be connected to a +separate network that is reachable only from the management cluster. + +.. _teapot-networking-storage: + +Storage Network +--------------- + +When (optionally) used in combination with multi-tenant storage, machines will +need to also be connected to a separate storage network. The networking +requirements for this network are much simpler, as it does not need to be +dynamically managed. Each edge port should be isolated from all of the others +(using e.g. Private VLANs), regardless of whether they are part of the same +tenant. :abbr:`QoS (Quality of Service)` rules should ensure that no individual +machine can effectively deny access to others. Configuring the switches for the +storage network can be considered out of scope for Project Teapot, at least +initially, as the configuration need not be dynamic, but might be in scope for +the :doc:`installer `. + +.. _teapot-networking-external: + +External connections +-------------------- + +Workloads running in a tenant cluster can request to be exposed for incoming +external connections in a number of different ways. The Teapot cloud is +responsible for ensuring that each of these is possible. + +The ``NodePort`` service type simply requires that the IP addresses of the +cluster members be routable from external networks. + +For IPv4 support in particular, Teapot will need to be able to allocate public +IP addresses and route traffic for them to the appropriate networks. +Traditionally this is done using :abbr:`NAT (Network Address Translation)` +(e.g. Floating IPs in OpenStack). Users can specify an externalAddress to make +use of public IPs within their cluster, although there's no built-in way to +discover what IPs are available. Teapot should also have a way of exporting the +:doc:`reverse DNS records ` for public IP addresses. + +The ``LoadBalancer`` Service type uses an external :doc:`load balancer +` as a front end. Traffic from the load balancer is directed +to a ``NodePort`` service within the tenant cluster. + +Most managed Kubernetes services provide an Ingress controller that can set up +load balancing (including :abbr:`TLS (Transport Layer Security)` termination) +in the underlying cloud for HTTP(S) traffic, including automatically +configuring public IPs. If Teapot provided :ref:`such an Ingress controller +`, it might be a viable option to not +support public IPs at all for the ``NodePort`` service type. In this case, the +implementation of public IPs could be confined to the :ref:`load balancing API +`, and the only stable public IP addresses +would be the Virtual IPs of the load balancers. Tenant IPv6 addresses could +easily be made publicly routable to provide direct access to ``NodePort`` +services over IPv6 only, although this also comes with the caveat that some +clients may be tempted to rely on the IP of a Service being static, when in +fact the only safe way to reference it is via a :doc:`DNS name ` exported +by ExternalDNS. + +Implementation Options +---------------------- + +.. _teapot-networking-ansible: + +Ansible Networking +~~~~~~~~~~~~~~~~~~ + +A good long-term implementation strategy might be to use ansible-networking to +directly configure the top-of-rack switches. This would be driven by a +Kubernetes controller running in the management cluster operating on a set of +Custom Resource Definitions (CRDs). The ansible-networking project supports a +wide variety of hardware already. A minimal proof of concept for this +controller `exists `_. + +In addition to configuring the edge switches, a solution for public IPs and +other ways of exposing services is also needed. Future requirements likely +include configuring limited cross-tenant network connectivity, and access to +hardware load balancers and other data center hardware. + +.. _teapot-networking-neutron: + +OpenStack Neutron +~~~~~~~~~~~~~~~~~ + +A good short-term option might be to use a cut-down Neutron installation as an +implementation detail to manage the network. Using only the baremetal port +types in Neutron circumvents a lot of the complexity. Most of the Neutron +agents would not be required, so message queue--based RPC could be eliminated +or replaced with json-rpc (as it has been in Ironic for Metal³). Since only a +trusted service would be controlling network changes, Keystone authentication +would not be required either. + +To ensure that Neutron itself could eventually be switched out, it would be +strictly confined behind a Kubernetes-native API, in much the same way as +Ironic is behind Metal³. The existing direct integration between Ironic and +Neutron would not be used, and nor could we rely on Neutron to provide an +integration point for e.g. :ref:`Octavia ` to +provide an abstraction over hardware load balancers. + +The abstraction point would be the Kubernetes CRDs -- different controllers +could be chosen to manage custom resources (and those might in turn make use of +additional non-public CRDs), but we would not attempt to build controllers with +multiple plugin points that could lead to ballooning complexity. diff --git a/doc/source/ideas/teapot/openstack-integration.rst b/doc/source/ideas/teapot/openstack-integration.rst new file mode 100644 index 0000000..38e3e54 --- /dev/null +++ b/doc/source/ideas/teapot/openstack-integration.rst @@ -0,0 +1,109 @@ +Teapot and OpenStack +==================== + +Many potential users of Teapot have large existing OpenStack deployments. +Teapot is not intended to be a wholesale replacement for OpenStack -- it does +not deal with virtualisation at all, in fact -- so it is important that the two +complement each other. + +.. _teapot-openstack-managed-services: + +Managed Services +---------------- + +A goal of Teapot is to make it easier for cloud providers to offer managed +services to tenants. Attempts to do this in OpenStack, such as Trove_, have +mostly foundered. The Kubernetes Operator pattern offers the most promising +ground for building such services in future, and since Teapot is +Kubernetes-native it would be well-placed to host them. + +Building a thin OpenStack-style ReST API over such services would allow their +use from an OpenStack cloud (presumably sharing, or federated to, the same +Keystone) simultaneously. And, in fact, most such services could be decoupled +from Teapot altogether and run in a generic Kubernetes cluster so that they +could benefit users of either cloud type even absent the other. + +Teapot's :ref:`load balancing API ` would +arguably already be a managed service. :ref:`Octavia +` could possibly use it as a back-end as a first +example. + +.. _teapot-openstack-side-by-side: + +Side-by-side Clouds +------------------- + +Teapot should be co-installable alongside an existing OpenStack cloud to +provide additional value. In this configuration, the Teapot cloud would use the +OpenStack cloud's :ref:`Keystone ` and any services that +are expected to be found in the catalog (e.g. :ref:`Manila +`, :ref:`Cinder `, +:ref:`Designate `). + +An OpenStack-style ReST API in front of Teapot would allow users of the +OpenStack cloud to create and manage bare-metal Kubernetes clusters in much the +same way they do today with Magnum. + +Tenants would need a way to connect their Neutron networks in OpenStack to the +Kubernetes clusters. Since Teapot tenant networks are :ref:`just V(x)LANs +`, this could be accomplished by adding those +networks as provider networks in Neutron, and allowing the correct tenants to +connect to them via Neutron routers. This should be sufficient for the main use +case, which would be running parts of an application in a Kubernetes cluster +while other parts remain in OpenStack VMs. + +However, the ideal for this type of deployment would be to allow servers to be +dynamically moved between the OpenStack and Teapot clouds. Sharing inventory +with OpenStack's Ironic might be simple enough -- if Metal³ was configured to +use the OpenStack cloud's Ironic then a small component could claim hosts in +OpenStack Placement and create corresponding BareMetalHost objects in Teapot. +Both clouds would end up manipulating the top-of-rack switch configuration for +a host, but presumably only at different times. + +Switching hosts between acting as OpenStack compute nodes and being available +to Teapot tenants would be more complex, since it would require interaction +with the tool managing the OpenStack deployment, of which there are many. +However, supporting autoscaling between the two is probably unnecessary. +Manually moving hosts between the clouds should be manageable, since no changes +to the physical network cabling would be required. Separate :ref:`provisioning +networks ` would need to be maintained, since +the provisioner needs control over DHCP. + +.. _teapot-openstack-on-teapot: + +OpenStack on Teapot +------------------- + +To date, the most popular OpenStack installers have converged on Ansible as a +deployment tool because the complexity of OpenStack needs tight control over +the workflow that purely declarative tools struggle to match. However, +Kubernetes Operators present a declarative alternative that is nonetheless +equally flexible. Even without Operators, Airship_ and StarlingX_ are both +installing OpenStack on top of Kubernetes. It seems likely that in the future +this will be a popular way of layering things, and Teapot is well-placed to +enable it since it provides bare-metal hosts running Kubernetes. + +For a large, shared OpenStack cloud, this would likely be best achieved by +running the OpenStack control plane components inside the Teapot management +cluster. Sharing of services would then be similar to the side-by-side case. +OpenStack Compute nodes or e.g. Ceph storage nodes could be deployed using +`Metal³`_. This effectively means building an OpenStack installation/management +system similar to a TripleO undercloud but based on Kubernetes. + +There is a second use case, for running small OpenStack installations (similar +to StarlingX) within a tenant. In these cases, the tenant OpenStack would still +need to access storage from the Teapot cloud. This could possibly be achieved +by federating the tenant Keystone to Teapot's Keystone and using hierarchical +multi-tenancy so that projects in the tenant Keystone are actually sub-projects +of the tenant's project in the Teapot Keystone. (The long-dead `Trio2o +`_ project also offered a potential +solution in the form of an API proxy, but probably not one worth resurrecting.) +Use of an overlay network (e.g. OVN) would be required, since the tenant would +have no access to the underlying network hardware. Some integration between the +tenant's Neutron and Teapot would need to be built to allow ingress traffic. + + +.. _Trove: https://docs.openstack.org/trove/ +.. _Airship: https://www.airshipit.org/ +.. _StarlingX: https://www.starlingx.io/ +.. _Metal³: https://metal3.io/ diff --git a/doc/source/ideas/teapot/storage.rst b/doc/source/ideas/teapot/storage.rst new file mode 100644 index 0000000..5f8de7a --- /dev/null +++ b/doc/source/ideas/teapot/storage.rst @@ -0,0 +1,140 @@ +Teapot Storage +============== + +Project Teapot should have the ability to optionally provide multi-tenant +access to shared file, block, and/or object storage. Shared file and block +storage capabilities are not currently available to Kubernetes users except +through the cloud providers. + +Tenants can always choose to use hyperconverged storage -- that is to say, both +compute and storage workloads on the same hosts -- without involvement or +permission from Teapot. (For example, by using Rook_.) However, this means that +compute and storage cannot be scaled independently; they are tightly coupled. +Tenants with disproportionately large amounts of data but modest compute needs +(and sometimes vice-versa) would not be served efficiently. Hyperconverged +storage also usually makes sense only for clusters that are essentially fixed. +Changing the size of the cluster results in rebalancing of storage, so it is +not suitable for workloads that vary greatly over time (for instance, training +of machine learning models). + +To efficiently run hyperconverged storage also requires a somewhat specialised +choice of servers. Particularly in a large cloud where different tenants have +different storage requirements, it might be cheaper to provide a centralised +storage cluster and thus require either fewer variants or less specialisation +of server hardware. + +For all of these reasons, a shared storage pool is needed to take full +advantage of the highly dynamic environment offered by a cloud like Teapot. + +Providing multi-tenant access to shared file and block storage allows the cloud +provider to use a dedicated storage network (such as a :abbr:`SAN (Storage Area +Network)`). Many potential users may already have something like this. Having +the storage centralised also makes it easier and more efficient to share large +amounts of data between tenants when required (since traffic can be confined to +the same :ref:`storage network ` rather than +traversing the public network). + +Applications can use object storage anywhere (including outside clouds), but to +minimise network bandwith, it will often be better to have it nearby. Should +the `proposal to add Object Bucket Provisioning +`_ to Kubernetes +eventuate, there will also be advantage in have object storage as part of the +local cloud, using the same authentication mechanism. + +Implementation Options +---------------------- + +OpenStack already provides robust, mature implementations of multi-tenant +shared storage that are accessible from Kubernetes. The main task would be to +integrate them into the system and simplify deployment. These services would +run in either the management cluster or a separate (but still +centrally-managed) storage cluster. + +.. _teapot-storage-manila: + +OpenStack Manila +~~~~~~~~~~~~~~~~ + +Manila_ is the most natural fit for Kubernetes because it provides 'RWX' +(Read/Write Many) persistent storage, which is often needed to avoid downtime +when pods are upgraded or rescheduled to different nodes as well as for +applications where multiple pods are writing to the same filesystem in +parallel. + +Manila's architecture is relatively simple already. It would be helpful if the +dependency on RabbitMQ could be removed (to be replaced with e.g. json-rpc in +the same way that Ironic has in Metal³), but this would require more +investigation. An Operator for deploying and managing Manila on Kubernetes is +under development. + +A :abbr:`CSI (Container Storage Interface)` plugin for Manila already exists in +cloud-provider-openstack_. + +.. _teapot-storage-cinder: + +OpenStack Cinder +~~~~~~~~~~~~~~~~ + +Cinder_ is more limited than Manila in the sense that it can provide only 'RWO' +(Read/Write One) access to persistent storage for most applications. +(Kubernetes volume mounts are generally file-based -- Kubernetes creates its +own file system on block devices if none is present.) However, Kubernetes does +now support raw block storage volumes, which *do* support 'RWX' mode for +applications that can work with raw block offsets. KubeVirt in particular is +expected to make use of raw block mode persistent volumes for backing virtual +machines, so this is likely to be a common use case. + +Much of the complexity in Cinder is linked to the need to provide agents +running on Nova compute hosts. Since Teapot is a baremetal-only service, only +the parts of Cinder needed to provide storage to Ironic servers are required. +Unfortunately, Cinder is quite heavily dependent on RabbitMQ. However, there +may be scope for simplification through further work with the Cinder community. +The remaining portions of Cinder are architecturally very similar to Manila, so +similar results could be expected. + +Cinder has a dependency on Barbican for supporting encrypted volumes. Encrypted +volume support is not required but would be nice to have. This is another +reason to use :ref:`Barbican `. It would be +nice to think that we could adapt Cinder to be able to use Kubernetes Secrets +instead (perhaps via another key manager back-end to Castellan), but that +doesn't actually provide the :doc:`level of security you would hope for +` without Barbican or an equivalent anyway. + +A :abbr:`CSI (Container Storage Interface)` plugin for Cinder already exists in +cloud-provider-openstack_. + +Ember_ is an alternative CSI plugin that makes use of lib-cinder, rather than +all of Cinder. This allows Cinder's hardware drivers to be used directly from +Kubernetes while eliminating a lot of overhead. However, some of the overhead +that is eliminated is the API that enforces multi-tenancy. Therefore, Ember is +not an option for this particular use case. + +.. _teapot-storage-swift: + +OpenStack Swift +~~~~~~~~~~~~~~~ + +Swift_ is a very mature object storage system, with both a native API and the +ability to emulate Amazon S3. It supports :ref:`Keystone ` +authentication. It has a relatively simple architecture that should make it +straightforward to deploy on top of Kubernetes. + +.. _teapot-storage-radosgw: + +Ceph Object Gateway +~~~~~~~~~~~~~~~~~~~ + +RadosGW_ is a service to provide an object storage interface backed by Ceph, +with two APIs that are compatible with large subsets of Swift and Amazon S3, +respectively. It can use either :ref:`Keystone ` or +:ref:`Keycloak ` for authentication. It can be installed +and managed using the Rook_ operator. + + +.. _Rook: https://rook.io/ +.. _cloud-provider-openstack: https://github.com/kubernetes/cloud-provider-openstack#readme +.. _Manila: https://docs.openstack.org/manila/latest/ +.. _Cinder: https://docs.openstack.org/cinder/latest/ +.. _Ember: https://ember-csi.io/ +.. _Swift: https://docs.openstack.org/swift/latest/ +.. _RadosGW: https://docs.ceph.com/docs/master/radosgw/ diff --git a/doc/source/index.rst b/doc/source/index.rst index 3348f82..f72783b 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -87,8 +87,9 @@ Proposed ideas ============== .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :titlesonly: :glob: + ideas/*/index ideas/*