openstack-helm/docs/guides-welcome/welcome-resiliency.md

## Welcome: Resiliency Philosophy

One of the goals of this project is to produce a set of charts that can be used in a production setting to deploy and upgrade OpenStack.  To achieve this goal, all components must be resilient, including both OpenStack and Infrastructure components leveraged by this project.  In addition, this also includes Kubernetes itself.  It is part of our mission to ensure that all infrastructure components are highly available and that a deployment can withstand a physical host failure out of the box. This means that:

- OpenStack components need to support and deploy with multiple replicas out of the box to ensure that each chart is deployed as a single-unit production ready first class citizen (unless development mode is enabled).
- Infrastructure elements such as Ceph, RabbitMQ, Galera (MariaDB), Memcached, and all others need to support resiliency and leverage multiple replicas for resiliency where applicable.  These components also need to validate that their application level configurations (for instance the underlying Galera cluster) can tolerate host crashes and withstand physical host failures.
- Scheduling annotations need to be employed to ensure maximum resiliency for multi-host environments.  They also need to be flexible to allow all-in-one deployments.  To this end, we promote the usage of `podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution` for most infrastructure elements.
- We make the assumption that we can depend on a reliable implementation of centralized storage to create PVCs within Kubernetes to support resiliency and complex application design.  Today, this is provided by the included Ceph chart. There is much work to do when making even a single backend production ready. We have chosen to focus on bringing Ceph into a production ready state, which includes handling real world deployment scenarios, resiliency, and pool configurations. In the future we would like to support more options for hardened backend PVC's. In the future, we would like to offer flexibility in choosing a hardened backend.
- We will document the best practices for running a resilient Kubernetes cluster in production.  This includes documenting the steps necessary to make all components resilient, such as Etcd and SkyDNS where possible, and point out gaps due to missing features.