Merge "Initial documentation for infra-cloud"

2015-07-13 22:59:09 +00:00
parent 12f360bd4a 9b95458143
commit 5597a47fcd
2 changed files with 192 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -27,6 +27,7 @@ Contents:
   test-infra-requirements
   sysadmin
   systems
+   infra-cloud

 .. toctree::
   :hidden:
--- a/doc/source/infra-cloud.rst
+++ b/doc/source/infra-cloud.rst
@@ -0,0 +1,191 @@
+:title: Infra Cloud
+
+.. _infra_cloud:
+
+Infra Cloud
+###########
+
+Introduction
+============
+
+With donated hardware and datacenter space, we can run an optimized
+semi-private cloud for the purpose of adding testing capacity and also
+with an eye for "dog fooding" OpenStack itself.
+
+Current Status
+==============
+
+Currently this cloud is in the planning and design phases. This section
+will be updated or removed as that changes.
+
+Mission
+=======
+
+The infra-cloud's mission is to turn donated raw hardware resources into
+expanded capacity for the OpenStack infrastructure nodepool.
+
+Methodology
+===========
+
+Infra-cloud is run like any other infra managed service. Puppet modules
+and Ansible do the bulk of configuring hosts, and Gerrit code review
+drives 99% of activities, with logins used only for debugging and
+repairing the service.
+
+Requirements
+============
+
+ * Compute - The intended workload is mostly nodepool launched Jenkins
+   slaves. Thus flavors that are capable of running these tests in a
+   reasonable amount of time must be available. The flavor(s) must provide:
+
+    * 8GB RAM
+
+    * 8 * `vcpu`
+
+    * 30GB root disk
+
+ * Images - Image upload must be allowed for nodepool.
+
+ * Uptime - Because there are other clouds that can keep some capacity
+   running, 99.9% uptime should be acceptable.
+
+ * Performance - The performance of compute and networking in infra-cloud
+   should be at least as good as, if not better than, the other nodepool
+   clouds that infra uses today.
+
+ * Infra-core - Infra-core is in charge of running the service.
+
+Implementation
+==============
+
+Multi-Site
+----------
+
+There are at least two "sites" with a collection of servers in each
+site. Each site will have its own cloud, and these clouds will share no
+infrastructure or data. The racks may be in the same physical location,
+but they will be managed as if they are not.
+
+HP1
+~~~
+
+The HP1 site has 48 machines. Each machine has 96G of RAM, 1.8TiB of disk and
+24 Cores of Intel Xeon X5650 @ 2.67GHz processors.
+
+HP2
+~~~
+
+The HP2 site has 100 machines. Each machine has 96G of RAM, 1.8TiB of disk and
+32 Cores of Intel Xeon E5-2670 0 @ 2.60GHz processors.
+
+Software
+--------
+
+Infra-cloud runs the most recent OpenStack stable release. During the
+period following a release, plans must be made to upgrade as soon as
+possible. In the future the cloud may be continuously deployed.
+
+Management
+----------
+
+ * A "Ironic Controller" machine is installed by hand into each site. That
+   machine is enrolled into the puppet/ansible infrastructure.
+
+ * An all-in-one one node OpenStack cloud with Ironic as the Nova driver
+   is installed on each Ironic Controller node. The OpenStack Cloud
+   produced by this installation will be referred to as "Ironic Cloud
+   $site". In order to keep things simpler, these do not share anything
+   with the cloud that nodepool will make use of.
+
+ * Each additional machine in a site will be enrolled into the Ironic Cloud
+   as a bare metal resource.
+
+ * Each Ironic Cloud $site will be added to the list of available clouds that
+   launch_node.py or the ansible replacement for it can use to spin up long
+   lived servers.
+
+ * An OpenStack Cloud with KVM as the hypervisor will be installed using
+   launch_node and the OpenStack puppet modules as per normal infra
+   installation of services.
+
+ * As with all OpenStack services, metrics will be collected in public
+   cacti and graphite services. The particular metrics are TBD.
+
+ * As a cloud has a large amount of pertinent log data, a public ELK cluster
+   will be needed to capture and expose it.
+
+ * All Infra services run on the public internet, and the same will be true
+   for the Infra Clouds and the Ironic Clouds. Insecure services that need
+   to be accessible across machine boundaries will employ per-IP iptables
+   rules rather then relying on a squishy middle.
+
+Architecture
+------------
+
+The generally accepted "Controller" and "Compute" layout is used,
+with controllers running all non-compute services and compute nodes
+running only nova-compute and supporting services.
+
+  * The cloud is deployed with two controllers in a DRBD storage pair
+    with ACTIVE/PASSIVE configured and a VIP shared between the two.
+    This is done to avoid complications with Galera and RabbitMQ at
+    the cost of making failovers more painful and under-utilizing the
+    passive stand-by controller.
+
+  * The cloud will use KVM because it is the default free hypervisor and
+    has the widest user base in OpenStack.
+
+  * The cloud will use Neutron configured for Provider VLAN because we
+    do not require tenant isolation and this simplifies our networking on
+    compute nodes.
+
+  * The cloud will not use floating IPs because every node will need to be
+    reachable via routable IPs and thus there is no need for separation. Also
+    Nodepool is under our control, so we don't have to worry about DNS TTLs
+    or anything else causing a need for a particular endpoint to remain at
+    a stable IP.
+
+  * The cloud will not use security groups because these are single use VMs
+    and they will configure any firewall inside the VM.
+
+  * The cloud will use MySQL because it is the default in OpenStack and has
+    the widest user base.
+
+  * The cloud will use RabbitMQ because it is the default in OpenStack and
+    has the widest user base. We don't have scaling demands that come close
+    to pushing the limits of RabbitMQ.
+
+  * The cloud will run swift as a backend for glance so that we can scale
+    image storage out as need arises.
+
+  * The cloud will run keystone v3 and glance v2 APIs because these are the
+    versions upstream recommends using.
+
+  * The cloud will not use the glance task API for image uploads, it will use
+    the PUT interface because the task API does not function and we are not
+    expecting a wide user base to be uploading many images simultaneously.
+
+  * The cloud will provide DHCP directly to its nodes because we trust DHCP.
+
+  * The cloud will have config drive enabled because we believe it to be more
+    robust than the EC2-style metadata service.
+
+  * The cloud will not have the meta-data service enabled because we do not
+    believe it to be robust.
+
+Networking
+----------
+
+Neutron is used, with a single `provider VLAN`_ attached to VMs for the
+simplest possible networking. DHCP is configured to hand the machine a
+routable IP which can be reached directly from the internet to facilitate
+nodepool/zuul communications.
+
+.. _provider VLAN: http://docs.openstack.org/networking-guide/deploy_scenario4b.html
+
+Each site will need 2 VLANs. One for the public IPs which every NIC of every
+host will be attached to. That VLAN will get a publicly routable /23. Also,
+there should be a second VLAN that is connected only to the NIC of the
+Ironic Cloud and is routed to the IPMI management network of all of the other
+nodes. Whether we use LinuxBridge or Open vSwitch is still TBD.