Merge "Initial documentation for infra-cloud"
This commit is contained in:
commit
5597a47fcd
@ -27,6 +27,7 @@ Contents:
|
||||
test-infra-requirements
|
||||
sysadmin
|
||||
systems
|
||||
infra-cloud
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
|
191
doc/source/infra-cloud.rst
Normal file
191
doc/source/infra-cloud.rst
Normal file
@ -0,0 +1,191 @@
|
||||
:title: Infra Cloud
|
||||
|
||||
.. _infra_cloud:
|
||||
|
||||
Infra Cloud
|
||||
###########
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
With donated hardware and datacenter space, we can run an optimized
|
||||
semi-private cloud for the purpose of adding testing capacity and also
|
||||
with an eye for "dog fooding" OpenStack itself.
|
||||
|
||||
Current Status
|
||||
==============
|
||||
|
||||
Currently this cloud is in the planning and design phases. This section
|
||||
will be updated or removed as that changes.
|
||||
|
||||
Mission
|
||||
=======
|
||||
|
||||
The infra-cloud's mission is to turn donated raw hardware resources into
|
||||
expanded capacity for the OpenStack infrastructure nodepool.
|
||||
|
||||
Methodology
|
||||
===========
|
||||
|
||||
Infra-cloud is run like any other infra managed service. Puppet modules
|
||||
and Ansible do the bulk of configuring hosts, and Gerrit code review
|
||||
drives 99% of activities, with logins used only for debugging and
|
||||
repairing the service.
|
||||
|
||||
Requirements
|
||||
============
|
||||
|
||||
* Compute - The intended workload is mostly nodepool launched Jenkins
|
||||
slaves. Thus flavors that are capable of running these tests in a
|
||||
reasonable amount of time must be available. The flavor(s) must provide:
|
||||
|
||||
* 8GB RAM
|
||||
|
||||
* 8 * `vcpu`
|
||||
|
||||
* 30GB root disk
|
||||
|
||||
* Images - Image upload must be allowed for nodepool.
|
||||
|
||||
* Uptime - Because there are other clouds that can keep some capacity
|
||||
running, 99.9% uptime should be acceptable.
|
||||
|
||||
* Performance - The performance of compute and networking in infra-cloud
|
||||
should be at least as good as, if not better than, the other nodepool
|
||||
clouds that infra uses today.
|
||||
|
||||
* Infra-core - Infra-core is in charge of running the service.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Multi-Site
|
||||
----------
|
||||
|
||||
There are at least two "sites" with a collection of servers in each
|
||||
site. Each site will have its own cloud, and these clouds will share no
|
||||
infrastructure or data. The racks may be in the same physical location,
|
||||
but they will be managed as if they are not.
|
||||
|
||||
HP1
|
||||
~~~
|
||||
|
||||
The HP1 site has 48 machines. Each machine has 96G of RAM, 1.8TiB of disk and
|
||||
24 Cores of Intel Xeon X5650 @ 2.67GHz processors.
|
||||
|
||||
HP2
|
||||
~~~
|
||||
|
||||
The HP2 site has 100 machines. Each machine has 96G of RAM, 1.8TiB of disk and
|
||||
32 Cores of Intel Xeon E5-2670 0 @ 2.60GHz processors.
|
||||
|
||||
Software
|
||||
--------
|
||||
|
||||
Infra-cloud runs the most recent OpenStack stable release. During the
|
||||
period following a release, plans must be made to upgrade as soon as
|
||||
possible. In the future the cloud may be continuously deployed.
|
||||
|
||||
Management
|
||||
----------
|
||||
|
||||
* A "Ironic Controller" machine is installed by hand into each site. That
|
||||
machine is enrolled into the puppet/ansible infrastructure.
|
||||
|
||||
* An all-in-one one node OpenStack cloud with Ironic as the Nova driver
|
||||
is installed on each Ironic Controller node. The OpenStack Cloud
|
||||
produced by this installation will be referred to as "Ironic Cloud
|
||||
$site". In order to keep things simpler, these do not share anything
|
||||
with the cloud that nodepool will make use of.
|
||||
|
||||
* Each additional machine in a site will be enrolled into the Ironic Cloud
|
||||
as a bare metal resource.
|
||||
|
||||
* Each Ironic Cloud $site will be added to the list of available clouds that
|
||||
launch_node.py or the ansible replacement for it can use to spin up long
|
||||
lived servers.
|
||||
|
||||
* An OpenStack Cloud with KVM as the hypervisor will be installed using
|
||||
launch_node and the OpenStack puppet modules as per normal infra
|
||||
installation of services.
|
||||
|
||||
* As with all OpenStack services, metrics will be collected in public
|
||||
cacti and graphite services. The particular metrics are TBD.
|
||||
|
||||
* As a cloud has a large amount of pertinent log data, a public ELK cluster
|
||||
will be needed to capture and expose it.
|
||||
|
||||
* All Infra services run on the public internet, and the same will be true
|
||||
for the Infra Clouds and the Ironic Clouds. Insecure services that need
|
||||
to be accessible across machine boundaries will employ per-IP iptables
|
||||
rules rather then relying on a squishy middle.
|
||||
|
||||
Architecture
|
||||
------------
|
||||
|
||||
The generally accepted "Controller" and "Compute" layout is used,
|
||||
with controllers running all non-compute services and compute nodes
|
||||
running only nova-compute and supporting services.
|
||||
|
||||
* The cloud is deployed with two controllers in a DRBD storage pair
|
||||
with ACTIVE/PASSIVE configured and a VIP shared between the two.
|
||||
This is done to avoid complications with Galera and RabbitMQ at
|
||||
the cost of making failovers more painful and under-utilizing the
|
||||
passive stand-by controller.
|
||||
|
||||
* The cloud will use KVM because it is the default free hypervisor and
|
||||
has the widest user base in OpenStack.
|
||||
|
||||
* The cloud will use Neutron configured for Provider VLAN because we
|
||||
do not require tenant isolation and this simplifies our networking on
|
||||
compute nodes.
|
||||
|
||||
* The cloud will not use floating IPs because every node will need to be
|
||||
reachable via routable IPs and thus there is no need for separation. Also
|
||||
Nodepool is under our control, so we don't have to worry about DNS TTLs
|
||||
or anything else causing a need for a particular endpoint to remain at
|
||||
a stable IP.
|
||||
|
||||
* The cloud will not use security groups because these are single use VMs
|
||||
and they will configure any firewall inside the VM.
|
||||
|
||||
* The cloud will use MySQL because it is the default in OpenStack and has
|
||||
the widest user base.
|
||||
|
||||
* The cloud will use RabbitMQ because it is the default in OpenStack and
|
||||
has the widest user base. We don't have scaling demands that come close
|
||||
to pushing the limits of RabbitMQ.
|
||||
|
||||
* The cloud will run swift as a backend for glance so that we can scale
|
||||
image storage out as need arises.
|
||||
|
||||
* The cloud will run keystone v3 and glance v2 APIs because these are the
|
||||
versions upstream recommends using.
|
||||
|
||||
* The cloud will not use the glance task API for image uploads, it will use
|
||||
the PUT interface because the task API does not function and we are not
|
||||
expecting a wide user base to be uploading many images simultaneously.
|
||||
|
||||
* The cloud will provide DHCP directly to its nodes because we trust DHCP.
|
||||
|
||||
* The cloud will have config drive enabled because we believe it to be more
|
||||
robust than the EC2-style metadata service.
|
||||
|
||||
* The cloud will not have the meta-data service enabled because we do not
|
||||
believe it to be robust.
|
||||
|
||||
Networking
|
||||
----------
|
||||
|
||||
Neutron is used, with a single `provider VLAN`_ attached to VMs for the
|
||||
simplest possible networking. DHCP is configured to hand the machine a
|
||||
routable IP which can be reached directly from the internet to facilitate
|
||||
nodepool/zuul communications.
|
||||
|
||||
.. _provider VLAN: http://docs.openstack.org/networking-guide/deploy_scenario4b.html
|
||||
|
||||
Each site will need 2 VLANs. One for the public IPs which every NIC of every
|
||||
host will be attached to. That VLAN will get a publicly routable /23. Also,
|
||||
there should be a second VLAN that is connected only to the NIC of the
|
||||
Ironic Cloud and is routed to the IPMI management network of all of the other
|
||||
nodes. Whether we use LinuxBridge or Open vSwitch is still TBD.
|
Loading…
Reference in New Issue
Block a user