4628 Commits

Author SHA1 Message Date
Zuul
e63611ad1a Merge "ng-1: Add nodegroup representation" 2019-03-24 20:48:47 +00:00
Zuul
f1f96e5835 Merge "add python 3.7 unit test job" 2019-03-22 17:24:41 +00:00
Zuul
5c586aed3c Merge "Fix openstack-cloud-controller-manager restarts" 2019-03-21 21:18:34 +00:00
Theodoros Tsioutsias
0607c7a9d6 ng-1: Add nodegroup representation
This adds the object and db schema changes needed for supporting
nodegroups.

story: 2005266

Change-Id: Ibf10277a52aa94c4b217cf3b364844b04baab1e0
2019-03-21 16:19:56 +00:00
Diogo Guerra
a46d2ffc91 [k8s] Install prometheus monitoring with helm
The Kubernetes Helm repository includes in its stable distribution
a prometheus-operator Chart.
This stable/prometheus-operator chart can be used to install all the
dependencies and some default configurations to use prometheus.
The installed extra charts are:
  * stable/prometheus-node-exporter (data scraping)
  * stable/prometheus (prometheus and alertmanager server)
  * stable/grafana (visualization dashboard)
  * stable/prometheus-operator (supervision and simple configuration)

The prometheus-operator is installed by using the label
monitoring_enabled=True. Also, the label grafana_admin_passwd can be
used to set the admin password for access to the grafana dashboard

This patch allows for transferral of prometheus monitoring maintenance
work to be done by the kubernetes/helm team.

Task: 28544
Story: 2004623
depends_on: I99d3a78085ba10030200f12bbfe58a72964e2326
Change-Id: I80d590785bf30f9d634debeaf51c0d4cce0aeb93
Signed-off-by: Diogo Guerra <dy090.guerra@gmail.com>
8.0.0.0rc1
2019-03-21 13:25:04 +01:00
Zuul
d1957c71dc Merge "Improve floating IP allocation" 2019-03-20 18:12:43 +00:00
Diogo Guerra
21acb8dc9a Fix openstack-cloud-controller-manager restarts
Openstack-cloud-controller-manager restarts several times during
cluster creation.

This happens because cloud-controller-manager starts running before
needed secrets exist in kubernetes. Cloud-controller-manager lists secrets 
and if the secrets exists it uses it and moves on, but if the secret 
doesn't exist it starts a watch until it does. As this is not allowed the
pod fails.

This is triggered by Issue 
https://github.com/kubernetes/cloud-provider-openstack/issues/545

Story: 2005270

Change-Id: If8f34dc45b3b8a76e3d561ed41b4d0a783ceecb5
Signed-off-by: Diogo Guerra <dy090.guerra@gmail.com>
2019-03-20 14:55:23 +01:00
Zuul
342023e870 Merge "Migrate legacy jobs to Ubuntu Bionic" 2019-03-20 08:15:57 +00:00
Lingxian Kong
c47fde0cbe Improve floating IP allocation
- Never allocate floating IP for etcd service.
- Introduce a new label `master_lb_floating_ip_enabled` which controls
  if Magnum allocates floating IP for the master load balancer. This
  label only takes effect when the `master_lb_enabled` is set. The
  default value is the same with `floating_ip_enabled`.
- The `floating_ip_enabled` property now only controls if Magnum
  should allocate the floating IPs for the master and worker nodes.

Change-Id: I0a232406deaf112b0cb9e445735d7b49206c676d
Story: #2005153
Task: #29868
2019-03-20 18:44:45 +13:00
Zuul
0cd35dbcca Merge "Support <ClusterID>/actions/resize API" 2019-03-19 22:16:15 +00:00
Feilong Wang
15ecdb8033 Support <ClusterID>/actions/resize API
Now an OpenStack driver for Kubernetes Cluster Autoscaler is being
proposed to support autoscaling when running k8s cluster on top of
OpenStack. However, currently there is no way in Magnum to let
the external consumer to control which node will be removed. The
alternative option is calling Heat API directly but obviously it
is not the best solution and it's confusing k8s community. So with
this patch, we're going to add a new API:

POST <ClusterID>/actions/resize

And the post body will be:

{
    "node_count": 3,
    "nodes_to_remove": ["dd9cc5ed-3a2b-11e9-9233-fa163e46bcc2"],
    "nodegroup": "production_group"
}

The API will be working in a declarative way. For example, there
are 3 nodes in the cluser now, user can propose an API request
like above. Magnum will call Heat to remove the node
dd9cc5ed-3a2b-11e9-9233-fa163e46bcc2 firstly, then bring the node
count back to 3 again.

Task: 29563
Story: 2005052

Change-Id: I7e36ce82c3f442976cc498153950b19c56a1759f
2019-03-19 20:13:17 +00:00
Spyros Trigazis
13e8c11f78 k8s_fedora: Add ca_key before all deployments
The script [1] that writes the ca.key depends in the apiserver to be
running and the script to start the apiserver [0] needs the ca.key to
exist.

Write the ca_key before all other scripts that depend on the apiserver.

story: 2005254
task: 30051

[0]
https://github.com/openstack/magnum/blob/master/magnum/drivers/common/templates/kubernetes/fragments/enable-services-master.sh
[1]
https://github.com/openstack/magnum/blob/master/magnum/drivers/k8s_fedora_atomic_v1/templates/kubecluster.yaml#L843

Change-Id: If532ccc4673225eb1b7e7cab77a30950ee5ee695
Signed-off-by: Spyros Trigazis <spyridon.trigazis@cern.ch>
2019-03-18 10:48:06 +01:00
Zuul
0da8288ada Merge "ci: Disable functional tests" 2019-03-13 11:26:21 +00:00
ghanshyam
b5a6ee1dc1 Migrate legacy jobs to Ubuntu Bionic
We have migrated the zuulv3 job to Bionic during Dec/Jan month.
 - http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000837.html
 - https://etherpad.openstack.org/p/devstack-bionic
But that effort does not move all gate job to Bionic as there are
large amount of jobs are still legacy jobs. All the legacy jobs still
use Xenial as nodeset.

As per the decided runtime for Stein, we need to test everything on openstack
CI/CD on Bionic - https://governance.openstack.org/tc/reference/runtimes/stein.html

Below patch move the legacy base jobs to bionic which will move the derived jobs
automatically to bionic. These jobs are modified with branch variant so that they will use
Bionic node from stein onwards and xenial for all other stable branches
until stable/rocky.
- https://review.openstack.org/#/c/639096

This commit remove the overridden nodeset from magnum legacy jobs
so that it will start using the nodeset defined in parent job.

More Details: 
- https://etherpad.openstack.org/p/legacy-job-bionic
- http://lists.openstack.org/pipermail/openstack-discuss/2019-March/003614.html

Depends-On: https://review.openstack.org/#/c/641886/
Change-Id: Ia5f037432f4c5925f916e19cbe8a3253869674d9
2019-03-13 01:24:50 +00:00
Zuul
e6f4969539 Merge "[fedora-atomic-k8s] Adding Node Problem Detector" 2019-03-12 22:05:22 +00:00
Feilong Wang
c39f1150e5 [fedora-atomic-k8s] Adding Node Problem Detector
Deploying Node Problem Detector to all nodes to detect problems which
can be leverage by auto healing. This is the first step of enabling
the auto healing feature.

Task: 29886
Story: 2004782

Change-Id: I1b6075025c5f369821b4136783e68b16535dc6ef
2019-03-11 22:39:50 +00:00
Zuul
988cbb8b49 Merge "Add missing ws separator between words" 2019-03-11 22:17:41 +00:00
Spyros Trigazis
16c2a4cfe3 ci: Disable functional tests
We currently run only vexxhost with nested
virtualization. Due to a kernel change all
functional jobs are failing.

Change-Id: I9ab45da36dbc5618587b4795658b4f4bb264f2c8
Signed-off-by: Spyros Trigazis <spyridon.trigazis@cern.ch>
2019-03-11 20:20:22 +01:00
Jonathan Rosser
2595fda3e3 Ensure http proxy environment is available during 'atomic install' for k8s
The scripts run by cloud-init for the master and minion nodes currently
write proxy environment variables into /bin/bashrc when they are defined.

These variables will only be introduced into the running environment
when a new bash shell is started. The /bin/sh used by the fragment
scripts will ignore /etc/bashrc, so the new shells invoked per fragment
will not have the http proxy variables present. This means that the
master/minion node deployment fails when behind an http proxy.

This patch adds explicit exports for HTTP_PROXY and HTTPS_PROXY when those
variables are defined, and not empty.

Task: 29863
Change-Id: Id05c90d5bf99d720ae6002b38d3291e364e1e0c4
2019-03-07 22:16:38 +00:00
Zuul
90dfeaa491 Merge "Fix swarm functional job" 2019-03-07 21:37:46 +00:00
Zuul
24775e0eb3 Merge "Update min tox version to 2.0" 2019-03-07 21:37:45 +00:00
Zuul
f0175f6aac Merge "[k8s] Make flannel self-hosted" 2019-03-07 21:37:40 +00:00
Zuul
722fc56eb3 Merge "Return health_status for cluster listing" 2019-03-07 11:05:58 +00:00
Zuul
373286368d Merge "make sure to set node_affinity_policy for Mesos template definition" 2019-03-06 21:10:57 +00:00
Zuul
c11c40a04d Merge "Fix prometheus installation script" 2019-03-06 15:44:39 +00:00
Zuul
6505aa360d Merge "Do not exit in the enable-helm-tiller script" 2019-03-06 09:46:49 +00:00
Spyros Trigazis
2ab874a5be [k8s] Make flannel self-hosted
Similar to calico, deploy flannel as a DS.
Flannel can use the kubernetes API to store
data, so it doesn't need to contact the etcd
server directly anymore.

This patch drops to relatively large files for
flannel's config, flannel-config-service.sh and
write-flannel-config.sh. All required config is
in the manifests.

Additional options to the controller manager:
--allocate-node-cidrs=true and --cluster-cidr.

Change-Id: I4f1129e155e2602299394b5866165260f4ea0df8
story: 2002751
task: 24870
2019-03-05 18:33:45 +01:00
Nguyen Hai Truong
18fc68dd26 Update min tox version to 2.0
The commands used by constraints need at least tox 2.0.
Update to reflect reality, which should help with local running of
constraints targets.

Change-Id: Iece749b90ec90bec1f5324bc351878e6252720ed
2019-03-05 11:56:54 +11:00
Feilong Wang
83c8b13bf0 Release k8s v1.11.8, v1.12.6 and v1.13.4
Release new k8s version because of CVE-2019-1002100[1]

[1] https://discuss.kubernetes.io/t/kubernetes-security-announcement-v1-11-8-1-12-6-1-13-4-released-to-address-medium-severity-cve-2019-1002100/5147

Task: 29789
Story: 2005124

Change-Id: I6435a10b05932ea71e825e944d53859eba374e91
2019-03-03 20:55:47 +00:00
Guang Yee
a47f5a3994 make sure to set node_affinity_policy for Mesos template definition
Fixes the problem with Mesos cluster creation where the
nodes_affinity_policy was not properly conveyed as it is required
in order to create the corresponding server group in Nova.

Change-Id: Ie8d73247ba95f20e24d6cae27963d18b35f8715a
story: 2005116
2019-03-01 15:49:06 -08:00
Zuul
e256f87d1a Merge "[k8s-fedora-atomic] Use ClusterIP for prometheus service" 2019-03-01 02:36:49 +00:00
Feilong Wang
e4b05bbd1a Fix swarm functional job
Now swarm functional job failed due to a a regression issue caused by
If11ba863a2aa538efe1e3e850084bdd33afd27d2 This patch fixes.

Task: 29766
Story: 2004195

Change-Id: I830ab66775e0dd57766cdab25d06500d85651dc1
2019-03-01 14:36:33 +13:00
Lingxian Kong
2cf4df0850 Fix prometheus installation script
- Fix the indent in the file.
- Use 'kubectl apply' instead of 'kubectl create' for more robust
  service restart.
- Do not retry infinitely when Prometheus datasource already injected
  into Grafana

Story: #2005117
Task: #29765

Change-Id: I5857fe62f922d27860946fd318296950834a8797
2019-03-01 14:16:36 +13:00
Feilong Wang
8c8cd7d199 Return health_status for cluster listing
Task: 29761
Story: 2002742

Change-Id: If702584fabe1402257b45db281561a5f5b83b972
2019-03-01 12:08:01 +13:00
Lingxian Kong
3695536085 Do not exit in the enable-helm-tiller script
The scripts included in the Heat kube_cluster_config resource should not exit
if the particular step is skipped.

Change-Id: I2d4cf54631c8ed3a9eb30b3e6c8e1af0007e23d5
Story: #2005109
Task: #29743
2019-03-01 12:03:52 +13:00
Zuul
57a3b73fa0 Merge "Fix async reserved word in python3.7" 2019-02-28 17:03:18 +00:00
Zuul
c181fce90d Merge "FakeLoopingCall raises IOError" 2019-02-28 17:03:13 +00:00
Zuul
6d85d7be56 Merge "python3 fix: decode binary cert data if encountered" 2019-02-28 11:28:40 +00:00
Theodoros Tsioutsias
14b46ea22b FakeLoopingCall raises IOError
All unittests using FakeLoopingCall raise an IOError if an initial
delay is not specified, because the default initial_dealy is -1.
Changing the default initial delay to 0.

story: 2005112
task: 29748
Change-Id: I6cbae0996c2347e25d8be617e4b3fd93f4d9cc95
2019-02-28 10:01:17 +00:00
Zuul
d76ab4da80 Merge "[k8s-fedora-atomic] Security group definition for worker nodes" 2019-02-27 23:59:12 +00:00
Lingxian Kong
31c82625d6 [k8s-fedora-atomic] Security group definition for worker nodes
Defines more strict security group rules for kubernetes worker nodes. The
ports that are open by default: default port range(30000-32767) for
external service ports; kubelet healthcheck port; Calico BGP network ports;
flannel overlay network ports. The cluster admin should manually config the
security group on the nodes where Traefik is allowed.

Story: #2005082
Task: #29661
Change-Id: Idbc67cb95133d3a4029105e6d4dc92519c816288
2019-02-27 22:15:46 +00:00
Zuul
07e48a1ed5 Merge "Add server group for cluster worker nodes" 2019-02-27 12:32:47 +00:00
Zuul
731499c460 Merge "Return instance ID of worker node" 2019-02-27 11:57:34 +00:00
Lingxian Kong
2bbfd52abc [k8s-fedora-atomic] Use ClusterIP for prometheus service
The NodePort type service, by design, bypasses almost all network
security in Kubernetes, so is not recommended to be used in the cloud
enviroment.

This patch changes the prometheus service type from NodePort to ClusterIP.

Story: #2005098
Task: #29712

Change-Id: Ic47a334bcf81afb87a78a5e66db1a988b473a47e
2019-02-28 00:13:28 +13:00
Zuul
138472dcf1 Merge "Add reno for flannel reboot fix" 2019-02-27 10:00:52 +00:00
Feilong Wang
20d03919fb Return instance ID of worker node
Return the nova instance UUID of worker nodes in kubeminion
templates. We will be able to remove resources from the
ResourceGroups based on nova instance uuid.

Backstory:
In heat a ResourceGroup creates a stack of depth 2. ResourceGroups
support removal policies to declare which resources must be removed.
This can be done by passing the index of the resource or the stack_id
of the nested stack. If a stack update call receives a list of
indices (eg [0, 5, 3]) or nested stack uuid (eg [uuidA, uuidB]), it
will remove the corresponding nested stacks.

In magnum's heat templates, a nested stack logically represents a
nova compute instance which is a cluster node. Using composition in
heat, we can change the way a resources group references the nested
stacks. This proposes to use the nova instance uuid as
'OS::stack_id'.

With this change, an external consumer of the stack (the cluster
autoscaler or an actual user) can remove resources from the
ResourceGroup using the nova instance uuid or resource index. Without
this change, a user or system  (which typically knows the name,
server uuid or ip) would have to find in which nested stack a
kubernetes node belongs too.  Resulting multiple call to heat.

The end result of this patch can be verified like this:
nested_stack_id=$(openstack stack resource show <STACK_ID_OR_NAME> kube_minions -c physical_resource_id -f value)
openstack stack show "${nested_stack_id}"

Task: 29664
Story: 2005054

Change-Id: I6d776f62d640c72b3228460392b92df94fe56fe6
2019-02-27 10:46:41 +01:00
Feilong Wang
4f84c849f6 Add server group for cluster worker nodes
Now Magnums onlys has one server group for all master and worker nodes
per cluster, which is not very flexible for small cloud scale. For a
3+ master clusters, it's easily meeting the capacity when using hard
anti-affinity policy. This patch is proposing one server group for each
master and worker nodes group to have better flexibility.

story: 2004195

Change-Id: If11ba863a2aa538efe1e3e850084bdd33afd27d2
2019-02-27 09:09:20 +00:00
Jake Yip
ea362b1391 python3 fix: decode binary cert data if encountered
We are writing to files opened with text mode ('w+'), so binary data
will have to be decoded before writing

Task: 29577
Story: 2005057
Change-Id: I034d0230c3022e701111bdc71f0af43da1852c3c
2019-02-27 19:47:38 +11:00
Nguyen Hai Truong
055384343f Add python 3.6 unit test job
This is a mechanically generated patch to add a unit test job running
under Python 3.6 as part of the python3-first goal.

See the python3-first goal document for details:
https://governance.openstack.org/tc/goals/stein/python3-first.html

Change-Id: I5a92105f7cfbcabf521150d65f89b14cea62db0f
2019-02-23 18:01:18 +11:00
Spyros Trigazis
e6b3325120 Add reno for flannel reboot fix
Change [0] fixed the issue of reseting iptables on node reboot
when flannel was configured which made pods lose connectivity.

[0] I7f6200a4966fda1cc701749bf1f37ddc492390c5

Change-Id: I07771f2c4711b0b86a53610517abdc3dad270574
Signed-off-by: Spyros Trigazis <spyridon.trigazis@cern.ch>
2019-02-22 11:07:59 +01:00