Commit Graph

1371 Commits

Author SHA1 Message Date
Arnaud Morin
feda25de9d Set OVH GRA1 region in maintenance mode
I recently applied a new kernel on BHS1, if everything is fine with
that, I propose to apply the same one GRA1 so it will help fixing some
timeout errors.

Change-Id: I489f8b84871c18f2dad079cae5b53fb1a504f1bd
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
2018-12-20 08:29:14 +01:00
Clark Boylan
7942c19f22 Use OVH BHS1 again
Set ovh-bhs1 max-servers to 150. OVH (thank you amorin) have debugged
and corrected a memory leak there that we believe to be the cause of the
test node slowness.

Frickler and I have run fio tests on VMs running on each hypervisor in
the region and they look happy. We've also run spot tests of devstack
and tempest which also appear happy.

Change-Id: If6fd5a6194a9996e8b031f74918f373dc7bbe758
2018-12-18 07:59:16 -08:00
Mohammed Naser
0ff63458ce vexxhost: tweak nodepool settings
This patch drops the VEXXHOST specific flavors from the Montreal
region because all of the SJC datacenter has *supported* and very
reliable nested virtualization.

It also bumps the max-servers to 10 in order to be able to supply
more results.

Change-Id: I6383772d6d1e1bca3a759692bf20d373baf588c6
2018-12-07 19:04:31 -05:00
Jens Harbott
55d145c34e Disable ovh bhs1
We are seeing excessive job timeouts in this region[0], disable it
until we can get a more stable turnout again.

[0] https://ethercalc.openstack.org/jg8f4p7jow5o

Change-Id: I7969cca2cdd99526294a4bf7a0f44f059823dae7
2018-12-07 14:06:27 +00:00
Clark Boylan
a5088837e2 Halve bhs1 max-servers value
We are debugging slow nodes in bhs1. Looking at dstat data we clearly
have some jobs that end up spending a lot of cpu time in sys and wai
columns while other similar jobs do not.

One thought was that this is due to an unhappy hypervisor or two, but
amorin has dug in and found that these slow jobs run on multiple unique
hypervisors implying that isn't likely.

My next thought is that we are our own noisy neighbors. Reducing the
max-servers should improve things if we are indeed our own noisy
neighbors.

Change-Id: Idd7804778a141d38da38b739294c6c6a62016053
2018-12-06 14:04:47 -08:00
Arnaud Morin
7671cc88f5 Reduce a little number of instances on BHS1
I'd like to isolate one host from the aggregate, but to perform that in
a good way, it's better to reduce the number of instances the nodepool
is trying to boot, this will avoid useless no valid host found errors.

Change-Id: Iddbfba1c3093e9f128c41db91d6b5b3e1d467ce8
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
2018-12-05 09:01:56 +01:00
Jeremy Stanley
dfda58e203 Revert "Temporarily disable ovh-bhs1 in nodepool"
This reverts commit 3f40af4296.

Can be approved once the slow disk performance in this region is
resolved.

Change-Id: Idda585116ae9dc09b55f6794ab5ee7bda47f455a
2018-11-30 17:38:54 +00:00
Jeremy Stanley
3f40af4296 Temporarily disable ovh-bhs1 in nodepool
We've gotten reports of frequent slow job runs in the BHS1 region
leading to job timeouts. Further investigation indicates these
instances top out around ~10-15MB/sec for contiguous writes to their
rootfs while instances booted from the same image and flavor in GRA1
see 250MB/sec or better with the same write patterns. Disable BHS1
in nodepool for now while we work with OVH staff to see if they can
determine the root cause.

Change-Id: I8b9a79b64dd7da6d3a33f24797ca597bd2426c86
2018-11-30 17:33:50 +00:00
Jeremy Stanley
970987e3ce Revert "Halve ovh-bhs1 max-servers temporarily"
This reverts commit 521d1ceafe. Merge
once testing of the CPU contention theory has concluded.

Change-Id: Ia15f6f943bab530e8b6fd96a2c57d091d60e3193
2018-11-23 15:30:52 +00:00
Jeremy Stanley
521d1ceafe Halve ovh-bhs1 max-servers temporarily
We've gotten reports of frequent slow job runs in the BHS1 region
leading to job timeouts and OVH staff have confirmed we're running a
CPU oversubscription ratio of 2:1 there, so try dropping our
utilization by half to confirm whether this could be due to CPU
contention during peak load.

Change-Id: If7e5f3c0dec71813f5bcb974a0217dc031801115
2018-11-23 15:25:10 +00:00
Ian Wienand
2eec5cd352 Fix arm64ci cloud names
This name was incorrectly added in
I428d46565921e018ac01cbd9c64b4be60c44f3d5; it's supposed to just be
arm64ci.

Change-Id: Iaae8db611acf317770eaea3b4caf1d3e403e1d54
2018-11-20 08:02:56 +11:00
Andreas Jaeger
cc48a5cd89 Update bindep-fallback for openSUSE 15.0
openSUSE 15.0 does not have libffi48-devel, instead we can use
libffi-devel. Install libffi48-devel only on openSUSE 42.3.

This was triggered by the failure in https://review.openstack.org/617282

Change-Id: I2207d69bd837a7249476b4a20025f41df3a7bc84
2018-11-12 12:12:41 +01:00
Ian Wienand
5115fd49d8 nodepool: Add arm64ci cloud
Credentials are populated (Ib96d14008ab3b8b7c12429d7432eaa485c404bb2),
mirror.nrt1.arm64ci.openstack.org is alive so everything is ready to
go.

We have a quota of 40 cores & 96gb ram; the c1.large flavor is 8/core
8gb.  Should we should be able to fit 5 CI servers to start with.

Change-Id: I428d46565921e018ac01cbd9c64b4be60c44f3d5
2018-11-09 14:59:09 +11:00
Mathieu Gagné
9bf8267708 Revert "Disable inap-mtl01 provider"
This reverts commit a8d18c9142.

Change-Id: Ic3681220cc555115c1ddffc742f19d4cd038447e
2018-10-29 15:59:24 +00:00
Mathieu Gagné
a8d18c9142 Disable inap-mtl01 provider
Change-Id: Ic367b9b59d10869d46e2dfac820adf1b85ed121a
2018-10-25 16:36:10 -04:00
Clark Boylan
57eaa73695 Switch nodepool launchers to use new zk cluster
This should happen at the same time as we switch the zuul scheduler over
to the new zk cluster and after the nodepool builders have populated
image data on the new zk cluster.

This gets us off the old nodepool.o.o server and onto newer HA cluster.

Change-Id: I9cea03f726d4acb21ad5584f8db7a4d15bc556db
2018-10-22 09:23:12 -07:00
Clark Boylan
a5bf522ce3 Switch nodepool builders to zk cluster
Switch over the nodepool builders to our newer zk cluster from the old
single node zk cluster. We will stop building images that the launchers
can see before the launchers move, but this lets us preseed the new
cluster with up to date image data.

Once the images are built with records in the new zk cluster we can
switch over the zuul scheduler and the launchers to this newer cluster.

Change-Id: I95ca326095decc03cf279383fa48dbdfc56ed8c8
2018-10-22 09:20:47 -07:00
Ian Wienand
c64c3d6f0f Restore full OVH-GRA1 quota
This is a follow-on to Id01f85fcee150f9360f508b09003a8d0043155bd to
restore the full quota.

Change-Id: Iec483a37f711f12fbb8ae6fe3299aabe4f621ac4
2018-10-19 16:01:07 +02:00
Ian Wienand
529b912c80 Revert "Disable ovh-gra1"
This partially reverts commit
bfdd3e6a42.

After fruitful discussions with amorin in IRC, we have nodes working
again in this region.  This puts a small load on for us to monitor for
a while.  A follow-on will do a full revert so we don't forget.

Story: #2004090
Task: #27492
Change-Id: Id01f85fcee150f9360f508b09003a8d0043155bd
2018-10-18 09:41:14 +00:00
Zuul
6f32c131c3 Merge "Disable ovh-gra1" 2018-10-16 07:59:02 +00:00
Ian Wienand
bfdd3e6a42 Disable ovh-gra1
As described in the story/task, this region is currently not working

Change-Id: Ief7b68b45537e7fc8791905d3039d35942636368
Story: #2004090
Task: #27492
2018-10-16 17:34:09 +11:00
Clark Boylan
bff5ce049f Disable packethost due to mirror outages
The mirror keeps getting shutdown which leads to jobs failing in pre-run
and restarting. This is just thrashing things and could lead to
failures. Lets disable the region until we understand the problem.

Change-Id: Ied3fd534dc029868fb770280c01bb564078c5a3d
2018-10-12 16:27:29 -07:00
Zuul
28bc6498f6 Merge "Revert "OVH GRA1 Maintenance" - 2018-10-11 0000UTC" 2018-10-11 16:41:55 +00:00
Zuul
b86d897b67 Merge "OVH GRA1 Maintenance 2018-10-10 1900 UTC" 2018-10-10 17:09:10 +00:00
Logan V
95dcab6b12 Bump Limestone to max-servers 50
The cloud has grown significantly over the past few months and we will
begin scaling max-servers slowly to fill capacity.

Change-Id: I8ead8e56ce5c54ac1ab286fe23f703d50760a560
2018-10-09 18:34:26 -05:00
Matthew Thode
a8e7fbe127
fix rsyslog builds on gentoo
A new version was stabilized on the 5th that allows for more complex
ssl usage.

also, alphabetize the use flag definitions based on package name.

Change-Id: Ie6f3f8462e98ca24879db9ef942ec81072330323
2018-10-07 05:11:12 -05:00
Matthew Thode
9c0292db70
set use flags for systemd
Change-Id: I081b23c1acec4b832bbfe1bae96d63e31ff6d335
2018-10-06 21:28:14 -05:00
Zuul
3dcc0359f6 Merge "enable sqlite in python" 2018-10-05 07:00:04 +00:00
Matthew Thode
d341ceca23
enable sqlite in python
Change-Id: Ie7248a1765029bcf8b17433fc4714d359bfb2747
2018-10-05 00:50:50 -05:00
Zuul
f276269e54 Merge "Update stackviz tarball location" 2018-10-04 20:33:49 +00:00
Zuul
b5980e3840 Merge "upgrade complete" 2018-10-04 19:13:28 +00:00
John Studarus
533e85e23a upgrade complete
This reverts commit 5c7223b477.

Change-Id: I957c02ae2d12df67fedbab497df94f21ad38b8bc
2018-10-04 18:57:40 +00:00
Clark Boylan
2224100eac Update stackviz tarball location
We've patched stackviz to work under python3 properly but we are still
pulling an old tarball for stackviz that was built last year. The legacy
job that built the file at this location seems to have been removed.
Switch to the new dist/ location which appears to be correct based on
tarball file sizes.

Someone that understands stackviz better than me should confirm this new
locations is the correct one.

Change-Id: If659a6f1fb50d288afed75e3f4975f7a4d140d35
2018-10-04 10:46:08 -07:00
Zuul
9dc9eb0765 Merge "Add GPU instances to CI infrastructure" 2018-10-04 14:33:42 +00:00
Mohammed Naser
6ec575648b Disable vexxhost mtl1
This is being done for capacity reasons.  We'll be bringing back
the region with 100+ VMs after the changes are complete which
should be within less than 2 weeks.

Change-Id: I549386c3ae0c3611eb50f8ffe6ad657d1f7bb443
2018-10-03 19:43:09 -04:00
Mohammed Naser
b40c2084b7 Add GPU instances to CI infrastructure
This patch adds a small number of instances that include the
following specifications:

- 6 (dedicated) threads
- 60GB memory
- 225GB PCIe NVMe storage
- NVIDIA K80 GPU

This should hopefully help in adding CI coverage for vGPU
support.

Change-Id: If5b8f9cd305e2fd51b8dab315e4804ce7c628dfd
2018-10-03 14:46:44 -04:00
Zuul
2c30704192 Merge "elements/ndoepool-base: only initially populate ipv4 nameservers" 2018-10-02 18:29:05 +00:00
John Studarus
5c7223b477 a control plane capacity upgrade is planned for later this week
reducing the workload until then since the control plane is overloaded

Change-Id: I4dc336fc5e4c3844bbc66e71d932e0f26fd4a0f2
2018-10-02 03:02:51 +00:00
Zuul
b858572806 Merge "Revert "Temporarily bump up capacity by 50 VMs"" 2018-09-29 15:09:56 +00:00
Zuul
d834293091 Merge "Temporarily bump up capacity by 50 VMs" 2018-09-28 13:06:56 +00:00
Mohammed Naser
1aaa0d95a6 Revert "Temporarily bump up capacity by 50 VMs"
This reverts commit c8cdfa8b12.

Change-Id: I426db005755defa5bff4e83a2259f0a875e5b27a
2018-09-28 08:45:10 -04:00
Mohammed Naser
c8cdfa8b12 Temporarily bump up capacity by 50 VMs
Times are hard.  Gates are long.  Let's help flush them out.

Please revert this once we've cleared the gate.

Change-Id: Idf0d8a784f11aa4004a909ca911782f7c7496763
2018-09-28 08:43:58 -04:00
Ian Wienand
6565b3c140 elements/ndoepool-base: only initially populate ipv4 nameservers
We are seeing a problem on Fedora where it appears on hosts without
configured ipv6 unbound chooses to send queries via the ipv6
forwarders and then returns DNS failures.

An upstream issue has been filed [1], but it remains unclear exactly
why this happens on Fedora but not other platforms.

However, having ipv6 forwarders is not always correct.  Not all our
platforms have glean support for ipv6 configuration, nor do all our
providers provide ipv6 transit.

Therefore, ipv4 is the lowest common denominator across all platforms.
Even those who are "ipv6 only" still provide ipv4 via NAT --
originally it was the unreliability of this NAT transit that lead to
unbound being used in the first place.  It should be noted that in
most all jobs, the configure-unbound role [2] called from the base-job
will re-write the forwarding information and configure ipv4/6
correctly during the base job depending on the node & provider
support.  Thus this only really affects some of the
openstack-zuul-jobs/system-config integration jobs, where we start out
without unbound configured because we're actually *testing* the
unbound configuration role.

An additional complication is that we want to keep backwards
compatability and populate the settings if
NODEPOOL_STATIC_NAMESERVER_V6 is explicitly set -- this is sometimes
required if you building infra-style images and are within a corporate
network that disallows outbound DNS queries for example.

Thus by default only populate ipv4 forwarders, unless explicitly asked
to add ipv6 with the new variable or the static v6 nameservers are
explicitly specified.

[1] https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=4188
[2] http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/configure-unbound

Change-Id: If060455e163266b2c3e72b4a2ac2838a61859496
2018-09-27 14:27:13 +10:00
Clark Boylan
9a3fc0c1e2 Revert "Disable OVH BHS1 region"
This reverts commit 19e7cf09d9.

The issues in OVH BHS1 around networking configuration have been worked
around with updates to glean and configuration to the labels in zuul.
New images are in place for each supported image in BHS1. We can go
ahead and start using this region again.

I have manually tested this by booting an ubuntu-xenial node with
glean_ignore_interfaces='True' set in metadata and the networking comes
up with expected using DHCP. The mirror in that region is reachable from
this test node.

Change-Id: I29746686217a62709c4afc6656d95829ace6fb3b
2018-09-25 14:01:27 -07:00
Clark Boylan
22fb41c763 Glean config on OVH nodes
Instruct glean via metadata properties to ignore the config drive
network_data.json interface data on OVH and instead fall back to DHCP.
This is necessary because post upgrade OVH config drive
network_data.json provides inaccurate network configuration details and
DHCP is actually what is needed there for working l2 networking.

Change-Id: I51f16d34a96ee8d964e8b540ce5113a662a56f6d
2018-09-25 09:28:03 -07:00
Ian Wienand
19e7cf09d9 Disable OVH BHS1 region
This reverts commit 756a8f43f7, which
was where we re-enabled OVH BHS1 after maintenance.  I strongly
suspect that this has something to do with the issues ...

It appears that VM's in BHS1 can not communicate with the mirror

From a sample host 158.69.64.62 to mirror01.bhs1.ovh.openstack.org

---
 root@ubuntu-bionic-ovh-bhs1-0002154210:~# ip addr
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:1b:4b:32 brd ff:ff:ff:ff:ff:ff
    inet 158.69.64.62/19 brd 158.69.95.255 scope global ens3
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe1b:4b32/64 scope link
       valid_lft forever preferred_lft forever

 root@ubuntu-bionic-ovh-bhs1-0002154210:~# traceroute -n mirror01.bhs1.ovh.openstack.org
 traceroute to mirror01.bhs1.ovh.openstack.org (158.69.80.87), 30 hops max, 60 byte packets
  1  158.69.64.62  2140.650 ms !H  2140.627 ms !H  2140.615 ms !H

 root@ubuntu-bionic-ovh-bhs1-0002154210:~# ping mirror01.bhs1.ovh.openstack.org
 PING mirror01.bhs1.ovh.openstack.org (158.69.80.87) 56(84) bytes of data.
 From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=1 Destination Host Unreachable
 From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=2 Destination Host Unreachable
 From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=3 Destination Host Unreachable
 --- mirror01.bhs1.ovh.openstack.org ping statistics ---
 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3049ms
---

However, *external* access to the mirror host and all other hosts
seems fine.  It appears to be an internal OVH BHS1 networking issue.

I have raised ticket #9721374795 with OVH about this issue.  It needs
to be escalated so is currently pending (further details should come
to infra-root@openstack.org).

In the mean time, all jobs are failing in the region.  Disable it
until we have a solution.

Change-Id: I748ca1c10d98cc2d7acf2e1821d4d0f886db86eb
2018-09-20 15:55:45 +10:00
Zuul
3be846e1e6 Merge "Install gentoolkit on Gentoo" 2018-09-19 22:37:01 +00:00
Zuul
16b8693e02 Merge "Revert "Revert "Revert "OVH BHS1 Maintenance" - 2018-09-19 1200UTC""" 2018-09-19 20:53:18 +00:00
Matthew Thode
66e29f7bb2
Install gentoolkit on Gentoo
Change-Id: I031d6fa77337ea7cbf5865c2f568e9a498096a00
2018-09-19 09:11:07 -05:00
Andreas Jaeger
756a8f43f7 Revert "Revert "Revert "OVH BHS1 Maintenance" - 2018-09-19 1200UTC""
Enable OVH BHS1 again.

This reverts commit d74c51b0a5.

Change-Id: Ie3c24efb3e9a753d027dc680ab6a26c6a1934159
2018-09-19 13:18:20 +00:00