I recently applied a new kernel on BHS1, if everything is fine with
that, I propose to apply the same one GRA1 so it will help fixing some
timeout errors.
Change-Id: I489f8b84871c18f2dad079cae5b53fb1a504f1bd
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
Set ovh-bhs1 max-servers to 150. OVH (thank you amorin) have debugged
and corrected a memory leak there that we believe to be the cause of the
test node slowness.
Frickler and I have run fio tests on VMs running on each hypervisor in
the region and they look happy. We've also run spot tests of devstack
and tempest which also appear happy.
Change-Id: If6fd5a6194a9996e8b031f74918f373dc7bbe758
This patch drops the VEXXHOST specific flavors from the Montreal
region because all of the SJC datacenter has *supported* and very
reliable nested virtualization.
It also bumps the max-servers to 10 in order to be able to supply
more results.
Change-Id: I6383772d6d1e1bca3a759692bf20d373baf588c6
We are seeing excessive job timeouts in this region[0], disable it
until we can get a more stable turnout again.
[0] https://ethercalc.openstack.org/jg8f4p7jow5o
Change-Id: I7969cca2cdd99526294a4bf7a0f44f059823dae7
We are debugging slow nodes in bhs1. Looking at dstat data we clearly
have some jobs that end up spending a lot of cpu time in sys and wai
columns while other similar jobs do not.
One thought was that this is due to an unhappy hypervisor or two, but
amorin has dug in and found that these slow jobs run on multiple unique
hypervisors implying that isn't likely.
My next thought is that we are our own noisy neighbors. Reducing the
max-servers should improve things if we are indeed our own noisy
neighbors.
Change-Id: Idd7804778a141d38da38b739294c6c6a62016053
I'd like to isolate one host from the aggregate, but to perform that in
a good way, it's better to reduce the number of instances the nodepool
is trying to boot, this will avoid useless no valid host found errors.
Change-Id: Iddbfba1c3093e9f128c41db91d6b5b3e1d467ce8
Signed-off-by: Arnaud Morin <arnaud.morin@corp.ovh.com>
This reverts commit 3f40af4296.
Can be approved once the slow disk performance in this region is
resolved.
Change-Id: Idda585116ae9dc09b55f6794ab5ee7bda47f455a
We've gotten reports of frequent slow job runs in the BHS1 region
leading to job timeouts. Further investigation indicates these
instances top out around ~10-15MB/sec for contiguous writes to their
rootfs while instances booted from the same image and flavor in GRA1
see 250MB/sec or better with the same write patterns. Disable BHS1
in nodepool for now while we work with OVH staff to see if they can
determine the root cause.
Change-Id: I8b9a79b64dd7da6d3a33f24797ca597bd2426c86
We've gotten reports of frequent slow job runs in the BHS1 region
leading to job timeouts and OVH staff have confirmed we're running a
CPU oversubscription ratio of 2:1 there, so try dropping our
utilization by half to confirm whether this could be due to CPU
contention during peak load.
Change-Id: If7e5f3c0dec71813f5bcb974a0217dc031801115
This name was incorrectly added in
I428d46565921e018ac01cbd9c64b4be60c44f3d5; it's supposed to just be
arm64ci.
Change-Id: Iaae8db611acf317770eaea3b4caf1d3e403e1d54
openSUSE 15.0 does not have libffi48-devel, instead we can use
libffi-devel. Install libffi48-devel only on openSUSE 42.3.
This was triggered by the failure in https://review.openstack.org/617282
Change-Id: I2207d69bd837a7249476b4a20025f41df3a7bc84
Credentials are populated (Ib96d14008ab3b8b7c12429d7432eaa485c404bb2),
mirror.nrt1.arm64ci.openstack.org is alive so everything is ready to
go.
We have a quota of 40 cores & 96gb ram; the c1.large flavor is 8/core
8gb. Should we should be able to fit 5 CI servers to start with.
Change-Id: I428d46565921e018ac01cbd9c64b4be60c44f3d5
This should happen at the same time as we switch the zuul scheduler over
to the new zk cluster and after the nodepool builders have populated
image data on the new zk cluster.
This gets us off the old nodepool.o.o server and onto newer HA cluster.
Change-Id: I9cea03f726d4acb21ad5584f8db7a4d15bc556db
Switch over the nodepool builders to our newer zk cluster from the old
single node zk cluster. We will stop building images that the launchers
can see before the launchers move, but this lets us preseed the new
cluster with up to date image data.
Once the images are built with records in the new zk cluster we can
switch over the zuul scheduler and the launchers to this newer cluster.
Change-Id: I95ca326095decc03cf279383fa48dbdfc56ed8c8
This partially reverts commit
bfdd3e6a42.
After fruitful discussions with amorin in IRC, we have nodes working
again in this region. This puts a small load on for us to monitor for
a while. A follow-on will do a full revert so we don't forget.
Story: #2004090
Task: #27492
Change-Id: Id01f85fcee150f9360f508b09003a8d0043155bd
The mirror keeps getting shutdown which leads to jobs failing in pre-run
and restarting. This is just thrashing things and could lead to
failures. Lets disable the region until we understand the problem.
Change-Id: Ied3fd534dc029868fb770280c01bb564078c5a3d
The cloud has grown significantly over the past few months and we will
begin scaling max-servers slowly to fill capacity.
Change-Id: I8ead8e56ce5c54ac1ab286fe23f703d50760a560
A new version was stabilized on the 5th that allows for more complex
ssl usage.
also, alphabetize the use flag definitions based on package name.
Change-Id: Ie6f3f8462e98ca24879db9ef942ec81072330323
We've patched stackviz to work under python3 properly but we are still
pulling an old tarball for stackviz that was built last year. The legacy
job that built the file at this location seems to have been removed.
Switch to the new dist/ location which appears to be correct based on
tarball file sizes.
Someone that understands stackviz better than me should confirm this new
locations is the correct one.
Change-Id: If659a6f1fb50d288afed75e3f4975f7a4d140d35
This is being done for capacity reasons. We'll be bringing back
the region with 100+ VMs after the changes are complete which
should be within less than 2 weeks.
Change-Id: I549386c3ae0c3611eb50f8ffe6ad657d1f7bb443
This patch adds a small number of instances that include the
following specifications:
- 6 (dedicated) threads
- 60GB memory
- 225GB PCIe NVMe storage
- NVIDIA K80 GPU
This should hopefully help in adding CI coverage for vGPU
support.
Change-Id: If5b8f9cd305e2fd51b8dab315e4804ce7c628dfd
Times are hard. Gates are long. Let's help flush them out.
Please revert this once we've cleared the gate.
Change-Id: Idf0d8a784f11aa4004a909ca911782f7c7496763
We are seeing a problem on Fedora where it appears on hosts without
configured ipv6 unbound chooses to send queries via the ipv6
forwarders and then returns DNS failures.
An upstream issue has been filed [1], but it remains unclear exactly
why this happens on Fedora but not other platforms.
However, having ipv6 forwarders is not always correct. Not all our
platforms have glean support for ipv6 configuration, nor do all our
providers provide ipv6 transit.
Therefore, ipv4 is the lowest common denominator across all platforms.
Even those who are "ipv6 only" still provide ipv4 via NAT --
originally it was the unreliability of this NAT transit that lead to
unbound being used in the first place. It should be noted that in
most all jobs, the configure-unbound role [2] called from the base-job
will re-write the forwarding information and configure ipv4/6
correctly during the base job depending on the node & provider
support. Thus this only really affects some of the
openstack-zuul-jobs/system-config integration jobs, where we start out
without unbound configured because we're actually *testing* the
unbound configuration role.
An additional complication is that we want to keep backwards
compatability and populate the settings if
NODEPOOL_STATIC_NAMESERVER_V6 is explicitly set -- this is sometimes
required if you building infra-style images and are within a corporate
network that disallows outbound DNS queries for example.
Thus by default only populate ipv4 forwarders, unless explicitly asked
to add ipv6 with the new variable or the static v6 nameservers are
explicitly specified.
[1] https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=4188
[2] http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/configure-unbound
Change-Id: If060455e163266b2c3e72b4a2ac2838a61859496
This reverts commit 19e7cf09d9.
The issues in OVH BHS1 around networking configuration have been worked
around with updates to glean and configuration to the labels in zuul.
New images are in place for each supported image in BHS1. We can go
ahead and start using this region again.
I have manually tested this by booting an ubuntu-xenial node with
glean_ignore_interfaces='True' set in metadata and the networking comes
up with expected using DHCP. The mirror in that region is reachable from
this test node.
Change-Id: I29746686217a62709c4afc6656d95829ace6fb3b
Instruct glean via metadata properties to ignore the config drive
network_data.json interface data on OVH and instead fall back to DHCP.
This is necessary because post upgrade OVH config drive
network_data.json provides inaccurate network configuration details and
DHCP is actually what is needed there for working l2 networking.
Change-Id: I51f16d34a96ee8d964e8b540ce5113a662a56f6d
This reverts commit 756a8f43f7, which
was where we re-enabled OVH BHS1 after maintenance. I strongly
suspect that this has something to do with the issues ...
It appears that VM's in BHS1 can not communicate with the mirror
From a sample host 158.69.64.62 to mirror01.bhs1.ovh.openstack.org
---
root@ubuntu-bionic-ovh-bhs1-0002154210:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:1b:4b:32 brd ff:ff:ff:ff:ff:ff
inet 158.69.64.62/19 brd 158.69.95.255 scope global ens3
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe1b:4b32/64 scope link
valid_lft forever preferred_lft forever
root@ubuntu-bionic-ovh-bhs1-0002154210:~# traceroute -n mirror01.bhs1.ovh.openstack.org
traceroute to mirror01.bhs1.ovh.openstack.org (158.69.80.87), 30 hops max, 60 byte packets
1 158.69.64.62 2140.650 ms !H 2140.627 ms !H 2140.615 ms !H
root@ubuntu-bionic-ovh-bhs1-0002154210:~# ping mirror01.bhs1.ovh.openstack.org
PING mirror01.bhs1.ovh.openstack.org (158.69.80.87) 56(84) bytes of data.
From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=1 Destination Host Unreachable
From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=2 Destination Host Unreachable
From ubuntu-bionic-ovh-bhs1-0002154210 (158.69.64.62) icmp_seq=3 Destination Host Unreachable
--- mirror01.bhs1.ovh.openstack.org ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3049ms
---
However, *external* access to the mirror host and all other hosts
seems fine. It appears to be an internal OVH BHS1 networking issue.
I have raised ticket #9721374795 with OVH about this issue. It needs
to be escalated so is currently pending (further details should come
to infra-root@openstack.org).
In the mean time, all jobs are failing in the region. Disable it
until we have a solution.
Change-Id: I748ca1c10d98cc2d7acf2e1821d4d0f886db86eb