We have a network appliance we test via nested virt. While the outer
node is live and the port we nodescan is open, the nested node is still
booting up SSHd. Which causes nodescan to return:
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
until SSHd is properly running.
Perviously we set out boot-timeout to 5 mins, to allow for the nested
SSHd to come online properly. This should restore that functionality.
Change-Id: I7f43530ee77a81f7c969d548190a71bfb9b03455
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
We are seeing users try to enable fips on their test nodes. This
presents a problem because the ssh host key which we've been using is a
ed25519 key which fips disables. Fips forces the use of another key
which ansible doesn't trust and subsequent ssh connections fail.
Address this by trying to scan all available host keys on the server and
not just the first one that paramiko returns.
Change-Id: Ibb2a07a29681dcefd4017eb2fd6134ee33ab726c
This ensures that we don't wait forever for tests to complete tasks.
This is particularly useful if you've disabled the global test timeout.
Change-Id: I0141e62826c3594ed20605cac25e39091d1514e2
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.
Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.
Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
During nodescan we currently set a socket timeout which is equal to
the timeout we wait for the entire boot. In case we have unfortunate
timing of the network interface setup of the node (especially Windows
does this very late in the boot process) we get longer wait times than
necessary. This happens because uninitialized network interfaces on
the node lead to unanswered syn packets instead of connection refused
errors. Linux typically does around 6 syn retries with an exponential
backof starting with 3s. This means the delay between syn retries is
3, 6, 12 seconds and thus in absolute time a single socket connect can
return after 0, 3, 6, 12, 45, 93 or 189 seconds.
This can be solved by setting a fixed lower timeout on the socket to
force it to return with timeout after 10s so we can avoid the
exponential syn retry backoff and thus don't waste too much time on
slower starting nodes.
Change-Id: Ibabdff1966d49752e86e15a1c2a24dd2c86d33f6
The connection port should be included in the privider diskimage.
This makes it possible to define images using other ports for
connections winrm for Windows which run on a different port than 22.
Change-Id: Ib4b335ffbcc4dc71704c06387377675a4206c663
In case of an image with the connection type winrm we cannot scan the
ssh host keys. So in case the connection type is not ssh we
need to skip gathering the host keys.
Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.
Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
We had a launch thread stuck here:
Thread: NodeLauncher-0000341123 (140201917658880)
File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.5/dist-packages/nodepool/driver/openstack/handler.py", line 245, in run
self._run()
File "/usr/local/lib/python3.5/dist-packages/nodepool/driver/openstack/handler.py", line 216, in _run
self._launchNode()
File "/usr/local/lib/python3.5/dist-packages/nodepool/driver/openstack/handler.py", line 201, in _launchNode
interface_ip, timeout=self._provider.boot_timeout)
File "/usr/local/lib/python3.5/dist-packages/nodepool/nodeutils.py", line 74, in keyscan
t.start_client()
File "/usr/local/lib/python3.5/dist-packages/paramiko/transport.py", line 489, in start_client
event.wait(0.1)
File "/usr/lib/python3.5/threading.py", line 549, in wait
signaled = self._cond.wait(timeout)
File "/usr/lib/python3.5/threading.py", line 297, in wait
gotit = waiter.acquire(True, timeout)
This adds a timeout to that method so paramiko won't get stuck there.
Change-Id: I038d88cb141f57b93d8572c067e714f4a3af9c2d
Use six.text_type since unicode() doesn't exist for python3.
Change-Id: I3628759c46f44429471aa394dee5056e191e4a05
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
This exception is not subscriptable in py3, but the proper way to
get to the errno in any version is to access the 'errno' attribute.
Change-Id: I9a2e23cee358ff0f573f29962ab03525bfd40974
When we switch from paramiko client to paramiko transport we failed to
properly setup a timeout.
Change-Id: Ia25c7f31a55d0d6e6bd42b2b266f41a4a2daf8ba
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
The syntax for imports has changed for python3, lets use the new
syntax.
Change-Id: Ia985424bf23b44e492f51182179d2e476cdcccbb
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
As we move forward with zuulv3, we no longer need to ability to SSH
into a node from nodepool-launcher. This means we can remove SSH
private keys from production server. Now we only keyscan the node and
pass the info to zuul to do SSH operations.
We also create out own socket now for paramiko, so we can better
control the exception handling.
Change-Id: I123631aa41fd3db374ef78cf97a8b8afde93f699
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Today, when SSHExceptions are raise, nodepool will abort communication
with the node. Now, nodepool will properly trap them and try again
until the SSH timeout has been raised.
This help with potential race conditions with openssh-server and
nodepool, where nodes would restart sshd after nodepool has
established a connection.
Change-Id: I40bfa1b1af6e4e75f8f14c597c28407ed08023de
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Add some sort of server information about our failed ssh_connect
attempts. Currently we don't expose any information about the host.
2016-08-23 16:26:11,894 ERROR nodepool.utils: Exception while testing ssh access:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 55, in ssh_connect
client = SSHClient(ip, username, **connect_kwargs)
File "/usr/local/lib/python2.7/dist-packages/nodepool/sshclient.py", line 30, in __init__
key_filename=key_filename)
File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 305, in connect
retry_on_signal(lambda: sock.connect(addr))
File "/usr/local/lib/python2.7/dist-packages/paramiko/util.py", line 270, in retry_on_signal
return function()
File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 305, in <lambda>
retry_on_signal(lambda: sock.connect(addr))
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out
Change-Id: I5705798c91b228a7be2788c33c5a128653b24bbe
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
At the moment, grepping through logs to determine what's happening with
timeouts on a provider is difficult because for some errors the cause of
the timeout is on a different line than the provider in question.
Give each timeout a specific named exception, and then when we catch the
exceptions, log them specifically with node id, provider and then the
additional descriptive text from the timeout exception. This should
allow for easy grepping through logs to find specific instances of
types of timeouts - or of all timeouts. Also add a corresponding success
debug log so that comparitive greps/counts are also easy.
Change-Id: I889bd9b5d92f77ce9ff86415c775fe1cd9545bbc
nodeutils.ssh_connect() offers an info message which suggest we
attempted to connect to an instance using password authentication:
Password auth exception. Try number 5...
Change message to be more generic
Include ip and username to better differentiate messages in the log
spam.
Example output:
Auth exception for debian@10.0.0.42, Try number 5...
Change-Id: Iea3c1cf3ae30919cbc6d147e16d383da91df5d75
New versions of paramiko wrap exceptions from multiple connection
attempts for multiple address families into one
NoValidConnectionsError exception. It is a subclass of socket.error
but with an errno set to None. Just check for that and ignore it
to supress log entries on perfectly normal connection failures.
Change-Id: If64ab66dcc6db7c1886fb72f36078f7f819d6506
add option to use ipv6 as ssh connect ip for building snapshot
image and launching jenkins slaves.
Conflicts:
doc/source/configuration.rst
nodepool/nodepool.py
Change-Id: I7e023e7581fc0b5ec1ee34d1e5a1eeaacd7d3bfd
Make testing easier by removing a copy of a method from the
provider_manager. Instead import this method from nodeutils.
Change-Id: I68addb82826c2ce5ee89e120d5f1958fde4f7f12
According to https://docs.python.org/3/howto/pyporting.html the
syntax changed in Python 3.x. The new syntax is usable with
Python >= 2.6 and should be preferred to be compatible with Python3.
Enabled hacking check H231.
Change-Id: Ide60f971493440311f1dcc594e33d536beb925e5
Some cloud instance types (Fedora for example) create
the ssh user after sshd comes online. This allows
our ssh connection retry loop to handle this scenario
gracefully.
Change-Id: Ie345dea50fc2983112cd2e72826a708583d2712a
Log stdout/stderr from the image build process. Use the provider
and image name in the log selector so that admins can route
appropriately (or at least grep).
Change-Id: I7bc74ebfca3184340b51b083695b3441f0924e83
This is used to serialize all access to an individual provider
(nova client). One ProviderManager is created for every provider
defined in the configuration. Any actions that require interaction
with nova submit a task to the manager which processes them serially
with an appropriate delay to ensure that rate limits are not hit.
This solves not only rate-limit problems, but also ends multi-threaded
access to a single novaclient Client object.
Change-Id: I0cdaa747dac08cdbe4719cb6c9c220678b7a0320
The existing db session strategy was inherited from a bunch of
shell scripts that ran once in a single thread and exited.
The surprising thing is that even worked at all. This change
replaces that "strategy" with one where each thread clearly
begins a new session as a context manager and passes that around
to functions that need the DB. A thread-local session is used
for convenience and extra safety.
This also adds a fake provider that will produce fake images and
servers quickly without needing a real nova or jenkins. This was
used to develop the database change.
Also some minor logging changes and very brief developer docs.
Change-Id: I45e6564cb061f81d79c47a31e17f5d85cd1d9306
This is effectively a required db field; without it, the watermark
calculation can be wrong until it's filled in, so make sure it's
there to start.
Also some minor logging changes.
Change-Id: Idc5a9cd40fe330f7a1aea4a5513267ee3c254f60