Whenever the cluster in initialized, new loops for keepalive validation
are created.
The old loos should be stopped to not overload the nsx with keepalive checks.
Change-Id: I6ae746ba11457c141814424f42e9a0c0e2684601
- Add censoring of sensitive headers from being logged in _proxy()
- Fix issue where Cookie and X-XSRF-TOKEN were not censored as intended
Change-Id: I14b422a25b40d0014c05226f9ae4fe8be75e33fb
In addition to status code, also log error message and
local NSX time for failed NSX authentication for
easier troubleshooting for failed JWT authentication and
potential clock drift between NSX, VC and Master node.
Change-Id: Icf31477bffda85ba73a1123232b6aa5503066922
This patch adds an option to collect per cluster or per endpoint API
call records during _proxy() call. This enables client side API auditing
without the need to rely on NSX support bundles. By default this option
is disabled.
Change-Id: Ied30d90fc745d5009850c1c83c74eacd46d5fbd9
Since py2 is no longer supported, built in methods can replace the
six package usage, as been done in the neutron project.
Change-Id: I435462c940e68fa48a910210e584cf139b3b9d95
get_token() method has argument 'refresh_token' in its signature.
Update get_token() method call when passing 'refresh_token' argument.
Change-Id: Idbf64f7f1f0db1c9bfbd079f426cfa3f79bd2edf
In a multi cluster setup, an adaptive API rate limit is more useful as
utilization can be dynamically balanced across all active clusters.
AIMD from TCP congestion control is a simple but effective algorithm
that fits our need here, as:
- API rate is similar to TCP window size. Each API call sent
concurrently is similar to packets in the fly.
- Each successful API call that was blocked before sent will cause rate
limit to be increased by 1. Similar to each ACK received.
- Each failed API call due to Server Busy (429/503) will cause rate
limit to be decreased by half. Similar to packet loss.
When adaptive rate is set to AIMD, a custom hard limit can still be set,
max at 100/s. TCP slow start is not implemented as the upperbound of
rate is relativly small. API rate will be adjusted per period. API
rate under no circumstances will exceed the hard limit.
Change-Id: I7360f422c704d63adf59895b893dcdbef05cfd23
JWT token used to authenticate with NSX can become invalid before
expiration due to VC service account credentials refresh. When this
case happens nsxlib should immediately re-get-token using the latest
creds and refresh request headers.
Change-Id: I1e3415379926f07e7b30eeaf44e9bcc7e2a26e9e
When endpoint goes down, the user should see same exception as
when the cluster is already down (detected by earlier activity).
For this purpose, translate grounding exception to
ServiceClusterUnavaliable.
In addition, display a warning if amount of retries is less than
amount of endpoints, since in this case not all endpoints will be
probed.
Change-Id: Ib4aa5eb95069b917c989b1f6dcd3535880b5a038
The user will be able to specify exception config object, that
defines which exceptions bring endpoint down, and which exceptions
trigger retry.
This change removes exception handling from the client class, which
hopefully makes the code more readable and easier to follow.
Change-Id: If4dd5c01e4bc83c9704347c2c7c8638c5ac1d72c
Currently in nsxlib, there's no client side API rate throttling. In a
scale setup it is deemed to easily overwhelm NSX backend. This patch
introduces a per-endpoint rate limiter that blocks over-limit calls.
Change-Id: Iccd1d2675bed16833d36fa40cc2ef56cf3464652
Before this change, keepalive probe consisted of two separate
configurable roundrip - one based on keepalive_section attribute,
and one on validation_method.
The recommended way to probe NSX appliance is using node/health API,
and tests show that it has best roundtrip time. This nsxlib will
switch to this healthcheck, and not expose keepalive methology to
clients any longer.
Change-Id: Ia972ef3d087fd01fa18d5a4e9dc9c32fbed0eb40
This can help distinguishing which requests have been
quened waiting for available connection or been retried.
Change-Id: I197ae819afde9333a2969472ba716694893298bd
Endpoint validation was two-fold - first validation_connection_method
was invoked, and then get for keepalive section, if configured.
This change suggest to run only one validation, but makes sure one
is always run:
if keepalive section is configured, validation will be based on it,
otherwise default validation (validation_connection_method) is used.
For policy, suggested default validation is via infra API.
Change-Id: Ib53d09ba6b2d70f99d5dba781950975c3d7195b6
For the case of no validation, endpoint state should be assumed to
be UP.
This is a quick fix to unblock no-validation scenarios. Next patch
will deal with cluster DOWN->UP transition.
Change-Id: Ia2a47e1a8d8aeb0174377b24b469613d866fc805
This change reduces retries during cluster health validation. There are
multiple retry levels today:
* retry on urllib3 http level
* retry in validating cluster health
* retry in _proxy_internal
This causes retry storm, which brings significant delays to API calls.
This is especially relevant when nsxlib is configured with
cluster_unavailable_retry = True (this is always the case with single
endpoint).
This change reduces configurable retry attempts in cluster health
validation to single retry per endpoint.
In addition, this change fixes scenario when client configures nsxlib
with no validation, in which case cluster should not mark endpoint as
UP in validation related code.
Change-Id: I33b4101a0e0c0f4088e10776e126cc495dabd89c
Keepalive can pose an extra load on the backend, especially
when client spawn multiple processes. In addition, some
deployments are using external load balancer with its
own monitoring mechanism, in which case nsxlib probing is
redundant.
Thsi change suggests to avoid keepalive probing in case
only one backend is configured. If cluster is DOWN,
connection will always be retried upon API call.
Change-Id: If6b5542f0444f5bb72c0d60e90942a7819c5d72e
In case validate_connection_method already has the effect to keep alive,
it should be allowed to not perform any extra keep-alive requests.
Currently in MP the default keepalive section is transport-zones, which
is deemed to degrade in performance a lot in scale setup. As a more
light-weighted path reverse-proxy/node/health is already used, we should
allow configuring keepalive section to be disabled.
Change-Id: I26c0af67f90b62533a39827ca5111832d306a153
The recursive call causes the pool to be locked if failure accures
repeatedly. The new code makes sure the pool element is returned to
the pool before the next call.
Change-Id: Ia03ac434bbddb3cd304c14555149da23f8852602
This patch adds a new config option "thumbprint".
It will be used to verify Manager server certificate
when "insecure" is false and "ca_file" is unset.
Change-Id: Idfb654c5b502cd6df12275e0a88cf10c546d819d
Add JSON Web Token provider abstract class.
In addition to this, allow clients to configure
the token provider instance such when it is set,
the Authorization header of NSXT requests has
the bearer token value inserted.
Change-Id: Ieb701411413ec239276685f02ee1364bd2b05abd
This patch updates hacking and bandit versions to match what neutron
and others are doing. It also fixes and ignores some new pep8 errors
that crop up due to the version bump. Finally the doc requirements are
moved to doc/requirements.txt to match what other projects do, even
though this project does not build docs today.
Change-Id: Ibe07dbdbaccc220b5ea2a628d342a09a01b09d11
1. Replace the url used for manager status check.
2. Change the order of validation checks since the list action can
be timed out.
3. Run the status validation without retries, to make sure the node is
identified as DOWN quicker.
Change-Id: I60501b544b5892dcc6eb1c4c897ee4add6262e0b
When validating if a connection with the NSX manager is up, also check
the manager status directly
Change-Id: I5de7054a058a74e8237a3344e0951ce37976c135
Get default headers might fail due to bakcend not supporting it or
one of the cluster nodes being down.
In any case, the flow should continue as usual to fail in endpoint()
causing the node to be down if it should.
Change-Id: I18cd5dadd37a96903544464ad3f3e5ea9d6edd4d
There is no need to pre-initialize connection pool, this would
require additional error handling.
Connection would be created on demand.
In addition, reduce the default for max-connections to reasonable
number of expected concurrent requests.
Change-Id: Ic2dfbd29169fe0532fde46f0ac29ce52ee01d40e
Connection generator should not insert "None" into connection pool.
Exception caught in validation func should do the job.
Change-Id: Ia854e3fb719b7e177c07b9b566ff50dc44fe7765
This patch will add support for the case nsxlib is configured with
a cluster (few nsx managers), which change their availability, and
in a specific point in time might all be DOWN.
1) The nsxlib will succeed even if currently all the managers are
unavailable (state DOWN)
2) By configuration of cluster_unavailable_retry=True, when a request
is issues and all managers are DOWN, the endpoint selected will be
retried until one of the managers is back UP, or until max retries
is reached (10 by default)
Change-Id: I2e3d1a9734f37ef82859baf0082b39c11d6ce149
In envs where the access is very slow a IOError may be received
insetad of OpenSSL.SSL.Error. Here we perform a retry.
Change-Id: Ib70eaabf94cd637ca68d311a7944687bf52d7bc9
When established connection is closed by server, endpoint should
not go down, and request should be retried with another connection.
However, if connection fails to be established, endpoint should go
down as coded before.
We observe server closing connections in spite of keep-alive with
policy endpoint.
Change-Id: I264da0ad47c31c9875a4be35acd4c5c4c88f4916
A recent change in pep/pycodingchecks introduced new warnings as part of
the pep8 target that causes pep8 to fail now.
This patch fixes code that issued warnings W503,E731,E266,E402
Change-Id: Ib0ad4d722eb6ce322d7f72a5bdaf38b6cb85937e
With client auth, request for XSRF token should not carry admin
username/password, for two reasons:
1. username/password may not exist in config
2. backend treats these credentials as authentication and ignores
the desired principal identity
Change-Id: I35af9536018196959297dcad6b11b98d0681d625
Current version of requests library checks for client certificate
file existance on each request, regardless of whether the cert is
going to be needed for this request.
Therefore it is useless to save the effort of populating cert file
for non-first requests on the connection.
In addition, add a retry for the case SSL error comes from within
SSL C code.
Change-Id: I32b8304b3217049752e8d25a1b735a6d2035fa0b
The XSRF token might be expired after too long with no activity.
This should not happen because the nsxlib cluster uses keep alive
messages.
in case it does happen, the keep alive will detect this incident
and renew the session.
Change-Id: I6c9a7af01b5b18c2a7e46cc6bf8337b7205d161f
To avoid CSRF attacks, the requests to the NSX manager should have
X-XSRF-TOKEN header.
To get it, and the JSESSIONID, the cluster will issue a session/create
request for each endpoint at init, and retrieve the values from the response
header.
Those values will later be used for each request to this endpoint.
Change-Id: I24cf6416e38dc7f57d7d8ceece48a1d3d5815112
This reverts commit 529ca1be95.
This is due to the fact that under load we would get
Request failed due to: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'FILE_CTRL', 'system lib'), ('SSL routines', 'SSL_CTX_use_certificate_file', 'system lib')]
Change-Id: Icd4052754b9be606c4912a5137ff081883399337