A partner performing some testing recognized a case where if a request
is sent to the Ironic Conductor while it is in the process of starting,
and the request makes it into be processed, yet latter the operation
fails with errors such as NodeNotLocked exception. Notably they were
able to reproduce this by requesting the attachment or detachment of
a VIF at the same time as restarting the conductor.
In part, this condition is due to to the conductor being restarted
where the conductors table includes the node being restarted and
the webserver has not possibly had a chance to observe that the
conductor is in the process of restarting as the hash ring is
still valid.
In short - Incoming RPC requests can come in during the initialization
window and as such we should not remove locks while the conductor could
possibly already be receiving work.
As such, we've added a ``prepare_host`` method which initializes
the conductor database connection and removes the stale locks.
Under normal operating conditions, the database client is reused.
rhbz# 1847305
Change-Id: I8e759168f1dc81cdcf430f3e33be990731595ec3
(cherry picked from commit b8e4aba1ec)
If a conductor hostname is changed while reservations are
issued to a conductor with one hostname, such as 'hostname'
and then the process is restarted with 'new_hostname', then the
queries would not match the node and effectively the nodes
would become inaccessible until the reservation is cleared.
This patch clears the reservation when stoping the
ironic-conductor service to avoid the nodes becoming inaccessible.
Ref to: https://review.opendev.org/#/c/711765/
Change-Id: Id31cd30564ff26df0bbe4976ffe3f268b0dd3d7b
Since we've dropped support for Python 2.7, it's time to look at
the bright future that Python 3.x will bring and stop forcing
compatibility with older versions.
This patch removes the six library from requirements, not
looking back.
Change-Id: Ib546f16965475c32b2f8caabd560e2c7d382ac5a
This change adds an option to publish the endpoint via mDNS on start
up and clean it up on tear down.
Story: #2005393
Task: #30383
Change-Id: I55d2e7718a23cde111eaac4e431588184cb16bda
`hash_distribution_replicas` was deprecated in the Stein cycle (12.1.0).
Story: #1680160
Task: #30033
Change-Id: Iddc59ed113fb9808f8c8564433475638491be84f
This change allows allocations that were not finished because of
conductor restarting or crashing to be finished after start up.
Change-Id: I016e08dcb59613b59ae753ef7d3bc9ac4a4a950a
Story: #2004341
Task: #29544
If a node is in ING state such as INSPECTING, RESCUING, and
the conductor goes down, when the conductor backs, the node
gets stuck with that ING state.
The cases for (DEPLOYING, CLEANING) are already processed
as expected, but (INSPECTING, RESCUING, UNRESCUING, VERIFYING,
ADOPTING). DELETING cannot be transitioned to 'fail' state.
Change-Id: Ie6886ea78fac8bae81675dabf467939deb1c4460
Story: #2003147
Task: #23282
This changes the calculation for keys in the hash ring manager to be of
the form "<conductor_group>:<driver>", instead of just driver. This is
used when the RPC version pin is 1.47 or greater (1.47 was created to
handle this).
When finding an RPC topic, we use the conductor group marked on the node
as part of this calculation. However, this becomes a problem when we
don't have a node that we're looking up a topic for. In this case we
look for a conductor in any group which has the driver loaded, and use a
temporary hash ring that does not use conductor groups to find a
conductor.
This also begins the API work, as the API must be aware of the new hash
ring calculation. However, exposing the conductor_group field and adding
a microversion is left for a future patch.
Story: 2001795
Task: 22641
Change-Id: Iaf71348666b683518fc6ce4769112459d98938f2
Adds the fields and bumps the objects versions. Excludes the field from
the node API for now.
Also adds the conductor_group config option, and populates the field in
the conductors table.
Also fixes a fundamentally broken test in ironic.tests.unit.db.test_api.
Change-Id: Ice2f90f7739b2927712ed45c969865136a216bd6
Story: 2001795
Task: 22640
Task: 22642
* removes any bits related to loading classic drivers from
the drivers factory code
* removes exceptions that only happen when classic drivers
can be loaded
* removes the BaseDriver, moves the useful functionality to
the BareDriver class
* /v1/drivers/?type=classic now always returns an empty list
* removes the migration updating classic drivers to hardware
types
The documentation will be updated separately.
Change-Id: I8ee58dfade87ae2a2544c5dcc27702c069f5089d
python-swiftclient stopped supporting the temp url structure used when radosgw
was set as the endpoint_type in ocata, meaning only Newton and older versions
of python-swiftclient will work. Newton is deprecated, so remove the option.
This breaks the deprecation cycle, but since it has been not working for so
long it needs to just be dropped.
Change-Id: Ibdc93b049b7e1ae34cac9e1f599786439c46a685
Also a few related errors based on some earlier investigation
may have been pulled in along the lines of E305.
Story: #2001985
Change-Id: Ifb2d3b481202fbd8cbb472e02de0f14f4d0809fd
Currently we only collect periodic tasks from interfaces used in enabled
classic drivers. Meaning, periodics are not collected from interfaces
that are only used in hardware types. This patch corrects it.
This patch does not enable collection of periodic tasks from hardware
types, since we did not collect them from classic drivers. I don't
remember the reason for that, and we may want to fix it later.
Change-Id: Ib1963f3f67a758a6b2405387bfe7b3e30cc31ed8
Story: #2001884
Task: #14357
If a conductor dies while holding a reservation, the node can get
stuck in its current state. Currently the conductor that takes
over the node only cleans it up if it's in the DEPLOYING state.
This change applies the same logic for all nodes:
1. Reservation is cleared by the conductor that took over the node
no matter what provision state.
2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
maintenance on.
3. Target power state is cleared as well.
The reservation is cleared even for nodes in maintenance mode,
otherwise it's impossible to move them out of maintenance.
Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
Closes-Bug: #1588901
When a conductor managing a node dies abruptly mid cleaing, the
node will get stuck in the CLEANING state.
This also moves _start_service() before creating CLEANING nodes
in tests. Finally, it adds autospec to a few places where the tests
fail in a mysterious way otherwise.
Change-Id: Ia7bce4dff57569707de4fcf3002eac241a5aa85b
Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com>
Partial-Bug: #1651092
When heartbeat thread of ironic-conductor server is reporting heartbeat,
it will be interrupted by database exceptions except 'DBConnectionError'.
So add 'Exception' in _conductor_service_record_keepalive to catch all
possible exceptions raised from database to ensure the heartbeat thread
not to exit. And also log the exception information. When the database
recovers from an exception, heartbeat thread will continue to report
heartbeat.
Change-Id: I0dc3ada945275811ef7272d500823e0a57011e8f
Closes-Bug: #1696296
If conductor is being stopped it is trying to wait of completion of
all periodical tasks which are already in the running state. If there
are many nodes assigned to the conductor this may take a long time,
and oslo service library can kill thread by timeout. This patch adds
code
that stops iterations over nodes in periodical tasks if conductor
is being stopped. These changes reduce probability to get locked
nodes after shutdown and time of shutdown.
Closes-Bug: #1701495
Change-Id: If6ea48d01132817a6f47560d3f6ee1756ebfab39
Currently config drive can be stored in swift with keystone
authentication. This change allows ironic to store the config drive in
ceph radosgw and use radosgw authentication mechanism that is not
currently supported. It uses swift API compatibility for ceph radosgw.
New options:
[deploy]/configdrive_use_object_store
[deploy]/object_store_endpoint_type
Deprecations:
[conductor]/configdrive_use_swift
Replaced by: [deploy]/configdrive_use_object_store
[glance]/temp_url_endpoint_type
Replaced by: [deploy]/object_store_endpoint_type
Change-Id: I9204c718505376cfb73632b0d0f31cea00d5e4d8
Closes-Bug: #1642719
The i18n team has decided not to translate the logs because it seems
like it's not very useful.
This patch removes translation of log messages from ironic/conductor.
Change-Id: I0fabef88f2d1bc588150f02cac0f5e975965fc29
Partial-Bug: #1674374
This changes driver_factory.default_interface() so that instead
of returning None if there is no calculated default interface,
it raises exception.NoValidDefaultForInterface.
This is a follow up to 6206c47720.
Change-Id: I0c3d5d75b5a37af02f3660968cf3f2c669e52019
Partial-Bug: #1524745
This adds additional constraints to the help messages for the
enabled_*_interfaces config options. It also checks if they are
empty at conductor startup, and if any are empty, errors out
with a better error message than previously provided.
Change-Id: I97fc318ce00291d5e43b70423930981c2f5a2de0
Partial-Bug: #1524745
This causes the conductor to fail to start up if a default interface
implementation cannot be found for any dynamic driver. This avoids
problems later where building a task object to operate on a node
could fail for the same reason.
This also removes a RAID interface test that turned out to be an
invalid test, but we couldn't tell it was invalid until we had
changed the start up behavior of the conductor.
Note that this release note doesn't actually note a change between
releases, but rather is mostly for my use when I come back to combine
many of the release notes for this feature later.
Change-Id: I39d3c30a6beda2e496ff85119281fdf4de191560
Partial-Bug: #1524745
This changes the driver loading validation in the conductor
startup to check for at least one classic *or* dynamic driver.
Previously the conductor would fail to start if no classic drivers
were loaded. This allows the conductor to use only dynamic
drivers, without loading any classic drivers.
It also now checks classic driver names against dynamic driver
names, and fails to start if there is a conflict there. This
would totally break the hash ring and cause mass confusion,
so we cannot allow it.
Change-Id: Id368690697f90471d09f16eaa4925338dadebd0f
Partial-Bug: #1524745
Attaching periodic tasks on a driver object (rather than an interface)
was deprecated during the Newton cycle (6.1.0). This removes support
for it.
Change-Id: I35afd4e0d3d1a32a516f6c755a0bd9aee0f1b1ba
Fixes-Bug: #1660805
The log string 'Failed to register hardware types' doesn't
provide much help as to what went wrong. This adds the reason
(exception message) to the log.
Change-Id: I941e35473f48c636134d5df31087d0ddbcacf44a
Partial-Bug: #1524745
conductor.base_manager._register_and_validate_hardware_interfaces had a
note at the top about what exceptions might be raised. Turn this into a
proper docstring.
Change-Id: I60b3e864f4cfba38ed7d12caf3bf723d73ab9e39
Partial-Bug: #1524745
This registers the intersection of supported and enabled interfaces for
each hardware type enabled in the conductor at conductor startup, and
unregisters them at conductor shutdown. Validation is left as a todo for
now.
Change-Id: I14e88bfc304de9414de008d1cc8568dda9115ecc
Partial-Bug: #1524745
This changes the ironic driver to use the hash ring implementation from
tooz, which is nearly identical to ironic.common.hash_ring.
Change-Id: I4200be2035067622604e5aa70e025594bcd0a801
Depends-On: Ic1f8b89b819ace8df9b15c61eaf9bf136ad3166b
In order to properly support booting and maintenance of
systems that boot from a remote storage device, we need an
interface to associate the driver with.
This commit adds a basic storage_interface and noop and fake
interfaces along with the appropriate handling for configuration
in the event that the driver list is blank, or is missing the
noop driver.
Co-Authored-By: Stephane Miller <stephane@alum.mit.edu>
Change-Id: Ib21eda88f207f18675c8580cd7fd37eab6fd70bf
Partial-Bug: #1559691
This approach will no longer make sense when the driver composition is in effect.
Anyway usually a better place to put a task is an interface.
Change-Id: I7096e428ce9774d89ac624c2d38bb23984a4b842
Related-Bug: #1524745
This change also introduces two network interfaces:
* flat: Copies current neutron DHCP provider logic to work with
cleaning ports;
* noop: noop interface.
The default value of the network_interface is None, meaning that the
node will be using the default network interface. The default network
interface is determined the following way:
* if [DEFAULT]default_network_interface configuration option is set
(the default for it is None), the specified interface becomes the
default for all nodes;
* if it is not set, 'flat' interface will be used if the deployment
currently uses 'neutron' DHCP provider, otherwise 'noop' interface
will be used.
create_cleaning_ports and delete_cleaning_ports methods of the DHCP
providers are still being called in case of out-of-tree DHCP
providers, but this possibility will be removed completely in the
next release. If the DHCP provider logic is rewritten into a custom
network interface, please remove those methods from the provider, so
that network interface is called instead.
Partial-bug: #1526403
Co-Authored-By: Om Kumar <om.kumar@hp.com>
Co-Authored-By: Vasyl Saienko <vsaienko@mirantis.com>
Co-Authored-By: Sivaramakrishna Garimella <sivaramakrishna.garimella@hp.com>
Co-Authored-By: Vladyslav Drok <vdrok@mirantis.com>
Co-Authored-By: Zhenguo Niu <Niu.ZGlinux@gmail.com>
Change-Id: I0c26582b6b6e9d32650ff3e2b9a3269c3c2d5454
During clearing locks, also clear target_power_state. As nodes
may locked in powering process, sync_power_state task will sync
the power_state field, but nobody handles target_power_state.
Change-Id: I2293e03c05e13c716f78533680d128ba45ccda02
Closes-Bug: #1567255