822 Commits

Author SHA1 Message Date
Jay Faulkner
1d11f0b7dd If listen_tls is true, enable TLS on wsgi server
This change enables operators to set [DEFAULT]listen_tls to
true configure IPA to be host its WSGI server over TLS using
existing SSL support in oslo.service.

In addition to configuring this in IPA, a deployer will need to
also set [ssl]cert_file, [ssl]key_file, and optionally
[ssl]ca_file in their ipa config, in addition to embedding those
files into the IPA ramdisk in order for this to be functional.

In order to make this change work, we also need to monkey patch
socket library early, or else oslo.service will end up passing an
unpatched socket to the eventlet wsgi server, which causes
deadlocks.

Change-Id: Ib7decae410915f3c27b045ee08538c94d455b030
2020-09-02 16:07:42 -07:00
Jay Faulkner
7d0ad36ebd Make WSGI server respect listen_* directives
The listen_port and listen_host directives are intended to allow
deployers of IPA to change the port and host IPA listens on. These
configs have not been obeyed since the migration to the oslo.service
wsgi server.

Story: 2008016
Task: 40668
Change-Id: I76235a6e6ffdf80a0f5476f577b055223cdf1585
2020-08-31 14:37:38 +00:00
Zuul
cfede0c5bc Merge "Clarify connection error on heartbeats" 2020-08-24 13:29:27 +00:00
Julia Kreger
f670f704f3 Clarify connection error on heartbeats
Heartbeat connection errors are often a sign of a transitory
network failures which may resolve themselves. But an operator
looking at the screen doesn't necessarilly know that.

They don't understand that there could have been a network
failure, or a misconfiguration that caused the connectivity
failure and soft of kind of default to "well it failed"
without further clarification.

As such, this patch adds explicit catching of the requests
ConnectionError exception and rasies a new internal error
with a more verbose error message in that event to provide
operators with additional clarity.

Change-Id: I4cb2c0d1f577df1c4451308bd86efa8f94390b0c
Story: 2008046
Task: 40709
2020-08-20 13:45:47 -07:00
Dmitry Tantsur
d50ff06b6b Enable the logs collection by default
It's incredibly helpful when debugging and most of consumers seem
to enable and rely on it.

Change-Id: I33bf58b3eb16b63b70f2a23e8a04449dc88fd94c
2020-08-19 17:25:24 +02:00
Vladyslav Drok
ba6ca246f5 Add possibility to pass global request ID
It can be done via ipa-global-request-id kernel commandline parameter.

Story: 2007681
Task: 39792
Change-Id: I6f544327d310c976a1625cfb411947591867882a
2020-08-12 15:21:08 +03:00
Zuul
3e938b6fcc Merge "Support changing the protocol part of callback_url to https" 2020-08-10 14:59:51 +00:00
Zuul
9f88a0cb59 Merge "Fix TypeError on agent lookup failure" 2020-08-07 16:32:30 +00:00
Zuul
cda5467839 Merge "Examples: add deploy_steps examples" 2020-08-07 15:55:30 +00:00
Dmitry Tantsur
353d09c3b0 Support changing the protocol part of callback_url to https
Adds a new kernel parameter for manual configuration and also creates
foundation for automatic TLS support later.

Change-Id: If341c3a8a268fc8cab6bd6be04b12ca32b31c8d8
Story: #2007214
Task: #40619
2020-08-06 15:14:31 +02:00
Zuul
008316e7e3 Merge "Extends pci devices metrics" 2020-08-05 10:30:44 +00:00
Zuul
e1f6c774c0 Merge "Hint 404 lookup failures for Operators" 2020-08-05 10:28:38 +00:00
Julia Kreger
5eab9bced6 Fix TypeError on agent lookup failure
Agent lookups can fail as we presently use logging.exception,
better known in our code as LOG.exception, which can also generate
other fun issues on journald based systems where additional errors
could be raised resulting in us being unable to troubleshoot the
the actual issue.

Because of the mis-use of LOG.exception and the default behavior
of the backoff retry handler, the retry logic was also not
functional as any error no matter how small caused IPA to
just exit.

Change-Id: Ic4608b7c6ff9773d1403926efb3d59869c71343b
Story: 2007968
Task: 40465
2020-08-04 20:43:02 -07:00
Kaifeng Wang
b424fbfa35 Extends pci devices metrics
Collects PCI class, revision, and bus information for the pci-devices
collector, these metrics as well as vendor id and device id are
components which can be used to construct device information like
lspci output, which is how cyborg agent collects accelerator devices.

Accelerator device based scheduling is possible after ironic has such
information in place.

Change-Id: I6c37c554f37dd5f1d21c8fd4fad2a4f44a3c75d7
Story: 2007971
Task: 40474
2020-08-04 23:32:37 +08:00
Dmitry Tantsur
ce53863361 Examples: add deploy_steps examples
Change-Id: Ifacd8fb05a80f34029965156334fbb707468f1f6
2020-08-04 16:51:54 +02:00
Zuul
a9ed390f08 Merge "set EVENTLET_NO_GREENDNS to 'yes'" 2020-07-31 18:27:55 +00:00
Julia Kreger
9830f3cb0f set EVENTLET_NO_GREENDNS to 'yes'
Eventlet, when monkey patching occurs, replaces the base
dns resolver methods. This can lead to compatability issues,
and un-expected exceptions being raised during the process
of monkey patching. Such as one if there are no resolvers.

As such, since we don't really need monkey patching of DNS,
and setting the flag should make the inspector CI jobs happier
where we don't need nor use DNS, AND tinycore may not be setting
a resolver configuration at all, which is the root of the failure
upon monkey patching that casues IPA to fail on start in certian
circumstances.

As a note, this has been performed on other projects due to
bugs. See Id9fe265d67f6e9ea5090bebcacae4a7a9150c5c2.

Change-Id: Ib8f7b844b1bfffff16f88ebbb6ef5ddbe61d5a30
Story: 2007936
Task: 40394
2020-07-31 16:21:06 +02:00
Zuul
ad9c54f55c Merge "Return the final RAID configuration from apply_configuration" 2020-07-29 14:00:08 +00:00
Dmitry Tantsur
f03d72019a Return the final RAID configuration from apply_configuration
AgentRAID expects it and fails with TypeError if it's not provided.

Change-Id: Id84ac129bba97540338e25f0027aa0a0f51bde52
Story: #2006963
2020-07-29 10:10:18 +02:00
Dmitry Tantsur
eb87651496 Allow erase_devices_metadata to be used as a deploy step
Change-Id: I75f156dd76b0e3aaa1592ba24fe42fb2a7057cc8
Story: #2006963
2020-07-27 17:57:37 +02:00
Zuul
dc395c5837 Merge "More refactoring of the image module" 2020-07-27 07:15:42 +00:00
Zuul
9ca640a1c5 Merge "Prevent un-needed iscsi cleanup" 2020-07-25 13:54:51 +00:00
Riccardo Pittau
80e11811f5 More refactoring of the image module
Introducing new function _umount_all_partitions to reduce the size
of _install_grub2

Change-Id: I304468d57b10d677f2a9d58aec42a1bf414c6cba
2020-07-24 14:34:46 +02:00
Zuul
daf61f33b0 Merge "Fix bootloader install issue with MDRAID" 2020-07-22 22:13:34 +00:00
Zuul
bfb395837d Merge "Adds poll mode deployment support" 2020-07-22 19:53:31 +00:00
Doug Szumski
5e95b1321d Fix bootloader install issue with MDRAID
When no root_device hint is set, an MDRAID partition can be incorrectly
selected as the root device which causes installation of the bootloader
to the physical disks behind the MDRAID volume to fail. See the notes
in the referenced Story for more detail.

This change adds a little more specificity to the listing of block
devices.

Change-Id: I66db457e71a0586723ee753bef961aec5bf58827
Story: 2007905
Task: 40303
2020-07-22 11:16:13 -07:00
Julia Kreger
2a56ee03b6 Prevent un-needed iscsi cleanup
When we added software raid support, we started calling bootloader
installation. As time went on, we ehnanced that code path for non
RAID cases in order to ensure that UEFI nvram was setup
for the instance to boot properly.

Somewhere in this process, we missed a possible failure case where
the iscsi client tgtadm may return failures. Obviously, the correct
path is to not call iscsi teardown if we don't need to.

Since it was always semi-opportunistic teardown, we can't blindly
catch any error, and if we started iSCSI and failed to tear the
connection down, we might want to still fail, so this change
moves the logic over to use a flag on the agent object which
one extension to set the flag and the other to read it and take
action based upon that.

Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c
Story: 2007937
Task: 40395
2020-07-20 14:24:06 -07:00
Dmitry Tantsur
1f3b70c4e9 Ignore devices with size 0 when collecting inventory
delete_configuration still fetches all devices as it needs to clean
ones with broken RAID.

Story: #2007907
Task: #40307
Change-Id: I4b0be2b0755108490f9cd3c4f3b71a5e036761a1
2020-07-09 18:28:20 +02:00
Riccardo Pittau
9d9a6bce5c Refactor part of image module
Shuffle some functions around and reduce size of _is_bootloader_loaded
moving logic out to a new function.

Change-Id: I9c10bf05186dcebb37f175d61bf4ac9ff86b6510
2020-07-07 10:44:50 +02:00
Zuul
2e9620a2c0 Merge "Limit Inspection->Lookup->Heartbeat lag" 2020-07-06 18:08:14 +00:00
Zuul
6218725610 Merge "Fix serializing ironic-lib exceptions" 2020-07-06 16:47:58 +00:00
Julia Kreger
c76b8b2c21 Limit Inspection->Lookup->Heartbeat lag
Caches hardware information collected during inspection
so that the initial lookup can occur without any delay.

Also adds logging to track how long inventory collection takes.

Co-Authored-By: Dmitry Tantsur <dtantsur@protonmail.com>
Change-Id: I3e0d237d37219e783d81913fa6cc490492b3f96a
2020-07-03 10:32:26 +02:00
Dmitry Tantsur
ba3caa6c64 Increase the ESP partition size to 550 MiB when using software RAID
This has been a popular guidance, and diskimage-builder has recently
started following it.

Change-Id: I794c846fb191c15b0a30546bf64d624dfbde0fd4
2020-07-02 17:30:33 +02:00
Zuul
de7d5affe7 Merge "Mount all vfat partitions before calling grub2" 2020-07-02 10:37:04 +00:00
Dmitry Tantsur
a4855c544c Fix serializing ironic-lib exceptions
Change-Id: If1408e4b81d263c56b4bbab618dd0737db5f762e
Story: #2007889
Task: #40268
2020-07-02 12:18:53 +02:00
Arne Wiebalck
c5022790b3 Mount all vfat partitions before calling grub2
In order to ensure grub2 finds all files it needs, mount all
vfat partitions specified in the deployed image.

Story: #2007618
Task: #39629
Change-Id: Ie5b6e0abc3f266409562f9ecb26538126b667056
2020-06-30 18:31:58 +02:00
Dmitry Tantsur
00ad03b709 Fixes minor issues in the read() retries patch
Follow-up to commit c5b97eb781cf9851f9abe87a1500b4da55b8bde8.

Two things slipped through the cracks:
* ImageDownloadError was instantiated incorrectly, resulting in a wrong
  error message. This was uncovered by using assertRaisesRegext in tests.
* We allowed calling write(None). This was uncovered by avoiding sleep(4)
  in tests and enabling more failed calls before timeout.

Change-Id: If5e798c5461ea3e474a153574b0db2da96f2dfa8
2020-06-30 10:51:53 +02:00
Zuul
f97f8e2c06 Merge "Fix confusing logging when running asynchronous commands" 2020-06-29 22:40:02 +00:00
Zuul
9219aae291 Merge "Extend retries to 9, 10 seconds apart." 2020-06-29 22:40:01 +00:00
Dmitry Tantsur
0eee26ea66 Fix confusing logging when running asynchronous commands
We log them as completed when they start executing.

Also fix a problem in remove_large_keys that prevented items
with defaultdict from being logged.

Change-Id: I34a06cc85f55c693416f8c4c9877d55d6affafc9
2020-06-26 15:19:04 +02:00
Riccardo Pittau
5cc44d251f Add debug message to node lookup
This should help identify the start of the node lookup.

Change-Id: I72f0949fee84be5a2b06eab976c5560e252fa63a
2020-06-25 16:04:00 +02:00
Zuul
c94fb84497 Merge "Minor clean-up follow-up to timeout on read() fix" 2020-06-25 10:23:18 +00:00
Julia Kreger
7abda4eefe Minor clean-up follow-up to timeout on read() fix
Just some minor cleanup driven from the review process.

Change-Id: I0b3d73c251d6da6d85e11279990dcc36751e27e7
2020-06-24 10:02:28 -07:00
Julia Kreger
c77a7df851 Extend retries to 9, 10 seconds apart.
The download retry interval was previously five seconds which is
not long enough to recover after a hard network connectivity break
where we may be reliant upon network port forwarding hold-down
timers or even routing protocol route propogation to recover
communication.

Previously the time value was 5 seconds, with 3 attempts, meaning
15 seconds total ignoring the error detection timeouts.

Now it is 10 seconds, with 10 attempts, meaning 100 seconds before
the error detection timeouts.

Change-Id: I6d11edc9a3156f2bdc21c3d432ecc7625d652699
2020-06-23 20:27:49 +00:00
Julia Kreger
159ab9f0ce Add full download retries
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.

Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
2020-06-23 20:27:41 +00:00
Julia Kreger
c5b97eb781 Add timeout operations to try and prevent hang on read()
Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.

In other words, we could be in a situation like:

- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
-  read(fd, len) - We drain the in-memory buffer side of the socket.
-  Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
  empty OR the buffered data doesn't meet our requried read len value.
  tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
                    recognized as dead, which could be a long time.

NOTE: The above read()s are python's read() on an contents being
      streamed. Lower level reads exist, but brains will hurt
      if we try to cover the dynamics at that level.

As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.

When we exceed the timeout time value with our wall clock,
we will fail the download.

Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240
2020-06-23 13:25:09 -07:00
Kaifeng Wang
61c95554ff Adds poll mode deployment support
Adds a new poll extension to provide get_hardware_info and get_node_info
interfaces.

get_hardware_info will be used for node validation by ironic deploy
drivers.

get_node_info will be used for sending lookup data to IPA.

standalone mode is assumed as debug only, but it's not the case
considering the poll mode will be introduced, slightly updates the
description, also prevents the mdns lookup when standalone is true.

Story: 1526486
Task: 28724

Change-Id: I5ad772a18cc4584585c5a7b6fb127547cece1998
2020-06-21 16:44:00 +08:00
Zuul
46bf7e0ef4 Merge "Add a deploy step for writing an image" 2020-06-20 00:00:10 +00:00
Dmitry Tantsur
6d7ec350ff Make get_partition_uuids work with whole disk images
We used to popular root UUID inside the message formatting function,
move it to actual prepare_image/cache_image calls.

Change-Id: Ifb22220dfd49633e8623dd76f7a6a128f5874b78
2020-06-17 14:38:58 +02:00
Zuul
751dac7b90 Merge "Split and move logic for partition tables" 2020-06-11 22:53:07 +00:00