104 Commits

Author SHA1 Message Date
Jay Faulkner
3d42298619 Remove standby.cache_image support
Image caching was never fully supported in Ironic or IPA; this is vestigal
code leftover from a partial implementation.

Even if we implemetented it today, we'd likely use a completely different
methodology.

Change-Id: Id4ab7b3c4f106b209585dbd090cdcb229b1daa73
2023-10-24 15:02:44 -07:00
Zuul
b42f0be422 Merge "implement basic-auth support for user-image download process" 2023-10-13 17:08:28 +00:00
Julia Kreger
cb61a8d6c0 Retry on checksum failures
HTTP is a fun protocol.

Size is basically optional. And clients implicitly trust the server
and socket has transferred all the bytes. Which *really* means you
should always checksum.

But... previously we didn't checksum as part of retrying.

So if anything happened with python-requests, or lower level
library code or the system itself causing bytes to be lost off the
buffer, creating an incomplete transfer situation, then we wouldn't
know until the checksum.

So now, we checksum and re-trigger the download if there is a
failure of the checksum.

This involved a minor shift in the download logic, and resulted in
a needful minor fix to an image checksum test as it would loop for
90 seconds as well.

Closes-Bug: 2038934
Change-Id: I543a60555a2621b49dd7b6564bd0654a46db2e9a
2023-10-10 09:15:31 -07:00
Adam Rozman
70961789a6 implement basic-auth support for user-image download process
This feature was proposed in https://bugs.launchpad.net/ironic-python-agent/+bug/2021947

Change-Id: I9dbfc1402240beb75b6736214753fd86dccae676
2023-10-10 16:25:51 +03:00
Julia Kreger
c65ad42ff1 Log the number of bytes downloaded
When troubleshooting download issues, which may present
as checksum validation failures, it is difficult to understand
if the *entire* file was downloaded due to the way HTTP works.

In that, a download may start with a successful result code,
and the content is streamed out until the socket is closed.

But with HTTP there is no way to know if that socket closed
prematurely and the original server size is *also* an optional
field, so just log the size we got to so we don't drive the
humans [more-]insane.

Also now logs the (optional) content-length field if
supplied by the server.

Change-Id: Id71b167f4e330d54b9afddf95f1a2ef9e40398bf
2023-07-19 16:20:40 +00:00
Zuul
bb156aad6c Merge "Fix Bandit errors" 2023-06-26 09:25:09 +00:00
Julia Kreger
78c1343a54 Fix Bandit errors
Bandit 1.7.5 released with a timeout check for all requests and
urllib calls.

Fixed those.

In the process, then exposed a bandit b310 issue, which was already
covered by the code, but explicitly marked it as such.

Also, enables bandit checks to be voting for CI..

Change-Id: If0e87790191f5f3648366d571e1d85dd7393a548
2023-06-06 08:34:55 -07:00
Zuul
141c5ff1c3 Merge "Add support for CentOS SUM files" 2023-05-09 09:03:25 +00:00
Harald Jensås
e7a048ecbe
Add support for CentOS SUM files
The CentOS Stream SUM files uses format:
  # FILENAME: <size> bytes
  ALGORITHM (FILENAME) = CHECKSUM

Compared to the more common format:
  CHECKSUM  *FILE_A
  CHECKSUM  FILE_B

Use regular expressions to check for filename both
in the middle with parentheses and at the end.
Similarly look for valid checksums at beginning or
end of line. Also look for know checsum patterns in
case file only contain the checksum iteself.

Change-Id: I9e49c1a6c66e51a7b884485f0bcaf7f1802bda33
2023-05-03 21:31:23 +02:00
Julia Kreger
c05fdf790c Fix checksum validation logic
The checksum validation logic, which was updated early on in the
whole process of deprecating md5, didn't account for a URL *or* a
longer checksum (i.e. sha256/sha512) which was decided while the
overall approach was being decided.

Fixes the logic, and adds additional tests.

Change-Id: Ic4053776e131fc02ace295a1e69e9f9faab47f42
2023-05-02 17:24:57 -07:00
Julia Kreger
32df26a22a Disable MD5 image checksums
MD5 image checksums have long been supersceeded by the use of a
``os_hash_algo`` and ``os_hash_value`` field as part of the
properties of an image.

In the process of doing this, we determined that checksum via
URL usage was non-trivial and determined that an appropriate
path was to allow the checksum type to be determined as needed.

Change-Id: I26ba8f8c37d663096f558e83028ff463d31bd4e6
2023-04-24 16:54:42 -07:00
Dmitry Tantsur
6a1334a068 Drop support for instance netboot
Change-Id: I2b4c543537dac8904028fdcdb590c1c214238e10
2022-07-07 16:38:22 +02:00
Derek Higgins
12f5f30e63 Instruct qemu-img to write image zeros to disk.
Doing this will cause it not to zero out the entire
block device which can be very costly on a slow HDD.

Story: 2009227
Task: 43315

Change-Id: I62ba2afc037d9844387e6b0984fe5008779d95d2
2021-12-08 15:56:05 +00:00
Dmitry Tantsur
cb836a29bf Trivial: minor fixes in error messages
Change-Id: I06b32c2eb576520cddff88074e4619070731017d
2021-09-07 14:41:38 +02:00
Riccardo Pittau
efbbc86f53 Increase version of hacking and pycodestyle
Fix H904 "Delay string interpolations at logging calls" errors

Change-Id: I331808d0132094faf739998a6984440787d3ebf8
2021-07-30 14:34:33 +02:00
Zuul
6be440eb3b Merge "Refactor: use convert_image from ironic_lib" 2021-06-04 16:35:00 +00:00
Zuul
7fdbcde3de Merge "Stop accepting duplicated configdrive" 2021-06-02 12:36:57 +00:00
Dmitry Tantsur
f657526807 Stop accepting duplicated configdrive
We're currently requiring it twice: in image_info and in a separate
configdrive argument. I think we should eventually settle on separate
arguments for separate entities, so this change makes the value in
image_info optional with a goal to stop accepting it.

We could probably just remove the handling in image_info, but a
deprecation is safer.

The (unused in ironic) cache_image call is updated with an optional
configdrive arguments.

Story: #2008904
Task: #42480
Change-Id: I1e2efa28efa3ea7e389774cb7633d916757bc6ed
2021-06-02 11:19:39 +02:00
Dmitry Tantsur
33d889c3c4 Refactor: use convert_image from ironic_lib
Change-Id: If890baf3545cff6cef7c645c42e7f9d9038c9aa7
2021-06-01 14:07:34 +02:00
Julia Kreger
9e4c7052a2 Limit qemu-img execution arenas
qemu-img attempts to launch multiple threads by default *and*
attempts to have multiple memory allocation arenas to operate
from. While multithreading can be good for performance, this
pattern and the memory footprint for process launch and
dependencies can turn the memory footprint for a cirros image
conversion (16MB) into 1.2GB of memory being asked for by the
qemu-img tool.

In order to limit this impact, as the default number of arenas
is governed by the number of CPUs times the number 8, it seems
reasonable to lower this to a more reasonable number which
also helps keep our possible memory footprint from being exceeded.

Change-Id: I71a28ec59ec31c691205eb34d9fcab63a2ccb682
Story: 2008928
Task: 42528
2021-05-26 13:04:46 -07:00
Dmitry Tantsur
606e500312 Rewrite write_image.sh in Python
Change-Id: I0caa65561948f4e0934943a7a0d3a209701b5a59
2021-05-18 14:45:13 +02:00
Dmitry Tantsur
24951b1029 Import deployment logic from ironic-lib
The two functions work_on_disk and create_config_drive_partition contain
a substantial part of the deployment logic. Previously we placed them in
ironic-lib for re-using on the conductor side in the iSCSI deploy
interface. Since the iSCSI deploy is going away, we can move this code
to ironic-python-agent to simplify maintenance.

Imports code from ironic_lib commit 9fb5be348202f4854a455cd08f400ae12b99e1f2.

Change-Id: I6cbcd81533f135208b57746cb0e33ffdfaf94eee
2021-05-03 14:17:57 +02:00
Dmitry Tantsur
b395181b1b Always fall back to sysrq when power off fails
The line we're looking for is not there when IPA is in a container, at least
for CentOS based containers. Just fall back to sysrq on errors.

Change-Id: Ie4ee605ad9c6cda58808512a563247175859c71e
2021-04-13 19:05:04 +02:00
Steve Baker
e61336602f Fix root UUID for streamed partition images
The root UUID changes after a streamed partition image is written to
the block device, causing later deployment failure when assuming the
old UUID.

This change updates the root UUID after streaming the partition image
is complete.

This issue may have been missed in local testing because deploying the
same image repeatedly will result in stable root UUID across runs.

Change-Id: Ice4630c16fc216980488d1427f3b02e1b8a417fa
2021-03-19 12:08:43 +01:00
Riccardo Pittau
bff252c726 Remove default parameter from execute
The param check_exit_code from the processutils extension execute has
default already at [0]
See:
https://opendev.org/openstack/oslo.concurrency/src/branch/master/oslo_concurrency/processutils.py#L214

Change-Id: Iedff5325e0737556d5eb3da601c984ddfc633873
2021-03-02 16:19:32 +01:00
Julia Kreger
4fb8163717 Fix boot mode detection for partition images
Previously, partition images were hard coded to be bios based
as opposed to consulting all of the values AND the node itself
before making the most appropriate determination. Now the agent
utilises the internal helper to properly determine the boot
mode when calling ironic-lib.

Story: 2008070
Task: 41265
Change-Id: Id5eeda69d5b9de2b393af414472d57b0d4380c43
2020-12-19 19:03:16 +00:00
Julia Kreger
246e0cf29e Change default ironic_lib invocation to flag local booting
The partition image support has been telling ironic-lib
that the machine will be local booted. While this is likely
harmless, and doesn't seem to break anythign, we should have
it match moving forward just to be on the safe side so we don't
accidently break things down the road.

Change-Id: I33e5d583964ef8c21aa04d7427bcd3957b89d449
2020-12-19 19:02:58 +00:00
Julia Kreger
cb6c0059b5 Fix default disk label with partition images
Partition images through the agent have the unfortunate
side effect of being executed without full node context
by default. Luckilly we've had a similar problem and
cache the node.

This patch changes the lookup from a default of msdos
partitions to use the cached node object.

Change-Id: I002816c9372fdf1cc32f3c67f420073551479fd9
2020-12-14 06:36:18 -08:00
Julia Kreger
d3c3d4dabe Update the cache if we don't have a root device hint
Or at least try to.

Some deployments just don't use root device hints, and this is okay.

However, other deployments need root device hints, and with fast
track mode in ramdisks, we created a situation where the node cache
could be updated by a human or software between the time the agent
was started, and the deployment was requested.

As a result, the agent has been updated to check if we have a hint
and if we don't, update the cache from the node lookup endpoint.

This is not needed when the inband deploy steps are executed, as
the process of updating the steps does force the node cache to be
updated.

Change-Id: I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6
Story: 2008039
Task: 40701
2020-08-25 19:34:48 +00:00
Dmitry Tantsur
00ad03b709 Fixes minor issues in the read() retries patch
Follow-up to commit c5b97eb781cf9851f9abe87a1500b4da55b8bde8.

Two things slipped through the cracks:
* ImageDownloadError was instantiated incorrectly, resulting in a wrong
  error message. This was uncovered by using assertRaisesRegext in tests.
* We allowed calling write(None). This was uncovered by avoiding sleep(4)
  in tests and enabling more failed calls before timeout.

Change-Id: If5e798c5461ea3e474a153574b0db2da96f2dfa8
2020-06-30 10:51:53 +02:00
Zuul
c94fb84497 Merge "Minor clean-up follow-up to timeout on read() fix" 2020-06-25 10:23:18 +00:00
Julia Kreger
7abda4eefe Minor clean-up follow-up to timeout on read() fix
Just some minor cleanup driven from the review process.

Change-Id: I0b3d73c251d6da6d85e11279990dcc36751e27e7
2020-06-24 10:02:28 -07:00
Julia Kreger
159ab9f0ce Add full download retries
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.

Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
2020-06-23 20:27:41 +00:00
Julia Kreger
c5b97eb781 Add timeout operations to try and prevent hang on read()
Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.

In other words, we could be in a situation like:

- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
-  read(fd, len) - We drain the in-memory buffer side of the socket.
-  Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
  empty OR the buffered data doesn't meet our requried read len value.
  tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
                    recognized as dead, which could be a long time.

NOTE: The above read()s are python's read() on an contents being
      streamed. Lower level reads exist, but brains will hurt
      if we try to cover the dynamics at that level.

As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.

When we exceed the timeout time value with our wall clock,
we will fail the download.

Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240
2020-06-23 13:25:09 -07:00
Dmitry Tantsur
6d7ec350ff Make get_partition_uuids work with whole disk images
We used to popular root UUID inside the message formatting function,
move it to actual prepare_image/cache_image calls.

Change-Id: Ifb22220dfd49633e8623dd76f7a6a128f5874b78
2020-06-17 14:38:58 +02:00
Dmitry Tantsur
6c1545b75b New extension call to return partition UUIDs
Currently we parse the success message from the write_image call.
This is inconvenient and incompatible with the deploy steps split.

Change-Id: I258dc1ff1ad1c9df5cbc26a7825d9e7ef2f3205b
Story: #2006963
2020-06-02 15:05:59 +02:00
Dmitry Tantsur
8adb7e1a04 Add timeout and retries when connection to an image server
If the server is stuck for any reason, the download will hang for
a potentially long time. Provide a timeout (defaults to 60 seconds)
and 2 retries on failure.

Change-Id: Ie53519266edd914fdbfa82fe52b4a55151e5ec5f
2020-04-24 10:34:40 +02:00
Riccardo Pittau
a332a19a57 Bump hacking to 3.0.0
Change-Id: I1032ea6a2e9d79aeaecb1458c319cbeb15ac1fff
2020-03-30 12:55:46 +02:00
Julia Kreger
55b011cb1f Fix GPT partition tables after agent writes contents
Fixes errors that were being raised upon restarting the agent
directly written out software raid images as the raidset is
restarted for device consistency and partition updates later
on in the code path of deployment.

Story: 2007455
Task: 39187
Change-Id: I9abf51eb77b262932e70329af5ce1593106a3171
2020-03-29 07:45:25 -07:00
Zuul
5521fa32f6 Merge "Add NTP time sync" 2020-03-11 19:51:24 +00:00
Julia Kreger
cee4bfc4bc Add NTP time sync
Attempt to sync the clock and save it to the hardware clock.

This feature supports use of chrony or ntpdate.

Sem-Ver: feature
Change-Id: I178d7614429d582e742d9cba6d0fa3ae099775e3
Story: 1619054
Task: 11591
2020-03-07 09:16:19 -08:00
Kaifeng Wang
629a19f24b Ignore None md5 checksum field
Current checking on md5 checksum field is a bit strict after we
have alternate hashing algorithm support from glance, this
patch ignores None value md5 checksum if it exists.
This dosn't provide any use to end users but maybe provide
convenience on internal logic.

Change-Id: I89d7ea8ac3464a430141e80be57b743673c3a173
2020-02-22 10:52:44 +08:00
Julia Kreger
ab00904e27 Catch ValueError for FIPS 140-2 mode
In FIPS 140-2 mode, the underlying operating system will
prevent the loading of certian algorithms for hasing and
encryption. Python hashlib returns a ValueError exception
when the type cannot be instantiated.

This change catches the error and returns a relatively
user understandable reason as to why a failure has occured.

Change-Id: Id1a144b906303caa92ce88793fba8d1b14def738
Story: 2007306
Task: 38788
2020-02-18 10:45:23 -08:00
Riccardo Pittau
ca7a46b113 Stop using six library
Since we've dropped support for Python 2.7, it's time to look at
the bright future that Python 3.x will bring and stop forcing
compatibility with older versions.
This patch removes the six library from requirements, not
looking back.

Change-Id: I4795417aa649be75ba7162a8cf30eacbb88c7b5e
2019-11-29 10:18:14 +01:00
Kaifeng Wang
6f634c358b Adds bandit template and exclude some of tests
Adds bandit configuration template and exclude some of
tests that we don't want to fix for the moment.

Keeping job unvoted so that we can keep an eye on possible
issues while not breaking gate.

Change-Id: I092d686ba38723d7951e8f06415f28cc809ad365
Story: 2005791
Task: 33563
2019-06-20 14:39:36 +08:00
Kaifeng Wang
a9cac52190 Relax checksum fields validation
In stein, ironic added the new os_hash_algo and os_hash_value checksum
fields provided by glance, but the checksum field is still mandatory,
which is inconvenient for standalone use case.

We could relax the checksum checking and proceed as long as there is at
least one of checksum mechanism available.

Change-Id: Ia90197416f76ada0422681044a16f1c07d7049a1
Story: 2005773
Task: 33490
2019-05-28 09:38:36 +08:00
Dmitry Tantsur
f821db3a54 Allow image checksum to be a URL
We allow image_source to be a URL, let us also support URLs for checksums.
This change copies handling of multi-file checksum files from metalsmith.

Change-Id: Ie4d7e5c79b76bdd72d50eeb384cf10519278a80c
Story: #2005061
Task: #29605
2019-02-25 14:28:09 +01:00
Sam Betts
fc2dfcee60 Attempt to read the partition table after writing an image
This patch adds code that tries to read the partition table after we've
successfully written an image to make sure the image that we wrote has a
valid partition table so we can more easily guarantee that what we've
written is bootable and not just junk. Without a valid partition table
writing a config drive will fail for whole disk images.

Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com>
Change-Id: I5cfd8c433a4db3e0d2d5086250e629d16234b7a4
Story: 2001760
Task: 12159
2018-11-19 18:57:23 +01:00
Zuul
f63099ebb6 Merge "Allow streaming raw partition images" 2018-10-26 14:14:55 +00:00
Dmitry Tantsur
29136bf68d Allow streaming raw partition images
Currently we support streaming raw whole disk images, but not
partition ones. This change enables it.

Change-Id: Ie95102aa3f2054a6b429f3d3e0926e90923c5faf
Story: #2003809
Task: #26558
2018-10-17 11:16:04 +02:00