Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.
In other words, we could be in a situation like:
- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
- read(fd, len) - We drain the in-memory buffer side of the socket.
- Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
empty OR the buffered data doesn't meet our requried read len value.
tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
recognized as dead, which could be a long time.
NOTE: The above read()s are python's read() on an contents being
streamed. Lower level reads exist, but brains will hurt
if we try to cover the dynamics at that level.
As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.
When we exceed the timeout time value with our wall clock,
we will fail the download.
(cherry picked from commit c5b97eb781)
(cherry picked from commit 33c96d0066)
This change incorporates the follow-up commit:
Fixes minor issues in the read() retries patch
Follow-up to commit c5b97eb781.
Two things slipped through the cracks:
* ImageDownloadError was instantiated incorrectly, resulting in a wrong
error message. This was uncovered by using assertRaisesRegext in tests.
* We allowed calling write(None). This was uncovered by avoiding sleep(4)
in tests and enabling more failed calls before timeout.
(cherry picked from commit 00ad03b709)
Co-Authored-By: Dmitry Tantsur <dtantsur@protonmail.com>
Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240