The races here are non-obvious. First, in general, HTTP audit event
dispatching looks synchronous in GitOverHttpServlet, but is actually
async since it happens after writing data over an HTTP connection.
There's no way to know whether the thread on the other end has read the
data and continued executing before #doFilter returns.
More specifically in the test environment, GitOverHttpServletIT performs
some Git-over-HTTP operations during setup, namely cloning the repo into
an InMemoryRepository. We don't know whether the audit events generated
by this operation are processed before or after the @Test method starts
executing. As a result, we can't use the cleaner alternative of having a
test set up its own AuditListener binding, because the DynamicSet
addition may or may not be visible when GitOverHttpServlet calls the
audit listener.
Fundamentally, we can't avoid waiting an arbitrarily long amount of time
for the audit events to appear, because the JVM provides no time-bounded
guarantee about when the HTTP thread will continue executing after the
test thread has finished reading the HTTP response. So ultimately, this
change boils down to increasing the timeout we wait for audit events to
appear. (Also note that in the pessimal JVM thread scheduling case, we
are also subject to timeouts in HTTP communication, so even if we
somehow got rid of timeouts for audit event processing it's not like we
would be free from timeouts.)
That said, this change also improves the code to be more reliable and
easy to reason about in several ways. First, use BlockingQueue over
wait/notify, which is a better abstraction for pumping messages between
threads.
Second, we now keep track in GitOverHttpServlet of how many requests
were started, and this AtomicLong increment operation happens before any
HTTP response data is written. This means that we know exactly how many
audit events to expect from the BlockingQueue, and callers don't have to
either specify the expected count in advance. This reduces the risk of
a caller miscounting and waiting for an event that will never arrive. It
also means that test methods can easily drain an arbitrary number of
events from the setup code with a single call at the top of the method.
The remainder of events are generated from entirely within the method.
Now that we more reliably dequeue all events, it's clear that we
actually do expect two HTTP requests in both the upload-pack and
receive-pack cases. The previous code expecting 1 event in the
upload-pack case was wrong and only passed due to flakiness.
It's possible that the timeouts may still need to be adjusted to work in
other environments (slow laptops, CI). In the worst case, we can use
BlockingQueue#take(), which has no timeout at all.
As a side effect, move FakeGroupAuditService to the acceptance package,
since it needs a new dependency that we don't really need to add to the
testing package. Like other classes in the acceptance package, it's used
during acceptance test setup, not any smaller tests.
Considered the following alternatives to introducing a new counter in
GitOverHttpServlet:
* Changing the GroupAuditService interface to explicitly record the
start of requests. Getting the interface right is nontrivial and
would require changes to multiple implementations, with no benefit
outside of tests.
* Hooking into existing HTTP request metrics, which doesn't currently
work due to the bug identified by Ic7b38a01.
Verified that this reduces flakiness to <0.5% on my workstation with:
$ bazel test --runs_per_test=200 --test_filter='GitOverHttpServletIT#.*AuditEventLog' //javatests/com/google/gerrit/acceptance/git:git
Change-Id: Idd57cc7f5e0b6504403b10bd3676e78d9f1cd792