gerrit/gerrit-plugin-api
Hugo Arès 00fc15ac00 Reduce chance of deadlock in account cache
This deadlock is not the typical deadlock where 2 threads locked a
resource and each one is waiting to lock a second resource already
locked by the other thread. The thread owning the account cache lock is
parked, which tell us that the locked was not released. I could not
determine the exact sequence of events leading to this deadlock making
it really hard to report/fix the problem.

While investigating, I realized that there quite a few reported issues
in Guava that could be major for Gerrit:

  Our deadlock happening in account cache
  https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

  Other deadlock
  https://github.com/google/guava/issues/2976
  https://github.com/google/guava/issues/2863

  Performance
  https://github.com/google/guava/issues/2063

  Race condition
  https://github.com/google/guava/issues/2971

Because I could not reproduce the deadlock in a dev environment or in
a unit test making it almost impossible to fix, I considered other
options such as replacing Guava by something else.

The maintainer of Caffeine[1] cache claims that Caffeine is a high
performance[2], near optimal caching library designed/implemented base
on the experience of designing Guava's cache and ConcurrentLinkedHashMap.
I also did some benchmarks about spawning a lot of threads reading/writing
values from the caches. I ran those benchmarks on both Guava and Caffeine
and Guava was always taking at least double the time than Caffeine to
complete all operations.

Migrating to Caffeine is almost a drop-in replacement. Caffeine
interface are very similar to Guava cache and there is an adapter to
migrate to Caffeine and keep using Guava's interfaces. After migrating
to Caffeine, we saw that deadlock occurrence was reduced from once a day
to once every 2 weeks in our production server.

The maintainer of Caffeine, Ben Manes pointed us to the possible
cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our
kernel version is supposed to have the fix but we will try another OS
and kernel version.

Replacing Guava caches by Caffeine brings 2 things, it reduces the
chances of having the deadlock most likely caused by a kernel bug and
improve the cache performance.

[1]https://github.com/ben-manes/caffeine
[2]https://github.com/ben-manes/caffeine/wiki/Benchmarks
[3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
[4]b0c29f79ec
[5]76835b0ebf

Bug: Issue 7645
Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae
2018-02-22 18:53:47 +00:00
..
2018-01-15 11:07:26 +09:00