James E. Blair eb803984a0 Use tenant-level layout locks
The current "layout_lock" in the scheduler is really an "abide" lock.
We lock it every time we change something in the abide (including
tenant layouts).  The name is inherited from pre-multi-tenant Zuul.

This can cause some less-than-optimal behavior when we need to wait to
acquire the "layout_lock" for a tenant reconfiguration event in one
thread while another thread holds the same lock because it is
reloading the configuration for a different tenant.  Ideally we should
be able to have finer-grained tenant-level locking instead, allowing
for less time waiting to reconfigure.

The following sections describe the layout lock use prior to this
commit and how this commit adjusts the code to make it safe for
finer-grained locking.

1) Tenant iteration

The layout lock is used in some places (notably some cleanup methods)
to avoid having the tenant list change during the method.  However,
the configloader already performs an atomic replacement of the tenant
list making it safe for iteration.  This change adds a lock around
updates to the tenant list to prevent corruption if two threads update
it at the same time.

The semaphore cleanup method indirectly references the abide and
layout for use in global and local semaphores.  This is just for path
construction, and the semaphores exist apart from the abide and layout
configurations and so should not be affected by either changing while
the cleanup method is running.

The node request cleanup method could end up running with an outdated
layout objects, including pipelines, however it should not be a
problem if these orphaned objects end up refreshing data from ZK right
before they are removed.

In these cases, we can simply remove the layout lock.

2) Protecting the unparsed project branch cache

The config cache cleanup method uses the unparsed project branch cache
(that is, the in-memory cache of the contents of zuul config files) to
determine what the active projects are.

Within the configloader, the cache is updated and then also used while
loading tenant configuration.  The layout lock would have made sure
all of these actions were mutually exclusive.  In order to remove the
layout lock here, we need to make sure the Abide's
unparsed_project_branch_cache is safe for concurrent updates.

The unparsed_project_branch_cache attribute is a dictionary that
conains references to UnparsedBranchCache objects.  Previously, the
configloader would delete each UnparsedBranchCache object from the
dictionary, reinitialize it, then incrementially add to it.

This current process has a benign flaw.  The branch cache is cleared,
and then loaded with data based on the tenant project config (TPC)
currently being processed.  Because the cache is loaded based on data
from the TPC, it is really only valid for one tenant at a time despite
our intention that it be valid for the entire abide.  However, since
we do check whether it is valid for a given TPC, and then clear and
reload it if it is not, there is no error in data, merely an
incomplete utilization of the cache.

In order to make the cache safe for use by different tenants at the
same time, we address this problem (and effectively make it so that it
is also *effective* for different tenants, even at different times).
The cache is updated to store the ltime for each entry in the cache,
and also to store null entries (with ltimes) for files and paths that
have been checked but are not present in the project-cache.  This
means that at any given time we can determine whether the cache is
valid for a given TPC, and support multiple TPCs (i.e., multiple
tenants).

It's okay for the cache to be updated simultaneously by two tenants
since we don't allow the cache contents to go backwards in ltime.  The
cache will either have the data with at least the ltime required, or
if not, that particular tenant load will spawn cat jobs and update it.

3) Protecting Tenant Project Configs (TPCs)

The loadTPC method on the ConfigLoader would similarly clear the TPCs
for a tenant, then add them back.  This could be problematic for any
other thread which might be referencing or iterating over TPCs.  To
correct this, we take a similar approach of atomic replacement.

Because there are two flavors of TPCs (config and untrusted) and they
are stored in two separate dictionaries, in order to atomically update
a complete tenant at once, the storage hierarchy is restructured as
"tenant -> {config/untrusted} -> project" rather than
"{config/untrusted} -> tenant -> project".  A new class named
TenantTPCRegistry holds both flavors of TPCs for a given tenant, and
it is this object that is atomically replaced.

Now that these issues are dealt with, we can implement a tenant-level
thread lock that is used simply to ensure that two threads don't
update the configuration for the same tenant at the same time.

The scheduler's unparsed abide is updated in two places: upon full
reconfiguration, or when another scheduler has performed a full
reconfiguration and updated the copy in ZK.  To prevent these two
methods from performing the same update simultaneously, we add an
"unparsed_abide_lock" and mutually exclude them.

Change-Id: Ifba261b206db85611c16bab6157f8d1f4349535d
2023-08-24 17:32:25 -07:00
..
2023-08-24 17:32:25 -07:00
2022-02-02 16:16:27 -08:00