From 2836af9c4441b33678a2d6a509769a09e544d91f Mon Sep 17 00:00:00 2001
From: "James E. Blair" <jim@acmegating.com>
Date: Wed, 24 Nov 2021 10:30:29 -0800
Subject: [PATCH] Add SOS documentation and remove spec

This migrates and adapts some of the SOS documentation into the ZK
reference.  It omits some of the detail as that belongs either in
the map below or in the code itself, but it retains the narrative
documentation to help understand the approach.

This removes the SOS spec itself since it is now (approximately)
complete, so we don't have redundant documentation.

Change-Id: I7113f1a14c5e3d68a50b242c0cb05a4eedeb0aec
Co-Authored-By: Tobias Henkel <tobias.henkel@bmw.de>
---
 .../reference/developer/specs/index.rst       |   1 -
 .../developer/specs/scale-out-scheduler.rst   | 341 ------------------
 doc/source/reference/developer/zookeeper.rst  | 104 +++++-
 3 files changed, 103 insertions(+), 343 deletions(-)
 delete mode 100644 doc/source/reference/developer/specs/scale-out-scheduler.rst

diff --git a/doc/source/reference/developer/specs/index.rst b/doc/source/reference/developer/specs/index.rst
index b61daf06fa..d96df0c26e 100644
--- a/doc/source/reference/developer/specs/index.rst
+++ b/doc/source/reference/developer/specs/index.rst
@@ -19,7 +19,6 @@ documentation instead.
    tenant-scoped-admin-web-API
    kubernetes-operator
    circular-dependencies
-   scale-out-scheduler
    zuul-runner
    enhanced-regional-executors
    tenant-resource-quota
diff --git a/doc/source/reference/developer/specs/scale-out-scheduler.rst b/doc/source/reference/developer/specs/scale-out-scheduler.rst
deleted file mode 100644
index 4bb750cdbb..0000000000
--- a/doc/source/reference/developer/specs/scale-out-scheduler.rst
+++ /dev/null
@@ -1,341 +0,0 @@
-Scale out scheduler
-===================
-
-.. warning:: This is not authoritative documentation.  These features
-   are not currently available in Zuul.  They may change significantly
-   before final implementation, or may never be fully completed.
-
-Zuul has a microservices architecture with the goal of no single point of
-failure in mind. This has not yet been achieved for the zuul-scheduler
-component.
-
-Especially within large Zuul deployments with many also long running jobs the
-cost of a scheduler crash can be quite high. In this case currently all
-in-flight jobs are lost and need to be restarted. A scale out scheduler approach
-can avoid this.
-
-The same problem holds true when updating the scheduler. Currently there is no
-possibility to upgrade the scheduler without downtime. While the pipeline state
-can be saved and re-enqueued this still loses all in-flight jobs. Further on a
-larger deployment the startup of the scheduler easily can be in the multi minute
-range. Having the ability to do zero downtime upgrades can make updates much
-more easier.
-
-Further having multiple schedulers can facilitate parallel processing of several
-tenants and help reducing global locks within installations with many tenants.
-
-In this document we will outline an approach towards a completely single point
-of failure free zuul system. This will be a transition with multiple phases.
-
-
-Status quo
-----------
-
-Zuul is an event driven system with several event loops that interact with each
-other:
-
-* Driver event loop: Drivers like Github or Gerrit have its own event loops.
-  They perform preprocessing of the received events and add events into the
-  scheduler event loop.
-
-* Scheduler event loop: This event loop processes the pipelines and
-  reconfigurations.
-
-All of these event loops currently run within the scheduler process without
-persisting their state. So the path to a scale out scheduler involves mainly
-making all event loops scale out capable.
-
-
-
-Target architecture
--------------------
-
-In addition to the event loops mentioned above we need an additional event queue
-per pipeline. This will make it easy to process several pipelines in parallel. A
-new driver event would first be processed in the driver event queue. This will
-add a new event into the scheduler event queue. The scheduler event queue then
-checks which pipeline may be interested in this event according to the tenant
-configuration and layout. Based on this the event is dispatched to all matching
-pipeline queues.
-
-As it is today different event types will have different priorities. This will
-be expressed like in node-requests with a prefix.
-
-The event queues will be stored in Zookeeper in the following paths:
-
-* ``/zuul/events/connection/<connection name>/<sequence>``: Event queue of a
-  connection
-
-* ``/zuul/events/scheduler-global/<prio>-<sequence>``: Global event queue of
-  scheduler
-
-* ``/zuul/events/tenant/<tenant name>/<pipeline>/<prio>-<sequence>``: Pipeline
-  event queue
-
-In order to make reconfigurations efficient we also need to store the parsed
-branch config in Zookeeper. This makes it possible to create the current layout
-without the need to ask the mergers multiple times for the configuration. This
-also can be used by zuul-web to keep an up-to-date layout that can be used for
-api requests.
-
-We also need to store the pipeline state in Zookeeper. This will be similar to
-the status.json but also needs to contain the frozen jobs and their current
-state.
-
-Further we need to replace gearman by Zookeeper as rpc mechanism to the
-executors. This will make it possible that different schedulers can continue
-smoothly with pipeline execution. The jobs will be stored in
-``/zuul/jobs/<tenant>/<sequence>``.
-
-
-Driver event ingestion
-----------------------
-
-Currently the drivers immediately get events from Gerrit or Github, process them
-and forward the events to the scheduler event loop. Thus currently all events
-are lost during a downtime of the zuul-scheduler. In order to decouple this we
-can push the raw events into Zookeeper and pop them in the driver event loop.
-
-We will split the drivers into an event receiving and an event processing
-component. The event receiving component will store the events in a squenced
-znode in the path ``/zuul/events/connection/<connection name>/<sequence>``.
-The event receiving part may or may not run within the scheduler context.
-The event processing part will be part of the scheduler context.
-
-There are three types of event receive mechanisms in Zuul:
-
-* Active event gathering: The connection actively subscribes for events (Gerrit)
-  or generates them itself (git, timer, zuul)
-
-* Passive event gathering: The events are sent to zuul from outside (Github
-  webhooks)
-
-* Internal event generation: The events are generated within zuul itself and
-  typically get injected directly into the scheduler event loop and thus don't
-  need to be changed in this phase.
-
-The active and passive event gathering need to be handled slightly different.
-
-Active event gathering
-~~~~~~~~~~~~~~~~~~~~~~
-
-This is mainly done by the Gerrit driver. We actively maintain a connection to
-the target and receive events. This means that if we have more than one instance
-we need to find a way to handle duplicated events. This type of event gathering
-can run within the scheduler process. Optionally if there is interest we can
-also make it possible to run this as a standalone process.
-
-We can utilize leader election to make sure there is exactly one instance
-receiving the events. This makes sure that we don't need to handle duplicated
-events at all. A drawback is that there is a short time when the current leader
-stops until the next leader has started event gathering. This could lead to a
-few missed events. But as this is the most easiest way we can accept this in
-the initial version.
-
-If there is a need to guarantee that there is no missed event during a
-leadership change the above algorithm can be enhanced later with parallel
-gathering and deduplication strategies. As this is much more complicated this
-will not be in scope of this spec but subject to later enhancements.
-
-
-Passive event gathering
-~~~~~~~~~~~~~~~~~~~~~~~
-
-In case of passive event gathering the events are sent to Zuul typically via
-webhooks. These types of events will be received in zuul-web that stores them in
-Zookeeper. This type of event gathering is used for example by the Github and
-Gerrit driver (other drivers, possibly implemented before this is realized,
-should be checked too). In this
-case we can have multiple instances but still receive only one event. So we
-don't need to take special care of event deduplication or leader election -
-multiple instances behind a load balancer are save to use and recommended for
-such passive event gathering.
-
-
-Store unparsed branch config in Zookeeper
------------------------------------------
-
-We need to store the global configuration in zookeeper. However zookeeper is not
-designed as a database with a large amount of data we should store as little as
-possible in zookeeper. Thus we only store the per project-branch unparsed config
-in zookeeper. From this every part of zuul like the scheduler or also zuul-web
-can quickly recalculate the layout of each tenant and keep it up to date by
-watching for changes in the unparsed project-branch-config. We will lock the
-complete global config with one lock and maintain a checkpoint version of it.
-This way each component can watch the config version number and react
-accordingly. Although we lock the complete global config we still should store
-the actual config in distinct nodes per project and branch. This is needed
-because of the 1MB limit per znode in zookeeper. It further makes it less
-expensive to cache the global config in each component as this cache will be
-updated incrementally.
-
-The configs will be stored in the path ``/zuul/config/<project>/<branch>`` per
-branch segmented in ``/zuul/config/<project>/<branch>/<path-to-config>/<shard>``.
-The ``shard`` is a sequence number and will be used to store larger than 1MB
-files due to the limitation mentioned above.
-
-
-Store pipeline and tenant state in Zookeeper
---------------------------------------------
-
-The pipeline state is similar to the current status.json. However the frozen
-jobs and their state are needed for seemless continuation of the pipeline
-execution on a different scheduler. Further this can make it easy to generate
-the status.json directly in zuul-web by inspecting the data in Zookeeper.
-Buildsets that are enqueued in a pipeline will be stored in
-``/zuul/tenant/<tenant>/pipeline/<pipeline>/queue/<queue>/<buildset uuid>``.
-
-Each buildset will contain a child znode per job that holds a data structure
-with the frozen job as well as the current state. This will also contain a
-reference to the node request that was used for this job. When the node request
-is fulfilled the pipeline processor creates an execution-request which
-will be locked by an executor before processing the job. The buildset will
-contain a link to the execution request. The executor will accept the referenced
-node request, lock the nodes and run the job. If the job needs to be canceled
-the pipeline processor just pulls the execution-request. The executor will
-notice this, abort the job and return the nodes.
-
-We also need to store tenant state like semaphores in Zookeeper. This will be
-stored in ``/zuul/tenant/<tenant>/semaphores/<name>``.
-
-
-Mandatory SQL connection
-------------------------
-
-Currently the times database is stored on the local filesystem of the scheduler.
-We already have an optional SQL database that holds the needed information. We
-need to be able to rely on this information so we'll make the SQL db mandatory.
-
-Zuul currently supports multiple database connections. At the moment the SQL
-reporters can be configured on pipelines. This should be changed to global and
-tenant-based SQL reporters. When we make the database
-connection mandatory zuul needs to know which one is the primary database
-connection. If there is only one configured connection it will be automatically
-the primary. If there are more configured connections one will need to be
-configured as primary database. Reporters will use the primary
-database in any case.
-
-The primary database can be used to query the last 10 successful build times
-and use this as the times database.
-
-
-Executor via Zookeeper
-----------------------
-
-In order to prepare for distributed pipeline execution we need to use Zookeeper
-for scheduling jobs on the executors. This is needed so that any scheduler can
-take over a pipeline execution without having to restart jobs.
-
-As described above the executor will look for builds. These will be stored in
-``/zuul/builds/<sequence>``. The builds will contain every information that is
-needed to run the job. The builds are stored outside of the pipeline itself
-for two reasons. First the executors should not need to do a deep search when
-looking for new builds to do. Second this makes it clear that they are not
-subject to the pipeline lock but have their own locks. However the buildsets
-in the pipeline will contain a reference to their builds.
-
-During the lifecycle of a build the executor can update the states by their own.
-But should enqueue result events to the corresponding pipeline event queue as
-pipeline processing relies on build started, paused, finished events. The
-lifecycle will be as follows.
-
-* Build gets created in state REQUESTED
-* Executor locks it and sets the state to RUNNING. It will enqueue a build
-  started event to the pipeline event queue.
-* If requested the executor sets the state to PAUSED after the run phase and
-  enqueues a build paused event to the pipeline event queue
-* If build is PAUSED a resume can be requested by the pipeline processor by
-  adding an empty ``resume`` child node to the build. This way we don't have to
-  update a locked znode while ignoring the lock. The executor will then change
-  the state back to RUNNING and continue the execution.
-* When the build is finished the executor changes the state to COMPLETE, unlocks
-  the build and enqueues a build finished event to the pipeline.
-* If a build should be canceled the pipeline processor adds a ``cancel`` child
-  znode that will be recognized by the executor which will act accordingly.
-
-It can be that an executor crashes. In this case it will lose the lock. We need
-to be able to recover from this and emit the right event to the pipeline.
-Such a lost builds can be detected if it is in a state other than REQUESTED or
-COMPLETED but unlocked. Any executor that sees such a request while looking for
-new builds to execute will lock and mark it as COMPLETED and failed. It then
-will emit a build completed event such that the pipeline event processor can
-react on this and reschedule the build. There is no special handling needed to
-return the nodes as in this case the failing executor will also lose its lock
-on the nodes so they will be deleted or recycled by nodepool automatically.
-
-
-Parallelize pipeline processing
--------------------------------
-
-Once we have the above data in place we can create the per pipeline event and
-the global scheduler event queues in Zookeeper. The global scheduler event queue
-will receive the trigger, management and result events that are not tenant
-specific. The purpose of this queue is to take these events and dispatch them to
-the pipeline queues of the tenants as appropriate. This event queue can easily
-processed using a locking mechanism.
-
-We also have tenant global events like tenant reconfigurations. These need
-exclusive access to all pipelines in the tenant. So we need a two layer locking
-approach during pipeline processing. At first we need an RW lock at the tenant
-level. This will allow to be locked by all pipeline processors at the same time
-(call them readers as they don't modify the global tenant state). Management
-events (e.g. tenant-reconfiguration) however will get this lock exclusive (call
-them writers as they modify the global tenant state).
-
-Each pipeline processor will loop over all pipelines that have outstanding
-events. Before processing an event it will first try to lock the tenant. If it
-fails it will continue with pipelines in the the next tenant having outstanding
-events. If it got the tenant lock it will try to lock the pipeline. If it fails
-it will continue with the next pipeline. If it succeeds it will process all
-outstanding events of that pipeline. To prevent starvation of tenants we can
-define a max processing time after which the pipeline processor will switch to
-the next tenant or pipeline even if there are outstanding events.
-
-In order to reduce stalls when doing reconfigurations or tenant reconfigurations
-we can run one pipeline processor in one thread and reconfigurations in a
-separate thread(s). This way a tenant that is running a longer reconfiguration
-won't block other tenants.
-
-
-Zuul-web changes
-----------------
-
-Now zuul can be changed to directly use the data in Zookeeper instead if
-asking the scheduler via gearman.
-
-
-Security considerations
------------------------
-
-When switching the executor job queue to Zookeeper we need to take precautions
-because this will also contain decrypted secrets. In order to secure this
-communication channel we need to make sure that we use authenticated and
-encrypted connections to zookeeper.
-
-* There is already a change that adds Zookeeper auth:
-  https://review.openstack.org/619156
-* Kazoo SSL support just has landed: https://github.com/python-zk/kazoo/pull/513
-
-Further we will encrypt every secret that is stored in zookeeper using a
-symmetric cipher with a shared key that is known to all zuul services but not
-zookeeper. This way we can avoid dumping decrypted secrets into the transaction
-log of zookeeper.
-
-
-Roadmap
--------
-
-In order to manage the workload and minimize rebasing efforts, we suggest to
-break the above into smaller changes. Each such change should be then
-implemented separately.
-
-#. Mandatory SQL connection, definition of primary SQL connection and add SQL
-   reporters for tenants
-#. Storing parsed branch config in zookeeper
-#. Storing raw events in zookeeper using drivers or a separate service
-#. Event queue per pipeline
-#. Storing pipeline state and tenant state in zookeeper
-#. Adapt drivers to pop events from zookeeper (split drivers into event
-   receiving and event processing components)
-#. Parallel pipeline processing
-#. Switch to using zookeeper instead of gearman
diff --git a/doc/source/reference/developer/zookeeper.rst b/doc/source/reference/developer/zookeeper.rst
index 89b96af005..b0aae0b202 100644
--- a/doc/source/reference/developer/zookeeper.rst
+++ b/doc/source/reference/developer/zookeeper.rst
@@ -1,5 +1,107 @@
+ZooKeeper
+=========
+
+Overview
+--------
+
+Zuul has a microservices architecture with the goal of no single point of
+failure in mind.
+
+Zuul is an event driven system with several event loops that interact
+with each other:
+
+* Driver event loop: Drivers like GitHub or Gerrit have their own event loops.
+  They perform preprocessing of the received events and add events into the
+  scheduler event loop.
+
+* Scheduler event loop: This event loop processes the pipelines and
+  reconfigurations.
+
+Each of these event loops persists data in ZooKeeper so that other
+components can share or resume processing.
+
+A key aspect of scalability is maintaining an event queue per
+pipeline. This makes it easy to process several pipelines in
+parallel. A new driver event is first processed in the driver event
+queue. This adds a new event into the scheduler event queue. The
+scheduler event queue then checks which pipeline may be interested in
+this event according to the tenant configuration and layout. Based on
+this the event is dispatched to all matching pipeline queues.
+
+In order to make reconfigurations efficient we store the parsed branch
+config in Zookeeper. This makes it possible to create the current
+layout without the need to ask the mergers multiple times for the
+configuration. This is used by zuul-web to keep an up-to-date layout
+for API requests.
+
+We store the pipeline state in Zookeeper.  This contains the complete
+information about queue items, jobs and builds, as well as a separate
+abbreviated state for quick access by zuul-web for the status page.
+
+Driver Event Ingestion
+----------------------
+
+There are three types of event receiving mechanisms in Zuul:
+
+* Active event gathering: The connection actively listens to events (Gerrit)
+  or generates them itself (git, timer, zuul)
+
+* Passive event gathering: The events are sent to Zuul from outside (GitHub
+  webhooks)
+
+* Internal event generation: The events are generated within Zuul itself and
+  typically get injected directly into the scheduler event loop.
+
+The active event gathering needs to be handled differently from
+passive event gathering.
+
+Active Event Gathering
+~~~~~~~~~~~~~~~~~~~~~~
+
+This is mainly done by the Gerrit driver. We actively maintain a
+connection to the target and receive events. We utilize a leader
+election to make sure there is exactly one instance receiving the
+events.
+
+Passive Event Gathering
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In case of passive event gathering the events are sent to Zuul
+typically via webhooks. These types of events are received in zuul-web
+which then stores them in Zookeeper. This type of event gathering is
+used by GitHub and other drivers. In this case we can have multiple
+instances but still receive only one event so that we don't need to
+take special care of event deduplication or leader election.  Multiple
+instances behind a load balancer are safe to use and recommended for
+such passive event gathering.
+
+Configuration Storage
+---------------------
+
+Zookeeper is not designed as a database with a large amount of data,
+so we should store as little as possible in zookeeper. Thus we only
+store the per project-branch unparsed config in zookeeper. From this,
+every part of Zuul, like the scheduler or zuul-web, can quickly
+recalculate the layout of each tenant and keep it up to date by
+watching for changes in the unparsed project-branch-config.
+
+We store the actual config sharded in multiple nodes, and those nodes
+are stored under per project and branch znodes. This is needed because
+of the 1MB limit per znode in zookeeper. It further makes it less
+expensive to cache the global config in each component as this cache
+is updated incrementally.
+
+Executor and Merger Queues
+--------------------------
+
+The executors and mergers each have an execution queue (and in the
+case of executors, optionally per-zone queues).  This makes it easy
+for executors and mergers to simply pick the next job to run without
+needing to inspect the entire pipeline state.  The scheduler is
+responsible for submitting job requests as the state changes.
+
 Zookeeper Map
-=============
+-------------
 
 This is a reference for object layout in Zookeeper.