diff --git a/doc/source/reference/developer/specs/index.rst b/doc/source/reference/developer/specs/index.rst index b61daf06fa..d96df0c26e 100644 --- a/doc/source/reference/developer/specs/index.rst +++ b/doc/source/reference/developer/specs/index.rst @@ -19,7 +19,6 @@ documentation instead. tenant-scoped-admin-web-API kubernetes-operator circular-dependencies - scale-out-scheduler zuul-runner enhanced-regional-executors tenant-resource-quota diff --git a/doc/source/reference/developer/specs/scale-out-scheduler.rst b/doc/source/reference/developer/specs/scale-out-scheduler.rst deleted file mode 100644 index 4bb750cdbb..0000000000 --- a/doc/source/reference/developer/specs/scale-out-scheduler.rst +++ /dev/null @@ -1,341 +0,0 @@ -Scale out scheduler -=================== - -.. warning:: This is not authoritative documentation. These features - are not currently available in Zuul. They may change significantly - before final implementation, or may never be fully completed. - -Zuul has a microservices architecture with the goal of no single point of -failure in mind. This has not yet been achieved for the zuul-scheduler -component. - -Especially within large Zuul deployments with many also long running jobs the -cost of a scheduler crash can be quite high. In this case currently all -in-flight jobs are lost and need to be restarted. A scale out scheduler approach -can avoid this. - -The same problem holds true when updating the scheduler. Currently there is no -possibility to upgrade the scheduler without downtime. While the pipeline state -can be saved and re-enqueued this still loses all in-flight jobs. Further on a -larger deployment the startup of the scheduler easily can be in the multi minute -range. Having the ability to do zero downtime upgrades can make updates much -more easier. - -Further having multiple schedulers can facilitate parallel processing of several -tenants and help reducing global locks within installations with many tenants. - -In this document we will outline an approach towards a completely single point -of failure free zuul system. This will be a transition with multiple phases. - - -Status quo ----------- - -Zuul is an event driven system with several event loops that interact with each -other: - -* Driver event loop: Drivers like Github or Gerrit have its own event loops. - They perform preprocessing of the received events and add events into the - scheduler event loop. - -* Scheduler event loop: This event loop processes the pipelines and - reconfigurations. - -All of these event loops currently run within the scheduler process without -persisting their state. So the path to a scale out scheduler involves mainly -making all event loops scale out capable. - - - -Target architecture -------------------- - -In addition to the event loops mentioned above we need an additional event queue -per pipeline. This will make it easy to process several pipelines in parallel. A -new driver event would first be processed in the driver event queue. This will -add a new event into the scheduler event queue. The scheduler event queue then -checks which pipeline may be interested in this event according to the tenant -configuration and layout. Based on this the event is dispatched to all matching -pipeline queues. - -As it is today different event types will have different priorities. This will -be expressed like in node-requests with a prefix. - -The event queues will be stored in Zookeeper in the following paths: - -* ``/zuul/events/connection//``: Event queue of a - connection - -* ``/zuul/events/scheduler-global/-``: Global event queue of - scheduler - -* ``/zuul/events/tenant///-``: Pipeline - event queue - -In order to make reconfigurations efficient we also need to store the parsed -branch config in Zookeeper. This makes it possible to create the current layout -without the need to ask the mergers multiple times for the configuration. This -also can be used by zuul-web to keep an up-to-date layout that can be used for -api requests. - -We also need to store the pipeline state in Zookeeper. This will be similar to -the status.json but also needs to contain the frozen jobs and their current -state. - -Further we need to replace gearman by Zookeeper as rpc mechanism to the -executors. This will make it possible that different schedulers can continue -smoothly with pipeline execution. The jobs will be stored in -``/zuul/jobs//``. - - -Driver event ingestion ----------------------- - -Currently the drivers immediately get events from Gerrit or Github, process them -and forward the events to the scheduler event loop. Thus currently all events -are lost during a downtime of the zuul-scheduler. In order to decouple this we -can push the raw events into Zookeeper and pop them in the driver event loop. - -We will split the drivers into an event receiving and an event processing -component. The event receiving component will store the events in a squenced -znode in the path ``/zuul/events/connection//``. -The event receiving part may or may not run within the scheduler context. -The event processing part will be part of the scheduler context. - -There are three types of event receive mechanisms in Zuul: - -* Active event gathering: The connection actively subscribes for events (Gerrit) - or generates them itself (git, timer, zuul) - -* Passive event gathering: The events are sent to zuul from outside (Github - webhooks) - -* Internal event generation: The events are generated within zuul itself and - typically get injected directly into the scheduler event loop and thus don't - need to be changed in this phase. - -The active and passive event gathering need to be handled slightly different. - -Active event gathering -~~~~~~~~~~~~~~~~~~~~~~ - -This is mainly done by the Gerrit driver. We actively maintain a connection to -the target and receive events. This means that if we have more than one instance -we need to find a way to handle duplicated events. This type of event gathering -can run within the scheduler process. Optionally if there is interest we can -also make it possible to run this as a standalone process. - -We can utilize leader election to make sure there is exactly one instance -receiving the events. This makes sure that we don't need to handle duplicated -events at all. A drawback is that there is a short time when the current leader -stops until the next leader has started event gathering. This could lead to a -few missed events. But as this is the most easiest way we can accept this in -the initial version. - -If there is a need to guarantee that there is no missed event during a -leadership change the above algorithm can be enhanced later with parallel -gathering and deduplication strategies. As this is much more complicated this -will not be in scope of this spec but subject to later enhancements. - - -Passive event gathering -~~~~~~~~~~~~~~~~~~~~~~~ - -In case of passive event gathering the events are sent to Zuul typically via -webhooks. These types of events will be received in zuul-web that stores them in -Zookeeper. This type of event gathering is used for example by the Github and -Gerrit driver (other drivers, possibly implemented before this is realized, -should be checked too). In this -case we can have multiple instances but still receive only one event. So we -don't need to take special care of event deduplication or leader election - -multiple instances behind a load balancer are save to use and recommended for -such passive event gathering. - - -Store unparsed branch config in Zookeeper ------------------------------------------ - -We need to store the global configuration in zookeeper. However zookeeper is not -designed as a database with a large amount of data we should store as little as -possible in zookeeper. Thus we only store the per project-branch unparsed config -in zookeeper. From this every part of zuul like the scheduler or also zuul-web -can quickly recalculate the layout of each tenant and keep it up to date by -watching for changes in the unparsed project-branch-config. We will lock the -complete global config with one lock and maintain a checkpoint version of it. -This way each component can watch the config version number and react -accordingly. Although we lock the complete global config we still should store -the actual config in distinct nodes per project and branch. This is needed -because of the 1MB limit per znode in zookeeper. It further makes it less -expensive to cache the global config in each component as this cache will be -updated incrementally. - -The configs will be stored in the path ``/zuul/config//`` per -branch segmented in ``/zuul/config////``. -The ``shard`` is a sequence number and will be used to store larger than 1MB -files due to the limitation mentioned above. - - -Store pipeline and tenant state in Zookeeper --------------------------------------------- - -The pipeline state is similar to the current status.json. However the frozen -jobs and their state are needed for seemless continuation of the pipeline -execution on a different scheduler. Further this can make it easy to generate -the status.json directly in zuul-web by inspecting the data in Zookeeper. -Buildsets that are enqueued in a pipeline will be stored in -``/zuul/tenant//pipeline//queue//``. - -Each buildset will contain a child znode per job that holds a data structure -with the frozen job as well as the current state. This will also contain a -reference to the node request that was used for this job. When the node request -is fulfilled the pipeline processor creates an execution-request which -will be locked by an executor before processing the job. The buildset will -contain a link to the execution request. The executor will accept the referenced -node request, lock the nodes and run the job. If the job needs to be canceled -the pipeline processor just pulls the execution-request. The executor will -notice this, abort the job and return the nodes. - -We also need to store tenant state like semaphores in Zookeeper. This will be -stored in ``/zuul/tenant//semaphores/``. - - -Mandatory SQL connection ------------------------- - -Currently the times database is stored on the local filesystem of the scheduler. -We already have an optional SQL database that holds the needed information. We -need to be able to rely on this information so we'll make the SQL db mandatory. - -Zuul currently supports multiple database connections. At the moment the SQL -reporters can be configured on pipelines. This should be changed to global and -tenant-based SQL reporters. When we make the database -connection mandatory zuul needs to know which one is the primary database -connection. If there is only one configured connection it will be automatically -the primary. If there are more configured connections one will need to be -configured as primary database. Reporters will use the primary -database in any case. - -The primary database can be used to query the last 10 successful build times -and use this as the times database. - - -Executor via Zookeeper ----------------------- - -In order to prepare for distributed pipeline execution we need to use Zookeeper -for scheduling jobs on the executors. This is needed so that any scheduler can -take over a pipeline execution without having to restart jobs. - -As described above the executor will look for builds. These will be stored in -``/zuul/builds/``. The builds will contain every information that is -needed to run the job. The builds are stored outside of the pipeline itself -for two reasons. First the executors should not need to do a deep search when -looking for new builds to do. Second this makes it clear that they are not -subject to the pipeline lock but have their own locks. However the buildsets -in the pipeline will contain a reference to their builds. - -During the lifecycle of a build the executor can update the states by their own. -But should enqueue result events to the corresponding pipeline event queue as -pipeline processing relies on build started, paused, finished events. The -lifecycle will be as follows. - -* Build gets created in state REQUESTED -* Executor locks it and sets the state to RUNNING. It will enqueue a build - started event to the pipeline event queue. -* If requested the executor sets the state to PAUSED after the run phase and - enqueues a build paused event to the pipeline event queue -* If build is PAUSED a resume can be requested by the pipeline processor by - adding an empty ``resume`` child node to the build. This way we don't have to - update a locked znode while ignoring the lock. The executor will then change - the state back to RUNNING and continue the execution. -* When the build is finished the executor changes the state to COMPLETE, unlocks - the build and enqueues a build finished event to the pipeline. -* If a build should be canceled the pipeline processor adds a ``cancel`` child - znode that will be recognized by the executor which will act accordingly. - -It can be that an executor crashes. In this case it will lose the lock. We need -to be able to recover from this and emit the right event to the pipeline. -Such a lost builds can be detected if it is in a state other than REQUESTED or -COMPLETED but unlocked. Any executor that sees such a request while looking for -new builds to execute will lock and mark it as COMPLETED and failed. It then -will emit a build completed event such that the pipeline event processor can -react on this and reschedule the build. There is no special handling needed to -return the nodes as in this case the failing executor will also lose its lock -on the nodes so they will be deleted or recycled by nodepool automatically. - - -Parallelize pipeline processing -------------------------------- - -Once we have the above data in place we can create the per pipeline event and -the global scheduler event queues in Zookeeper. The global scheduler event queue -will receive the trigger, management and result events that are not tenant -specific. The purpose of this queue is to take these events and dispatch them to -the pipeline queues of the tenants as appropriate. This event queue can easily -processed using a locking mechanism. - -We also have tenant global events like tenant reconfigurations. These need -exclusive access to all pipelines in the tenant. So we need a two layer locking -approach during pipeline processing. At first we need an RW lock at the tenant -level. This will allow to be locked by all pipeline processors at the same time -(call them readers as they don't modify the global tenant state). Management -events (e.g. tenant-reconfiguration) however will get this lock exclusive (call -them writers as they modify the global tenant state). - -Each pipeline processor will loop over all pipelines that have outstanding -events. Before processing an event it will first try to lock the tenant. If it -fails it will continue with pipelines in the the next tenant having outstanding -events. If it got the tenant lock it will try to lock the pipeline. If it fails -it will continue with the next pipeline. If it succeeds it will process all -outstanding events of that pipeline. To prevent starvation of tenants we can -define a max processing time after which the pipeline processor will switch to -the next tenant or pipeline even if there are outstanding events. - -In order to reduce stalls when doing reconfigurations or tenant reconfigurations -we can run one pipeline processor in one thread and reconfigurations in a -separate thread(s). This way a tenant that is running a longer reconfiguration -won't block other tenants. - - -Zuul-web changes ----------------- - -Now zuul can be changed to directly use the data in Zookeeper instead if -asking the scheduler via gearman. - - -Security considerations ------------------------ - -When switching the executor job queue to Zookeeper we need to take precautions -because this will also contain decrypted secrets. In order to secure this -communication channel we need to make sure that we use authenticated and -encrypted connections to zookeeper. - -* There is already a change that adds Zookeeper auth: - https://review.openstack.org/619156 -* Kazoo SSL support just has landed: https://github.com/python-zk/kazoo/pull/513 - -Further we will encrypt every secret that is stored in zookeeper using a -symmetric cipher with a shared key that is known to all zuul services but not -zookeeper. This way we can avoid dumping decrypted secrets into the transaction -log of zookeeper. - - -Roadmap -------- - -In order to manage the workload and minimize rebasing efforts, we suggest to -break the above into smaller changes. Each such change should be then -implemented separately. - -#. Mandatory SQL connection, definition of primary SQL connection and add SQL - reporters for tenants -#. Storing parsed branch config in zookeeper -#. Storing raw events in zookeeper using drivers or a separate service -#. Event queue per pipeline -#. Storing pipeline state and tenant state in zookeeper -#. Adapt drivers to pop events from zookeeper (split drivers into event - receiving and event processing components) -#. Parallel pipeline processing -#. Switch to using zookeeper instead of gearman diff --git a/doc/source/reference/developer/zookeeper.rst b/doc/source/reference/developer/zookeeper.rst index 89b96af005..b0aae0b202 100644 --- a/doc/source/reference/developer/zookeeper.rst +++ b/doc/source/reference/developer/zookeeper.rst @@ -1,5 +1,107 @@ +ZooKeeper +========= + +Overview +-------- + +Zuul has a microservices architecture with the goal of no single point of +failure in mind. + +Zuul is an event driven system with several event loops that interact +with each other: + +* Driver event loop: Drivers like GitHub or Gerrit have their own event loops. + They perform preprocessing of the received events and add events into the + scheduler event loop. + +* Scheduler event loop: This event loop processes the pipelines and + reconfigurations. + +Each of these event loops persists data in ZooKeeper so that other +components can share or resume processing. + +A key aspect of scalability is maintaining an event queue per +pipeline. This makes it easy to process several pipelines in +parallel. A new driver event is first processed in the driver event +queue. This adds a new event into the scheduler event queue. The +scheduler event queue then checks which pipeline may be interested in +this event according to the tenant configuration and layout. Based on +this the event is dispatched to all matching pipeline queues. + +In order to make reconfigurations efficient we store the parsed branch +config in Zookeeper. This makes it possible to create the current +layout without the need to ask the mergers multiple times for the +configuration. This is used by zuul-web to keep an up-to-date layout +for API requests. + +We store the pipeline state in Zookeeper. This contains the complete +information about queue items, jobs and builds, as well as a separate +abbreviated state for quick access by zuul-web for the status page. + +Driver Event Ingestion +---------------------- + +There are three types of event receiving mechanisms in Zuul: + +* Active event gathering: The connection actively listens to events (Gerrit) + or generates them itself (git, timer, zuul) + +* Passive event gathering: The events are sent to Zuul from outside (GitHub + webhooks) + +* Internal event generation: The events are generated within Zuul itself and + typically get injected directly into the scheduler event loop. + +The active event gathering needs to be handled differently from +passive event gathering. + +Active Event Gathering +~~~~~~~~~~~~~~~~~~~~~~ + +This is mainly done by the Gerrit driver. We actively maintain a +connection to the target and receive events. We utilize a leader +election to make sure there is exactly one instance receiving the +events. + +Passive Event Gathering +~~~~~~~~~~~~~~~~~~~~~~~ + +In case of passive event gathering the events are sent to Zuul +typically via webhooks. These types of events are received in zuul-web +which then stores them in Zookeeper. This type of event gathering is +used by GitHub and other drivers. In this case we can have multiple +instances but still receive only one event so that we don't need to +take special care of event deduplication or leader election. Multiple +instances behind a load balancer are safe to use and recommended for +such passive event gathering. + +Configuration Storage +--------------------- + +Zookeeper is not designed as a database with a large amount of data, +so we should store as little as possible in zookeeper. Thus we only +store the per project-branch unparsed config in zookeeper. From this, +every part of Zuul, like the scheduler or zuul-web, can quickly +recalculate the layout of each tenant and keep it up to date by +watching for changes in the unparsed project-branch-config. + +We store the actual config sharded in multiple nodes, and those nodes +are stored under per project and branch znodes. This is needed because +of the 1MB limit per znode in zookeeper. It further makes it less +expensive to cache the global config in each component as this cache +is updated incrementally. + +Executor and Merger Queues +-------------------------- + +The executors and mergers each have an execution queue (and in the +case of executors, optionally per-zone queues). This makes it easy +for executors and mergers to simply pick the next job to run without +needing to inspect the entire pipeline state. The scheduler is +responsible for submitting job requests as the state changes. + Zookeeper Map -============= +------------- This is a reference for object layout in Zookeeper.