Merge changes I7216e8d8,I99a4cf9c,I7a05807d into stable-3.4
* changes: Add a section on reducing the impacts of Git gc Add a section about the impacts of Git gc Add a repository-maintenance doc
This commit is contained in:
@@ -75,6 +75,7 @@
|
||||
. link:config-reverseproxy.html[Reverse Proxy]
|
||||
. link:config-auto-site-initialization.html[Automatic Site Initialization on Startup]
|
||||
. link:pgm-index.html[Server Side Administrative Tools]
|
||||
. link:repository-maintenance.html[Repository Maintenance]
|
||||
. link:user-request-tracing.html[Request Tracing]
|
||||
. link:note-db.html[NoteDb]
|
||||
. link:config-accounts.html[Accounts on NoteDb]
|
||||
|
116
Documentation/repository-maintenance.txt
Normal file
116
Documentation/repository-maintenance.txt
Normal file
@@ -0,0 +1,116 @@
|
||||
= Gerrit Code Review - Repository Maintenance
|
||||
|
||||
== Description
|
||||
|
||||
Each project in Gerrit is stored in a bare Git repository. Gerrit uses
|
||||
the JGit library to access (read and write to) these Git repositories.
|
||||
As modifications are made to a project, Git repository maintenance will
|
||||
be needed or performance will eventually suffer. When using the Git
|
||||
command line tool to operate on a Git repository, it will run `git gc`
|
||||
every now and then on the repository to ensure that Git garbage
|
||||
collection is performed. However regular maintenance does not happen as
|
||||
a result of normal Gerrit operations, so this is something that Gerrit
|
||||
administrators need to plan for.
|
||||
|
||||
Gerrit has a built-in feature which allows it to run Git garbage
|
||||
collection on repositories. This can be
|
||||
link:config-gerrit.html#gc[configured] to run on a regular basis, and/or
|
||||
this can be run manually with the link:cmd-gc.html[gerrit gc] ssh
|
||||
command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API.
|
||||
Some administrators will opt to run `git gc` or `jgit gc` outside of
|
||||
Gerrit instead. There are many reasons this might be done, the main one
|
||||
likely being that when it is run in Gerrit it can be very resource
|
||||
intensive and scheduling an external job to run Git garbage collection
|
||||
allows administrators to finely tune the approach and resource usage of
|
||||
this maintenance.
|
||||
|
||||
== Git Garbage Collection Impacts
|
||||
|
||||
Unlike a typical server database, access to Git repositories is not
|
||||
marshalled through a single process or a set of inter communicating
|
||||
processes. Unfortuntatlely the design of the on-disk layout of a Git
|
||||
repository does not allow for 100% race free operations when accessed by
|
||||
multiple actors concurrently. These design shortcomings are more likely
|
||||
to impact the operations of busy repositories since racy conditions are
|
||||
more likely to occur when there are more concurrent operations. Since
|
||||
most Gerrit servers are expected to run without interruptions, Git
|
||||
garbage collection likely needs to be run during normal operational hours.
|
||||
When it runs, it adds to the concurrency of the overall accesses. Given
|
||||
that many of the operations in garbage collection involve deleting files
|
||||
and directories, it has a higher chance of impacting other ongoing
|
||||
operations than most other operations.
|
||||
|
||||
=== Interrupted Operations
|
||||
|
||||
When Git garbage collection deletes a file or directory that is
|
||||
currently in use by an ongoing operation, it can cause that operation to
|
||||
fail. These sorts of failures are often single shot failures, i.e. the
|
||||
operation will succeed if tried again. An example of such a failure is
|
||||
when a pack file is deleted while Gerrit is sending an object in the
|
||||
file over the network to a user performing a clone or fetch. Usually
|
||||
pack files are only deleted when the referenced objects in them have
|
||||
been repacked and thus copied to a new pack file. So performing the same
|
||||
operation again after the fetch will likely send the same object from
|
||||
the new pack instead of the deleted one, and the operation will succeed.
|
||||
|
||||
=== Data Loss
|
||||
|
||||
It is possible for data loss to occur when Git garbage collection runs.
|
||||
This is very rare, but it can happen. This can happen when an object is
|
||||
believed to be unreferenced when object repacking is running, and then
|
||||
garbage collection deletes it. This can happen because even though an
|
||||
object may indeed be unreferenced when object repacking begins and
|
||||
reachability of all objects is determined, it can become referenced by
|
||||
another concurrent operation after this unreferenced determination but
|
||||
before it gets deleted. When this happens, a new reference can be
|
||||
created which points to a now missing object, and this will result in a
|
||||
loss.
|
||||
|
||||
== Reducing Git Garbage Collection Impacts
|
||||
|
||||
JGit has a `preserved` directory feature which is intended to reduce
|
||||
some of the impacts of Git garbage collection, and Gerrit can take
|
||||
advantage of the feature too. The `preserved` directory is a
|
||||
subdirectory of a repository's `objects/pack` directory where JGit will
|
||||
move pack files that it would normally delete when `jgit gc` is invoked
|
||||
with the `--preserve-oldpacks` option. It will later delete these files
|
||||
the next time that `jgit gc` is run if it is invoked with the
|
||||
`--prune-preserved` option. Using these flags together on every `jgit gc`
|
||||
invocation means that packfiles will get an extended lifetime by one
|
||||
full garbage collection cycle. Since an atomic move is used to move these
|
||||
files, any open references to them will continue to work, even on NFS. On
|
||||
a busy repository, preserving pack files can make operations much more
|
||||
reliable, and interrupted operations should almost entirely disappear.
|
||||
|
||||
Moving files to the `preserved` directory also has the ability to reduce
|
||||
data loss. If JGit cannot find an object it needs in its current object
|
||||
DB, it will look into the `preserved` directory as a last resort. If it
|
||||
finds the object in a pack file there, it will restore the
|
||||
slated-to-be-deleted pack file back to the original `objects/pack`
|
||||
directory effectively "undeleting" it and making all the objects in it
|
||||
available again. When this happens, data loss is prevented.
|
||||
|
||||
One advantage of restoring preserved packfiles in this way when an
|
||||
object is referenced in them, is that it makes loosening unreferenced
|
||||
objects during Git garbage collection, which is a potentially expensive,
|
||||
wasteful, and performance impacting operation, no longer desirable. It
|
||||
is recommended that if you use Git for garbage collection, that you use
|
||||
the `-a` option to `git repack` instead of the `-A` option to no longer
|
||||
perform this loosening.
|
||||
|
||||
When Git is used for garbage collection instead of JGit, it is fairly
|
||||
easy to wrap `git gc` or `git repack` with a small script which has a
|
||||
`--prune-preserved` option which behaves as mentioned above by deleting
|
||||
any pack files currently in the preserved directory, and also has a
|
||||
`--preserve-oldpacks` option which then hardlinks all the currently
|
||||
existing pack files from the `objects/pack` directory into the
|
||||
`preserved` directory right before calling the real Git command. This
|
||||
approach will then behave similarly to `jgit gc` with respect to
|
||||
preserving pack files.
|
||||
|
||||
GERRIT
|
||||
------
|
||||
Part of link:index.html[Gerrit Code Review]
|
||||
|
||||
SEARCHBOX
|
||||
---------
|
Reference in New Issue
Block a user