diff --git a/Documentation/index.txt b/Documentation/index.txt index e56e7ca2d3..dc94b14c54 100644 --- a/Documentation/index.txt +++ b/Documentation/index.txt @@ -75,6 +75,7 @@ . link:config-reverseproxy.html[Reverse Proxy] . link:config-auto-site-initialization.html[Automatic Site Initialization on Startup] . link:pgm-index.html[Server Side Administrative Tools] +. link:repository-maintenance.html[Repository Maintenance] . link:user-request-tracing.html[Request Tracing] . link:note-db.html[NoteDb] . link:config-accounts.html[Accounts on NoteDb] diff --git a/Documentation/repository-maintenance.txt b/Documentation/repository-maintenance.txt new file mode 100644 index 0000000000..16724367bc --- /dev/null +++ b/Documentation/repository-maintenance.txt @@ -0,0 +1,116 @@ += Gerrit Code Review - Repository Maintenance + +== Description + +Each project in Gerrit is stored in a bare Git repository. Gerrit uses +the JGit library to access (read and write to) these Git repositories. +As modifications are made to a project, Git repository maintenance will +be needed or performance will eventually suffer. When using the Git +command line tool to operate on a Git repository, it will run `git gc` +every now and then on the repository to ensure that Git garbage +collection is performed. However regular maintenance does not happen as +a result of normal Gerrit operations, so this is something that Gerrit +administrators need to plan for. + +Gerrit has a built-in feature which allows it to run Git garbage +collection on repositories. This can be +link:config-gerrit.html#gc[configured] to run on a regular basis, and/or +this can be run manually with the link:cmd-gc.html[gerrit gc] ssh +command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API. +Some administrators will opt to run `git gc` or `jgit gc` outside of +Gerrit instead. There are many reasons this might be done, the main one +likely being that when it is run in Gerrit it can be very resource +intensive and scheduling an external job to run Git garbage collection +allows administrators to finely tune the approach and resource usage of +this maintenance. + +== Git Garbage Collection Impacts + +Unlike a typical server database, access to Git repositories is not +marshalled through a single process or a set of inter communicating +processes. Unfortuntatlely the design of the on-disk layout of a Git +repository does not allow for 100% race free operations when accessed by +multiple actors concurrently. These design shortcomings are more likely +to impact the operations of busy repositories since racy conditions are +more likely to occur when there are more concurrent operations. Since +most Gerrit servers are expected to run without interruptions, Git +garbage collection likely needs to be run during normal operational hours. +When it runs, it adds to the concurrency of the overall accesses. Given +that many of the operations in garbage collection involve deleting files +and directories, it has a higher chance of impacting other ongoing +operations than most other operations. + +=== Interrupted Operations + +When Git garbage collection deletes a file or directory that is +currently in use by an ongoing operation, it can cause that operation to +fail. These sorts of failures are often single shot failures, i.e. the +operation will succeed if tried again. An example of such a failure is +when a pack file is deleted while Gerrit is sending an object in the +file over the network to a user performing a clone or fetch. Usually +pack files are only deleted when the referenced objects in them have +been repacked and thus copied to a new pack file. So performing the same +operation again after the fetch will likely send the same object from +the new pack instead of the deleted one, and the operation will succeed. + +=== Data Loss + +It is possible for data loss to occur when Git garbage collection runs. +This is very rare, but it can happen. This can happen when an object is +believed to be unreferenced when object repacking is running, and then +garbage collection deletes it. This can happen because even though an +object may indeed be unreferenced when object repacking begins and +reachability of all objects is determined, it can become referenced by +another concurrent operation after this unreferenced determination but +before it gets deleted. When this happens, a new reference can be +created which points to a now missing object, and this will result in a +loss. + +== Reducing Git Garbage Collection Impacts + +JGit has a `preserved` directory feature which is intended to reduce +some of the impacts of Git garbage collection, and Gerrit can take +advantage of the feature too. The `preserved` directory is a +subdirectory of a repository's `objects/pack` directory where JGit will +move pack files that it would normally delete when `jgit gc` is invoked +with the `--preserve-oldpacks` option. It will later delete these files +the next time that `jgit gc` is run if it is invoked with the +`--prune-preserved` option. Using these flags together on every `jgit gc` +invocation means that packfiles will get an extended lifetime by one +full garbage collection cycle. Since an atomic move is used to move these +files, any open references to them will continue to work, even on NFS. On +a busy repository, preserving pack files can make operations much more +reliable, and interrupted operations should almost entirely disappear. + +Moving files to the `preserved` directory also has the ability to reduce +data loss. If JGit cannot find an object it needs in its current object +DB, it will look into the `preserved` directory as a last resort. If it +finds the object in a pack file there, it will restore the +slated-to-be-deleted pack file back to the original `objects/pack` +directory effectively "undeleting" it and making all the objects in it +available again. When this happens, data loss is prevented. + +One advantage of restoring preserved packfiles in this way when an +object is referenced in them, is that it makes loosening unreferenced +objects during Git garbage collection, which is a potentially expensive, +wasteful, and performance impacting operation, no longer desirable. It +is recommended that if you use Git for garbage collection, that you use +the `-a` option to `git repack` instead of the `-A` option to no longer +perform this loosening. + +When Git is used for garbage collection instead of JGit, it is fairly +easy to wrap `git gc` or `git repack` with a small script which has a +`--prune-preserved` option which behaves as mentioned above by deleting +any pack files currently in the preserved directory, and also has a +`--preserve-oldpacks` option which then hardlinks all the currently +existing pack files from the `objects/pack` directory into the +`preserved` directory right before calling the real Git command. This +approach will then behave similarly to `jgit gc` with respect to +preserving pack files. + +GERRIT +------ +Part of link:index.html[Gerrit Code Review] + +SEARCHBOX +---------