Update our Gitea robots.txt from gitea.com's

We've experienced some runaway growth of Gitea archive cache files
on one of our backends, which according to upstream is often caused
by web crawlers indexing the archive URLs. They recommended updating
our robots.txt to the current state of https://gitea.com/robots.txt
in order to help mitigate the issue.

I've kept things we expressly commented out before still commented
out, or anything that seems similar to what we commented out on the
assumption that the reasons would carry over.

After some discussion in IRC, we also decided it would make sense to
disallow /avatars and /user/* like they do.

Change-Id: I2b43b89de08c9a9d170e1ecbd14b1e6336fd2c84
This commit is contained in:
Jeremy Stanley 2024-01-05 17:04:05 +00:00
parent 8734fa7c6e
commit 79103e1a35

View File

@ -3,6 +3,7 @@
# and
# https://github.com/robots.txt
# at 2020-07-01
# and https://gitea.com/robots.txt on 2024-01-05
#
# Some commented out items are left to indicate we have considered
# them and would like to explicitly allow them for indexing while they
@ -10,26 +11,82 @@
User-agent: *
# Disallow: /avatars
# Disallow: /user/*
Disallow: /api/*
Disallow: /avatars
Disallow: /user/*
# Disallow: /*/*/src/commit/*
# Disallow: /*/*/commit/*
# Disallow: /*/*/*/refs/*
Disallow: /*/*/*/star
Disallow: /*/*/*/watch
Disallow: /*/*/labels
Disallow: /*/*/activity/*
Disallow: /vendor/librejs.html
Disallow: /api/swagger
Disallow: /vendor/*
Disallow: /swagger.*.json
# Language spam
Disallow: /*?lang=
# From github
Disallow: */archive/
Disallow: */blame/
# from Github, to be cleaned
Allow: /*/*/tree/master
Allow: /*/*/blob/master
Disallow: /*/*/pulse
Disallow: /*/*/tree/*
Disallow: /*/*/blob/*
Disallow: /*/*/wiki/*/*
Disallow: /gist/*/*/*
Disallow: /oembed
Disallow: /*/forks
Disallow: /*/stars
Disallow: /*/download
Disallow: /*/revisions
Disallow: /*/*/issues/new
Disallow: /*/*/issues/search
Disallow: /*/*/commits/*/*
Disallow: /*/*/commits/*?author
Disallow: /*/*/commits/*?path
Disallow: /*/*/branches
Disallow: /*/*/tags
Disallow: /*/*/contributors
Disallow: /*/*/comments
Disallow: /*/*/stargazers
Disallow: /*/*/search
Disallow: /*/tarball/
Disallow: /*/zipball/
Disallow: /*/*/archive/
# Disallow: /raw/*
Disallow: /*/followers
Disallow: /*/following
Disallow: /stars/*
Disallow: /*/blame/
Disallow: /*/watchers
Disallow: /*/network
Disallow: /*/graphs
# Disallow: /*/raw/
Disallow: /*/compare/
Disallow: /*/cache/
Disallow: /*/*/blame/
Disallow: /*/*/watchers
Disallow: /*/*/network
Disallow: /*/*/graphs
# Disallow: /*/*/raw/
Disallow: /*/*/compare/
Disallow: /*/*/cache/
Disallow: /.git/
Disallow: */.git/
Disallow: /*/.git/
Disallow: /*.git$
Disallow: /*/sitemap.xml
Disallow: /search/advanced
Disallow: /search
Disallow: /*q=
Disallow: /*.atom
Crawl-delay: 2