diff --git a/doc/source/index.rst b/doc/source/index.rst index f9950dd..fb4e5e5 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -66,6 +66,7 @@ section of this index. specs/central-auth specs/irc + specs/prometheus specs/storyboard_integration_tests specs/storyboard_story_tags specs/storyboard_subscription_pub_sub diff --git a/specs/prometheus.rst b/specs/prometheus.rst new file mode 100644 index 0000000..9e99132 --- /dev/null +++ b/specs/prometheus.rst @@ -0,0 +1,173 @@ +:: + + Copyright 2021 Open Infrastructure Foundation + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode + +======================== +Run a Prometheus Service +======================== + +https://storyboard.openstack.org/#!/story/2009228 + +Our existing systems metric tooling is built around Cacti. Unfortunately, +this tooling is aging without a great path forward into the future. This +gives us the opportunity to reevaluate and consider what tools might be +best leveraged for gathering systems metrics today. Prometheus has grown +to become a popular tool in this space, is well supported, and allows us +to gather application metrics for many of the services we already run in +addition to systems metrics. Let's run a Prometheus instance and start +replacing Cacti. + +Problem Description +=================== + +In order to properly size the services we run, debug issues with resource +limits, and generally ensure the health of our systems we need to collect +metrics on how they are performing. Historically we have done this with +Cacti which polls systems via SNMP and collects that information in RRD +files. Cacti will then render graphs for this RRD data per host over various +time ranges. + +Our Cacti installation is aging and needs to be upgraded. Rather than put +a bunch of effort into maintaining this older system and modernizing it +we can jump directly to Prometheus which software like Zuul, Gerrit, and +Gitea support. This change is likely to require a bit more bootstrapping +effort, but in the end we will get a much richer set of metrics for +understanding our systems and software. + +Proposed Change +=============== + +We will deploy a new server with a large attached volume. We will then run +Prometheus with docker-compose. We should use the prom/prometheus image +published to Docker Hub. The large volume will be mounted to provide storage +for Prometheus' TSDB files. + +To collect the system metrics we will use Prometheus' node-exporter tool. +The upstream for this tool publishes binaries for x86_64 and arm64 systems. +We will use the published binaries (possibly using a local copy) instead of +using distro packages because the distro packages are quite old and +node-exporter has changed metric schemas multiple times until it hit version +1.0. We use the published binaries instead of docker images because running +node-exporter in docker is awkward as you have to expose significant system +resources into the container to properly collect their details. We will need +to open up node-exporter's publishing port to the new Prometheus server in our +firewall rules. + +Once the base set of services and firewall access are in place we can begin +to roll out configuration that polls the instances and renders the +information into sets of graphs per instance. Ideally this will be configured +automatically for instances in our inventory similar to how sslcertcheck +works. At this point I'm not sure any of us are Prometheus experts and we +will not describe what those configs should look like here. Instead we expect +Prometheus config to ingest metrics per instance, and grafana configs to +render graphs per instance for that data. + +We can leverage our functional testing system to work out what these configs +should look like, or simply modify them on the new server until we are happy. +We can get away with making these updates "in production" because the new +service won't be in production until we are happy with it. + +Once we are happy with the results we should collect data side by side in +both Cacti and Prometheus for one month. We can then compare the two systems +to ensure the data is accurate and useable. Once we have made this +determination the old Cacti server can be put into hibernation for historical +record purposes. + +Integrating with services like Zuul, Gerrit, Gerritbot, and Gitea is also +possible but outside of the scope of this spec. Adding these integrations +is expected to be straightforward once the Cacti replacement details have +been sorted out. + +Alternatives +------------ + +We can keep running Cacti and upgrade it one way or another. The end result +will be familiar but provide far less functionality. + +We can run Prometheus with its SNMP exporter instead of node exporter. The +upside to this approach is we already know how to collect SNMP data from +our servers. The downside is that the Prometheus community seems to prefer +node exporter and there is a bit more tooling around it. We'll probably find +better support for grafana dashboards and graphs this way. Additionally +node exporter is able to collect a lot of information that we would have to +write our own SNMP MIBs for that we otherwise get for free. This is a good +opporunity to use modern tooling. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + TBD + +Gerrit Topic +------------ + +Use Gerrit topic "opendev-prometheus" for all patches related to this spec. + +.. code-block:: bash + + git-review -t opendev-prometheus + +Work Items +---------- + +1. Deploy a new metrics.opendev.org server and update DNS. +2. Deploy prometheus on the new server with docker-compose. +3. Deploy node exporter on all of our instances. +4. Update firewall rules on all of our instances to allow Prometheus polls + from the new server. +5. Configure Prometheus to poll our instances. +6. Review the results and iterate until we are collecting what we want to + collect and it is safe to expose publicly. +7. Open firewall rules on the new server to expose the Prometheus data + externally. +8. Build grafana dashboards for our instances exposing the metrics in + Prometheus. + +Repositories +------------ + +No new repositories will need to be created. All config should live in +opendev/system-config. + +Servers +------- + +We will create a new metrics.opendev.org server. + +DNS Entries +----------- + +Only DNS records for the new server will be created. + +Documentation +------------- + +We will update documentation to include information on operating prometheus, +adding sources of data to prometheus, and adding graph dashboards to grafana +backed by prometheus. + +Security +-------- + +We will need to update firewall rules on all systems to allow Prometheus +polls from the new metrics.opendev.org server. + +Testing +------- + +A system-config-run-prometheus job will be added to run prometheus and at +least one other server that it will gather metrics from. This will ensure +that node exporter polling and ingestion to prometheus is functional. + +Dependencies +============ + +None