large-scale/doc/source/journey/configure/database.rst

========
Database
========

Introduction
------------

The general recommendation is to run a galera cluster (using mysql or mariadb) for your database.
The active-active features of it allow for a fast and easy failover.

Note that OpenStack does not play well with writing in parallel to multiple galera nodes,
see the config recommendation below.

------------------------------
One database to rule them all?
------------------------------

You can consider deploying the database in two ways:

* one galera cluster for each OpenStack service
* one single big galera cluster for all OpenStack services

The recommendation is to split your galera in separate clusters for multiple reasons:

* Reduce the impact when a galera cluster is down
* Allow intervention on smaller part of infrastructure

Also there is no benefit of colocating multiple services on the same galera cluster.

Config recommendation
---------------------

This section is split into three parts:

* the configuration for galera itself
* the configuration for the reverse proxy in front of galera
* the configuration for the OpenStack services

--------------------
Galera configuration
--------------------

All of these settings need to be consistent on all nodes of the galera cluster.

Note that this guide does not include the general requirements to get the galera cluster set up in the first place.
For this please see https://mariadb.com/kb/en/getting-started-with-mariadb-galera-cluster/

General health configs
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: console

   [mysqld]
   max_connections=4000
   max_statement_time=20

In order to ensure your cluster runs smoothly we recommend you limit the amount of connections and the time statements can be executed.

The value for ``max_connections`` should be set based on actual tests (testing with a lot of idle connections is fine).

The value of ``20`` seconds for ``max_statement_time`` is enough for all normal use-cases we know of.
You might only run into issues with regular Nova cleanup jobs if they do not run often enough.

Replication stability
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: console

   [galera]
   wsrep_provider_options=gcomm.thread_prio=rr:2;gcs.fc_limit=160;gcs.fc_factor=0.8;gcache.size=2G

When you have a large amount of connections to your galera cluster, these connections might starve the galera replication thread.
If the replication thread does not get enough CPU time, the galera cluster will lose its members and break.

This setting sets the replication thread to realtime scheduling on the kernel side.
If you run galera as a non-privileged user (as you hopefully do), galera will need ``CAP_SYS_NICE`` in order to be allowed to change the priority.
If you run inside a container environment, you might need to set ``kernel.sched_rt_runtime_us=-1`` (although that is suboptimal).

Performance
^^^^^^^^^^^

.. code-block:: console

   [mysqld]
   tmp_table_size=64M
   max_heap_table_size=64M
   optimizer_switch=derived_merge=off

Temporary tables and the ``derived_merge`` optimizer are an important setting if you have a large amount of Neutron RBAC rules.

---------------------------
Reverse Proxy Configuration
---------------------------

You will need to run a reverse proxy in front of your galera cluster to ensure OpenStack only ever communicates with a single cluster node for ``write`` requests.
This is required because OpenStack does not handle well the deadlocks when writing to different nodes in parallel.

If you choose to run haproxy for this, you can use something like the following config:

.. code-block:: console

   defaults
     timeout client 300s

   listen db_master
     bind 0.0.0.0:3306
     balance first
     option mysql-check
     server server-1 server-1.with.the.fqdn check inter 5s downinter 15s fastinter 2s resolvers cluster id 1
     server server-2 server-2.with.the.fqdn check inter 5s downinter 15s fastinter 2s resolvers cluster backup id 2
     server server-3 server-3.with.the.fqdn check inter 5s downinter 15s fastinter 2s resolvers cluster backup id 3

   listen db_slave
     bind 0.0.0.0:3308
     balance roundrobin
     option mysql-check
     server server-1 server-1.with.the.fqdn check inter 5s downinter 15s fastinter 2s resolvers cluster id 3 backup
     server server-2 server-2.with.the.fqdn check inter 5s downinter 15s fastinter 2s resolvers cluster id 1 weight 10
     server server-3 server-3.with.the.fqdn check inter 5s downinter 15s fastinter 2s resolvers cluster id 2 weight 10

By using two blocks, we can separate the ``read`` only SQL requests (listening on port 3308 here) from ``read/write``
requests (listening on port 3306 here) and lower down a little bit the load on the first (master) mysql backend.

You should note the ``timeout client`` setting here, as it is relevant to the OpenStack configuration.

-----------------------
OpenStack Configuration
-----------------------

Database Connection Settings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The database configuration is normally in the ``[database]`` section of the configuration.
You should set the following:

.. code-block:: console

   connection = mysql+pymysql://login:pass@proxy:3306/db?charset=utf8
   slave_connection = mysql+pymysql://login:pass@proxy:3308/db?charset=utf8
   connection_recycle_time = 280
   max_pool_size = 15
   max_overflow = 25

The ``connection`` is used by OpenStack services to do ``read`` and ``write`` requests.

The ``slave_connection`` is used by OpenStack services to do ``read`` only requests.

The ``connection_recycle_time`` should be a bit smaller than the ``timeout client`` in the reverse proxy (5% to 10%).
This ensures connections are recreated on the OpenStack side first before the reverse proxy is forcing the connection to terminate.

The ``max_pool_size`` and ``max_overflow`` define the amount of connections an individual thread is allowed to have.
You will need to set this based on experience (although the above should be a good start).

Database cleanup
^^^^^^^^^^^^^^^^

Nova and Cinder use soft deletes inside their database.
This means deleted entries are still persistent in the database and just get a ``deleted`` flag set.

In order to prevent the database tables from growing forever these deleted entries will need to be regularly removed.
For this you can use:

* ``nova-manage db archive_deleted_rows`` and `nova-manage db purge` for Nova (https://docs.openstack.org/nova/latest/cli/nova-manage.html)
* ``cinder-manage db purge`` for Cinder (https://docs.openstack.org/cinder/latest/cli/cinder-manage.html)

If you did never run these cleanups previously (or if your environment has a high amount of resources being deleted) you might run into
a timeout due to the ``max_statement_time`` on the database cluster.
To work around this the ``nova-manage`` commands support a ``--max-rows`` argument.
For Cinder you might need to run the SQL statements manually and add a ``limit 1000`` to them (statements are part of the error of the command).