Add retry to start OSD

In Nautilus, it was verified for every start during the recovery
sequence, the OSD have an initial monmap which have fsid zeroed.

During the OSD initialization it requests the monmap to the monitor and
and proceeds to the initialization routines, which involves sending
commands to be executed by the monitor using the stored fsid. If the
fsid doesn't match with the one in the monitor, it sends a message of
"wrong fsid", causing the OSD to stop its execution.

For the most part, the OSD will receive the monmap previous to sending
commands to the monitor, however, sometimes this fails.

In virtual environments it was possible to reproduce this error only
once in 90 tries. In kernel rt this seems to happen more frequently.

This fix allows the OSD to try five times to be started, so it can have
a chance of correctly receiving the monmap before starting sending
commands to the monitor.

Testing performed:

AIO-SX - Created an ansible test file to reproduce this error and left
it running forever until the "wrong fsid" message appeared and in the
second try it was able to receive the monmap and OSD was successfully
started.

Story: 2009074
Task: 44094

Signed-off-by: Vinicius Lopes da Silva <vinicius.lopesdasilva@windriver.com>
Change-Id: Ib4b6d37b520ec2d78ea7b6a6c411a128ce284f66
This commit is contained in:
Vinicius Lopes da Silva
2021-11-30 11:34:22 -03:00
parent 5e9c6edc67
commit fe53eb47da

View File

@@ -194,8 +194,8 @@
- name: Restore store.db from mon-store
shell: cp -ar /tmp/mon-store/store.db /var/lib/ceph/mon/ceph-{{ mon_name }}
- name: Bring up ceph Monitor and OSDs
command: /etc/init.d/ceph start mon osd
- name: Bring up ceph Monitor
command: /etc/init.d/ceph start mon
- name: Wait for ceph monitor to be up
shell: ceph -s
@@ -203,6 +203,19 @@
retries: 5
delay: 2
# During initialization of OSD, it requests the monmap to the monitor
# before sending the monitor commands to be run. Since there are different
# threads involved, it is possible the OSD sends the command before receiving
# the monmap, causing an error of "wrong fsid" and making OSD to stop its
# execution. So we add a retry of 5 to make sure it will receive the monmap when
# it should
- name: Bring up ceph OSDs
command: /etc/init.d/ceph start osd
retries: 5
delay: 3
register: result
until: result.rc == 0
- name: Enable Ceph Msgr v2 protocol
shell: ceph mon enable-msgr2
until: true