fuel-specs/specs/6.0/ha_tests.rst


==========================================
HA tests improvements
==========================================

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/fuel/+spec/ha-test-improvements


Problem description
===================

Need to add new HA tests and modify the existing one


Proposed change
===============

We need to clarify the list of new tests and new checks
and then implement it in system tests

Alternatives
------------

No alternatives

Data model impact
-----------------

No impact

REST API impact
---------------

No impact

Upgrade impact
--------------

No impact

Security impact
---------------

No impact

Notifications impact
--------------------

No impact

Other end user impact
---------------------

No impact

Performance Impact
------------------

No impact

Other deployer impact
---------------------

No impact

Developer impact
----------------

Implementation
==============

Assignee(s)
-----------

Can be implemented by fuel-qa team in parallel

Work Items
----------

1. Shut down public vip two times
(link to bug https://bugs.launchpad.net/fuel/+bug/1311749)

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. Find node with public vip
3. Shut down eth with public vip
4. Check vip is recovered
5. Find node on which vip is recovered
6. Shut down eth with public vip one more time
7. Check vip is recovered
8. Run OSTF
9. Do the same for management vip

2. Galera does not reassemble on galera quorum loss
(link to bug https://bugs.launchpad.net/fuel/+bug/1350545)

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. Shut down one controller
3. Wait for galera cluster to reassemble (HA health check has passed)
4. Kill mysqld on second controller
5. Start first controller
6. Wait for 5 minutes that galera reassembles and check it reassembles
7. Run OSTF
8. Check rabbit status with MOS script

3. Corrupt root file system on primary controller

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. Corrupt root file system on primary controller
3. Run OSTF

4. Block corosync traffic
(link to bug https://bugs.launchpad.net/fuel/+bug/1354520)

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. Login to rabbit master node
3. Block corosync traffic by extracting interface from management bridge
4. Unblock corosync traffic back
5. Check rabbitmqctl cluster_status at rabbit master node
6. Run OSTF HA tests

5. HA scalability for mongo

Steps:
1. Deploy HA cluster with Nova-network, 1 controller and 3 mongo nodes
2. Add 2 controller nodes and re-deploy cluster
3. Run OSTF
4. Add 2 mongo nodes and re-deploy cluster
5. Run OSTF

6. Lock DB access on primary controller

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. Lock DB access on primary controller
3. Run OSTF

7. Need to test HA failover on clusters with bonding

Steps:
1. Deploy HA cluster with Neutron Vlan, 3 controllers, 2 compute,
eth1-eth4 interfaces are bonded in active backup mode
2. Destroy primary controller
3. Check pacemaker status
4. Run OSTF
5. Check rabbit status with MOS script
(retry it during 5 min till successful result)

8. HA load testing with rally
(May be not a part of this blueprint)

9. Need to test HA Neutron cluster under high load and simultaneous
removing of virtual router ports
(related link http://lists.openstack.org/pipermail/openstack-operators/
2014-September/005165.html)

10. Cinder Neutron Plugin

Steps:
1. Deploy HA cluster with Neutron GRE, 3 controllers, 2 compute,
cinder-neutron plugin enabled
2. Run network verification
3. Run OSTF

11. Rmq failover test for compute service

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers,
2 compute with cinder roles
2. Disable one compute node with
nova-manage service disable --host=<compute_node_name> --service=nova-compute
3. On controller node under test (which compute node under test is connected
to via rmq port 5673) repeat spawn / destroy instance requests continuosly
(sleep 60) while the test is running
4. Add iptables block rule from compute IP to controller IP:5673
(take care for conntrack as well)
iptables -I INPUT 1 -s compute_IP -p tcp --dport 5673 -m state
--state NEW,ESTABLISHED,RELATED -j DROP
5. Wait 3 min for compute node under test should be marked as down
in the nova service-list
6. Wait for another 3 min for it to be brought up back
7. Check for the compute node under test queue - it should be zero messages
in it
8. Check if the instance could be spawned at the node

12. Check monit on compute nodes

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. Ssh to every compute node
3. Kill nova-compute service
4. Check that service was restarted by monit

13. Check pacemaker restarts heat-engine in case of losing amqp connection

Steps:
1. Deploy HA cluster with Nova-network, 3 controllers, 2 compute
2. SSH to controller with running heat-engine
3. Check heat-engine status
4. Block heat-engine amqp connections
5. Check if heat-engine was moved to another controller or stopped
on current controller
6. If moved - ssh to node with running heat-engine
6.1 Check heat-engine is running
6.2 Check heat-engine has some amqp connections
7. If stopped - check heat-engine process is running with new pid
7.1 Unblock heat-engine amqp connections
7.2 Check amqp connection re-appears for heat-engine


14. Neutron agent rescheduling

Steps:
1. Deploy HA cluster with Neutron GRE, 3 controllers, 2 compute
2. Check the neutron-agents list consitency (no duplicates,
alive statuses, etc)
3. On host with l3 agent create one more router
4. Check there are 2 namespaces
5. Destroy controller with l3 agent
6. Check it was moved to another controller, check all routers
and namespaces were moved
7. Check metadata agent was also moved, there is process in router
namespace that listen to 8775 port

15. DHCP agent rescheduling

Steps:
1. Deploy HA cluster with Neutron GRE, 3 controllers, 2 compute
2. Destroy controller with dhcp agent
3. Check it was moved to another controller
4. Check metadata agent was also moved, there is process in router
namespace that listen to 8775 port

Dependencies
============


Testing
=======


Documentation Impact
====================


References
==========