Detailed session information and enhancements

- Add GET /v1/maintenance/{session_id}/detail
- Add 'maintenance.session' event. This can be used
  to track workflow. It gives you percent of hosts
  maintained.

Other enhancements:
- Add Sample VNFM for OpenStack: vnfm.py
  (Kubernetes renamed to vnfm_k8s.py)
- Add Sample VNF for OpenStack:
  maintenance_hot_tpl.yaml
- Update testing instructions (tools)
- Update documentation
- Add more tools for testing:
  - fenix_db_reset (flushed the database)
  - set_config.py (set the AODH / Ceilometer config)
- Add admin tool: infra_admin.py
  This tool can run maintenance workflow and
  track its progress
- Make sure everything is written in database.
  If Fenix is restarted, it initialise existing
  'ongoing' workflows from database. More functions
  to database API and utilization in example workflows.

story: 2004336
Task: #27922

Change-Id: I794b11a8684f5fc513cb8f5affcd370ec70f3dbc
Signed-off-by: Tomi Juvonen <tomi.juvonen@nokia.com>
This commit is contained in:
Tomi Juvonen 2020-04-17 12:31:15 +03:00
parent ef8bbb388b
commit 244fb3ced0
34 changed files with 2481 additions and 379 deletions

@ -27,6 +27,7 @@ would also be telling about adding or removing a host.
* Documentation: https://fenix.readthedocs.io/en/latest/index.html
* Developer Documentation: https://wiki.openstack.org/wiki/Fenix
* Source: https://opendev.org/x/fenix
* Running sample workflows: https://opendev.org/x/fenix/src/branch/master/fenix/tools/README.md
* Bug tracking and Blueprints: https://storyboard.openstack.org/#!/project/x/fenix
* How to contribute: https://docs.openstack.org/infra/manual/developers.html
* `Fenix Specifications <specifications/index.html>`_

@ -1,6 +1,6 @@
####################
Host Maintenance API
####################
###
API
###
.. toctree::
:maxdepth: 2

@ -1,28 +1,29 @@
:tocdepth: 2
#######################
Host Maintenance API v1
#######################
######
API v1
######
.. rest_expand_all::
#####
Admin
#####
#########
Admin API
#########
These APIs are meant for infrastructure admin who is in charge of triggering
the rolling maintenance and upgrade workflows.
the rolling maintenance and upgrade workflow sessions.
.. include:: maintenance.inc
#######
Project
#######
###########
Project API
###########
These APIs are meant for projects having instances on top of the infrastructure
under corresponding rolling maintenance or upgrade session. Usage of these APIs
expects there is an application manager (VNFM) that can interact with Fenix
workflow via these APIs. If this is not the case, workflow should have a default
behavior for instances owned by projects, that are not interacting with Fenix.
These APIs are meant for projects (tenant/VNF) having instances on top of the
infrastructure under corresponding rolling maintenance or upgrade session.
Usage of these APIs expects there is an application manager (VNFM) that can
interact with Fenix workflow via these APIs. If this is not the case, workflow
should have a default behavior for instances owned by projects, that are not
interacting with Fenix.
.. include:: project.inc

@ -1,13 +1,13 @@
.. -*- rst -*-
===========
Maintenance
===========
==========================
Admin workflow session API
==========================
Create maintenance session
==========================
.. rest_method:: POST /v1/maintenance/
.. rest_method:: POST /v1/maintenance
Create a new maintenance session. You can specify a list of 'hosts' to be
maintained or have an empty list to indicate those should be self-discovered.
@ -49,7 +49,7 @@ Response codes
Update maintenance session (planned future functionality)
=========================================================
.. rest_method:: PUT /v1/maintenance/{session_id}/
.. rest_method:: PUT /v1/maintenance/{session_id}
Update existing maintenance session. This can be used to continue a failed
session after manually fixing what failed. Workflow should then run
@ -79,7 +79,7 @@ Response codes
Get maintenance sessions
========================
.. rest_method:: GET /v1/maintenance/
.. rest_method:: GET /v1/maintenance
Get all ongoing maintenance sessions.
@ -88,7 +88,7 @@ Response codes
.. rest_status_code:: success status.yaml
- 200: get-maintenance-sessions-get
- 200: maintenance-sessions-get
.. rest_status_code:: error status.yaml
@ -98,7 +98,7 @@ Response codes
Get maintenance session
=======================
.. rest_method:: GET /v1/maintenance/{session_id}/
.. rest_method:: GET /v1/maintenance/{session_id}
Get a maintenance session state.
@ -114,7 +114,38 @@ Response codes
.. rest_status_code:: success status.yaml
- 200: get-maintenance-session-get
- 200: maintenance-session-get
.. rest_status_code:: error status.yaml
- 400
- 404
- 422
- 500
Get maintenance session details
===============================
.. rest_method:: GET /v1/maintenance/{session_id}/detail
Get a maintenance session details. This information can be usefull to see
detailed status of a maintennace session or to troubleshoot a failed session.
Usually session should fail on simple problem, that can be fast manually
fixed. Then one can update maintenance session state to continue from 'prev_state'.
Request
-------
.. rest_parameters:: parameters.yaml
- session_id: session_id
Response codes
--------------
.. rest_status_code:: success status.yaml
- 200: maintenance-session-detail-get
.. rest_status_code:: error status.yaml
@ -126,7 +157,7 @@ Response codes
Delete maintenance session
==========================
.. rest_method:: DELETE /v1/maintenance/{session_id}/
.. rest_method:: DELETE /v1/maintenance/{session_id}
Delete a maintenance session. Usually called after the session is successfully
finished.
@ -141,12 +172,3 @@ finished.
- 400
- 422
- 500
Future
======
On top of some expected changes mentioned above, it will also be handy to get
detailed information about the steps run already in the maintenance session.
This will be helpful when need to figure out any correcting actions to
successfully finish a failed session. For now admin can update failed session
state to previous or his wanted state to try continue a failed session.

@ -36,7 +36,7 @@ uuid-path:
#############################################################################
action-metadata:
description: |
Metadata; hints to plug-ins
Metadata; hints to plug-ins.
in: body
required: true
type: dictionary
@ -44,7 +44,17 @@ action-metadata:
action-plugin-name:
description: |
plug-in name. Default workflow executes same type of plug-ins in an
alphabetical order
alphabetical order.
in: body
required: true
type: string
action-plugin-state:
description: |
Action plug-in state. This is workflow and action plug-in specific
information to be passed from action plug-in to workflow. Helps
understanding how action plug-in was executed and to troubleshoot
accordingly.
in: body
required: true
type: string
@ -77,6 +87,20 @@ boolean:
required: true
type: boolean
datetime-string:
description: |
Date and time string according to ISO 8601.
in: body
required: true
type: string
details:
description: |
Workflow internal special usage detail. Example nova-compute service id.
in: body
required: true
type: string
group-uuid:
description: |
Instance group uuid. Should match with OpenStack server group if one exists.
@ -84,6 +108,21 @@ group-uuid:
required: true
type: string
host-type:
description: |
Host type as it is wanted to be used in workflow implementation.
Example workflows uses values as compute and controller.
in: body
required: false
type: list of strings
hostname:
description: |
Name of the host.
in: body
required: true
type: string
hosts:
description: |
Hosts to be maintained. An empty list can indicate hosts are to be
@ -102,7 +141,7 @@ instance-action:
instance-actions:
description: |
instance ID : action string. This variable is not needed in reply to state
MAINTENANCE, SCALE_IN or MAINTENANCE_COMPLETE
MAINTENANCE, SCALE_IN or MAINTENANCE_COMPLETE.
in: body
required: true
type: dictionary
@ -128,6 +167,14 @@ instance-name:
required: true
type: string
instance-state:
description: |
State of the instance as in underlying cloud. Can be different in
different clouds like OpenStack or Kubernetes.
in: body
required: true
type: string
lead-time:
description: |
How long lead time VNF needs for 'migration_type' operation. VNF needs to
@ -177,30 +224,50 @@ max-interruption-time:
metadata:
description: |
Metadata; like hints to projects
Hint to project/tenant/VNF to know what capability the infrastructure
is offering to instance when it moves to already maintained host in
'PLANNED_MAINTENANCE' state action. This may have impact on how
the instance is to be moved or if instance is to be upgraded and
VNF needs to re-instantiate it as its 'OWN_ACTION'. This could be the
case with new hardware or instance could be wanted to be upgraded
anyhow at the same time of the infrastructure maintenance.
in: body
required: true
type: dictionary
migration-type:
description: |
LIVE_MIGRATION, MIGRATION or OWN_ACTION
'LIVE_MIGRATE', 'MIGRATE' or 'OWN_ACTION'
Own action is create new and delete old instance.
Note! VNF need to obey resource_mitigation with own action
This affects to order of delete old and create new to not over
commit the resources. In Kubernetes also EVICTION supported. There admin
commit the resources. In Kubernetes also 'EVICTION' supported. There admin
will delete instance and VNF automation like ReplicaSet will make a new
instance
instance.
in: body
required: true
type: string
percent_done:
description: |
How many percent of hosts are maintained.
in: body
required: true
type: dictionary
plugin:
description: |
Action plugin name.
in: body
required: true
type: dictionary
recovery-time:
description: |
VNF recovery time after operation to instance. Workflow needs to take
into account recovery_time for previous instance moved and only then
start moving next obyeing max_impacted_members
Note! regardless anti_affinity group or not
Note! regardless anti_affinity group or not.
in: body
required: true
type: integer
@ -255,7 +322,7 @@ workflow-name:
workflow-state:
description: |
Maintenance workflow state.
Maintenance workflow state (States explained in the user guide)
in: body
required: true
type: string

@ -1,8 +1,8 @@
.. -*- rst -*-
=======
Project
=======
============================
Project workflow session API
============================
These APIs are generic for any cloud as instance ID should be something that can
be matched to virtual machines or containers regardless of the cloud underneath.
@ -10,7 +10,7 @@ be matched to virtual machines or containers regardless of the cloud underneath.
Get project maintenance session
===============================
.. rest_method:: GET /v1/maintenance/{session_id}/{project_id}/
.. rest_method:: GET /v1/maintenance/{session_id}/{project_id}
Get project instances belonging to the current state of maintenance session.
the Project-manager receives an AODH event alarm telling about different
@ -31,7 +31,7 @@ Response codes
.. rest_status_code:: success status.yaml
- 200: get-project-maintenance-session-post
- 200: project-maintenance-session-post
.. rest_status_code:: error status.yaml
@ -42,7 +42,7 @@ Response codes
Input from project to maintenance session
=========================================
.. rest_method:: PUT /v1/maintenance/{session_id}/{project_id}/
.. rest_method:: PUT /v1/maintenance/{session_id}/{project_id}
Project having instances on top of the infrastructure handled by a maintenance
session might need to make own action for its instances on top of a host going
@ -78,9 +78,9 @@ Response codes
- 422
- 500
============================
Project with NFV constraints
============================
===========================
Project NFV constraints API
===========================
These APIs are for VNFs, VNMF and EM that are made to support ETSI defined
standard VIM interface for sophisticated interaction to optimize rolling

@ -0,0 +1,212 @@
{
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instances": [
{
"instance_id": "da8f96ae-a1fe-4e6b-a852-6951d513a440",
"action_done": false,
"host": "overcloud-novacompute-2",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "INSTANCE_ACTION_DONE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_nonha_app_2",
"state": "active",
"details": null,
"action": null,
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "73190018-eab0-4074-bed0-4b0c274a1c8b"
},
{
"instance_id": "22d869d7-2a67-4d70-bb3c-dcc14a014d78",
"action_done": false,
"host": "overcloud-novacompute-4",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "ACK_PLANNED_MAINTENANCE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_nonha_app_3",
"state": "active",
"details": null,
"action": "MIGRATE",
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "c0930990-65ac-4bca-88cb-7cb0e7d5c420"
},
{
"instance_id": "89467f5c-d5f8-461f-8b5c-236ce54138be",
"action_done": false,
"host": "overcloud-novacompute-2",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "INSTANCE_ACTION_DONE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_nonha_app_1",
"state": "active",
"details": null,
"action": null,
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "c6eba3ae-cb9e-4a1f-af10-13c66f61e4d9"
},
{
"instance_id": "5243f1a4-9f7b-4c91-abd5-533933bb9c90",
"action_done": false,
"host": "overcloud-novacompute-3",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "INSTANCE_ACTION_DONE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_ha_app_0",
"state": "active",
"details": "floating_ip",
"action": null,
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "d67176ff-e2e4-45e3-9a52-c069a3a66c5e"
},
{
"instance_id": "4e2e24d7-0e5d-4a92-8edc-e343b33b9f10",
"action_done": false,
"host": "overcloud-novacompute-3",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "INSTANCE_ACTION_DONE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_nonha_app_0",
"state": "active",
"details": null,
"action": null,
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "f2f7fd7f-8900-4b24-91dc-098f797790e1"
},
{
"instance_id": "92aa44f9-7ce4-4ba4-a29c-e03096ad1047",
"action_done": false,
"host": "overcloud-novacompute-4",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "ACK_PLANNED_MAINTENANCE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_ha_app_1",
"state": "active",
"details": null,
"action": "MIGRATE",
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "f35c9ba5-e5f7-4843-bae5-7df9bac2a33c"
},
{
"instance_id": "afa2cf43-6a1f-4508-ba59-12b773f8b926",
"action_done": false,
"host": "overcloud-novacompute-0",
"created_at": "2020-04-15T11:43:09.000000",
"project_state": "ACK_PLANNED_MAINTENANCE",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"instance_name": "demo_nonha_app_4",
"state": "active",
"details": null,
"action": "MIGRATE",
"project_id": "444b05e6f4764189944f00a7288cd281",
"id": "fea38e9b-3d7c-4358-ba2e-06e9c340342d"
}
],
"state": "PLANNED_MAINTENANCE",
"session": {
"workflow": "vnf",
"created_at": "2020-04-15T11:43:09.000000",
"updated_at": "2020-04-15T11:44:04.000000",
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"maintenance_at": "2020-04-15T11:43:28.000000",
"state": "PLANNED_MAINTENANCE",
"prev_state": "START_MAINTENANCE",
"meta": "{'openstack': 'upgrade'}"
},
"hosts": [
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-novacompute-3",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": false,
"maintained": true,
"details": "3de22382-5500-4d13-b9a2-470cc21002ee",
"type": "compute",
"id": "426ea4b9-4438-44ee-9849-1b3ffcc42ad6",
},
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-novacompute-2",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": false,
"maintained": true,
"details": "91457572-dabf-4aff-aab9-e12a5c6656cd",
"type": "compute",
"id": "74f0f6d1-520a-4e5b-b69c-c3265d874b14",
},
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-novacompute-5",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": false,
"maintained": true,
"details": "87921762-0c70-4d3e-873a-240cb2e5c0bf",
"type": "compute",
"id": "8d0f764e-11e8-4b96-8f6a-9c8fc0eebca2",
},
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-novacompute-1",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": false,
"maintained": true,
"details": "52c7270a-cfc2-41dd-a574-f4c4c54aa78d",
"type": "compute",
"id": "be7fd08c-0c5f-4bf4-a95b-bc3b3c01d918",
},
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-novacompute-0",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": true,
"maintained": false,
"details": "ea68bd0d-a5b6-4f06-9bff-c6eb0b248530",
"type": "compute",
"id": "ce46f423-e485-4494-8bb7-e1a2b038bb8e",
},
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-novacompute-4",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": true,
"maintained": false,
"details": "d5271d60-db14-4011-9497-b1529486f62b",
"type": "compute",
"id": "efdf668c-b1cc-4539-bdb6-aea9afbcc897",
},
{
"created_at": "2020-04-15T11:43:09.000000",
"hostname": "overcloud-controller-0",
"updated_at": null,
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"disabled": false,
"maintained": true,
"details": "9a68c85e-42f7-4e40-b64a-2e7a9e2ccd03",
"type": "controller",
"id": "f4631941-8a51-44ee-b814-11a898729f3c",
}
],
"percent_done": 71,
"action_plugin_instances": [
{
"created_at": "2020-04-15 11:12:16",
"updated_at": null,
"id": "4e864972-b692-487b-9204-b4d6470db266",
"session_id": "47479bca-7f0e-11ea-99c9-2c600c9893ee",
"hostname": "overcloud-novacompute-4",
"plugin": "dummy",
"state": null
}
]
}

@ -19,28 +19,60 @@
.. literalinclude:: samples/maintenance-session-put-200.json
:language: javascript
get-maintenance-sessions-get: |
maintenance-sessions-get: |
.. rest_parameters:: parameters.yaml
- session_id: uuid-list
.. literalinclude:: samples/get-maintenance-sessions-get-200.json
.. literalinclude:: samples/maintenance-sessions-get-200.json
:language: javascript
get-maintenance-session-get: |
maintenance-session-get: |
.. rest_parameters:: parameters.yaml
- state: workflow-state
.. literalinclude:: samples/get-maintenance-session-get-200.json
.. literalinclude:: samples/maintenance-session-get-200.json
:language: javascript
get-project-maintenance-session-post: |
maintenance-session-detail-get: |
.. rest_parameters:: parameters.yaml
- action: migration-type
- action_done: boolean
- created_at: datetime-string
- details: details
- disabled: boolean
- host: hostname
- hostname: hostname
- id: uuid
- instance_id: uuid
- instance_name: instance-name
- maintained: boolean
- maintenance_at: datetime-string
- meta: metadata
- percent_done: percent_done
- plugin: plugin
- prev_state: workflow-state
- project_id: uuid
- project_state: workflow-state-reply
- session_id: uuid
- state(action_plugin_instances): action-plugin-state
- state(instances): instance-state
- state: workflow-state
- type: host-type
- updated_at: datetime-string
- workflow: workflow-name
.. literalinclude:: samples/maintenance-session-detail-get-200.json
:language: javascript
project-maintenance-session-post: |
.. rest_parameters:: parameters.yaml
- instance_ids: instance-ids
.. literalinclude:: samples/get-project-maintenance-session-post-200.json
.. literalinclude:: samples/project-maintenance-session-post-200.json
:language: javascript
201:

@ -77,12 +77,38 @@ Example:
Event type 'maintenance.session'
--------------------------------
--Not yet implemented--
This event type is meant for infrastructure admin to know the changes in the
ongoing maintenance workflow session. When implemented, there will not be a need
for polling the state through an API.
ongoing maintenance workflow session. This can be used instead of polling API.
Via API you will get more detailed information if you need to troubleshoot.
payload
~~~~~~~~
+--------------+--------+------------------------------------------------------------------------------+
| Name | Type | Description |
+==============+========+==============================================================================+
| service | string | Origin service name: Fenix |
+--------------+--------+------------------------------------------------------------------------------+
| state | string | Maintenance workflow state (States explained in the user guide) |
+--------------+--------+------------------------------------------------------------------------------+
| session_id | string | UUID of the related maintenance session |
+--------------+--------+------------------------------------------------------------------------------+
| percent_done | string | How many percent of hosts are maintained |
+--------------+--------+------------------------------------------------------------------------------+
| project_id | string | workflow admin project ID |
+--------------+--------+------------------------------------------------------------------------------+
Example:
.. code-block:: json
{
"service": "fenix",
"state": "IN_MAINTENANCE",
"session_id": "76e55df8-1c51-11e8-9928-0242ac110002",
"percent_done": 34,
"project_id": "ead0dbcaf3564cbbb04842e3e54960e3"
}
Project
=======

@ -66,7 +66,11 @@ class V1Controller(rest.RestController):
else:
args[0] = 'http404-nonexistingcontroller'
elif depth == 3 and route == "maintenance":
args[0] = "project"
last = self._routes.get(args[2], args[2])
if last == "detail":
args[0] = "session"
else:
args[0] = "project"
elif depth == 4 and route == "maintenance":
args[0] = "project_instance"
else:

@ -160,9 +160,10 @@ class SessionController(BaseController):
self.engine_rpcapi = maintenance.EngineRPCAPI()
# GET /v1/maintenance/<session_id>
# GET /v1/maintenance/<session_id>/detail
@policy.authorize('maintenance:session', 'get')
@expose(content_type='application/json')
def get(self, session_id):
def get(self, session_id, detail=None):
try:
jsonschema.validate(session_id, schema.uid)
except jsonschema.exceptions.ValidationError as e:
@ -173,7 +174,15 @@ class SessionController(BaseController):
LOG.error("Unexpected data")
abort(400)
try:
session = self.engine_rpcapi.admin_get_session(session_id)
if detail:
if detail != "detail":
description = "Invalid path %s" % detail
LOG.error(description)
abort(400, six.text_type(description))
session = (
self.engine_rpcapi.admin_get_session_detail(session_id))
else:
session = self.engine_rpcapi.admin_get_session(session_id)
except RemoteError as e:
self.handle_remote_error(e)
if session is None:

@ -37,9 +37,13 @@ class EngineRPCAPI(service.RPCClient):
return self.call('admin_create_session', data=data)
def admin_get_session(self, session_id):
"""Get maintenance workflow session details"""
"""Get maintenance workflow session state"""
return self.call('admin_get_session', session_id=session_id)
def admin_get_session_detail(self, session_id):
"""Get maintenance workflow session details"""
return self.call('admin_get_session_detail', session_id=session_id)
def admin_delete_session(self, session_id):
"""Delete maintenance workflow session thread"""
return self.call('admin_delete_session', session_id=session_id)

@ -115,11 +115,23 @@ def create_session(values):
return IMPL.create_session(values)
def update_session(values):
return IMPL.update_session(values)
def remove_session(session_id):
"""Remove a session from the tables."""
return IMPL.remove_session(session_id)
def get_session(session_id):
return IMPL.maintenance_session_get(session_id)
def get_sessions():
return IMPL.maintenance_session_get_all()
def create_action_plugin(values):
"""Create a action from the values."""
return IMPL.create_action_plugin(values)
@ -129,10 +141,22 @@ def create_action_plugins(session_id, action_dict_list):
return IMPL.create_action_plugins(action_dict_list)
def get_action_plugins(session_id):
return IMPL.action_plugins_get_all(session_id)
def create_action_plugin_instance(values):
return IMPL.create_action_plugin_instance(values)
def get_action_plugin_instances(session_id):
return IMPL.action_plugin_instances_get_all(session_id)
def update_action_plugin_instance(values):
return IMPL.update_action_plugin_instance(values)
def remove_action_plugin_instance(ap_instance):
return IMPL.remove_action_plugin_instance(ap_instance)
@ -141,11 +165,19 @@ def create_downloads(download_dict_list):
return IMPL.create_downloads(download_dict_list)
def get_downloads(session_id):
return IMPL.download_get_all(session_id)
def create_host(values):
"""Create a host from the values."""
return IMPL.create_host(values)
def update_host(values):
return IMPL.update_host(values)
def create_hosts(session_id, hostnames):
hosts = []
for hostname in hostnames:
@ -174,6 +206,10 @@ def create_hosts_by_details(session_id, hosts_dict_list):
return IMPL.create_hosts(hosts)
def get_hosts(session_id):
return IMPL.hosts_get(session_id)
def create_projects(session_id, project_ids):
projects = []
for project_id in project_ids:
@ -185,6 +221,18 @@ def create_projects(session_id, project_ids):
return IMPL.create_projects(projects)
def update_project(values):
return IMPL.update_project(values)
def get_projects(session_id):
return IMPL.projects_get(session_id)
def update_instance(values):
return IMPL.update_instance(values)
def create_instance(values):
"""Create a instance from the values."""
return IMPL.create_instance(values)
@ -199,6 +247,10 @@ def remove_instance(session_id, instance_id):
return IMPL.remove_instance(session_id, instance_id)
def get_instances(session_id):
return IMPL.instances_get(session_id)
def update_project_instance(values):
return IMPL.update_project_instance(values)

@ -58,8 +58,6 @@ def upgrade():
sa.Column('maintained', sa.Boolean, default=False),
sa.Column('disabled', sa.Boolean, default=False),
sa.Column('details', sa.String(length=255), nullable=True),
sa.Column('plugin', sa.String(length=255), nullable=True),
sa.Column('plugin_state', sa.String(length=32), nullable=True),
sa.UniqueConstraint('session_id', 'hostname', name='_session_host_uc'),
sa.PrimaryKeyConstraint('id'))

@ -135,6 +135,15 @@ def maintenance_session_get(session_id):
return _maintenance_session_get(get_session(), session_id)
def _maintenance_session_get_all(session):
query = model_query(models.MaintenanceSession, session)
return query
def maintenance_session_get_all():
return _maintenance_session_get_all(get_session())
def create_session(values):
values = values.copy()
msession = models.MaintenanceSession()
@ -152,6 +161,18 @@ def create_session(values):
return maintenance_session_get(msession.session_id)
def update_session(values):
session = get_session()
session_id = values.session_id
with session.begin():
msession = _maintenance_session_get(session,
session_id)
msession.update(values)
msession.save(session=session)
return maintenance_session_get(session_id)
def remove_session(session_id):
session = get_session()
with session.begin():
@ -276,6 +297,22 @@ def action_plugin_instances_get_all(session_id):
return _action_plugin_instances_get_all(get_session(), session_id)
def update_action_plugin_instance(values):
session = get_session()
session_id = values.session_id
plugin = values.plugin
hostname = values.hostname
with session.begin():
ap_instance = _action_plugin_instance_get(session,
session_id,
plugin,
hostname)
ap_instance.update(values)
ap_instance.save(session=session)
return action_plugin_instance_get(session_id, plugin, hostname)
def create_action_plugin_instance(values):
values = values.copy()
ap_instance = models.MaintenanceActionPluginInstance()
@ -402,6 +439,18 @@ def create_host(values):
return host_get(mhost.session_id, mhost.hostname)
def update_host(values):
session = get_session()
session_id = values.session_id
hostname = values.hostname
with session.begin():
mhost = _host_get(session, session_id, hostname)
mhost.update(values)
mhost.save(session=session)
return host_get(session_id, hostname)
def create_hosts(values_list):
for values in values_list:
vals = values.copy()
@ -468,6 +517,18 @@ def create_project(values):
return project_get(mproject.session_id, mproject.project_id)
def update_project(values):
session = get_session()
session_id = values.session_id
project_id = values.project_id
with session.begin():
mproject = _project_get(session, session_id, project_id)
mproject.update(values)
mproject.save(session=session)
return project_get(session_id, project_id)
def create_projects(values_list):
for values in values_list:
vals = values.copy()
@ -476,7 +537,7 @@ def create_projects(values_list):
mproject = models.MaintenanceProject()
mproject.update(vals)
if _project_get(session, mproject.session_id,
mproject.project_id):
mproject.project_id):
selected = ['project_id']
raise db_exc.FenixDBDuplicateEntry(
model=mproject.__class__.__name__,
@ -512,6 +573,18 @@ def instances_get(session_id):
return _instances_get(get_session(), session_id)
def update_instance(values):
session = get_session()
session_id = values.session_id
instance_id = values.instance_id
with session.begin():
minstance = _instance_get(session, session_id, instance_id)
minstance.update(values)
minstance.save(session=session)
return instance_get(session_id, instance_id)
def create_instance(values):
values = values.copy()
minstance = models.MaintenanceInstance()

@ -99,8 +99,6 @@ class MaintenanceHost(mb.FenixBase):
maintained = sa.Column(sa.Boolean, default=False)
disabled = sa.Column(sa.Boolean, default=False)
details = sa.Column(sa.String(length=255), nullable=True)
plugin = sa.Column(sa.String(length=255), nullable=True)
plugin_state = sa.Column(sa.String(length=32), nullable=True)
def to_dict(self):
return super(MaintenanceHost, self).to_dict()

@ -117,9 +117,7 @@ def _get_fake_host_values(uuid=_get_fake_uuid(),
'type': 'compute',
'maintained': False,
'disabled': False,
'details': None,
'plugin': None,
'plugin_state': None}
'details': None}
return hdict

@ -10,7 +10,18 @@ Files:
- 'demo-ha.yaml': demo-ha ReplicaSet to make 2 anti-affinity PODS.
- 'demo-nonha.yaml': demo-nonha ReplicaSet to make n nonha PODS.
- 'vnfm.py': VNFM to test k8s.py workflow.
- 'vnfm_k8s.py': VNFM to test k8s.py (Kubernetes example) workflow.
- 'vnfm.py': VNFM to test nfv.py (OpenStack example) workflow.
- 'infra_admin.py': Tool to act as infrastructure admin. Tool catch also
the 'maintenance.session' and 'maintenance.host' events to keep track
where the maintenance is going. You will see when certain host is maintained
and how many percent of hosts are maintained.
- 'session.json': Example to define maintenance session parameters as JSON
file to be given as input to 'infra_admin.py'. Example if for nfv.py workflow.
This could be used for any advanced workflow testing giving software downloads
and real action plugins.
- 'set_config.py': You can use this to set Fenix AODH/Ceilometer configuration.
- 'fenix_db_reset': Flush the Fenix database.
## Kubernetes workflow (k8s.py)
@ -92,7 +103,7 @@ kluster. Under here is what you can run in different terminals. Terminals
should be running in master node. Here is short description:
- Term1: Used for logging Fenix
- Term2: Infrastructure admin commands
- Term2: Infrastructure admin
- Term3: VNFM logging for testing and setting up the VNF
#### Term1: Fenix-engine logging
@ -114,6 +125,8 @@ Debugging and other configuration changes to '.conf' files under '/etc/fenix'
#### Term2: Infrastructure admin window
##### Admin commands as command line and curl
Use DevStack admin as user. Set your variables needed accordingly
```sh
@ -148,12 +161,42 @@ If maintenance run till the end with 'MAINTENANCE_DONE', you are ready to run it
again if you wish. 'MAINTENANCE_FAILED' or in case of exceptions, you should
recover system before trying to test again. This is covered in Term3 below.
#### Term3: VNFM (fenix/tools/vnfm.py)
##### Admin commands using admin tool
Use DevStack admin as user.
Go to Fenix tools directory
```sh
. ~/devstack/operc admin admin
cd /opt/stack/fenix/fenix/tools
```
Call admin tool and it will run the maintenance workflow. Admin tool defaults
to 'OpenStack' and 'nfv' workflow, so you can override those by exporting
environmental variables
```sh
. ~/devstack/openrc admin admin
export WORKFLOW=k8s
export CLOUD_TYPE=k8s
python infra_admin.py
```
If you want to choose freely parameters for maintenance workflow session,
you can give session.json file as input. With this option infra_admin.py
will only override the 'maintenance_at' to be 20seconds in future when
Fenix is called.
```sh
python infra_admin.py --file session.json
```
Maintenance will start by pressing enter, just follow instructions on the
console.
#### Term3: VNFM (fenix/tools/vnfm_k8s.py)
Use DevStack as demo user for testing demo application
```sh
. ~/devstack/operc demo demo
```
Go to Fenix Kubernetes tool directory for testing
@ -181,7 +224,7 @@ is 32 cpus, so value is "15" in both yaml files. Replicas can be changed in
demo-nonha.yaml. Minimum 2 (if minimum of 3 worker nodes) to maximum
'(amount_of_worker_nodes-1)*2'. Greater amount means more scaling needed and
longer maintenance window as less parallel actions possible. Surely constraints
in vnfm.py also can be changed for different behavior.
in vnfm_k8s.py also can be changed for different behavior.
You can delete pods used like this
@ -192,11 +235,11 @@ kubectl delete replicaset.apps demo-ha demo-nonha --namespace=demo
Start Kubernetes VNFM that we need for testing
```sh
python vnfm.py
python vnfm_k8s.py
```
Now you can start maintenance session in Term2. When workflow failed or
completed; you first kill vnfm.py with "ctrl+c" and delete maintenance session
completed; you first kill vnfm_k8s.py with "ctrl+c" and delete maintenance session
in Term2.
If workflow failed something might need to be manually fixed. Here you
@ -221,7 +264,8 @@ kubectl delete replicaset.apps demo-ha demo-nonha --namespace=demo;sleep 15;kube
## OpenStack workflows (default.py and nvf.py)
OpenStack workflows can be tested by using OPNFV Doctor project for testing.
OpenStack workflows can be tested by using OPNFV Doctor project for testing
or to use Fenix own tools.
Workflows:
- default.py is the first example workflow with VNFM interaction.
@ -290,7 +334,7 @@ cpu_allocation_ratio = 1.0
allow_resize_to_same_host = False
```
### Workflow default.py
### Workflow default.py testing with Doctor
On controller node clone Doctor to be able to test. Doctor currently requires
Python 3.6:
@ -331,13 +375,13 @@ sudo systemctl restart devstack@fenix*
You can also make changed to Doctor before running Doctor test
### Workflow vnf.py
### Workflow vnf.py testing with Doctor
This workflow differs from above as it expects ETSI FEAT03 constraints.
In Doctor testing it means we also need to use different application manager (VNFM)
Where default.py worklow used the sample.py application manager vnf.py
workflow uses vnfm.py workflow (doctor/doctor_tests/app_manager/vnfm.py)
workflow uses vnfm_k8s.py workflow (doctor/doctor_tests/app_manager/vnfm_k8s.py)
Only change to testing is that you should export variable to use different
application manager.
@ -354,3 +398,115 @@ export APP_MANAGER_TYPE=sample
```
Doctor modifies the message where it calls maintenance accordingly to use
either 'default' or 'nfv' as workflow in Fenix side
### Workflow vnf.py testing with Fenix
Where Doctor is made to automate everything as a test case, Fenix provides
different tools for admin and VNFM:
- 'vnfm.py': VNFM to test nfv.py.
- 'infra_admin.py': Tool to act as infrastructure admin.
Use 3 terminal windows (Term1, Term2 and Term3) to test Fenix with Kubernetes
kluster. Under here is what you can run in different terminals. Terminals
should be running in master node. Here is short description:
- Term1: Used for logging Fenix
- Term2: Infrastructure admin
- Term3: VNFM logging for testing and setting up the VNF
#### Term1: Fenix-engine logging
If any changes to Fenix make them under '/opt/stack/fenix'; restart Fenix and
see logs
```sh
sudo systemctl restart devstack@fenix*;sudo journalctl -f --unit devstack@fenix-engine
```
API logs can also be seen
```sh
sudo journalctl -f --unit devstack@fenix-api
```
Debugging and other configuration changes to '.conf' files under '/etc/fenix'
#### Term2: Infrastructure admin window
Go to Fenix tools directory for testing
```sh
cd /opt/stack/fenix/fenix/tools
```
Make flavor for testing that takes the half of the amount of VCPUs on single
compute node (here we have 48 VCPUs on each compute) This is required by
the current example 'vnfm.py' and the vnf 'maintenance_hot_tpl.yaml' that
is used in testing. 'vnf.py' workflow is not bind to these in any way, but
can be used with different VNFs and VNFM.
```sh
openstack flavor create --ram 512 --vcpus 24 --disk 1 --public demo_maint_flavor
```
Call admin tool and it will run the nvf.py workflow.
```sh
. ~/devstack/openrc admin admin
python infra_admin.py
```
If you want to choose freely parameters for maintenance workflow session,
you can give 'session.json' file as input. With this option 'infra_admin.py'
will only override the 'maintenance_at' to be 20 seconds in future when
Fenix is called.
```sh
python infra_admin.py --file session.json
```
Maintenance will start by pressing enter, just follow instructions on the
console.
In case you failed to remove maintenance workflow session, you can do it
manually as instructed above in 'Admin commands as command line and curl'.
#### Term3: VNFM (fenix/tools/vnfm.py)
Use DevStack as demo user for testing demo application
```sh
. ~/devstack/openrc demo demo
```
Go to Fenix tools directory for testing
```sh
cd /opt/stack/fenix/fenix/tools
```
Start VNFM that we need for testing
```sh
python vnfm.py
```
Now you can start maintenance session in Term2. When workflow failed or
completed; you first kill vnfm.py with "ctrl+c" and then delete maintenance
session in Term2.
If workflow failed something might need to be manually fixed.
Here you can remove the heat stack if vnfm.py failed to sdo that:
```sh
openstack stack delete -y --wait demo_stack
```
It may also be that workflow failed somewhere in the middle and some
'nova-compute' are disabled. You can enable those. Here you can see the
states:
```sh
openstack compute service list
```

@ -0,0 +1,9 @@
MYSQLPW=admin
# Fenix DB
[ `mysql -uroot -p$MYSQLPW -e "SELECT host, user FROM mysql.user;" | grep fenix | wc -l` -eq 0 ] && {
mysql -uroot -p$MYSQLPW -hlocalhost -e "CREATE USER 'fenix'@'localhost' IDENTIFIED BY 'fenix';"
mysql -uroot -p$MYSQLPW -hlocalhost -e "GRANT ALL PRIVILEGES ON fenix.* TO 'fenix'@'' identified by 'fenix';FLUSH PRIVILEGES;"
}
mysql -ufenix -pfenix -hlocalhost -e "DROP DATABASE IF EXISTS fenix;"
mysql -ufenix -pfenix -hlocalhost -e "CREATE DATABASE fenix CHARACTER SET utf8;"

320
fenix/tools/infra_admin.py Normal file

@ -0,0 +1,320 @@
# Copyright (c) 2020 Nokia Corporation.
# All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
import aodhclient.client as aodhclient
import argparse
import datetime
from flask import Flask
from flask import request
import json
from keystoneauth1 import loading
from keystoneclient import client as ks_client
import logging as lging
import os
from oslo_config import cfg
from oslo_log import log as logging
import requests
import sys
from threading import Thread
import time
import yaml
try:
import fenix.utils.identity_auth as identity_auth
except ValueError:
sys.path.append('../utils')
import identity_auth
try:
input = raw_input
except NameError:
pass
LOG = logging.getLogger(__name__)
streamlog = lging.StreamHandler(sys.stdout)
formatter = lging.Formatter("%(asctime)s: %(message)s")
streamlog.setFormatter(formatter)
LOG.logger.addHandler(streamlog)
LOG.logger.setLevel(logging.INFO)
def get_identity_auth(conf, project=None, username=None, password=None):
loader = loading.get_plugin_loader('password')
return loader.load_from_options(
auth_url=conf.service_user.os_auth_url,
username=(username or conf.service_user.os_username),
password=(password or conf.service_user.os_password),
user_domain_name=conf.service_user.os_user_domain_name,
project_name=(project or conf.service_user.os_project_name),
tenant_name=(project or conf.service_user.os_project_name),
project_domain_name=conf.service_user.os_project_domain_name)
class InfraAdmin(object):
def __init__(self, conf, log):
self.conf = conf
self.log = log
self.app = None
def start(self):
self.log.info('InfraAdmin start...')
self.app = InfraAdminManager(self.conf, self.log)
self.app.start()
def stop(self):
self.log.info('InfraAdmin stop...')
if not self.app:
return
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
}
url = 'http://%s:%d/shutdown'\
% (self.conf.host,
self.conf.port)
requests.post(url, data='', headers=headers)
class InfraAdminManager(Thread):
def __init__(self, conf, log, project='service'):
Thread.__init__(self)
self.conf = conf
self.log = log
self.project = project
# Now we are as admin:admin:admin by default. This means we listen
# notifications/events as admin
# This means Fenix service user needs to be admin:admin:admin
# self.auth = identity_auth.get_identity_auth(conf,
# project=self.project)
self.auth = get_identity_auth(conf,
project='service',
username='fenix',
password='admin')
self.session = identity_auth.get_session(auth=self.auth)
self.keystone = ks_client.Client(version='v3', session=self.session)
self.aodh = aodhclient.Client(2, self.session)
self.headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'}
self.project_id = self.keystone.projects.list(name=self.project)[0].id
self.headers['X-Auth-Token'] = self.session.get_token()
self.create_alarm()
services = self.keystone.services.list()
for service in services:
if service.type == 'maintenance':
LOG.info('maintenance service: %s:%s type %s'
% (service.name, service.id, service.type))
maint_id = service.id
self.endpoint = [ep.url for ep in self.keystone.endpoints.list()
if ep.service_id == maint_id and
ep.interface == 'public'][0]
self.log.info('maintenance endpoint: %s' % self.endpoint)
if self.conf.workflow_file:
with open(self.conf.workflow_file) as json_file:
self.session_request = yaml.safe_load(json_file)
else:
if self.conf.cloud_type == 'openstack':
metadata = {'openstack': 'upgrade'}
elif self.conf.cloud_type in ['k8s', 'kubernetes']:
metadata = {'kubernetes': 'upgrade'}
else:
metadata = {}
self.session_request = {'state': 'MAINTENANCE',
'workflow': self.conf.workflow,
'metadata': metadata,
'actions': [
{"plugin": "dummy",
"type": "host",
"metadata": {"foo": "bar"}}]}
self.start_maintenance()
def create_alarm(self):
alarms = {alarm['name']: alarm for alarm in self.aodh.alarm.list()}
alarm_name = "%s_MAINTENANCE_SESSION" % self.project
if alarm_name not in alarms:
alarm_request = dict(
name=alarm_name,
description=alarm_name,
enabled=True,
alarm_actions=[u'http://%s:%d/maintenance_session'
% (self.conf.host,
self.conf.port)],
repeat_actions=True,
severity=u'moderate',
type=u'event',
event_rule=dict(event_type=u'maintenance.session'))
self.aodh.alarm.create(alarm_request)
alarm_name = "%s_MAINTENANCE_HOST" % self.project
if alarm_name not in alarms:
alarm_request = dict(
name=alarm_name,
description=alarm_name,
enabled=True,
alarm_actions=[u'http://%s:%d/maintenance_host'
% (self.conf.host,
self.conf.port)],
repeat_actions=True,
severity=u'moderate',
type=u'event',
event_rule=dict(event_type=u'maintenance.host'))
self.aodh.alarm.create(alarm_request)
def start_maintenance(self):
self.log.info('Waiting AODH to initialize...')
time.sleep(5)
input('--Press ENTER to start maintenance session--')
maintenance_at = (datetime.datetime.utcnow() +
datetime.timedelta(seconds=20)
).strftime('%Y-%m-%d %H:%M:%S')
self.session_request['maintenance_at'] = maintenance_at
self.headers['X-Auth-Token'] = self.session.get_token()
url = self.endpoint + "/maintenance"
self.log.info('Start maintenance session: %s\n%s\n%s' %
(url, self.headers, self.session_request))
ret = requests.post(url, data=json.dumps(self.session_request),
headers=self.headers)
session_id = ret.json()['session_id']
self.log.info('--== Maintenance session %s instantiated ==--'
% session_id)
def _alarm_data_decoder(self, data):
if "[" in data or "{" in data:
# string to list or dict removing unicode
data = yaml.load(data.replace("u'", "'"))
return data
def _alarm_traits_decoder(self, data):
return ({str(t[0]): self._alarm_data_decoder(str(t[2]))
for t in data['reason_data']['event']['traits']})
def run(self):
app = Flask('InfraAdmin')
@app.route('/maintenance_host', methods=['POST'])
def maintenance_host():
data = json.loads(request.data.decode('utf8'))
try:
payload = self._alarm_traits_decoder(data)
except Exception:
payload = ({t[0]: t[2] for t in
data['reason_data']['event']['traits']})
self.log.error('cannot parse alarm data: %s' % payload)
raise Exception('VNFM cannot parse alarm.'
'Possibly trait data over 256 char')
state = payload['state']
host = payload['host']
session_id = payload['session_id']
self.log.info("%s: Host: %s %s" % (session_id, host, state))
return 'OK'
@app.route('/maintenance_session', methods=['POST'])
def maintenance_session():
data = json.loads(request.data.decode('utf8'))
try:
payload = self._alarm_traits_decoder(data)
except Exception:
payload = ({t[0]: t[2] for t in
data['reason_data']['event']['traits']})
self.log.error('cannot parse alarm data: %s' % payload)
raise Exception('VNFM cannot parse alarm.'
'Possibly trait data over 256 char')
state = payload['state']
percent_done = payload['percent_done']
session_id = payload['session_id']
self.log.info("%s: %s%% done in state %s" % (session_id,
percent_done,
state))
if state in ['MAINTENANCE_FAILED', 'MAINTENANCE_DONE']:
self.headers['X-Auth-Token'] = self.session.get_token()
input('--Press any key to remove %s session--' %
session_id)
self.log.info('Remove maintenance session %s....' % session_id)
url = ('%s/maintenance/%s' % (self.endpoint, session_id))
self.headers['X-Auth-Token'] = self.session.get_token()
ret = requests.delete(url, data=None, headers=self.headers)
LOG.info('Press CTRL + C to quit')
if ret.status_code != 200:
raise Exception(ret.text)
return 'OK'
@app.route('/shutdown', methods=['POST'])
def shutdown():
self.log.info('shutdown InfraAdmin server at %s' % time.time())
func = request.environ.get('werkzeug.server.shutdown')
if func is None:
raise RuntimeError('Not running with the Werkzeug Server')
func()
return 'InfraAdmin shutting down...'
app.run(host=self.conf.host, port=self.conf.port)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Workflow Admin tool')
parser.add_argument('--file', type=str, default=None,
help='Workflow sesssion creation arguments file')
parser.add_argument('--host', type=str, default=None,
help='the ip of InfraAdmin')
parser.add_argument('--port', type=int, default=None,
help='the port of InfraAdmin')
args = parser.parse_args()
opts = [
cfg.StrOpt('host',
default=(args.host or '127.0.0.1'),
help='the ip of InfraAdmin',
required=True),
cfg.IntOpt('port',
default=(args.port or '12349'),
help='the port of InfraAdmin',
required=True),
cfg.StrOpt('workflow',
default=os.environ.get('WORKFLOW', 'vnf'),
help='Workflow to be used',
required=True),
cfg.StrOpt('cloud_type',
default=os.environ.get('CLOUD_TYPE', 'openstack'),
help='Cloud type for metadata',
required=True),
cfg.StrOpt('workflow_file',
default=(args.file or None),
help='Workflow session creation arguments file',
required=True)]
CONF = cfg.CONF
CONF.register_opts(opts)
CONF.register_opts(identity_auth.os_opts, group='service_user')
app = InfraAdmin(CONF, LOG)
app.start()
try:
LOG.info('Press CTRL + C to quit')
while True:
time.sleep(2)
except KeyboardInterrupt:
app.stop()

@ -0,0 +1,108 @@
---
heat_template_version: 2017-02-24
description: Demo VNF test case
parameters:
ext_net:
type: string
default: public
# flavor_vcpus:
# type: number
# default: 24
maint_image:
type: string
default: cirros-0.4.0-x86_64-disk
ha_intances:
type: number
default: 2
nonha_intances:
type: number
default: 10
app_manager_alarm_url:
type: string
default: http://0.0.0.0:12348/maintenance
resources:
int_net:
type: OS::Neutron::Net
int_subnet:
type: OS::Neutron::Subnet
properties:
network_id: {get_resource: int_net}
cidr: "9.9.9.0/24"
dns_nameservers: ["8.8.8.8"]
ip_version: 4
int_router:
type: OS::Neutron::Router
properties:
external_gateway_info: {network: {get_param: ext_net}}
int_interface:
type: OS::Neutron::RouterInterface
properties:
router_id: {get_resource: int_router}
subnet: {get_resource: int_subnet}
# maint_instance_flavor:
# type: OS::Nova::Flavor
# properties:
# name: demo_maint_flavor
# ram: 512
# vcpus: {get_param: flavor_vcpus}
# disk: 1
ha_app_svrgrp:
type: OS::Nova::ServerGroup
properties:
name: demo_ha_app_group
policies: ['anti-affinity']
floating_ip:
type: OS::Nova::FloatingIP
properties:
pool: {get_param: ext_net}
multi_ha_instances:
type: OS::Heat::ResourceGroup
properties:
count: {get_param: ha_intances}
resource_def:
type: OS::Nova::Server
properties:
name: demo_ha_app_%index%
flavor: demo_maint_flavor
image: {get_param: maint_image}
networks:
- network: {get_resource: int_net}
scheduler_hints:
group: {get_resource: ha_app_svrgrp}
multi_nonha_instances:
type: OS::Heat::ResourceGroup
properties:
count: {get_param: nonha_intances}
resource_def:
type: OS::Nova::Server
properties:
name: demo_nonha_app_%index%
flavor: demo_maint_flavor
image: {get_param: maint_image}
networks:
- network: {get_resource: int_net}
association:
type: OS::Nova::FloatingIPAssociation
properties:
floating_ip: {get_resource: floating_ip}
server_id: {get_attr: [multi_ha_instances, resource.0]}
app_manager_alarm:
type: OS::Aodh::EventAlarm
properties:
alarm_actions:
- {get_param: app_manager_alarm_url}
event_type: "maintenance.scheduled"
repeat_actions: true

6
fenix/tools/session.json Normal file

@ -0,0 +1,6 @@
{
"state": "MAINTENANCE",
"metadata": {"openstack": "upgrade"},
"actions": [{"metadata": {"os": "upgrade"}, "type": "host", "plugin": "dummy"}],
"workflow": "vnf"
}

185
fenix/tools/set_config.py Normal file

@ -0,0 +1,185 @@
# Copyright (c) 2020 ZTE and others.
# All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
import os
import shutil
import yaml
cbase = "/var/lib/config-data/puppet-generated/ceilometer"
if not os.path.isdir(cbase):
cbase = ""
def set_notifier_topic():
ep_file = cbase + '/etc/ceilometer/event_pipeline.yaml'
ep_file_bak = cbase + '/etc/ceilometer/event_pipeline.yaml.bak'
event_notifier_topic = 'notifier://?topic=alarm.all'
config_modified = False
if not os.path.isfile(ep_file):
raise Exception("File doesn't exist: %s." % ep_file)
with open(ep_file, 'r') as file:
config = yaml.safe_load(file)
sinks = config['sinks']
for sink in sinks:
if sink['name'] == 'event_sink':
publishers = sink['publishers']
if event_notifier_topic not in publishers:
print('Add event notifier in ceilometer')
publishers.append(event_notifier_topic)
config_modified = True
else:
print('NOTE: event notifier is configured'
'in ceilometer as we needed')
if config_modified:
shutil.copyfile(ep_file, ep_file_bak)
with open(ep_file, 'w+') as file:
file.write(yaml.safe_dump(config))
def set_event_definitions():
ed_file = cbase + '/etc/ceilometer/event_definitions.yaml'
ed_file_bak = cbase + '/etc/ceilometer/event_definitions.bak'
orig_ed_file_exist = True
modify_config = False
if not os.path.isfile(ed_file):
# Deployment did not modify file, so it did not exist
src_file = '/etc/ceilometer/event_definitions.yaml'
if not os.path.isfile(src_file):
config = []
orig_ed_file_exist = False
else:
shutil.copyfile('/etc/ceilometer/event_definitions.yaml', ed_file)
if orig_ed_file_exist:
with open(ed_file, 'r') as file:
config = yaml.safe_load(file)
et_list = [et['event_type'] for et in config]
if 'compute.instance.update' in et_list:
print('NOTE: compute.instance.update allready configured')
else:
print('NOTE: add compute.instance.update to event_definitions.yaml')
modify_config = True
instance_update = {
'event_type': 'compute.instance.update',
'traits': {
'deleted_at': {'fields': 'payload.deleted_at',
'type': 'datetime'},
'disk_gb': {'fields': 'payload.disk_gb',
'type': 'int'},
'display_name': {'fields': 'payload.display_name'},
'ephemeral_gb': {'fields': 'payload.ephemeral_gb',
'type': 'int'},
'host': {'fields': 'publisher_id.`split(., 1, 1)`'},
'instance_id': {'fields': 'payload.instance_id'},
'instance_type': {'fields': 'payload.instance_type'},
'instance_type_id': {'fields': 'payload.instance_type_id',
'type': 'int'},
'launched_at': {'fields': 'payload.launched_at',
'type': 'datetime'},
'memory_mb': {'fields': 'payload.memory_mb',
'type': 'int'},
'old_state': {'fields': 'payload.old_state'},
'os_architecture': {
'fields':
"payload.image_meta.'org.openstack__1__architecture'"},
'os_distro': {
'fields':
"payload.image_meta.'org.openstack__1__os_distro'"},
'os_version': {
'fields':
"payload.image_meta.'org.openstack__1__os_version'"},
'resource_id': {'fields': 'payload.instance_id'},
'root_gb': {'fields': 'payload.root_gb',
'type': 'int'},
'service': {'fields': 'publisher_id.`split(., 0, -1)`'},
'state': {'fields': 'payload.state'},
'tenant_id': {'fields': 'payload.tenant_id'},
'user_id': {'fields': 'payload.user_id'},
'vcpus': {'fields': 'payload.vcpus', 'type': 'int'}
}
}
config.append(instance_update)
if 'maintenance.scheduled' in et_list:
print('NOTE: maintenance.scheduled allready configured')
else:
print('NOTE: add maintenance.scheduled to event_definitions.yaml')
modify_config = True
mscheduled = {
'event_type': 'maintenance.scheduled',
'traits': {
'allowed_actions': {'fields': 'payload.allowed_actions'},
'instance_ids': {'fields': 'payload.instance_ids'},
'reply_url': {'fields': 'payload.reply_url'},
'actions_at': {'fields': 'payload.actions_at',
'type': 'datetime'},
'reply_at': {'fields': 'payload.reply_at', 'type': 'datetime'},
'state': {'fields': 'payload.state'},
'session_id': {'fields': 'payload.session_id'},
'project_id': {'fields': 'payload.project_id'},
'metadata': {'fields': 'payload.metadata'}
}
}
config.append(mscheduled)
if 'maintenance.host' in et_list:
print('NOTE: maintenance.host allready configured')
else:
print('NOTE: add maintenance.host to event_definitions.yaml')
modify_config = True
mhost = {
'event_type': 'maintenance.host',
'traits': {
'host': {'fields': 'payload.host'},
'project_id': {'fields': 'payload.project_id'},
'state': {'fields': 'payload.state'},
'session_id': {'fields': 'payload.session_id'}
}
}
config.append(mhost)
if 'maintenance.session' in et_list:
print('NOTE: maintenance.session allready configured')
else:
print('NOTE: add maintenance.session to event_definitions.yaml')
modify_config = True
mhost = {
'event_type': 'maintenance.session',
'traits': {
'percent_done': {'fields': 'payload.percent_done'},
'project_id': {'fields': 'payload.project_id'},
'state': {'fields': 'payload.state'},
'session_id': {'fields': 'payload.session_id'}
}
}
config.append(mhost)
if modify_config:
if orig_ed_file_exist:
shutil.copyfile(ed_file, ed_file_bak)
else:
with open(ed_file_bak, 'w+') as file:
file.close()
with open(ed_file, 'w+') as file:
file.write(yaml.safe_dump(config))
set_notifier_topic()
set_event_definitions()

@ -12,21 +12,25 @@
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
import aodhclient.client as aodhclient
import datetime
from flask import Flask
from flask import request
import heatclient.client as heatclient
from heatclient.common.template_utils import get_template_contents
from heatclient import exc as heat_excecption
import json
from keystoneauth1 import loading
from keystoneclient import client as ks_client
from kubernetes import client
from kubernetes import config
import logging as lging
from neutronclient.v2_0 import client as neutronclient
import novaclient.client as novaclient
import os
from oslo_config import cfg
from oslo_log import log as logging
import requests
import sys
from threading import Thread
import time
import uuid
import yaml
try:
@ -56,6 +60,120 @@ CONF.register_opts(opts)
CONF.register_opts(identity_auth.os_opts, group='service_user')
class Stack(object):
def __init__(self, conf, log, project='demo'):
self.conf = conf
self.log = log
self.project = project
self.auth = identity_auth.get_identity_auth(conf, project=self.project)
self.session = identity_auth.get_session(self.auth)
self.heat = heatclient.Client(version='1', session=self.session)
self.stack_name = None
self.stack_id = None
self.template = None
self.parameters = {}
self.files = {}
# standard yaml.load will not work for hot tpl becasue of date format in
# heat_template_version is not string
def get_hot_tpl(self, template_file):
if not os.path.isfile(template_file):
raise Exception('File(%s) does not exist' % template_file)
return get_template_contents(template_file=template_file)
def _wait_stack_action_complete(self, action):
action_in_progress = '%s_IN_PROGRESS' % action
action_complete = '%s_COMPLETE' % action
action_failed = '%s_FAILED' % action
status = action_in_progress
stack_retries = 160
while status == action_in_progress and stack_retries > 0:
time.sleep(2)
try:
stack = self.heat.stacks.get(self.stack_name)
except heat_excecption.HTTPNotFound:
if action == 'DELETE':
# Might happen you never get status as stack deleted
status = action_complete
break
else:
raise Exception('unable to get stack')
status = stack.stack_status
stack_retries = stack_retries - 1
if stack_retries == 0 and status != action_complete:
raise Exception("stack %s not completed within 5min, status:"
" %s" % (action, status))
elif status == action_complete:
self.log.info('stack %s %s' % (self.stack_name, status))
elif status == action_failed:
raise Exception("stack %s failed" % action)
else:
self.log.error('stack %s %s' % (self.stack_name, status))
raise Exception("stack %s unknown result" % action)
def wait_stack_delete(self):
self._wait_stack_action_complete('DELETE')
def wait_stack_create(self):
self._wait_stack_action_complete('CREATE')
def wait_stack_update(self):
self._wait_stack_action_complete('UPDATE')
def create(self, stack_name, template, parameters={}, files={}):
self.stack_name = stack_name
self.template = template
self.parameters = parameters
self.files = files
stack = self.heat.stacks.create(stack_name=self.stack_name,
files=files,
template=template,
parameters=parameters)
self.stack_id = stack['stack']['id']
try:
self.wait_stack_create()
except Exception:
# It might not always work at first
self.log.info('retry creating maintenance stack.......')
self.delete()
time.sleep(5)
stack = self.heat.stacks.create(stack_name=self.stack_name,
files=files,
template=template,
parameters=parameters)
self.stack_id = stack['stack']['id']
self.wait_stack_create()
def update(self, stack_name, stack_id, template, parameters={}, files={}):
self.heat.stacks.update(stack_name=stack_name,
stack_id=stack_id,
files=files,
template=template,
parameters=parameters)
self.wait_stack_update()
def delete(self):
if self.stack_id is not None:
self.heat.stacks.delete(self.stack_name)
self.wait_stack_delete()
else:
self.log.info('no stack to delete')
def get_identity_auth(conf, project=None, username=None, password=None):
loader = loading.get_plugin_loader('password')
return loader.load_from_options(
auth_url=conf.service_user.os_auth_url,
username=(username or conf.service_user.os_username),
password=(password or conf.service_user.os_password),
user_domain_name=conf.service_user.os_user_domain_name,
project_name=(project or conf.service_user.os_project_name),
tenant_name=(project or conf.service_user.os_project_name),
project_domain_name=conf.service_user.os_project_domain_name)
class VNFM(object):
def __init__(self, conf, log):
@ -64,16 +182,18 @@ class VNFM(object):
self.app = None
def start(self):
LOG.info('VNFM start......')
self.log.info('VNFM start...')
self.app = VNFManager(self.conf, self.log)
self.app.start()
def stop(self):
LOG.info('VNFM stop......')
self.log.info('VNFM stop...')
if not self.app:
return
self.app.headers['X-Auth-Token'] = self.app.session.get_token()
self.log.info('delete VNF constraints...')
self.app.delete_constraints()
self.log.info('VNF delete start...')
self.app.stack.delete()
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
@ -86,29 +206,38 @@ class VNFM(object):
class VNFManager(Thread):
def __init__(self, conf, log):
def __init__(self, conf, log, project='demo'):
Thread.__init__(self)
self.conf = conf
self.log = log
self.port = self.conf.port
self.log = log
self.intance_ids = None
# VNFM is started with OS_* exported as admin user
# We need that to query Fenix endpoint url
# Still we work with our tenant/poroject/vnf as demo
self.project = "demo"
LOG.info('VNFM project: %s' % self.project)
self.project = project
self.auth = identity_auth.get_identity_auth(conf, project=self.project)
self.session = identity_auth.get_session(auth=self.auth)
self.ks = ks_client.Client(version='v3', session=self.session)
self.aodh = aodhclient.Client(2, self.session)
# Subscribe to mainenance event alarm from Fenix via AODH
self.create_alarm()
config.load_kube_config()
self.kaapi = client.AppsV1Api()
self.kapi = client.CoreV1Api()
self.keystone = ks_client.Client(version='v3', session=self.session)
auth = get_identity_auth(conf,
project='service',
username='fenix',
password='admin')
session = identity_auth.get_session(auth=auth)
keystone = ks_client.Client(version='v3', session=session)
self.nova = novaclient.Client(version='2.34', session=self.session)
self.neutron = neutronclient.Client(session=self.session)
self.headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'}
self.project_id = self.session.get_project_id()
self.stack = Stack(self.conf, self.log, self.project)
files, template = self.stack.get_hot_tpl('maintenance_hot_tpl.yaml')
ext_net = self.get_external_network()
parameters = {'ext_net': ext_net}
self.log.info('creating VNF...')
self.log.info('parameters: %s' % parameters)
self.stack.create('%s_stack' % self.project,
template,
parameters=parameters,
files=files)
self.headers['X-Auth-Token'] = self.session.get_token()
self.orig_number_of_instances = self.number_of_instances()
# List of instances
@ -118,66 +247,58 @@ class VNFManager(Thread):
self.instance_constraints = None
# Update existing instances to instance lists
self.update_instances()
# How many instances needs to exists (with current VNF load)
# max_impacted_members need to be updated accordingly
# if number of instances is scaled. example for demo-ha:
# max_impacted_members = len(self.ha_instances) - ha_group_limit
self.ha_group_limit = 2
self.nonha_group_limit = 2
nonha_instances = len(self.nonha_instances)
if nonha_instances < 7:
self.scale = 2
else:
self.scale = int((nonha_instances) / 2)
self.log.info('Init nonha_instances: %s scale: %s: max_impacted %s' %
(nonha_instances, self.scale, nonha_instances - 1))
# Different instance groups constraints dict
self.ha_group = None
self.nonha_group = None
# VNF project_id (VNF ID)
self.project_id = None
# HA instance_id that is active has active label
self.nonha_group_id = str(uuid.uuid4())
self.ha_group_id = [sg.id for sg in self.nova.server_groups.list()
if sg.name == "%s_ha_app_group" % self.project][0]
# Floating IP used in HA instance
self.floating_ip = None
# HA instance_id that is active / has floating IP
self.active_instance_id = self.active_instance_id()
services = self.ks.services.list()
services = keystone.services.list()
for service in services:
if service.type == 'maintenance':
LOG.info('maintenance service: %s:%s type %s'
% (service.name, service.id, service.type))
self.log.info('maintenance service: %s:%s type %s'
% (service.name, service.id, service.type))
maint_id = service.id
self.maint_endpoint = [ep.url for ep in self.ks.endpoints.list()
self.maint_endpoint = [ep.url for ep in keystone.endpoints.list()
if ep.service_id == maint_id and
ep.interface == 'public'][0]
LOG.info('maintenance endpoint: %s' % self.maint_endpoint)
self.log.info('maintenance endpoint: %s' % self.maint_endpoint)
self.update_constraints_lock = False
self.update_constraints()
# Instances waiting action to be done
self.pending_actions = {}
def create_alarm(self):
alarms = {alarm['name']: alarm for alarm in self.aodh.alarm.list()}
alarm_name = "%s_MAINTENANCE_ALARM" % self.project
if alarm_name in alarms:
return
alarm_request = dict(
name=alarm_name,
description=alarm_name,
enabled=True,
alarm_actions=[u'http://%s:%d/maintenance'
% (self.conf.ip,
self.conf.port)],
repeat_actions=True,
severity=u'moderate',
type=u'event',
event_rule=dict(event_type=u'maintenance.scheduled'))
self.aodh.alarm.create(alarm_request)
def get_external_network(self):
ext_net = None
networks = self.neutron.list_networks()['networks']
for network in networks:
if network['router:external']:
ext_net = network['name']
break
if ext_net is None:
raise Exception("external network not defined")
return ext_net
def delete_remote_instance_constraints(self, instance_id):
url = "%s/instance/%s" % (self.maint_endpoint, instance_id)
LOG.info('DELETE: %s' % url)
self.log.info('DELETE: %s' % url)
ret = requests.delete(url, data=None, headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
if ret.status_code == 404:
LOG.info('Already deleted: %s' % instance_id)
else:
raise Exception(ret.text)
raise Exception(ret.text)
def update_remote_instance_constraints(self, instance):
url = "%s/instance/%s" % (self.maint_endpoint, instance["instance_id"])
LOG.info('PUT: %s' % url)
self.log.info('PUT: %s' % url)
ret = requests.put(url, data=json.dumps(instance),
headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
@ -186,7 +307,7 @@ class VNFManager(Thread):
def delete_remote_group_constraints(self, instance_group):
url = "%s/instance_group/%s" % (self.maint_endpoint,
instance_group["group_id"])
LOG.info('DELETE: %s' % url)
self.log.info('DELETE: %s' % url)
ret = requests.delete(url, data=None, headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
raise Exception(ret.text)
@ -194,13 +315,14 @@ class VNFManager(Thread):
def update_remote_group_constraints(self, instance_group):
url = "%s/instance_group/%s" % (self.maint_endpoint,
instance_group["group_id"])
LOG.info('PUT: %s' % url)
self.log.info('PUT: %s' % url)
ret = requests.put(url, data=json.dumps(instance_group),
headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
raise Exception(ret.text)
def delete_constraints(self):
self.headers['X-Auth-Token'] = self.session.get_token()
for instance_id in self.instance_constraints:
self.delete_remote_instance_constraints(instance_id)
self.delete_remote_group_constraints(self.nonha_group)
@ -208,73 +330,82 @@ class VNFManager(Thread):
def update_constraints(self):
while self.update_constraints_lock:
LOG.info('Waiting update_constraints_lock...')
self.log.info('Waiting update_constraints_lock...')
time.sleep(1)
self.update_constraints_lock = True
LOG.info('Update constraints')
if self.project_id is None:
self.project_id = self.ks.projects.list(name=self.project)[0].id
# Pods groupped by ReplicaSet, so we use that id
rs = {r.metadata.name: r.metadata.uid for r in
self.kaapi.list_namespaced_replica_set('demo').items}
self.log.info('Update constraints')
# Nova does not support groupping instances that do not belong to
# anti-affinity server_groups. Anyhow all instances need groupping
max_impacted_members = len(self.nonha_instances) - 1
nonha_group = {
"group_id": rs['demo-nonha'],
"group_id": self.nonha_group_id,
"project_id": self.project_id,
"group_name": "demo-nonha",
"group_name": "%s_nonha_app_group" % self.project,
"anti_affinity_group": False,
"max_instances_per_host": 0,
"max_impacted_members": max_impacted_members,
"recovery_time": 10,
"recovery_time": 2,
"resource_mitigation": True}
LOG.info('create demo-nonha constraints: %s'
% nonha_group)
self.log.info('create %s_nonha_app_group constraints: %s'
% (self.project, nonha_group))
ha_group = {
"group_id": rs['demo-ha'],
"group_id": self.ha_group_id,
"project_id": self.project_id,
"group_name": "demo-ha",
"group_name": "%s_ha_app_group" % self.project,
"anti_affinity_group": True,
"max_instances_per_host": 1,
"max_impacted_members": 1,
"recovery_time": 10,
"recovery_time": 4,
"resource_mitigation": True}
LOG.info('create demo-ha constraints: %s'
% ha_group)
self.log.info('create %s_ha_app_group constraints: %s'
% (self.project, ha_group))
if not self.ha_group or self.ha_group != ha_group:
LOG.info('ha instance group need update')
self.update_remote_group_constraints(ha_group)
self.ha_group = ha_group.copy()
if not self.nonha_group or self.nonha_group != nonha_group:
LOG.info('nonha instance group need update')
self.update_remote_group_constraints(nonha_group)
self.nonha_group = nonha_group.copy()
instance_constraints = {}
for ha_instance in self.ha_instances:
instance = {
"instance_id": ha_instance.metadata.uid,
"instance_id": ha_instance.id,
"project_id": self.project_id,
"group_id": ha_group["group_id"],
"instance_name": ha_instance.metadata.name,
"instance_name": ha_instance.name,
"max_interruption_time": 120,
"migration_type": "EVICTION",
"migration_type": "MIGRATE",
"resource_mitigation": True,
"lead_time": 40}
LOG.info('create ha instance constraints: %s' % instance)
instance_constraints[ha_instance.metadata.uid] = instance
self.log.info('create ha instance constraints: %s'
% instance)
instance_constraints[ha_instance.id] = instance
for nonha_instance in self.nonha_instances:
instance = {
"instance_id": nonha_instance.metadata.uid,
"instance_id": nonha_instance.id,
"project_id": self.project_id,
"group_id": nonha_group["group_id"],
"instance_name": nonha_instance.metadata.name,
"instance_name": nonha_instance.name,
"max_interruption_time": 120,
"migration_type": "EVICTION",
"migration_type": "MIGRATE",
"resource_mitigation": True,
"lead_time": 40}
LOG.info('create nonha instance constraints: %s' % instance)
instance_constraints[nonha_instance.metadata.uid] = instance
self.log.info('create nonha instance constraints: %s'
% instance)
instance_constraints[nonha_instance.id] = instance
if not self.instance_constraints:
# Initial instance constraints
LOG.info('create initial instances constraints...')
self.log.info('create initial instances constraints...')
for instance in [instance_constraints[i] for i
in instance_constraints]:
self.update_remote_instance_constraints(instance)
self.instance_constraints = instance_constraints.copy()
else:
LOG.info('check instances constraints changes...')
self.log.info('check instances constraints changes...')
added = [i for i in instance_constraints.keys()
if i not in self.instance_constraints]
deleted = [i for i in self.instance_constraints.keys()
@ -291,64 +422,55 @@ class VNFManager(Thread):
if updated or deleted:
# Some instance constraints have changed
self.instance_constraints = instance_constraints.copy()
if not self.ha_group or self.ha_group != ha_group:
LOG.info('ha instance group need update')
self.update_remote_group_constraints(ha_group)
self.ha_group = ha_group.copy()
if not self.nonha_group or self.nonha_group != nonha_group:
LOG.info('nonha instance group need update')
self.update_remote_group_constraints(nonha_group)
self.nonha_group = nonha_group.copy()
self.update_constraints_lock = False
def active_instance_id(self):
# We digtate the active in the beginning
instance = self.ha_instances[0]
LOG.info('Initially Active instance: %s %s' %
(instance.metadata.name, instance.metadata.uid))
name = instance.metadata.name
namespace = instance.metadata.namespace
body = {"metadata": {"labels": {"active": "True"}}}
self.kapi.patch_namespaced_pod(name, namespace, body)
self.active_instance_id = instance.metadata.uid
def switch_over_ha_instance(self, instance_id):
if instance_id == self.active_instance_id:
# Need to switchover as instance_id will be affected and is active
# Need rertry as it takes time after heat template done before
# Floating IP in place
retry = 5
while retry > 0:
for instance in self.ha_instances:
if instance_id == instance.metadata.uid:
LOG.info('Active to Standby: %s %s' %
(instance.metadata.name, instance.metadata.uid))
name = instance.metadata.name
namespace = instance.metadata.namespace
body = client.UNKNOWN_BASE_TYPE()
body.metadata.labels = {"ative": None}
self.kapi.patch_namespaced_pod(name, namespace, body)
else:
LOG.info('Standby to Active: %s %s' %
(instance.metadata.name, instance.metadata.uid))
name = instance.metadata.name
namespace = instance.metadata.namespace
body = client.UNKNOWN_BASE_TYPE()
body.metadata.labels = {"ative": "True"}
self.kapi.patch_namespaced_pod(name, namespace, body)
self.active_instance_id = instance.metadata.uid
network_interfaces = next(iter(instance.addresses.values()))
for network_interface in network_interfaces:
_type = network_interface.get('OS-EXT-IPS:type')
if _type == "floating":
if not self.floating_ip:
self.floating_ip = network_interface.get('addr')
self.log.debug('active_instance: %s %s' %
(instance.name, instance.id))
return instance.id
time.sleep(2)
self.update_instances()
retry -= 1
raise Exception("No active instance found")
def switch_over_ha_instance(self):
for instance in self.ha_instances:
if instance.id != self.active_instance_id:
self.log.info('Switch over to: %s %s' % (instance.name,
instance.id))
# Deprecated, need to use neutron instead
# instance.add_floating_ip(self.floating_ip)
port = self.neutron.list_ports(device_id=instance.id)['ports'][0]['id'] # noqa
floating_id = self.neutron.list_floatingips(floating_ip_address=self.floating_ip)['floatingips'][0]['id'] # noqa
self.neutron.update_floatingip(floating_id, {'floatingip': {'port_id': port}}) # noqa
# Have to update ha_instances as floating_ip changed
self.update_instances()
self.active_instance_id = instance.id
break
def get_instance_ids(self):
instances = self.kapi.list_pod_for_all_namespaces().items
return [i.metadata.uid for i in instances
if i.metadata.name.startswith("demo-")
and i.metadata.namespace == "demo"]
ret = list()
for instance in self.nova.servers.list(detailed=False):
ret.append(instance.id)
return ret
def update_instances(self):
instances = self.kapi.list_pod_for_all_namespaces().items
instances = self.nova.servers.list(detailed=True)
self.ha_instances = [i for i in instances
if i.metadata.name.startswith("demo-ha")
and i.metadata.namespace == "demo"]
if "%s_ha_app_" % self.project in i.name]
self.nonha_instances = [i for i in instances
if i.metadata.name.startswith("demo-nonha")
and i.metadata.namespace == "demo"]
if "%s_nonha_app_" % self.project in i.name]
def _alarm_data_decoder(self, data):
if "[" in data or "{" in data:
@ -364,77 +486,38 @@ class VNFManager(Thread):
ret = requests.get(url, data=None, headers=self.headers)
if ret.status_code != 200:
raise Exception(ret.text)
LOG.info('get_instance_ids %s' % ret.json())
self.log.info('get_instance_ids %s' % ret.json())
return ret.json()['instance_ids']
def scale_instances(self, scale_instances):
def scale_instances(self, number_of_instances):
# number_of_instances_before = self.number_of_instances()
number_of_instances_before = len(self.nonha_instances)
replicas = number_of_instances_before + scale_instances
parameters = self.stack.parameters
parameters['nonha_intances'] = (number_of_instances_before +
number_of_instances)
self.stack.update(self.stack.stack_name,
self.stack.stack_id,
self.stack.template,
parameters=parameters,
files=self.stack.files)
# We only scale nonha apps
namespace = "demo"
name = "demo-nonha"
body = {'spec': {"replicas": replicas}}
self.kaapi.patch_namespaced_replica_set_scale(name, namespace, body)
time.sleep(3)
# Let's check if scale has taken effect
# number_of_instances_after = self.number_of_instances()
self.update_instances()
self.update_constraints()
number_of_instances_after = len(self.nonha_instances)
check = 20
while number_of_instances_after == number_of_instances_before:
if check == 0:
LOG.error('scale_instances with: %d failed, still %d instances'
% (scale_instances, number_of_instances_after))
raise Exception('scale_instances failed')
check -= 1
time.sleep(1)
self.update_instances()
number_of_instances_after = len(self.nonha_instances)
if (number_of_instances_before + number_of_instances !=
number_of_instances_after):
self.log.error('scale_instances with: %d from: %d ends up to: %d'
% (number_of_instances, number_of_instances_before,
number_of_instances_after))
raise Exception('scale_instances failed')
LOG.info('scaled instances from %d to %d' %
(number_of_instances_before, number_of_instances_after))
self.log.info('scaled nonha_intances from %d to %d' %
(number_of_instances_before,
number_of_instances_after))
def number_of_instances(self):
instances = self.kapi.list_pod_for_all_namespaces().items
return len([i for i in instances
if i.metadata.name.startswith("demo-")])
def instance_action(self, instance_id, allowed_actions):
# We should keep instance constraint in our internal structur
# and match instance_id specific allowed action. Now we assume EVICTION
if 'EVICTION' not in allowed_actions:
LOG.error('Action for %s not foudn from %s' %
(instance_id, allowed_actions))
return None
return 'EVICTION'
def instance_action_started(self, instance_id, action):
time_now = datetime.datetime.utcnow()
max_interruption_time = (
self.instance_constraints[instance_id]['max_interruption_time'])
self.pending_actions[instance_id] = {
'started': time_now,
'max_interruption_time': max_interruption_time,
'action': action}
def was_instance_action_in_time(self, instance_id):
time_now = datetime.datetime.utcnow()
started = self.pending_actions[instance_id]['started']
limit = self.pending_actions[instance_id]['max_interruption_time']
action = self.pending_actions[instance_id]['action']
td = time_now - started
if td.total_seconds() > limit:
LOG.error('%s %s took too long: %ds' %
(instance_id, action, td.total_seconds()))
LOG.error('%s max_interruption_time %ds might be too short' %
(instance_id, limit))
raise Exception('%s %s took too long: %ds' %
(instance_id, action, td.total_seconds()))
else:
LOG.info('%s %s with recovery time took %ds' %
(instance_id, action, td.total_seconds()))
del self.pending_actions[instance_id]
return len(self.nova.servers.list(detailed=False))
def run(self):
app = Flask('VNFM')
@ -447,85 +530,86 @@ class VNFManager(Thread):
except Exception:
payload = ({t[0]: t[2] for t in
data['reason_data']['event']['traits']})
LOG.error('cannot parse alarm data: %s' % payload)
self.log.error('cannot parse alarm data: %s' % payload)
raise Exception('VNFM cannot parse alarm.'
'Possibly trait data over 256 char')
LOG.info('VNFM received data = %s' % payload)
self.log.info('VNFM received data = %s' % payload)
state = payload['state']
reply_state = None
reply = dict()
LOG.info('VNFM state: %s' % state)
self.log.info('VNFM state: %s' % state)
if state == 'MAINTENANCE':
self.headers['X-Auth-Token'] = self.session.get_token()
instance_ids = (self.get_session_instance_ids(
payload['instance_ids'],
payload['session_id']))
reply['instance_ids'] = instance_ids
reply_state = 'ACK_MAINTENANCE'
my_instance_ids = self.get_instance_ids()
invalid_instances = (
[instance_id for instance_id in instance_ids
if instance_id not in my_instance_ids])
if invalid_instances:
self.log.error('Invalid instances: %s' % invalid_instances)
reply_state = 'NACK_MAINTENANCE'
else:
reply_state = 'ACK_MAINTENANCE'
elif state == 'SCALE_IN':
# scale down only nonha instances
nonha_instances = len(self.nonha_instances)
scale_in = nonha_instances / 2
self.scale_instances(-scale_in)
self.update_constraints()
reply['instance_ids'] = self.get_instance_ids()
# scale down "self.scale" instances that is VCPUS equaling
# at least a single compute node
self.scale_instances(-self.scale)
reply_state = 'ACK_SCALE_IN'
elif state == 'MAINTENANCE_COMPLETE':
# possibly need to upscale
number_of_instances = self.number_of_instances()
if self.orig_number_of_instances > number_of_instances:
scale_instances = (self.orig_number_of_instances -
number_of_instances)
self.scale_instances(scale_instances)
self.update_constraints()
self.scale_instances(self.scale)
reply_state = 'ACK_MAINTENANCE_COMPLETE'
elif (state == 'PREPARE_MAINTENANCE'
or state == 'PLANNED_MAINTENANCE'):
instance_id = payload['instance_ids'][0]
instance_action = (self.instance_action(instance_id,
payload['allowed_actions']))
if not instance_action:
raise Exception('Allowed_actions not supported for %s' %
instance_id)
elif state == 'PREPARE_MAINTENANCE':
# TBD from contraints
if "MIGRATE" not in payload['allowed_actions']:
raise Exception('MIGRATE not supported')
instance_ids = payload['instance_ids'][0]
self.log.info('VNFM got instance: %s' % instance_ids)
if instance_ids == self.active_instance_id:
self.switch_over_ha_instance()
# optional also in contraints
reply['instance_action'] = "MIGRATE"
reply_state = 'ACK_PREPARE_MAINTENANCE'
LOG.info('VNFM got instance: %s' % instance_id)
self.switch_over_ha_instance(instance_id)
reply['instance_action'] = instance_action
reply_state = 'ACK_%s' % state
self.instance_action_started(instance_id, instance_action)
elif state == 'PLANNED_MAINTENANCE':
# TBD from contraints
if "MIGRATE" not in payload['allowed_actions']:
raise Exception('MIGRATE not supported')
instance_ids = payload['instance_ids'][0]
self.log.info('VNFM got instance: %s' % instance_ids)
if instance_ids == self.active_instance_id:
self.switch_over_ha_instance()
# optional also in contraints
reply['instance_action'] = "MIGRATE"
reply_state = 'ACK_PLANNED_MAINTENANCE'
elif state == 'INSTANCE_ACTION_DONE':
# TBD was action done in max_interruption_time (live migration)
# NOTE, in EVICTION instance_id reported that was in evicted
# node. New instance_id might be different
LOG.info('%s' % payload['instance_ids'])
self.was_instance_action_in_time(payload['instance_ids'][0])
self.update_instances()
self.update_constraints()
# TBD was action done in allowed window
self.log.info('%s' % payload['instance_ids'])
else:
raise Exception('VNFM received event with'
' unknown state %s' % state)
if reply_state:
reply['session_id'] = payload['session_id']
self.headers['X-Auth-Token'] = self.session.get_token()
reply['state'] = reply_state
url = payload['reply_url']
LOG.info('VNFM reply: %s' % reply)
self.log.info('VNFM reply: %s' % reply)
requests.put(url, data=json.dumps(reply), headers=self.headers)
return 'OK'
@app.route('/shutdown', methods=['POST'])
def shutdown():
LOG.info('shutdown VNFM server at %s' % time.time())
self.log.info('shutdown VNFM server at %s' % time.time())
func = request.environ.get('werkzeug.server.shutdown')
if func is None:
raise RuntimeError('Not running with the Werkzeug Server')
@ -543,3 +627,5 @@ if __name__ == '__main__':
time.sleep(2)
except KeyboardInterrupt:
app_manager.stop()
except Exception:
app_manager.app.stack.delete()

561
fenix/tools/vnfm_k8s.py Normal file

@ -0,0 +1,561 @@
# Copyright (c) 2020 Nokia Corporation.
# All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
import aodhclient.client as aodhclient
import datetime
from flask import Flask
from flask import request
import json
from keystoneauth1 import loading
from keystoneclient import client as ks_client
from kubernetes import client
from kubernetes import config
import logging as lging
from oslo_config import cfg
from oslo_log import log as logging
import requests
import sys
from threading import Thread
import time
import yaml
try:
import fenix.utils.identity_auth as identity_auth
except ValueError:
sys.path.append('../utils')
import identity_auth
LOG = logging.getLogger(__name__)
streamlog = lging.StreamHandler(sys.stdout)
LOG.logger.addHandler(streamlog)
LOG.logger.setLevel(logging.INFO)
opts = [
cfg.StrOpt('ip',
default='127.0.0.1',
help='the ip of VNFM',
required=True),
cfg.IntOpt('port',
default='12348',
help='the port of VNFM',
required=True),
]
CONF = cfg.CONF
CONF.register_opts(opts)
CONF.register_opts(identity_auth.os_opts, group='service_user')
def get_identity_auth(conf, project=None, username=None, password=None):
loader = loading.get_plugin_loader('password')
return loader.load_from_options(
auth_url=conf.service_user.os_auth_url,
username=(username or conf.service_user.os_username),
password=(password or conf.service_user.os_password),
user_domain_name=conf.service_user.os_user_domain_name,
project_name=(project or conf.service_user.os_project_name),
tenant_name=(project or conf.service_user.os_project_name),
project_domain_name=conf.service_user.os_project_domain_name)
class VNFM(object):
def __init__(self, conf, log):
self.conf = conf
self.log = log
self.app = None
def start(self):
LOG.info('VNFM start......')
self.app = VNFManager(self.conf, self.log)
self.app.start()
def stop(self):
LOG.info('VNFM stop......')
if not self.app:
return
self.app.headers['X-Auth-Token'] = self.app.session.get_token()
self.app.delete_constraints()
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
}
url = 'http://%s:%d/shutdown'\
% (self.conf.ip,
self.conf.port)
requests.post(url, data='', headers=headers)
class VNFManager(Thread):
def __init__(self, conf, log):
Thread.__init__(self)
self.conf = conf
self.log = log
self.port = self.conf.port
self.intance_ids = None
# VNFM is started with OS_* exported as admin user
# We need that to query Fenix endpoint url
# Still we work with our tenant/poroject/vnf as demo
self.project = "demo"
LOG.info('VNFM project: %s' % self.project)
self.auth = identity_auth.get_identity_auth(conf, project=self.project)
self.session = identity_auth.get_session(auth=self.auth)
self.ks = ks_client.Client(version='v3', session=self.session)
self.aodh = aodhclient.Client(2, self.session)
# Subscribe to mainenance event alarm from Fenix via AODH
self.create_alarm()
config.load_kube_config()
self.kaapi = client.AppsV1Api()
self.kapi = client.CoreV1Api()
self.headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'}
self.headers['X-Auth-Token'] = self.session.get_token()
self.orig_number_of_instances = self.number_of_instances()
# List of instances
self.ha_instances = []
self.nonha_instances = []
# Different instance_id specific constraints {instanse_id: {},...}
self.instance_constraints = None
# Update existing instances to instance lists
self.update_instances()
# How many instances needs to exists (with current VNF load)
# max_impacted_members need to be updated accordingly
# if number of instances is scaled. example for demo-ha:
# max_impacted_members = len(self.ha_instances) - ha_group_limit
self.ha_group_limit = 2
self.nonha_group_limit = 2
# Different instance groups constraints dict
self.ha_group = None
self.nonha_group = None
auth = get_identity_auth(conf,
project='service',
username='fenix',
password='admin')
session = identity_auth.get_session(auth=auth)
keystone = ks_client.Client(version='v3', session=session)
# VNF project_id (VNF ID)
self.project_id = self.session.get_project_id()
# HA instance_id that is active has active label
self.active_instance_id = self.active_instance_id()
services = keystone.services.list()
for service in services:
if service.type == 'maintenance':
LOG.info('maintenance service: %s:%s type %s'
% (service.name, service.id, service.type))
maint_id = service.id
self.maint_endpoint = [ep.url for ep in keystone.endpoints.list()
if ep.service_id == maint_id and
ep.interface == 'public'][0]
LOG.info('maintenance endpoint: %s' % self.maint_endpoint)
self.update_constraints_lock = False
self.update_constraints()
# Instances waiting action to be done
self.pending_actions = {}
def create_alarm(self):
alarms = {alarm['name']: alarm for alarm in self.aodh.alarm.list()}
alarm_name = "%s_MAINTENANCE_ALARM" % self.project
if alarm_name in alarms:
return
alarm_request = dict(
name=alarm_name,
description=alarm_name,
enabled=True,
alarm_actions=[u'http://%s:%d/maintenance'
% (self.conf.ip,
self.conf.port)],
repeat_actions=True,
severity=u'moderate',
type=u'event',
event_rule=dict(event_type=u'maintenance.scheduled'))
self.aodh.alarm.create(alarm_request)
def delete_remote_instance_constraints(self, instance_id):
url = "%s/instance/%s" % (self.maint_endpoint, instance_id)
LOG.info('DELETE: %s' % url)
ret = requests.delete(url, data=None, headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
if ret.status_code == 404:
LOG.info('Already deleted: %s' % instance_id)
else:
raise Exception(ret.text)
def update_remote_instance_constraints(self, instance):
url = "%s/instance/%s" % (self.maint_endpoint, instance["instance_id"])
LOG.info('PUT: %s' % url)
ret = requests.put(url, data=json.dumps(instance),
headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
raise Exception(ret.text)
def delete_remote_group_constraints(self, instance_group):
url = "%s/instance_group/%s" % (self.maint_endpoint,
instance_group["group_id"])
LOG.info('DELETE: %s' % url)
ret = requests.delete(url, data=None, headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
raise Exception(ret.text)
def update_remote_group_constraints(self, instance_group):
url = "%s/instance_group/%s" % (self.maint_endpoint,
instance_group["group_id"])
LOG.info('PUT: %s' % url)
ret = requests.put(url, data=json.dumps(instance_group),
headers=self.headers)
if ret.status_code != 200 and ret.status_code != 204:
raise Exception(ret.text)
def delete_constraints(self):
for instance_id in self.instance_constraints:
self.delete_remote_instance_constraints(instance_id)
self.delete_remote_group_constraints(self.nonha_group)
self.delete_remote_group_constraints(self.ha_group)
def update_constraints(self):
while self.update_constraints_lock:
LOG.info('Waiting update_constraints_lock...')
time.sleep(1)
self.update_constraints_lock = True
LOG.info('Update constraints')
# Pods groupped by ReplicaSet, so we use that id
rs = {r.metadata.name: r.metadata.uid for r in
self.kaapi.list_namespaced_replica_set('demo').items}
max_impacted_members = len(self.nonha_instances) - 1
nonha_group = {
"group_id": rs['demo-nonha'],
"project_id": self.project_id,
"group_name": "demo-nonha",
"anti_affinity_group": False,
"max_instances_per_host": 0,
"max_impacted_members": max_impacted_members,
"recovery_time": 10,
"resource_mitigation": True}
LOG.info('create demo-nonha constraints: %s'
% nonha_group)
ha_group = {
"group_id": rs['demo-ha'],
"project_id": self.project_id,
"group_name": "demo-ha",
"anti_affinity_group": True,
"max_instances_per_host": 1,
"max_impacted_members": 1,
"recovery_time": 10,
"resource_mitigation": True}
LOG.info('create demo-ha constraints: %s'
% ha_group)
if not self.ha_group or self.ha_group != ha_group:
LOG.info('ha instance group need update')
self.update_remote_group_constraints(ha_group)
self.ha_group = ha_group.copy()
if not self.nonha_group or self.nonha_group != nonha_group:
LOG.info('nonha instance group need update')
self.update_remote_group_constraints(nonha_group)
self.nonha_group = nonha_group.copy()
instance_constraints = {}
for ha_instance in self.ha_instances:
instance = {
"instance_id": ha_instance.metadata.uid,
"project_id": self.project_id,
"group_id": ha_group["group_id"],
"instance_name": ha_instance.metadata.name,
"max_interruption_time": 120,
"migration_type": "EVICTION",
"resource_mitigation": True,
"lead_time": 40}
LOG.info('create ha instance constraints: %s' % instance)
instance_constraints[ha_instance.metadata.uid] = instance
for nonha_instance in self.nonha_instances:
instance = {
"instance_id": nonha_instance.metadata.uid,
"project_id": self.project_id,
"group_id": nonha_group["group_id"],
"instance_name": nonha_instance.metadata.name,
"max_interruption_time": 120,
"migration_type": "EVICTION",
"resource_mitigation": True,
"lead_time": 40}
LOG.info('create nonha instance constraints: %s' % instance)
instance_constraints[nonha_instance.metadata.uid] = instance
if not self.instance_constraints:
# Initial instance constraints
LOG.info('create initial instances constraints...')
for instance in [instance_constraints[i] for i
in instance_constraints]:
self.update_remote_instance_constraints(instance)
self.instance_constraints = instance_constraints.copy()
else:
LOG.info('check instances constraints changes...')
added = [i for i in instance_constraints.keys()
if i not in self.instance_constraints]
deleted = [i for i in self.instance_constraints.keys()
if i not in instance_constraints]
modified = [i for i in instance_constraints.keys()
if (i not in added and i not in deleted and
instance_constraints[i] !=
self.instance_constraints[i])]
for instance_id in deleted:
self.delete_remote_instance_constraints(instance_id)
updated = added + modified
for instance in [instance_constraints[i] for i in updated]:
self.update_remote_instance_constraints(instance)
if updated or deleted:
# Some instance constraints have changed
self.instance_constraints = instance_constraints.copy()
self.update_constraints_lock = False
def active_instance_id(self):
# We digtate the active in the beginning
instance = self.ha_instances[0]
LOG.info('Initially Active instance: %s %s' %
(instance.metadata.name, instance.metadata.uid))
name = instance.metadata.name
namespace = instance.metadata.namespace
body = {"metadata": {"labels": {"active": "True"}}}
self.kapi.patch_namespaced_pod(name, namespace, body)
self.active_instance_id = instance.metadata.uid
def switch_over_ha_instance(self, instance_id):
if instance_id == self.active_instance_id:
# Need to switchover as instance_id will be affected and is active
for instance in self.ha_instances:
if instance_id == instance.metadata.uid:
LOG.info('Active to Standby: %s %s' %
(instance.metadata.name, instance.metadata.uid))
name = instance.metadata.name
namespace = instance.metadata.namespace
body = client.UNKNOWN_BASE_TYPE()
body.metadata.labels = {"ative": None}
self.kapi.patch_namespaced_pod(name, namespace, body)
else:
LOG.info('Standby to Active: %s %s' %
(instance.metadata.name, instance.metadata.uid))
name = instance.metadata.name
namespace = instance.metadata.namespace
body = client.UNKNOWN_BASE_TYPE()
body.metadata.labels = {"ative": "True"}
self.kapi.patch_namespaced_pod(name, namespace, body)
self.active_instance_id = instance.metadata.uid
self.update_instances()
def get_instance_ids(self):
instances = self.kapi.list_pod_for_all_namespaces().items
return [i.metadata.uid for i in instances
if i.metadata.name.startswith("demo-") and
i.metadata.namespace == "demo"]
def update_instances(self):
instances = self.kapi.list_pod_for_all_namespaces().items
self.ha_instances = [i for i in instances
if i.metadata.name.startswith("demo-ha") and
i.metadata.namespace == "demo"]
self.nonha_instances = [i for i in instances
if i.metadata.name.startswith("demo-nonha") and
i.metadata.namespace == "demo"]
def _alarm_data_decoder(self, data):
if "[" in data or "{" in data:
# string to list or dict removing unicode
data = yaml.load(data.replace("u'", "'"))
return data
def _alarm_traits_decoder(self, data):
return ({str(t[0]): self._alarm_data_decoder(str(t[2]))
for t in data['reason_data']['event']['traits']})
def get_session_instance_ids(self, url, session_id):
ret = requests.get(url, data=None, headers=self.headers)
if ret.status_code != 200:
raise Exception(ret.text)
LOG.info('get_instance_ids %s' % ret.json())
return ret.json()['instance_ids']
def scale_instances(self, scale_instances):
number_of_instances_before = len(self.nonha_instances)
replicas = number_of_instances_before + scale_instances
# We only scale nonha apps
namespace = "demo"
name = "demo-nonha"
body = {'spec': {"replicas": replicas}}
self.kaapi.patch_namespaced_replica_set_scale(name, namespace, body)
time.sleep(3)
# Let's check if scale has taken effect
self.update_instances()
number_of_instances_after = len(self.nonha_instances)
check = 20
while number_of_instances_after == number_of_instances_before:
if check == 0:
LOG.error('scale_instances with: %d failed, still %d instances'
% (scale_instances, number_of_instances_after))
raise Exception('scale_instances failed')
check -= 1
time.sleep(1)
self.update_instances()
number_of_instances_after = len(self.nonha_instances)
LOG.info('scaled instances from %d to %d' %
(number_of_instances_before, number_of_instances_after))
def number_of_instances(self):
instances = self.kapi.list_pod_for_all_namespaces().items
return len([i for i in instances
if i.metadata.name.startswith("demo-")])
def instance_action(self, instance_id, allowed_actions):
# We should keep instance constraint in our internal structur
# and match instance_id specific allowed action. Now we assume EVICTION
if 'EVICTION' not in allowed_actions:
LOG.error('Action for %s not foudn from %s' %
(instance_id, allowed_actions))
return None
return 'EVICTION'
def instance_action_started(self, instance_id, action):
time_now = datetime.datetime.utcnow()
max_interruption_time = (
self.instance_constraints[instance_id]['max_interruption_time'])
self.pending_actions[instance_id] = {
'started': time_now,
'max_interruption_time': max_interruption_time,
'action': action}
def was_instance_action_in_time(self, instance_id):
time_now = datetime.datetime.utcnow()
started = self.pending_actions[instance_id]['started']
limit = self.pending_actions[instance_id]['max_interruption_time']
action = self.pending_actions[instance_id]['action']
td = time_now - started
if td.total_seconds() > limit:
LOG.error('%s %s took too long: %ds' %
(instance_id, action, td.total_seconds()))
LOG.error('%s max_interruption_time %ds might be too short' %
(instance_id, limit))
raise Exception('%s %s took too long: %ds' %
(instance_id, action, td.total_seconds()))
else:
LOG.info('%s %s with recovery time took %ds' %
(instance_id, action, td.total_seconds()))
del self.pending_actions[instance_id]
def run(self):
app = Flask('VNFM')
@app.route('/maintenance', methods=['POST'])
def maintenance_alarm():
data = json.loads(request.data.decode('utf8'))
try:
payload = self._alarm_traits_decoder(data)
except Exception:
payload = ({t[0]: t[2] for t in
data['reason_data']['event']['traits']})
LOG.error('cannot parse alarm data: %s' % payload)
raise Exception('VNFM cannot parse alarm.'
'Possibly trait data over 256 char')
LOG.info('VNFM received data = %s' % payload)
state = payload['state']
reply_state = None
reply = dict()
LOG.info('VNFM state: %s' % state)
if state == 'MAINTENANCE':
self.headers['X-Auth-Token'] = self.session.get_token()
instance_ids = (self.get_session_instance_ids(
payload['instance_ids'],
payload['session_id']))
reply['instance_ids'] = instance_ids
reply_state = 'ACK_MAINTENANCE'
elif state == 'SCALE_IN':
# scale down only nonha instances
nonha_instances = len(self.nonha_instances)
scale_in = nonha_instances / 2
self.scale_instances(-scale_in)
self.update_constraints()
reply['instance_ids'] = self.get_instance_ids()
reply_state = 'ACK_SCALE_IN'
elif state == 'MAINTENANCE_COMPLETE':
# possibly need to upscale
number_of_instances = self.number_of_instances()
if self.orig_number_of_instances > number_of_instances:
scale_instances = (self.orig_number_of_instances -
number_of_instances)
self.scale_instances(scale_instances)
self.update_constraints()
reply_state = 'ACK_MAINTENANCE_COMPLETE'
elif (state == 'PREPARE_MAINTENANCE' or
state == 'PLANNED_MAINTENANCE'):
instance_id = payload['instance_ids'][0]
instance_action = (self.instance_action(instance_id,
payload['allowed_actions']))
if not instance_action:
raise Exception('Allowed_actions not supported for %s' %
instance_id)
LOG.info('VNFM got instance: %s' % instance_id)
self.switch_over_ha_instance(instance_id)
reply['instance_action'] = instance_action
reply_state = 'ACK_%s' % state
self.instance_action_started(instance_id, instance_action)
elif state == 'INSTANCE_ACTION_DONE':
# TBD was action done in max_interruption_time (live migration)
# NOTE, in EVICTION instance_id reported that was in evicted
# node. New instance_id might be different
LOG.info('%s' % payload['instance_ids'])
self.was_instance_action_in_time(payload['instance_ids'][0])
self.update_instances()
self.update_constraints()
else:
raise Exception('VNFM received event with'
' unknown state %s' % state)
if reply_state:
reply['session_id'] = payload['session_id']
reply['state'] = reply_state
url = payload['reply_url']
LOG.info('VNFM reply: %s' % reply)
requests.put(url, data=json.dumps(reply), headers=self.headers)
return 'OK'
@app.route('/shutdown', methods=['POST'])
def shutdown():
LOG.info('shutdown VNFM server at %s' % time.time())
func = request.environ.get('werkzeug.server.shutdown')
if func is None:
raise RuntimeError('Not running with the Werkzeug Server')
func()
return 'VNFM shutting down...'
app.run(host="0.0.0.0", port=self.port)
if __name__ == '__main__':
app_manager = VNFM(CONF, LOG)
app_manager.start()
try:
LOG.info('Press CTRL + C to quit')
while True:
time.sleep(2)
except KeyboardInterrupt:
app_manager.stop()

@ -94,7 +94,36 @@ class RPCClient(object):
class EngineEndpoint(object):
def __init__(self):
sessions = db_api.get_sessions()
self.workflow_sessions = {}
if sessions:
LOG.info("Initialize workflows from DB")
for session in sessions:
session_id = session.session_id
LOG.info("Session %s from DB" % session.session_id)
workflow = "fenix.workflow.workflows.%s" % session.workflow
LOG.info("Workflow plugin module: %s" % workflow)
try:
wf_plugin = getattr(import_module(workflow), 'Workflow')
self.workflow_sessions[session_id] = wf_plugin(CONF,
session_id,
None)
except ImportError:
session_dir = "%s/%s" % (CONF.local_cache_dir, session_id)
download_plugin_dir = session_dir + "/workflow/"
download_plugin_file = "%s/%s.py" % (download_plugin_dir,
session.workflow)
if os.path.isfile(download_plugin_file):
self.workflow_sessions[session_id] = (
source_loader_workflow_instance(
workflow,
download_plugin_file,
CONF,
session_id,
None))
else:
raise Exception('%s: could not find workflow plugin %s'
% (session_id, session.workflow))
def _validate_session(self, session_id):
if session_id not in self.workflow_sessions.keys():
@ -144,7 +173,7 @@ class EngineEndpoint(object):
data))
else:
raise Exception('%s: could not find workflow plugin %s' %
(self.session_id, data["workflow"]))
(session_id, data["workflow"]))
self.workflow_sessions[session_id].start()
return {"session_id": session_id}
@ -154,8 +183,23 @@ class EngineEndpoint(object):
if not self._validate_session(session_id):
return None
LOG.info("EngineEndpoint: admin_get_session")
return ({"session_id": session_id, "state":
self.workflow_sessions[session_id].session.state})
return {"session_id": session_id, "state":
self.workflow_sessions[session_id].session.state}
def admin_get_session_detail(self, ctx, session_id):
"""Get maintenance workflow session details"""
if not self._validate_session(session_id):
return None
LOG.info("EngineEndpoint: admin_get_session_detail")
sess = self.workflow_sessions[session_id]
return {"session_id": session_id,
"state": sess.session.state,
"percent_done": sess.session_report["last_percent"],
"session": sess.session,
"hosts": sess.hosts,
"instances": sess.instances,
"action_plugin_instances": db_api.get_action_plugin_instances(
session_id)}
def admin_delete_session(self, ctx, session_id):
"""Delete maintenance workflow session thread"""
@ -198,6 +242,7 @@ class EngineEndpoint(object):
session_obj = self.workflow_sessions[session_id]
project = session_obj.project(project_id)
project.state = data["state"]
db_api.update_project(project)
if "instance_actions" in data:
session_obj.proj_instance_actions[project_id] = (
data["instance_actions"].copy())
@ -212,6 +257,7 @@ class EngineEndpoint(object):
instance.project_state = data["state"]
if "instance_action" in data:
instance.action = data["instance_action"]
db_api.update_instance(instance)
return data
def get_instance(self, ctx, instance_id):

@ -12,8 +12,10 @@
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
from fenix.db import api as db_api
from oslo_log import log as logging
import subprocess
import time
LOG = logging.getLogger(__name__)
@ -32,10 +34,12 @@ class ActionPlugin(object):
output = subprocess.check_output("echo Dummy running in %s" %
self.hostname,
shell=True)
time.sleep(1)
self.ap_dbi.state = "DONE"
except subprocess.CalledProcessError:
self.ap_dbi.state = "FAILED"
finally:
db_api.update_action_plugin_instance(self.ap_dbi)
LOG.debug("%s: OUTPUT: %s" % (self.wf.session_id, output))
LOG.info("%s: Dummy action plugin state: %s" % (self.wf.session_id,
self.ap_dbi.state))

@ -34,31 +34,55 @@ LOG = logging.getLogger(__name__)
class BaseWorkflow(Thread):
def __init__(self, conf, session_id, data):
def __init__(self, conf, session_id, data=None):
# if data not set, we initialize from DB
Thread.__init__(self)
self.conf = conf
self.session_id = session_id
self.stopped = False
self.thg = threadgroup.ThreadGroup()
self.timer = {}
self.session = self._init_session(data)
if data:
self.session = self._init_session(data)
else:
self.session = db_api.get_session(session_id)
LOG.info('%s session from DB: %s' % (self.session_id,
self.session.state))
self.hosts = []
if "hosts" in data and data['hosts']:
if not data:
self.hosts = db_api.get_hosts(session_id)
elif "hosts" in data and data['hosts']:
# Hosts given as input, not to be discovered in workflow
self.hosts = self.init_hosts(self.convert(data['hosts']))
else:
LOG.info('%s: No hosts as input' % self.session_id)
if "actions" in data:
if not data:
self.actions = db_api.get_action_plugins(session_id)
elif "actions" in data:
self.actions = self._init_action_plugins(data["actions"])
else:
self.actions = []
if "download" in data:
if not data:
self.downloads = db_api.get_downloads(session_id)
elif "download" in data:
self.downloads = self._init_downloads(data["download"])
else:
self.downloads = []
self.projects = []
self.instances = []
if not data:
self.projects = db_api.get_projects(session_id)
else:
self.projects = []
if not data:
self.instances = db_api.get_instances(session_id)
else:
self.instances = []
self.proj_instance_actions = {}
self.states_methods = {'MAINTENANCE': 'maintenance',
@ -72,6 +96,7 @@ class BaseWorkflow(Thread):
self.url = "http://%s:%s" % (conf.host, conf.port)
self.auth = get_identity_auth(conf)
self.auth_session = get_session(auth=self.auth)
self.project_id = self.auth_session.get_project_id()
self.aodh = aodhclient.Client('2', self.auth_session)
transport = messaging.get_transport(self.conf)
self.notif_proj = messaging.Notifier(transport,
@ -84,6 +109,13 @@ class BaseWorkflow(Thread):
driver='messaging',
topics=['notifications'])
self.notif_admin = self.notif_admin.prepare(publisher_id='fenix')
self.notif_sess = messaging.Notifier(transport,
'maintenance.session',
driver='messaging',
topics=['notifications'])
self.notif_sess = self.notif_sess.prepare(publisher_id='fenix')
self.session_report = {'last_percent': 0, 'last_state': None}
def init_hosts(self, hostnames):
LOG.info('%s: init_hosts: %s' % (self.session_id, hostnames))
@ -174,6 +206,12 @@ class BaseWorkflow(Thread):
return [host.hostname for host in self.hosts if host.maintained and
host.type == host_type]
def get_maintained_percent(self):
maintained_hosts = float(len([host for host in self.hosts
if host.maintained]))
all_hosts = float(len(self.hosts))
return int(maintained_hosts / all_hosts * 100)
def get_disabled_hosts(self):
return [host for host in self.hosts if host.disabled]
@ -195,6 +233,7 @@ class BaseWorkflow(Thread):
if host_obj:
if len(host_obj) == 1:
host_obj[0].maintained = True
db_api.update_host(host_obj[0])
else:
raise Exception('host_maintained: %s has duplicate entries' %
hostname)
@ -230,8 +269,10 @@ class BaseWorkflow(Thread):
def set_projets_state(self, state):
for project in self.projects:
project.state = state
db_api.update_project(project)
for instance in self.instances:
instance.project_state = None
db_api.update_instance(instance)
def project_has_state_instances(self, project_id):
instances = ([instance.instance_id for instance in self.instances if
@ -254,11 +295,13 @@ class BaseWorkflow(Thread):
instance.project_state = state
else:
instance.project_state = None
db_api.update_instance(instance)
if state_instances:
some_project_has_instances = True
project.state = state
else:
project.state = None
db_api.update_project(project)
if not some_project_has_instances:
LOG.error('%s: No project has instances on hosts %s' %
(self.session_id, hosts))
@ -410,6 +453,10 @@ class BaseWorkflow(Thread):
# TBD we could notify admin for workflow state change
self.session.prev_state = self.session.state
self.session.state = state
self.session = db_api.update_session(self.session)
self._session_notify(state,
self.get_maintained_percent(),
self.session_id)
if state in ["MAINTENANCE_DONE", "MAINTENANCE_FAILED"]:
try:
statefunc = (getattr(self,
@ -481,14 +528,35 @@ class BaseWorkflow(Thread):
self.notif_proj.info({'some': 'context'}, 'maintenance.scheduled',
payload)
def _admin_notify(self, project, host, state, session_id):
payload = dict(project_id=project, host=host, state=state,
def _admin_notify(self, host, state, session_id):
payload = dict(project_id=self.project_id, host=host, state=state,
session_id=session_id)
LOG.info('Sending "maintenance.host": %s' % payload)
self.notif_admin.info({'some': 'context'}, 'maintenance.host', payload)
def _session_notify(self, state, percent_done, session_id):
# There is race in threads to send this message
# Maintenance can be further away with other thread
if self.session_report['last_percent'] > percent_done:
percent_done = self.session_report['last_percent']
if self.session_report['last_state'] == state:
return
else:
self.session_report['last_percent'] = percent_done
self.session_report['last_state'] = state
payload = dict(project_id=self.project_id,
state=state,
percent_done=percent_done,
session_id=session_id)
LOG.info('Sending "maintenance.session": %s' % payload)
self.notif_sess.info({'some': 'context'},
'maintenance.session',
payload)
def projects_answer(self, state, projects):
state_ack = 'ACK_%s' % state
state_nack = 'NACK_%s' % state

@ -140,6 +140,7 @@ class Workflow(BaseWorkflow):
host.type = 'controller'
continue
host.type = 'other'
db_api.update_host(host)
def disable_host_nova_compute(self, hostname):
LOG.info('%s: disable nova-compute on host %s' % (self.session_id,
@ -153,6 +154,7 @@ class Workflow(BaseWorkflow):
self.nova.services.disable_log_reason(hostname, "nova-compute",
"maintenance")
host.disabled = True
db_api.update_host(host)
def enable_host_nova_compute(self, hostname):
LOG.info('%s: enable nova-compute on host %s' % (self.session_id,
@ -165,6 +167,7 @@ class Workflow(BaseWorkflow):
(self.session_id, hostname))
self.nova.services.enable(hostname, "nova-compute")
host.disabled = False
db_api.update_host(host)
def get_compute_hosts(self):
return [host.hostname for host in self.hosts
@ -408,8 +411,8 @@ class Workflow(BaseWorkflow):
def get_free_vcpus_by_host(self, host, hvisors):
hvisor = ([h for h in hvisors if
h.__getattr__('hypervisor_hostname').split(".", 1)[0]
== host][0])
h.__getattr__(
'hypervisor_hostname').split(".", 1)[0] == host][0])
vcpus = hvisor.__getattr__('vcpus')
vcpus_used = hvisor.__getattr__('vcpus_used')
return vcpus - vcpus_used
@ -547,6 +550,7 @@ class Workflow(BaseWorkflow):
reply_at = None
state = "INSTANCE_ACTION_DONE"
instance.project_state = state
db_api.update_instance(instance)
metadata = "{}"
self._project_notify(project, instance_ids, allowed_actions,
actions_at, reply_at, state, metadata)
@ -561,6 +565,7 @@ class Workflow(BaseWorkflow):
project, instance.instance_id))
LOG.info('Action %s instance %s ' % (instance.action,
instance.instance_id))
db_api.update_instance(instance)
if instance.action == 'MIGRATE':
if not self.migrate_server(instance):
return False
@ -576,6 +581,12 @@ class Workflow(BaseWorkflow):
'%s not supported' %
(self.session_id, instance.instance_id,
instance.action))
server = self.nova.servers.get(instance.instance_id)
instance.host = (
str(server.__dict__.get('OS-EXT-SRV-ATTR:host')))
instance.state = server.__dict__.get('OS-EXT-STS:vm_state')
instance.action = None
db_api.update_instance(instance)
return self._wait_host_empty(host)
def _wait_host_empty(self, host):
@ -625,6 +636,7 @@ class Workflow(BaseWorkflow):
if instance.state == 'error':
LOG.error('instance %s live migration failed'
% server_id)
db_api.update_instance(instance)
return False
elif orig_vm_state != instance.state:
LOG.info('instance %s state changed: %s' % (server_id,
@ -632,6 +644,7 @@ class Workflow(BaseWorkflow):
elif host != orig_host:
LOG.info('instance %s live migrated to host %s' %
(server_id, host))
db_api.update_instance(instance)
return True
migration = (
self.nova.migrations.list(instance_uuid=server_id)[0])
@ -664,6 +677,7 @@ class Workflow(BaseWorkflow):
except Exception as e:
LOG.error('server %s live migration failed, Exception=%s' %
(server_id, e))
db_api.update_instance(instance)
return False
def migrate_server(self, instance):
@ -693,6 +707,7 @@ class Workflow(BaseWorkflow):
LOG.info('instance %s migration resized to host %s' %
(server_id, host))
instance.host = host
db_api.update_instance(instance)
return True
if last_vm_state != instance.state:
LOG.info('instance %s state changed: %s' % (server_id,
@ -701,6 +716,7 @@ class Workflow(BaseWorkflow):
LOG.error('instance %s migration failed, state: %s'
% (server_id, instance.state))
instance.host = host
db_api.update_instance(instance)
return False
time.sleep(5)
retries = retries - 1
@ -712,6 +728,7 @@ class Workflow(BaseWorkflow):
if retry_migrate == 0:
LOG.error('server %s migrate failed after retries' %
server_id)
db_api.update_instance(instance)
return False
# Might take time for scheduler to sync inconsistent instance
# list for host
@ -723,11 +740,13 @@ class Workflow(BaseWorkflow):
except Exception as e:
LOG.error('server %s migration failed, Exception=%s' %
(server_id, e))
db_api.update_instance(instance)
return False
finally:
retry_migrate = retry_migrate - 1
LOG.error('instance %s migration timeout, state: %s' %
(server_id, instance.state))
db_api.update_instance(instance)
return False
def maintenance_by_plugin_type(self, hostname, plugin_type):
@ -889,13 +908,11 @@ class Workflow(BaseWorkflow):
self.disable_host_nova_compute(compute)
for host in self.get_controller_hosts():
LOG.info('IN_MAINTENANCE controller %s' % host)
self._admin_notify(self.conf.service_user.os_project_name,
host,
self._admin_notify(host,
'IN_MAINTENANCE',
self.session_id)
self.host_maintenance(host)
self._admin_notify(self.conf.service_user.os_project_name,
host,
self._admin_notify(host,
'MAINTENANCE_COMPLETE',
self.session_id)
LOG.info('MAINTENANCE_COMPLETE controller %s' % host)
@ -908,13 +925,11 @@ class Workflow(BaseWorkflow):
self._wait_host_empty(host)
LOG.info('IN_MAINTENANCE compute %s' % host)
self._admin_notify(self.conf.service_user.os_project_name,
host,
self._admin_notify(host,
'IN_MAINTENANCE',
self.session_id)
self.host_maintenance(host)
self._admin_notify(self.conf.service_user.os_project_name,
host,
self._admin_notify(host,
'MAINTENANCE_COMPLETE',
self.session_id)
@ -929,13 +944,11 @@ class Workflow(BaseWorkflow):
self._wait_host_empty(host)
LOG.info('IN_MAINTENANCE host %s' % host)
self._admin_notify(self.conf.service_user.os_project_name,
host,
self._admin_notify(host,
'IN_MAINTENANCE',
self.session_id)
self.host_maintenance(host)
self._admin_notify(self.conf.service_user.os_project_name,
host,
self._admin_notify(host,
'MAINTENANCE_COMPLETE',
self.session_id)

@ -63,11 +63,12 @@ class Workflow(BaseWorkflow):
LOG.info("%s: initialized with Kubernetes: %s" %
(self.session_id,
v_api.get_code_with_http_info()[0].git_version))
self.hosts = self._init_hosts_by_services()
LOG.info('%s: Execute pre action plugins' % (self.session_id))
self.maintenance_by_plugin_type("localhost", "pre")
if not data:
self.hosts = db_api.get_hosts(session_id)
else:
self.hosts = self._init_hosts_by_services()
LOG.info('%s: Execute pre action plugins' % (self.session_id))
self.maintenance_by_plugin_type("localhost", "pre")
self.group_impacted_members = {}
def _init_hosts_by_services(self):
@ -106,6 +107,7 @@ class Workflow(BaseWorkflow):
body = {"apiVersion": "v1", "spec": {"unschedulable": True}}
self.kapi.patch_node(node_name, body)
host.disabled = True
db_api.update_host(host)
def uncordon(self, node_name):
LOG.info("%s: uncordon %s" % (self.session_id, node_name))
@ -113,6 +115,7 @@ class Workflow(BaseWorkflow):
body = {"apiVersion": "v1", "spec": {"unschedulable": None}}
self.kapi.patch_node(node_name, body)
host.disabled = False
db_api.update_host(host)
def _pod_by_id(self, pod_id):
return [p for p in self.kapi.list_pod_for_all_namespaces().items
@ -667,6 +670,7 @@ class Workflow(BaseWorkflow):
actions_at = reply_time_str(wait_time)
reply_at = actions_at
instance.project_state = state
db_api.update_instance(instance)
metadata = self.session.meta
retry = 2
replied = False
@ -737,6 +741,7 @@ class Workflow(BaseWorkflow):
reply_at = None
state = "INSTANCE_ACTION_DONE"
instance.project_state = state
db_api.update_instance(instance)
metadata = "{}"
self._project_notify(project, instance_ids, allowed_actions,
actions_at, reply_at, state, metadata)
@ -814,22 +819,24 @@ class Workflow(BaseWorkflow):
if host.type == "compute":
self._wait_host_empty(hostname)
LOG.info('IN_MAINTENANCE %s' % hostname)
self._admin_notify(self.conf.service_user.os_project_name,
hostname,
self._admin_notify(hostname,
'IN_MAINTENANCE',
self.session_id)
for plugin_type in ["host", host.type]:
LOG.info('%s: Execute %s action plugins' % (self.session_id,
plugin_type))
self.maintenance_by_plugin_type(hostname, plugin_type)
self._admin_notify(self.conf.service_user.os_project_name,
hostname,
self._admin_notify(hostname,
'MAINTENANCE_COMPLETE',
self.session_id)
if host.type == "compute":
self.uncordon(hostname)
LOG.info('MAINTENANCE_COMPLETE %s' % hostname)
host.maintained = True
db_api.update_host(host)
self._session_notify(self.session.state,
self.get_maintained_percent(),
self.session_id)
def maintenance(self):
LOG.info("%s: maintenance called" % self.session_id)
@ -919,6 +926,10 @@ class Workflow(BaseWorkflow):
return
for host_name in self.get_compute_hosts():
self.cordon(host_name)
for host in self.get_controller_hosts():
# TBD one might need to change this. Now all controllers
# maintenance serialized
self.host_maintenance(host)
thrs = []
for host_name in empty_hosts:
# LOG.info("%s: Maintaining %s" % (self.session_id, host_name))

@ -66,15 +66,20 @@ class Workflow(BaseWorkflow):
nova_version = max_nova_server_ver
self.nova = novaclient.Client(nova_version,
session=self.auth_session)
if not self.hosts:
if not data:
self.hosts = db_api.get_hosts(session_id)
elif not self.hosts:
self.hosts = self._init_hosts_by_services()
else:
self._init_update_hosts()
LOG.info("%s: initialized. Nova version %f" % (self.session_id,
nova_version))
LOG.info('%s: Execute pre action plugins' % (self.session_id))
self.maintenance_by_plugin_type("localhost", "pre")
if data:
# We expect this is done if initialized from DB
LOG.info('%s: Execute pre action plugins' % (self.session_id))
self.maintenance_by_plugin_type("localhost", "pre")
# How many members of each instance group are currently affected
self.group_impacted_members = {}
@ -144,6 +149,7 @@ class Workflow(BaseWorkflow):
host.type = 'controller'
continue
host.type = 'other'
db_api.update_host(host)
def disable_host_nova_compute(self, hostname):
LOG.info('%s: disable nova-compute on host %s' % (self.session_id,
@ -157,6 +163,7 @@ class Workflow(BaseWorkflow):
self.nova.services.disable_log_reason(hostname, "nova-compute",
"maintenance")
host.disabled = True
db_api.update_host(host)
def enable_host_nova_compute(self, hostname):
LOG.info('%s: enable nova-compute on host %s' % (self.session_id,
@ -169,6 +176,7 @@ class Workflow(BaseWorkflow):
(self.session_id, hostname))
self.nova.services.enable(hostname, "nova-compute")
host.disabled = False
db_api.update_host(host)
def get_instance_details(self, instance):
network_interfaces = next(iter(instance.addresses.values()))
@ -413,17 +421,17 @@ class Workflow(BaseWorkflow):
prev_hostname = hostname
if free_vcpus >= vcpus:
# TBD vcpu capacity might be too scattered so moving instances from
# one host to other host still might not succeed. At least with
# one host to another host still might not succeed. At least with
# NUMA and CPU pinning, one should calculate and ask specific
# instances
# instances to be moved so can get empty host obeying pinning.
return False
else:
return True
def get_vcpus_by_host(self, host, hvisors):
hvisor = ([h for h in hvisors if
h.__getattr__('hypervisor_hostname').split(".", 1)[0]
== host][0])
h.__getattr__(
'hypervisor_hostname').split(".", 1)[0] == host][0])
vcpus = hvisor.__getattr__('vcpus')
vcpus_used = hvisor.__getattr__('vcpus_used')
return vcpus, vcpus_used
@ -535,6 +543,7 @@ class Workflow(BaseWorkflow):
actions_at = reply_time_str(wait_time)
reply_at = actions_at
instance.project_state = state
db_api.update_instance(instance)
metadata = self.session.meta
retry = 2
replied = False
@ -605,6 +614,7 @@ class Workflow(BaseWorkflow):
reply_at = None
state = "INSTANCE_ACTION_DONE"
instance.project_state = state
db_api.update_instance(instance)
metadata = "{}"
self._project_notify(project, instance_ids, allowed_actions,
actions_at, reply_at, state, metadata)
@ -697,6 +707,11 @@ class Workflow(BaseWorkflow):
% (instance.instance_id,
self.group_impacted_members[group_id],
max_parallel))
server = self.nova.servers.get(instance.instance_id)
instance.host = str(server.__dict__.get('OS-EXT-SRV-ATTR:host'))
instance.state = server.__dict__.get('OS-EXT-STS:vm_state')
instance.action = None
db_api.update_instance(instance)
@run_async
def actions_to_have_empty_host(self, host, state, target_host=None):
@ -759,6 +774,7 @@ class Workflow(BaseWorkflow):
if instance.state == 'error':
LOG.error('instance %s live migration failed'
% server_id)
db_api.update_instance(instance)
return False
elif orig_vm_state != instance.state:
LOG.info('instance %s state changed: %s' % (server_id,
@ -766,6 +782,7 @@ class Workflow(BaseWorkflow):
elif host != orig_host:
LOG.info('instance %s live migrated to host %s' %
(server_id, host))
db_api.update_instance(instance)
return True
migration = (
self.nova.migrations.list(instance_uuid=server_id)[0])
@ -775,6 +792,7 @@ class Workflow(BaseWorkflow):
'%d retries' %
(server_id,
self.conf.live_migration_retries))
db_api.update_instance(instance)
return False
# When live migrate fails it can fail fast after calling
# To have Nova time to be ready for next live migration
@ -793,17 +811,20 @@ class Workflow(BaseWorkflow):
waited = waited + 1
last_migration_status = migration.status
last_vm_status = vm_status
db_api.update_instance(instance)
LOG.error('instance %s live migration did not finish in %ss, '
'state: %s' % (server_id, waited, instance.state))
except Exception as e:
LOG.error('server %s live migration failed, Exception=%s' %
(server_id, e))
db_api.update_instance(instance)
return False
def migrate_server(self, instance, target_host=None):
server_id = instance.instance_id
server = self.nova.servers.get(server_id)
instance.state = server.__dict__.get('OS-EXT-STS:vm_state')
orig_state = server.__dict__.get('OS-EXT-STS:vm_state')
instance.state = orig_state
orig_host = str(server.__dict__.get('OS-EXT-SRV-ATTR:host'))
LOG.info('migrate_server %s state %s host %s to %s' %
(server_id, instance.state, orig_host, target_host))
@ -823,7 +844,12 @@ class Workflow(BaseWorkflow):
server.confirm_resize()
LOG.info('instance %s migration resized to host %s' %
(server_id, host))
instance.host = host
server = self.nova.servers.get(server_id)
instance.host = (
str(server.__dict__.get('OS-EXT-SRV-ATTR:host')))
instance.state = (
server.__dict__.get('OS-EXT-STS:vm_state'))
db_api.update_instance(instance)
return True
if last_vm_state != instance.state:
LOG.info('instance %s state changed: %s' % (server_id,
@ -832,6 +858,7 @@ class Workflow(BaseWorkflow):
LOG.error('instance %s migration failed, state: %s'
% (server_id, instance.state))
instance.host = host
db_api.update_instance(instance)
return False
time.sleep(5)
retries = retries - 1
@ -843,6 +870,7 @@ class Workflow(BaseWorkflow):
if retry_migrate == 0:
LOG.error('server %s migrate failed after retries' %
server_id)
db_api.update_instance(instance)
return False
# Might take time for scheduler to sync inconsistent instance
# list for host.
@ -855,11 +883,13 @@ class Workflow(BaseWorkflow):
except Exception as e:
LOG.error('server %s migration failed, Exception=%s' %
(server_id, e))
db_api.update_instance(instance)
return False
finally:
retry_migrate = retry_migrate - 1
LOG.error('instance %s migration timeout, state: %s' %
(server_id, instance.state))
db_api.update_instance(instance)
return False
def maintenance_by_plugin_type(self, hostname, plugin_type):
@ -922,22 +952,24 @@ class Workflow(BaseWorkflow):
if host.type == "compute":
self._wait_host_empty(hostname)
LOG.info('IN_MAINTENANCE %s' % hostname)
self._admin_notify(self.conf.service_user.os_project_name,
hostname,
self._admin_notify(hostname,
'IN_MAINTENANCE',
self.session_id)
for plugin_type in ["host", host.type]:
LOG.info('%s: Execute %s action plugins' % (self.session_id,
plugin_type))
self.maintenance_by_plugin_type(hostname, plugin_type)
self._admin_notify(self.conf.service_user.os_project_name,
hostname,
self._admin_notify(hostname,
'MAINTENANCE_COMPLETE',
self.session_id)
if host.type == "compute":
self.enable_host_nova_compute(hostname)
LOG.info('MAINTENANCE_COMPLETE %s' % hostname)
host.maintained = True
db_api.update_host(host)
self._session_notify(self.session.state,
self.get_maintained_percent(),
self.session_id)
def maintenance(self):
LOG.info("%s: maintenance called" % self.session_id)