Browse Source

add the devspecs to doc for tricircle

Add the devspecs doc to tricircle doc. Make it more openly for

people who are interest in tricircle.

Change-Id: Ia26c82f4c348664e0bdee4c36c9e13f8f3a8b7fc
Signed-off-by: song baisen <songbaisen@szzt.com.cn>
Co-Authored-By: tangzhuo <ztang@hnu.edu.cn>, zhiyuan_cai <luckyvega.g@gmail.com>
changes/07/609607/4
songbaisen 3 years ago
parent
commit
3872ddcaf6
  1. 276
      doc/source/devspecs/async_job_management.rst
  2. 558
      doc/source/devspecs/cross-neutron-l2-networking.rst
  3. 233
      doc/source/devspecs/cross-neutron-vxlan-networking.rst
  4. 18
      doc/source/devspecs/devspecs-guide.rst
  5. 236
      doc/source/devspecs/dynamic-pod-binding.rst
  6. 234
      doc/source/devspecs/enhance-xjob-reliability.rst
  7. 8
      doc/source/devspecs/index.rst
  8. 554
      doc/source/devspecs/l3-networking-combined-bridge-net.rst
  9. 393
      doc/source/devspecs/l3-networking-multi-NS-with-EW-enabled.rst
  10. 185
      doc/source/devspecs/lbaas.rst
  11. 111
      doc/source/devspecs/legacy_tables_clean.rst
  12. 214
      doc/source/devspecs/local-neutron-plugin.rst
  13. 327
      doc/source/devspecs/new-l3-networking-mulit-NS-with-EW.rst
  14. 247
      doc/source/devspecs/quality-of-service.rst
  15. 66
      doc/source/devspecs/resource_deleting.rst
  16. 219
      doc/source/devspecs/smoke-test-engine.rst
  17. 9
      doc/source/index.rst

276
doc/source/devspecs/async_job_management.rst

@ -0,0 +1,276 @@
=========================================
Tricircle Asynchronous Job Management API
=========================================
Background
==========
In the Tricircle, XJob provides OpenStack multi-region functionality. It
receives and processes jobs from the Admin API or Tricircle Central
Neutron Plugin and handles them in an asynchronous way. For example, when
booting an instance in the first time for the project, router, security
group rule, FIP and other resources may have not already been created in
the local Neutron(s), these resources could be created asynchronously to
accelerate response for the initial instance booting request, different
from network, subnet and security group resources that must be created
before an instance booting. Central Neutron could send such creation jobs
to local Neutron(s) through XJob and then local Neutron(s) handle them
with their own speed.
Implementation
==============
XJob server may strike occasionally so tenants and cloud administrators
need to know the job status and delete or redo the failed job if necessary.
Asynchronous job management APIs provide such functionality and they are
listed as following:
* Create a job
Create a job to synchronize resource if necessary.
Create Job Request::
POST /v1.0/jobs
{
"job": {
"type": "port_delete",
"project_id": "d01246bc5792477d9062a76332b7514a",
"resource": {
"pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
"port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
}
}
}
Response:
{
"job": {
"id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
"project_id": "d01246bc5792477d9062a76332b7514a",
"type": "port_delete",
"timestamp": "2017-03-03 11:05:36",
"status": "NEW",
"resource": {
"pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
"port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
}
}
}
Normal Response Code: 202
* Get a job
Retrieve a job from the Tricircle database.
The detailed information of the job will be shown. Otherwise
it will return "Resource not found" exception.
List Request::
GET /v1.0/jobs/3f4ecf30-0213-4f1f-9cb0-0233bcedb767
Response:
{
"job": {
"id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
"project_id": "d01246bc5792477d9062a76332b7514a",
"type": "port_delete",
"timestamp": "2017-03-03 11:05:36",
"status": "NEW",
"resource": {
"pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
"port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
}
}
}
Normal Response Code: 200
* Get all jobs
Retrieve all of the jobs from the Tricircle database.
List Request::
GET /v1.0/jobs/detail
Response:
{
"jobs":
[
{
"id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
"project_id": "d01246bc5792477d9062a76332b7514a",
"type": "port_delete",
"timestamp": "2017-03-03 11:05:36",
"status": "NEW",
"resource": {
"pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
"port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
}
},
{
"id": "b01fe514-5211-4758-bbd1-9f32141a7ac2",
"project_id": "d01246bc5792477d9062a76332b7514a",
"type": "seg_rule_setup",
"timestamp": "2017-03-01 17:14:44",
"status": "FAIL",
"resource": {
"project_id": "d01246bc5792477d9062a76332b7514a"
}
}
]
}
Normal Response Code: 200
* Get all jobs with filter(s)
Retrieve job(s) from the Tricircle database. We can filter them by
project ID, job type and job status. If no filter is provided,
GET /v1.0/jobs will return all jobs.
The response contains a list of jobs. Using filters, a subset of jobs
will be returned.
List Request::
GET /v1.0/jobs?project_id=d01246bc5792477d9062a76332b7514a
Response:
{
"jobs":
[
{
"id": "3f4ecf30-0213-4f1f-9cb0-0233bcedb767",
"project_id": "d01246bc5792477d9062a76332b7514a",
"type": "port_delete",
"timestamp": "2017-03-03 11:05:36",
"status": "NEW",
"resource": {
"pod_id": "0eb59465-5132-4f57-af01-a9e306158b86",
"port_id": "8498b903-9e18-4265-8d62-3c12e0ce4314"
}
},
{
"id": "b01fe514-5211-4758-bbd1-9f32141a7ac2",
"project_id": "d01246bc5792477d9062a76332b7514a",
"type": "seg_rule_setup",
"timestamp": "2017-03-01 17:14:44",
"status": "FAIL",
"resource": {
"project_id": "d01246bc5792477d9062a76332b7514a"
}
}
]
}
Normal Response Code: 200
* Get all jobs' schemas
Retrieve all jobs' schemas. User may want to know what the resources
are needed for a specific job.
List Request::
GET /v1.0/jobs/schemas
return all jobs' schemas.
Response:
{
"schemas":
[
{
"type": "configure_route",
"resource": ["router_id"]
},
{
"type": "router_setup",
"resource": ["pod_id", "router_id", "network_id"]
},
{
"type": "port_delete",
"resource": ["pod_id", "port_id"]
},
{
"type": "seg_rule_setup",
"resource": ["project_id"]
},
{
"type": "update_network",
"resource": ["pod_id", "network_id"]
},
{
"type": "subnet_update",
"resource": ["pod_id", "subnet_id"]
},
{
"type": "shadow_port_setup",
"resource": [pod_id", "network_id"]
}
]
}
Normal Response Code: 200
* Delete a job
Delete a failed or duplicated job from the Tricircle database.
A pair of curly braces will be returned if succeeds, otherwise an
exception will be thrown. What's more, we can list all jobs to verify
whether it is deleted successfully or not.
Delete Job Request::
DELETE /v1.0/jobs/{id}
Response:
This operation does not return a response body.
Normal Response Code: 200
* Redo a job
Redo a halted job brought by the XJob server corruption or network failures.
The job handler will redo a failed job with time interval, but this Admin
API will redo a job immediately. Nothing will be returned for this request,
but we can monitor its status through the execution state.
Redo Job Request::
PUT /v1.0/jobs/{id}
Response:
This operation does not return a response body.
Normal Response Code: 200
Data Model Impact
=================
In order to manage the jobs for each tenant, we need to filter them by
project ID. So project ID is going to be added to the AsyncJob model and
AsyncJobLog model.
Dependencies
============
None
Documentation Impact
====================
- Add documentation for asynchronous job management API
- Add release note for asynchronous job management API
References
==========
None

558
doc/source/devspecs/cross-neutron-l2-networking.rst

@ -0,0 +1,558 @@
========================================
Cross Neutron L2 networking in Tricircle
========================================
Background
==========
The Tricircle provides unified OpenStack API gateway and networking automation
functionality. Those main functionalities allow cloud operators to manage
multiple OpenStack instances which are running in one site or multiple sites
as a single OpenStack cloud.
Each bottom OpenStack instance which is managed by the Tricircle is also called
a pod.
The Tricircle has the following components:
* Nova API-GW
* Cinder API-GW
* Neutron API Server with Neutron Tricircle plugin
* Admin API
* XJob
* DB
Nova API-GW provides the functionality to trigger automatic networking creation
when new VMs are being provisioned. Neutron Tricircle plug-in is the
functionality to create cross Neutron L2/L3 networking for new VMs. After the
binding of tenant-id and pod finished in the Tricircle, Cinder API-GW and Nova
API-GW will pass the cinder api or nova api request to appropriate bottom
OpenStack instance.
Please refer to the Tricircle design blueprint[1], especially from
'7. Stateless Architecture Proposal' for the detail description of each
components.
Problem Description
===================
When a user wants to create a network in Neutron API Server, the user can
specify the 'availability_zone_hints'(AZ or az will be used for short for
availability zone) during network creation[5], in the Tricircle, the
'az_hints' means which AZ the network should be spread into. The 'az_hints'
meaning in Tricircle is a little different from the 'az_hints' meaning in
Neutron[5]. If no 'az_hints' was specified during network creation, this created
network will be spread into any AZ. If there is a list of 'az_hints' during the
network creation, that means the network should be able to be spread into these
AZs which are suggested by a list of 'az_hints'.
When a user creates VM or Volume, there is also one parameter called
availability zone. The AZ parameter is used for Volume and VM co-location, so
that the Volume and VM will be created into same bottom OpenStack instance.
When a VM is being attached to a network, the Tricircle will check whether a
VM's AZ is inside in the network's AZs scope. If a VM is not in the network's
AZs scope, the VM creation will be rejected.
Currently, the Tricircle only supports one pod in one AZ. And only supports a
network associated with one AZ. That means currently a tenant's network will
be presented only in one bottom OpenStack instance, that also means all VMs
connected to the network will be located at one bottom OpenStack instance.
If there are more than one pod in one AZ, refer to the dynamic pod binding[6].
There are lots of use cases where a tenant needs a network being able to be
spread out into multiple bottom OpenStack instances in one AZ or multiple AZs.
* Capacity expansion: tenants add VMs more and more, the capacity of one
OpenStack may not be enough, then a new OpenStack instance has to be added
to the cloud. But the tenant still wants to add new VMs into same network.
* Cross Neutron network service chaining. Service chaining is based on
the port-pairs. Leveraging the cross Neutron L2 networking capability which
is provided by the Tricircle, the chaining could also be done by across sites.
For example, vRouter1 in pod1, but vRouter2 in pod2, these two VMs could be
chained.
* Applications are often required to run in different availability zones to
achieve high availability. Application needs to be designed as
Active-Standby/Active-Active/N-Way to achieve high availability, and some
components inside one application are designed to work as distributed
cluster, this design typically leads to state replication or heart
beat among application components (directly or via replicated database
services, or via private designed message format). When this kind of
applications are distributedly deployed into multiple OpenStack instances,
cross Neutron L2 networking is needed to support heart beat
or state replication.
* When a tenant's VMs are provisioned in different OpenStack instances, there
is E-W (East-West) traffic for these VMs, the E-W traffic should be only
visible to the tenant, and isolation is needed. If the traffic goes through
N-S (North-South) via tenant level VPN, overhead is too much, and the
orchestration for multiple site to site VPN connection is also complicated.
Therefore cross Neutron L2 networking to bridge the tenant's routers in
different Neutron servers can provide more light weight isolation.
* In hybrid cloud, there is cross L2 networking requirement between the
private OpenStack and the public OpenStack. Cross Neutron L2 networking will
help the VMs migration in this case and it's not necessary to change the
IP/MAC/Security Group configuration during VM migration.
The spec[5] is to explain how one AZ can support more than one pod, and how
to schedule a proper pod during VM or Volume creation.
And this spec is to deal with the cross Neutron L2 networking automation in
the Tricircle.
The simplest way to spread out L2 networking to multiple OpenStack instances
is to use same VLAN. But there is a lot of limitations: (1) A number of VLAN
segment is limited, (2) the VLAN network itself is not good to spread out
multiple sites, although you can use some gateways to do the same thing.
So flexible tenant level L2 networking across multiple Neutron servers in
one site or in multiple sites is needed.
Proposed Change
===============
Cross Neutron L2 networking can be divided into three categories,
``VLAN``, ``Shared VxLAN`` and ``Mixed VLAN/VxLAN``.
* VLAN
Network in each bottom OpenStack is VLAN type and has the same VLAN ID.
If we want VLAN L2 networking to work in multi-site scenario, i.e.,
Multiple OpenStack instances in multiple sites, physical gateway needs to
be manually configured to make one VLAN networking be extended to other
sites.
*Manual setup physical gateway is out of the scope of this spec*
* Shared VxLAN
Network in each bottom OpenStack instance is VxLAN type and has the same
VxLAN ID.
Leverage L2GW[2][3] to implement this type of L2 networking.
* Mixed VLAN/VxLAN
Network in each bottom OpenStack instance may have different types and/or
have different segment IDs.
Leverage L2GW[2][3] to implement this type of L2 networking.
There is another network type called “Local Network”. For “Local Network”,
the network will be only presented in one bottom OpenStack instance. And the
network won't be presented in different bottom OpenStack instances. If a VM
in another pod tries to attach to the “Local Network”, it should be failed.
This use case is quite useful for the scenario in which cross Neutron L2
networking is not required, and one AZ will not include more than bottom
OpenStack instance.
Cross Neutron L2 networking will be able to be established dynamically during
tenant's VM is being provisioned.
There is assumption here that only one type of L2 networking will work in one
cloud deployment.
A Cross Neutron L2 Networking Creation
--------------------------------------
A cross Neutron L2 networking creation will be able to be done with the az_hint
attribute of the network. If az_hint includes one AZ or more AZs, the network
will be presented only in this AZ or these AZs, if no AZ in az_hint, it means
that the network can be extended to any bottom OpenStack.
There is a special use case for external network creation. For external
network creation, you need to specify the pod_id but not AZ in the az_hint
so that the external network will be only created in one specified pod per AZ.
*Support of External network in multiple OpenStack instances in one AZ
is out of scope of this spec.*
Pluggable L2 networking framework is proposed to deal with three types of
L2 cross Neutron networking, and it should be compatible with the
``Local Network``.
1. Type Driver under Tricircle Plugin in Neutron API server
* Type driver to distinguish different type of cross Neutron L2 networking. So
the Tricircle plugin need to load type driver according to the configuration.
The Tricircle can reuse the type driver of ML2 with update.
* Type driver to allocate VLAN segment id for VLAN L2 networking.
* Type driver to allocate VxLAN segment id for shared VxLAN L2 networking.
* Type driver for mixed VLAN/VxLAN to allocate VxLAN segment id for the
network connecting L2GWs[2][3].
* Type driver for Local Network only updating ``network_type`` for the
network to the Tricircle Neutron DB.
When a network creation request is received in Neutron API Server in the
Tricircle, the type driver will be called based on the configured network
type.
2. Nova API-GW to trigger the bottom networking automation
Nova API-GW can be aware of when a new VM is provisioned if boot VM api request
is received, therefore Nova API-GW is responsible for the network creation in
the bottom OpenStack instances.
Nova API-GW needs to get the network type from Neutron API server in the
Tricircle, and deal with the networking automation based on the network type:
* VLAN
Nova API-GW creates network in bottom OpenStack instance in which the VM will
run with the VLAN segment id, network name and type that are retrieved from
the Neutron API server in the Tricircle.
* Shared VxLAN
Nova API-GW creates network in bottom OpenStack instance in which the VM will
run with the VxLAN segment id, network name and type which are retrieved from
Tricricle Neutron API server. After the network in the bottom OpenStack
instance is created successfully, Nova API-GW needs to make this network in the
bottom OpenStack instance as one of the segments in the network in the Tricircle.
* Mixed VLAN/VxLAN
Nova API-GW creates network in different bottom OpenStack instance in which the
VM will run with the VLAN or VxLAN segment id respectively, network name and type
which are retrieved from Tricricle Neutron API server. After the network in the
bottom OpenStack instances is created successfully, Nova API-GW needs to update
network in the Tricircle with the segmentation information of bottom netwoks.
3. L2GW driver under Tricircle Plugin in Neutron API server
Tricircle plugin needs to support multi-segment network extension[4].
For Shared VxLAN or Mixed VLAN/VxLAN L2 network type, L2GW driver will utilize the
multi-segment network extension in Neutron API server to build the L2 network in the
Tricircle. Each network in the bottom OpenStack instance will be a segment for the
whole cross Neutron L2 networking in the Tricircle.
After the network in the bottom OpenStack instance was created successfully, Nova
API-GW will call Neutron server API to update the network in the Tricircle with a
new segment from the network in the bottom OpenStack instance.
If the network in the bottom OpenStack instance was removed successfully, Nova
API-GW will call Neutron server api to remove the segment in the bottom OpenStack
instance from network in the Tricircle.
When L2GW driver under Tricircle plugin in Neutron API server receives the
segment update request, L2GW driver will start async job to orchestrate L2GW API
for L2 networking automation[2][3].
Data model impact
-----------------
In database, we are considering setting physical_network in top OpenStack instance
as ``bottom_physical_network#bottom_pod_id`` to distinguish segmentation information
in different bottom OpenStack instance.
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
----------------
**Local Network Implementation**
For Local Network, L2GW is not required. In this scenario, no cross Neutron L2/L3
networking is required.
A user creates network ``Net1`` with single AZ1 in az_hint, the Tricircle plugin
checks the configuration, if ``tenant_network_type`` equals ``local_network``,
it will invoke Local Network type driver. Local Network driver under the
Tricircle plugin will update ``network_type`` in database.
For example, a user creates VM1 in AZ1 which has only one pod ``POD1``, and
connects it to network ``Net1``. ``Nova API-GW`` will send network creation
request to ``POD1`` and the VM will be booted in AZ1 (There should be only one
pod in AZ1).
If a user wants to create VM2 in AZ2 or ``POD2`` in AZ1, and connect it to
network ``Net1`` in the Tricircle, it would be failed. Because the ``Net1`` is
local_network type network and it is limited to present in ``POD1`` in AZ1 only.
**VLAN Implementation**
For VLAN, L2GW is not required. This is the most simplest cross Neutron
L2 networking for limited scenario. For example, with a small number of
networks, all VLANs are extended through physical gateway to support cross
Neutron VLAN networking, or all Neutron servers under same core switch with same visible
VLAN ranges that supported by the core switch are connected by the core
switch.
when a user creates network called ``Net1``, the Tricircle plugin checks the
configuration. If ``tenant_network_type`` equals ``vlan``, the
Tricircle will invoke VLAN type driver. VLAN driver will
create ``segment``, and assign ``network_type`` with VLAN, update
``segment`` and ``network_type`` and ``physical_network`` with DB
A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be
booted in ``POD1``, ``Nova API-GW`` needs to get the network information and
send network creation message to ``POD1``. Network creation message includes
``network_type`` and ``segment`` and ``physical_network``.
Then the user creates VM2 in AZ2, and connects it to network Net1. If VM will
be booted in ``POD2``, ``Nova API-GW`` needs to get the network information and
send create network message to ``POD2``. Create network message includes
``network_type`` and ``segment`` and ``physical_network``.
**Shared VxLAN Implementation**
A user creates network ``Net1``, the Tricircle plugin checks the configuration, if
``tenant_network_type`` equals ``shared_vxlan``, it will invoke shared VxLAN
driver. Shared VxLAN driver will allocate ``segment``, and assign
``network_type`` with VxLAN, and update network with ``segment`` and
``network_type`` with DB
A user creates VM1 in AZ1, and connects it to network ``Net1``. If VM1 will be
booted in ``POD1``, ``Nova API-GW`` needs to get the network information and send
create network message to ``POD1``, create network message includes
``network_type`` and ``segment``.
``Nova API-GW`` should update ``Net1`` in Tricircle with the segment information
got by ``POD1``.
Then the user creates VM2 in AZ2, and connects it to network ``Net1``. If VM2 will
be booted in ``POD2``, ``Nova API-GW`` needs to get the network information and
send network creation massage to ``POD2``, network creation message includes
``network_type`` and ``segment``.
``Nova API-GW`` should update ``Net1`` in the Tricircle with the segment information
get by ``POD2``.
The Tricircle plugin detects that the network includes more than one segment
network, calls L2GW driver to start async job for cross Neutron networking for
``Net1``. The L2GW driver will create L2GW1 in ``POD1`` and L2GW2 in ``POD2``. In
``POD1``, L2GW1 will connect the local ``Net1`` and create L2GW remote connection
to L2GW2, then populate the information of MAC/IP which resides in L2GW1. In
``POD2``, L2GW2 will connect the local ``Net1`` and create L2GW remote connection
to L2GW1, then populate remote MAC/IP information which resides in ``POD1`` in L2GW2.
L2GW driver in the Tricircle will also detect the new port creation/deletion API
request. If port (MAC/IP) created or deleted in ``POD1`` or ``POD2``, it needs to
refresh the L2GW2 MAC/IP information.
Whether to populate the information of port (MAC/IP) should be configurable according
to L2GW capability. And only populate MAC/IP information for the ports that are not
resides in the same pod.
**Mixed VLAN/VxLAN**
To achieve cross Neutron L2 networking, L2GW will be used to connect L2 network
in different Neutron servers, using L2GW should work for Shared VxLAN and Mixed VLAN/VxLAN
scenario.
When L2GW connected with local network in the same OpenStack instance, no
matter it's VLAN or VxLAN or GRE, the L2GW should be able to connect the
local network, and because L2GW is extension of Neutron, only network
UUID should be enough for L2GW to connect the local network.
When admin user creates network in Tricircle, he/she specifies the network
type as one of the network type as discussed above. In the phase of creating
network in Tricircle, only one record is saved in the database, no network
will be created in bottom OpenStack.
After the network in the bottom created successfully, need to retrieve the
network information like segment id, network name and network type, and make
this network in the bottom pod as one of the segments in the network in
Tricircle.
In the Tricircle, network could be created by tenant or admin. For tenant, no way
to specify the network type and segment id, then default network type will
be used instead. When user uses the network to boot a VM, ``Nova API-GW``
checks the network type. For Mixed VLAN/VxLAN network, ``Nova API-GW`` first
creates network in bottom OpenStack without specifying network type and segment
ID, then updates the top network with bottom network segmentation information
returned by bottom OpenStack.
A user creates network ``Net1``, plugin checks the configuration, if
``tenant_network_type`` equals ``mixed_vlan_vxlan``, it will invoke mixed VLAN
and VxLAN driver. The driver needs to do nothing since segment is allocated
in bottom.
A user creates VM1 in AZ1, and connects it to the network ``Net1``, the VM is
booted in bottom ``POD1``, and ``Nova API-GW`` creates network in ``POD1`` and
queries the network detail segmentation information (using admin role), and
gets network type, segment id, then updates this new segment to the ``Net1``
in Tricircle ``Neutron API Server``.
Then the user creates another VM2, and with AZ info AZ2, then the VM should be
able to be booted in bottom ``POD2`` which is located in AZ2. And when VM2 should
be able to be booted in AZ2, ``Nova API-GW`` also creates a network in ``POD2``,
and queries the network information including segment and network type,
updates this new segment to the ``Net1`` in Tricircle ``Neutron API Server``.
The Tricircle plugin detects that the ``Net1`` includes more than one network
segments, calls L2GW driver to start async job for cross Neutron networking for
``Net1``. The L2GW driver will create L2GW1 in ``POD1`` and L2GW2 in ``POD2``. In
``POD1``, L2GW1 will connect the local ``Net1`` and create L2GW remote connection
to L2GW2, then populate information of MAC/IP which resides in ``POD2`` in L2GW1.
In ``POD2``, L2GW2 will connect the local ``Net1`` and create L2GW remote connection
to L2GW1, then populate remote MAC/IP information which resides in ``POD1`` in L2GW2.
L2GW driver in Tricircle will also detect the new port creation/deletion api
calling, if port (MAC/IP) created or deleted in ``POD1``, then needs to refresh
the L2GW2 MAC/IP information. If port (MAC/IP) created or deleted in ``POD2``,
then needs to refresh the L2GW1 MAC/IP information,
Whether to populate MAC/IP information should be configurable according to
L2GW capability. And only populate MAC/IP information for the ports that are
not resides in the same pod.
**L3 bridge network**
Current implementation without cross Neutron L2 networking.
* A special bridge network is created and connected to the routers in
different bottom OpenStack instances. We configure the extra routes of the routers
to route the packets from one OpenStack to another. In current
implementation, we create this special bridge network in each bottom
OpenStack with the same ``VLAN ID``, so we have an L2 network to connect
the routers.
Difference between L2 networking for tenant's VM and for L3 bridging network.
* The creation of bridge network is triggered during attaching router
interface and adding router external gateway.
* The L2 network for VM is triggered by ``Nova API-GW`` when a VM is to be
created in one pod, and finds that there is no network, then the network
will be created before the VM is booted, network or port parameter is
required to boot VM. The IP/Mac for VM is allocated in the ``Tricircle``,
top layer to avoid IP/mac collision if they are allocated separately in
bottom pods.
After cross Neutron L2 networking is introduced, the L3 bridge network should
be updated too.
L3 bridge network N-S (North-South):
* For each tenant, one cross Neutron N-S bridge network should be created for
router N-S inter-connection. Just replace the current VLAN N-S bridge network
to corresponding Shared VxLAN or Mixed VLAN/VxLAN.
L3 bridge network E-W (East-West):
* When attaching router interface happened, for VLAN, it will keep
current process to establish E-W bridge network. For Shared VxLAN and Mixed
VLAN/VxLAN, if a L2 network is able to expand to the current pod, then just
expand the L2 network to the pod, all E-W traffic will go out from local L2
network, then no bridge network is needed.
* For example, (Net1, Router1) in ``Pod1``, (Net2, Router1) in ``Pod2``, if
``Net1`` is a cross Neutron L2 network, and can be expanded to Pod2, then
will just expand ``Net1`` to Pod2. After the ``Net1`` expansion ( just like
cross Neutron L2 networking to spread one network in multiple Neutron servers ), it'll
look like (Net1, Router1) in ``Pod1``, (Net1, Net2, Router1) in ``Pod2``, In
``Pod2``, no VM in ``Net1``, only for E-W traffic. Now the E-W traffic will
look like this:
from Net2 to Net1:
Net2 in Pod2 -> Router1 in Pod2 -> Net1 in Pod2 -> L2GW in Pod2 ---> L2GW in
Pod1 -> Net1 in Pod1.
Note: The traffic for ``Net1`` in ``Pod2`` to ``Net1`` in ``Pod1`` can bypass the L2GW in
``Pod2``, that means outbound traffic can bypass the local L2GW if the remote VTEP of
L2GW is known to the local compute node and the packet from the local compute
node with VxLAN encapsulation cloud be routed to remote L2GW directly. It's up
to the L2GW implementation. With the inbound traffic through L2GW, the inbound
traffic to the VM will not be impacted by the VM migration from one host to
another.
If ``Net2`` is a cross Neutron L2 network, and can be expanded to ``Pod1`` too,
then will just expand ``Net2`` to ``Pod1``. After the ``Net2`` expansion(just
like cross Neutron L2 networking to spread one network in multiple Neutron servers ), it'll
look like (Net2, Net1, Router1) in ``Pod1``, (Net1, Net2, Router1) in ``Pod2``,
In ``Pod1``, no VM in Net2, only for E-W traffic. Now the E-W traffic will look
like this: from ``Net1`` to ``Net2``:
Net1 in Pod1 -> Router1 in Pod1 -> Net2 in Pod1 -> L2GW in Pod1 ---> L2GW in
Pod2 -> Net2 in Pod2.
To limit the complexity, one network's az_hint can only be specified when
creating, and no update is allowed, if az_hint need to be updated, you have
to delete the network and create again.
If the network can't be expanded, then E-W bridge network is needed. For
example, Net1(AZ1, AZ2,AZ3), Router1; Net2(AZ4, AZ5, AZ6), Router1.
Then a cross Neutron L2 bridge network has to be established:
Net1(AZ1, AZ2, AZ3), Router1 --> E-W bridge network ---> Router1,
Net2(AZ4, AZ5, AZ6).
Assignee(s)
------------
Primary assignee:
Other contributors:
Work Items
------------
Dependencies
----------------
None
Testing
----------------
None
References
----------------
[1] https://docs.google.com/document/d/18kZZ1snMOCD9IQvUKI5NVDzSASpw-QKj7l2zNqMEd3g/
[2] https://review.openstack.org/#/c/270786/
[3] https://github.com/openstack/networking-l2gw/blob/master/specs/kilo/l2-gateway-api.rst
[4] http://developer.openstack.org/api-ref-networking-v2-ext.html#networks-multi-provider-ext
[5] http://docs.openstack.org/mitaka/networking-guide/adv-config-availability-zone.html
[6] https://review.openstack.org/#/c/306224/

233
doc/source/devspecs/cross-neutron-vxlan-networking.rst

@ -0,0 +1,233 @@
===========================================
Cross Neutron VxLAN Networking in Tricircle
===========================================
Background
==========
Currently we only support VLAN as the cross-Neutron network type. For VLAN network
type, central plugin in Tricircle picks a physical network and allocates a VLAN
tag(or uses what users specify), then before the creation of local network,
local plugin queries this provider network information and creates the network
based on this information. Tricircle only guarantees that instance packets sent
out of hosts in different pods belonging to the same VLAN network will be tagged
with the same VLAN ID. Deployers need to carefully configure physical networks
and switch ports to make sure that packets can be transported correctly between
physical devices.
For more flexible deployment, VxLAN network type is a better choice. Compared
to 12-bit VLAN ID, 24-bit VxLAN ID can support more numbers of bridge networks
and cross-Neutron L2 networks. With MAC-in-UDP encapsulation of VxLAN network,
hosts in different pods only need to be IP routable to transport instance
packets.
Proposal
========
There are some challenges to support cross-Neutron VxLAN network.
1. How to keep VxLAN ID identical for the same VxLAN network across Neutron servers
2. How to synchronize tunnel endpoint information between pods
3. How to trigger L2 agents to build tunnels based on this information
4. How to support different back-ends, like ODL, L2 gateway
The first challenge can be solved as VLAN network does, we allocate VxLAN ID in
central plugin and local plugin will use the same VxLAN ID to create local
network. For the second challenge, we introduce a new table called
"shadow_agents" in Tricircle database, so central plugin can save the tunnel
endpoint information collected from one local Neutron server in this table
and use it to populate the information to other local Neutron servers when
needed. Here is the schema of the table:
.. csv-table:: Shadow Agent Table
:header: Field, Type, Nullable, Key, Default
id, string, no, primary, null
pod_id, string, no, , null
host, string, no, unique, null
type, string, no, unique, null
tunnel_ip, string, no, , null
**How to collect tunnel endpoint information**
When the host where a port will be located is determined, local Neutron server
will receive a port-update request containing host ID in the body. During the
process of this request, local plugin can query agent information that contains
tunnel endpoint information from local Neutron database with host ID and port
VIF type; then send tunnel endpoint information to central Neutron server by
issuing a port-update request with this information in the binding profile.
**How to populate tunnel endpoint information**
When the tunnel endpoint information in one pod is needed to be populated to
other pods, XJob will issue port-create requests to corresponding local Neutron
servers with tunnel endpoint information queried from Tricircle database in the
bodies. After receiving such request, local Neutron server will save tunnel
endpoint information by calling real core plugin's "create_or_update_agent"
method. This method comes from neutron.db.agent_db.AgentDbMixin class. Plugins
that support "agent" extension will have this method. Actually there's no such
agent daemon running in the target local Neutron server, but we insert a record
for it in the database so the local Neutron server will assume there exists an
agent. That's why we call it shadow agent.
The proposed solution for the third challenge is based on the shadow agent and
L2 population mechanism. In the original Neutron process, if the port status
is updated to active, L2 population mechanism driver does two things. First,
driver checks if the updated port is the first port in the target agent. If so,
driver collects tunnel endpoint information of other ports in the same network,
then sends the information to the target agent via RPC. Second, driver sends
the tunnel endpoint information of the updated port to other agents where ports
in the same network are located, also via RPC. L2 agents will build the tunnels
based on the information they received. To trigger the above processes to build
tunnels across Neutron servers, we further introduce shadow port.
Let's say we have two instance ports, port1 is located in host1 in pod1 and
port2 is located in host2 in pod2. To make L2 agent running in host1 build a
tunnel to host2, we create a port with the same properties of port2 in pod1.
As discussed above, local Neutron server will create shadow agent during the
process of port-create request, so local Neutron server in pod1 won't complain
that host2 doesn't exist. To trigger L2 population process, we then update the
port status to active, so L2 agent in host1 will receive tunnel endpoint
information of port2 and build the tunnel. Port status is a read-only property
so we can't directly update it via ReSTful API. Instead, we issue a port-update
request with a special key in the binding profile. After local Neutron server
receives such request, it pops the special key from the binding profile and
updates the port status to active. XJob daemon will take the job to create and
update shadow ports.
Here is the flow of shadow agent and shadow port process::
+-------+ +---------+ +---------+
| | | | +---------+ | |
| Local | | Local | | | +----------+ +------+ | Local |
| Nova | | Neutron | | Central | | | | | | Neutron |
| Pod1 | | Pod1 | | Neutron | | Database | | XJob | | Pod2 |
| | | | | | | | | | | |
+---+---+ +---- ----+ +----+----+ +----+-----+ +--+---+ +----+----+
| | | | | |
| update port1 | | | | |
| [host id] | | | | |
+---------------> | | | |
| | update port1 | | | |
| | [agent info] | | | |
| +----------------> | | |
| | | save shadow | | |
| | | agent info | | |
| | +----------------> | |
| | | | | |
| | | trigger shadow | | |
| | | port setup job | | |
| | | for pod1 | | |
| | +---------------------------------> |
| | | | | query ports in |
| | | | | the same network |
| | | | +------------------>
| | | | | |
| | | | | return port2 |
| | | | <------------------+
| | | | query shadow | |
| | | | agent info | |
| | | | for port2 | |
| | | <----------------+ |
| | | | | |
| | | | create shadow | |
| | | | port for port2 | |
| <--------------------------------------------------+ |
| | | | | |
| | create shadow | | | |
| | agent and port | | | |
| +-----+ | | | |
| | | | | | |
| | | | | | |
| <-----+ | | | |
| | | | update shadow | |
| | | | port to active | |
| <--------------------------------------------------+ |
| | | | | |
| | L2 population | | | trigger shadow |
| +-----+ | | | port setup job |
| | | | | | for pod2 |
| | | | | +-----+ |
| <-----+ | | | | |
| | | | | | |
| | | | <-----+ |
| | | | | |
| | | | | |
+ + + + + +
Bridge network can support VxLAN network in the same way, we just create shadow
ports for router interface and router gateway. In the above graph, local Nova
server updates port with host ID to trigger the whole process. L3 agent will
update interface port and gateway port with host ID, so similar process will
be triggered to create shadow ports for router interface and router gateway.
Currently Neutron team is working on push notification [1]_, Neutron server
will send resource data to agents; agents cache this data and use it to do the
real job like configuring openvswitch, updating iptables, configuring dnsmasq,
etc. Agents don't need to retrieve resource data from Neutron server via RPC
any more. Based on push notification, if tunnel endpoint information is stored
in port object later, and this information supports updating via ReSTful API,
we can simplify the solution for challenge 3 and 4. We just need to create
shadow port containing tunnel endpoint information. This information will be
pushed to agents and agents use it to create necessary tunnels and flows.
**How to support different back-ends besides ML2+OVS implementation**
We consider two typical back-ends that can support cross-Neutron VxLAN networking,
L2 gateway and SDN controller like ODL. For L2 gateway, we consider only
supporting static tunnel endpoint information for L2 gateway at the first step.
Shadow agent and shadow port process is almost the same with the ML2+OVS
implementation. The difference is that, for L2 gateway, the tunnel IP of the
shadow agent is set to the tunnel endpoint of the L2 gateway. So after L2
population, L2 agents will create tunnels to the tunnel endpoint of the L2
gateway. For SDN controller, we assume that SDN controller has the ability to
manage tunnel endpoint information across Neutron servers, so Tricircle only helps to
allocate VxLAN ID and keep the VxLAN ID identical across Neutron servers for one network.
Shadow agent and shadow port process will not be used in this case. However, if
different SDN controllers are used in different pods, it will be hard for each
SDN controller to connect hosts managed by other SDN controllers since each SDN
controller has its own mechanism. This problem is discussed in this page [2]_.
One possible solution under Tricircle is as what L2 gateway does. We create
shadow ports that contain L2 gateway tunnel endpoint information so SDN
controller can build tunnels in its own way. We then configure L2 gateway in
each pod to forward the packets between L2 gateways. L2 gateways discussed here
are mostly hardware based, and can be controlled by SDN controller. SDN
controller will use ML2 mechanism driver to receive the L2 network context and
further control L2 gateways for the network.
To distinguish different back-ends, we will add a new configuration option
cross_pod_vxlan_mode whose valid values are "p2p", "l2gw" and "noop". Mode
"p2p" works for the ML2+OVS scenario, in this mode, shadow ports and shadow
agents containing host tunnel endpoint information are created; mode "l2gw"
works for the L2 gateway scenario, in this mode, shadow ports and shadow agents
containing L2 gateway tunnel endpoint information are created. For the SDN
controller scenario, as discussed above, if SDN controller can manage tunnel
endpoint information by itself, we only need to use "noop" mode, meaning that
neither shadow ports nor shadow agents will be created; or if SDN controller
can manage hardware L2 gateway, we can use "l2gw" mode.
Data Model Impact
=================
New table "shadow_agents" is added.
Dependencies
============
None
Documentation Impact
====================
- Update configuration guide to introduce options for VxLAN network
- Update networking guide to discuss new scenarios with VxLAN network
- Add release note about cross-Neutron VxLAN networking support
References
==========
.. [1] https://blueprints.launchpad.net/neutron/+spec/push-notifications
.. [2] http://etherealmind.com/help-wanted-stitching-a-federated-sdn-on-openstack-with-evpn/

18
doc/source/devspecs/devspecs-guide.rst

@ -0,0 +1,18 @@
Devspecs Guide
------------------
Some specs for developers. Who are interest in tricircle.
.. include:: ./async_job_management.rst
.. include:: ./cross-neutron-l2-networking.rst
.. include:: ./cross-neutron-vxlan-networking.rst
.. include:: ./dynamic-pod-binding.rst
.. include:: ./enhance-xjob-reliability.rst
.. include:: ./l3-networking-combined-bridge-net.rst
.. include:: ./l3-networking-multi-NS-with-EW-enabled.rst
.. include:: ./lbaas.rst
.. include:: ./legacy_tables_clean.rst
.. include:: ./local-neutron-plugin.rst
.. include:: ./new-l3-networking-mulit-NS-with-EW.rst
.. include:: ./quality-of-service.rst
.. include:: ./resource_deleting.rst
.. include:: ./smoke-test-engine.rst

236
doc/source/devspecs/dynamic-pod-binding.rst

@ -0,0 +1,236 @@
=================================
Dynamic Pod Binding in Tricircle
=================================
Background
===========
Most public cloud infrastructure is built with Availability Zones (AZs).
Each AZ is consisted of one or more discrete data centers, each with high
bandwidth and low latency network connection, separate power and facilities.
These AZs offer cloud tenants the ability to operate production
applications and databases deployed into multiple AZs are more highly
available, fault tolerant and scalable than a single data center.
In production clouds, each AZ is built by modularized OpenStack, and each
OpenStack is one pod. Moreover, one AZ can include multiple pods. Among the
pods, they are classified into different categories. For example, servers
in one pod are only for general purposes, and the other pods may be built
for heavy load CAD modeling with GPU. So pods in one AZ could be divided
into different groups. Different pod groups for different purposes, and
the VM's cost and performance are also different.
The concept "pod" is created for the Tricircle to facilitate managing
OpenStack instances among AZs, which therefore is transparent to cloud
tenants. The Tricircle maintains and manages a pod binding table which
records the mapping relationship between a cloud tenant and pods. When the
cloud tenant creates a VM or a volume, the Tricircle tries to assign a pod
based on the pod binding table.
Motivation
===========
In resource allocation scenario, when a tenant creates a VM in one pod and a
new volume in a another pod respectively. If the tenant attempt to attach the
volume to the VM, the operation will fail. In other words, the volume should
be in the same pod where the VM is, otherwise the volume and VM would not be
able to finish the attachment. Hence, the Tricircle needs to ensure the pod
binding so as to guarantee that VM and volume are created in one pod.
In capacity expansion scenario, when resources in one pod are exhausted,
then a new pod with the same type should be added into the AZ. Therefore,
new resources of this type should be provisioned in the new added pod, which
requires dynamical change of pod binding. The pod binding could be done
dynamically by the Tricircle, or by admin through admin api for maintenance
purpose. For example, for maintenance(upgrade, repairement) window, all
new provision requests should be forwarded to the running one, but not
the one under maintenance.
Solution: dynamic pod binding
==============================
It's quite headache for capacity expansion inside one pod, you have to
estimate, calculate, monitor, simulate, test, and do online grey expansion
for controller nodes and network nodes whenever you add new machines to the
pod. It's quite big challenge as more and more resources added to one pod,
and at last you will reach limitation of one OpenStack. If this pod's
resources exhausted or reach the limit for new resources provisioning, the
Tricircle needs to bind tenant to a new pod instead of expanding the current
pod unlimitedly. The Tricircle needs to select a proper pod and stay binding
for a duration, in this duration VM and volume will be created for one tenant
in the same pod.
For example, suppose we have two groups of pods, and each group has 3 pods,
i.e.,
GroupA(Pod1, Pod2, Pod3) for general purpose VM,
GroupB(Pod4, Pod5, Pod6) for CAD modeling.
Tenant1 is bound to Pod1, Pod4 during the first phase for several months.
In the first phase, we can just add weight in Pod, for example, Pod1, weight 1,
Pod2, weight2, this could be done by adding one new field in pod table, or no
field at all, just link them by the order created in the Tricircle. In this
case, we use the pod creation time as the weight.
If the tenant wants to allocate VM/volume for general VM, Pod1 should be
selected. It can be implemented with flavor or volume type metadata. For
general VM/Volume, there is no special tag in flavor or volume type metadata.
If the tenant wants to allocate VM/volume for CAD modeling VM, Pod4 should be
selected. For CAD modeling VM/Volume, a special tag "resource: CAD Modeling"
in flavor or volume type metadata determines the binding.
When it is detected that there is no more resources in Pod1, Pod4. Based on
the resource_affinity_tag, the Tricircle queries the pod table for available
pods which provision a specific type of resources. The field resource_affinity
is a key-value pair. The pods will be selected when there are matched
key-value in flavor extra-spec or volume extra-spec. A tenant will be bound
to one pod in one group of pods with same resource_affinity_tag. In this case,
the Tricircle obtains Pod2 and Pod3 for general purpose, as well as Pod5 an
Pod6 for CAD purpose. The Tricircle needs to change the binding, for example,
tenant1 needs to be bound to Pod2, Pod5.
Implementation
==============
Measurement
----------------
To get the information of resource utilization of pods, the Tricircle needs to
conduct some measurements on pods. The statistic task should be done in
bottom pod.
For resources usages, current cells provide interface to retrieve usage for
cells [1]. OpenStack provides details of capacity of a cell, including disk
and ram via api of showing cell capacities [1].
If OpenStack is not running with cells mode, we can ask Nova to provide
an interface to show the usage detail in AZ. Moreover, an API for usage
query at host level is provided for admins [3], through which we can obtain
details of a host, including cpu, memory, disk, and so on.
Cinder also provides interface to retrieve the backend pool usage,
including updated time, total capacity, free capacity and so on [2].
The Tricircle needs to have one task to collect the usage in the bottom on
daily base, to evaluate whether the threshold is reached or not. A threshold
or headroom could be configured for each pod, but not to reach 100% exhaustion
of resources.
On top there should be no heavy process. So getting the sum info from the
bottom can be done in the Tricircle. After collecting the details, the
Tricircle can judge whether a pod reaches its limit.
Tricircle
----------
The Tricircle needs a framework to support different binding policy (filter).
Each pod is one OpenStack instance, including controller nodes and compute
nodes. E.g.,
::
+-> controller(s) - pod1 <--> compute nodes <---+
|
The tricircle +-> controller(s) - pod2 <--> compute nodes <---+ resource migration, if necessary
(resource controller) .... |
+-> controller(s) - pod{N} <--> compute nodes <-+
The Tricircle selects a pod to decide where the requests should be forwarded
to which controller. Then the controllers in the selected pod will do its own
scheduling.
One simplest binding filter is as follows. Line up all available pods in a
list and always select the first one. When all the resources in the first pod
has been allocated, remove it from the list. This is quite like how production
cloud is built: at first, only a few pods are in the list, and then add more
and more pods if there is not enough resources in current cloud. For example,
List1 for general pool: Pod1 <- Pod2 <- Pod3
List2 for CAD modeling pool: Pod4 <- Pod5 <- Pod6
If Pod1's resource exhausted, Pod1 is removed from List1. The List1 is changed
to: Pod2 <- Pod3.
If Pod4's resource exhausted, Pod4 is removed from List2. The List2 is changed
to: Pod5 <- Pod6
If the tenant wants to allocate resources for general VM, the Tricircle
selects Pod2. If the tenant wants to allocate resources for CAD modeling VM,
the Tricircle selects Pod5.
Filtering
-------------
For the strategy of selecting pods, we need a series of filters. Before
implementing dynamic pod binding, the binding criteria are hard coded to
select the first pod in the AZ. Hence, we need to design a series of filter
algorithms. Firstly, we plan to design an ALLPodsFilter which does no
filtering and passes all the available pods. Secondly, we plan to design an
AvailabilityZoneFilter which passes the pods matching the specified available
zone. Thirdly, we plan to design a ResourceAffiniyFilter which passes the pods
matching the specified resource type. Based on the resource_affinity_tag,
the Tricircle can be aware of which type of resource the tenant wants to
provision. In the future, we can add more filters, which requires adding more
information in the pod table.
Weighting
-------------
After filtering all the pods, the Tricircle obtains the available pods for a
tenant. The Tricircle needs to select the most suitable pod for the tenant.
Hence, we need to define a weight function to calculate the corresponding
weight of each pod. Based on the weights, the Tricircle selects the pod which
has the maximum weight value. When calculating the weight of a pod, we need
to design a series of weigher. We first take the pod creation time into
consideration when designing the weight function. The second one is the idle
capacity, to select a pod which has the most idle capacity. Other metrics
will be added in the future, e.g., cost.
Data Model Impact
-----------------
Firstly, we need to add a column “resource_affinity_tag” to the pod table,
which is used to store the key-value pair, to match flavor extra-spec and
volume extra-spec.
Secondly, in the pod binding table, we need to add fields of start binding
time and end binding time, so the history of the binding relationship could
be stored.
Thirdly, we need a table to store the usage of each pod for Cinder/Nova.
We plan to use JSON object to store the usage information. Hence, even if
the usage structure is changed, we don't need to update the table. And if
the usage value is null, that means the usage has not been initialized yet.
As just mentioned above, the usage could be refreshed in daily basis. If it's
not initialized yet, it means there is still lots of resources available,
which could be scheduled just like this pod has not reach usage threshold.
Dependencies
------------
None
Testing
-------
None
Documentation Impact
--------------------
None
Reference
---------
[1] http://developer.openstack.org/api-ref-compute-v2.1.html#showCellCapacities
[2] http://developer.openstack.org/api-ref-blockstorage-v2.html#os-vol-pool-v2
[3] http://developer.openstack.org/api-ref-compute-v2.1.html#showinfo

234
doc/source/devspecs/enhance-xjob-reliability.rst

@ -0,0 +1,234 @@
=======================================
Enhance Reliability of Asynchronous Job
=======================================
Background
==========
Currently we are using cast method in our RPC client to trigger asynchronous
job in XJob daemon. After one of the worker threads receives the RPC message
from the message broker, it registers the job in the database and starts to
run the handle function. The registration guarantees that asynchronous job will
not be lost after the job fails and the failed job can be redone. The detailed
discussion of the asynchronous job process in XJob daemon is covered in our
design document [1].
Though asynchronous jobs are correctly saved after worker threads get the RPC
message, we still have risk to lose jobs. By using cast method, it's only
guaranteed that the message is received by the message broker, but there's no
guarantee that the message can be received by the message consumer, i.e., the
RPC server thread running in XJob daemon. According to the RabbitMQ document,
undelivered messages will be lost if RabbitMQ server stops [2]. Message
persistence or publisher confirm can be used to increase reliability, but
they sacrifice performance. On the other hand, we can not assume that message
brokers other than RabbitMQ will provide similar persistence or confirmation
functionality. Therefore, Tricircle itself should handle the asynchronous job
reliability problem as far as possible. Since we already have a framework to
register, run and redo asynchronous jobs in XJob daemon, we propose a cheaper
way to improve reliability.
Proposal
========
One straightforward way to make sure that the RPC server has received the RPC
message is to use call method. RPC client will be blocked until the RPC server
replies the message if it uses call method to send the RPC request. So if
something wrong happens before the reply, RPC client can be aware of it. Of
course we cannot make RPC client wait too long, thus RPC handlers in the RPC
server side need to be simple and quick to run. Thanks to the asynchronous job
framework we already have, migrating from cast method to call method is easy.
Here is the flow of the current process::
+--------+ +--------+ +---------+ +---------------+ +----------+
| | | | | | | | | |
| API | | RPC | | Message | | RPC Server | | Database |
| Server | | client | | Broker | | Handle Worker | | |
| | | | | | | | | |
+---+----+ +---+----+ +----+----+ +-------+-------+ +----+-----+
| | | | |
| call RPC API | | | |
+--------------> | | |
| | send cast message | | |
| +-------------------> | |
| call return | | dispatch message | |
<--------------+ +------------------> |
| | | | register job |
| | | +---------------->
| | | | |
| | | | obtain lock |
| | | +---------------->
| | | | |
| | | | run job |
| | | +----+ |
| | | | | |
| | | | | |
| | | <----+ |
| | | | |
| | | | |
+ + + + +
We can just leave **register job** phase in the RPC handle and put **obtain
lock** and **run job** phase in a separate thread, so the RPC handle is simple
enough to use call method to invoke it. Here is the proposed flow::
+--------+ +--------+ +---------+ +---------------+ +----------+ +-------------+ +-------+
| | | | | | | | | | | | | |
| API | | RPC | | Message | | RPC Server | | Database | | RPC Server | | Job |
| Server | | client | | Broker | | Handle Worker | | | | Loop Worker | | Queue |
| | | | | | | | | | | | | |
+---+----+ +---+----+ +----+----+ +-------+-------+ +----+-----+ +------+------+ +---+---+
| | | | | | |
| call RPC API | | | | | |
+--------------> | | | | |
| | send call message | | | | |
| +--------------------> | | | |
| | | dispatch message | | | |
| | +------------------> | | |
| | | | register job | | |
| | | +----------------> | |
| | | | | | |
| | | | job enqueue | | |
| | | +------------------------------------------------>
| | | | | | |
| | | reply message | | | job dequeue |
| | <------------------+ | |-------------->
| | send reply message | | | obtain lock | |
| <--------------------+ | <----------------+ |
| call return | | | | | |
<--------------+ | | | run job | |
| | | | | +----+ |
| | | | | | | |
| | | | | | | |
| | | | | +----> |
| | | | | | |
| | | | | | |
+ + + + + + +
In the above graph, **Loop Worker** is a new-introduced thread to do the actual
work. **Job Queue** is an eventlet queue used to coordinate **Handle
Worker** who produces job entries and **Loop Worker** who consumes job entries.
While accessing an empty queue, **Loop Worker** will be blocked until some job
entries are put into the queue. **Loop Worker** retrieves job entries from the
job queue then starts to run it. Similar to the original flow, since multiple
workers may get the same type of job for the same resource at the same time,
workers need to obtain the lock before it can run the job. One problem occurs
whenever XJob daemon stops before it finishes all the jobs in the job queue;
all unfinished jobs are lost. To solve it, we make changes to the original
periodical task that is used to redo failed job, and let it also handle the
jobs which have been registered for a certain time but haven't been started.
So both failed jobs and "orphan" new jobs can be picked up and redone.
You can see that **Handle Worker** doesn't do many works, it just consumes RPC
messages, registers jobs then puts job items in the job queue. So one extreme
solution here, will be to register new jobs in the API server side and start
worker threads to retrieve jobs from the database and run them. In this way, we
can remove all the RPC processes and use database to coordinate. The drawback
of this solution is that we don't dispatch jobs. All the workers query jobs
from the database so there is high probability that some of the workers obtain
the same job and thus race occurs. In the first solution, message broker
helps us to dispatch messages, and so dispatch jobs.
Considering job dispatch is important, we can make some changes to the second
solution and move to the third one, that is to also register new jobs in the
API server side, but we still use cast method to trigger asynchronous job in
XJob daemon. Since job registration is done in the API server side, we are not
afraid that the jobs will be lost if cast messages are lost. If API server side
fails to register the job, it will return response of failure; If registration
of job succeeds, the job will be done by XJob daemon at last. By using RPC, we
dispatch jobs with the help of message brokers. One thing which makes cast
method better than call method is that retrieving RPC messages and running job
handles are done in the same thread so if one XJob daemon is busy handling
jobs, RPC messages will not be dispatched to it. However when using call
method, RPC messages are retrieved by one thread(the **Handle Worker**) and job
handles are run by another thread(the **Loop Worker**), so XJob daemon may
accumulate many jobs in the queue and at the same time it's busy handling jobs.
This solution has the same problem with the call method solution. If cast
messages are lost, the new jobs are registered in the database but no XJob
daemon is aware of these new jobs. Same way to solve it, use periodical task to
pick up these "orphan" jobs. Here is the flow::
+--------+ +--------+ +---------+ +---------------+ +----------+
| | | | | | | | | |
| API | | RPC | | Message | | RPC Server | | Database |
| Server | | client | | Broker | | Handle Worker | | |
| | | | | | | | | |
+---+----+ +---+----+ +----+----+ +-------+-------+ +----+-----+
| | | | |
| call RPC API | | | |
+--------------> | | |
| | register job | | |
| +------------------------------------------------------->
| | | | |
| | [if succeed to | | |
| | register job] | | |
| | send cast message | | |
| +-------------------> | |
| call return | | dispatch message | |
<--------------+ +------------------> |
| | | | obtain lock |
| | | +---------------->
| | | | |
| | | | run job |
| | | +----+ |
| | | | | |
| | | | | |
| | | <----+ |
| | | | |
| | | | |
+ + + + +
Discussion
==========
In this section we discuss the pros and cons of the above three solutions.
.. list-table:: **Solution Comparison**
:header-rows: 1
* - Solution
- Pros
- Cons
* - API server uses call
- no RPC message lost
- downtime of unfinished jobs in the job queue when XJob daemon stops,
job dispatch not based on XJob daemon workload
* - API server register jobs + no RPC
- no requirement on RPC(message broker), no downtime
- no job dispatch, conflict costs time
* - API server register jobs + uses cast
- job dispatch based on XJob daemon workload
- downtime of lost jobs due to cast messages lost
Downtime means that after a job is dispatched to a worker, other workers need
to wait for a certain time to determine that job is expired and take over it.
Conclusion
==========
We decide to implement the third solution(API server register jobs + uses cast)
since it improves the asynchronous job reliability and at the mean time has
better work load dispatch.
Data Model Impact
=================
None
Dependencies
============
None
Documentation Impact
====================
None
References
==========
..[1] https://docs.google.com/document/d/1zcxwl8xMEpxVCqLTce2-dUOtB-ObmzJTbV1uSQ6qTsY
..[2] https://www.rabbitmq.com/tutorials/tutorial-two-python.html
..[3] https://www.rabbitmq.com/confirms.html
..[4] http://eventlet.net/doc/modules/queue.html

8
doc/source/devspecs/index.rst

@ -0,0 +1,8 @@
==========================
Tricircle Devspecs Guide
==========================
.. toctree::
:maxdepth: 4
devspecs-guide

554
doc/source/devspecs/l3-networking-combined-bridge-net.rst

@ -0,0 +1,554 @@
==============================================
Layer-3 Networking and Combined Bridge Network
==============================================
Background
==========
To achieve cross-Neutron layer-3 networking, we utilize a bridge network to
connect networks in each Neutron server, as shown below:
East-West networking::
+-----------------------+ +-----------------------+
| OpenStack1 | | OpenStack2 |
| | | |
| +------+ +---------+ | +------------+ | +---------+ +------+ |
| | net1 | | ip1| | | bridge net | | |ip2 | | net2 | |
| | +--+ R +---+ +---+ R +--+ | |
| | | | | | | | | | | | | |
| +------+ +---------+ | +------------+ | +---------+ +------+ |
+-----------------------+ +-----------------------+
Fig 1
North-South networking::
+---------------------+ +-------------------------------+
| OpenStack1 | | OpenStack2 |
| | | |
| +------+ +-------+ | +--------------+ | +-------+ +----------------+ |
| | net1 | | ip1| | | bridge net | | | ip2| | external net | |
| | +--+ R1 +---+ +---+ R2 +--+ | |
| | | | | | | 100.0.1.0/24 | | | | | 163.3.124.0/24 | |
| +------+ +-------+ | +--------------+ | +-------+ +----------------+ |
+---------------------+ +-------------------------------+
Fig 2
To support east-west networking, we configure extra routes in routers in each
OpenStack cloud::
In OpenStack1, destination: net2, nexthop: ip2
In OpenStack2, destination: net1, nexthop: ip1
To support north-south networking, we set bridge network as the external
network in OpenStack1 and as the internal network in OpenStack2. For instance
in net1 to access the external network, the packets are SNATed twice, first
SNATed to ip1, then SNATed to ip2. For floating ip binding, ip in net1 is first
bound to ip(like 100.0.1.5) in bridge network(bridge network is attached to R1
as external network), then the ip(100.0.1.5) in bridge network is bound to ip
(like 163.3.124.8)in the real external network (bridge network is attached to
R2 as internal network).
Problems
========
The idea of introducing a bridge network is good, but there are some problems
in the current usage of the bridge network.
Redundant Bridge Network
------------------------
We use two bridge networks to achieve layer-3 networking for each tenant. If
VLAN is used as the bridge network type, limited by the range of VLAN tag, only
2048 pairs of bridge networks can be created. The number of tenants supported
is far from enough.
Redundant SNAT
--------------
In the current implementation, packets are SNATed two times for outbound
traffic and are DNATed two times for inbound traffic. The drawback is that
packets of outbound traffic consume extra operations. Also, we need to maintain
extra floating ip pool for inbound traffic.
DVR support
-----------
Bridge network is attached to the router as an internal network for east-west
networking and north-south networking when the real external network and the
router are not located in the same OpenStack cloud. It's fine when the bridge
network is VLAN type, since packets directly go out of the host and are
exchanged by switches. But if we would like to support VxLAN as the bridge
network type later, attaching bridge network as an internal network in the
DVR scenario will cause some troubles. How DVR connects the internal networks
is that packets are routed locally in each host, and if the destination is not
in the local host, the packets are sent to the destination host via a VxLAN
tunnel. Here comes the problem, if bridge network is attached as an internal
network, the router interfaces will exist in all the hosts where the router
namespaces are created, so we need to maintain lots of VTEPs and VxLAN tunnels
for bridge network in the Tricircle. Ports in bridge network are located in
different OpenStack clouds so local Neutron server is not aware of ports in
other OpenStack clouds and will not setup VxLAN tunnel for us.
Proposal
--------
To address the above problems, we propose to combine the bridge networks for
east-west and north-south networking. Bridge network is always attached to
routers as an external network. In the DVR scenario, different from router
interfaces, router gateway will only exist in the SNAT namespace in a specific
host, which reduces the number of VTEPs and VxLAN tunnels the Tricircle needs
to handle. By setting "enable_snat" option to "False" when attaching the router
gateway, packets will not be SNATed when go through the router gateway, so
packets are only SNATed and DNATed one time in the real external gateway.
However, since one router can only be attached to one external network, in the
OpenStack cloud where the real external network is located, we need to add one
more router to connect the bridge network with the real external network. The
network topology is shown below::
+-------------------------+ +-------------------------+
|OpenStack1 | |OpenStack2 |
| +------+ +--------+ | +------------+ | +--------+ +------+ |
| | | | IP1| | | | | |IP2 | | | |
| | net1 +---+ R1 XXXXXXX bridge net XXXXXXX R2 +---+ net2 | |
| | | | | | | | | | | | | |
| +------+ +--------+ | +---X----+---+ | +--------+ +------+ |
| | X | | |
+-------------------------+ X | +-------------------------+
X |
X |
+--------------------------------X----|-----------------------------------+
|OpenStack3 X | |
| X | |
| +------+ +--------+ X | +--------+ +--------------+ |
| | | | IP3| X | |IP4 | | | |
| | net3 +----+ R3 XXXXXXXXXX +---+ R4 XXXXXX external net | |
| | | | | | | | | |
| +------+ +--------+ +--------+ +--------------+ |
| |
+-------------------------------------------------------------------------+
router interface: -----
router gateway: XXXXX
IPn: router gateway ip or router interface ip
Fig 3
Extra routes and gateway ip are configured to build the connection::
routes of R1: net2 via IP2
net3 via IP3
external gateway ip of R1: IP4
(IP2 and IP3 are from bridge net, so routes will only be created in
SNAT namespace)
routes of R2: net1 via IP1
net3 via IP3
external gateway ip of R2: IP4
(IP1 and IP3 are from bridge net, so routes will only be created in
SNAT namespace)
routes of R3: net1 via IP1
net2 via IP2
external gateway ip of R3: IP4
(IP1 and IP2 are from bridge net, so routes will only be created in
SNAT namespace)
routes of R4: net1 via IP1
net2 via IP2
net3 via IP3
external gateway ip of R1: real-external-gateway-ip
disable DVR mode
An alternative solution which can reduce the extra router is that for the
router that locates in the same OpenStack cloud with the real external network,
we attach the bridge network as an internal network, so the real external
network can be attached to the same router. Here is the topology::
+-------------------------+ +-------------------------+
|OpenStack1 | |OpenStack2 |
| +------+ +--------+ | +------------+ | +--------+ +------+ |
| | | | IP1| | | | | |IP2 | | | |
| | net1 +---+ R1 XXXXXXX bridge net XXXXXXX R2 +---+ net2 | |
| | | | | | | | | | | | | |
| +------+ +--------+ | +-----+------+ | +--------+ +------+ |
| | | | |
+-------------------------+ | +-------------------------+
|
|
+----------------------|---------------------------------+
|OpenStack3 | |
| | |
| +------+ +---+----+ +--------------+ |
| | | | IP3 | | | |
| | net3 +----+ R3 XXXXXXXX external net | |
| | | | | | | |
| +------+ +--------+ +--------------+ |
| |
+--------------------------------------------------------+
router interface: -----
router gateway: XXXXX
IPn: router gateway ip or router interface ip
Fig 4
The limitation of this solution is that R3 needs to be set as non-DVR mode.
As is discussed above, for network attached to DVR mode router, the router
interfaces of this network will be created in all the hosts where the router
namespaces are created. Since these interfaces all have the same IP and MAC,
packets sent between instances(could be virtual machine, container or bare
metal) can't be directly wrapped in the VxLAN packets, otherwise packets sent
from different hosts will have the same MAC. How Neutron solve this problem is
to introduce DVR MACs which are allocated by Neutron server and assigned to
each host hosting DVR mode router. Before wrapping the packets in the VxLAN
packets, the source MAC of the packets are replaced by the DVR MAC of the host.
If R3 is DVR mode, source MAC of packets sent from net3 to bridge network will
be changed, but after the packets reach R1 or R2, R1 and R2 don't recognize the
DVR MAC, so the packets are dropped.
The same, extra routes and gateway ip are configured to build the connection::
routes of R1: net2 via IP2
net3 via IP3
external gateway ip of R1: IP3
(IP2 and IP3 are from bridge net, so routes will only be created in
SNAT namespace)
routes of R2: net1 via IP1
net3 via IP3
external gateway ip of R1: IP3
(IP1 and IP3 are from bridge net, so routes will only be created in
SNAT namespace)
routes of R3: net1 via IP1
net2 via IP2
external gateway ip of R3: real-external-gateway-ip
(non-DVR mode, routes will all be created in the router namespace)
The real external network can be deployed in one dedicated OpenStack cloud. In
that case, there is no need to run services like Nova and Cinder in that cloud.
Instance and volume will not be provisioned in that cloud. Only Neutron service
is required. Then the above two topologies transform to the same one::
+-------------------------+ +-------------------------+
|OpenStack1 | |OpenStack2 |
| +------+ +--------+ | +------------+ | +--------+ +------+ |
| | | | IP1| | | | | |IP2 | | | |
| | net1 +---+ R1 XXXXXXX bridge net XXXXXXX R2 +---+ net2 | |
| | | | | | | | | | | | | |
| +------+ +--------+ | +-----+------+ | +--------+ +------+ |
| | | | |
+-------------------------+ | +-------------------------+
|
|
+-----------|-----------------------------------+
|OpenStack3 | |
| | |
| | +--------+ +--------------+ |
| | |IP3 | | | |
| +---+ R3 XXXXXX external net | |
| | | | | |
| +--------+ +--------------+ |
| |
+-----------------------------------------------+
Fig 5
The motivation of putting the real external network in a dedicated OpenStack
cloud is to simplify the real external network management, and also to separate
the real external network and the internal networking area, for better security
control.
Discussion
----------
The implementation of DVR does bring some restrictions to our cross-Neutron
layer-2 and layer-3 networking, resulting in the limitation of the above two
proposals. In the first proposal, if the real external network is deployed with
internal networks in the same OpenStack cloud, one extra router is needed in
that cloud. Also, since one of the router is DVR mode and the other is non-DVR
mode, we need to deploy at least two l3 agents, one is dvr-snat mode and the
other is legacy mode. The limitation of the second proposal is that the router
is non-DVR mode, so east-west and north-south traffic are all go through the
router namespace in the network node.
Also, cross-Neutron layer-2 networking can not work with DVR because of
source MAC replacement. Considering the following topology::
+----------------------------------------------+ +-------------------------------+
|OpenStack1 | |OpenStack2 |
| +-----------+ +--------+ +-----------+ | | +--------+ +------------+ |
| | | | | | | | | | | | | |
| | net1 +---+ R1 +---+ net2 | | | | R2 +---+ net2 | |
| | Instance1 | | | | Instance2 | | | | | | Instance3 | |
| +-----------+ +--------+ +-----------+ | | +--------+ +------------+ |
| | | |
+----------------------------------------------+ +-------------------------------+
Fig 6
net2 supports cross-Neutron layer-2 networking, so instances in net2 can be
created in both OpenStack clouds. If the router net1 and net2 connected to is
DVR mode, when Instance1 ping Instance2, the packets are routed locally and
exchanged via a VxLAN tunnel. Source MAC replacement is correctly handled
inside OpenStack1. But when Instance1 tries to ping Instance3, OpenStack2 does
not recognize the DVR MAC from OpenStack1, thus connection fails. Therefore,
only local type network can be attached to a DVR mode router.
Cross-Neutron layer-2 networking and DVR may co-exist after we address the
DVR MAC recognition problem(we will issue a discussion about this problem in
the Neutron community) or introduce l2 gateway. Actually this bridge network
approach is just one of the implementation, we are considering in the near
future to provide a mechanism to let SDN controller to plug in, which DVR and
bridge network may be not needed.
Having the above limitation, can our proposal support the major user scenarios?
Considering whether the tenant network and router are local or across Neutron
servers, we divide the user scenarios into four categories. For the scenario of
cross-Neutron router, we use the proposal shown in Fig 3 in our discussion.
Local Network and Local Router
------------------------------
Topology::
+-----------------+ +-----------------+
|OpenStack1 | |OpenStack2 |
| | | |
| ext net1 | | ext net2 |