neutron-specs/specs/xena/node-local-ip.rst

18 KiB

Node-Local virtual IP address (Local IP)

RFE: https://bugs.launchpad.net/neutron/+bug/1930200

This spec proposes a new type of shared virtual IP address that can be used to access distributed and/or multi-node services and applications running in the cloud. The feature is primarily focused on high efficiency and performance of the networking data plane for very large scale clouds and/or clouds with high network throughput demands.

Problem Description

In a topology where several workloads access a shared resource over network obvious bottlenecks are the network and the shared resource itself (or a loadbalancer):

image

Proposed feature suggests to solve this bottleneck issue by deploying local replicas/caches of the shared service and re-translating local workloads` requests to these replicas transparently. Transparency is achieved by assigning same IP address to local replicas as on main shared resource.

image

Primary use-cases for this feature are listed in the RFE description. While the proposed virtual IP may be used on its own for simple use cases, it is mainly intended to serve as a basic building block and should be used in conjunction with other services to provide full-fledged service discovery, load balancing, proxying and other advanced features.

Proposed Change

Add a new Node Local IP (or just Local IP) resource - a virtual IP that can be shared across multiple ports/VMs (similar to anycast IP) and is guaranteed to only be reachable within the same physical server/node boundaries. The mechanism is similar to Floating IP, with the difference that Local IP could be assigned to several fixed IPs (on different nodes) simultaneously and that all address translation happens early/locally on the node's integration bridge.

The scope of this feature is limited to:

  • API and CLI to provide fine grained control over allocation and assignment of the Local IP addresses
  • Networking data plane mechanism to efficiently redirect traffic to a port within the same physical node (if present) or to optionally forward it to a fallback destination
  • Local IP will only be accessible within the same L2 segment

Local IP object will be composed of the following objects & attributes:

  • Underlying Neutron port representing Local IP
  • Collection of associated Neutron ports that will serve as packets' destination when the packet source port is located on the same physical node
  • IP mode for local ports:
    • 'translate' (default) - Local IP will behave as DNAT for redirected packets
    • 'passthrough' - packets will be redirected without modification to IP headers; destination VM's guest OS will be responsible for configuring Local IP's IP address on a corresponding port interface. This is similar to current allowed address pair mechanism with the benefit of fine grained control over IP allocation and assignment
  • Fallback policy:
    • 'forward' (default) - to redirect packets to an underlying Neutron port IP address if no Local IP assigned port is available on the physical node
    • 'drop' - packets will be dropped if no Local IP assigned port is available on the physical node - out of scope for now
    • 'reject' - similar to iptables' 'reject' policy to help clients fail fast if no Local IP assigned port is available on the physical node -out of scope for now

REST API Impact

New Local IP API resource

Attribute Type Req CRUD Description
id uuid-str Yes R Unique identifier for the Local IP object.
name String No CRU Human readable name for the Local IP (255 characters limit). Does not have to be unique.
description String No CRU Human readable description for the Local IP (255 characters limit).
project_id uuid-str No CR Owner of the Local IP.
local_ip String No CR Local IP CIDR (virtual) that will be reachable within the same physical server/node. Required when provided local_port_id's Port has several fixed IPs
local_port_id uuid-str Yes CR Underlying (backup) Neutron port ID used by Local IP object to get actual IP address for local translation. If port has several fixed IPs -local_ip should be provided as well.
network_id uuid-str Yes CR ID of network from which to allocate Local IPs underlying Neutron port. Required when no local_port_id was provided upon creation.
ip_mode String No CRU One of 'translate' (for DNAT) or 'passthrough' (no NAT) modes described above. Default: 'translate'

New Local IP Association API resource

Attribute Type Req CRUD Description
local_ip_id uuid-str Yes CR Local IP id
fixed_port_id uuid-str Yes CR ID of associated Neutron fixed port
fixed_ip String No CR Exact port's fixed IP address that is associated with Local IP
host List No R Host where associated port is bound

Basically new API will be similar to existing Floating IP API with some differences:

  • Local IP could be associated with multiple fixed ports at the same time.
  • Local IP may be created by providing an existing Neutron port which will be used to obtain the IP address and may serve as a fallback destination
  • Creating Local IP without explicitly specified port will trigger creation of an underlying Neutron port to handle IPAM, network boundaries and permission checks. In this case network ID should be specified upon creation
  • Deletion of the Local IP underlying fallback port will be prohibited by the Neutron server as long as a Local IP associated with such port exists
  • Manual underlying Neutron port updates will be ignored by parent Local IP (similar to current Floating IP behavior)
  • Deleting Local IP should be prohibited if Local IP has associated local ports; 'force' option may be provided to trigger local port disassociation on deletion; deleting Local IP should also delete the underlying Neutron port representing Local IP if this port was created specifically for Local IP and did not exist before
  • IPAM and IP validation will be handled automatically by underlying Port
  • Local IP will behave similar to Port and Floating IP API in terms of user/admin access rights
  • Local IP may have quotas similar to Floating IP API (out of scope for now)

Data Model Impact

The following are the backend database tables for the REST API proposed above.


Local IPs
Attribute Type Req Description
id (PK) uuid-str Yes Unique identifier for the Local IP object.
name string No Human readable name for the Local IP (255 characters limit). Does not have to be unique.
project_id uuid-str No Owner of the Local IP. Only admin users can specify a project identifier other than their own.
local_ip string Yes Local IP CIDR (virtual) that will be reachable within the same physical server/node.
local_port_id (FK) uuid-str Yes Underlying (backup) Neutron port ID used by Local IP object to get actual IP address for local NATting
ip_mode string No One of 'translate' (for DNAT) or 'passthrough' (no NAT) modes described above. Default: 'translate'

Local IP associations
Attribute Type Req Description
local_ip_id (PK) uuid-str Yes UUID of a Local IP
fixed_port_id (PK) uuid-str Yes UUID of a port which will serve Local IP requests on this port's host
fixed_ip String Yes Exact port's fixed IP address that is associated with Local IP

Client impact

  • New CLI and OpenStack client commands to create, update, delete, list and show Local IPs and to associate/disassociate fixed IP ports

OVS, ML2 Drivers impact

RPC interface between server and OVS agent will be updated to include info about Local IPs associated with agent's ports. Server will update port_details with this info.

OVS agent will perform following flow updates based on Local IPs info:

  • update ARP spoofing flow for associated port (like done for allowed address pairs)
  • identify network/local vlan of associated port
  • for each port from this network - redirect packets from table 0 or 24(ARP_SPOOF)/25(MAC_SPOOF) to a new table - 50 (LOCAL_EGRESS)

Below flow examples are for the case when fixed port (id=56fbc1e7-2c.., MAC=fa:16:3e:9a:1a:de, IP=10.0.0.51) is associated with Local IP 10.0.0.10

table=25, n_packets=0, n_bytes=0, priority=2,in_port="tap56fbc1e7-2c",
dl_src=fa:16:3e:9a:1a:de actions=resubmit(,50)
  • table 50(LOCAL_EGRESS) serves for memorizing local vlans to later distinguish Local IP traffic from different nets and redirect packets further to table 51(LOCAL_IP_HANDLE)
table=50, n_packets=0, n_bytes=0, priority=9,in_port="tap56fbc1e7-2c"
actions=load:0x4->NXM_NX_REG6[],resubmit(,51)
  • table 51(LOCAL_IP_HANDLE) has actual flows for Local IP handling
    • ARP responder flow to handle Local IP ARP requests from same subnet ports (fixed port and local IP associated with it are from the same IP subnet)

      table=51, n_packets=0, n_bytes=0, priority=1,arp,reg6=0x4,
      arp_tpa=10.0.0.10
      actions=load:0x2->NXM_OF_ARP_OP[],
      move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],
      move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],
      load:0xfa163e9a1ade->NXM_NX_ARP_SHA[],
      load:0x1010102->NXM_OF_ARP_SPA[],
      move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],
      mod_dl_src:fa:16:3e:9a:1a:de,IN_PORT
    • NAT flows to do actual Local IP address translation

      table=51, n_packets=0, n_bytes=0, priority=10,ip,reg6=0x4,
      nw_dst=10.0.0.10
      actions=ct(commit,table=60,zone=4,nat(dst=10.0.0.51))
      
      table=51, n_packets=0, n_bytes=0, priority=10,ct_state=-trk,ip,reg6=0x4,
      dl_src=fa:16:3e:9a:1a:de,nw_src=10.0.0.51 actions=ct(table=60,zone=4,nat)

      After translation packets are resubmitted further to TRANSIENT_TABLE (60)

      Alternatively static NAT could be used to avoid kernel conntrack: learn back flows with 7 tuple (src_mac, dest_mac, src_ip, dest_ip, protocol, src_protocol_port, dest_protocol_port)

      table=51, priority=30,tcp,dl_vlan=1,dl_dst=fa:16:3e:9a:1a:de,
      dl_src=fa:16:3e:4f:26:fd,nw_src=10.0.0.100,nw_dst=10.0.0.10,
      tp_dst=0x8000/0x8000
      actions=learn(table=62,idle_timeout=30,hard_timeout=1800,priority=90,
      eth_type=0x800,nw_proto=6,NXM_OF_ETH_SRC[]=NXM_OF_ETH_DST[],
      NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],NXM_OF_IP_SRC[]=NXM_OF_IP_DST[],
      NXM_OF_IP_DST[]=NXM_OF_IP_SRC[],NXM_OF_TCP_DST[]=NXM_OF_TCP_SRC[],
      NXM_OF_TCP_SRC[]=NXM_OF_TCP_DST[],NXM_OF_VLAN_TCI[0..11],
      load:NXM_NX_REG0[0..11]->NXM_OF_VLAN_TCI[0..11],
      output:NXM_OF_IN_PORT[]),set_field:10.0.0.10->ip_dst,goto_table:60

      This is usable in case of OVS offloading (e.g. SmartNICs). This will be made configurable on the ovs agent side.

    • Not matched (not related to Local IP) packets are transmitted further

      table=51, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,60)

Packet processing example:

VM1 <10.0.0.100> tries to access Local IP 10.0.0.10
Local IP is assigned to port with MAC_2/10.0.0.51:

ARP:
VM1 -> ARP -> 10.0.0.10 and response <10.0.0.10 has MAC_2>

Egress:
<VM1_MAC_1 + 10.0.0.100> to <MAC_2 + 10.0.0.10>
<<ct NAT>>
<VM1_MAC_1 + 10.0.0.100> to <MAC_2 + 10.0.0.51>

Ingress (back):
<MAC_2 + 10.0.0.51> to <VM1_MAC_1 + 10.0.0.100>
<<ct NAT>>
<MAC_2 + 10.0.0.10> to <VM1_MAC_1 + 10.0.0.100>

Scheduling

Local IP will have no effect on VM scheduling/placement. Local IP operates under assumption that at any given time any physical node in the cloud should have no more than one active port/VM associated with a given Local IP. Users are responsible to utilize OpenStack placement/scheduling features (e.g. anti-affinity rules) to avoid assigning multiple ports from the same physical node the the same Local IP.

However there might be valid cases when multiple Local IP's local ports may be colocated on the same physycal node (e.g. as a result of temporary live migration during other node's maintenance). Local IP should provide a deterministic way of handling such situations (e.g. in case of multiple local ports, only the oldest port shall be used).

Initial release limitations

  • Only IPv4 will be supported. IPv6 support will be considered in future releases
  • Only 'openvswitch' ML2 mechanism driver will support the feature
  • Only tunnel (VxLAN, GRE) networks will be supported. 'vlan' will be considered if require minimum overhead
  • No deterministic handling of packets if a node contains multiple local ports from same L2 segment associated with the same Local IP
  • no 'reject' and 'drop' fallback policies; only 'forward' will be supported

References