Container pinning on worker nodes and All-in-one servers

This story will pin the infrastructure and openstack pods to the
platform cores for worker nodes and All-in-one servers.

This configures systemd system.conf parameter
CPUAffinity=<platform_cpus> by generating
/etc/systemd/system.conf.d/platform-cpuaffinity.conf .
All services launch tasks with the appropriate cpu affinity.

This creates the cgroup called 'k8s-infra' for the following subset
of controllers ('cpuacct', 'cpuset', 'cpu', 'memory', 'systemd').
This configures custom cpuset.cpus (i.e., cpuset) and cpuset.mems
(i.e., nodeset) based on sysinv platform configurable cores. This is
generated by puppet using sysinv host cpu information and is stored
to the hieradata variables:
- platform::kubernetes::params::k8s_cpuset
- platform::kubernetes::params::k8s_nodeset

This creates the cgroup called 'machine.slice' for the controller
'cpuset' and sets cpuset.cpus and cpuset.mems to the parent values.
This prevents VMs from inheriting those settings from libvirt.

Note: systemd automatically mounts cgroups and all available
resource controllers, so the new puppet code does not need to do
that.

Kubelet is now launched with --cgroup-root /k8s-infra by configuring
kubeadm.yaml with the option: cgroupRoot: "/k8s-infra" .

For openstack based worker nodes including AIO
(i.e., host-label openstack-compute-node=enabled):
- the k8s cpuset and nodeset include the assigned platform cores

For non-openstack based worker nodes including AIO:
- the k8s cpuset and nodeset include all cpus except the assigned
  platform cores. This will be refined in a later update since
  we need isolate cpusets of k8s infrastructure from other pods.

The cpuset topology can be viewed with the following:
 sudo systemd-cgls cpuset

The task cpu affinity can be verified with the following:
 ps-sched.sh

The dynamic affining of platform tasks during start-up is disabled,
that code requires cleanup, and likely no longer required
since we are using systemd CPUAffinity and cgroups.

This includes a few small fixes to enable testing of this feature:
- facter platform_res_mem was updated to not require 'memtop', since
  that depends on existance of numa nodes. This was failing on QEMU
  environment when the host does not have Numa nodes. This occurs
  when there is no CPU topology specified.
- cpumap_functions.sh updated parameter defaults so that calling
  bash scripts may enable 'set -u' undefined variable checking.
- the generation of platform_cpu_list did not have all threads.
- the cpulist-to-ranges inline code was incorrect; in certain
  senarios the rstrip(',') would take out the wrong commas.

Story: 2004762
Task: 28879

Change-Id: I6fd21bac59fc2d408132905b88710da48aa8d928
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
This commit is contained in:
Jim Gauld 2019-03-28 14:26:24 -04:00
parent 61fcb0b9b7
commit 209e346ab4
13 changed files with 265 additions and 72 deletions

View File

@ -1,3 +1,12 @@
# Platform reserved memory is the total normal memory (i.e. 4K memory) that
# may be allocated by programs in MiB. This total excludes huge-pages and
# kernel overheads.
#
# The 'MemAvailable' field represents total unused memory. This includes:
# free, buffers, cached, and reclaimable slab memory.
#
# The Active(anon) and Inactive(anon) fields represents the total used
# anonymous memory.
Facter.add(:platform_res_mem) do
setcode "memtop | awk 'FNR == 3 {a=$13+$14} END {print a}'"
setcode "grep -e '^MemAvailable:' -e '^Active(anon):' -e '^Inactive(anon):' /proc/meminfo | awk '{a+=$2} END{print int(a/1024)}'"
end

View File

@ -15,6 +15,16 @@ class platform::compute::config
replace => true,
content => template('platform/worker_reserved.conf.erb')
}
file { '/etc/systemd/system.conf.d/platform-cpuaffinity.conf':
ensure => 'present',
replace => true,
content => template('platform/systemd-system-cpuaffinity.conf.erb')
}
}
class platform::compute::config::runtime {
include ::platform::compute::config
}
class platform::compute::grub::params (
@ -307,6 +317,37 @@ class platform::compute::pmqos (
}
}
# Set systemd machine.slice cgroup cpuset to be used with VMs,
# and configure this cpuset to span all logical cpus and numa nodes.
# NOTES:
# - The parent directory cpuset spans all online cpus and numa nodes.
# - Setting the machine.slice cpuset prevents this from inheriting
# kubernetes libvirt pod's cpuset, since machine.slice cgroup will be
# created when a VM is launched if it does not already exist.
# - systemd automatically mounts cgroups and controllers, so don't need
# to do that here.
class platform::compute::machine {
$parent_dir = '/sys/fs/cgroup/cpuset'
$parent_mems = "${parent_dir}/cpuset.mems"
$parent_cpus = "${parent_dir}/cpuset.cpus"
$machine_dir = "${parent_dir}/machine.slice"
$machine_mems = "${machine_dir}/cpuset.mems"
$machine_cpus = "${machine_dir}/cpuset.cpus"
notice("Create ${machine_dir}")
file { $machine_dir :
ensure => directory,
owner => 'root',
group => 'root',
mode => '0700',
}
-> exec { "Create ${machine_mems}" :
command => "/bin/cat ${parent_mems} > ${machine_mems}",
}
-> exec { "Create ${machine_cpus}" :
command => "/bin/cat ${parent_cpus} > ${machine_cpus}",
}
}
class platform::compute {
Class[$name] -> Class['::platform::vswitch']
@ -316,5 +357,6 @@ class platform::compute {
require ::platform::compute::allocate
require ::platform::compute::pmqos
require ::platform::compute::resctrl
require ::platform::compute::machine
require ::platform::compute::config
}

View File

@ -6,12 +6,90 @@ class platform::kubernetes::params (
$etcd_endpoint = undef,
$service_domain = undef,
$dns_service_ip = undef,
$host_labels = [],
$ca_crt = undef,
$ca_key = undef,
$sa_key = undef,
$sa_pub = undef,
$k8s_cpuset = undef,
$k8s_nodeset = undef,
) { }
class platform::kubernetes::cgroup::params (
$cgroup_root = '/sys/fs/cgroup',
$cgroup_name = 'k8s-infra',
$controllers = ['cpuset', 'cpu', 'cpuacct', 'memory', 'systemd'],
) {}
class platform::kubernetes::cgroup
inherits ::platform::kubernetes::cgroup::params {
include ::platform::kubernetes::params
$k8s_cpuset = $::platform::kubernetes::params::k8s_cpuset
$k8s_nodeset = $::platform::kubernetes::params::k8s_nodeset
# Default to float across all cpus and numa nodes
if !defined('$k8s_cpuset') {
$k8s_cpuset = generate('/bin/cat', '/sys/devices/system/cpu/online')
notice("System default cpuset ${k8s_cpuset}.")
}
if !defined('$k8s_nodeset') {
$k8s_nodeset = generate('/bin/cat', '/sys/devices/system/node/online')
notice("System default nodeset ${k8s_nodeset}.")
}
# Create kubelet cgroup for the minimal set of required controllers.
# NOTE: The kubernetes cgroup_manager_linux func Exists() checks that
# specific subsystem cgroup paths actually exist on the system. The
# particular cgroup cgroupRoot must exist for the following controllers:
# "cpu", "cpuacct", "cpuset", "memory", "systemd".
# Reference:
# https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cgroup_manager_linux.go
# systemd automatically mounts cgroups and controllers, so don't need
# to do that here.
notice("Create ${cgroup_root}/${controllers}/${cgroup_name}")
$controllers.each |String $controller| {
$cgroup_dir = "${cgroup_root}/${controller}/${cgroup_name}"
file { $cgroup_dir :
ensure => directory,
owner => 'root',
group => 'root',
mode => '0700',
}
# Modify k8s cpuset resources to reflect platform configured cores.
# NOTE: Using 'exec' here instead of 'file' resource type with 'content'
# tag to update contents under /sys, since puppet tries to create files
# with temp names in the same directory, and the kernel only allows
# specific filenames to be created in these particular directories.
# This causes puppet to fail if we use the 'content' tag.
# NOTE: Child cgroups cpuset must be subset of parent. In the case where
# child directories already exist and we change the parent's cpuset to
# be a subset of what the children have, will cause the command to fail
# with "-bash: echo: write error: device or resource busy".
if $controller == 'cpuset' {
$cgroup_mems = "${cgroup_dir}/cpuset.mems"
$cgroup_cpus = "${cgroup_dir}/cpuset.cpus"
$cgroup_tasks = "${cgroup_dir}/tasks"
notice("Set ${cgroup_name} nodeset: ${k8s_nodeset}, cpuset: ${k8s_cpuset}")
File[ $cgroup_dir ]
-> exec { "Create ${cgroup_mems}" :
command => "/bin/echo ${k8s_nodeset} > ${cgroup_mems} || :",
}
-> exec { "Create ${cgroup_cpus}" :
command => "/bin/echo ${k8s_cpuset} > ${cgroup_cpus} || :",
}
-> file { $cgroup_tasks :
ensure => file,
owner => 'root',
group => 'root',
mode => '0644',
}
}
}
}
class platform::kubernetes::kubeadm {
include ::platform::docker::params
@ -276,6 +354,7 @@ class platform::kubernetes::master
inherits ::platform::kubernetes::params {
contain ::platform::kubernetes::kubeadm
contain ::platform::kubernetes::cgroup
contain ::platform::kubernetes::master::init
contain ::platform::kubernetes::firewall
@ -285,6 +364,7 @@ class platform::kubernetes::master
# kubeadm init is run.
Class['::platform::dns'] -> Class[$name]
Class['::platform::kubernetes::kubeadm']
-> Class['::platform::kubernetes::cgroup']
-> Class['::platform::kubernetes::master::init']
-> Class['::platform::kubernetes::firewall']
}
@ -338,10 +418,17 @@ class platform::kubernetes::worker
# will already be configured and includes support for running pods.
if $::personality != 'controller' {
contain ::platform::kubernetes::kubeadm
contain ::platform::kubernetes::cgroup
contain ::platform::kubernetes::worker::init
Class['::platform::kubernetes::kubeadm']
-> Class['::platform::kubernetes::cgroup']
-> Class['::platform::kubernetes::worker::init']
} else {
# Reconfigure cgroups cpusets on AIO
contain ::platform::kubernetes::cgroup
Class['::platform::kubernetes::cgroup']
}
file { '/var/run/.disable_worker_services':

View File

@ -34,3 +34,4 @@ nodeStatusUpdateFrequency: "4s"
failSwapOn: false
featureGates:
HugePages: false
cgroupRoot: "/k8s-infra"

View File

@ -0,0 +1,3 @@
[Manager]
CPUAffinity=<%= @platform_cpu_list %>

View File

@ -7581,7 +7581,8 @@ class ConductorManager(service.PeriodicService):
config_dict = {
"personalities": personalities,
"host_uuids": [host_uuid],
"classes": ['platform::compute::grub::runtime']
"classes": ['platform::compute::grub::runtime',
'platform::compute::config::runtime']
}
self._config_apply_runtime_manifest(context, config_uuid,
config_dict,

View File

@ -33,7 +33,7 @@ class LibvirtHelm(openstack.OpenstackBaseHelm):
'qemu': {
'user': "root",
'group': "root",
'cgroup_controllers': ["cpu", "cpuacct"],
'cgroup_controllers': ["cpu", "cpuacct", "cpuset"],
'namespaces': [],
'clear_emulator_capabilities': 0
}

View File

@ -13,6 +13,7 @@ from sqlalchemy.orm.exc import NoResultFound
from sysinv.common import constants
from sysinv.common import utils
from sysinv.common import exception
from sysinv.helm import common as helm_common
from sysinv.puppet import quoted_str
@ -268,3 +269,12 @@ class BasePuppet(object):
return str(address)
except netaddr.AddrFormatError:
return address
# TODO (jgauld): Refactor to use utility has_openstack_compute(labels)
def is_openstack_compute(self, host):
if self.dbapi is None:
return False
for obj in self.dbapi.label_get_by_host(host.id):
if helm_common.LABEL_COMPUTE_LABEL == obj.label_key:
return True
return False

View File

@ -10,6 +10,7 @@ import subprocess
from sysinv.common import constants
from sysinv.common import exception
from sysinv.common import utils
from sysinv.openstack.common import log as logging
from sysinv.puppet import base
@ -69,6 +70,13 @@ class KubernetesPuppet(base.BasePuppet):
def get_host_config(self, host):
config = {}
# Retrieve labels for this host
config.update(self._get_host_label_config(host))
# Update cgroup resource controller parameters for this host
config.update(self._get_host_k8s_cgroup_config(host))
if host.personality != constants.WORKER:
return config
@ -131,3 +139,55 @@ class KubernetesPuppet(base.BasePuppet):
def _get_dns_service_ip(self):
# Setting this to a constant for now. Will be configurable later
return constants.DEFAULT_DNS_SERVICE_IP
def _get_host_label_config(self, host):
config = {}
labels = self.dbapi.label_get_by_host(host.uuid)
host_label_keys = []
for label in labels:
host_label_keys.append(label.label_key)
config.update(
{'platform::kubernetes::params::host_labels': host_label_keys})
return config
def _get_host_k8s_cgroup_config(self, host):
config = {}
# determine set of all logical cpus and nodes
host_cpus = self._get_host_cpu_list(host, threads=True)
host_cpuset = set([c.cpu for c in host_cpus])
host_nodeset = set([c.numa_node for c in host_cpus])
# determine set of platform logical cpus and nodes
platform_cpus = self._get_host_cpu_list(
host, function=constants.PLATFORM_FUNCTION, threads=True)
platform_cpuset = set([c.cpu for c in platform_cpus])
platform_nodeset = set([c.numa_node for c in platform_cpus])
# determine set of nonplatform logical cpus and nodes
nonplatform_cpuset = host_cpuset - platform_cpuset
nonplatform_nodeset = set()
for c in host_cpus:
if c.cpu not in platform_cpuset:
nonplatform_nodeset.update([c.numa_node])
if constants.WORKER in utils.get_personalities(host):
if self.is_openstack_compute(host):
k8s_cpuset = utils.format_range_set(platform_cpuset)
k8s_nodeset = utils.format_range_set(platform_nodeset)
else:
k8s_cpuset = utils.format_range_set(nonplatform_cpuset)
k8s_nodeset = utils.format_range_set(nonplatform_nodeset)
else:
k8s_cpuset = utils.format_range_set(host_cpuset)
k8s_nodeset = utils.format_range_set(host_nodeset)
LOG.debug('host:%s, k8s_cpuset:%s, k8s_nodeset:%s',
host.hostname, k8s_cpuset, k8s_nodeset)
config.update(
{'platform::kubernetes::params::k8s_cpuset': k8s_cpuset,
'platform::kubernetes::params::k8s_nodeset': k8s_nodeset,
})
return config

View File

@ -4,10 +4,7 @@
# SPDX-License-Identifier: Apache-2.0
#
import copy
import itertools
import os
import operator
from sysinv.common import constants
from sysinv.common import exception
@ -552,16 +549,9 @@ class PlatformPuppet(base.BasePuppet):
if not host_cpus:
return config
# Define the full range of CPUs for the compute host
max_cpu = max(host_cpus, key=operator.attrgetter('cpu'))
worker_cpu_list = "\"0-%d\"" % max_cpu.cpu
platform_cpus_no_threads = self._get_platform_cpu_list(host)
vswitch_cpus_no_threads = self._get_vswitch_cpu_list(host)
platform_cpu_list_with_quotes = \
"\"%s\"" % ','.join([str(c.cpu) for c in platform_cpus_no_threads])
platform_numa_cpus = utils.get_numa_index_list(platform_cpus_no_threads)
vswitch_numa_cpus = utils.get_numa_index_list(vswitch_cpus_no_threads)
@ -582,69 +572,53 @@ class PlatformPuppet(base.BasePuppet):
reserved_platform_cores = "(%s)" % ' '.join(platform_cores)
reserved_vswitch_cores = "(%s)" % ' '.join(vswitch_cores)
host_cpus = sorted(host_cpus, key=lambda c: c.cpu)
n_cpus = len(host_cpus)
host_cpu_list = [c.cpu for c in host_cpus]
# all logical cpus
host_cpus = self._get_host_cpu_list(host, threads=True)
host_cpuset = set([c.cpu for c in host_cpus])
host_ranges = utils.format_range_set(host_cpuset)
n_cpus = len(host_cpuset)
# platform logical cpus
platform_cpus = self._get_host_cpu_list(
host, function=constants.PLATFORM_FUNCTION, threads=True)
platform_cpus = sorted(platform_cpus, key=lambda c: c.cpu)
platform_cpu_list = \
"%s" % ','.join([str(c.cpu) for c in platform_cpus])
platform_cpuset = set([c.cpu for c in platform_cpus])
platform_ranges = utils.format_range_set(platform_cpuset)
# vswitch logical cpus
vswitch_cpus = self._get_host_cpu_list(
host, constants.VSWITCH_FUNCTION, threads=True)
vswitch_cpus = sorted(vswitch_cpus, key=lambda c: c.cpu)
vswitch_cpu_list = \
"%s" % ','.join([str(c.cpu) for c in vswitch_cpus])
vswitch_cpuset = set([c.cpu for c in vswitch_cpus])
vswitch_ranges = utils.format_range_set(vswitch_cpuset)
# rcu_nocbs = all cores - platform cores
rcu_nocbs = copy.deepcopy(host_cpu_list)
for i in [int(s) for s in platform_cpu_list.split(',')]:
rcu_nocbs.remove(i)
# non-platform logical cpus
rcu_nocbs_cpuset = host_cpuset - platform_cpuset
rcu_nocbs_ranges = utils.format_range_set(rcu_nocbs_cpuset)
# change the CPU list to ranges
rcu_nocbs_ranges = ""
for key, group in itertools.groupby(enumerate(rcu_nocbs),
lambda xy: xy[1] - xy[0]):
group = list(group)
rcu_nocbs_ranges += "%s-%s," % (group[0][1], group[-1][1])
rcu_nocbs_ranges = rcu_nocbs_ranges.rstrip(',')
# non-vswitch CPUs = all cores - vswitch cores
non_vswitch_cpus = host_cpu_list
for i in [c.cpu for c in vswitch_cpus]:
non_vswitch_cpus.remove(i)
# change the CPU list to ranges
non_vswitch_cpus_ranges = ""
for key, group in itertools.groupby(enumerate(non_vswitch_cpus),
lambda xy: xy[1] - xy[0]):
group = list(group)
non_vswitch_cpus_ranges += "\"%s-%s\"," % (group[0][1], group[-1][1])
# non-vswitch logical cpus
non_vswitch_cpuset = host_cpuset - vswitch_cpuset
non_vswitch_ranges = utils.format_range_set(non_vswitch_cpuset)
cpu_options = ""
if constants.LOWLATENCY in host.subfunctions:
vswitch_cpu_list_with_quotes = \
"\"%s\"" % ','.join([str(c.cpu) for c in vswitch_cpus])
config.update({
'platform::compute::pmqos::low_wakeup_cpus':
vswitch_cpu_list_with_quotes,
"\"%s\"" % vswitch_ranges,
'platform::compute::pmqos::hight_wakeup_cpus':
non_vswitch_cpus_ranges.rstrip(',')})
vswitch_cpu_list = rcu_nocbs_ranges
cpu_options += "nohz_full=%s " % vswitch_cpu_list
"\"%s\"" % non_vswitch_ranges,
})
vswitch_ranges = rcu_nocbs_ranges
cpu_options += "nohz_full=%s " % vswitch_ranges
cpu_options += "isolcpus=%s rcu_nocbs=%s kthread_cpus=%s " \
"irqaffinity=%s" % (vswitch_cpu_list,
"irqaffinity=%s" % (vswitch_ranges,
rcu_nocbs_ranges,
platform_cpu_list,
platform_cpu_list)
platform_ranges,
platform_ranges)
config.update({
'platform::compute::params::worker_cpu_list':
worker_cpu_list,
"\"%s\"" % host_ranges,
'platform::compute::params::platform_cpu_list':
platform_cpu_list_with_quotes,
"\"%s\"" % platform_ranges,
'platform::compute::params::reserved_vswitch_cores':
reserved_vswitch_cores,
'platform::compute::params::reserved_platform_cores':
@ -660,8 +634,8 @@ class PlatformPuppet(base.BasePuppet):
host_memory = self.dbapi.imemory_get_by_ihost(host.id)
memory_numa_list = utils.get_numa_index_list(host_memory)
platform_cpus = self._get_platform_cpu_list(host)
platform_cpu_count = len(platform_cpus)
platform_cpus_no_threads = self._get_platform_cpu_list(host)
platform_core_count = len(platform_cpus_no_threads)
platform_nodes = []
vswitch_nodes = []
@ -684,7 +658,7 @@ class PlatformPuppet(base.BasePuppet):
platform_size = memory.platform_reserved_mib
platform_node = "\"node%d:%dMB:%d\"" % (
node, platform_size, platform_cpu_count)
node, platform_size, platform_core_count)
platform_nodes.append(platform_node)
vswitch_size = memory.vswitch_hugepages_size_mib

View File

@ -34,12 +34,15 @@ function affine_tasks {
# Affine non-kernel-thread tasks (excluded [kthreadd] and its children) to all available
# cores. They will be reaffined to platform cores later on as part of nova-compute
# launch.
log_debug "Affining all tasks to all available CPUs..."
affine_tasks_to_all_cores
RET=$?
if [ $RET -ne 0 ]; then
log_error "Some tasks failed to be affined to all cores."
fi
##log_debug "Affining all tasks to all available CPUs..."
# TODO: Should revisit this since this leaves a few lingering floating
# tasks and does not really work with cgroup cpusets.
# Comment out for now. Cleanup required.
##affine_tasks_to_all_cores
##RET=$?
##if [ $RET -ne 0 ]; then
## log_error "Some tasks failed to be affined to all cores."
##fi
# Get number of logical cpus
N_CPUS=$(cat /proc/cpuinfo 2>/dev/null | \

View File

@ -26,8 +26,11 @@ start ()
log "Initial Configuration incomplete. Skipping affining tasks."
exit 0
fi
affine_tasks_to_platform_cores
[[ $? -eq 0 ]] && log "Tasks re-affining done." || log "Tasks re-affining failed."
# TODO: Should revisit this since this leaves a few lingering floating
# tasks and does not really work with cgroup cpusets.
# Comment out for now. Cleanup required.
##affine_tasks_to_platform_cores
##[[ $? -eq 0 ]] && log "Tasks re-affining done." || log "Tasks re-affining failed."
}
stop ()

View File

@ -32,8 +32,8 @@ function expand_sequence {
# Append a string to comma separated list string
################################################################################
function append_list {
local PUSH=$1
local LIST=$2
local PUSH=${1-}
local LIST=${2-}
if [ -z "${LIST}" ]; then
LIST=${PUSH}
else
@ -179,8 +179,8 @@ function invert_cpulist {
#
################################################################################
function in_list {
local item="$1"
local list="$2"
local item="${1-}"
local list="${2-}"
# expand list format 0-3,8-11 to a full sequence {0..3} {8..11}
local exp_list