The CPU namespace aims to extend the current pool of namespaces in the kernel to isolate the system topology view from applications. The CPU namespace virtualizes the CPU information by maintaining an internal translation from the namespace CPU to the logical CPU in the kernel. The CPU namespace will also enable the existing interfaces interfaces like sys/proc, cgroupfsand sched_set(/get)affinity syscalls to be context aware and divulge information of the topology based on the CPU namespace context that requests information from it.
Today, applications that run on containers enforce their CPU and memory limits, requirements with the help of cgroups. However, many applications legacy or otherwise get the view of the system through sysfs/procfs and allocate resources like number of threads/processes, memory allocation based on that information. This can lead to unexpected running behaviors as well as have a high impact on performance.
The problem is not only limited to the coherency of information. Cloud runtime environments requests for CPU runtime in millicores, which translate to using CFS period and quota to limit CPU runtime in cgroups. However, generally, applications operate in terms of threads with little to no cognizance of the millicore limit or its connotation.
In addition to coherency issues, the current way of doing things also pose security and fair use implications on a multi-tenant system such as:
A case where an actor can be in cognisance of the CPU node topology can schedule workloads and select CPUs such that the bus is flooded causing a Denial Of Service attack.
A case wherein identifying the CPU system topology can help identify cores that are close to buses and peripherals such as GPUs to get an undue latency advantage from the rest of the workloads.
Currently, all of these problems mentioned above can be mitigated with the use of light weight VMs - Kata Containers. However with the use of a CPU namespace, the isolation advantages that are provided by a Kata Container can be achieved without the heaviness of a virtual machine.
The architecture of the CPU namespace is as follows:
The task struct links to the nsproxy which as the name suggests is a pointer proxy for the namespaces that can be attached to it later. One of the proxy pointers is now introduced for the CPU namespace.
The CPU namespace structure contains the following fields.
NOTE: For the sake of this design discussion, consider vCPU as the CPU within the CPU namespace and pCPU as the corresponding translation that Linux as host recognizes and can perform operations upon.
Translation map for vcpu to pcpu mapping: This is a map of all the vcpus and their corresponding pcpus of the namespace. In the init namespace the map is a 1:1, this implies that each vcpu is mapped to each pcpu. Subsequent namespaces can have scrambled maps wherein each physical CPU can have a corresponding vcpu attached to it. This allows for the CPU details to be abstracted out.
The translation map follows a flat heirarchy. This means that the child namespace also independently maps to a pcpu. The advantage of this approach is two fold.
vCPUs need not be traversed along the heirarchy to get a physical CPU, which leads to faster translations
This also abstracts information between parent and child. This ensures that even the parent does not know virtual-physical CPU information of the Child and hence further reduces a potential attack surface.
vcpuset cpus: This is a set of cpus that are accessible to this CPU namespace. The CPUs in this set are the virtual counterparts of the physical CPUs in the translation map.
Pointer to the Parent CPU namespace. If is the Init namespace then points to NULL.
To further explain the design, a sample heirarchy is shown:Assume there are 4 CPU namespaces in the system. The system has 32 CPUs.
Init Namspace: This has a vcpuset of 0-31. The translations is a 1:1 map.
Namespace A: Child to Init namespace. It inherits cpuset restrictions. The translations are scrambled
Namespace B: Child of Init namespace. Cgroup interface restricts the CPUs to 0-3. This makes the vcpuset of 4 cpus to show the CPUs based on the scrambled map
Namespace C: Child of namespace B. It inherits the cpuset restriction of B. However, it's vcpu will point to a different translation map, which is a flat vcpu to pcpu map.
IBM POWER 9 - 176 CPUs, 2 nodes
container runtime: Docker
Kernel: 5.14 + CPU namespace patches
The right hand side is a shell spawned in the CPU init namespace i.e it is in the shell prompt right after boot and has 1:1 vcpu to pcpu mapping
Spawn a simple ubuntu container which spans the cpuset of all the CPUs in the system.
$ docker run --cpuset-cpus="0-175" -it ubuntu bash
Restrict the container's cpuset using the cgroup interface in the init namespace
$ cd /sys/fs/cgroup/cpuset/docker/cont_id/ $ echo "0-3" > cpuset.cpus
The container will now be restricted to the said CPUs and the will show a the virtualized mapping of the CPUs in that namespace
The sysfs, procfs and the cgroupfs files are now context sensitive such that the information about CPUs is dilvulged based on the CPU namespace context that requests for it.
Userspace tools like nproc, lscpu, top will show virtualized CPU information as well.
tasksetting a process to a CPU within the container namespace will result in translation of that CPU number to that of a physical CPU. The view from within the namespace is unchanged. The view from the init namespace, however tells us the real story!
Survey proposal for identification of the problems and state of the art solutions: