How to use Control groups to restrict tasks to a certain CPU or a percentage of all CPUs?
Objectives
This post shows you how to use Control groups ( cgroup-v1, cgroup-v2 ) to restrict tasks to a certain CPU or a percentage of all CPUs. We will create specific control groups to which we will add a shell. From this shell, we’ll run stress-ng. Htop and ps(1) will help to monitor what happens.
Prerequisites
We use cgroup-v1 here. This means that a somewhat recent kernel, which was configured properly should be fine. As a root file system we’ll use a core-image-minimal
here. This uses systemd as the init manager plus a few simple tools.
Kernel config
Those are the cgroup related kernel configs:
root@multi-v7-ml:~# zcat /proc/config.gz | grep CGROUP
CONFIG_CGROUPS=y CONFIG_BLK_CGROUP=y CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_CGROUP_PIDS=y CONFIG_CGROUP_RDMA=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y CONFIG_CGROUP_DEBUG=y CONFIG_SOCK_CGROUP_DATA=y CONFIG_BLK_CGROUP_RWSTAT=y # CONFIG_BLK_CGROUP_IOLATENCY is not set # CONFIG_BLK_CGROUP_IOCOST is not set CONFIG_CGROUP_NET_PRIO=y # CONFIG_CGROUP_NET_CLASSID is not set
core-image-minimal with systemd
I created a special resy-systemd distro configuration for this. Those are the variables of interest, and nowadays it could be done even with less variables:
# I want to use systemd here instead of sysvinit DISTRO_FEATURES_append = " systemd" VIRTUAL-RUNTIME_init_manager = "systemd" DISTRO_FEATURES_BACKFILL_CONSIDERED = "sysvinit" VIRTUAL-RUNTIME_initscripts = "" VIRTUAL-RUNTIME_dev_manager ?= "" VIRTUAL-RUNTIME_login_manager ?= "" VIRTUAL-RUNTIME_init_manager ?= "" VIRTUAL-RUNTIME_initscripts ?= "" VIRTUAL-RUNTIME_keymaps ?= ""
packages to be added to core-image-minimal
It would be better to create an “image-recipe” here, but we’ll just hack local.conf:
IMAGE_INSTALL_append = " htop stress-ng dropbear packagegroup-core-base-utils tree"
htop
Htop is a cross-platform interactive process viewer. It is a text-mode application (for console or X terminals) and requires ncurses.
stress-ng
Stress-ng will stress test various physical subsystems of a computer. It operates as well on various kernel interfaces of the operating system.
dropbear
Dropbear is a relatively small SSH server and client. It runs on a variety of POSIX-based platforms. Dropbear is open source software, distributed under an MIT-style license. Dropbear is useful for “embedded”-type Linux systems.
packagegroup-core-base-utils
I’m using this packagegroup to have more complete versions of standard Unix/Linux tools like ps(1).
tree
Tree(1) lists the contents of directories in a tree-like format.
Overview
Multi-core
Let’s assume we have one symmetric multi-core machine with 4 cores:
root@multi-v7-ml:~# cat /proc/cpuinfo
processor : 0 model name : ARMv7 Processor rev 10 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 processor : 1 model name : ARMv7 Processor rev 10 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 processor : 2 model name : ARMv7 Processor rev 10 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 processor : 3 model name : ARMv7 Processor rev 10 (v7l) BogoMIPS : 6.00 Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 Hardware : Freescale i.MX6 Quad/DualLite (Device Tree) Revision : 0000 Serial : 0000000000000000
Control Groups (excerpt from here)
Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children (child processes/tasks), into hierarchical groups with specialized behavior.
Definitions:
A cgroup associates a set of tasks with a set of parameters for one or more subsystems. It allows us to deal with a group of tasks.
A subsystem is typically a kernel module that makes use of the task grouping facilities. A subsystem mountis typically a “resource controller” that schedules a resource or applies per-cgroup limits.
A hierarchy is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of the cgroups in the hierarchy, and a set of subsystems; each subsystem has system-specific state attached to each cgroup in the hierarchy. Each hierarchy has an instance of the cgroup virtual filesystem associated with it.
User-level code may create and destroy cgroups by name in an instance of the cgroup virtual file system, specify and query to which cgroup a task is assigned, and list the task PIDs assigned to a cgroup. Those creations and assignments only affect the hierarchy associated with that instance of the cgroup file system.
The top level cgroup is this:
mount | grep "/sys/fs/cgroup "
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Other cgroups are those:
mount | grep /sys/fs/cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755) cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/debug type cgroup (rw,nosuid,nodev,noexec,relatime,debug) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
CPUSETS (excerpt from here)
The mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks is called Cpusets. Resources like CPU and memory can be limited via cpusets.
cd /sys/fs/cgroup/cpuset tree
. |-- cgroup.clone_children |-- cgroup.procs |-- cgroup.sane_behavior |-- cpuset.cpu_exclusive |-- cpuset.cpus |-- cpuset.effective_cpus |-- cpuset.effective_mems |-- cpuset.mem_exclusive |-- cpuset.mem_hardwall |-- cpuset.memory_migrate |-- cpuset.memory_pressure |-- cpuset.memory_pressure_enabled |-- cpuset.memory_spread_page |-- cpuset.memory_spread_slab |-- cpuset.mems |-- cpuset.sched_load_balance |-- cpuset.sched_relax_domain_level |-- notify_on_release |-- release_agent `-- tasks
Let’s have a look at some items of interest here:
/sys/fs/cgroup/cpuset/cgroup.procs
list of thread group IDs in the cgroup. This list is not guaranteed to be sorted or free of duplicate thread group IDs, and userspace should sort/uniquify the list if this property is required. Writing a thread group ID into this file moves all threads in that group into this cgroup.
By default all thread groups IDs are in this cgroup:
cat /sys/fs/cgroup/cpuset/cgroup.procs | tail -10
372 376 378 379 403 407 408 409 410 411
/sys/fs/cgroup/cpuset/cpuset.cpus
list of CPUs in that cpuset
cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-3
This means that tasks attached to this cpuset are allowed to run on all 4 CPUs.
/sys/fs/cgroup/cpuset/cpuset.mems
list of Memory Nodes in that cpuset. The parameter has a value of 0 on systems that do not have a non uniform memory architecture (NUMA).
cat /sys/fs/cgroup/cpuset/cpuset.mems
0
/sys/fs/cgroup/cpuset/tasks
You can list all the tasks (by pid) attached to any cpuset or attach tasks to the cgroup. Just write its PID to the file /sys/fs/cgroup/cpuset/tasks
for that cgroup.
cat /sys/fs/cgroup/cpuset/tasks | tail -10
372 376 378 379 403 409 410 411 412 413
Process and task IDs of your shell can be checked like this:
ps -L -p $$ -o tid,pid,cpuid,ppid,args
TID PID CPUID PPID COMMAND 234 234 3 1 -sh
CPU resource usage is unrestricted by default (cgroup/cpuset)
Try to apply CPU load to all 4 CPUs
stress-ng --cpu 4 &
Monitor CPU usage with htop:
0[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 26, 1 thr; 4 running 1[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Load average: 1.77 0.96 0.45 2[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Uptime: 00:08:44 3[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Mem[||||||||| 59.2M/990M] Swp[ 0K/0K] PID USER PRI NI VIRT RES SHR S CPU%-MEM% TIME+ Command 284 root 20 0 14988 2692 1388 R 100. 0.3 0:30.47 stress-ng --cpu 4 285 root 20 0 14988 2692 1388 R 100. 0.3 0:30.48 stress-ng --cpu 4 286 root 20 0 14988 2692 1388 R 100. 0.3 0:30.46 stress-ng --cpu 4 287 root 20 0 14988 2692 1388 R 100. 0.3 0:30.08 stress-ng --cpu 4 282 root 20 0 3696 2564 1628 R 1.3 0.3 0:01.72 htop 1 root 20 0 7744 5692 4016 S 0.0 0.6 0:09.93 /sbin/init ...
We can see that all fours CPUs are pretty busy.
Restrict CPU resource usage for whatever runs from a specific shell to CPU 3
Create a new cpuset
mkdir /sys/fs/cgroup/cpuset/set1 tree /sys/fs/cgroup/cpuset/set1
/sys/fs/cgroup/cpuset/set1 |-- cgroup.clone_children |-- cgroup.procs |-- cpuset.cpu_exclusive |-- cpuset.cpus |-- cpuset.effective_cpus |-- cpuset.effective_mems |-- cpuset.mem_exclusive |-- cpuset.mem_hardwall |-- cpuset.memory_migrate |-- cpuset.memory_pressure |-- cpuset.memory_spread_page |-- cpuset.memory_spread_slab |-- cpuset.mems |-- cpuset.sched_load_balance |-- cpuset.sched_relax_domain_level |-- notify_on_release `-- tasks
cgroup.procs
No processes should be in the set yet:
cat /sys/fs/cgroup/cpuset/set1/cgroup.procs
cpuset.cpus
Restrict this set to CPU 3:
echo 3 > /sys/fs/cgroup/cpuset/set1/cpuset.cpus cat /sys/fs/cgroup/cpuset/set1/cpuset.cpus
3
cpuset.mems
we don’t have a NUMA architecture:
echo 0 > /sys/fs/cgroup/cpuset/set1/cpuset.mems cat /sys/fs/cgroup/cpuset/set1/cpuset.mems
0
tasks
tasks: they should not be in the set yet:
cat /sys/fs/cgroup/cpuset/set1/tasks
Attach the shell to the new cpuset
echo $$ > /sys/fs/cgroup/cpuset/set1/tasks cat /sys/fs/cgroup/cpuset/set1/tasks
265 289
ps -L -p 265 -o tid,pid,cpuid,ppid,args
TID PID CPUID PPID COMMAND 265 265 3 264 -sh
As you can see the shell runs on CPU 3.
Apply load from the shell
stress-ng --cpu 4 &
Monitor with htop
0[ 0.0%] Tasks: 26, 1 thr; 4 running 1[|| 1.3%] Load average: 0.89 0.20 0.06 2[ 0.0%] Uptime: 02:01:58 3[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Mem[||||||||| 60.8M/990M] Swp[ 0K/0K] PID USER PRI NI VIRT RES SHR S CPU%-MEM% TIME+ Command 310 root 20 0 14988 2960 1400 R 25.0 0.3 0:04.60 stress-ng --cpu 4 311 root 20 0 14988 2960 1400 R 25.0 0.3 0:04.60 stress-ng --cpu 4 312 root 20 0 14988 2956 1400 R 25.0 0.3 0:04.60 stress-ng --cpu 4 313 root 20 0 14988 2952 1400 R 25.0 0.3 0:04.60 stress-ng --cpu 4 314 root 20 0 3696 2584 1652 R 2.0 0.3 0:00.13 htop 1 root 20 0 7744 5708 4020 S 0.0 0.6 0:10.08 /sbin/init ...
As you can see only CPU 3 is busy.
Let’s check process and task IDs:
ps -L -p 310 -o tid,pid,cpuid,ppid,args
TID PID CPUID PPID COMMAND 310 310 3 304 stress-ng --cpu 4
ps -L -p 311 -o tid,pid,cpuid,ppid,args
TID PID CPUID PPID COMMAND 311 311 3 304 stress-ng --cpu 4
ps -L -p 312 -o tid,pid,cpuid,ppid,args
TID PID CPUID PPID COMMAND 312 312 3 304 stress-ng --cpu 4
ps -L -p 313 -o tid,pid,cpuid,ppid,args
TID PID CPUID PPID COMMAND 313 313 3 304 stress-ng --cpu 4
As you can see the four stress-ng instances run only on CPU 3.
CPU resource usage is unrestricted by default (cgroup/cpu)
CPU
The cgroups CPU subsystem feature isolates the CPU time consumption of a group of processes (tasks) from the rest of the system. We can get reports on CPU usage by the processes in a cgroup, and set limits on the number of CPUs used by those processes.
tree /sys/fs/cgroup/cpu
/sys/fs/cgroup/cpu |-- cgroup.clone_children |-- cgroup.procs |-- cgroup.sane_behavior |-- cpu.cfs_period_us |-- cpu.cfs_quota_us |-- cpu.rt_period_us |-- cpu.rt_runtime_us |-- cpu.shares |-- cpu.stat |-- cpuacct.stat |-- cpuacct.usage |-- cpuacct.usage_all |-- cpuacct.usage_percpu |-- cpuacct.usage_percpu_sys |-- cpuacct.usage_percpu_user |-- cpuacct.usage_sys |-- cpuacct.usage_user |-- notify_on_release |-- release_agent `-- tasks
cgroup.procs, tasks
By default all thread group IDs are in cpu/cgroup.procs. All Tasks of the system are as well attached to cpu/tasks.
cpu.cfs_period_us
Specifies a period of time, in microseconds, for how regularly a cgroup’s access to the CPU resources should be reallocated. Valid values are 1 second to 1000 microseconds.
cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000
cpu.cfs_quota_us (excerpt from here)
Cpu.cfs_quota_us specifies the total time in microseconds in which all Tasks in the control group can be executed.
As soon as Tasks in a cgroup use up all the time specified by the quota, they are throttled.
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS (completely fair scheduler).
Writing any (valid) positive value(s) will enact the specified bandwidth limit. The minimum quota allowed for the quota or period is 1ms. The upper bound on the period length of 1s.
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit.
cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1
Restrict CPU resource usage for whatever runs from a specific shell to 50% of one CPU
Create new set
mkdir /sys/fs/cgroup/cpu/set1 tree /sys/fs/cgroup/cpu/set1
/sys/fs/cgroup/cpu/set1 |-- cgroup.clone_children |-- cgroup.procs |-- cpu.cfs_period_us |-- cpu.cfs_quota_us |-- cpu.rt_period_us |-- cpu.rt_runtime_us |-- cpu.shares |-- cpu.stat |-- cpuacct.stat |-- cpuacct.usage |-- cpuacct.usage_all |-- cpuacct.usage_percpu |-- cpuacct.usage_percpu_sys |-- cpuacct.usage_percpu_user |-- cpuacct.usage_sys |-- cpuacct.usage_user |-- notify_on_release `-- tasks
cpu.cfs_period_us
cat /sys/fs/cgroup/cpu/set1/cpu.cfs_period_us
100000
cpu.cfs_quota_us
echo 50000 > /sys/fs/cgroup/cpu/set1/cpu.cfs_quota_us cat /sys/fs/cgroup/cpu/set1/cpu.cfs_quota_us
50000
Attach a shell to the set & stress the system
echo $$ > /sys/fs/cgroup/cpu/set1/tasks stress-ng --cpu 4 &
CPUs | CPUs utilized | CPU % utilized | cpu. cfs_period_us | cpu. cfs_quota_us | CPUs really utilized | CPU % really utilized |
---|---|---|---|---|---|---|
4 | 1 | 50 | 100000 | 50000 | 4 | 12.5 on average |
Monitor CPU usage
0[|||||||||| 10.6%] Tasks: 26, 1 thr; 4 running 1[||||||||||| 12.6%] Load average: 1.15 0.98 0.63 2[||||||||||||| 14.6%] Uptime: 00:52:42 3[||||||||||| 12.6%] Mem[||||||||| 68.7M/990M] Swp[ 0K/0K] PID USER PRI NI VIRT RES SHR S CPU%-MEM% TIME+ Command 319 root 20 0 14988 2692 1388 R 15.2 0.3 1:06.78 stress-ng --cpu 4 318 root 20 0 14988 2692 1388 R 13.2 0.3 1:06.31 stress-ng --cpu 4 321 root 20 0 14988 2692 1388 R 13.2 0.3 1:05.43 stress-ng --cpu 4 320 root 20 0 14988 2692 1388 R 9.9 0.3 0:50.14 stress-ng --cpu 4 316 root 20 0 3696 2576 1708 R 0.7 0.3 0:05.63 htop 1 root 20 0 7772 5808 4128 S 0.0 0.6 0:09.87 /sbin/init
Restrict CPU resource usage for whatever runs from a specific shell to 10% of one CPU
CPUs | CPUs utilized | CPU % utilized | cpu. cfs_period_us | cpu. cfs_quota_us | CPUs really utilized | CPU % really utilized |
---|---|---|---|---|---|---|
4 | 1 | 10 | 100000 | 10000 | 4 | 2.5 on average |
0[||| 2.6%] Tasks: 26, 1 thr; 4 running 1[||| 2.6%] Load average: 0.15 0.26 0.12 2[|||| 3.3%] Uptime: 00:03:10 3[|| 2.0%] Mem[||||||||| 58.9M/990M] Swp[ 0K/0K] PID USER PRI NI VIRT RES SHR S CPU%-MEM% TIME+ Command 286 root 20 0 14988 2604 1304 R 2.6 0.3 0:00.32 stress-ng --cpu 4 287 root 20 0 14988 2848 1296 R 2.6 0.3 0:00.33 stress-ng --cpu 4 288 root 20 0 14988 2348 1300 R 2.6 0.2 0:00.31 stress-ng --cpu 4 289 root 20 0 14988 2348 1300 R 2.0 0.2 0:00.32 stress-ng --cpu 4 262 root 20 0 3704 2600 1660 R 0.7 0.3 0:00.46 htop 1 root 20 0 7772 5688 4024 S 0.0 0.6 0:09.82 /sbin/init
Restrict CPU resource usage for whatever runs from a specific shell to 100% of two CPUs
CPUs | CPUs utilized | CPU % utilized | cpu. cfs_period_us | cpu. cfs_quota_us | CPUs really utilized | CPU % really utilized |
---|---|---|---|---|---|---|
4 | 2 | 100 | 100000 | 200000 | 4 | 50 on average |
0[||||||||||||||||||||||||||||||||||||||||||| 50.3%] Tasks: 26, 1 thr; 4 running 1[|||||||||||||||||||||||||||||||||||||||||||| 51.0%] Load average: 1.10 0.32 0.11 2[||||||||||||||||||||||||||||||||||||||||||| 50.3%] Uptime: 00:26:35 3[|||||||||||||||||||||||||||||||||||||||||||| 50.6%] Mem[||||||||| 60.9M/990M] Swp[ 0K/0K] PID USER PRI NI VIRT RES SHR S CPU%-MEM% TIME+ Command 300 root 20 0 14988 2952 1388 R 50.2 0.3 0:22.00 stress-ng --cpu 4 297 root 20 0 14988 2940 1388 R 49.6 0.3 0:21.75 stress-ng --cpu 4 298 root 20 0 14988 2948 1388 R 49.6 0.3 0:21.99 stress-ng --cpu 4 299 root 20 0 14988 2948 1388 R 49.6 0.3 0:21.85 stress-ng --cpu 4 271 root 20 0 3692 2636 1704 R 1.3 0.3 0:00.86 htop 1 root 20 0 7740 5768 4096 S 0.0 0.6 0:10.04 /sbin/init ...
Conclusion
As you can see control groups are an interesting choice for managing system resources. Think about use cases like containers and real-time. We used pretty much low-level kernel interfaces here. There are definitely more high-level ways to manage cgroups as well.
Appendix
A post about “Degrees of real-time” is here. If you want to learn how Embedded Linux and real-time Linux work have a look here. To learn more about the Yocto Project® have a look here.