How to use Control groups to restrict tasks to a certain CPU or a percentage of all CPUs?

Objectives

This post shows you how to use Control groups ( cgroup-v1, cgroup-v2 ) to restrict tasks to a certain CPU or a percentage of all CPUs. We will create specific control groups to which we will add a shell. From this shell, we’ll run stress-ng. Htop and ps(1) will help to monitor what happens.

Prerequisites

We use cgroup-v1 here. This means that a somewhat recent kernel, which was configured properly should be fine. As a root file system we’ll use a core-image-minimal here. This uses systemd as the init manager plus a few simple tools.

Kernel config

Those are the cgroup related kernel configs:

root@multi-v7-ml:~# zcat /proc/config.gz | grep CGROUP
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_BLK_CGROUP_RWSTAT=y
# CONFIG_BLK_CGROUP_IOLATENCY is not set
# CONFIG_BLK_CGROUP_IOCOST is not set
CONFIG_CGROUP_NET_PRIO=y
# CONFIG_CGROUP_NET_CLASSID is not set

core-image-minimal with systemd

I created a special resy-systemd distro configuration for this. Those are the variables of interest, and nowadays it could be done even with less variables:

# I want to use systemd here instead of sysvinit
DISTRO_FEATURES_append = " systemd"
VIRTUAL-RUNTIME_init_manager = "systemd"
DISTRO_FEATURES_BACKFILL_CONSIDERED = "sysvinit"
VIRTUAL-RUNTIME_initscripts = ""

VIRTUAL-RUNTIME_dev_manager ?= ""
VIRTUAL-RUNTIME_login_manager ?= ""
VIRTUAL-RUNTIME_init_manager ?= ""
VIRTUAL-RUNTIME_initscripts ?= ""
VIRTUAL-RUNTIME_keymaps ?= ""

packages to be added to core-image-minimal

It would be better to create an “image-recipe” here, but we’ll just hack local.conf:

IMAGE_INSTALL_append = " htop stress-ng dropbear packagegroup-core-base-utils tree"

htop

Htop is a cross-platform interactive process viewer. It is a text-mode application (for console or X terminals) and requires ncurses.

stress-ng

Stress-ng will stress test various physical subsystems of a computer. It operates as well on various kernel interfaces of the operating system.

dropbear

Dropbear is a relatively small SSH server and client. It runs on a variety of POSIX-based platforms. Dropbear is open source software, distributed under an MIT-style license. Dropbear is useful for “embedded”-type Linux systems.

packagegroup-core-base-utils

I’m using this packagegroup to have more complete versions of standard Unix/Linux tools like ps(1).

tree

Tree(1) lists the contents of directories in a tree-like format.

Overview

Multi-core

Let’s assume we have one symmetric multi-core machine with 4 cores:

root@multi-v7-ml:~# cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 6.00
Features        : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

processor       : 1
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 6.00
Features        : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

processor       : 2
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 6.00
Features        : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

processor       : 3
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 6.00
Features        : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

Hardware        : Freescale i.MX6 Quad/DualLite (Device Tree)
Revision        : 0000
Serial          : 0000000000000000

Control Groups (excerpt from here)

Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children (child processes/tasks), into hierarchical groups with specialized behavior.

Definitions:

cgroup associates a set of tasks with a set of parameters for one or more subsystems. It allows us to deal with a group of tasks.

subsystem is typically a kernel module that makes use of the task grouping facilities. A subsystem mountis typically a “resource controller” that schedules a resource or applies per-cgroup limits.

hierarchy is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of the cgroups in the hierarchy, and a set of subsystems; each subsystem has system-specific state attached to each cgroup in the hierarchy. Each hierarchy has an instance of the cgroup virtual filesystem associated with it.

User-level code may create and destroy cgroups by name in an instance of the cgroup virtual file system, specify and query to which cgroup a task is assigned, and list the task PIDs assigned to a cgroup. Those creations and assignments only affect the hierarchy associated with that instance of the cgroup file system.

The top level cgroup is this:

mount | grep "/sys/fs/cgroup "
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)

Other cgroups are those:

mount | grep /sys/fs/cgroup 
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/debug type cgroup (rw,nosuid,nodev,noexec,relatime,debug)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)

CPUSETS (excerpt from here)

The mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks is called Cpusets. Resources like CPU and memory can be limited via cpusets.

cd /sys/fs/cgroup/cpuset
tree
.
|-- cgroup.clone_children
|-- cgroup.procs
|-- cgroup.sane_behavior
|-- cpuset.cpu_exclusive
|-- cpuset.cpus
|-- cpuset.effective_cpus
|-- cpuset.effective_mems
|-- cpuset.mem_exclusive
|-- cpuset.mem_hardwall
|-- cpuset.memory_migrate
|-- cpuset.memory_pressure
|-- cpuset.memory_pressure_enabled
|-- cpuset.memory_spread_page
|-- cpuset.memory_spread_slab
|-- cpuset.mems
|-- cpuset.sched_load_balance
|-- cpuset.sched_relax_domain_level
|-- notify_on_release
|-- release_agent
`-- tasks

Let’s have a look at some items of interest here:

/sys/fs/cgroup/cpuset/cgroup.procs

cgroup.procs:

list of thread group IDs in the cgroup. This list is not guaranteed to be sorted or free of duplicate thread group IDs, and userspace should sort/uniquify the list if this property is required. Writing a thread group ID into this file moves all threads in that group into this cgroup.

By default all thread groups IDs are in this cgroup:

cat /sys/fs/cgroup/cpuset/cgroup.procs | tail -10
372
376
378
379
403
407
408
409
410
411
/sys/fs/cgroup/cpuset/cpuset.cpus

cpuset.cpus:

list of CPUs in that cpuset

cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-3

This means that tasks attached to this cpuset are allowed to run on all 4 CPUs.

/sys/fs/cgroup/cpuset/cpuset.mems

cpuset.mems:

list of Memory Nodes in that cpuset. The parameter has a value of 0 on systems that do not have a non uniform memory architecture (NUMA).

cat /sys/fs/cgroup/cpuset/cpuset.mems
0
/sys/fs/cgroup/cpuset/tasks

You can list all the tasks (by pid) attached to any cpuset or attach tasks to the cgroup. Just write its PID to the file /sys/fs/cgroup/cpuset/tasks for that cgroup.

cat /sys/fs/cgroup/cpuset/tasks | tail -10
372
376
378
379
403
409
410
411
412
413

Process and task IDs of your shell can be checked like this:

ps -L -p $$ -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  234   234     3     1 -sh

CPU resource usage is unrestricted by default (cgroup/cpuset)

Try to apply CPU load to all 4 CPUs

stress-ng --cpu 4 &

Monitor CPU usage with htop:

    0[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   Tasks: 26, 1 thr; 4 running
    1[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   Load average: 1.77 0.96 0.45 
    2[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   Uptime: 00:08:44
    3[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[|||||||||                                                        59.2M/990M]
  Swp[                                                                      0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU%-MEM%   TIME+  Command
  284 root       20   0 14988  2692  1388 R 100.  0.3  0:30.47 stress-ng --cpu 4
  285 root       20   0 14988  2692  1388 R 100.  0.3  0:30.48 stress-ng --cpu 4
  286 root       20   0 14988  2692  1388 R 100.  0.3  0:30.46 stress-ng --cpu 4
  287 root       20   0 14988  2692  1388 R 100.  0.3  0:30.08 stress-ng --cpu 4
  282 root       20   0  3696  2564  1628 R  1.3  0.3  0:01.72 htop
    1 root       20   0  7744  5692  4016 S  0.0  0.6  0:09.93 /sbin/init
...

We can see that all fours CPUs are pretty busy.

Restrict CPU resource usage for whatever runs from a specific shell to CPU 3

Create a new cpuset

mkdir /sys/fs/cgroup/cpuset/set1
tree /sys/fs/cgroup/cpuset/set1
/sys/fs/cgroup/cpuset/set1
|-- cgroup.clone_children
|-- cgroup.procs
|-- cpuset.cpu_exclusive
|-- cpuset.cpus
|-- cpuset.effective_cpus
|-- cpuset.effective_mems
|-- cpuset.mem_exclusive
|-- cpuset.mem_hardwall
|-- cpuset.memory_migrate
|-- cpuset.memory_pressure
|-- cpuset.memory_spread_page
|-- cpuset.memory_spread_slab
|-- cpuset.mems
|-- cpuset.sched_load_balance
|-- cpuset.sched_relax_domain_level
|-- notify_on_release
`-- tasks

cgroup.procs

No processes should be in the set yet:

cat /sys/fs/cgroup/cpuset/set1/cgroup.procs

cpuset.cpus

Restrict this set to CPU 3:

echo 3 > /sys/fs/cgroup/cpuset/set1/cpuset.cpus
cat /sys/fs/cgroup/cpuset/set1/cpuset.cpus
3

cpuset.mems

we don’t have a NUMA architecture:

echo 0 > /sys/fs/cgroup/cpuset/set1/cpuset.mems
cat /sys/fs/cgroup/cpuset/set1/cpuset.mems
0

tasks

tasks: they should not be in the set yet:

cat /sys/fs/cgroup/cpuset/set1/tasks

Attach the shell to the new cpuset

echo $$ > /sys/fs/cgroup/cpuset/set1/tasks
cat /sys/fs/cgroup/cpuset/set1/tasks
265
289
ps -L -p 265 -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  265   265     3   264 -sh

As you can see the shell runs on CPU 3.

Apply load from the shell

stress-ng --cpu 4 &

Monitor with htop

    0[                                                                       0.0%]   Tasks: 26, 1 thr; 4 running
    1[||                                                                     1.3%]   Load average: 0.89 0.20 0.06 
    2[                                                                       0.0%]   Uptime: 02:01:58
    3[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[|||||||||                                                        60.8M/990M]
  Swp[                                                                      0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU%-MEM%   TIME+  Command
  310 root       20   0 14988  2960  1400 R 25.0  0.3  0:04.60 stress-ng --cpu 4
  311 root       20   0 14988  2960  1400 R 25.0  0.3  0:04.60 stress-ng --cpu 4
  312 root       20   0 14988  2956  1400 R 25.0  0.3  0:04.60 stress-ng --cpu 4
  313 root       20   0 14988  2952  1400 R 25.0  0.3  0:04.60 stress-ng --cpu 4
  314 root       20   0  3696  2584  1652 R  2.0  0.3  0:00.13 htop
    1 root       20   0  7744  5708  4020 S  0.0  0.6  0:10.08 /sbin/init
...

As you can see only CPU 3 is busy.

Let’s check process and task IDs:

ps -L -p 310 -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  310   310     3   304 stress-ng --cpu 4
ps -L -p 311 -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  311   311     3   304 stress-ng --cpu 4
ps -L -p 312 -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  312   312     3   304 stress-ng --cpu 4
ps -L -p 313 -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  313   313     3   304 stress-ng --cpu 4

As you can see the four stress-ng instances run only on CPU 3.

CPU resource usage is unrestricted by default (cgroup/cpu)

CPU

The cgroups CPU subsystem feature isolates the CPU time consumption of a group of processes (tasks) from the rest of the system. We can get reports on CPU usage by the processes in a cgroup, and set limits on the number of CPUs used by those processes.

tree /sys/fs/cgroup/cpu
/sys/fs/cgroup/cpu
|-- cgroup.clone_children
|-- cgroup.procs
|-- cgroup.sane_behavior
|-- cpu.cfs_period_us
|-- cpu.cfs_quota_us
|-- cpu.rt_period_us
|-- cpu.rt_runtime_us
|-- cpu.shares
|-- cpu.stat
|-- cpuacct.stat
|-- cpuacct.usage
|-- cpuacct.usage_all
|-- cpuacct.usage_percpu
|-- cpuacct.usage_percpu_sys
|-- cpuacct.usage_percpu_user
|-- cpuacct.usage_sys
|-- cpuacct.usage_user
|-- notify_on_release
|-- release_agent
`-- tasks

cgroup.procs, tasks

By default all thread group IDs are in cpu/cgroup.procs. All Tasks of the system are as well attached to cpu/tasks.

cpu.cfs_period_us

Specifies a period of time, in microseconds, for how regularly a cgroup’s access to the CPU resources should be reallocated. Valid values are 1 second to 1000 microseconds.

cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000

cpu.cfs_quota_us (excerpt from here)

Cpu.cfs_quota_us specifies the total time in microseconds in which all Tasks in the control group can be executed.

As soon as Tasks in a cgroup use up all the time specified by the quota, they are throttled.

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS (completely fair scheduler).

Writing any (valid) positive value(s) will enact the specified bandwidth limit. The minimum quota allowed for the quota or period is 1ms. The upper bound on the period length of 1s.

Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit.

cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1

Restrict CPU resource usage for whatever runs from a specific shell to 50% of one CPU

Create new set

mkdir /sys/fs/cgroup/cpu/set1
tree /sys/fs/cgroup/cpu/set1
/sys/fs/cgroup/cpu/set1
|-- cgroup.clone_children
|-- cgroup.procs
|-- cpu.cfs_period_us
|-- cpu.cfs_quota_us
|-- cpu.rt_period_us
|-- cpu.rt_runtime_us
|-- cpu.shares
|-- cpu.stat
|-- cpuacct.stat
|-- cpuacct.usage
|-- cpuacct.usage_all
|-- cpuacct.usage_percpu
|-- cpuacct.usage_percpu_sys
|-- cpuacct.usage_percpu_user
|-- cpuacct.usage_sys
|-- cpuacct.usage_user
|-- notify_on_release
`-- tasks

cpu.cfs_period_us

cat /sys/fs/cgroup/cpu/set1/cpu.cfs_period_us
100000

cpu.cfs_quota_us

echo 50000 > /sys/fs/cgroup/cpu/set1/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu/set1/cpu.cfs_quota_us
50000

Attach a shell to the set & stress the system

echo $$ > /sys/fs/cgroup/cpu/set1/tasks
stress-ng --cpu 4 &
CPUsCPUs
utilized
CPU %
utilized
cpu.
cfs_period_us
cpu.
cfs_quota_us
CPUs
really
utilized
CPU %
really
utilized
415010000050000412.5
on average

Monitor CPU usage

    0[||||||||||                                                            10.6%]   Tasks: 26, 1 thr; 4 running
    1[|||||||||||                                                           12.6%]   Load average: 1.15 0.98 0.63 
    2[|||||||||||||                                                         14.6%]   Uptime: 00:52:42
    3[|||||||||||                                                           12.6%]
  Mem[|||||||||                                                        68.7M/990M]
  Swp[                                                                      0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU%-MEM%   TIME+  Command
  319 root       20   0 14988  2692  1388 R 15.2  0.3  1:06.78 stress-ng --cpu 4
  318 root       20   0 14988  2692  1388 R 13.2  0.3  1:06.31 stress-ng --cpu 4
  321 root       20   0 14988  2692  1388 R 13.2  0.3  1:05.43 stress-ng --cpu 4
  320 root       20   0 14988  2692  1388 R  9.9  0.3  0:50.14 stress-ng --cpu 4
  316 root       20   0  3696  2576  1708 R  0.7  0.3  0:05.63 htop
    1 root       20   0  7772  5808  4128 S  0.0  0.6  0:09.87 /sbin/init

Restrict CPU resource usage for whatever runs from a specific shell to 10% of one CPU

CPUsCPUs
utilized
CPU %
utilized
cpu.
cfs_period_us
cpu.
cfs_quota_us
CPUs
really
utilized
CPU %
really
utilized
41101000001000042.5
on average
    0[|||                                                                    2.6%]   Tasks: 26, 1 thr; 4 running
    1[|||                                                                    2.6%]   Load average: 0.15 0.26 0.12 
    2[||||                                                                   3.3%]   Uptime: 00:03:10
    3[||                                                                     2.0%]
  Mem[|||||||||                                                        58.9M/990M]
  Swp[                                                                      0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU%-MEM%   TIME+  Command
  286 root       20   0 14988  2604  1304 R  2.6  0.3  0:00.32 stress-ng --cpu 4
  287 root       20   0 14988  2848  1296 R  2.6  0.3  0:00.33 stress-ng --cpu 4
  288 root       20   0 14988  2348  1300 R  2.6  0.2  0:00.31 stress-ng --cpu 4
  289 root       20   0 14988  2348  1300 R  2.0  0.2  0:00.32 stress-ng --cpu 4
  262 root       20   0  3704  2600  1660 R  0.7  0.3  0:00.46 htop
    1 root       20   0  7772  5688  4024 S  0.0  0.6  0:09.82 /sbin/init

Restrict CPU resource usage for whatever runs from a specific shell to 100% of two CPUs

CPUsCPUs
utilized
CPU %
utilized
cpu.
cfs_period_us
cpu.
cfs_quota_us
CPUs
really
utilized
CPU %
really
utilized
42100100000200000450
on average
    0[|||||||||||||||||||||||||||||||||||||||||||                           50.3%]   Tasks: 26, 1 thr; 4 running
    1[||||||||||||||||||||||||||||||||||||||||||||                          51.0%]   Load average: 1.10 0.32 0.11 
    2[|||||||||||||||||||||||||||||||||||||||||||                           50.3%]   Uptime: 00:26:35
    3[||||||||||||||||||||||||||||||||||||||||||||                          50.6%]
  Mem[|||||||||                                                        60.9M/990M]
  Swp[                                                                      0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU%-MEM%   TIME+  Command
  300 root       20   0 14988  2952  1388 R 50.2  0.3  0:22.00 stress-ng --cpu 4
  297 root       20   0 14988  2940  1388 R 49.6  0.3  0:21.75 stress-ng --cpu 4
  298 root       20   0 14988  2948  1388 R 49.6  0.3  0:21.99 stress-ng --cpu 4
  299 root       20   0 14988  2948  1388 R 49.6  0.3  0:21.85 stress-ng --cpu 4
  271 root       20   0  3692  2636  1704 R  1.3  0.3  0:00.86 htop
    1 root       20   0  7740  5768  4096 S  0.0  0.6  0:10.04 /sbin/init
...

Conclusion

As you can see control groups are an interesting choice for managing system resources. Think about use cases like containers and real-time. We used pretty much low-level kernel interfaces here. There are definitely more high-level ways to manage cgroups as well.

Appendix

A post about “Degrees of real-time” is here. If you want to learn how Embedded Linux and real-time Linux work have a look here. To learn more about the Yocto Project® have a look here.

Upcoming Events

Our 3 points

of differentiation

We provide host and target hardware during all our teaching.

Three or more people from the same company? We provide private customized training – consulting included.

Subject matter experts develop high-quality, job-related, up-to-date, authentic courseware.