Cgroup initialization
Call stack
start_kernel()
cgroup_init_early()
for_each_subsys()
mem_cgroup_init()
cgroup_init()
for_each_subsys()
cgroup Macro Expansion
Linux kernel uses x-macros to populate an enum list, cgroup_subsys_id, which represents the Cgroup subsystem IDs. The cgroup_subsys.h header file checks relevant CONFIG_ flags (set at kernel compile time) to see which cgroups should be enabled. The name for each enabled cgroup is defined as a parameter of the SUBSYS macro. This list is part of an x-macro pattern.
I will cover the three x-macro expansion patterns most relevant to this guide. The macro SUBSYS() has multiple definitions to implement different x-macro expansions:
- Expansion to populate an enum type named cgroup_subsys_id
#define SUBSYS(_x) _x ## _cgrp_id,
enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h>
CGROUP_SUBSYS_COUNT,
};
The subsystem name parameter (i.e., _x) that is passed to SUBSYS() is, is combined with _cgrp_id to form a member entry for the enum type cgroup_subsys_id.
[!NOTE] The
##symbol is a token pasting symbol, that essentially pastes _x and _cgrp_id.
The calls to SUBSYS for each enabled cgroup are included into the body of this enum to form the macro:
#define SUBSYS(_x) _x ## _cgrp_id,
enum cgroup_subsys_id {
SUBSYS(cpuset)
SUBSYS(cpu)
SUBSYS(memory)
SUBSYS(devices)
SUBSYS(io)
CGROUP_SUBSYS_COUNT,
};
[!NOTE] For brevity, I’m only showing a few cgroup subsystems.
Each SUBSYS(_x) call is expanded, by taking the parameter and pasting (or, concatenating) _cgrp_id. For example, SUBSYS(cpu) becomes cpu_cgrp_id. The final enum type expanded at compile time will look like this:
enum cgroup_subsys_id {
cpuset_cgrp_id,
cpu_cgrp_id,
memory_cgrp_id,
devices_cgrp_id,
io_cgrp_id,
CGROUP_SUBSYS_COUNT,
};
- Expansion to create the name of a cgroup subsystem struct.
#define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys;
#include <linux/cgroup_subsys.h>
#undef SUBSYS
This x-macro pattern includes a different definition for the SUBSYS() macros. Similar to the pattern describe above, the parameter passed to SUBSYS is a name for an enabled cgroup subsystem; however, this macro concatenates the name with _cgrp_subsys to define a cgroup_subsys struct type.
extern struct cgroup_subsys cpu_cgrp_subsys;
extern struct cgroup_subsys cpuset_cgrp_subsys;
extern struct cgroup_subsys memory_cgrp_subsys;
extern struct cgroup_subsys devices_cgrp_subsys;
extern struct cgroup_subsys io_cgrp_subsys;
- Expansion to create an array of cgroup subsystem struct pointers (i.e., struct cgroup_subsys *)
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
struct cgroup_subsys *cgroup_subsys[] = {
#include <linux/cgroup_subsys.h>
};
This would expand to the following struct definition:
struct cgroup_subsys *cgroup_subsys[] = {
[cpu_cgrp_id] = &cpu_cgrp_subsys,
[cpuset_cgrp_id] = &cpuset_cgrp_subsys,
[memory_cgrp_id] = &memory_cgrp_subsys,
[devices_cgrp_id] = &devices_cgrp_subsys,
[io_cgrp_id] = &io_cgrp_subsys
};
Indexing into cgroup_subsys array of cgroup_subsys struct pointers
In expansion number 3 above, the enum members expanded from example number 1 are index values. This guarantees order, even if the enum members are reordered. It also allows for simple access to the subsystem structs, by indexing into the array.
As shown in the original call stack, cgroup_init and cgroup_init_early both call for_each_subsys, which, itself, is a macro (NOT an X-macro).
#define for_each_subsys(ss, ssid) \
for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT && \
(((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
It’s pretty clear what this macro function is doing; I’m including it to show how eacy cgroup subsystem is initialized. I’m not entirely sure why a macro is used for this for loop, as it (seemingly) has no semantic difference from do the following below:
for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT && (((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
{
// code here
}
One important thing to note, here, is that the enum is never explicitly called. As shown above, X-macro expansion created the enum type members, representing the cgroup subsystem IDs. Each subsystem added to that enum, was later used in the x-macro expansion for the cgroup subsystem struct pointer array (struct cgroup_subsys *cgroup_subsys) as an index number in the array, that where the corresponding subsystem struct pointer is stored (e.g., [memory_cgrp_id] = &memory_cgrp_subsys).
The for-loop shown above, simply indexes into the array, for every i to CGROUP_SUBSYS_COUNT. The for-loops second condition (((ss) = cgroup_subsys[ssid]) || true), assigns the cgroup_subsys pointer, ss (i.e., struct cgroup_subsys *ss), which is local (i.e., on the functions stack), to the corresponding subsystem struct pointer of type cgroup_subsys and verifies successful assignement (i.e., || true).
This allows the CPU to indirectly jump to the subsystem struct to manipulate it.
cgroup_init_early() and cgroup_init() call stacks
for_each_subsys()
cgroup_init_subsys()
for_each_subsys()
cgroup_init_subsys()
The loop gets a name for the cgroup and adds it to an array of cgroup names. Next, cgroup_init_subsys() is called, which: locks the cgroup, initializes head of cgroup list, creates a root cgroup state for the subsystem, then unlocks cgroup.
Looking at cgroup_init_subsys(), the subsystem is initiated by creating a cgroup subsystem state, which is the basic structure used by controllers.
The cgroup_subsys struct that was passed is assigned a root cgroup state (i.e., cgrp_dfl_root) and the cgroup subsystem state is set to the result of ss->css_alloc(NULL) which is a handler for a function pointer.
struct cgroup_subsys {
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
...
...
}
Passing NULL is simply passing NULL for the param of type struct cgroup_subsys_state. The struct would eventually be a pointer of the same struct type. Setting this to NULL appears to be a minimal initialization to set the subsystem state equal to the passed subsystems state…
There is a call to init_and_link which passes css, ss, and the memory address of cgrp_dfl_root.cgrp (the root cgroup) as the cgroup param.
That function, init_and_link, zero’s memory for the cgroup subsystem state, and assigns the cgroup member of the cgroup subsystem state to the cgroup passed to init_and_link (i.e., pointer to root cgroup cgrp_dfl_root.cgroup). Next the States subsystem is assigned to the subsystem param that was passed, followed by some initialization.
The newly initialized cgroup subsystem state is added to an array named init_css_set; one of that structs members is subsys, which is an array of type cgroup_subsys_state. The subsystems ID (according to the enum we covered earlier) is the index.
Finally, a bitmap is used, is suppose as a flag, to determine callbacks available
have_fork_callback |= (bool)ss->fork << ss->id;
have_exit_callback |= (bool)ss->exit << ss->id;
have_release_callback |= (bool)ss->release << ss->id;
have_canfork_callback |= (bool)ss->can_fork << ss->id;
After cgroup_init_subsys returns, cgroup_init() creates what appears to be a feature bitmask that would be used to determine if the subsystem is implicit, threaded, and legacy cftype.
The for-loop ends with a bind (ss->bind(..)) and the cgroup subsystem directory is populated with the newly initialized cgroup.
mem_cgroup_init()
Not really doing much, it seems like it’s only setting up some caches.
It’s also helpful to look at the cgroup_subsys struct, which has several function pointers. The memory cgroup subsystem struct has the following assignments:
struct cgroup_subsys memory_cgrp_subsys = {
.css_alloc = mem_cgroup_css_alloc,
.css_online = mem_cgroup_css_online,
.css_offline = mem_cgroup_css_offline,
.css_released = mem_cgroup_css_released,
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.css_rstat_flush = mem_cgroup_css_rstat_flush,
.attach = mem_cgroup_attach,
.fork = mem_cgroup_fork,
.exit = mem_cgroup_exit,
.dfl_cftypes = memory_files,
#ifdef CONFIG_MEMCG_V1
.legacy_cftypes = mem_cgroup_legacy_files,
#endif
.early_init = 0,
};
Relevant call stack
callers:
filemap.c::filemap_add_folio()
huge_memory.c::vma_alloc_anon_folio_pmd()
khugepaged.c::alloc_charge_folio() // unlikely()
ksm.c::ksm_might_need_to_copy()
memory.c::folio_prealloc()
memory.c::alloc_anon_folio()
migrate_device.c::migrate_vma_insert_page()
shmem.c::shmem_alloc_and_add_folio()
shmem.c::shmem_mfill_atomic_pte()
userfaultd.c::mfill_atomic_pte_copy()
userfaultd.c::mfill_atomic_pte-zeroed_folio()
----
mem_cgroup_charge()
__mem_cgroup_charge()
memcg = get_mm_cgroup_from_mm(mm)
'ret' charge_memcg()
css_put()
A folio[1, 2] is a struct page which is guaranteed not to be a tail page.
Memory Resource Controller
Memory Cgroup Accounting
The memory cgroup subsystem.
Each mm_struct knows about which cgroup it belongs to and each page has a pointer to the page_cgroup, which knows the cgroup that it belongs to.
The accounting is done as follows: mem_cgroup_charge() is invoked to setup the necessary data structures and check if the cgroup that is being charged is over its limit. If it is then reclaim is invoked on the cgroup. More details can be found in the reclaim section of this document. If everything goes well, a page meta-data-structure called page_cgroup is updated. page_cgroup has its own LRU on cgroup. (*) page_cgroup structure is allocated at boot/memory-hotplug time.
