Linux Kernel Memory Management

Initialization

This occurs in the file mm/mm_init.c.

One thing that I notice is that many of the functions are marked with macros designating them as initialization functions. In include/linux/init.h, you will find the most notable: __init and __meminit.

__init macro

The __init macro expands to __section(“init.text”) __cold __latent_entropy __no_kstack_erase

This essentially places the function in the Text subsection of init in memory. According the Linux Foundation: Linux Standard Base, where linux documentation on standards resides, the init section in an ELF binary is used to store executable instructions that contribute to the process initialization code. When a programs starts to run, the systems arranges to execute the code in this section before calling the main program entry point.

This means the functions in mm/mm_init.c marked with __init are initialization functions that are run before the main entrypoint for the memory management program.

If you’re not familiar with ELF, I strongly suggest reading about it. The text section is for non-writable executable code, so .init.text is a section for non-writable initialization code; that is, code that can be executed but cannot be written to. For obvious reasons, you don’t want people to be able to write to executable code.

Sections, allow the Linker to not only group code into one of the main ELF sections (e.g., .text, .data, .bss, etc.) but also into subsections.

Memory Initialization

After looking at the code, it’s pretty easy to see how memory is initialized. Here is a high level overview.

The function mm_core_init() is called to setup memory allocators. Here is the body of mm_core_init() [Kernel:stable:6.18.6]

void __init mm_core_init(void)
{
	arch_mm_preinit();
	hugetlb_bootmem_alloc();

	/* Initializations relying on SMP setup */
	BUILD_BUG_ON(MAX_ZONELISTS > 2);
	build_all_zonelists(NULL);
	page_alloc_init_cpuhp();
	alloc_tag_sec_init();
	page_ext_init_flatmem();
	mem_debugging_and_hardening_init();
	kfence_alloc_pool_and_metadata();
	report_meminit();
	kmsan_init_shadow();
	stack_depot_early_init();

	kho_memory_init();

	memblock_free_all();
	mem_init();
	kmem_cache_init();

	page_ext_init_flatmem_late();
	kmemleak_init();
	ptlock_cache_init();
	pgtable_cache_init();
	debug_objects_mem_init();
	vmalloc_init();

	if (!deferred_struct_pages)
		page_ext_init();

	pti_init();
	kmsan_init_runtime();
	mm_cache_init();
	execmem_init();
}

One of the functions called from the body of mm_core_init() is mem_init(), which is defined directly above mm_core_init() and marked with a macro corresponding to the weak attribute[1], which allows developers to define library functions that can be overridden in user space. This is possible becaue the declaration is emitted as a weak symbol, and not a global.

void __init __weak mem_init(void)
{
}

As you can see, the body of mem_init() is empty. This is not a function prototype (hence, the curly brackets) but a definition with no code. The __weak macro (which expands to attribute((weak))) allows this. The implementation of mem_init() is architecture dependent; that is, each architecture that Linux supports must define behavior for mem_init() that will override the instance of mem_init() in mm/mm_init.c.

As an example, let’s take a look at the x86_64 architecture.

If you’re unfamiliar with x86_64 architecture bootloaders, it’s important that you understand a few of the basics. All systems with an x86 (and x86_64) archtiectures firt boot to a 16-bit Mode, known as Real Mode. This is a legacy mode and it is requried for backwards compatibility with legacy i386. Real mode does not support hardware protection, and it uses an obsolete memory addressing scheme, segmentation, which uses a logical address, a multiplier of 16, and an offset to address memory bytes or words. Because of this switching to 32-bit mode as quickly as possible is the first objective of modern bootloaders when booting from x86 arch. I won’t cover this in detail, but it’s cover extensively in OSDev Wiki website.

After switching to 32-bit mode, hardware-enabled protection is supported and the modern Virtual Memory Paging mechanism is used. After some setup, the arch can boot to 64-bit mode. On x86_64 architectures, this order is always followed; that is, the system is always booted into 16-bit real mode, then 32-bit, then 64-bit. (If only booting to 32-bit, then the bootloader stops at 32-bit mode). This boot escalation is transparent to the user. Another important thing to note is that, once the bootloader switches to 32-bit mode, it can use the hardware protection for basic memory layout setup, including setting up stack space. Once stack space is created, the bootloader program can switch to C.

This is why the Linux kernel’s Memory Management program delegates the responsibility of defining behavior for mem_init() to the mm code in the arch subsystem. For x86, this function is defined in arch/x86/mm/init_64.c:

void __init mem_init(void)
{
	/* clear_bss() already clear the empty_zero_page */

	after_bootmem = 1;
	x86_init.hyper.init_after_bootmem();

	/*
	 * Must be done after boot memory is put on freelist, because here we
	 * might set fields in deferred struct pages that have not yet been
	 * initialized, and memblock_free_all() initializes all the reserved
	 * deferred pages for us.
	 */
	register_page_bootmem_info();

	/* Register memory areas for /proc/kcore */
	if (get_gate_vma(&init_mm))
		kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);

	preallocate_vmalloc_pages();
}

[!NOTE] Read my extremely thorough notes on Linux boot process, following the bootloader from the first jumpe from BIOS, to the call to initialize memory management. This guide will only explore the bootloader for i386/x86_64 and will help you understand how the bootloader brings the system from real to long mode, and eventually, 64-bit mode. Most importantly, see how the hardware memory is initialize during boot (i.e., memory detection, BIOS helper functions, low and high memory mapping, page table setup and init, etc.). I will walk through assembly, and explain a lot of useful and important things, even things like syscalls and asmlinkage.

[!NOTE] Another thing to note is that main.c, the compilation unit that defines start_kernel, is not in the arch subsystem folder. Rather, it’s in init/main.c. This illustrates how the kernel is the abstraction layer on top of hardware. Generally speaking, programs running on Linux Kernel do not need to be aware of the underlying architecture. Of course, this doesn’t apply to everything; compilers drivers (some, not all) are examples of programs that still need to know arch.

start_kernel::mm_core_init()

The mm_core_init function calls many init functions; we’ve looked at mem_init(), which is marked with the __weak attribute, and overriden by the architecture specific implementation. The 64-bit implementation registers memory page information setup during boot, and registers memory areas for /proc/kcore. Finally, vmalloc pages are preallocated.

The next initialization function we’ll explore is vmalloc_init. Similar to mm_init, vmalloc_init is first defined in mm/internal.h with an empty body; unlike mm_init, the implementation is not from the arch/ subsystem, it is defined in mm/vmalloc.c.

All I’ll say about this function is that is calls vmap_init_nodes

Conclusion

This was fun and a good learning experience. Perhaps I’ll explore NUMA another time.

Linux Memory Management

Linux Kernel Memory Management

Initialization

__init macro

Memory Initialization

start_kernel::mm_core_init()

Conclusion

Exploring Linux Cgroups

Cgroup initialization

Call stack

cgroup Macro Expansion

Linux kmalloc memory …

Allocating Memory (p. 213)

Kmalloc

Linux Bootloader

Booloaders on i386/x86_64