Allocating Memory (p. 213)

Kmalloc

- Similar to malloc
- Fast unless it blocks and doesn't clear obtained memory; allocated region still holds previous content
- Allocated Region is also contiguous in PHysical memory ** Very important note about kmalloc() **

#include <linux/slab.h>
void *kmalloc(size_t size, int flags);

Flags:

`GFP_KERNEL`

Means the allocation is performed on behalf of a process running in kernel space
The prefix, GFP_, naming is related to the internal call to __get_free_pages(), which is used for the kmalloc() allocation. (GFP == Get free pages)
The calling function uses a syscall on behalf of a process.
- GFP_KERNEL allows kmalloc() to put a process to sleep, while waiting for a memory page to become available, in low memory situations.
- The ability for kmalloc() to sleep a process, means the GFP_KERNEL flag can only be used in a process context, not interrupt
  - This is because, only a proccess contexts is reentrant. Kmalloc() called within a process context can wait for the page.
  - Interrupt contexts are not reentrant; while the kernel works to get free pages for contiguous allocation, kmalloc() would not be able to sleep, because it would have no means of returning to the interrupt event. Consider a situation where memory is low. Recall that kmalloc() provides allocations of contiguous physical memory; contiguous physical memory that sufficiently satisfies an allocation request is not always (in fact, rarely) available. In situations where memory is low, it is unlikely the kernel will be able to locate an unmapped contiguous range of physical memory to allocate; so, the kernel will work to locate free memory (e.g., flushing buffers to disk, swapping, etc.,). This operation takes time, which is why the GFP_KERNEL flag is useful, because it can sleep the process while the kernel works to find memory. Once the memory is found, kmalloc wakes the process up and continues execution. With interrupts, this is simply not possible, you cannot sleep an interrupt and reenter a state. Thus, in interrupt contexts, the GFP_KERNEL flag cannot be used.
- Furthermore, the calling function cannot run in an atomic context. Atomic contexts guarantee completion without interruption, and are therefore not reentrant.
- Interrupt contexts using kmalloc() should use GFP_ATOMIC. `Don’t confuse this with the previous point which states that the calling function cannot be in an atomic state.).
  - The kernel typically tries to reserve some pages for atomic operations. The GFP_ATOMIC flag allows kmalloc to use these pages. Kmalloc can even use the last page, but if use of the last page fails, the allocation fails.

A list of all flags that can be used with kmalloc, along with an explanation, can be found in include/linux/gfp_types.h.

Memory Zones

Memory zones include: Normal, DMA-capable, and High-Memory zones.

Normal Memory Zone

Normal is used for most allocations; however, if the DMA bit or the HIGHMEM bit is set, the allocation must be done from a zone other than the Normal zone.

This is to all RAM being viewed as equivilant.
Some memory, such as HIGHMEM, cannot be accessed directly by the kernel; rather, mappings (using MMIO and IOMMU) must be created. This illustrates why the Normal Zone cannot be used for all allocations.

DMA-capable Memory Zone

This memory lives in a preferential address range, where peripherals can perform DMA access.

For most modern platforms, all memory lives in this Zone
(x86 archs) No size limit for the DMA-capable zone is imposed on PCI/e devices
- Legacy ISA devices are limited to the first 16MB of RAM
- I believe this is because ISA could only support 8- or 16-bit addresses for ISA in PC/XT and AT mode, respectively.
  - For PC/XT, the 20-bit address bus only supports an 8-bit address for a max of 1MB
  - For AT, the bus was extended to 24 bits, supporting 16-bit addresses, allowing a max memory space of 16MB

Further reading about DMA is strongly encouraged. The Linux Kernel Documentation includes a helpful page on DMA Mapping, that provides more detail about DMA-cabable memory. Reading the documentation page covering the DMAEngine is also strongly encouraged.

High Memory

A mechanism allowing large amounts of memory. As mentioned earlier, this memory cannot be accessed directly by the kernel; setup of special mapping is required. Still, using this zone is a viable option for device drivers with large memory footprints.

On traditional 32-bit systems, 4GB of memory could be supported. Linux, however, splits the memory into user and kernel space, typically giving 3GB to user and 1GB to kernel. All of the kernels data structures, including the kernel code itself, must fit in that 1GB space. This includes drivers. The kernel cannot manipulate memory unless that memory is mapped into the kernels address space. Thus, the biggest consumer of kernel address space is virtual mappings.

While user processes each have thier own virtual memory space, all processes share a single kernel (privileged) Kernel Virtual Address space. The kernel virtual address space is shared with all processes, by mapping it into the upper portion of every process's virtual address space. This improves performance, especially for syscalls/interrupts: a full context switch of memory tables isn't requried. If the kernels space was not mapped, and remained different that users, the a context switch would require a TLB Flush for every system call.

Furthermore, the kernel maps all physical RAM to its virtual space, allowing it to efficiently access any part of physical memory by adjusting the offset. This memory mapping is Direct Mapping, which is a 1-to-1 mapping virt to phys. The is usually referred to as logical address. This means every byte of RAM has a corresponding kernel virtual address, allowing the kernel have full visibility of ram and “touch” any piece of memory, such as a user buffer or page table.

For user space process’s needing memory, the kernel maintains a list of assigned virtual ranges (stored in vm_area_struct objects). These are just address numbers. When a process attempts to touch one of these addresses, if a mapping to a physical page doesn’t already exist:

a page fault will occur
Kernel will locate free. RAM page
Kernel will update the process’s page table to map the virtual page (corresponding to the address the user attempted to access) to the newly located physical page.
subsequent access should not page fualt.

The process now has a page table entry for that address in it's page table with Physical Frame Number (PFN) mapped and the valid/present bit set. The kernel still has it's OWN PRIVATE directly mapped virtual address that maps to the SAME physical page frame.

[!NOTE] All memory allocated to a process isn’t automatically mapped to a physical page when user virtual memory is allocated. The physical page frame can always be computing, using the VPN and offset mechanism. Each virtual address (byte addressable) corresponds to a page table entry. Only when a Physical Frame Number (PFN) is mapped, will the PTE contain the PFN. The MMU will check the valid/present bit. If it’s not set, no PFN is mapped, and a page fault occurs to locate a free phsycail page, get it’s PFN and map to the PTE. Since virtual addressing allows the cumulative amount of virtual memory to exceed physical memory available, it’s usually not possible for every process to have all pages loaded to Page table, unless you’re only running a few. Otherwise, swap or demand paging is often used to only load what’s needed. This keeps physical pages free for other processes that actually need them, rather than allowing processes to waste pages by not using them.

When a driver (or other) calls vmalloc() or kmalloc(), the allocations, which are in kernel space, along with the kernels directly-mapped kernel virtual space, is mapped into every processes page table. The efficiency of mapping allows the kernel to handle system calls or interrupts without memory page context switch (process context switch still occurs) by swapping pages/ flushing TLB. Originally, every kernel virtual address was mapped; however, modern kernel uses Kernel Page Table Isolation (KPTI). User page tables include PTEs for the process, and a minimal trampoline of kernel memory, which includes essential entry points for syscalls and interrupts. When a system call occurs, the kernel immediately jumps to a second set/level of page tables, that include PTEs for Kernel Virtual Address Space.

This is also true with I/O devices, such as GPU or NIC. A driver may use something like ioremap() to take the physical hardware address of a PCIe device from the PCI bus and map it to kernel virtual space (still, directly mapped). The driver can then read/write to the MMIO space in kernel virtual memory, and IOMMU directs signals to hardware instead of RAM. This is very common with DMA! DMA driver will allocate DMA coherent buffers, that are mapped to kernel virtual memory space. During a DMA transfer, data is moved directly from hardware IO device to the coherent buffers mapped into memory, without using the CPU.

The mappings also allow the kernel to mitigate/manage fragmentation, by allocating contiguous virtual memory blocks to user process, which map to (often & usually) discontiguous physical memory blocks.

This also allows the kernel to provide isolation. Not to be confused with hardware-enabled protection.

Even though the kernel shares an address space with the process (by mapping its KVAS to upper portion of process virtual address space), the MMU’s supervisor (or protection) bits in the page tables are used to prevent user applications from reading/writing to kernel memory.

[!IMPORTANT] Don’t confuse user and kernel virtual address space with many-to-one thread mappings b/t user and kernel threads. Many-to-One Threading (Execution Focus): This is about how code runs. It is a management model where a user-space library manages multiple “User Threads” that are all mapped to a single “Kernel Thread” (the entity the OS actually schedules on a CPU).

TLDR: The kernel needs it’s own virtual address for any memory it must touch directly. Modern processor manufacturers include address extension features, which support more than 4GB RAM on 32-bit. Only the lowest portion (1 - 2 GB) or RAM has logical addresses; that is, only the lowest portion of RAM is directly-mapped in the Kernels virtual address space (KVAS) because that’s all that will fit. High Memory is NOT mapped and does NOT have a logical address.

To use High Memory, the kernel must setup explicit virtual mapping to map pages in High-memory into the KVAS, making it “accessible” to the kernel to be manipulated. This is why kernel Data structures are placed in low memory; high memory is reserved for user-space process pages.

Lookaside Cache

Memory Pools

There are places in the kernel where memory allocations cannot be allowed to fail. As a way of guaranteeing allocations in those situations, the kernel developers created an abstraction known as a memory pool (or “mempool”). A memory pool is really just a form of a lookaside cache that tries to always keep a list of free memory around for use in emergencies.

// prototype for mempool_create()

mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data);

min_nr is the minimum number of objects.

When the mempool is created, the allocation function will be called enough times to create a pool of preallocated objects. Thereafter, calls to mempool_alloc attempt to acquire additional objects from the allocation function; should that allocation fail, one of the preallocated objects (if any remain) is returned.

when an object is freed with mempool_free, it is kept in the pool if the number of preallocated objeects is kept below the minimum; otherwise, it’s returned to the system.

Mempool resizing can be done with mempool_resize fuction:

// prototype for mempool_resize()
int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);

Function get_free_pages()

If large memory chunks are needed, paging is a better technique than mempool or lookaside cache. Flag options and usage for __get_free_pages() are the same as kmalloc.

Allocating whole pages has the following advantages:

Page Alignment
Less Overhead - __get_free_pages() is a low-level allocator and interracts directly with the buddy system
Improved TLB performance (i.e,. high rate of TLB hits)

[!NOTE] Comments in the source code clearly state that GFP_DMA and GFP_DMA32 should be avoided. These flag options remain available to support linux legacy systems that require them. While kmalloc and __get_free_pages both accept these flags, the DMA API function dma_alloc_coherent() should be used instead. Please refer to API documentation for guidance.

vmalloc

The allocation function vmalloc allocates contiguous virtual memory, but does not guarantee contiguous physical memory.

Use of vmalloc is discouraged in most situations. Memory obtained from vmalloc is slightly less efficient to work with, and, on some architectures, the amount of address space set aside for vmalloc is relatively small. Code that uses vmalloc is likely to get a chilly reception if submitted for inclusion in the kernel.

Communicating w/Hardware (p. 235)

- Skim / use osdev

Interrupt Handling (p. 258)

- 259 - 275
- 281 - 285

PCI (p. 302)

- Skim only

Memory Mapping DMA (p. 412)

- Full

Network Drivers (p. 497)

- Full

Linux kmalloc memory Allocation

Allocating Memory (p. 213)

Kmalloc

`GFP_KERNEL`

Memory Zones

Normal Memory Zone

DMA-capable Memory Zone

High Memory

Lookaside Cache

Memory Pools

Function get_free_pages()

vmalloc

Communicating w/Hardware (p. 235)

Interrupt Handling (p. 258)

PCI (p. 302)

Memory Mapping DMA (p. 412)

Network Drivers (p. 497)

Exploring Linux Cgroups

Cgroup initialization

Call stack

cgroup Macro Expansion

Linux kmalloc memory …

Allocating Memory (p. 213)

Kmalloc

Linux Bootloader

Booloaders on i386/x86_64