Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

🏠 Back to Blog

Memory Allocation

  • There is just one engine for memory allocation in the Linux kernel. The page allocator (buddy system or BSA) with the Slab allocator layered on top.
  • The kernel’s primary means of memory allocation using using the Page Allocator API (aka the buddy system) or the Slab Allocator API.
  • The internal implementation of the slab allocator used by the kernel is called the SLUB allocator.
  • The SLAB allocator solves fragmentation issues that arise with the page allocator by allocating memory in chunks of fixed sizes. This allows for more efficient use of memory and reduces overhead. Thus, the slab allocator is layered on top of the page allocator.
  • The only way to allocate or deallocate physical memory in the kernel is through the page allocator API.
  • The slab and page allocators reside completely in the kernel space and are not accessible from user space. Calling malloc or free in user space does not interact with the kernel’s memory allocation mechanisms. You should use kmalloc and kfree or kzmalloc and kzfree in kernel space.
  • The page allocator gets its page frames from the kernels low-mem area
  • Linux kernel memory is non-swappable

Page Allocator (Buddy System)

  • The key to the page allocator is it’s primary internal metadata structure. It’s called the “buddy system free-list” and consists of an array of pointers to doubly linked circular lists.
  • The index of this array of pointers is called the ‘order’ of the list. It’s the power to which to raise to 2. The array length is from MAX_ORDER - 1. The value of MAX_ORDER is architecture dependent, but is typically 11 or 12.
  • On x86 systems, a page is typically 4KB in size. Thus, order 0 corresponds to a single 4KB page, order 1 corresponds to 8KB (2 contiguous pages), order 2 corresponds to 16KB (4 contiguous pages), and so on.
  • Each doubly linked circular list points to free and physically contiguous blocks of memory of the corresponding order.
  • The kernel gives us a convenient view into the current state of the page allocator through the /proc/buddyinfo file. This file shows the number of free blocks of each order for each memory zone in the system.
λ ~ $ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2
Node 0, zone    DMA32      6      6      8      7      5      6      7      4      8      9    283
Node 0, zone   Normal      5    113    180   1018    715    320    151     49     38     10  14971
λ ~ $
  • The first column indicates the memory zone. Common zones include DMA, DMA32, and Normal. DMA is used for legacy devices that can only address the first 16MB of memory. DMA32 is used for devices that can address up to 4GB of memory. Normal is the standard memory zone for general-purpose allocations. The third and subsequent columns represent the number of free blocks of each order, starting from order 0.
  • If it’s not already obvious based on /proc/buddyinfo, the kernel keeps a different free-list for each memory zone to accommodate the different addressing requirements of various hardware devices.

How it works

  • When a request for memory allocation is made, the page allocator first determines the smallest order that can satisfy the request. It then checks the corresponding free-list for available blocks.

  • If a suitable block is found, it is removed from the free-list and returned to the requester.

  • If no suitable block is available, the allocator looks for a larger block in higher-order free lists. If a larger block is found, it is split into two smaller blocks, one of which is returned to the requester while the other is added back to the appropriate free-list.

  • When memory is freed, the allocator checks if the freed block’s “buddy” (the adjacent block of the same order) is also free. If so, the two blocks are merged back into a larger block and added to the higher-order free-list. This merging process continues recursively up the orders until no further merging is possible.

  • The buddy system’s design helps to minimize fragmentation by ensuring that memory blocks are always combined into larger blocks when possible.

  • The page allocator also incorporates various strategies to handle memory pressure, such as reclaiming memory from other parts of the kernel or invoking the Out-Of-Memory (OOM) killer when necessary.

  • The page allocator is optimized for performance, with fast allocation and deallocation operations to meet the demands of the kernel’s memory management needs.

  • The kernel provides another view into the state of memory allocation through the /proc/pagetypeinfo file. This file gives a more detailed breakdown of free pages by migrate type and order within each memory zone.

λ ~ $ sudo cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      1      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      2
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable      0      0      0      1      1      0      1      0      0      1      0
Node    0, zone    DMA32, type      Movable      6      6      8      6      4      6      6      4      8      8    283
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable    616    401    143     51     19      3      0      2      2      9      1
Node    0, zone   Normal, type      Movable   2290   1082    620   1032    792    350    161     47     31      0  14945
Node    0, zone   Normal, type  Reclaimable      1      1      1     19     31     18      4      2      3      1      2
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA            3            5            0            0            0            0
Node 0, zone    DMA32            6          614            0            0            0            0
Node 0, zone   Normal          254        31436           40            0            0            0

Using the Page Allocator API

  • The Linux kernel exposes a comprehensive API for memory allocation through the page allocator.
  • All of the AIPs or macros that take two parameters, the first parameter is called the GFP flags or bitmask (named gfp_mask). The second parameter is typically the order of the allocation (order).
  • The API prototypes are currently defined in the include/linux/gfp.h> header file.
#define pr_fmt(fmt) "%s:%s(): " fmt, KBUILD_MODNAME, __func__
#include <linux/init.h>
#include <linux/module.h>
#include <linux/mm.h>
#include "../../klib.h"

MODULE_AUTHOR("rtn");
MODULE_VERSION("0.2");
MODULE_LICENSE("Dual MIT/GPL");

static void *gptr1, *gptr2, *gptr3, *gptr4, *gptr5;
static int bsa_alloc_order = 3;
module_param_named(order, bsa_alloc_order, int, 0660);
MODULE_PARM_DESC(order, "order of our step 2 allocation (power-to-raise-2-to; default:3)");

/*
 * bsa_alloc : test some of the bsa (buddy system allocator
 * aka page allocator) APIs
 */
static int bsa_alloc(void)
{
	int stat = -ENOMEM;
	u64 numpg2alloc = 0;
	const struct page *pg_ptr1;

	/* 0. Show the identity mapping: physical RAM page frames to kernel virtual
	 *    addresses, from PAGE_OFFSET for 5 pages
	 */
	pr_info("0. Show identity mapping: RAM page frames : kernel virtual pages :: 1:1\n"
		"(PAGE_SIZE = %ld bytes)\n", PAGE_SIZE);
	/* SEE THIS!
	 * Show the virt, phy addr and PFNs (page frame numbers).
	 * This function is in our 'library' code here: ../../klib.c
	 * This way, we can see if the pages allocated are really physically
	 * contiguous! Signature:
	 *  void show_phy_pages(const void *kaddr, size_t len, bool contiguity_check);
	 */
	pr_info("[--------- show_phy_pages() output follows:\n");
	show_phy_pages((void *)PAGE_OFFSET, 5 * PAGE_SIZE, 1);
	pr_info(" --------- show_phy_pages() output done]\n");

	/* 1. Allocate one page with the __get_free_page() API */
	gptr1 = (void *)__get_free_page(GFP_KERNEL);
	if (!gptr1) {
		pr_warn("mem alloc via __get_free_page() failed!\n");
		/* As per convention, we emit a printk above saying that the
		 * allocation failed. In practice it isn't required; the kernel
		 * will definitely emit many warning printk's if a memory alloc
		 * request ever fails! Thus, we do this only once (here; could also
		 * use the WARN_ONCE()); from now on we don't pedantically print any
		 * error message on a memory allocation request failing.
		 */
		goto out1;
	}
	pr_info("#.    BSA/PA API     Amt alloc'ed        KVA\n");	// header

	pr_info("1.  __get_free_page()     1 page    %px\n", gptr1);

	/* 2. Allocate 2^bsa_alloc_order pages with the __get_free_pages() API */
	numpg2alloc = powerof(2, bsa_alloc_order);  /* returns 2^bsa_alloc_order */
	gptr2 = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, bsa_alloc_order);
	if (!gptr2) {
		/* no error/warning printk now; see above comment */
		goto out2;
	}
	pr_info("2. __get_free_pages()  2^%d page(s)  %px\n",
		bsa_alloc_order, gptr2);
	pr_info("[--------- show_phy_pages() output follows:\n");
	show_phy_pages(gptr2, numpg2alloc * PAGE_SIZE, 1);
	pr_info(" --------- show_phy_pages() output done]\n");

	/* 3. Allocate and init one page with the get_zeroed_page() API */
	gptr3 = (void *)get_zeroed_page(GFP_KERNEL);
	if (!gptr3)
		goto out3;
	pr_info("#.    BSA/PA API     Amt alloc'ed        KVA\n");	// header
	pr_info("3.  get_zeroed_page()   1 page      %px\n", gptr3);

	/* 4. Allocate one page with the alloc_page() API.
	 * Careful! It does not return the alloc'ed page addr but rather the pointer
	 * to the metadata structure 'page' representing the allocated page:
	 *    struct page * alloc_page(gfp_mask);
	 * So, we use the page_address() helper to convert it to a kernel
	 * logical (or virtual) address.
	 */
	pg_ptr1 = alloc_page(GFP_KERNEL);
	if (!pg_ptr1)
		goto out4;
	gptr4 = page_address(pg_ptr1);
	pr_info("4.       alloc_page()   1 page      %px\n"
		" (struct page addr = %px)\n", (void *)gptr4, pg_ptr1);

	/* 5. Allocate 2^5 = 32 pages with the alloc_pages() API.
	 * < Same warning as above applies here too! >
	 */
	gptr5 = page_address(alloc_pages(GFP_KERNEL, 5));
	if (!gptr5)
		goto out5;
	pr_info("5.      alloc_pages()  %lld pages     %px\n",
		powerof(2, 5), (void *)gptr5);

	return 0;
 out5:
	free_page((unsigned long)gptr4);
 out4:
	free_page((unsigned long)gptr3);
 out3:
	free_pages((unsigned long)gptr2, bsa_alloc_order);
 out2:
	free_page((unsigned long)gptr1);
 out1:
	return stat;
}

static int __init lowlevel_mem_init(void)
{
	return bsa_alloc();
}

static void __exit lowlevel_mem_exit(void)
{
	pr_info("free-ing up the prev allocated BSA/PA memory chunks...\n");
	/* Free 'em! We follow the convention of freeing them in the reverse
	 * order from which they were allocated
	 */
	free_pages((unsigned long)gptr5, 5);
	free_page((unsigned long)gptr4);
	free_page((unsigned long)gptr3);
	free_pages((unsigned long)gptr2, bsa_alloc_order);
	free_page((unsigned long)gptr1);
	pr_info("removed\n");
}

module_init(lowlevel_mem_init);
module_exit(lowlevel_mem_exit);

Using the Slab Allocator API

  • The slab allocator is optimized for allocating and deallocating small objects, making it suitable for kernel data structures.
  • kmalloc is used to allocate memory, while kfree is used to free the allocated memory.
  • kzmalloc is a variant of kmalloc that initializes the allocated memory to zero.
  • You can use vmstat -m to view slab allocator statistics.
  • The kernel keeps a whole slew of slab caches of various sizes to optimize memory allocation for frequently used object sizes.

Creating a custom slab cache

  • If we’re writing a driver, and within it, we notice that we frequently need to allocate and free objects of a particular size, we can create a custom slab cache for that object type using the kmem_cache_create API.
#define pr_fmt(fmt) "%s:%s(): " fmt, KBUILD_MODNAME, __func__

#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/version.h>
#include <linux/sched.h>	/* current */
#include "../../convenient.h"
#include "../../klib.h"

#define OURCACHENAME "ryans_slab_cache"

static int use_ctor = 1;
module_param(use_ctor, uint, 0);
MODULE_PARM_DESC(use_ctor, "If set to 1 (default), our custom ctor routine"
	 " will initialize slabmem; when 0, no custom constructor will run");

MODULE_AUTHOR("rtn");
MODULE_DESCRIPTION("simple demo of creating a custom slab cache");
MODULE_LICENSE("Dual MIT/GPL");
MODULE_VERSION("0.1");

/* Our 'demo' structure; one that (we imagine) is often allocated and freed;
 * hence, we create a custom slab cache to hold pre-allocated 'instances'
 * of it... It's size: 328 bytes.
 */
struct myctx {
	u32 iarr[10];		// 40 bytes; total=40
	u64 uarr[10];		// 80 bytes; total=120
	s8 uname[128], passwd[16], config[64];	// 128+16+64=208 bytes; total=328
};
static struct kmem_cache *gctx_cachep;

static int use_our_cache(void)
{
	struct myctx *obj = NULL;

#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39)
	pr_debug("Cache name is %s\n", kmem_cache_name(gctx_cachep));
#else
	pr_debug("[ker ver > 2.6.38 cache name deprecated...]\n");
#endif

	obj = kmem_cache_alloc(gctx_cachep, GFP_KERNEL);
	if (!obj) {		/* pedantic warning printk below... */
		pr_warn("[Pedantic] kmem_cache_alloc() failed\n");
		return -ENOMEM;
	}

	pr_info("Our cache object (@ %pK, actual=%px) size is %u bytes; actual ksize=%zu\n",
		obj, obj, kmem_cache_size(gctx_cachep), ksize(obj));
	print_hex_dump_bytes("obj: ", DUMP_PREFIX_OFFSET, obj, sizeof(struct myctx));

	/* free it */
	kmem_cache_free(gctx_cachep, obj);
	return 0;
}

/* The parameter to the constructor is the pointer to the just-allocated memory
 * 'object' from our custom slab cache. Here, in our 'constructor' routine,
 * we initialize the just-allocated memory object.
 */
static void our_ctor(void *new)
{
	struct myctx *ctx = new;
	struct task_struct *p = current;

	/* TIPS:
	 * 1. To see how exactly we got here, insert this call: dump_stack();
	 *    (read its output bottom-up ignoring call frames that begin with '?')
	 * 2. Count the number of times the below printk's emitted; its, in effect,
	 *    the number of objects cached by the kernel within this slab cache.
	 */
	pr_info("in ctor: just alloced mem object is @ 0x%px\n", ctx);	/* %pK in production */
	memset(ctx, 0, sizeof(struct myctx));

	/* As a demo, we init the 'config' field of our structure to some
	 * (arbitrary) 'accounting' values from our task_struct
	 */
	snprintf_lkp(ctx->config, 6 * sizeof(u64) + 5, "%d.%d,%ld.%ld,%ld,%ld",
		 p->tgid, p->pid, p->nvcsw, p->nivcsw, p->min_flt, p->maj_flt);
}

static int create_our_cache(void)
{
	int ret = 0;
	void *ctor_fn = NULL;

	if (use_ctor == 1)
		ctor_fn = our_ctor;

	pr_info("sizeof our ctx structure is %zu bytes\n"
		" using custom constructor routine? %s\n",
		sizeof(struct myctx), use_ctor == 1 ? "yes" : "no");

	/* Create a new slab cache:
	 * kmem_cache_create(const char *name, unsigned int size, unsigned int align,
	 *                   slab_flags_t flags, void (*ctor)(void *));
	 * When our constructor's enabled (the default), this call will trigger it.
	 */
	gctx_cachep = kmem_cache_create(OURCACHENAME,	// name of our cache
					sizeof(struct myctx),	// (min) size of each object
					sizeof(long),	// alignment
					SLAB_POISON |	/* use slab poison values (explained soon) */
					SLAB_RED_ZONE |	/* good for catching buffer under|over-flow bugs */
					SLAB_HWCACHE_ALIGN,	/* good for performance */
					ctor_fn);	// ctor: here, on by default
	if (!gctx_cachep) {
		/* When a kernel-space mem alloc fails we'll usually not require a warning
		 * message as the kernel will definitely emit warning printk's.
		 * We do so here pedantically...
		 */
		pr_warn("kmem_cache_create() failed\n");
		if (IS_ERR(gctx_cachep))
			ret = PTR_ERR(gctx_cachep);
	}

	return ret;
}

static int __init slab_custom_init(void)
{
	pr_info("inserted\n");
	create_our_cache();
	return use_our_cache();
}

static void __exit slab_custom_exit(void)
{
	kmem_cache_destroy(gctx_cachep);
	pr_info("custom cache destroyed; removed\n");
}

module_init(slab_custom_init);
module_exit(slab_custom_exit);

  • After create a custom slab cache, we can view it using the vmstat -m command.
λ ~ $ sudo vmstat -m | head -n1
Cache                       Num  Total   Size  Pages
λ ~ $ sudo vmstat -m |grep -i ryan
ryans_slab_cache              0     36    448     36
λ ~ $
  • We also have access to slab information in the sysfs file system. Example: /sys/kernel/slabs/<slab name>:
λ notes (main) $ grep . /sys/kernel/slab/ryans_slab_cache/
aliases                   ctor                      object_size               poison                    shrink                    store_user
align                     destroy_by_rcu            objects_partial           reclaim_account           skip_kfence               total_objects
cache_dma                 hwcache_align             objs_per_slab             red_zone                  slabs                     trace
cpu_partial               min_partial               order                     remote_node_defrag_ratio  slabs_cpu_partial         usersize
cpu_slabs                 objects                   partial                   sanity_checks             slab_size                 validate


vmalloc

  • The vmalloc function is used to allocate virtually contiguous memory in the kernel space. Unlike kmalloc, which allocates physically contiguous memory, vmalloc can allocate larger blocks of memory that may not be physically contiguous but are contiguous in the virtual address space.

Design Tips

  • Keep the most important (frequently accessed) members together and at the top of a data structure to minimize padding and improve cache utilization. To understand why, imagine there are five important members (of a total size of 56 bytes) in your data structure and you keep them all together at the top of your structure. Say the CPU cacheline size is 64 bytes. When the CPU fetches the first cacheline containing your data structure, it will bring in all five important members in a single cacheline fetch, resulting in better performance.
  • Try to align structure members so that a single member does not ‘fall off’ a cacheline. For example, consider a structure with a 4-byte integer followed by an 8-byte double. If the integer is at the end of a cacheline, the double will span two cachelines, resulting in two cacheline fetches when accessing the double. To avoid this, you can add padding to ensure that the double starts at the beginning of a cacheline.