Memory Allocation
- There is just one engine for memory allocation in the Linux kernel. The page allocator (buddy system or BSA) with the Slab allocator layered on top.
- The kernel’s primary means of memory allocation using using the Page Allocator API (aka the buddy system) or the Slab Allocator API.
- The internal implementation of the slab allocator used by the kernel is called the SLUB allocator.
- The SLAB allocator solves fragmentation issues that arise with the page allocator by allocating memory in chunks of fixed sizes. This allows for more efficient use of memory and reduces overhead. Thus, the slab allocator is layered on top of the page allocator.
- The only way to allocate or deallocate physical memory in the kernel is through the page allocator API.
- The slab and page allocators reside completely in the kernel space and are not accessible from user space. Calling
mallocor free in user space does not interact with the kernel’s memory allocation mechanisms. You should usekmallocandkfreeorkzmallocandkzfreein kernel space. - The page allocator gets its page frames from the kernels low-mem area
- Linux kernel memory is non-swappable
Page Allocator (Buddy System)
- The key to the page allocator is it’s primary internal metadata structure. It’s called the “buddy system free-list” and consists of an array of pointers to doubly linked circular lists.
- The index of this array of pointers is called the ‘order’ of the list. It’s the power to which to raise to 2. The array length is from
MAX_ORDER- 1. The value ofMAX_ORDERis architecture dependent, but is typically 11 or 12. - On x86 systems, a page is typically 4KB in size. Thus, order 0 corresponds to a single 4KB page, order 1 corresponds to 8KB (2 contiguous pages), order 2 corresponds to 16KB (4 contiguous pages), and so on.
- Each doubly linked circular list points to free and physically contiguous blocks of memory of the corresponding order.
- The kernel gives us a convenient view into the current state of the page allocator through the
/proc/buddyinfofile. This file shows the number of free blocks of each order for each memory zone in the system.
λ ~ $ cat /proc/buddyinfo
Node 0, zone DMA 0 0 0 0 0 0 0 0 1 1 2
Node 0, zone DMA32 6 6 8 7 5 6 7 4 8 9 283
Node 0, zone Normal 5 113 180 1018 715 320 151 49 38 10 14971
λ ~ $
- The first column indicates the memory zone. Common zones include DMA, DMA32, and Normal. DMA is used for legacy devices that can only address the first 16MB of memory. DMA32 is used for devices that can address up to 4GB of memory. Normal is the standard memory zone for general-purpose allocations. The third and subsequent columns represent the number of free blocks of each order, starting from order 0.
- If it’s not already obvious based on
/proc/buddyinfo, the kernel keeps a different free-list for each memory zone to accommodate the different addressing requirements of various hardware devices.
How it works
-
When a request for memory allocation is made, the page allocator first determines the smallest order that can satisfy the request. It then checks the corresponding free-list for available blocks.
-
If a suitable block is found, it is removed from the free-list and returned to the requester.
-
If no suitable block is available, the allocator looks for a larger block in higher-order free lists. If a larger block is found, it is split into two smaller blocks, one of which is returned to the requester while the other is added back to the appropriate free-list.
-
When memory is freed, the allocator checks if the freed block’s “buddy” (the adjacent block of the same order) is also free. If so, the two blocks are merged back into a larger block and added to the higher-order free-list. This merging process continues recursively up the orders until no further merging is possible.
-
The buddy system’s design helps to minimize fragmentation by ensuring that memory blocks are always combined into larger blocks when possible.
-
The page allocator also incorporates various strategies to handle memory pressure, such as reclaiming memory from other parts of the kernel or invoking the Out-Of-Memory (OOM) killer when necessary.
-
The page allocator is optimized for performance, with fast allocation and deallocation operations to meet the demands of the kernel’s memory management needs.
-
The kernel provides another view into the state of memory allocation through the
/proc/pagetypeinfofile. This file gives a more detailed breakdown of free pages by migrate type and order within each memory zone.
λ ~ $ sudo cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 1 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 2
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 0 0 0 1 1 0 1 0 0 1 0
Node 0, zone DMA32, type Movable 6 6 8 6 4 6 6 4 8 8 283
Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 616 401 143 51 19 3 0 2 2 9 1
Node 0, zone Normal, type Movable 2290 1082 620 1032 792 350 161 47 31 0 14945
Node 0, zone Normal, type Reclaimable 1 1 1 19 31 18 4 2 3 1 2
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 0, zone DMA 3 5 0 0 0 0
Node 0, zone DMA32 6 614 0 0 0 0
Node 0, zone Normal 254 31436 40 0 0 0
Using the Page Allocator API
- The Linux kernel exposes a comprehensive API for memory allocation through the page allocator.
- All of the AIPs or macros that take two parameters, the first parameter is called the GFP flags or bitmask (named
gfp_mask). The second parameter is typically the order of the allocation (order). - The API prototypes are currently defined in the
include/linux/gfp.h>header file.
#define pr_fmt(fmt) "%s:%s(): " fmt, KBUILD_MODNAME, __func__
#include <linux/init.h>
#include <linux/module.h>
#include <linux/mm.h>
#include "../../klib.h"
MODULE_AUTHOR("rtn");
MODULE_VERSION("0.2");
MODULE_LICENSE("Dual MIT/GPL");
static void *gptr1, *gptr2, *gptr3, *gptr4, *gptr5;
static int bsa_alloc_order = 3;
module_param_named(order, bsa_alloc_order, int, 0660);
MODULE_PARM_DESC(order, "order of our step 2 allocation (power-to-raise-2-to; default:3)");
/*
* bsa_alloc : test some of the bsa (buddy system allocator
* aka page allocator) APIs
*/
static int bsa_alloc(void)
{
int stat = -ENOMEM;
u64 numpg2alloc = 0;
const struct page *pg_ptr1;
/* 0. Show the identity mapping: physical RAM page frames to kernel virtual
* addresses, from PAGE_OFFSET for 5 pages
*/
pr_info("0. Show identity mapping: RAM page frames : kernel virtual pages :: 1:1\n"
"(PAGE_SIZE = %ld bytes)\n", PAGE_SIZE);
/* SEE THIS!
* Show the virt, phy addr and PFNs (page frame numbers).
* This function is in our 'library' code here: ../../klib.c
* This way, we can see if the pages allocated are really physically
* contiguous! Signature:
* void show_phy_pages(const void *kaddr, size_t len, bool contiguity_check);
*/
pr_info("[--------- show_phy_pages() output follows:\n");
show_phy_pages((void *)PAGE_OFFSET, 5 * PAGE_SIZE, 1);
pr_info(" --------- show_phy_pages() output done]\n");
/* 1. Allocate one page with the __get_free_page() API */
gptr1 = (void *)__get_free_page(GFP_KERNEL);
if (!gptr1) {
pr_warn("mem alloc via __get_free_page() failed!\n");
/* As per convention, we emit a printk above saying that the
* allocation failed. In practice it isn't required; the kernel
* will definitely emit many warning printk's if a memory alloc
* request ever fails! Thus, we do this only once (here; could also
* use the WARN_ONCE()); from now on we don't pedantically print any
* error message on a memory allocation request failing.
*/
goto out1;
}
pr_info("#. BSA/PA API Amt alloc'ed KVA\n"); // header
pr_info("1. __get_free_page() 1 page %px\n", gptr1);
/* 2. Allocate 2^bsa_alloc_order pages with the __get_free_pages() API */
numpg2alloc = powerof(2, bsa_alloc_order); /* returns 2^bsa_alloc_order */
gptr2 = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, bsa_alloc_order);
if (!gptr2) {
/* no error/warning printk now; see above comment */
goto out2;
}
pr_info("2. __get_free_pages() 2^%d page(s) %px\n",
bsa_alloc_order, gptr2);
pr_info("[--------- show_phy_pages() output follows:\n");
show_phy_pages(gptr2, numpg2alloc * PAGE_SIZE, 1);
pr_info(" --------- show_phy_pages() output done]\n");
/* 3. Allocate and init one page with the get_zeroed_page() API */
gptr3 = (void *)get_zeroed_page(GFP_KERNEL);
if (!gptr3)
goto out3;
pr_info("#. BSA/PA API Amt alloc'ed KVA\n"); // header
pr_info("3. get_zeroed_page() 1 page %px\n", gptr3);
/* 4. Allocate one page with the alloc_page() API.
* Careful! It does not return the alloc'ed page addr but rather the pointer
* to the metadata structure 'page' representing the allocated page:
* struct page * alloc_page(gfp_mask);
* So, we use the page_address() helper to convert it to a kernel
* logical (or virtual) address.
*/
pg_ptr1 = alloc_page(GFP_KERNEL);
if (!pg_ptr1)
goto out4;
gptr4 = page_address(pg_ptr1);
pr_info("4. alloc_page() 1 page %px\n"
" (struct page addr = %px)\n", (void *)gptr4, pg_ptr1);
/* 5. Allocate 2^5 = 32 pages with the alloc_pages() API.
* < Same warning as above applies here too! >
*/
gptr5 = page_address(alloc_pages(GFP_KERNEL, 5));
if (!gptr5)
goto out5;
pr_info("5. alloc_pages() %lld pages %px\n",
powerof(2, 5), (void *)gptr5);
return 0;
out5:
free_page((unsigned long)gptr4);
out4:
free_page((unsigned long)gptr3);
out3:
free_pages((unsigned long)gptr2, bsa_alloc_order);
out2:
free_page((unsigned long)gptr1);
out1:
return stat;
}
static int __init lowlevel_mem_init(void)
{
return bsa_alloc();
}
static void __exit lowlevel_mem_exit(void)
{
pr_info("free-ing up the prev allocated BSA/PA memory chunks...\n");
/* Free 'em! We follow the convention of freeing them in the reverse
* order from which they were allocated
*/
free_pages((unsigned long)gptr5, 5);
free_page((unsigned long)gptr4);
free_page((unsigned long)gptr3);
free_pages((unsigned long)gptr2, bsa_alloc_order);
free_page((unsigned long)gptr1);
pr_info("removed\n");
}
module_init(lowlevel_mem_init);
module_exit(lowlevel_mem_exit);
Using the Slab Allocator API
- The slab allocator is optimized for allocating and deallocating small objects, making it suitable for kernel data structures.
kmallocis used to allocate memory, whilekfreeis used to free the allocated memory.kzmallocis a variant ofkmallocthat initializes the allocated memory to zero.- You can use
vmstat -mto view slab allocator statistics. - The kernel keeps a whole slew of slab caches of various sizes to optimize memory allocation for frequently used object sizes.
Creating a custom slab cache
- If we’re writing a driver, and within it, we notice that we frequently need to allocate and free objects of a particular size, we can create a custom slab cache for that object type using the
kmem_cache_createAPI.
#define pr_fmt(fmt) "%s:%s(): " fmt, KBUILD_MODNAME, __func__
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/version.h>
#include <linux/sched.h> /* current */
#include "../../convenient.h"
#include "../../klib.h"
#define OURCACHENAME "ryans_slab_cache"
static int use_ctor = 1;
module_param(use_ctor, uint, 0);
MODULE_PARM_DESC(use_ctor, "If set to 1 (default), our custom ctor routine"
" will initialize slabmem; when 0, no custom constructor will run");
MODULE_AUTHOR("rtn");
MODULE_DESCRIPTION("simple demo of creating a custom slab cache");
MODULE_LICENSE("Dual MIT/GPL");
MODULE_VERSION("0.1");
/* Our 'demo' structure; one that (we imagine) is often allocated and freed;
* hence, we create a custom slab cache to hold pre-allocated 'instances'
* of it... It's size: 328 bytes.
*/
struct myctx {
u32 iarr[10]; // 40 bytes; total=40
u64 uarr[10]; // 80 bytes; total=120
s8 uname[128], passwd[16], config[64]; // 128+16+64=208 bytes; total=328
};
static struct kmem_cache *gctx_cachep;
static int use_our_cache(void)
{
struct myctx *obj = NULL;
#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39)
pr_debug("Cache name is %s\n", kmem_cache_name(gctx_cachep));
#else
pr_debug("[ker ver > 2.6.38 cache name deprecated...]\n");
#endif
obj = kmem_cache_alloc(gctx_cachep, GFP_KERNEL);
if (!obj) { /* pedantic warning printk below... */
pr_warn("[Pedantic] kmem_cache_alloc() failed\n");
return -ENOMEM;
}
pr_info("Our cache object (@ %pK, actual=%px) size is %u bytes; actual ksize=%zu\n",
obj, obj, kmem_cache_size(gctx_cachep), ksize(obj));
print_hex_dump_bytes("obj: ", DUMP_PREFIX_OFFSET, obj, sizeof(struct myctx));
/* free it */
kmem_cache_free(gctx_cachep, obj);
return 0;
}
/* The parameter to the constructor is the pointer to the just-allocated memory
* 'object' from our custom slab cache. Here, in our 'constructor' routine,
* we initialize the just-allocated memory object.
*/
static void our_ctor(void *new)
{
struct myctx *ctx = new;
struct task_struct *p = current;
/* TIPS:
* 1. To see how exactly we got here, insert this call: dump_stack();
* (read its output bottom-up ignoring call frames that begin with '?')
* 2. Count the number of times the below printk's emitted; its, in effect,
* the number of objects cached by the kernel within this slab cache.
*/
pr_info("in ctor: just alloced mem object is @ 0x%px\n", ctx); /* %pK in production */
memset(ctx, 0, sizeof(struct myctx));
/* As a demo, we init the 'config' field of our structure to some
* (arbitrary) 'accounting' values from our task_struct
*/
snprintf_lkp(ctx->config, 6 * sizeof(u64) + 5, "%d.%d,%ld.%ld,%ld,%ld",
p->tgid, p->pid, p->nvcsw, p->nivcsw, p->min_flt, p->maj_flt);
}
static int create_our_cache(void)
{
int ret = 0;
void *ctor_fn = NULL;
if (use_ctor == 1)
ctor_fn = our_ctor;
pr_info("sizeof our ctx structure is %zu bytes\n"
" using custom constructor routine? %s\n",
sizeof(struct myctx), use_ctor == 1 ? "yes" : "no");
/* Create a new slab cache:
* kmem_cache_create(const char *name, unsigned int size, unsigned int align,
* slab_flags_t flags, void (*ctor)(void *));
* When our constructor's enabled (the default), this call will trigger it.
*/
gctx_cachep = kmem_cache_create(OURCACHENAME, // name of our cache
sizeof(struct myctx), // (min) size of each object
sizeof(long), // alignment
SLAB_POISON | /* use slab poison values (explained soon) */
SLAB_RED_ZONE | /* good for catching buffer under|over-flow bugs */
SLAB_HWCACHE_ALIGN, /* good for performance */
ctor_fn); // ctor: here, on by default
if (!gctx_cachep) {
/* When a kernel-space mem alloc fails we'll usually not require a warning
* message as the kernel will definitely emit warning printk's.
* We do so here pedantically...
*/
pr_warn("kmem_cache_create() failed\n");
if (IS_ERR(gctx_cachep))
ret = PTR_ERR(gctx_cachep);
}
return ret;
}
static int __init slab_custom_init(void)
{
pr_info("inserted\n");
create_our_cache();
return use_our_cache();
}
static void __exit slab_custom_exit(void)
{
kmem_cache_destroy(gctx_cachep);
pr_info("custom cache destroyed; removed\n");
}
module_init(slab_custom_init);
module_exit(slab_custom_exit);
- After create a custom slab cache, we can view it using the
vmstat -mcommand.
λ ~ $ sudo vmstat -m | head -n1
Cache Num Total Size Pages
λ ~ $ sudo vmstat -m |grep -i ryan
ryans_slab_cache 0 36 448 36
λ ~ $
- We also have access to slab information in the sysfs file system. Example:
/sys/kernel/slabs/<slab name>:
λ notes (main) $ grep . /sys/kernel/slab/ryans_slab_cache/
aliases ctor object_size poison shrink store_user
align destroy_by_rcu objects_partial reclaim_account skip_kfence total_objects
cache_dma hwcache_align objs_per_slab red_zone slabs trace
cpu_partial min_partial order remote_node_defrag_ratio slabs_cpu_partial usersize
cpu_slabs objects partial sanity_checks slab_size validate
vmalloc
- The
vmallocfunction is used to allocate virtually contiguous memory in the kernel space. Unlikekmalloc, which allocates physically contiguous memory,vmalloccan allocate larger blocks of memory that may not be physically contiguous but are contiguous in the virtual address space.
Design Tips
- Keep the most important (frequently accessed) members together and at the top of a data structure to minimize padding and improve cache utilization. To understand why, imagine there are five important members (of a total size of 56 bytes) in your data structure and you keep them all together at the top of your structure. Say the CPU cacheline size is 64 bytes. When the CPU fetches the first cacheline containing your data structure, it will bring in all five important members in a single cacheline fetch, resulting in better performance.
- Try to align structure members so that a single member does not ‘fall off’ a cacheline. For example, consider a structure with a 4-byte integer followed by an 8-byte double. If the integer is at the end of a cacheline, the double will span two cachelines, resulting in two cacheline fetches when accessing the double. To avoid this, you can add padding to ensure that the double starts at the beginning of a cacheline.