memory

The CPU has a memory management unit (MMU) to add flexibility in accessing memory. The kernel assists the MMU by breaking down the memory used by a process into chunks called ‘pages’. The kernel maintains a data structure, called a ‘page table’, that maps a process’s virtual page addresses into real page addresses in memory. As a process accesses memory, the MMU translates the virtual addresses used by the process into real addresses based on the kernel’s page table.

A user process doesn't need all of it's memory to be immediately available in order to run. The kernel general loads and allocates pages as a process needs them; this system is known as on-demand paging or just demand paging. Let's see how a program starts and runs as a new process:

1) The kernel loads the beginning of the program's instruction code into memory pages. 2) Th ekernel may allocate some working-memory pages to the new process 3) As the process runs, it may determine that the next instruction in code isn't in any of the memory pages that the kernel loaded initially. At this point, the kernel will take over and load the necessary page into memory, and then lets the program resume execution.

You can get a system's page size by looking at the kernel configuration:

getconfig PAGE_SIZE
4096

Page Faults

If a memory page isn't ready when a process wants to use it, the process triggers a page fault. If a page fault occurs, the kernel takes control of the CPU from the process in order to get the page ready. There are two kinds of page faults, major and minor.

Minor page faults occur when the page is in main memory, but the MMU doesn't know where it is. A major page fault occurs when the desired memory page isn't in main memory at all, which means that the kernel must load it from disk or some other slow storage media. Major page faults will bog down a system. Some major page faults are unavoidable, like when the system loads the code from disk when running a program for the first time.

You can drill down to the page faults for individual processes by using the top, ps, and time commands. You’ll need to use the system version of time for this.

ryan:todo/  |main ?:3 ✗|$ /usr/bin/time cal > /dev/null
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 2824maxresident)k
0inputs+0outputs (0major+130minor)pagefaults 0swaps

As you can see in the output above, there were 0 major page faults and 130 minutes page faults when running the cal program.

Virtual Memory

The OS’s process abstraction provides each process with a virtual memory space. Virtual memory is an abstraction that gives each process its own private, logical address space in which its instructions and data are stored. Each process’s virtual address space can be thought of as an array of addressable bytes from 0 up to some maximum address. Processes cannot access the contents of one another’s address spaces.
Operating systems implement virtual memory as part of the lone view abstraction of processes. That is, each process only interacts with memory in terms of its own virtual address space rather than the reality of many processes sharing the computers RAM simultaneously.
A process’s virtual address space is divided into several sections, each of which stores a different part of the process’s memory. The top part is reserved for the OS and can only be accessed in kernel mode. The text and data parts of a process’s virtual addresss space are initialized from the program executable file. The text section contains the program instructions, and the data section contains global variables. The stack and heap sections vary in size as the process runs. Stack space grows in response to the process making function calls, and shrinks as it returns from the function calls. Heap space grows when the process dynamically allocates memory space (via calls to malloc), and shrinks when the process frees memory space (with calls to free). The heap and stack portions of a process’s memory are typically located far apart in its address space to maximize the amount of space either can use. Typically, the stack is located at the bottom of a process’s address space and grows upward. The heap is located at the top of the stack and grows downward.
```
--------------------------------
|            OS Code           |
--------------------------------
|       Application Code       |
--------------------------------
|      Data (Global Vars)      |
--------------------------------
|             Heap             |
|             ⌄⌄⌄⌄             |
|                              |
|                              |
--------------------------------
|                              |
|                              |
|            ^^^^^             |
|            Stack             |
--------------------------------
```
A page fault occurs when a process tries to access a page that is not currently stored in RAM. The opposite is a page hit. To handle a page fault, the OS needs to keep track of which RAM frames are free so that it can find a free frame of RAM into which the page read from disk can be stored.
Page Table Entries (PTE) include a dirty bit that is used to indicate if the in-RAM copy of the page has been modified.

Memory Addresses

Because processes operate within their own virtual address spaces, operating systems must make an important distinction between two types of memory addresses. Virtual addresses refer to storage locations in a processes virtual address space, and physical addresses refer to a location in RAM.
At any point in time, the OS stores in RAM the address space contents of many processes as well as OS code that it may map into every process’s virtual address space.

Virtual Memory and Virtual Addresses

Virtual memory is the per-process view of its memory space, and virtual addresses are addresses in the process’s view of its memory. If two processes run the same binary executable, then they have will have the exact same virtual addresses for function code and for global variables in the address spaces.
Processors generally provide some hardware support for virtual memory. An OS can make use of this hardware support for virtual memory to perform virtual to physical address translation quickly, avoiding having to trap to the OS to handle every address translation.
The memory management unit (MMU) is the part of the computer hardware that implements address translation. At it’s most complete, the MMU performs full translation.

Paging

Although many virtual memory systems have been implemented over the years, paging is now the mostly widely used imlementation of virtual memory.
In a Paged virtual memory system, the OS divides the virtual address space of each process into fixed-sized chunks called pages. The OS defines the page size for the system. Page sizes of a few kilobytes are commonly used in general purpose operating systems today. 4 KB is the default page size on many systems.
Physical memory is similarly divided into page-sized chunks called frames. Because pages and frames are defined to be the same size, any page of a process’s virtual memory can be stored in any frame of physical RAM.

Virtual and Physical Addresses in Paged Systems

Paged virtual memory systems divide the bits of a virtual address into two parts; the high-order bits specify the page number on which the virtual address is stored, and the low-order bits correspond to the byte offset within the page (which byte from the top of the page corresponds to the address).
Similarly, paging systems divide physical addresses into two parts; the high-order bits specify the frame number of physical memory, and the low-order bits specify the byte offset within the frame. Because frames and pages are the same size, the byte offset bits in a virtual address are identical to the byte offset bits in its translated physical address. Virtual addresses differ from their translated physical addresses in their high-order bits, which specify the virtual page number and the physical frame number.

Page tables

Because every page of a processes virtual memory space can map to a different frame of RAM, the OS must maintain mappings for every virtual page in the process’s address space. The OS keeps a per-process page table that it uses to store the process’s virtual page number to physical frame number mappings.

Translation Look-aside Buffer (TLB)

Although paging has many benefits, it also results in a significant slowdown to every memory access. In a paged virtual memory system, every load and store to a virtual memory address requires two RAM accesses; the first reads the page table entry (PTE) to get the frame number for virtual-to-physical address translation, and the second reads or writes the byte(s) at the physical RAM address. Thus, in a paged virtual memory system, every memory access is twice as slow as in a sytem that supports direct physical RAM addressing.
One way to reduce the additional overhead of paging is to cache page table mappings of virtual page numbers to physical frame numbers. When translating a virtual address, the MMU first checks for the page numbers in the cache. If found, then the page’s frame number mapping can be grabbed from the cache entry, avoiding one RAM access for reading the PTE.
A translation look-aside buffer (TLB) is a hardware cache that stores (page number, frame number) mappings. It is a small, fully associative cache that is optimized for fast lookups in hardware. When the MMU finds a mapping in the TLB (a TLB hit), a page table lookup is not needed, and only one RAM access is required to execute a load or store to a virtual memory address.

How Linux Organizes Virtual Memory

UVA (User Virtual Addressing)

KVA (Kernel Virtual Addressing)

How Linux Organizes Physical Memory

At boot, the kernel organizes and partitions RAM into a tree like heirarchy consisting of nodes, zones, and page frames (page frames are physical pages of RAM). The top level of the heirarchy is made up of nodes, which represent a collection of memory that is local to a particular CPU or group of CPUs. Each node contains one or more zones, which are collections of page frames that share similar characteristics. The zones are further divided into page frames, which are the smallest unit of memory that can be allocated by the kernel.

Any processor core can access any physical memory location, regardless of which node it belongs to. However, accessing memory that is local to the core’s node is faster than accessing memory that is located on a different node. This is because local memory access avoids the overhead of traversing interconnects between nodes.

NUMA vs. UMA

Essentially, nodes are data structures that are used to denote and abstract a physical RAM module on the system motherboard and its associated controller chipset. Actual hardware is being abstracted via software. Two types of memory architectures exist, UMA (Uniform Memory Access) and NUMA (Non-Uniform Memory Access).

NUMA

In a NUMA architecture, each processor has its own local memory, and accessing local memory is faster than accessing memory that is located on a different node. This is because local memory access avoids the overhead of traversing interconnects between nodes.
NUMA architectures are commonly used in high-performance computing systems, where multiple processors are used to perform complex computations. By using NUMA, these systems can achieve better performance and scalability than traditional UMA architectures.
NUMA systems must have at least 2 physical memory banks (nodes) to be considered NUMA.
One can use the lstopo command to view the NUMA topology of a system. The output will show the number of nodes, the amount of memory in each node, and the CPUs that are associated with each node. hwloc is another tool that can be used to view the NUMA topology of a system. It provides a graphical representation of the system’s hardware topology, including the NUMA nodes and their associated memory and CPUs.
The number of zones per node is dynamically determined by the kernel at boot time based on the amount of memory in the node and the system architecture. The kernel typically creates three zones per node: DMA, DMA32, and Normal. The DMA zone is used for memory that is accessible by devices that use direct memory access (DMA), the DMA32 zone is used for memory that is accessible by 32-bit devices, and the Normal zone is used for all other memory. In addition to these standard zones, the kernel may also create additional zones based on the system architecture and configuration. You can view /proc/buddyinfo to see the memory zones and their associated page frames.

λ ch7 (main) $ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2
Node 0, zone    DMA32      6      6      8      7      5      6      7      4      8      9    283
Node 0, zone   Normal   3170   7694  10543   8113   5011   1761    552    182     55     66  10595

UMA

In a UMA architecture, all processors share the same physical memory, and accessing any memory location takes the same amount of time, regardless of which processor is accessing it.
In Linux, UMA systems are treated as NUMA systems with a single node. This means that the kernel still uses the same data structures and algorithms for managing memory, but there is no need to consider the locality of memory access.

Keyboard shortcuts

notebook