Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

🏠 Back to Blog

Process Scheduling

  • Every process in Linux goes through a variety of states throughout it’s lifetime. By coding these states, we can form the state machine of a process.
  • The KSE (Kernel Scheduling Entity) on a Linux system is the thread. Each thread has its own state, which is represented by a set of flags in the task_struct structure. Convention uses the word process. However, we will ignore convention going forward.
  • The states that a Linux thread can be in are:
    • TASK_RUNNING: The thread is either running on a CPU or is ready to run. In either state, the thread is put into the run queue. Linux maintains one run queue per core on the CPU.
    • TASK_INTERRUPTIBLE: The thread is sleeping and can be woken up by signals. Tasks that are waiting are put in a wait queue.
    • TASK_UNINTERRUPTIBLE: The thread is sleeping and cannot be woken up by signals.
    • TASK_STOPPED: The thread has been stopped, usually by a signal.
    • TASK_TRACED: The thread is being traced by another process (e.g., a debugger).
    • EXIT_ZOMBIE: The thread has finished execution but still has an entry in the process table.
    • EXIT_DEAD: The thread has been completely removed from the process table.
  • Once a thread has been created via fork(), clone(), or the pthread_create() API, it informs the process scheduler that it is now ready to be scheduled for execution.

The Linux Process (thread/task) State Machine

                          +--------------------+
                          |      CREATED       |
                          |  (new task_struct) |
                          +---------+----------+
                                    |
                                    v
                         +----------+----------+
                         |      READY          |
                         | (runnable, in runq) |
                         +----------+----------+
                                    |
                                    | scheduled by CPU
                                    v
                         +----------+----------+
                         |      RUNNING        |
                         +----------+----------+uling
                                    |
        +---------------------------+------------------------------+
        |                           |                              |
        | voluntary sleep           | preemption                   |
        v                           v                              |
+-------+--------+        +---------+---------+                    |
| INTERRUPTIBLE  | <----- |     READY          | <-----------------+
|   SLEEP (S)    |        | (runnable again)   |
+-------+--------+        +--------------------+
        |
        | syscall waits (I/O, futex, etc.)
        v
+-------+--------+
| UNINTERRUPTIBLE|
|   SLEEP (D)    |
+-------+--------+
        |
        | I/O completes
        v
+---------------------------------------+
|                 READY                 |
+---------------------------------------+


                     (signals)
                         |
                         v
                +--------+---------+
                |     STOPPED      |
                |  (T state: SIGSTOP,
                |   SIGTSTP, etc.) |
                +--------+---------+
                         |
                         | SIGCONT
                         v
                +--------+---------+
                |       READY       |
                +-------------------+


When the process calls exit():
                       |
                       v
              +--------+---------+
              |      ZOMBIE      |
              |   (defunct, SZ)  |
              +--------+---------+
                       |
                       | parent calls wait()
                       v
              +--------+---------+
              |     REAPED       |
              | (task_struct freed)  
              +-------------------+

Scheduling Policies

  • Linux supports several scheduling policies that determine how threads are prioritized and scheduled for execution. The main policies are:
    • SCHED_OTHER: This is the default time-sharing policy for regular threads. This is the commonly discussed Completely Fair Scheduler (CFS) policy.
    • SCHED_FIFO: THis is a real-time policy where threads are scheduled in a first-in, first-out manner.
      • Threads will yield back their processor time if:
        • They block on I/O
        • They stop or die
        • A higher priority real-time thread becomes runnable
    • SCHED_RR: This is a real-time policy where threads are scheduled in a round-robin manner.
      • Threads have a finite time slice. Typically 100ms.
      • Threads will yield back their time if:
        • They go into a blocking state
        • They stop or die
        • A higher priority real-time thread becomes runnable
        • the timeslice expires
    • SCHED_BATCH: Suitable for low-priority and non-interactive batch jobs, with less preemption.
    • SCHED_IDLE: For very low priority background tasks that should only run when the system is idle.

CFS (Completely Fair Scheduler)

  • The CFS is the default scheduling algorithm used in Linux for SCHED_OTHER threads.
  • It aims to allocate CPU time fairly among all runnable threads, ensuring that each thread gets a proportional share of the CPU based on its weight (priority).
  • CFS uses a red-black tree data structure to manage runnable threads, allowing for efficient insertion, deletion, and selection of the next thread. When a thread becomes runnable, it is added to the red-black tree based on its vruntime. The red-black tree is sorted in such a way that the task that has the smallest vruntime is always at the leftmost node of the tree, making it easy to select the next thread to run. Scanning the tree from left to right gives a timeline of execution for all runnable tasks.
  • Each thread has a virtual runtime (vruntime) that tracks the amount of CPU time it has received. The scheduler selects the thread with the smallest vruntime to run next.
  • CFS dynamically adjusts the vruntime of threads based on their priority, ensuring that higher-priority threads receive more CPU time.