file systems

A file system is like a database. It defines the structure to transform a simple block device into a sophisticated heirarchy of files and directories that users can understand.
File Systems are typically implemented in the Kernel. However, 9P from Plan 9 has inspired the development of user-space file systems. The FUSE (File System in User Space) feature allows file systems to be created in user-space.
The following list shows the most common file systems in use today:
- ext4 (extended file system, v4)
  - Supports journaling (a small cache outside of the file system) to provide data integrity and hasten booting (introduced with ext3)
  - The ext4 file system is an incremental improvement over ext2/3 and provides support for larger files and a greater number of directories.
- btrfs (b-tree filesystem)
  - a newer file system native to linux, designed to scale beyond the limitations of ext4
- FAT (file allocation table)
  - 3 types: msdos, vfat, exfat
  - Used by most removable flash media
  - Supported by Windows, Darwin, and Linux
- XFS
  - A high performance filesystem used by default on some Linux distros, such as RHEL
- HFS+
  - An Apple standard filesystem used on Mac systems
- ISO 9660
  - Used on CD-ROM discs

Directories

Directories in Linux (ext file systems) are just a file with a table. The table has 2 columns. The two columns contain the name and the inode of the files within the directory.

inodes

A traditional *nix file system has two primary components: a pool of data blocks where you can store data and a database system that manages that data pool. The database system is centered around the inode data structure. An inode is metadata about a file. Inodes are identified by numbers in an inode table. File names and directories are implemented as inodes.
A directory inode contains a list of filenames and links correspending to other inodes.
To view the inode numbers for any directory, use ls -i or stat <filename>
3 time stamps:
- atime: last time the file was open()
- mtime: last time the file was modified
- ctime: ctime IS NOT creation time. It is the last time the inode was changed. For example, using chown or chmod will change the ctime.

Format a partition with a file system

When preparing new storage devices, after partitioning the device, you are ready to create a file system
- You can use mkfs to create partitions. mkfs has several aliases for each partition type. Example: mkfs.ext4, mkfs.xfs, etc.
- mkfs will automatically detect blocks on a device and set some reasonable defaults based on this. Unless you really understand what you are doing, do not change these defaults.

Mounting a file system

After creating a file system on a partition, you can mount it using the mount command
- Usage: mount -t *type* *device* *mountpoint*
- Example: mount -t ext4 /dev/sda2 /mnt/mydisk
- To unmount a file system, use unmount
  - Example: unmount /mnt/mydisk
- It is recommended to mount a file system with it’s UUID, rather than it’s name. Device names are determined by the order in which the kernel finds the device and can change over time.
  - You can use blkid or lsblk -f to identify the UUID of a partition
  - You can then mount using the UUID by:
    - mount UUID=<insert UUID here> /mnt/mydisk

The number of options available for the mount command is staggering. You should review the man page for more info

Buffering/Cache/Caching

Linux, like other Unix variants, buffer writes to the disk. This means that the kernel doesn’t immediately write changes to the disk. But will instead write the changes to a buffer in RAM and then later write them to the disk when it deems appropriate.
When you unmount a file system with unmount, it’s changes are automatically written to the disk from the buffer (why is why you should always unmount partitions before removing them from the system, i.e. USB drives). However, you can also force this to happen using the sync command.
The kernel also uses a cache to store reads from the disk. This way, if a process continually reads the same data from the disk, it doesn’t have to go to the disk every time to fetch the data, rather using the cache to read the data from.

Automatically mounting filesystems at boot time

The /etc/fstab file is used to automatically mount filesystems at boot time
There are two alternatives to /etc/fstab
- /etc/fstab.d/ directory. This directory can contain individual filesystem configuration files (one for each filesystem).
- Systemd unit files.

Filesystem Utilization

To view the utilization of your currently mounted filesytem, you can use the df command.
- Pass the -h flag to view free space in a human readable form
  - Example: df -h

Checking and Repairing Filesystems

Filesystem errors are usually due to a user shutting down a system in a wrong way (like pulling the power cable). Such situations could leave the filesystem cache in memory not matching the data on the disk. This is especially bad if the system is in the process of modifying the filesystem when you give it a kick. Many filesystems support journaling (ext3+ filesystems for example), but you should always shut down a system properly.
The tool to check a filesystem for errors is fsck. There is a different version of fsck for each filesystem that linux supports. For example, the ext filesystems will use e2fsck to check the filesystem for errors. However, you don’t need to run e2fsck directly. You can just run fsck and it will usually detect the filesystems and run the appropriate repair tool.
You should never run fsck on a mounted filesystem. The kernel may alter data in the filesystem as you run the check, causing runtime mismatches that can crash the system and corrupt files. There is one exception to this rule. If you mount the root partition in read-only, single-user mode, you can run fsck on it.
When fsck asks you about reconnecting an inode, it has found a file that doesn’t appear to have a name. When reconnecting an inode, fsck will place the file in the lost+found directory with a number as the name. fsck does this by walking through the inode table and directory structure to generate new link counts and a new block allocation map (such as the block bitmap), and then it compares this newly generated data with the filesystem on the disk. If there are mismatches, fsck must fix the link counts and determine what to do with any inodes and/or data that didn’t come up when it traversed the directory structure.
On a system that has many problems, fsck can make things worse. One way to tell if you should cancel the fsck utility is if it asks a lot of questions while running the repair process. This is usually indicative of a bigger problem. If you think this is the case, you can run fsck -n to run fsck in dry mode (no changes will be made to the partition).
If you suspect that a superblock is corrupt, perhaps because someone overwrote the beginning of the disk, you might be able to recover the filesystem with one of the superblock backups that mkfs creates. Use fsck -b <num> to replace the corrupted superblock with an alternate at num and hope for the best. If you don’t know where to find a backup for the superblock, you can run mkfs -n on the device to view a list of superblock backup numbers without destroying your data.
You normally do not need to check ext3/4 filesystems manually because the journal ensures data integrity.
The kernel will not mount an ext3/4 filesystems with a non-empty journal. You can flush the journal using e2fsck -fy /dev/<device>

Special purpose filesystems

proc - mounted on /proc. Each numbered directory inside /proc refers to the PID of a running process on the system. The directory /proc/self represents the current process.
sysfs - mounted on /sys. See ./devices.md for more info.
tmpfs - mounted on /run and other locations. Allows you to use physical memory and swap space as temporary storage.
squashfs - a type of read-only filesystem where content is stored in a compressed format and extracted on-demand through a loopback device.
overlay - a filesystem that merges directories in a composite directory. Often used by containers.

Swap space

Swap space is used to augment the RAM on a machine with disk space
If you run out of physical memory, the Linux virtual memory system can move pages of memory to and from disk storage (swap space). This is referred to as paging.
You can use mkswap to create swap space on a partition. Then use swapon to enable it. You can also use swapoff to disable swap space.
In addition to using disk space for swap, you can also use a file. You can first create the file with dd. Example: dd if=/dev/zero of=/swapfile bs=1024k count=<size in megabytes>
High performance servers should not have swap space and should avoid disk access if at all possible.

A common file system workload involves reading a large amount of file data sequentially, for example, for a file system backup. This data may be too large to fit in the cache, or it may be read only once, and is therefore unlikely to remain in the cache. Such a workload would perform relatively poorly, as it would have a low cache-hit ratio.

Prefetch is a common file system feature for solving this problem. It can detect a sequential read workload based on the current and previous file I/O offsets, and then predict and issue disk reads before the application has requested them. This populates the file system cache, so that if the application does perform the expected read, it results in a cache-hit, rather than reading from much slower disk.

Prefetch can typically be tuned in most systems.

Write-back Caching

Write-back caching is commonly used by file systems to improve write performance. It works by treating writes as completed after the transfer to main memory, and writing them to disk sometime later, asynchronously. The file system process for writing this ‘dirty’ data to disk is called ‘flushing’. The trade-off of write-back cache is reliability. DRAM-based main memory is volatile, and dirty data can be lost in the event of a power failure. Data could also be written to disk incompletely, leaving the disk in a corrupt state. If file-system metadata becomes corrupted, the file system may no longer load.

Synchronous writes

Synchronous writes are used by some applications such as database log writers, where the risk of data corruption for asynchronous writes is unacceptable.

VFS (Virtual File System)

VFS provides a common interface for different file system types. Prior to VFS, different file systems required different system calls for interacting with each. The calls for interacting with a FAT file system were different than those for a EXT file system.

File System Caches

Unix originally only had the buffer cache to improve performance of block device access. Nowadays, Linux has multiple different cache types.

Page Cache
Buffer cache
Directory Cache
inode cache

Copy on write

A Copy on write (COW) file system does not overwrite existing blocks but instead follows these steps:

Write blocks to a new location (a new copy)
Update references to new blocks
Add old blocks to the free list

This helps maintain file system integrity in the event of a system failure, and also helps improve performance by turning random writes into sequential ones

Troubleshooting File Systems

Key metrics for file systems include:

Operation Rate
Operation latency

In Linux, there are typically no readily available metrics for file system operations (the exception being NFS, via nfsstat).

Tools:

mount
free
top
vmstat
sar
slabtop
filetop
cachestat
fsck
ext4slower
e2fsck

Keyboard shortcuts

notebook