file systems
- A file system is like a database. It defines the structure to transform a simple block device into a sophisticated heirarchy of files and directories that users can understand.
- File Systems are typically implemented in the Kernel. However, 9P from Plan 9 has inspired the development of user-space file systems. The FUSE (File System in User Space) feature allows file systems to be created in user-space.
- The following list shows the most common file systems in use today:
- ext4 (extended file system, v4)
- Supports journaling (a small cache outside of the file system) to provide data integrity and hasten booting (introduced with ext3)
- The ext4 file system is an incremental improvement over ext2/3 and provides support for larger files and a greater number of directories.
- btrfs (b-tree filesystem)
- a newer file system native to linux, designed to scale beyond the limitations of ext4
- FAT (file allocation table)
- 3 types: msdos, vfat, exfat
- Used by most removable flash media
- Supported by Windows, Darwin, and Linux
- XFS
- A high performance filesystem used by default on some Linux distros, such as RHEL
- HFS+
- An Apple standard filesystem used on Mac systems
- ISO 9660
- Used on CD-ROM discs
- ext4 (extended file system, v4)
Directories
- Directories in Linux (ext file systems) are just a file with a table. The table has 2 columns. The two columns contain the name and the inode of the files within the directory.
inodes
- A traditional *nix file system has two primary components: a pool of data blocks where you can store data and a database system that manages that data pool. The database system is centered around the inode data structure. An inode is metadata about a file. Inodes are identified by numbers in an inode table. File names and directories are implemented as inodes.
- A directory inode contains a list of filenames and links correspending to other inodes.
- To view the inode numbers for any directory, use
ls -iorstat <filename> - 3 time stamps:
- atime: last time the file was open()
- mtime: last time the file was modified
- ctime: ctime IS NOT creation time. It is the last time the inode was changed. For example, using chown or chmod will change the ctime.
Format a partition with a file system
- When preparing new storage devices, after partitioning the device, you are ready to create a file system
- You can use
mkfsto create partitions.mkfshas several aliases for each partition type. Example:mkfs.ext4,mkfs.xfs, etc. - mkfs will automatically detect blocks on a device and set some reasonable defaults based on this. Unless you really understand what you are doing, do not change these defaults.
- You can use
Mounting a file system
- After creating a file system on a partition, you can mount it using the
mountcommand- Usage:
mount -t *type* *device* *mountpoint* - Example:
mount -t ext4 /dev/sda2 /mnt/mydisk - To unmount a file system, use
unmount- Example:
unmount /mnt/mydisk
- Example:
- It is recommended to mount a file system with it’s UUID, rather than it’s name. Device names are determined by the order in which the kernel finds the device and can change over time.
- You can use
blkidorlsblk -fto identify the UUID of a partition - You can then mount using the UUID by:
mount UUID=<insert UUID here> /mnt/mydisk
- You can use
- Usage:
The number of options available for the mount command is staggering. You should review the man page for more info
Buffering/Cache/Caching
- Linux, like other Unix variants, buffer writes to the disk. This means that the kernel doesn’t immediately write changes to the disk. But will instead write the changes to a buffer in RAM and then later write them to the disk when it deems appropriate.
- When you unmount a file system with
unmount, it’s changes are automatically written to the disk from the buffer (why is why you should always unmount partitions before removing them from the system, i.e. USB drives). However, you can also force this to happen using thesynccommand. - The kernel also uses a cache to store reads from the disk. This way, if a process continually reads the same data from the disk, it doesn’t have to go to the disk every time to fetch the data, rather using the cache to read the data from.
Automatically mounting filesystems at boot time
- The
/etc/fstabfile is used to automatically mount filesystems at boot time - There are two alternatives to
/etc/fstab/etc/fstab.d/directory. This directory can contain individual filesystem configuration files (one for each filesystem).- Systemd unit files.
Filesystem Utilization
- To view the utilization of your currently mounted filesytem, you can use the
dfcommand.- Pass the
-hflag to view free space in a human readable form- Example:
df -h
- Example:
- Pass the
Checking and Repairing Filesystems
- Filesystem errors are usually due to a user shutting down a system in a wrong way (like pulling the power cable). Such situations could leave the filesystem cache in memory not matching the data on the disk. This is especially bad if the system is in the process of modifying the filesystem when you give it a kick. Many filesystems support journaling (ext3+ filesystems for example), but you should always shut down a system properly.
- The tool to check a filesystem for errors is
fsck. There is a different version offsckfor each filesystem that linux supports. For example, the ext filesystems will usee2fsckto check the filesystem for errors. However, you don’t need to rune2fsckdirectly. You can just runfsckand it will usually detect the filesystems and run the appropriate repair tool. - You should never run
fsckon a mounted filesystem. The kernel may alter data in the filesystem as you run the check, causing runtime mismatches that can crash the system and corrupt files. There is one exception to this rule. If you mount the root partition in read-only, single-user mode, you can runfsckon it. - When
fsckasks you about reconnecting an inode, it has found a file that doesn’t appear to have a name. When reconnecting an inode,fsckwill place the file in the lost+found directory with a number as the name.fsckdoes this by walking through the inode table and directory structure to generate new link counts and a new block allocation map (such as the block bitmap), and then it compares this newly generated data with the filesystem on the disk. If there are mismatches,fsckmust fix the link counts and determine what to do with any inodes and/or data that didn’t come up when it traversed the directory structure. - On a system that has many problems,
fsckcan make things worse. One way to tell if you should cancel thefsckutility is if it asks a lot of questions while running the repair process. This is usually indicative of a bigger problem. If you think this is the case, you can runfsck -nto runfsckin dry mode (no changes will be made to the partition). - If you suspect that a superblock is corrupt, perhaps because someone overwrote the beginning of the disk, you might be able to recover the filesystem with one of the superblock backups that
mkfscreates. Usefsck -b <num>to replace the corrupted superblock with an alternate at num and hope for the best. If you don’t know where to find a backup for the superblock, you can runmkfs -non the device to view a list of superblock backup numbers without destroying your data. - You normally do not need to check ext3/4 filesystems manually because the journal ensures data integrity.
- The kernel will not mount an ext3/4 filesystems with a non-empty journal. You can flush the journal using
e2fsck -fy /dev/<device>
Special purpose filesystems
- proc - mounted on
/proc. Each numbered directory inside/procrefers to the PID of a running process on the system. The directory/proc/selfrepresents the current process. - sysfs - mounted on
/sys. See./devices.mdfor more info. - tmpfs - mounted on
/runand other locations. Allows you to use physical memory and swap space as temporary storage. - squashfs - a type of read-only filesystem where content is stored in a compressed format and extracted on-demand through a loopback device.
- overlay - a filesystem that merges directories in a composite directory. Often used by containers.
Swap space
- Swap space is used to augment the RAM on a machine with disk space
- If you run out of physical memory, the Linux virtual memory system can move pages of memory to and from disk storage (swap space). This is referred to as paging.
- You can use
mkswapto create swap space on a partition. Then useswaponto enable it. You can also useswapoffto disable swap space. - In addition to using disk space for swap, you can also use a file. You can first create the file with
dd. Example:dd if=/dev/zero of=/swapfile bs=1024k count=<size in megabytes> - High performance servers should not have swap space and should avoid disk access if at all possible.
Prefetch
A common file system workload involves reading a large amount of file data sequentially, for example, for a file system backup. This data may be too large to fit in the cache, or it may be read only once, and is therefore unlikely to remain in the cache. Such a workload would perform relatively poorly, as it would have a low cache-hit ratio.
Prefetch is a common file system feature for solving this problem. It can detect a sequential read workload based on the current and previous file I/O offsets, and then predict and issue disk reads before the application has requested them. This populates the file system cache, so that if the application does perform the expected read, it results in a cache-hit, rather than reading from much slower disk.
Prefetch can typically be tuned in most systems.
Write-back Caching
Write-back caching is commonly used by file systems to improve write performance. It works by treating writes as completed after the transfer to main memory, and writing them to disk sometime later, asynchronously. The file system process for writing this ‘dirty’ data to disk is called ‘flushing’. The trade-off of write-back cache is reliability. DRAM-based main memory is volatile, and dirty data can be lost in the event of a power failure. Data could also be written to disk incompletely, leaving the disk in a corrupt state. If file-system metadata becomes corrupted, the file system may no longer load.
Synchronous writes
Synchronous writes are used by some applications such as database log writers, where the risk of data corruption for asynchronous writes is unacceptable.
VFS (Virtual File System)
VFS provides a common interface for different file system types. Prior to VFS, different file systems required different system calls for interacting with each. The calls for interacting with a FAT file system were different than those for a EXT file system.
File System Caches
Unix originally only had the buffer cache to improve performance of block device access. Nowadays, Linux has multiple different cache types.
- Page Cache
- Buffer cache
- Directory Cache
- inode cache
Copy on write
A Copy on write (COW) file system does not overwrite existing blocks but instead follows these steps:
- Write blocks to a new location (a new copy)
- Update references to new blocks
- Add old blocks to the free list
This helps maintain file system integrity in the event of a system failure, and also helps improve performance by turning random writes into sequential ones
Troubleshooting File Systems
Key metrics for file systems include:
- Operation Rate
- Operation latency
In Linux, there are typically no readily available metrics for file system operations (the exception being NFS, via nfsstat).
Tools:
mountfreetopvmstatsarslabtopfiletopcachestatfsckext4slowere2fsck