Posted on Leave a comment

Learning about Partitions and How to Create Them for Fedora

Operating system distributions try to craft a one size fits all partition layout for their file systems. Distributions cannot know the details about how your hardware is configured or how you use your system though. Do you have more than one storage drive? If so, you might be able to get a performance benefit by putting the write-heavy partitions (var and swap for example) on a separate drive from the others that tend to be more read-intensive since most drives cannot read and write at the same time. Or maybe you are running a database and have a small solid-state drive that would improve the database’s performance if its files are stored on the SSD.

The following sections attempt to describe in brief some of the historical reasons for separating some parts of the file system out into separate partitions so that you can make a more informed decision when you install your Linux operating system.

If you know more (or contradictory) historical details about the partitioning decisions that shaped the Linux operating systems used today, contribute what you know below in the comments section!

Common partitions and why or why not to create them

The boot partition

One of the reasons for putting the /boot directory on a separate partition was to ensure that the boot loader and kernel were located within the first 1024 cylinders of the disk. Most modern computers do not have the 1024 cylinder restriction. So for most people, this concern is no longer relevant. However, modern UEFI-based computers have a different restriction that makes it necessary to have a separate partition for the boot loader. UEFI-based computers require that the boot loader (which can be the Linux kernel directly) be on a FAT-formatted file system. The Linux operating system, however, requires a POSIX-compliant file system that can designate access permissions to individual files. Since FAT file systems do not support access permissions, the boot loader must be on a separate file system than the rest of the operating system on modern UEFI-based computers. A single partition cannot be formatted with more than one type of file system.

The var partition

One of the historical reasons for putting the /var directory on a separate partition was to prevent files that were frequently written to (/var/log/* for example) from filling up the entire drive. Since modern drives tend to be much larger and since other means like log rotation and disk quotas are available to manage storage utilization, putting /var on a separate partition may not be necessary. It is much easier to change a disk quota than it is to re-partition a drive.

Another reason for isolating /var was that file system corruption was much more common in the original version of the Linux Extended File System (EXT). The file systems that had more write activity were much more likely to be irreversibly corrupted by a power outage than those that did not. By partitioning the disk into separate file systems, one could limit the scope of the damage in the event of file system corruption. This concern is no longer as significant because modern file systems support journaling.

The home partition

Having /home on a separate partition makes it possible to re-format the other partitions without overwriting your home directories. However, because modern Linux distributions are much better at doing in-place operating system upgrades, re-formatting shouldn’t be needed as frequently as it might have been in the past.

It can still be useful to have /home on a separate partition if you have a dual-boot setup and want both operating systems to share the same home directories. Or if your operating system is installed on a file system that supports snapshots and rollbacks and you want to be able to rollback your operating system to an older snapshot without reverting the content in your user profiles. Even then, some file systems allow their descendant file systems to be rolled back independently, so it still may not be necessary to have a separate partition for /home. On ZFS, for example, one pool/partition can have multiple descendant file systems.

The swap partition

The swap partition reserves space for the contents of RAM to be written to permanent storage. There are pros and cons to having a swap partition. A pro of having swap memory is that it theoretically gives you time to gracefully shutdown unneeded applications before the OOM killer takes matters into its own hands. This might be important if the system is running mission-critical software that you don’t want abruptly terminated. A con might be that your system runs so slow when it starts swapping memory to disk that you’d rather the OOM killer take care of the problem for you.

Another use for swap memory is hibernation mode. This might be where the rule that the swap partition should be twice the size of your computer’s RAM originated. Ideally, you should be able to put a system into hibernation even if nearly all of its RAM is in use. Beware that Linux’s support for hibernation is not perfect. It is not uncommon that after a Linux system is resumed from hibernation some hardware devices are left in an inoperable state (for example, no video from the video card or no internet from the WiFi card).

In any case, having a swap partition is more a matter of taste. It is not required.

The root partition

The root partition (/) is the catch-all for all directories that have not been assigned to a separate partition. There is always at least one root partition. BIOS-based systems that are new enough to not have the 1024 cylinder limit can be configured with only a root partition and no others so that there is never a need to resize a partition or file system if space requirements change.

The EFI system partition

The EFI System Partition (ESP) serves the same purpose on UEFI-based computers as the boot partition did on the older BIOS-based computers. It contains the boot loader and kernel. Because the files on the ESP need to be accessible by the computer’s firmware, the ESP has a few restrictions that the older boot partition did not have. The restrictions are:

  1. The ESP must be formatted with a FAT file system (vfat in Anaconda)
  2. The ESP must have a special type-code (EF00 when using gdisk)

Because the older boot partition did not have file system or type-code restrictions, it is permissible to apply the above properties to the boot partition and use it as your ESP. Note, however, that the GRUB boot loader does not support combining the boot and ESP partitions. If you use GRUB, you will have to create a separate partition and mount it beneath the /boot directory.

The Boot Loader Specification (BLS) lists several reasons why it is ideal to use the legacy boot partition as your ESP. The reasons include:

  1. The UEFI firmware should be able to load the kernel directly. Having a separate, non-ESP compliant boot partition for the kernel prevents the UEFI firmware from being able to directly load the kernel.
  2. Nesting the ESP mount point three mount levels deep increases the likelihood that an intermediate mount could fail or otherwise be unavailable when needed. That is, requiring root (/), then boot (/boot), then efi (/efi) to be consecutively mounted is unnecessarily complex and prone to error.
  3. Requiring the boot loader to be able to read other partitions/disks which may be formatted with arbitrary file systems is non-trivial. Even when the boot loader does contain such code, the code that works at installation time can become outdated and fail to access the kernel/initrd after a file system update. This is currently true of GRUB’s ZFS file system driver, for example. You must be careful not to update your ZFS file system if you use the GRUB boot loader or else your system may not come back up the next time you reboot.

Besides the concerns listed above, it is a good idea to have your startup environment — up to and including your initramfs — on a single self-contained file system for recovery purposes. Suppose, for example, that you need to rollback your root file system because it has become corrupted or it has become infected with malware. If your kernel and initramfs are on the root file system, you may be unable to perform the recovery. By having the boot loader, kernel, and initramfs all on a single file system that is rarely accessed or updated, you can increase your chances of being able to recover the rest of your system.

In summary, there are many ways that you can layout your partitions and the type of hardware (BIOS or UEFI) and the brand of boot loader (GRUB, Syslinux or systemd-boot) are among the factors that will influence which layouts will work.

Other considerations

MBR vs. GPT

GUID Partition Table (GPT) is the newer partition format that supports larger disks. GPT was designed to work with the newer UEFI firmware. It is backward-compatible with the older Master Boot Record (MBR) partition format but not all boot loaders support the MBR boot method. GRUB and Syslinux support both MBR and UEFI, but systemd-boot only supports the newer UEFI boot method.

By using GPT now, you can increase the likelihood that your storage device, or an image of it, can be transferred over to a newer computer in the future should you wish to do so. If you have an older computer that natively supports only MBR-partitioned drives, you may need to add the inst.gpt parameter to Anaconda when starting the installer to get it to use the newer format. How to add the inst.gpt parameter is shown in the below video titled “Partitioning a BIOS Computer”.

If you use the GPT partition format on a BIOS-based computer, and you use the GRUB boot loader, you must additionally create a one megabyte biosboot partition at the start of your storage device. The biosboot partition is not needed by any other brand of boot loader. How to create the biosboot partition is demonstrated in the below video titled “Partitioning a BIOS Computer”.

LVM

One last thing to consider when manually partitioning your Linux system is whether to use standard partitions or logical volumes. Logical volumes are managed by the Logical Volume Manager (LVM). You can setup LVM volumes directly on your disk without first creating standard partitions to hold them. However, most computers still require that the boot partition be a standard partition and not an LVM volume. Consequently, having LVM volumes only increases the complexity of the system because the LVM volumes must be created within standard partitions.

The main features of LVM — online storage resizing and clustering — are not really applicable to the typical end user. Most laptops do not have hot-swappable drive bays for adding or reconfiguring storage while the system is running. And not many laptop or desktop users have clvmd configured so they can access a centralized storage device concurrently from multiple client computers.

LVM is great for servers and clusters. But it adds extra complexity for the typical end user. Go with standard partitions unless you are a server admin who needs the more advanced features.

Video demonstrations

Now that you know which partitions you need, you can watch the sort video demonstrations below to see how to manually partition a Fedora Linux computer from the Anaconda installer.

These videos demonstrate creating only the minimally required partitions. You can add more if you choose.

Because the GRUB boot loader requires a more complex partition layout on UEFI systems, the below video titled “Partitioning a UEFI Computer” additionally demonstrates how to install the systemd-boot boot loader. By using the systemd-boot boot loader, you can reduce the number of needed partitions to just two — boot and root. How to use a boot loader other than the default (GRUB) with Fedora’s Anaconda installer is officially documented here.

Partitioning a UEFI Computer
Partitioning a BIOS Computer
Posted on Leave a comment

Check storage performance with dd

This article includes some example commands to show you how to get a rough estimate of hard drive and RAID array performance using the dd command. Accurate measurements would have to take into account things like write amplification and system call overhead, which this guide does not. For a tool that might give more accurate results, you might want to consider using hdparm.

To factor out performance issues related to the file system, these examples show how to test the performance of your drives and arrays at the block level by reading and writing directly to/from their block devices. WARNING: The write tests will destroy any data on the block devices against which they are run. Do not run them against any device that contains data you want to keep!

Four tests

Below are four example dd commands that can be used to test the performance of a block device:

  1. One process reading from $MY_DISK:
    # dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache
  2. One process writing to $MY_DISK:
    # dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct
  3. Two processes reading concurrently from $MY_DISK:
    # (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache &); (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache skip=200 &)
  4. Two processes writing concurrently to $MY_DISK:
    # (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct &); (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct skip=200 &)

– The iflag=nocache and oflag=direct parameters are important when performing the read and write tests (respectively) because without them the dd command will sometimes show the resulting speed of transferring the data to/from RAM rather than the hard drive.

– The values for the bs and count parameters are somewhat arbitrary and what I have chosen should be large enough to provide a decent average in most cases for current hardware.

– The null and zero devices are used for the destination and source (respectively) in the read and write tests because they are fast enough that they will not be the limiting factor in the performance tests.

– The skip=200 parameter on the second dd command in the concurrent read and write tests is to ensure that the two copies of dd are operating on different areas of the hard drive.

16 examples

Below are demonstrations showing the results of running each of the above four tests against each of the following four block devices:

  1. MY_DISK=/dev/sda2 (used in examples 1-X)
  2. MY_DISK=/dev/sdb2 (used in examples 2-X)
  3. MY_DISK=/dev/md/stripped (used in examples 3-X)
  4. MY_DISK=/dev/md/mirrored (used in examples 4-X)

A video demonstration of the these tests being run on a PC is provided at the end of this guide.

Begin by putting your computer into rescue mode to reduce the chances that disk I/O from background services might randomly affect your test results. WARNING: This will shutdown all non-essential programs and services. Be sure to save your work before running these commands. You will need to know your root password to get into rescue mode. The passwd command, when run as the root user, will prompt you to (re)set your root account password.

$ sudo -i
# passwd
# setenforce 0
# systemctl rescue

You might also want to temporarily disable logging to disk:

# sed -r -i.bak 's/^#?Storage=.*/Storage=none/' /etc/systemd/journald.conf
# systemctl restart systemd-journald.service

If you have a swap device, it can be temporarily disabled and used to perform the following tests:

# swapoff -a
# MY_DEVS=$(mdadm --detail /dev/md/swap | grep active | grep -o "/dev/sd.*")
# mdadm --stop /dev/md/swap
# mdadm --zero-superblock $MY_DEVS

Example 1-1 (reading from sda)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 1)
# dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.7003 s, 123 MB/s

Example 1-2 (writing to sda)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 1)
# dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.67117 s, 125 MB/s

Example 1-3 (reading concurrently from sda)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 1)
# (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache &); (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.42875 s, 61.2 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.52614 s, 59.5 MB/s

Example 1-4 (writing concurrently to sda)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 1)
# (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct &); (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct skip=200 &)
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.2435 s, 64.7 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.60872 s, 58.1 MB/s

Example 2-1 (reading from sdb)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 2)
# dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.67285 s, 125 MB/s

Example 2-2 (writing to sdb)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 2)
# dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.67198 s, 125 MB/s

Example 2-3 (reading concurrently from sdb)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 2)
# (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache &); (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.52808 s, 59.4 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.57736 s, 58.6 MB/s

Example 2-4 (writing concurrently to sdb)

# MY_DISK=$(echo $MY_DEVS | cut -d ' ' -f 2)
# (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct &); (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.7841 s, 55.4 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.81475 s, 55.0 MB/s

Example 3-1 (reading from RAID0)

# mdadm --create /dev/md/stripped --homehost=any --metadata=1.0 --level=0 --raid-devices=2 $MY_DEVS
# MY_DISK=/dev/md/stripped
# dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.837419 s, 250 MB/s

Example 3-2 (writing to RAID0)

# MY_DISK=/dev/md/stripped
# dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.823648 s, 255 MB/s

Example 3-3 (reading concurrently from RAID0)

# MY_DISK=/dev/md/stripped
# (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache &); (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.31025 s, 160 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.80016 s, 116 MB/s

Example 3-4 (writing concurrently to RAID0)

# MY_DISK=/dev/md/stripped
# (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct &); (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.65026 s, 127 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.81323 s, 116 MB/s

Example 4-1 (reading from RAID1)

# mdadm --stop /dev/md/stripped
# mdadm --create /dev/md/mirrored --homehost=any --metadata=1.0 --level=1 --raid-devices=2 --assume-clean $MY_DEVS
# MY_DISK=/dev/md/mirrored
# dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.74963 s, 120 MB/s

Example 4-2 (writing to RAID1)

# MY_DISK=/dev/md/mirrored
# dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.74625 s, 120 MB/s

Example 4-3 (reading concurrently from RAID1)

# MY_DISK=/dev/md/mirrored
# (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache &); (dd if=$MY_DISK of=/dev/null bs=1MiB count=200 iflag=nocache skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.67171 s, 125 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.67685 s, 125 MB/s

Example 4-4 (writing concurrently to RAID1)

# MY_DISK=/dev/md/mirrored
# (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct &); (dd if=/dev/zero of=$MY_DISK bs=1MiB count=200 oflag=direct skip=200 &)
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 4.09666 s, 51.2 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 4.1067 s, 51.1 MB/s

Restore your swap device and journald configuration

# mdadm --stop /dev/md/stripped /dev/md/mirrored
# mdadm --create /dev/md/swap --homehost=any --metadata=1.0 --level=1 --raid-devices=2 $MY_DEVS
# mkswap /dev/md/swap
# swapon -a
# mv /etc/systemd/journald.conf.bak /etc/systemd/journald.conf
# systemctl restart systemd-journald.service
# reboot

Interpreting the results

Examples 1-1, 1-2, 2-1, and 2-2 show that each of my drives read and write at about 125 MB/s.

Examples 1-3, 1-4, 2-3, and 2-4 show that when two reads or two writes are done in parallel on the same drive, each process gets at about half the drive’s bandwidth (60 MB/s).

The 3-x examples show the performance benefit of putting the two drives together in a RAID0 (data stripping) array. The numbers, in all cases, show that the RAID0 array performs about twice as fast as either drive is able to perform on its own. The trade-off is that you are twice as likely to lose everything because each drive only contains half the data. A three-drive array would perform three times as fast as a single drive (all drives being equal) but it would be thrice as likely to suffer a catastrophic failure.

The 4-x examples show that the performance of the RAID1 (data mirroring) array is similar to that of a single disk except for the case where multiple processes are concurrently reading (example 4-3). In the case of multiple processes reading, the performance of the RAID1 array is similar to that of the RAID0 array. This means that you will see a performance benefit with RAID1, but only when processes are reading concurrently. For example, if a process tries to access a large number of files in the background while you are trying to use a web browser or email client in the foreground. The main benefit of RAID1 is that your data is unlikely to be lost if a drive fails.

Video demo

Testing storage throughput using dd

Troubleshooting

If the above tests aren’t performing as you expect, you might have a bad or failing drive. Most modern hard drives have built-in Self-Monitoring, Analysis and Reporting Technology (SMART). If your drive supports it, the smartctl command can be used to query your hard drive for its internal statistics:

# smartctl --health /dev/sda
# smartctl --log=error /dev/sda
# smartctl -x /dev/sda

Another way that you might be able to tune your PC for better performance is by changing your I/O scheduler. Linux systems support several I/O schedulers and the current default for Fedora systems is the multiqueue variant of the deadline scheduler. The default performs very well overall and scales extremely well for large servers with many processors and large disk arrays. There are, however, a few more specialized schedulers that might perform better under certain conditions.

To view which I/O scheduler your drives are using, issue the following command:

$ for i in /sys/block/sd?/queue/scheduler; do echo "$i: $(<$i)"; done

You can change the scheduler for a drive by writing the name of the desired scheduler to the /sys/block/<device name>/queue/scheduler file:

# echo bfq > /sys/block/sda/queue/scheduler

You can make your changes permanent by creating a udev rule for your drive. The following example shows how to create a udev rule that will set all rotational drives to use the BFQ I/O scheduler:

# cat << END > /etc/udev/rules.d/60-ioscheduler-rotational.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
END

Here is another example that sets all solid-state drives to use the NOOP I/O scheduler:

# cat << END > /etc/udev/rules.d/60-ioscheduler-solid-state.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"
END

Changing your I/O scheduler won’t affect the raw throughput of your devices, but it might make your PC seem more responsive by prioritizing the bandwidth for the foreground tasks over the background tasks or by eliminating unnecessary block reordering.


Photo by James Donovan on Unsplash.