Moving a Linux system's root without rebooting

8 September, 2023

On a long-lived Linux server I operate, it has periodically been desirable to move its root filesystem (containing core OS files and the like) to a new drive for various reasons. Over the years it’s migrated from a conveniently inexpensive mechanical hard drive to a series of increasingly capable and larger solid-state drives (SSDs), first connected over SATA and later in the form of M.2 NVMe cards. I recently sought to do another upgrade, swapping the current NVMe drive for a larger one. Making things more interesting, I’ve developed strategies to do this without rebooting except to physically plug or unplug devices!

Traditionally, migrating a system to a new root disk like this either involves reinstalling completely and restoring data afterwards, or booting from some kind of rescue media (a “Live CD”) and manually copying everything from the old to the new disk. Both of these require some system downtime and the former makes it easy to unintentionally leave behind important aspects of system configuration, so I’ve usually done the latter.

During an earlier migration of this same system, I realized that Linux’s Logical Volume Manager (LVM) could make these kinds of migration easier. Primarily LVM makes it easier to resize disk partitions after moving them to a new disk, allowing me to take advantage of progressively larger disks over time; but it also turns out that LVM’s layers of abstraction make it possible to move filesystems (including the root) without even turning the computer off, doing a kind of “switcheroo” that minimizes downtime.

Depending on the hardware configuration, this approach allows disk migration without any downtime, provided the storage devices involved can be hotplugged or will both end up being connected! Since it’s been a while since I last did this switcheroo and there have been cases since the last time where these techniques were relevant in conversation with other professional computer people, this time I’ve chosen to write up the process so it can both serve as notes for me the next time I want to do it, and to demonstrate it to others.

Setup

Before diving into the details, I should recap what the system configuration looks like. This is a fairly recent consumer-grade PC based around an Intel processor, with the OS installed on a M.2 NVMe SSD installed in a slot on the mainboard. The new drive I want to move to is another M.2 SSD but with greater capacity.

Because the mainboard does not have any spare M.2 slots where I could install the new SSD, I’ve used a USB enclosure to connect the new drive over USB, with expectation that after the migration is complete I can incur small downtime to take out the old SSD and install the new one into the slot on the mainboard. Use of a USB enclosure for the new drive probably causes some performance degradation since running everything over USB will tend to be somewhat slower, but that temporary loss of performance is unimportant to me.

Discovering the current state

Although I know this sytem uses LVM, it’s been a while since I configured it so many of the details are fuzzy. Before beginning, I needed to investigate what the current configuration was in order to understand what I should do to perform the switcheroo.

First, I used lsblk to understand what storage devices were connected and in use (with irrelevant ones omitted here):

# lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sdf                      8:80   0 953.9G  0 disk
nvme0n1                259:1    0 238.5G  0 disk
├─nvme0n1p1            259:2    0   512M  0 part /boot
└─nvme0n1p2            259:3    0   238G  0 part
  ├─myhost-root        254:0    0    40G  0 lvm  /
  ├─myhost-home        254:1    0    60G  0 lvm  /home
  └─myhost-var         254:2    0    60G  0 lvm  /var

Or reformatted as a table, omitting columns that aren’t interesting and translating a few items to more understandable versions:

Device name	Size	Type	Mountpoint
`sdf`	953.9G	disk
`nvme0n1`	238.5G	disk
`nvme0n1p1`	512M	partition	`/boot`
`nvme0n1p2`	238G	partition
`myhost-root`	40G	LVM LV	`/`
`myhost-home`	60G	LVM LV	`/home`
`myhost-var`	60G	LVM LV	`/var`

The disk sdf is the new drive in the USB enclosure, and nvme0n1 is the current boot disk containing the root filesystem as well as /home and /var which need to be moved. Most of the data is backed by LVM (the type column reads lvm, meaning it’s on a LVM LV), which enables this whole procedure.

There’s also a small (512MB) partition at nvme0n1p1 mounted at /boot. This will be important to preserve. Looking at the disk’s partition table in detail provides more useful information (with some things that are not relevant to this discussion removed):

# gdisk -l /dev/nvme0n1
Found valid GPT with protective MBR; using GPT.
Disk /dev/nvme0n1: 500118192 sectors, 238.5 GiB
Sector size (logical/physical): 512/512 bytes

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1050623   512.0 MiB   EF00  EFI system partition
   2         1050624       500117503   238.0 GiB   8E00  Linux LVM

Knowing the exact sector size of the boot partition will be useful later (to confirm the new one is exactly the same), and this confirms that there are only two partitions on the old disk.

EFI configuration

The EFI system partition type for the boot partition indicates that it is in fact an EFI system partition (ESP), telling us a few things:

This machine uses UEFI to boot, not legacy BIOS.
It’s impossible to put this data on LVM, because the system firmware needs to be able to read the files stored therein. It’s formatted with a FAT32 filesystem, which is common for EFI systems; some bootloaders are capable of accessing data on LVM, but I doubt any UEFI firmwares out there can.
The boot sector and any data stored outside a partition does not need to be preserved, because only the data on the ESP is needed to boot. Compare to legacy BIOS, where the boot sector always contains bootloader code that must be preserved and sometimes bootloader code is stored in unpartitioned areas of the disk which must also be preserved.
The firmware may have references to partition UUIDs that tell it what to boot from, which would need to either be preserved or updated later.

To understand what the firmware’s boot configuration is, efibootmgr works well:¹

# efibootmgr
BootCurrent: 0001
Timeout: 5 seconds
BootOrder: 0001,0000
Boot0000* Arch Linux    VenHw(99e275e7-75a0-4b37-a2e6-c5385e6c00cb)72006f006f0074003d002f006400650076002f006d00610070007000650072002f006b006900720069007300680069006d0061005f006e00650077002d0072006f006f007400200072007700200069006e0069007400720064003d002f0069006e00740065006c002d00750063006f00640065002e0069006d006700200069006e0069007400720064003d002f0069006e0069007400720061006d00660073002d006c0069006e00750078002e0069006d006700
Boot0001* GRUB  HD(1,GPT,abf0aa0a-b6ed-45a9-9c16-77727fef1538,0x800,0x100000)/File(\EFI\GRUB\grubx64.efi)

efibootmgr tells us here that the firmware is configured to prefer to boot GRUB (entry Boot0001, which comes first in BootOrder), and it finds the GRUB executable on HD number 1, in a partition with UUID abf0aa0a…. The other boot entry contains a lot of text, which in this output is displayed as hexadecimal bytes but I know that it’s actually meant to be text encoded with UCS-2 (similar to UTF-16: two bytes per character, with the second zero for most Latin characters).

Seeing this slight mess of text summons memories from when I first configured EFI boot on this system: I remember being unable to make the firmware directly boot the kernel, then giving up and using GRUB instead. While it seemed that the system’s firmware was willing to boot a Linux kernel image directly, it would not pass the required kernel command line options so the kernel was unable to do anything useful like mount the root filesystem. Using GRUB instead means the configuration is loaded from a file rather than any parameters passed from the firmware, sidestepping that issue.

It looks like the disk number (HD(1,… for the Boot0001 GRUB entry that is of interest here) is chosen by the firmware and perhaps correlates to a physical port on the mainboard, so there should be no special considerations in making the firmware select the correct disk to boot from. To select the correct partition to load GRUB from however, either the new partition’s UUID must be set to be the same as the old or the boot entry must be updated with a new UUID.

To confirm that this UUID refers to the UEFI system partition, I looked at /dev/disk/by-partuuid and verified that the file abf0aa0a… is a link to nvme0n1p1: the ESP. It seems plausible that there might be a way to tell the firmware to load something from the ESP directly without specifying the filesystem UUID, but I’m not sufficiently motivated to discover if that’s true right now.

Moving /boot

To move the ESP, I’ve chosen to reuse the original partition UUID so I don’t need to change firmware configuration at all. This should make it easier to switch back to the old disk if I need to.

First I used gdisk to interactively create a partition on the new disk with the same size, partition type, and UUID as the original:

# gdisk /dev/sdf
GPT fdisk (gdisk) version 1.0.9.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): p
Disk /dev/sdf: 2000409264 sectors, 953.9 GiB
Model: Generic
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): CE6A291B-12C9-4669-BF19-8696D81FBF4B
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 2000409230
Partitions will be aligned on 2048-sector boundaries
Total free space is 2000409197 sectors (953.9 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-2000409230, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-2000409230, default = 2000408575) or {+-}size{KMGTP}: +512M
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): EF00
Changed type of partition to 'EFI system partition'

Command (? for help): x

Expert command (? for help): c
Using 1
Enter the partition's new unique GUID ('R' to randomize): abf0aa0a-b6ed-45a9-9c16-77727fef1538
New GUID is ABF0AA0A-B6ED-45A9-9C16-77727FEF1538

After saving the new partition table and exiting gdisk, gdisk -l /dev/sdf verifies that the new partition has the same size (start and end sectors) as the old one, so I’m happy with this result.

Copying the filesystem

To copy the filesystem, I’d like to take a bit-exact copy (rather than copying the files and replicating the directory structure) to ensure there won’t be any unexpected differences between the original and copy. Since I only rarely need to write to this filesystem and certainly don’t need to right now (aside from writing the copy to the new disk), I can remount it read-only then copy the block device directly:

# mount -o remount,ro /boot
# dd if=/dev/nvme0n1p1 of=/dev/sdf1 bs=1M
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 14.6401 s, 36.7 MB/s

Making it read-only before copying ensures that its state on disk is consistent and the system won’t make any changes that could leave it in an inconsistent state while I’m making the copy. To ensure the new copy remains the same as the old one, this filesystem should remain read-only until I replace it with the new one. Otherwise changes wouldn’t be propagated to the copy, and I’d need to repeat this step later.

LVM switcheroo

With the non-LVM data moved, it’s on to the rest. To recall how LVM works (or provide a quick introduction to readers unfamiliar with it), the Linux Logical Volume Manager (LVM) offers a variety of features to abstract away the details of persistent storage devices.

Physical volumes (PVs) are created on physical storage (actual hard disks).
Multiple PVs can be combined into a volume group (VG), which acts like a storage pool made up of all of the PVs in the VG.
Logical volumes (LVs) are created on VGs and act like regular disk partitions, but given superpowers by the layers of VG and PV underneath.

By sitting between actual storage and the “partition” that typically has a filesystem on it (a LV), it’s possible to configure the LV to behave entirely unlike a real disk. A few of the interesting features include:

A “thin” LV can be larger than the underlying PV, which might be useful in a system that is actively managed and the administrator ensures there is always enough storage available: they can create a very large filesystem that doesn’t need to be resized as storage is added, without having enough storage for the entire filesystem from the outset.
“Snapshot” LVs capture changes to another LV, exactly capturing its contents at the time the snapshot was created without blocking writes to the original. This works a lot like the snapshot features of ZFS and btrfs, but isn’t tied to use of any particular filesystem.
Allocation policy of a LV with regard to the PVs in its VG can be controlled, often using the raid LV type to use any of the common RAID allocation policies to spread data across disks to improve performance, provide redundancy against disk failure, or both.

VGs and LVs are all given names, which is where the names like myhost-root seen earlier come from. myhost is the name of the VG I’m using, and root is the name of the LV holding the filesystem that gets mounted at / (the system root).

Desired configuration

After doing some reading of the LVM-related documentation (in particular lvmraid(7), I concluded that temporarily configuring my LVs as RAID1 across the two devices is a sensible approach. By also setting the activation mode to degraded, the system will allow the relevant LVs to be used as long as either PV it’s on (assuming RAID1 with two devices) is present. By doing this, I can reconfigure the LVs then wait for them to be synced, and reboot to swap in the new storage. If something goes wrong with the new one, I can connect the old one and they will still have the same contents so I can easily roll back and try again.

First, create a PV on the new disk in a new partition that fills the rest of the device (sdf2) and add it to the VG (which is named myhost):

# pvcreate /dev/sdf2
  Physical volume "/dev/sdf2" successfully created.
# vgextend myhost /dev/sdf2

At this point vgdisplay shows there are two active PVs in the VG, and its size is much larger than it was previously. Now convert each of the LVs I care about to RAID1, mirrored across both PVs:

# for lv in root var home; do
for> lvconvert --type raid1 --mirrors 1 myhost/$lv
for> done
Are you sure you want to convert linear LV myhost/root to raid1 with 2 images enhancing resilience? [y/n]: y
  Logical volume myhost/root successfully converted.
Are you sure you want to convert linear LV myhost/var to raid1 with 2 images enhancing resilience? [y/n]: y
  Logical volume myhost/var successfully converted.
Are you sure you want to convert linear LV myhost/home to raid1 with 2 images enhancing resilience? [y/n]: y
  Logical volume myhost/home successfully converted.

After conversion, the LVs need to sync: make a copy of all their data on the new PV. The lvmraid manpage says to use lvs -a -o name,sync_percent to monitor this process:

# lvs -a -o name,sync_percent
  LV              Cpy%Sync
  root            3.71
  [root_rimage_0]
  [root_rimage_1]
  [root_rmeta_0]
  [root_rmeta_1]

Before continuing, I waited until Cpy%Sync was 100% for each of the volumes. To ensure that the system will still try to boot with only one replica present, I also verified that the current LVM configuration allowed degraded RAID LVs to be activated:

# lvmconfig --type current activation/activation_mode
activation_mode="degraded"

Since this was already set as I wanted, I didn’t need to make any further changes, and proceeded to shut down the computer, swap out the old disk and install the new one.

Unexpected misbehavior

After swapping the two disks and disconnecting the one in the USB enclosure (leaving only the new one installed in the machine), the firmware booted the Linux kernel correctly, but the system didn’t boot successfully: it complained that it couldn’t find the root filesystem. Using a rescue shell I was able to determine that all my LVs were present, but inactive (unusable until activated) and in the partial state, seemingly indicating that some important metadata was stored only on the original PV and LVM refused to allow the LVs to activate because it didn’t have all of the data needed to manage them correctly.

I suspect there may have been a way to repair this RAID issue, but I opted to fall back to the approach I had used previously rather than embark into the unknown while being stuck with a computer that’s offline until it’s fixed.

Without redundancy

Since I had already managed to break the system more by attempting to use RAID than the old approach would have, I fell back at this point to the workflow I had used previously that still allows everything to be moved without downtime other than for physically moving hardware around but makes rollback to the original state more difficult. Unfortunately because I had already broken the system, I had to do this from a rescue shell after crawling around to connect the requisite cables so I could use the rescue shell.

The plan for this was:

Convert the VG back to linear allocation, rather than raid1; keep a single copy of each LV’s data, having no particular opinion where it’s stored.
Move all data off the old PV and onto the new one.
Remove the old PV from the VG, guaranteeing that nothing will be stored on it.
(optional) Wipe the old PV so it can’t be accidentally re-added to the VG.

Each of these steps conveniently maps to a single command to run:

# lvconvert --type linear myhost/root myhost/var myhost/home
# pvmove /dev/sdf2
# vgreduce myhost /dev/sdf2
# pvremove /dev/sdf2

pvmove takes a while to complete, because it moves all data on sdf2 to other PVs in the VG. Since I moved the new drive to the mainboard slot when I was ready to try the RAID, sdf is now the old drive. Then vgreduce removes the PV from the named VG (failing if there is any data still stored on the nominated PV), and pvremove wipes the PV signature off the old disk.

After doing that, I rebooted normally and the system came up as expected this time, running entirely on the new disk. Meanwhile, the old disk has no data left on it and can be reused for other purposes.

Useful tools, if you don’t mess it up

I hope this writeup is useful to others, and expect it will be of use to myself in the future as well. LVM provides some very useful capabilities to system administrators that I believe many users are unaware of, and greater awareness is the first step towards making use of these tools.

My failed experiment with LVM RAID reinforces that when doing low-level computer maintenance it’s often best to go with a proven approach to avoid nasty surprises: this nasty surprise wasn’t too bad, but it did delay my dinner while I sorted it out. Had the RAID worked how I’d expected then it would have become another tool in my arsenal, and because this system isn’t very critical to anybody, on balance the learning experience seems to have been worthwhile. For the future however, I’d want to either do more experimentation before trying that approach again, or simply follow the well-trod path (using pvmove) for safety.

It’s usually impossible to interrogate the firmware to learn about boot configuration when using legacy BIOS; being able to do this is a useful feature of UEFI. ↩︎