Replacing Failed Drive in Zfs Zpool (on Proxmox)

Dec 12, 2016 · 5 minute read
Category: linux

Recently we had one of our Proxmox machines suffer a failed disk drive.

Thankfully, replacing a failed disk in a ZFS zpool is remarkably simple if you know how.

In this example, we are using the ZFS configuration as per the Proxmox installer which also creates a boot partition which is not part of the zpool. Seems like a pretty sensible idea to me.

Here is how we can look at the status of our zpool and see that it has a failed disk:

root@cluster1 zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    rpool                     DEGRADED     0     0     0
      raidz1-0                DEGRADED     0     0     0
        sdb2                  ONLINE       0     0     0
        sdc2                  ONLINE       0     0     0
        sdd2                  ONLINE       0     0     0
        14456048953908038050  FAULTED      0     0     0  was /dev/sdd2

So you can see that /dev/sdd2 has died and is no longer available. The numeric ID that is in place of sdd2 is important so make sure you have note of it.

Now lets assume that you have figured out which drive the failed one is, whipped it out and slotted in a shiny new replacement drive that is at least as big as the one it is replacing. The next step is to actually add in the new drive.

Step one: Know your Drive IDs

To avoid misery, you need to make absolutely sure you know which drives are which. If you replace a drive then the IDs (sda etc) can get shuffled around, so you need to double check.

The easiest way I think is to look in /dev/disk/by-id and in there you should notice one disk that has no partitions - that is your new one.

root@cluster1 cd /dev/disk/by-id/
/dev/disk/by-id
root@cluster1 ll
total 0
drwxr-xr-x 2 root root 560 Dec 12 12:08 .
drwxr-xr-x 6 root root 120 Dec 12 12:08 ..
lrwxrwxrwx 1 root root   9 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRREL -> ../../sdd
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRREL-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRREL-part2 -> ../../sdd2
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRREL-part9 -> ../../sdd9
lrwxrwxrwx 1 root root   9 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRS0J -> ../../sdc
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRS0J-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRS0J-part2 -> ../../sdc2
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRS0J-part9 -> ../../sdc9
lrwxrwxrwx 1 root root   9 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRV9T -> ../../sdb
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRV9T-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRV9T-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YCRV9T-part9 -> ../../sdb9
lrwxrwxrwx 1 root root   9 Dec 12 12:08 ata-ST1000DX001-1NS162_Z4YE995W -> ../../sda
lrwxrwxrwx 1 root root   9 Dec 12 12:08 wwn-0x5000c50090cca172 -> ../../sdc
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cca172-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cca172-part2 -> ../../sdc2
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cca172-part9 -> ../../sdc9
lrwxrwxrwx 1 root root   9 Dec 12 12:08 wwn-0x5000c50090cd24c4 -> ../../sdb
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cd24c4-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cd24c4-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cd24c4-part9 -> ../../sdb9
lrwxrwxrwx 1 root root   9 Dec 12 12:08 wwn-0x5000c50090cd2ff2 -> ../../sdd
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cd2ff2-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cd2ff2-part2 -> ../../sdd2
lrwxrwxrwx 1 root root  10 Dec 12 12:08 wwn-0x5000c50090cd2ff2-part9 -> ../../sdd9
lrwxrwxrwx 1 root root   9 Dec 12 12:08 wwn-0x5000c50092c1f5b2 -> ../../sda

So in our example, the new disk is sda

Step two: Partitions

We need to get our drive set up with the right partition table. Thankfully this is easy enough because we can just copy this from a healthy drive.

Warning - make sure you have the next command right before running it

# Use these variables to make sure you have this the right way around

newDisk='/dev/sda'

healthyDisk='/dev/sdb'

sgdisk -R "$newDisk" "$healthyDisk"
sgdisk -G "$newDisk

Step three: Boot partition

In our example, partition one is boot and can be just copied directly from a healthy disk (we will sort out ZFS partition later)

# Use these variables to make sure you have this the right way around

newDiskBootPartition='/dev/sda1'

healthyDiskBootPartition='/dev/sdb1'

dd if="$healthyDiskBootPartition" of="$newDiskBootPartition" bs=512

Step four: Add to zpool

Now we are going to add the new disk to the zpool and replace the failed one.

newDiskZFSPartition='/dev/sda2`
#Put your failed disk ID here - as reported in `zpool status -v` - eg 14456048953908038050
failedDiskPartitionID=''

zpool replace rpool "$failedDiskPartitionID" "$newDiskZFSPartition"

That should give you the warning: Make sure to wait until resilver is done before rebooting.

You can keep track of the reslivering process by running zpool status -v

root@cluster1 zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 12 12:14:43 2016
    91.9M scanned out of 1.87T at 7.66M/s, 71h0m to go
    22.6M resilvered, 0.00% done
config:

    NAME                        STATE     READ WRITE CKSUM
    rpool                       DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        sdb2                    ONLINE       0     0     0
        sdc2                    ONLINE       0     0     0
        sdd2                    ONLINE       0     0     0
        replacing-3             UNAVAIL      0     0     0
          14456048953908038050  FAULTED      0     0     0  was /dev/sdd2
          sda2                  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Note on this, the time to go (eg 71h0m) was wildly pessimistic - actually took around 4 hours

Step five: Reboot

Once the reslivering process has finished, you can reboot the machine to make sure that everything is back to normal health

root@cluster1 zpool status -v
  pool: rpool
 state: ONLINE
  scan: resilvered 456G in 4h42m with 0 errors on Mon Dec 12 16:57:37 2016
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0
        sde2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0

errors: No known data errors