Image

I was on holidays a few weeks ago and decided to replace an aging Mac Pro that I had been using as a Plex server with a new FreeNAS box, since I could run a jail with Plex. So I used the four 3TB WD Red drives in the Mac Pro with another two new similar drives to construct a new FreeNAS box.

Of course, the problem with those four drives I used as a base was that they were already four years old, but since they still seemed to be working well enough and it would save me about $600, why not?

As it turns out, yesterday FreeNAS helpfully told me that one of the drives was failing. I’m not a stranger to systems administration, having done it for the better part of 20 years, and I’m certainly not a stranger to replacing failed drives, but… FreeNAS (based on FreeBSD) is a bit different than what I’m used to (I’m a Linux guy, I can deal with mdadm and others no problem). Couple this with the fact that I’m using ZFS in a RAIDZ2 configuration and… this is all new to me, thus this entry. If nothing else, it gives me something to refer down the line but hopefully it is found useful by others as well.

I’m using FreeNAS 9.10.1 and while the error alert and emails are nice, they don’t give me much to go on. Especially when the web UI tells me that the zpool is degraded but doesn’t actually tell me which drive has the problems. When I navigate to Storage - Volumes - [mount point] - View Volumes you can select the volume name (in my case it is “storage”) by clicking on it and at the bottom of the page you will see three new icons, the last of which is Volume Status. Here you can see the physical device that is failed, in my case with a DEGRADED status. You can also view this information by using the commandline:

[root@heimdall] ~# zpool status storage
  pool: storage
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: none requested
config:

    NAME                                            STATE     READ WRITE CKSUM
    storage                                         DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/ebcc2eb3-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/ed05b1c9-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/edc30220-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/ee870fb3-4be5-11e6-9152-3497f634fc9c  DEGRADED     0     0    40  too many errors
        gptid/ef3cebfa-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/effda85e-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0

errors: No known data errors

So from the above I can see that my storage zpool is in a degraded state and it tells me that the device gptid/ee870fb3-4be5-11e6-9152-3497f634fc9c is the culprit. That isn’t a drive identifier and I can’t see in the zpool manpage a way to make it show me the device. So I have to use a different command and since I have the gptid, I can weed out the things I don’t care about:

[root@heimdall] ~# glabel status | grep ee870fb3
gptid/ee870fb3-4be5-11e6-9152-3497f634fc9c     N/A  ada2p2

This is helpful! So I know that the device that is degraded and has too many errors is /dev/ada2. Checking this out with smartctl shows me no useful information — it has not failed any SMART tests. I even ran long and short SMART tests after the fact, in a different machine, to see if it was actually dying and smartctl tells me that both tests completed without error. But, given the age of the drive, it’s probably due a replacement anyways.

When you’re pulling the drive, unless you’ve labelled the drives physically, you’ll need to identify them by serial number. The smartctl tool can show you this, or you can use camcontrol:

[root@heimdall] ~# camcontrol identify ada2|grep serial
serial number         WD-WMC1T0421516

The FreeNAS documentation will tell you how to replace a failed drive. Effectively, you just need to power off the system, pull the failed drive and replace it with a new drive and reboot. Once you have done this, navigate back and find the OFFLINE disk (it will be the one with a series of numbers rather than a device name) and click the Replace button and select the new device (should be the only one on the list). Since this drive is entirely new and unpartitioned, you’ll need to force the replacement. After that, it’s just a matter of sitting back while the volume performs the resilver operation.

If you’re using an encrypted volume (I’m not) you have a few more steps to take that the documentation describes.

When you’ve done this, you’ll be able to see the estimate of how long it will take to resilver with the zpool command:

[root@heimdall] ~# zpool status -v storage
  pool: storage
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Aug 10 23:20:19 2016
        41.4G scanned out of 4.55T at 392M/s, 3h20m to go
        6.89G resilvered, 0.89% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    storage                                         ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/ebcc2eb3-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/ed05b1c9-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/edc30220-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/42ea0574-5f83-11e6-aa4a-3497f634fc9c  ONLINE       0     0     0  (resilvering)
        gptid/ef3cebfa-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0
        gptid/effda85e-4be5-11e6-9152-3497f634fc9c  ONLINE       0     0     0

errors: No known data errors

In my case, it took over 3 hours. I just went to bed and when I got up, it was back online and in good state. Really really easy. Thank you FreeNAS!

Share on: TwitterLinkedIn


Related Posts


Published

Category

Bsd

Tags

Stay in touch