I was on holidays a few weeks ago and decided to replace an aging Mac Pro that I had been using as a Plex server with a new FreeNAS box, since I could run a jail with Plex. So I used the four 3TB WD Red drives in the Mac Pro with another two new similar drives to construct a new FreeNAS box.
Of course, the problem with those four drives I used as a base was that they were already four years old, but since they still seemed to be working well enough and it would save me about $600, why not?
As it turns out, yesterday FreeNAS helpfully told me that one of the drives was failing. I'm not a stranger to systems administration, having done it for the better part of 20 years, and I'm certainly not a stranger to replacing failed drives, but... FreeNAS (based on FreeBSD) is a bit different than what I'm used to (I'm a Linux guy, I can deal with
mdadm and others no problem). Couple this with the fact that I'm using ZFS in a RAIDZ2 configuration and... this is all new to me, thus this entry. If nothing else, it gives me something to refer down the line but hopefully it is found useful by others as well.
I'm using FreeNAS 9.10.1 and while the error alert and emails are nice, they don't give me much to go on. Especially when the web UI tells me that the zpool is degraded but doesn't actually tell me which drive has the problems. When I navigate to Storage - Volumes - [mount point] - View Volumes you can select the volume name (in my case it is "storage") by clicking on it and at the bottom of the page you will see three new icons, the last of which is Volume Status. Here you can see the physical device that is failed, in my case with a DEGRADED status. You can also view this information by using the commandline:
[root@heimdall] ~# zpool status storage pool: storage state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/ebcc2eb3-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/ed05b1c9-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/edc30220-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/ee870fb3-4be5-11e6-9152-3497f634fc9c DEGRADED 0 0 40 too many errors gptid/ef3cebfa-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/effda85e-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 errors: No known data errors
So from the above I can see that my storage zpool is in a degraded state and it tells me that the device
gptid/ee870fb3-4be5-11e6-9152-3497f634fc9c is the culprit. That isn't a drive identifier and I can't see in the zpool manpage a way to make it show me the device. So I have to use a different command and since I have the gptid, I can weed out the things I don't care about:
[root@heimdall] ~# glabel status | grep ee870fb3 gptid/ee870fb3-4be5-11e6-9152-3497f634fc9c N/A ada2p2
This is helpful! So I know that the device that is degraded and has too many errors is /dev/ada2. Checking this out with smartctl shows me no useful information -- it has not failed any SMART tests. I even ran long and short SMART tests after the fact, in a different machine, to see if it was actually dying and smartctl tells me that both tests completed without error. But, given the age of the drive, it's probably due a replacement anyways.
When you're pulling the drive, unless you've labelled the drives physically, you'll need to identify them by serial number. The smartctl tool can show you this, or you can use camcontrol:
[root@heimdall] ~# camcontrol identify ada2|grep serial serial number WD-WMC1T0421516
The FreeNAS documentation will tell you how to replace a failed drive. Effectively, you just need to power off the system, pull the failed drive and replace it with a new drive and reboot. Once you have done this, navigate back and find the OFFLINE disk (it will be the one with a series of numbers rather than a device name) and click the Replace button and select the new device (should be the only one on the list). Since this drive is entirely new and unpartitioned, you'll need to force the replacement. After that, it's just a matter of sitting back while the volume performs the resilver operation.
If you're using an encrypted volume (I'm not) you have a few more steps to take that the documentation describes.
When you've done this, you'll be able to see the estimate of how long it will take to resilver with the zpool command:
[root@heimdall] ~# zpool status -v storage pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Aug 10 23:20:19 2016 41.4G scanned out of 4.55T at 392M/s, 3h20m to go 6.89G resilvered, 0.89% done config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/ebcc2eb3-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/ed05b1c9-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/edc30220-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/42ea0574-5f83-11e6-aa4a-3497f634fc9c ONLINE 0 0 0 (resilvering) gptid/ef3cebfa-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 gptid/effda85e-4be5-11e6-9152-3497f634fc9c ONLINE 0 0 0 errors: No known data errors
In my case, it took over 3 hours. I just went to bed and when I got up, it was back online and in good state. Really really easy. Thank you FreeNAS!