User:GNUru/Software RAID on NVMe with Monitoring
The default configuration and support provided by installing the usual tools and following the official documentation for software RAID on NVMe drives is very *VERY* far from adequate. I learned this the hard way after purchasing a pair of Samsung 980 PRO 2TB NVMe modules with a now-widely-known-to-be-defective firmware that Samsung was very quiet about, and despite very nearly losing a bunch of data, my RAID 1 configuration plus an old backup meant in the end I lost no data. It was an eye-opening experience that made me realize just how lacking the standard monitoring capabilities of mdadm and smartmontools are, particularly with NVMe drives (despite their prevalence in 2023).
Here's a cheat sheet or overview of what you need to do to achieve a sane monitoring setup on a desktop:
- mdadm, nvme, smartmontools, cronie, and a mail agent installed and configured using their respective guides
- scripts I wrote, and either placed somewhere in
PATH
or called by absolute path:email-sysadmin
- wrote this as a generic wrapper for emailing sysadmin
mdadm_notify.sh
- wrote this to gather useful info about RAID event and send notification e-mail
smart-nvme-degradation-check
- wrote this script which keeps state (uses subdir of same dir that smartd uses) and notifies (via e-mail and
wall
) if any NVMe lifetime health properties degrade
- wrote this script which keeps state (uses subdir of same dir that smartd uses) and notifies (via e-mail and
/etc/
conf.d/
smartd
- edited this to add
--attributelog=/var/log/smartd/
and--savestates=/var/lib/smartd/
options- both those dirs are the values suggested in
man smartd
, and had to be created manually - attribute log doesn't actually work for NVMe, so this is only added in preparation for if/when support is added in future; see https://www.smartmontools.org/ticket/1190
- savestates stores min/max temperatures seen per device, so that they're preserved across restarts of smartd (since devices themselves don't track their min/max temperatures)
- both those dirs are the values suggested in
- edited this to add
cron.daily/
smart-data-log
- added this to log SMART data for every SMART-capable device to syslog every day, mainly since smartd's attribute log feature doesn't (yet?) support NVMe drives
smart-nvme-degradation-check
- added this to call your
/root/bin/smart-nvme-degradation-check
script to perform crucial NVMe checks that smartctl doesn't (yet?) support
- added this to call your
cron.weekly/
mdadm
- edited this (commented it all out) to disable default of scrubbing weekly
cron.monthly/
mdadm
- copied from
cron.weekly/mdadm
, but:- removed the stupid date condition (which assumes script runs weekly and breaks if run monthly) and executability condition
- removed
--cron
option tocheckarray
since all it did was hide useful info from being logged - removed
--quiet
option tocheckarray
, for same reason - redirected stderr to stdout, then piped the whole thing to a
logger
command to make sure it gets logged to syslog
- copied from
local.d/
50-md-sync-stall-shutdown.stop
- added this to check if any md devices are in a sync state, and stall the shutdown until the operation completes
mdadm.conf
- you normally need this file anyway, and existing docs already mention what you need in it, but here's what you have:
AUTO -all
to disable auto-assembly so that only arrays listed in this file are assembledDEVICE /dev/disk/by-id/nvme-blah-blah*-part1
i.e. a stable glob pattern that matches only the array's or arrays' component drivesARRAY /dev/md# ...etc...
is the usual line you get when you runmdadm --detail --scan >>/etc/mdadm.conf
during the initial setup of mdadmPROGRAM /path/to/mdadm_notify.sh
runs this script on every RAID array event
- you normally need this file anyway, and existing docs already mention what you need in it, but here's what you have:
smartd.conf
- edited this to disable default of:
DEVICESCAN
- and instead do:
DEFAULT -c interval=60 -a -m @ALL
- to update defaults (only applies to lines below, and only settings which are not overridden in lines below) to:
- check every 60s
- enable standard collection of directives
- and on warnings/failures run every executable in
/etc/smartd_warning.d/
- to update defaults (only applies to lines below, and only settings which are not overridden in lines below) to:
/dev/nvme# -W 0,58,65
- for each NVMe drive (replace
#
with appropriate value for desired device) to enable temperature checking since it's not enabled by default - can't use
by-id
symlinks because of bug https://www.smartmontools.org/ticket/1670, and workaround https://www.smartmontools.org/changeset/4847 only works for symlinks pointing to/dev/sd*
so won't work here - enables temperature checking, but not the diff/min/max check, just inform at 58C and warn/notify at 65C (adjust as desired)
- for each NVMe drive (replace
DEVICESCAN -c interval=1800
- to scan for devices other than those explicitly mentioned above, but for them go back to using smartd's default 30min interval
- edited this to disable default of:
smartd_warning.d/
email-sysadmin
- added this to send an e-mail notification
wall
- added this to blast a notification via
wall
command
- added this to blast a notification via