====== HDD「ンッーカッチャコンカッチャコンカッチャコン ンッーカッチャコンカッチャコンカッチャコン」 ======
ファイルサーバから表題のような、いやーな音が聞こえてきた。
恐る恐るzpool statusをしてみると…
$ zpool status
pool: zdata3
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 8h58m with 0 errors on Sun Jun 30 21:56:04 2013
config:
NAME STATE READ WRITE CKSUM
zdata3 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada2p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
7910286219826960147 REMOVED 0 0 0 was /dev/ada0p1
raidz1-1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada12 ONLINE 0 0 0
logs
da1 ONLINE 0 0 0
cache
da2 ONLINE 0 0 0
spares
ada14p1 AVAIL
errors: No known data errors
いやあああああああああああああああ!!
更にsyslogを見てみると…
Aug 19 08:11:04 Freyja kernel: ahcich0: Timeout on slot 9 port 0
Aug 19 08:11:04 Freyja kernel: ahcich0: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd c0 serr 00000000 cmd 0000c917
Aug 19 09:41:10 Freyja kernel: ahcich0: Timeout on slot 12 port 0
Aug 19 09:41:10 Freyja kernel: ahcich0: is 00000000 cs 00000000 ss 0000f000 rs 0000f000 tfd 40 serr 00000000 cmd 0000cf17
Aug 19 09:41:41 Freyja kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Aug 19 09:42:11 Freyja kernel: ahcich0: Timeout on slot 15 port 0
Aug 19 09:42:11 Freyja kernel: ahcich0: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0000cf17
Aug 19 09:42:42 Freyja kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Aug 19 09:43:12 Freyja kernel: ahcich0: Timeout on slot 15 port 0
Aug 19 09:43:12 Freyja kernel: ahcich0: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0000cf17
Aug 19 09:43:12 Freyja kernel: (ada0:ahcich0:0:0:0): lost device
Aug 19 09:43:43 Freyja kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Aug 19 09:44:13 Freyja kernel: ahcich0: Timeout on slot 15 port 0
Aug 19 09:44:13 Freyja kernel: ahcich0: is 00000000 cs 00078000 ss 00078000 rs 00078000 tfd 80 serr 00000000 cmd 0000cf17
Aug 19 09:44:14 Freyja kernel: (ada0:ahcich0:0:0:0): removing device entry
Aug 19 09:44:17 Freyja kernel: ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
Aug 19 09:44:17 Freyja kernel: ada0: ATA-8 SATA 3.x device
Aug 19 09:44:17 Freyja kernel: ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug 19 09:44:17 Freyja kernel: ada0: Command Queueing enabled
Aug 19 09:44:17 Freyja kernel: ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
Aug 19 09:44:17 Freyja kernel: ada0: Previously was known as ad4
Aug 19 09:44:30 Freyja kernel: ahcich0: Timeout on slot 18 port 0
Aug 19 09:44:30 Freyja kernel: ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr 00000000 cmd 0000d217
Aug 19 09:44:30 Freyja kernel: ahcich0: Error while READ LOG EXT
凄く・・・ご臨終です・・・・。
8/19ってことは、実家帰省のさなか延々カッコンカッコンしてたのか…。RAID-Z1なのでガクブルもの。てか、ホットスペアを自動で使ってくれないのね。何でだろう?
取りあえず、手動でお亡くなりになったHDDとスペアを置き換える。
まず、死んだHDDをプールから切り離す。
$ sudo zpool offline zdata3 ada0p1
$ zpool status zdata3
pool: zdata3
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 8h58m with 0 errors on Sun Jun 30 21:56:04 2013
config:
NAME STATE READ WRITE CKSUM
zdata3 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada2p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
7910286219826960147 OFFLINE 0 0 0 was /dev/ada0p1
raidz1-1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada12 ONLINE 0 0 0
logs
da1 ONLINE 0 0 0
cache
da2 ONLINE 0 0 0
spares
ada14p1 AVAIL
errors: No known data errors
それから、スペアと入れ替える。
$ sudo zpool replace zdata3 ada0p1 ada14p1
$ zpool status zdata3
pool: zdata3
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Aug 24 10:42:18 2013
53.5M scanned out of 10.1T at 4.86M/s, 602h18m to go
8.17M resilvered, 0.00% done
config:
NAME STATE READ WRITE CKSUM
zdata3 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada2p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
spare-2 OFFLINE 0 0 0
7910286219826960147 OFFLINE 0 0 0 was /dev/ada0p1
ada14p1 ONLINE 0 0 0 (resilvering)
raidz1-1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada12 ONLINE 0 0 0
logs
da1 ONLINE 0 0 0
cache
da2 ONLINE 0 0 0
spares
8748654380753058766 INUSE was /dev/ada14p1
errors: No known data errors
replaceコマンドの応答がなかなか(十数秒)返ってこず、少し焦ったw 3TBのHDD買ってこないとだな…。
それにしても、自分はSeagateとの相性がすこぶるよろしくない。HDDは去年の6/20に買ったものなので、1年ちょいでお亡くなりという事に。昔使ってた鯖でも、Seagate製が1年足らずで死んだんだよなぁ。RMAで動作品に交換出来ると思うが、さっさと売って新HDD費用の足しにしよう……。
**(2014/8/25 1:18追記)**
10TBのプールを10時間ほどでリビルド完了。
死んだHDDを切り離す。
$ sudo zpool detach zdata3 ada0p1
$ zpool status zdata3
pool: zdata3
state: ONLINE
scan: resilvered 2.37T in 10h6m with 0 errors on Sat Aug 24 20:49:04 2013
config:
NAME STATE READ WRITE CKSUM
zdata3 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada2p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
ada14p1 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada12 ONLINE 0 0 0
logs
da1 ONLINE 0 0 0
cache
da2 ONLINE 0 0 0
errors: No known data errors
ada0がケース内のどのHDDなのか特定するには、HDDの型番とシリアル番号が分かればいいので、通常はcamcontrol identify ada0すればOK。
ただし今回はada0の情報が全く表示されなかったため、生きている同型のHDDを1つずつidentifyし、死んだ玉を消去法で探し当てた。