2015-02-17

故障したHDDのSMART

あるサーバのHDDの異常で動作が怪しくなった. fsckを使ってなんとか起動したが, SMARTを見てみると, 以下のようになっていた.
=== START OF INFORMATION SECTION ===
Device Model:     ST31000524AS
Serial Number:    xxxxxxxx
Firmware Version: HP64
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Feb 17 19:06:41 2015 JST
...
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   088   086   006    Pre-fail  Always       -       15941795
  3 Spin_Up_Time            0x0023   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   001   001   036    Pre-fail  Always   FAILING_NOW 4094
  7 Seek_Error_Rate         0x002f   078   060   030    Pre-fail  Always       -       71770086
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       16194
 10 Spin_Retry_Count        0x0033   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       58
180 Unused_Rsvd_Blk_Cnt_Tot 0x002b   100   100   000    Pre-fail  Always       -       1721951280
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1901
188 Command_Timeout         0x0032   100   094   000    Old_age   Always       -       38655819799
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   074   066   045    Old_age   Always       -       26 (Lifetime Min/Max 26/26)
194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 11 0 0)
195 Hardware_ECC_Recovered  0x003a   051   021   000    Old_age   Always       -       15941795
196 Reallocated_Event_Count 0x0032   001   001   036    Old_age   Always   FAILING_NOW 4094
197 Current_Pending_Sector  0x0032   100   097   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       207
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 200 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 200 occurred at disk power-on lifetime: 16194 hours (674 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      01:41:47.658  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      01:41:47.625  READ LOG EXT
  60 00 08 ff ff ff 4f 00      01:41:44.947  READ FPDMA QUEUED
  60 00 08 f0 74 87 4e 00      01:41:44.928  READ FPDMA QUEUED
  60 00 08 e8 74 87 4e 00      01:41:44.918  READ FPDMA QUEUED
...
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        17         -
# 2  Short offline       Completed without error       00%         4         -
# 3  Extended offline    Aborted by host               90%         2         -
どうやら, 数週間前から, 以下のようなログが /var/log/messages に出ていたようだ.
Feb  7 13:34:49 localhost smartd[3459]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Feb  7 13:34:49 localhost smartd[3459]: Device: /dev/sda [SAT], 207 Offline uncorrectable sectors

0 件のコメント:

コメントを投稿