SMART (Self-Monitoring, Analysis, and Reporting Technology) is a technology included in most hard drives today. You can take advantage of this technology to determine and test for hard drive failure on running systems. Almost all linux distributions systems include the smartmontools package. (I say almost because its impossible to be familiar with all of them.) Here are some handy commands used to take advantage of the reporting and testing features of the linux smart tools.
Please note that I am using the device /dev/hda in the following examples, this may or may not be the storage device in your system.
Print the overall health of a drive:
smartctl -H /dev/sda smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
As you can see my device currently has a passing grade. This is not however a final result, this simply means that the drive has not failed any previous tests or found any problems during the time it has run after that test. To tell the drive to dig a little deeper you can use smartctl to do some tests, lets do that now.
According to the documentation, this command can be given during normal system operation (unless run in captive mode).
smartctl --test=short /dev/sda smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Test will complete after Sat Jul 21 12:27:19 2012
When looking at the response I got a little scared when I saw “off-line mode” but that simply means that the test will run as the machine is functioning normally. You will notice that this test will take around a minute to complete. After which you can use the aforementioned overall command to get a quick result of the test, best to do this after the test has completed.
The long test can also be run on a live system, and will do a lot deeper testing on the device, however it will take significantly longer to finish.
smartctl --test=long /dev/sda smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 86 minutes for test to complete. Test will complete after Sat Jul 21 14:04:28 2012
86 minutes is a far cry from the short test’s 1 minute time, but again this is a much more detailed test.
Getting it all:
The next and last command will output all the information the drive can possibly give. In the response below I have selectively removed a lot of output because there is a lot information to go through. My main point is the command and something I will get to in just a second.
smartctl -a /dev/sda Device Model: ########### Serial Number: ######## Firmware Version: #### User Capacity: ###,###,###,### bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Jul 21 13:24:32 2012 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled SMART Self-test log structure revision number 1 Num TestDescription Status Remaining LifeTime(hours) LBAoffirsterror # 1 Extended offline Self-test routine in progress 50% 11320 - # 2 Short offline Completed without error 00% 11319 - # 3 Short offline Completed without error 00% 8721 - # 4 Short offline Completed without error 00% 1 -
If you look you can see that the long test is in progress and about 50% complete. Running either the overall health output command or the detailed command before a test is finished won’t hurt anything, but it also won’t tell you what the result of the currently running test is until after its finished.
I don’t have a failing drive, at least according to SMART. That’s great news! The downside of this is that I don’t have output of a failing drive to put here, but a little google-fu can give you some examples of what you don’t want to see as well as what some of the detailed information means.
A few notes about the detailed output:
ID# ATTRIBUTENAME FLAG VALUE WORST THRESH TYPE UPDATED WHENFAILED RAWVALUE 1 RawReadErrorRate 0x000f 114 099 006 Pre-fail Always - 69591434 3 SpinUpTime 0x0003 097 097 000 Pre-fail Always - 0 4 StartStopCount 0x0032 100 100 020 Oldage Always - 28 5 ReallocatedSectorCt 0x0033 100 100 036 Pre-fail Always - 0 7 SeekErrorRate 0x000f 081 060 030 Pre-fail Always - 166638935 9 PowerOnHours 0x0032 088 088 000 Oldage Always - 11321 10 SpinRetryCount 0x0013 100 100 097 Pre-fail Always - 0 12 PowerCycleCount 0x0032 100 100 020 Oldage Always - 14 184 End-to-EndError 0x0032 100 100 099 Oldage Always - 0 187 ReportedUncorrect 0x0032 100 100 000 Oldage Always - 0 188 CommandTimeout 0x0032 100 097 000 Oldage Always - 4295032861 189 HighFlyWrites 0x003a 098 098 000 Oldage Always - 2 190 AirflowTemperatureCel 0x0022 072 065 045 Oldage Always - 28 (Lifetime Min/Max 26/28) 194 TemperatureCelsius 0x0022 028 040 000 Oldage Always - 28 (0 22 0 0) 195 HardwareECCRecovered 0x001a 033 028 000 Oldage Always - 69591434 197 CurrentPendingSector 0x0012 100 100 000 Oldage Always - 0 198 OfflineUncorrectable 0x0010 100 100 000 Oldage Offline - 0 199 UDMACRCErrorCount 0x003e 200 200 000 Old_age Always - 0
The ‘Oldage’ in ‘TYPE’ does not mean the drive is old or past its life expectancy, it simply means that the value of that attribute is expected to change over the life of the drive. The ‘Pre-fail’ does not mean that the drive is failing either, it simply means that the attribute is below or at the hardware manufactures expectations.
The ‘ReallocatedSector_Ct’ is a good value to keep an eye on, each drive has a few spare sectors for those that fail, and typically some drives will have an occasional bad sector, however a large number here might be indicative of problems to come.
A note about raid controllers:
You can also get the smart status of drives behind a some raid controllers using
smartctl -H -d 3ware,P /dev/twa#
Where P is equal to the drive port.